What is a Data Dictionary?

A data dictionary is a type of documentation that provides important information about the variables in the dataset, such as their definitions, descriptions, and structure. This is especially important if you are working with multiple tables or with a database. A data dictionary can accompany the dataset or can be a standalone data item.

It’s useful to have a data dictionary because it’s a critical tool for research reproducibility and revisitation. The main goal of a data dictionary is to help people understand and use the dataset(s). It will help answer questions like, “What does this variable mean?”

Looking for a cheat sheet? Check out our one-pager

Table of Contents

Warm-Up

It’s common for long-standing research projects to have a data dictionary. Here are some open-source examples to explore:

A data dictionary can be a simple table (spreadsheet or PDF) or a detailed web application. For some projects, a data dictionary is created and maintained by one research member, but this can also be done by the entire research team.

Exercise 1

Please help us make sense of the dataset below.

Access this dataset:

Florida, Richard, 2013, “Class-Divided Cities, Detroit Edition Published in Atlantic Cities”, https://doi.org/10.5683/SP3/SNXXHQ, Borealis, V3

Download the data file “Detroit Class Data.xlsx” in the Original File Format. While examining the data, try to answer the following questions:

  1. What do you think the columns STATEFP10 and COUNTYFP10 mean?
  2. Describe the different measures in this study.
  3. How was the data collected?

Alternatively, here is another dataset example with a better data dictionary:

Barsky, Eugene; Mitchell, Marjorie; Buhler, Jeremy, 2019, “UBC Research Data Management Survey: Science and Engineering”, https://doi.org/10.5683/SP2/9VEAT9, Borealis, V3

You can see how a data dictionary allows users to make sense of the data very fast.


The Process of Creating a Data Dictionary

Document your work as you go, such as making updates when new elements and variables are added or updated. This reduces the risk of forgetting valuable information or losing details.

Place the data dictionary where your data files are stored. Having the data dictionary nearby makes it easier to understand the data, as it serves as a guide to its contents. Check out our directory structures workshop for more information.

A data dictionary can be created with any text editor or word processor, but we suggest using spreadsheet software. The spreadsheet should be saved as a CSV or TSV file because it’s a lightweight, non-proprietary file format that’s accessible to everyone and future-friendly.

Stylistic Considerations of a Data Dictionary

How you write your data dictionary is as important as the information you include. To ensure consistency, follow the style that’s agreed upon by the research team. Also, make sure to note any stylistic decisions in your README file if necessary.

The following are some general best practices related to data documentation:

  • Be as clear as possible
  • Don’t use jargon
  • Define terms, abbreviations, and acronyms
  • Keep the data dictionary where you store your data
  • Follow good naming conventions and a consistent formatting style

Recommended Content

A data dictionary is usually formatted as a table with the variables in rows and variable information in columns. At the top, you should mention relevant metadata such as the dataset’s name, the creation date of the data dictionary, and the version number or last updated date.

Every project is different, so consider which of the following applies to your project.

Key fields to consider:

Field Details
Variable ID The name used to identify the specific variable. This can be a sequence of alphanumeric characters.
Variable Name The name of the specific variable in a human language, like English. Don’t include abbreviations or acronyms.
Variable Definition An explanation of what the variable means, how it was calculated, how it should be used, or any known patterns. You can refer to existing vocabularies to increase interoperability (e.g. Unified Medical Language System).
Variable Type The format of the variable (e.g., string, number, percentage, date).
Allowable Values / Parameters Describe the values that can be entered. For example: min/max values for numbers, a list of options for characters, or whether the field accepts nulls.
Requirement Indicate if the variable is required with a “yes” or “no”.
Notes Add any extra notes, remarks, or instructions that help contextualize the variable.

Secondary fields to consider:

Field Details
Example Usage/Sample Values Provide some examples of how the variable is implemented, or what the variable looks like.
Measurement Units The measurement units of the variable.
Questions Text Include the exact wording from the survey, interview, task, etc.
Timestamp The indicated time the variable’s data was collected.
Missing Data A description of the missing data for the specific variable. Indicate the type of missing data, such as the system missing the data, a data instrument error, or a participant skip error, etc.

Sample Data Dictionary

Below is a shortened data dictionary for the dataset we looked at in exercise 1. You can download a sample data dictionary based on exercise 1 here (CSV file). The fields and existing values can be deleted and modified to fit your project.

NOTE: The values here are made-up examples for educational purposes. They do not reflect the real study.

Column Name Meaningful Name Description Data Type Data Usage Type Sample Values
STATEFP10 State code The unique numeric code for the state. More information on state codes can be found here. String Dimension Attribute “01”, “02”, “06”
COUNTYFP10 County code The unique numeric code for the county. More information on county codes can be found here. String Dimension Foreign Key “001”, “003”, “005”
GEOID10 Geographical ID Combined state, county, and tract identifier. String Dimension Foreign Key “26163593300”
FFFPCT Fast Food Percentage Percentage of restaurants classified as fast food. Number Fact 40.3, 55.8, 22.5

Data Dictionary Template

You can download a blank data dictionary template here. Modify the header and fields included to fit your project.


Here is a breakdown of what we covered: A data dictionary is an informative document about the dataset’s variables, content, structure, and other details needed for understanding and reproducing the research. Remember to have a consistent and clear style, and record any updates made. Aim to make your data dictionary accessible to others and for the future by saving it in a non-proprietary format.

Congrats!

Hooray! You can now create a data dictionary so you and other researchers can understand the dataset with no problems!


Sources


Need help?

Please reach out to research.data@ubc.ca for assistance with any of your research data questions.


View in GitHub

Loading last updated date...