What is a Data Dictionary?
A data dictionary is a type of documentation that provides important information about the variables in the dataset, such as their definitions, descriptions, and structure. This is especially important if you are working with multiple tables or with a database. A data dictionary can accompany the dataset or can be a standalone data item.
It’s useful to have a data dictionary because it’s a critical tool for research reproducibility and revisitation. The main goal of a data dictionary is to help people understand and use the dataset(s). It will help answer questions like, “What does this variable mean?”
Looking for a cheat sheet? Check out our one-pager
Table of Contents
Warm-Up
It’s common for long-standing research projects to have a data dictionary. Here are some open-source examples to explore:
- National Database of Deep-Sea Corals
- Climate and Forecast Conventions
- Organic Carbon Sorption and Decomposition in Selected Global Soils
- Planetary Science Dictionary (NASA)
A data dictionary can be a simple table (spreadsheet or PDF) or a detailed web application. For some projects, a data dictionary is created and maintained by one research member, but this can also be done by the entire research team.
Exercise 1
Please help us make sense of the dataset below.
Access this dataset:
Florida, Richard, 2013, “Class-Divided Cities, Detroit Edition Published in Atlantic Cities”, https://doi.org/10.5683/SP3/SNXXHQ, Borealis, V3
Download the data file “Detroit Class Data.xlsx” in the Original File Format. While examining the data, try to answer the following questions:
- What do you think the columns
STATEFP10
andCOUNTYFP10
mean? - Describe the different measures in this study.
- How was the data collected?
Alternatively, here is another dataset example with a better data dictionary:
Barsky, Eugene; Mitchell, Marjorie; Buhler, Jeremy, 2019, “UBC Research Data Management Survey: Science and Engineering”, https://doi.org/10.5683/SP2/9VEAT9, Borealis, V3
You can see how a data dictionary allows users to make sense of the data very fast.
The Process of Creating a Data Dictionary
Document your work as you go, such as making updates when new elements and variables are added or updated. This reduces the risk of forgetting valuable information or losing details.
Place the data dictionary where your data files are stored. Having the data dictionary nearby makes it easier to understand the data, as it serves as a guide to its contents. Check out our directory structures workshop for more information.
A data dictionary can be created with any text editor or word processor, but we suggest using spreadsheet software. The spreadsheet should be saved as a CSV or TSV file because it’s a lightweight, non-proprietary file format that’s accessible to everyone and future-friendly.
Stylistic Considerations of a Data Dictionary
How you write your data dictionary is as important as the information you include. To ensure consistency, follow the style that’s agreed upon by the research team. Also, make sure to note any stylistic decisions in your README file if necessary.
The following are some general best practices related to data documentation:
- Be as clear as possible
- Don’t use jargon
- Define terms, abbreviations, and acronyms
- Keep the data dictionary where you store your data
- Follow good naming conventions and a consistent formatting style
Recommended Content
A data dictionary is usually formatted as a table with the variables in rows and variable information in columns. At the top, you should mention relevant metadata such as the dataset’s name, the creation date of the data dictionary, and the version number or last updated date.
Every project is different, so consider which of the following applies to your project.
Key fields to consider:
Field | Details |
---|---|
Variable ID | The name used to identify the specific variable. This can be a sequence of alphanumeric characters. |
Variable Name | The name of the specific variable in a human language, like English. Don’t include abbreviations or acronyms. |
Variable Definition | An explanation of what the variable means, how it was calculated, how it should be used, or any known patterns. You can refer to existing vocabularies to increase interoperability (e.g. Unified Medical Language System). |
Variable Type | The format of the variable (e.g., string, number, percentage, date). |
Allowable Values / Parameters | Describe the values that can be entered. For example: min/max values for numbers, a list of options for characters, or whether the field accepts nulls. |
Requirement | Indicate if the variable is required with a “yes” or “no”. |
Notes | Add any extra notes, remarks, or instructions that help contextualize the variable. |
Secondary fields to consider:
Field | Details |
---|---|
Example Usage/Sample Values | Provide some examples of how the variable is implemented, or what the variable looks like. |
Measurement Units | The measurement units of the variable. |
Questions Text | Include the exact wording from the survey, interview, task, etc. |
Timestamp | The indicated time the variable’s data was collected. |
Missing Data | A description of the missing data for the specific variable. Indicate the type of missing data, such as the system missing the data, a data instrument error, or a participant skip error, etc. |
Sample Data Dictionary
Below is a shortened data dictionary for the dataset we looked at in exercise 1. You can download a sample data dictionary based on exercise 1 here (CSV file). The fields and existing values can be deleted and modified to fit your project.
NOTE: The values here are made-up examples for educational purposes. They do not reflect the real study.
Column Name | Meaningful Name | Description | Data Type | Data Usage Type | Sample Values |
---|---|---|---|---|---|
STATEFP10 | State code | The unique numeric code for the state. More information on state codes can be found here. | String | Dimension Attribute | “01”, “02”, “06” |
COUNTYFP10 | County code | The unique numeric code for the county. More information on county codes can be found here. | String | Dimension Foreign Key | “001”, “003”, “005” |
GEOID10 | Geographical ID | Combined state, county, and tract identifier. | String | Dimension Foreign Key | “26163593300” |
FFFPCT | Fast Food Percentage | Percentage of restaurants classified as fast food. | Number | Fact | 40.3, 55.8, 22.5 |
Data Dictionary Template
You can download a blank data dictionary template here. Modify the header and fields included to fit your project.
Here is a breakdown of what we covered: A data dictionary is an informative document about the dataset’s variables, content, structure, and other details needed for understanding and reproducing the research. Remember to have a consistent and clear style, and record any updates made. Aim to make your data dictionary accessible to others and for the future by saving it in a non-proprietary format.
Congrats!
Hooray! You can now create a data dictionary so you and other researchers can understand the dataset with no problems!
Sources
- Harvard Biomedical Data Management. Data Dictionary. https://datamanagement.hms.harvard.edu/collect-analyze/documentation-metadata/data-dictionary
- Penn Libraries Guides. Data Management Resources. https://guides.library.upenn.edu/c.php?g=564157&p=9554907
- Phegley, L. (2023). University of Pennsylvania. Data Dictionary Blank Template. https://repository.upenn.edu/entities/publication/0430ccdd-cbd8-4404-9f54-11cb81d5b3b1
- Stony Brook University Data Governance. Data Dictionary Standards. https://www.stonybrook.edu/commcms/datagovernance/structureandroles/datadictionarystandards
Need help?
Please reach out to research.data@ubc.ca
for assistance with any of your research data questions.
Loading last updated date...