Data dictionary
A data dictionary is a variable-level documentation that tells us important information about the variables in the dataset. This is especially important if you are working with multiple tables or with a database. A data dictionary is typically formated as a table with a row corresponding to a variable in your dataset and columns representing a field of information about that variable. The data dictionary should include variable name, data type, description, and sample values.
The main goal of the data dictionary is to help people understand datasets. It will help your peers answer questions such as “what does this variable mean?”
Most long standing research projects will have a data dictionary, below are some open-source examples:
- National Database of Deep-Sea Corals
- Climate and Forecast Conventions
- Organic Carbon Sorption and Decomposition in Selected Global Soils
- Human Health Risk Assessment
- Drug Product Database - Health Canada
- NCI Proteomic Data Commons
- EarthExploer USGS Landsat
- Planetary Science Dictionary (NASA)
As you can see, a data dictionary can be a simple table (spreadsheet or PDF) or a full-fledged web application. Some projects only need a one data dictionary that can be created and maintained by a single person while others will require a whole team to create and maintain it.
Exercise 1
Please help us to make sense of a dataset.
Access a dataset:
Florida, Richard, 2013, “Class-Divided Cities, Detroit Edition Published in Atlantic Cities”, https://doi.org/10.5683/SP3/SNXXHQ, Borealis, V3, UNF:6:zsehCAz4agntvPwDZF03OA== [fileUNF]
Download the data file “Detroit Class Data.xlsx” in the Original File Format. Examining the data, try to answer the following questions:
- What do you think the columns
STATEFP10
andCOUNTYFP10
mean? - Describe the different measures in this study.
- How was the data collected?
How to create a Data Dictionary
- A data dictionary will typically be structured so each row corresponds to a column in your dataset and each column represents a field of information about the column.
-
Include the following fields:
- Column name
- Column name in plain English
- Description of the Column
- Data type
- Data usage type
- Sample values
-
Optional fields to inlcude
- Transformations (Was the column the result of a transformation?)
- Example usage (SQL queries)
- Missing values
- Values (this is useful if a column uses a scale/test)
- Other notes
Template
Below is an example data dictionary for the article we looked at earlier in the chapter. This template was designed to capture most datasets and with . There are other data dictionary templates available for more specific needs. NOTE: These values here are made up for educational purposes, they do not reflect what the real study had in mind.
You can download this template here
Column Name | Business Name | Description | Data Type | Data Usage Type | Sample Values |
---|---|---|---|---|---|
STATEFP10 | State code | The unique numeric code for the state. | String | Dimension Attribute | “01”, “02”, “06” |
COUNTYFP10 | County code | The unique numeric code for the county. | String | Dimension Foreign Key | “001”, “003”, “005” |
TRACTCE10 | Census Tract Code | Code identifying a specific census tract. | String | Dimension Attribute | “593300” |
GEOID10 | Geographical ID | Combined state, county, and tract identifier. | String | Dimension Foreign Key | “26163593300” |
NAMELSAD10 | Area Name | The full name of the census area. | String | Attribute | “Los Angeles County, CA”, “Cook County, IL” |
class | Land Classification | Indicates land use or classification. | String | Dimension Attribute | “Residential”, “Commercial”, “Agricultural” |
CCPCT | Child Care Percentage | Percentage of households using child care services. | Number | Fact | 15.2, 25.4, 32.7 |
FFFPCT | Fast Food Percentage | Percentage of restaurants classified as fast food. | Number | Fact | 40.3, 55.8, 22.5 |
SCPCT | Senior Citizen Percentage | Percentage of population over 65 years old. | Number | Fact | 10.4, 15.8, 20.1 |
WCPCT | Working Class Percentage | Percentage of households in the working class. | Number | Fact | 45.2, 62.1, 51.3 |
The Style
How you write your data dictionary is as important as the information you include. Always remember to be as clear as possible. Follow the style guide provided by your team to be consistent. The following are some general best practices related to data documentation:
- Don’t use jargon
- Define terms and acronyms
- House the data dictionary where you store data
- Make it machine-readable
The process
Document your work as you go, so you don’t lose track of any details. If you wait until the end of your project, you might already have lost or forgotten valuable information.
You can create a data dictionary with any text editor but we suggest using some kind of spreadsheet (Excel, Numbers, etc.) Although you should edit the data dictionary with a spreadsheet software, we suggest saving it as a CSV or TSV file as it is a non-proprietary format and freely available for everyone to use into the future.
Congrats!
You are now ready to create your data dictionary so other researchers can understand your dataset with no problems!
Need help?
Please reach out to research.data@ubc.ca
for assistance with any of your research data questions.