Preparing your data

This workshop emphasizes the visualization but in many cases your data will need to be manipulated, reformatted, and re-packaged before you feed it to a visualization tool. Tableau (and other visualization tools) have some features to help clean messy data, but here are some general things to consider when preparing your data.

OpenRefine is a powerful open source tool for working with messy data. If you are familiar with the R programming language, Tidyverse packages like dplyr and tidyr are also useful.

Each measure in one column

If you’re working with tabular data (e.g. Excel, csv), format it so that each measure appears in one column only. A table like this…

	march	april	may	june	july
kittens_adopted	0	0	1	0	2
meals_ordered	8	11	17	13	23

…will be harder to work with and will have fewer visualization options than this table with the same information:

month	kittens_adopted	meals_ordered
march	0	8
april	0	11
may	1	17
june	0	13
july	2	23

Remove non-data content

Excel worksheets often begin with titles and other information, with the actual data starting below. Subtitles and empty lines within the data table may further interfere with visualization software’s ability to interpret the contents. Removing extra lines and non-data content is a good general practice, though Tableau can often identify the data elements correctly during import.

Know your dataset

It’s important to know a few things about your data before importing it. How many records are there? What are the variable labels? Knowing this will help you confirm that Tableau imported the data correctly.

The sample dataset for this workshop contains four months of data from the Statistics Canada Labour Force Survey (Jan, Apr, Jul, and Oct 2020). This data and supporting documentation is freely available from UBC Library’s Abacus data repository under the terms of the Statistics Canada Open License. Statistics Canada’s Guide to the Labour Force Survey provides information about the survey purpose, methodology, and terms.

To make it easier to work with in Tableau the sample dataset is a modified subset of the original data. Some variables were excluded and variable names and values were replaced with descriptive labels (see the Resources page for details about the changes). The dataset contains 360,554 observations of 23 variables:

Actual hours worked/wk at main job
Age group
Census metropolitan area
Class of worker (main job)
Duration of unemployment
Flows into unemployment
Highest educational attainment
Job permanency (employees only)
Labour force status
Occupation at main job
Province
Reason for leaving job
Reason for part time work
Record number
Sex
Single or multiple jobholder
Statistical Weight
Student status
Survey month
Survey year
Type of work (main job)
Usual hourly wages (employees only)
Usual hours worked/wk at main job

Checking your work

If you know your data it’s easier to spot when something’s not quite right, suggesting a setting or calculation is incorrect. The sample dataset is the same one used to generate Statistics Canada table 14-10-0017-02, Labour force characteristics by province, monthly, unadjusted for seasonality. As you learn Tableau you may use this table to check your work.

Preparing your data

Each measure in one column

Remove non-data content

Know your dataset

Checking your work

Statistics Canada table

Graph produced from sample dataset