Getting ready to analyze microdata
Please note that we are aiming to demonstrate the capacity of R to analyze microdata, not teach new users how to use R. For beginner R help, please take those workshops offered by the library or book a consult with the GAA.
Follow along and run the code in this R Markdown file.
Set the working directory
The default working directory for an RMarkdown file is the location of the file (e.g. your “Downloads” folder if you just downloaded and opened it). To save workshop files to a different location, in RStudio go to File -> Save As -> and choose a location you will remember.
Install and load required packages
required_packages <- c("haven", "dplyr", "srvyr", "gtsummary", "ggplot2", "ggpubr")
# Loop through each required package and install any that are missing
for (package in required_packages) {
if (!package %in% installed.packages()) {
install.packages(package)
}
}
#Load required packages for use in this R session (HIDE OUTPUT)
library(haven) #to import SPSS .sav file
library(dplyr) #for data manipulation
library(srvyr) #survey-specific functions
library(gtsummary) #create summary tables
library(ggplot2) #create plots
library(flextable) # create tables
library(officer) # manipulate word docs and power points
Download and unzip the microdata
This workshop uses Public Use Microdata Files (PUMFs) from the Canadian Tobacco and Nicotine Survey (CTNS). PUMFs for the CTNS and other Statistics Canada surveys are available in Abacus, UBC Library’s data repository (https://abacus.library.ubc.ca/).
Survey data can be downloaded by visiting Abacus with a browser, but R can automate the process using the Abacus API. Each Abacus file has a persistent identifier called a handle. Listed below are the CTNS data files used in this example. (Links are to the Abacus records where you’ll also find codebooks and user guides.)
Survey | File description | File name | Handle |
---|---|---|---|
CTNS 2020 | Microdata in SPSS .sav format | CTNS_2020_PUMF_SPSS_sav.zip | 11272.1/AB2/UYC0Z8/XVITQW |
CTNS 2022 | Microdata in SPSS .sav format | CTNS_2022_SPSS_SAV.zip | 11272.1/AB2/PWWFK3/4K96XZ |
Run the code below to download and unzip data files for the 2020 and 2022 survey years
download.file("https://abacus.library.ubc.ca/api/access/datafile/:persistentId?persistentId=hdl:11272.1/AB2/UYC0Z8/XVITQW","CTNS_2020.zip", mode="wb")
download.file("https://abacus.library.ubc.ca/api/access/datafile/:persistentId?persistentId=hdl:11272.1/AB2/PWWFK3/4K96XZ","CTNS_2022.zip", mode="wb")
unzip("CTNS_2020.zip")
unzip("CTNS_2022.zip")
Look in the RStudio Files tab in the bottom right of the screen. You should see the unzipped “.sav” files in your working directory.
Read in your sav files
The read_sav function from the haven package imports SPSS .sav files as data.frames, which are similar to spreadsheets. It also imports variable and value labels to make the data easier to work with.
#Import the .sav files and store them as 'ctns2020' and 'ctns2022'
data2020 <- read_sav("ctns_2020_pumf_eng.sav")
data2022 <- read_sav("ctns_2022_pumf.sav")
It is important to check that your data imported correctly. Click ‘data2020’ and ‘data2022’ in the environment pane (top right) to view the imported data. Is it what you expect?