Orientation to the CTNS

The example in this workshop is the Canadian Tobacco and Nicotine Survey (CTNS) from Statistics Canada. We will use the CTNS PUMF for two survey years available in Abacus:

Documentation

Before analyzing the data it’s important to read the documentation. Good survey documentation will answer these and other questions:

What is the survey’s purpose?
What is the target population?
When was data collected?
How were questions asked?
What variables are available?

The CTNS PUMFs in Abacus come with user guides, codebooks, and/or questionnaires that will help you interpret the data and avoid mistakes in your analysis.

All PUMFs will have some way to identify variables, whether it’s a codebook, data dictionary or other reference file. User guides and questionnaires are not always present.

User guide

The user guide usually begins with a description of the survey and includes sections on definitions, methodology, data collection, and more. Here’s an excerpt from the survey description in the CTNS 2022 user guide (p7).

The main objective of the Canadian Tobacco and Nicotine Survey (CTNS) is to gather information about the prevalence of cigarette smoking, vaping, cannabis and alcohol use.

Understanding Canadian trends in tobacco, nicotine, cannabis, and alcohol use is vital to the effective development, implementation and evaluation of national and provincial strategies, policies and programs. The CTNS was conducted by Statistics Canada in late 2022 and early 2023 with the cooperation and support of Health Canada.

Codebook

A codebook lists all the variables in the PUMF along with:

answer categories (possible answers to the question)
frequencies (the number of responses for each category)
weighted frequencies (population estimate for each category)

Here is the CTNS 2020 cobebook entry for the GENDER variable: CTNS 2020 codebook, Gender variable

Weights

The codebook frequencies lead us to a key concept: the survey weight. In a Statistics Canada PUMF each response represents a certain number of people in the target population. That number is represented by the weight variable.

The CTNS 2022 target population “is non-institutionalized persons aged 15 years or older living in Canada’s ten provinces, who are not members of collectives or living on reserves” (CTNS 2022 User Guide, p9)

An analysis based on the number of responses will tell us about the survey sample (e.g. X people who answered the survey are current smokers). More often we want to produce estimates of the population (e.g. X people in the target population are current smokers). That is, weights are used to make the data more representative of the target population.

WTPP is the weight variable for the CTNS. In most cases our analysis should be based on the sum of the WTPP variable, not on the count of records.

Why use sav files?

If we compare the codebook to the data imported into R as a sav file, we can see that the codebook variables have a description directly in the header of the dataframe. If we were using a plain text format, we would loose this information.

Set up your survey data for analysis

After confirming that the data imported correctly we can perform other other operations to prepare the data for analysis.

Note: The code below uses the pipe operator from the dplyr package to perform multiple functions in sequence. The %>% at the end of each lines tells R to take the output of that line and “pipe” it into the next line for further processing.

ctns2020 <- data2020 %>%
  as_factor() %>%   #for better labels and data handling
  droplevels() %>%  #remove levels that have no data (tidier tables)
  as_survey(weights=WTPP)  #treat as a survey with weight variable WTPP
  
ctns2022 <- data2022 %>%
  as_factor() %>% 
  droplevels() %>%  
  as_survey(weights=WTPP) 

srvyr package

In the code above we used the as_survey function in the srvyr package. We set the weight to be the WTPP variable. After this point, the data will automatically be weighted for graphs and analyses.

Selected CTNS variables, 2020 and 2022

Below are selected variables we will use when we analyze the CTNS PUMF in R. Take a moment to familiarize yourself with them, and notice the differences between the 2020 and 2022 survey years.

CTNS 2020

variable	notes
AGEGROUP	age groups from 15 to 65+
DV_SSR	smoking status (current/former/never) DV_SS in 2022
GENDER	binary variable, confuses gender and sex (Female/Male)
HHLDSIZE	household size from 1 to 5+
PROV_C	province
WTPP	survey weight

CTNS 2022

variable	notes
AGEGROUP	age groups from 15 to 65+
DV_SS	smoking status (current/former/never) DV_SSR in 2020
GENDER	acknowledges non-binary identities but groups them for privacy (Women+/Men+)
HHLDSIZE	not available in 2022 PUMF
PROV_C	province
WTPP	survey weight

When there are difference between years the researcher will need to decide whether comparisons are possible. There are three differences among these variables:

HHLDSIZE the household size variable is present in the 2020 PUMF, but not in 2022
DV_SSR and DV_SS are different names for the same smoking status variable
GENDER the 2022 gender variable allows non-binary gender identities, but only two categories are reported in the PUMF

How would you handle these differences?