This workbook guides you through de-identifying the Apple
Grower Satisfaction Survey
(WorkerSatisfaction_300rows.xlsx: 300 rows, 16
columns).
WorkerSatisfaction_300rows_deidentified.xlsx
— for research team and analystsdata_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx
— for authorized personnel only; store separately from the de-identified
data (dummy example)Do not edit the raw source file
(WorkerSatisfaction_300rows.xlsx).
| Part | Section | What happens |
|---|---|---|
| 1 | Setup | Install packages, set paths |
| 2 | Load and assess | Read raw data; review direct IDs, dates, geography, free text |
| 3 | Transform | Remove, generalize, suppress, and pseudonymize — preview after each step |
| 4 | Verify and deliver | QA checks → export de-identified file → export data key |
| Risk | What to look for (Part 2) | What to do (Part 3) |
|---|---|---|
| Direct identifiers | Names, emails, handles, owner names | Remove columns |
| Dates | age is date of birth |
Replace with 5-year age bands |
| Geography | Small city counts | Suppress rare cities, then anonymize |
| External sources | Free-text comments |
Remove column |
| Quasi-identifiers | Orchard names | Pseudonymize to Orchard_01, … |
| Package | Purpose |
|---|---|
readxl |
Read the original Excel survey file |
dplyr |
Transform data (native pipe \|>) |
writexl |
Export Excel outputs |
Run Install packages once if a package is missing.
# install.packages("readxl")
# install.packages("dplyr")
# install.packages("writexl")
# install.packages(c("readxl", "dplyr", "writexl"))
library(readxl)
library(dplyr)
library(writexl)
DATA_FILE <- "WorkerSatisfaction_300rows.xlsx"
OUTPUT_FILE <- "WorkerSatisfaction_300rows_deidentified.xlsx"
KEY_DIR <- "data_key_file"
KEY_FILE <- file.path(KEY_DIR, "WorkerSatisfaction_data_key_DUMMY.xlsx")
K_THRESHOLD <- 5
k-anonymity threshold: categories with fewer than 5 records will be pooled before anonymization.
cat("K_THRESHOLD =", K_THRESHOLD, "\n")
## K_THRESHOLD = 5
cat("Data key file:", KEY_FILE, "\n")
## Data key file: data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx
Load the unmodified survey and review what makes the data identifiable.
raw <- read_excel(DATA_FILE, sheet = "Unmodified data")
cat("Raw data:", nrow(raw), "rows ×", ncol(raw), "columns\n")
## Raw data: 300 rows × 16 columns
head(raw, 10)
## # A tibble: 10 × 16
## worker_id email_id age immigration_stat city province
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 James Strange james.s… 1998-09-10 00:00:00 Non-immigrant Kelo… B.C.
## 2 Maya Liya maya.li… 1994-08-04 00:00:00 Immigrant Kelo… B.C.
## 3 Amelio Beal amelio.… 1995-01-02 00:00:00 Non-permanent r… West… B.C.
## 4 Cara Sahara cara.sa… 1991-03-22 00:00:00 Non-permanent r… West… B.C.
## 5 Neiv Rieg neivr@g… 1990-07-15 00:00:00 Non-permanent r… Vict… B.C.
## 6 Troy Ahoy tahoy@y… 1988-06-06 00:00:00 Immigrant Vict… B.C.
## 7 Dave Mahew dmahew@… 2000-11-11 00:00:00 Non-immigrant Vern… B.C.
## 8 Jamie Thomas jamiet@… 1992-12-18 00:00:00 Immigrant Vern… B.C.
## 9 Betty Stills bettys5… 1997-09-06 00:00:00 Non-permanent r… Cobb… B.C.
## 10 Enrique Iglasias ei123@h… 1985-10-16 00:00:00 Non-permanent r… Cobb… B.C.
## # ℹ 10 more variables: orchard_id <chr>, owner_id <chr>, username_id <chr>,
## # sns_worked <dbl>, sat_hrs <dbl>, trt_workers <dbl>, trt_manager <dbl>,
## # cmf_manager <dbl>, sat_work_overall <dbl>, comments <chr>
These variables point to a specific person and will be removed in Part 3:
| Variable | Why it is risky |
|---|---|
worker_id |
Full name (300 unique values) |
email_id |
Email address (300 unique values) |
username_id |
Social-media handle (265 unique values) |
owner_id |
Orchard owner name |
The column age stores date of birth,
not a numeric age. Exact DOB is a strong quasi-identifier when combined
with location and employer.
data.frame(
earliest = as.character(min(as.Date(raw$age))),
latest = as.character(max(as.Date(raw$age))),
unique_dates = length(unique(raw$age))
)
## earliest latest unique_dates
## 1 1970-02-20 2000-11-11 295
Exact DOB will be replaced with 5-year age bands in Part 3.
city has 12 levels; province is always B.C.
Small groups are easier to re-identify.
raw |>
count(city, sort = TRUE) |>
mutate(rare = n < K_THRESHOLD)
## # A tibble: 12 × 3
## city n rare
## <chr> <int> <lgl>
## 1 Victoria 33 FALSE
## 2 Summerland 32 FALSE
## 3 Keremeos 31 FALSE
## 4 Kelowna 29 FALSE
## 5 West Kelowna 29 FALSE
## 6 Peachland 27 FALSE
## 7 Cobble Hill 26 FALSE
## 8 Vernon 24 FALSE
## 9 Naramata 18 FALSE
## 10 Oliver 18 FALSE
## 11 Osoyoos 17 FALSE
## 12 Penticton 16 FALSE
rare_cities <- raw |>
count(city) |>
filter(n < K_THRESHOLD) |>
pull(city)
cat("Rare cities (n <", K_THRESHOLD, "):\n")
## Rare cities (n < 5 ):
rare_cities
## character(0)
raw |> count(immigration_stat, sort = TRUE)
## # A tibble: 3 × 2
## immigration_stat n
## <chr> <int>
## 1 Non-permanent resident 107
## 2 Non-immigrant 103
## 3 Immigrant 90
Note: Pool rare cities to
"Other BC community", then replace all city names with codes (City_01, …). Immigration counts are checked here; recoding is only needed if any group falls below 5.
The comments field can name people, employers,
platforms, and birth years — even after other columns are cleaned.
raw |>
filter(comments != "None") |>
slice(1:10) |>
select(comments)
## # A tibble: 10 × 1
## comments
## <chr>
## 1 I get that I'm a farm worker and grinding is part of the job, but it'd be ni…
## 2 I don't mind the actual work, but the people I work with and the owners like…
## 3 I only like this job because I can make good money, everything else sucks. I…
## 4 The hours and work are hard, but it's good money and I love the team we have…
## 5 Hard but great work
## 6 I'm an apple picker, it is what it is
## 7 It's a good summer job to help me pay for school, definitely not something I…
## 8 The owner's wife is such a sweetheart, but I don't like the owner himself. …
## 9 I hate this job and everybody who works here. the guys I work with are awfu…
## 10 I don't love the job but I stay for the money and the routine.
Note: Remove
commentsentirely. Free text cannot be reliably de-identified without manual review.
Orchard names link respondents to a specific workplace. They will be pseudonymized in Part 3.
raw |> count(orchard_id, sort = TRUE)
## # A tibble: 12 × 2
## orchard_id n
## <chr> <int>
## 1 Billie's Apples 33
## 2 Okanagan Fresh 32
## 3 Valley View Apples 31
## 4 Applejacks 29
## 5 West Kelowna Apples 29
## 6 Lakeside Groves 27
## 7 Country Apples 26
## 8 Bon Appletit 24
## 9 Golden Valley Farms 18
## 10 Ridgeline Orchard 18
## 11 Peak Harvest Co 17
## 12 Sunrise Orchards 16
Note: Replace orchard names with non-descriptive codes (
Orchard_01, …).
Apply each rule from Part 2. Run chunks in order; each step shows 10 rows of the updated dataframe.
direct_ids <- c("worker_id", "email_id", "username_id", "owner_id")
step1 <- raw |>
select(-all_of(direct_ids), -comments)
cat("Step 3.1:", nrow(step1), "rows ×", ncol(step1), "columns\n")
## Step 3.1: 300 rows × 11 columns
head(step1, 10)
## # A tibble: 10 × 11
## age immigration_stat city province orchard_id sns_worked
## <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 1998-09-10 00:00:00 Non-immigrant Kelo… B.C. Applejacks 2
## 2 1994-08-04 00:00:00 Immigrant Kelo… B.C. Applejacks 4
## 3 1995-01-02 00:00:00 Non-permanent resid… West… B.C. West Kelo… 1
## 4 1991-03-22 00:00:00 Non-permanent resid… West… B.C. West Kelo… 2
## 5 1990-07-15 00:00:00 Non-permanent resid… Vict… B.C. Billie's … 2
## 6 1988-06-06 00:00:00 Immigrant Vict… B.C. Billie's … 5
## 7 2000-11-11 00:00:00 Non-immigrant Vern… B.C. Bon Apple… 3
## 8 1992-12-18 00:00:00 Immigrant Vern… B.C. Bon Apple… 1
## 9 1997-09-06 00:00:00 Non-permanent resid… Cobb… B.C. Country A… 3
## 10 1985-10-16 00:00:00 Non-permanent resid… Cobb… B.C. Country A… 6
## # ℹ 5 more variables: sat_hrs <dbl>, trt_workers <dbl>, trt_manager <dbl>,
## # cmf_manager <dbl>, sat_work_overall <dbl>
to_age_band <- function(dob) {
yrs <- as.integer(difftime(Sys.Date(), as.Date(dob), units = "days") / 365.25)
if (yrs < 25) "18-24"
else if (yrs < 35) "25-34"
else if (yrs < 45) "35-44"
else if (yrs < 55) "45-54"
else "55+"
}
step2 <- step1 |>
mutate(age_band = sapply(age, to_age_band)) |>
select(-age) |>
relocate(age_band, .before = immigration_stat)
cat("Step 3.2:", nrow(step2), "rows ×", ncol(step2), "columns\n")
## Step 3.2: 300 rows × 11 columns
head(step2, 10)
## # A tibble: 10 × 11
## age_band immigration_stat city province orchard_id sns_worked sat_hrs
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 25-34 Non-immigrant Kelow… B.C. Applejacks 2 3
## 2 25-34 Immigrant Kelow… B.C. Applejacks 4 2
## 3 25-34 Non-permanent resident West … B.C. West Kelo… 1 1
## 4 35-44 Non-permanent resident West … B.C. West Kelo… 2 3
## 5 35-44 Non-permanent resident Victo… B.C. Billie's … 2 2
## 6 35-44 Immigrant Victo… B.C. Billie's … 5 3
## 7 25-34 Non-immigrant Vernon B.C. Bon Apple… 3 3
## 8 25-34 Immigrant Vernon B.C. Bon Apple… 1 4
## 9 25-34 Non-permanent resident Cobbl… B.C. Country A… 3 3
## 10 35-44 Non-permanent resident Cobbl… B.C. Country A… 6 1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## # sat_work_overall <dbl>
step3 <- step2 |>
mutate(city = if_else(city %in% rare_cities, "Other BC community", city))
cat("Step 3.3:", nrow(step3), "rows ×", ncol(step3), "columns\n")
## Step 3.3: 300 rows × 11 columns
head(step3, 10)
## # A tibble: 10 × 11
## age_band immigration_stat city province orchard_id sns_worked sat_hrs
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 25-34 Non-immigrant Kelow… B.C. Applejacks 2 3
## 2 25-34 Immigrant Kelow… B.C. Applejacks 4 2
## 3 25-34 Non-permanent resident West … B.C. West Kelo… 1 1
## 4 35-44 Non-permanent resident West … B.C. West Kelo… 2 3
## 5 35-44 Non-permanent resident Victo… B.C. Billie's … 2 2
## 6 35-44 Immigrant Victo… B.C. Billie's … 5 3
## 7 25-34 Non-immigrant Vernon B.C. Bon Apple… 3 3
## 8 25-34 Immigrant Vernon B.C. Bon Apple… 1 4
## 9 25-34 Non-permanent resident Cobbl… B.C. Country A… 3 3
## 10 35-44 Non-permanent resident Cobbl… B.C. Country A… 6 1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## # sat_work_overall <dbl>
step3 |> count(city, sort = TRUE)
## # A tibble: 12 × 2
## city n
## <chr> <int>
## 1 Victoria 33
## 2 Summerland 32
## 3 Keremeos 31
## 4 Kelowna 29
## 5 West Kelowna 29
## 6 Peachland 27
## 7 Cobble Hill 26
## 8 Vernon 24
## 9 Naramata 18
## 10 Oliver 18
## 11 Osoyoos 17
## 12 Penticton 16
step4 <- step3 |>
mutate(city = paste0(
"City_", sprintf("%02d", as.integer(factor(city)))
))
cat("Step 3.4:", nrow(step4), "rows ×", ncol(step4), "columns\n")
## Step 3.4: 300 rows × 11 columns
head(step4, 10)
## # A tibble: 10 × 11
## age_band immigration_stat city province orchard_id sns_worked sat_hrs
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 25-34 Non-immigrant City_… B.C. Applejacks 2 3
## 2 25-34 Immigrant City_… B.C. Applejacks 4 2
## 3 25-34 Non-permanent resident City_… B.C. West Kelo… 1 1
## 4 35-44 Non-permanent resident City_… B.C. West Kelo… 2 3
## 5 35-44 Non-permanent resident City_… B.C. Billie's … 2 2
## 6 35-44 Immigrant City_… B.C. Billie's … 5 3
## 7 25-34 Non-immigrant City_… B.C. Bon Apple… 3 3
## 8 25-34 Immigrant City_… B.C. Bon Apple… 1 4
## 9 25-34 Non-permanent resident City_… B.C. Country A… 3 3
## 10 35-44 Non-permanent resident City_… B.C. Country A… 6 1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## # sat_work_overall <dbl>
step4 |> count(city, sort = TRUE)
## # A tibble: 12 × 2
## city n
## <chr> <int>
## 1 City_11 33
## 2 City_09 32
## 3 City_03 31
## 4 City_02 29
## 5 City_12 29
## 6 City_07 27
## 7 City_01 26
## 8 City_10 24
## 9 City_04 18
## 10 City_05 18
## 11 City_06 17
## 12 City_08 16
step5 <- step4 |>
mutate(orchard_id = paste0(
"Orchard_", sprintf("%02d", as.integer(factor(orchard_id)))
))
deid <- step5
cat("Step 3.5 — final:", nrow(deid), "rows ×", ncol(deid), "columns\n")
## Step 3.5 — final: 300 rows × 11 columns
head(deid, 10)
## # A tibble: 10 × 11
## age_band immigration_stat city province orchard_id sns_worked sat_hrs
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 25-34 Non-immigrant City_… B.C. Orchard_01 2 3
## 2 25-34 Immigrant City_… B.C. Orchard_01 4 2
## 3 25-34 Non-permanent resident City_… B.C. Orchard_12 1 1
## 4 35-44 Non-permanent resident City_… B.C. Orchard_12 2 3
## 5 35-44 Non-permanent resident City_… B.C. Orchard_02 2 2
## 6 35-44 Immigrant City_… B.C. Orchard_02 5 3
## 7 25-34 Non-immigrant City_… B.C. Orchard_03 3 3
## 8 25-34 Immigrant City_… B.C. Orchard_03 1 4
## 9 25-34 Non-permanent resident City_… B.C. Orchard_04 3 3
## 10 35-44 Non-permanent resident City_… B.C. Orchard_04 6 1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## # sat_work_overall <dbl>
deid |> count(orchard_id, sort = TRUE)
## # A tibble: 12 × 2
## orchard_id n
## <chr> <int>
## 1 Orchard_02 33
## 2 Orchard_07 32
## 3 Orchard_11 31
## 4 Orchard_01 29
## 5 Orchard_12 29
## 6 Orchard_06 27
## 7 Orchard_04 26
## 8 Orchard_03 24
## 9 Orchard_05 18
## 10 Orchard_09 18
## 11 Orchard_08 17
## 12 Orchard_10 16
| Removed | Transformed |
|---|---|
worker_id, email_id,
username_id, owner_id, comments,
exact DOB (age) |
age → age_band; rare cities pooled;
city → City_01 …; orchard_id →
Orchard_01 … |
Confirm the de-identified data is safe to export. All checks should
show TRUE.
data.frame(
check = c(
"No direct ID columns",
"No comments column",
"No exact DOB column",
"Row count unchanged",
"Rare cities suppressed",
"Cities anonymized",
"Orchards pseudonymized"
),
passed = c(
!any(direct_ids %in% names(deid)),
!"comments" %in% names(deid),
!"age" %in% names(deid),
nrow(deid) == nrow(raw),
!any(rare_cities %in% deid$city),
!any(unique(raw$city) %in% deid$city),
!any(unique(raw$orchard_id) %in% deid$orchard_id)
)
)
## check passed
## 1 No direct ID columns TRUE
## 2 No comments column TRUE
## 3 No exact DOB column TRUE
## 4 Row count unchanged TRUE
## 5 Rare cities suppressed TRUE
## 6 Cities anonymized TRUE
## 7 Orchards pseudonymized TRUE
data.frame(
rows = nrow(deid),
columns = ncol(deid),
smallest_city = min(table(deid$city)),
smallest_age_band = min(table(deid$age_band)),
smallest_orchard = min(table(deid$orchard_id))
)
## rows columns smallest_city smallest_age_band smallest_orchard
## 1 300 11 16 13 16
Note: Each group should ideally have at least 5 records, or be pooled into a broader category.
This is the file you can share for analysis. It does not include the data key.
write_xlsx(deid, OUTPUT_FILE)
cat("Exported:", OUTPUT_FILE, "\n")
## Exported: WorkerSatisfaction_300rows_deidentified.xlsx
cat("Rows:", nrow(deid), "| Columns:", ncol(deid), "\n\n")
## Rows: 300 | Columns: 11
cat("Preview of exported data:\n")
## Preview of exported data:
head(deid, 10)
## # A tibble: 10 × 11
## age_band immigration_stat city province orchard_id sns_worked sat_hrs
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 25-34 Non-immigrant City_… B.C. Orchard_01 2 3
## 2 25-34 Immigrant City_… B.C. Orchard_01 4 2
## 3 25-34 Non-permanent resident City_… B.C. Orchard_12 1 1
## 4 35-44 Non-permanent resident City_… B.C. Orchard_12 2 3
## 5 35-44 Non-permanent resident City_… B.C. Orchard_02 2 2
## 6 35-44 Immigrant City_… B.C. Orchard_02 5 3
## 7 25-34 Non-immigrant City_… B.C. Orchard_03 3 3
## 8 25-34 Immigrant City_… B.C. Orchard_03 1 4
## 9 25-34 Non-permanent resident City_… B.C. Orchard_04 3 3
## 10 35-44 Non-permanent resident City_… B.C. Orchard_04 6 1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## # sat_work_overall <dbl>
Keeping the data key file well protected is essential because it is the sole mechanism that can re-identify individuals in otherwise de-identified research data, and its compromise can undermine confidentiality protections and cause harm to participants. In the UBC environment, the key file must be stored separately from research data on UBC-approved secure systems, with access restricted to authorized personnel only and never stored on personal devices or unapproved cloud services.
Here is an example of a data key file for the work described above; this is a dummy file provided for illustrative purposes only, and in real-world research settings, such files must never be shared on public systems.
File path: data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx
| Sheet | Contents |
|---|---|
README |
Dummy disclaimer and UBC storage requirements |
orchard_key |
Original orchard name → pseudonym code |
city_key |
City label after suppression → pseudonym code |
rare_city_suppress |
Cities pooled to "Other BC community" |
The key holds orchard and city code mappings only — not worker names, emails, or other direct identifiers removed in step 3.1.
orchard_key <- step1 |>
distinct(orchard_id) |>
arrange(orchard_id) |>
transmute(
original_orchard = orchard_id,
pseudonym_code = paste0("Orchard_", sprintf("%02d", row_number()))
)
city_key <- step3 |>
distinct(city) |>
arrange(city) |>
transmute(
city_after_suppression = city,
pseudonym_code = paste0("City_", sprintf("%02d", row_number()))
)
if (length(rare_cities) > 0) {
rare_city_key <- data.frame(
original_city = rare_cities,
pooled_to = "Other BC community"
)
} else {
rare_city_key <- data.frame(
original_city = "(none — all cities at or above k threshold)",
pooled_to = NA_character_
)
}
key_readme <- data.frame(
field = c(
"File status",
"Purpose",
"Contains",
"Storage requirement (UBC)"
),
description = c(
"DUMMY — illustrative example only; not for production use",
"Reverse pseudonym codes for orchard and city in the de-identified file",
"Orchard map, city map, rare-city suppression list — no worker names or emails",
"Store separately from de-identified data on UBC-approved secure systems; authorized access only"
)
)
cat("Orchard key:\n")
## Orchard key:
head(orchard_key, 10)
## # A tibble: 10 × 2
## original_orchard pseudonym_code
## <chr> <chr>
## 1 Applejacks Orchard_01
## 2 Billie's Apples Orchard_02
## 3 Bon Appletit Orchard_03
## 4 Country Apples Orchard_04
## 5 Golden Valley Farms Orchard_05
## 6 Lakeside Groves Orchard_06
## 7 Okanagan Fresh Orchard_07
## 8 Peak Harvest Co Orchard_08
## 9 Ridgeline Orchard Orchard_09
## 10 Sunrise Orchards Orchard_10
cat("\nCity key:\n")
##
## City key:
city_key
## # A tibble: 12 × 2
## city_after_suppression pseudonym_code
## <chr> <chr>
## 1 Cobble Hill City_01
## 2 Kelowna City_02
## 3 Keremeos City_03
## 4 Naramata City_04
## 5 Oliver City_05
## 6 Osoyoos City_06
## 7 Peachland City_07
## 8 Penticton City_08
## 9 Summerland City_09
## 10 Vernon City_10
## 11 Victoria City_11
## 12 West Kelowna City_12
dir.create(KEY_DIR, showWarnings = FALSE)
write_xlsx(
list(
README = key_readme,
orchard_key = orchard_key,
city_key = city_key,
rare_city_suppress = rare_city_key
),
KEY_FILE
)
cat("Data key exported:", KEY_FILE, "\n")
## Data key exported: data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx
cat("Sheets: README, orchard_key, city_key, rare_city_suppress\n")
## Sheets: README, orchard_key, city_key, rare_city_suppress
cat("Store separately from", OUTPUT_FILE, "— never on public systems.\n")
## Store separately from WorkerSatisfaction_300rows_deidentified.xlsx — never on public systems.
| Stage | What changed |
|---|---|
| Direct identifiers | Removed worker_id, email_id,
username_id, owner_id |
| External sources | Removed comments |
| Dates | age (DOB) → age_band (5-year groups) |
| Geography | Rare cities pooled; all cities → City_01 … |
| Quasi-identifiers | orchard_id → Orchard_01 …; mappings in
data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx (dummy, separate
storage) |
The de-identified dataset keeps all satisfaction and treatment ratings for analysis while substantially lowering re-identification risk. The data key remains the only link back to original orchard and city names — protect it accordingly.