Introduction

This workbook guides you through de-identifying the Apple Grower Satisfaction Survey (WorkerSatisfaction_300rows.xlsx: 300 rows, 16 columns).

Learning objectives

Identify privacy risks in survey data
Apply de-identification steps in R using dplyr
Export a shareable dataset and a separate data key file

Workflow

Setup — load packages and set file paths
Load and assess — inspect raw data and review privacy risks
Transform — apply de-identification rules step by step
Verify and deliver — run QA checks and export output files

Output files

WorkerSatisfaction_300rows_deidentified.xlsx — for research team and analysts
data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx — for authorized personnel only; store separately from the de-identified data (dummy example)

Do not edit the raw source file (WorkerSatisfaction_300rows.xlsx).

Roadmap

Part	Section	What happens
1	Setup	Install packages, set paths
2	Load and assess	Read raw data; review direct IDs, dates, geography, free text
3	Transform	Remove, generalize, suppress, and pseudonymize — preview after each step
4	Verify and deliver	QA checks → export de-identified file → export data key

Risk	What to look for (Part 2)	What to do (Part 3)
Direct identifiers	Names, emails, handles, owner names	Remove columns
Dates	`age` is date of birth	Replace with 5-year age bands
Geography	Small city counts	Suppress rare cities, then anonymize
External sources	Free-text `comments`	Remove column
Quasi-identifiers	Orchard names	Pseudonymize to `Orchard_01`, …

Part 1 — Setup

Required packages

Package	Purpose
`readxl`	Read the original Excel survey file
`dplyr`	Transform data (native pipe `\\|>`)
`writexl`	Export Excel outputs

Run Install packages once if a package is missing.

# install.packages("readxl")
# install.packages("dplyr")
# install.packages("writexl")
# install.packages(c("readxl", "dplyr", "writexl"))

library(readxl)
library(dplyr)
library(writexl)

DATA_FILE   <- "WorkerSatisfaction_300rows.xlsx"
OUTPUT_FILE <- "WorkerSatisfaction_300rows_deidentified.xlsx"
KEY_DIR     <- "data_key_file"
KEY_FILE    <- file.path(KEY_DIR, "WorkerSatisfaction_data_key_DUMMY.xlsx")
K_THRESHOLD <- 5

k-anonymity threshold: categories with fewer than 5 records will be pooled before anonymization.

cat("K_THRESHOLD =", K_THRESHOLD, "\n")

## K_THRESHOLD = 5

cat("Data key file:", KEY_FILE, "\n")

## Data key file: data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx

Part 2 — Load and Assess

Load the unmodified survey and review what makes the data identifiable.

2.1 Load raw data

raw <- read_excel(DATA_FILE, sheet = "Unmodified data")

cat("Raw data:", nrow(raw), "rows ×", ncol(raw), "columns\n")

## Raw data: 300 rows × 16 columns

head(raw, 10)

## # A tibble: 10 × 16
##    worker_id        email_id age                 immigration_stat city  province
##    <chr>            <chr>    <dttm>              <chr>            <chr> <chr>   
##  1 James Strange    james.s… 1998-09-10 00:00:00 Non-immigrant    Kelo… B.C.    
##  2 Maya Liya        maya.li… 1994-08-04 00:00:00 Immigrant        Kelo… B.C.    
##  3 Amelio Beal      amelio.… 1995-01-02 00:00:00 Non-permanent r… West… B.C.    
##  4 Cara Sahara      cara.sa… 1991-03-22 00:00:00 Non-permanent r… West… B.C.    
##  5 Neiv Rieg        neivr@g… 1990-07-15 00:00:00 Non-permanent r… Vict… B.C.    
##  6 Troy Ahoy        tahoy@y… 1988-06-06 00:00:00 Immigrant        Vict… B.C.    
##  7 Dave Mahew       dmahew@… 2000-11-11 00:00:00 Non-immigrant    Vern… B.C.    
##  8 Jamie Thomas     jamiet@… 1992-12-18 00:00:00 Immigrant        Vern… B.C.    
##  9 Betty Stills     bettys5… 1997-09-06 00:00:00 Non-permanent r… Cobb… B.C.    
## 10 Enrique Iglasias ei123@h… 1985-10-16 00:00:00 Non-permanent r… Cobb… B.C.    
## # ℹ 10 more variables: orchard_id <chr>, owner_id <chr>, username_id <chr>,
## #   sns_worked <dbl>, sat_hrs <dbl>, trt_workers <dbl>, trt_manager <dbl>,
## #   cmf_manager <dbl>, sat_work_overall <dbl>, comments <chr>

2.2 Direct identifiers

These variables point to a specific person and will be removed in Part 3:

Variable	Why it is risky
`worker_id`	Full name (300 unique values)
`email_id`	Email address (300 unique values)
`username_id`	Social-media handle (265 unique values)
`owner_id`	Orchard owner name

2.3 Dates

The column age stores date of birth, not a numeric age. Exact DOB is a strong quasi-identifier when combined with location and employer.

data.frame(
  earliest     = as.character(min(as.Date(raw$age))),
  latest       = as.character(max(as.Date(raw$age))),
  unique_dates = length(unique(raw$age))
)

##     earliest     latest unique_dates
## 1 1970-02-20 2000-11-11          295

Exact DOB will be replaced with 5-year age bands in Part 3.

2.4 Geography

city has 12 levels; province is always B.C. Small groups are easier to re-identify.

raw |>
  count(city, sort = TRUE) |>
  mutate(rare = n < K_THRESHOLD)

## # A tibble: 12 × 3
##    city             n rare 
##    <chr>        <int> <lgl>
##  1 Victoria        33 FALSE
##  2 Summerland      32 FALSE
##  3 Keremeos        31 FALSE
##  4 Kelowna         29 FALSE
##  5 West Kelowna    29 FALSE
##  6 Peachland       27 FALSE
##  7 Cobble Hill     26 FALSE
##  8 Vernon          24 FALSE
##  9 Naramata        18 FALSE
## 10 Oliver          18 FALSE
## 11 Osoyoos         17 FALSE
## 12 Penticton       16 FALSE

rare_cities <- raw |>
  count(city) |>
  filter(n < K_THRESHOLD) |>
  pull(city)

cat("Rare cities (n <", K_THRESHOLD, "):\n")

## Rare cities (n < 5 ):

rare_cities

## character(0)

raw |> count(immigration_stat, sort = TRUE)

## # A tibble: 3 × 2
##   immigration_stat           n
##   <chr>                  <int>
## 1 Non-permanent resident   107
## 2 Non-immigrant            103
## 3 Immigrant                 90

Note: Pool rare cities to "Other BC community", then replace all city names with codes (City_01, …). Immigration counts are checked here; recoding is only needed if any group falls below 5.

2.5 External sources (free text)

The comments field can name people, employers, platforms, and birth years — even after other columns are cleaned.

raw |>
  filter(comments != "None") |>
  slice(1:10) |>
  select(comments)

## # A tibble: 10 × 1
##    comments                                                                     
##    <chr>                                                                        
##  1 I get that I'm a farm worker and grinding is part of the job, but it'd be ni…
##  2 I don't mind the actual work, but the people I work with and the owners like…
##  3 I only like this job because I can make good money, everything else sucks. I…
##  4 The hours and work are hard, but it's good money and I love the team we have…
##  5 Hard but great work                                                          
##  6 I'm an apple picker, it is what it is                                        
##  7 It's a good summer job to help me pay for school, definitely not something I…
##  8 The owner's wife is such a sweetheart, but I don't like the owner himself.  …
##  9 I hate this job and everybody who works here.  the guys I work with are awfu…
## 10 I don't love the job but I stay for the money and the routine.

Note: Remove comments entirely. Free text cannot be reliably de-identified without manual review.

2.6 Quasi-identifiers (orchards)

Orchard names link respondents to a specific workplace. They will be pseudonymized in Part 3.

raw |> count(orchard_id, sort = TRUE)

## # A tibble: 12 × 2
##    orchard_id              n
##    <chr>               <int>
##  1 Billie's Apples        33
##  2 Okanagan Fresh         32
##  3 Valley View Apples     31
##  4 Applejacks             29
##  5 West Kelowna Apples    29
##  6 Lakeside Groves        27
##  7 Country Apples         26
##  8 Bon Appletit           24
##  9 Golden Valley Farms    18
## 10 Ridgeline Orchard      18
## 11 Peak Harvest Co        17
## 12 Sunrise Orchards       16

Note: Replace orchard names with non-descriptive codes (Orchard_01, …).

Part 3 — Transform

Apply each rule from Part 2. Run chunks in order; each step shows 10 rows of the updated dataframe.

3.1 Remove direct identifiers and comments

direct_ids <- c("worker_id", "email_id", "username_id", "owner_id")

step1 <- raw |>
  select(-all_of(direct_ids), -comments)

cat("Step 3.1:", nrow(step1), "rows ×", ncol(step1), "columns\n")

## Step 3.1: 300 rows × 11 columns

head(step1, 10)

## # A tibble: 10 × 11
##    age                 immigration_stat     city  province orchard_id sns_worked
##    <dttm>              <chr>                <chr> <chr>    <chr>           <dbl>
##  1 1998-09-10 00:00:00 Non-immigrant        Kelo… B.C.     Applejacks          2
##  2 1994-08-04 00:00:00 Immigrant            Kelo… B.C.     Applejacks          4
##  3 1995-01-02 00:00:00 Non-permanent resid… West… B.C.     West Kelo…          1
##  4 1991-03-22 00:00:00 Non-permanent resid… West… B.C.     West Kelo…          2
##  5 1990-07-15 00:00:00 Non-permanent resid… Vict… B.C.     Billie's …          2
##  6 1988-06-06 00:00:00 Immigrant            Vict… B.C.     Billie's …          5
##  7 2000-11-11 00:00:00 Non-immigrant        Vern… B.C.     Bon Apple…          3
##  8 1992-12-18 00:00:00 Immigrant            Vern… B.C.     Bon Apple…          1
##  9 1997-09-06 00:00:00 Non-permanent resid… Cobb… B.C.     Country A…          3
## 10 1985-10-16 00:00:00 Non-permanent resid… Cobb… B.C.     Country A…          6
## # ℹ 5 more variables: sat_hrs <dbl>, trt_workers <dbl>, trt_manager <dbl>,
## #   cmf_manager <dbl>, sat_work_overall <dbl>

3.2 Generalize date of birth

to_age_band <- function(dob) {
  yrs <- as.integer(difftime(Sys.Date(), as.Date(dob), units = "days") / 365.25)
  if (yrs < 25)      "18-24"
  else if (yrs < 35) "25-34"
  else if (yrs < 45) "35-44"
  else if (yrs < 55) "45-54"
  else               "55+"
}

step2 <- step1 |>
  mutate(age_band = sapply(age, to_age_band)) |>
  select(-age) |>
  relocate(age_band, .before = immigration_stat)

cat("Step 3.2:", nrow(step2), "rows ×", ncol(step2), "columns\n")

## Step 3.2: 300 rows × 11 columns

head(step2, 10)

## # A tibble: 10 × 11
##    age_band immigration_stat       city   province orchard_id sns_worked sat_hrs
##    <chr>    <chr>                  <chr>  <chr>    <chr>           <dbl>   <dbl>
##  1 25-34    Non-immigrant          Kelow… B.C.     Applejacks          2       3
##  2 25-34    Immigrant              Kelow… B.C.     Applejacks          4       2
##  3 25-34    Non-permanent resident West … B.C.     West Kelo…          1       1
##  4 35-44    Non-permanent resident West … B.C.     West Kelo…          2       3
##  5 35-44    Non-permanent resident Victo… B.C.     Billie's …          2       2
##  6 35-44    Immigrant              Victo… B.C.     Billie's …          5       3
##  7 25-34    Non-immigrant          Vernon B.C.     Bon Apple…          3       3
##  8 25-34    Immigrant              Vernon B.C.     Bon Apple…          1       4
##  9 25-34    Non-permanent resident Cobbl… B.C.     Country A…          3       3
## 10 35-44    Non-permanent resident Cobbl… B.C.     Country A…          6       1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## #   sat_work_overall <dbl>

3.3 Suppress rare cities

step3 <- step2 |>
  mutate(city = if_else(city %in% rare_cities, "Other BC community", city))

cat("Step 3.3:", nrow(step3), "rows ×", ncol(step3), "columns\n")

## Step 3.3: 300 rows × 11 columns

head(step3, 10)

## # A tibble: 10 × 11
##    age_band immigration_stat       city   province orchard_id sns_worked sat_hrs
##    <chr>    <chr>                  <chr>  <chr>    <chr>           <dbl>   <dbl>
##  1 25-34    Non-immigrant          Kelow… B.C.     Applejacks          2       3
##  2 25-34    Immigrant              Kelow… B.C.     Applejacks          4       2
##  3 25-34    Non-permanent resident West … B.C.     West Kelo…          1       1
##  4 35-44    Non-permanent resident West … B.C.     West Kelo…          2       3
##  5 35-44    Non-permanent resident Victo… B.C.     Billie's …          2       2
##  6 35-44    Immigrant              Victo… B.C.     Billie's …          5       3
##  7 25-34    Non-immigrant          Vernon B.C.     Bon Apple…          3       3
##  8 25-34    Immigrant              Vernon B.C.     Bon Apple…          1       4
##  9 25-34    Non-permanent resident Cobbl… B.C.     Country A…          3       3
## 10 35-44    Non-permanent resident Cobbl… B.C.     Country A…          6       1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## #   sat_work_overall <dbl>

step3 |> count(city, sort = TRUE)

## # A tibble: 12 × 2
##    city             n
##    <chr>        <int>
##  1 Victoria        33
##  2 Summerland      32
##  3 Keremeos        31
##  4 Kelowna         29
##  5 West Kelowna    29
##  6 Peachland       27
##  7 Cobble Hill     26
##  8 Vernon          24
##  9 Naramata        18
## 10 Oliver          18
## 11 Osoyoos         17
## 12 Penticton       16

3.4 Anonymize city

step4 <- step3 |>
  mutate(city = paste0(
    "City_", sprintf("%02d", as.integer(factor(city)))
  ))

cat("Step 3.4:", nrow(step4), "rows ×", ncol(step4), "columns\n")

## Step 3.4: 300 rows × 11 columns

head(step4, 10)

## # A tibble: 10 × 11
##    age_band immigration_stat       city   province orchard_id sns_worked sat_hrs
##    <chr>    <chr>                  <chr>  <chr>    <chr>           <dbl>   <dbl>
##  1 25-34    Non-immigrant          City_… B.C.     Applejacks          2       3
##  2 25-34    Immigrant              City_… B.C.     Applejacks          4       2
##  3 25-34    Non-permanent resident City_… B.C.     West Kelo…          1       1
##  4 35-44    Non-permanent resident City_… B.C.     West Kelo…          2       3
##  5 35-44    Non-permanent resident City_… B.C.     Billie's …          2       2
##  6 35-44    Immigrant              City_… B.C.     Billie's …          5       3
##  7 25-34    Non-immigrant          City_… B.C.     Bon Apple…          3       3
##  8 25-34    Immigrant              City_… B.C.     Bon Apple…          1       4
##  9 25-34    Non-permanent resident City_… B.C.     Country A…          3       3
## 10 35-44    Non-permanent resident City_… B.C.     Country A…          6       1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## #   sat_work_overall <dbl>

step4 |> count(city, sort = TRUE)

## # A tibble: 12 × 2
##    city        n
##    <chr>   <int>
##  1 City_11    33
##  2 City_09    32
##  3 City_03    31
##  4 City_02    29
##  5 City_12    29
##  6 City_07    27
##  7 City_01    26
##  8 City_10    24
##  9 City_04    18
## 10 City_05    18
## 11 City_06    17
## 12 City_08    16

3.5 Pseudonymize orchard

step5 <- step4 |>
  mutate(orchard_id = paste0(
    "Orchard_", sprintf("%02d", as.integer(factor(orchard_id)))
  ))

deid <- step5

cat("Step 3.5 — final:", nrow(deid), "rows ×", ncol(deid), "columns\n")

## Step 3.5 — final: 300 rows × 11 columns

head(deid, 10)

## # A tibble: 10 × 11
##    age_band immigration_stat       city   province orchard_id sns_worked sat_hrs
##    <chr>    <chr>                  <chr>  <chr>    <chr>           <dbl>   <dbl>
##  1 25-34    Non-immigrant          City_… B.C.     Orchard_01          2       3
##  2 25-34    Immigrant              City_… B.C.     Orchard_01          4       2
##  3 25-34    Non-permanent resident City_… B.C.     Orchard_12          1       1
##  4 35-44    Non-permanent resident City_… B.C.     Orchard_12          2       3
##  5 35-44    Non-permanent resident City_… B.C.     Orchard_02          2       2
##  6 35-44    Immigrant              City_… B.C.     Orchard_02          5       3
##  7 25-34    Non-immigrant          City_… B.C.     Orchard_03          3       3
##  8 25-34    Immigrant              City_… B.C.     Orchard_03          1       4
##  9 25-34    Non-permanent resident City_… B.C.     Orchard_04          3       3
## 10 35-44    Non-permanent resident City_… B.C.     Orchard_04          6       1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## #   sat_work_overall <dbl>

deid |> count(orchard_id, sort = TRUE)

## # A tibble: 12 × 2
##    orchard_id     n
##    <chr>      <int>
##  1 Orchard_02    33
##  2 Orchard_07    32
##  3 Orchard_11    31
##  4 Orchard_01    29
##  5 Orchard_12    29
##  6 Orchard_06    27
##  7 Orchard_04    26
##  8 Orchard_03    24
##  9 Orchard_05    18
## 10 Orchard_09    18
## 11 Orchard_08    17
## 12 Orchard_10    16

What changed

Removed	Transformed
`worker_id`, `email_id`, `username_id`, `owner_id`, `comments`, exact DOB (`age`)	`age` → `age_band`; rare cities pooled; `city` → `City_01` …; `orchard_id` → `Orchard_01` …

Part 4 — Verify and Deliver

4.1 QA checks

Confirm the de-identified data is safe to export. All checks should show TRUE.

data.frame(
  check = c(
    "No direct ID columns",
    "No comments column",
    "No exact DOB column",
    "Row count unchanged",
    "Rare cities suppressed",
    "Cities anonymized",
    "Orchards pseudonymized"
  ),
  passed = c(
    !any(direct_ids %in% names(deid)),
    !"comments" %in% names(deid),
    !"age" %in% names(deid),
    nrow(deid) == nrow(raw),
    !any(rare_cities %in% deid$city),
    !any(unique(raw$city) %in% deid$city),
    !any(unique(raw$orchard_id) %in% deid$orchard_id)
  )
)

##                    check passed
## 1   No direct ID columns   TRUE
## 2     No comments column   TRUE
## 3    No exact DOB column   TRUE
## 4    Row count unchanged   TRUE
## 5 Rare cities suppressed   TRUE
## 6      Cities anonymized   TRUE
## 7 Orchards pseudonymized   TRUE

data.frame(
  rows               = nrow(deid),
  columns            = ncol(deid),
  smallest_city      = min(table(deid$city)),
  smallest_age_band  = min(table(deid$age_band)),
  smallest_orchard   = min(table(deid$orchard_id))
)

##   rows columns smallest_city smallest_age_band smallest_orchard
## 1  300      11            16                13               16

Note: Each group should ideally have at least 5 records, or be pooled into a broader category.

4.2 Export de-identified dataset

This is the file you can share for analysis. It does not include the data key.

write_xlsx(deid, OUTPUT_FILE)

cat("Exported:", OUTPUT_FILE, "\n")

## Exported: WorkerSatisfaction_300rows_deidentified.xlsx

cat("Rows:", nrow(deid), "| Columns:", ncol(deid), "\n\n")

## Rows: 300 | Columns: 11

cat("Preview of exported data:\n")

## Preview of exported data:

head(deid, 10)

## # A tibble: 10 × 11
##    age_band immigration_stat       city   province orchard_id sns_worked sat_hrs
##    <chr>    <chr>                  <chr>  <chr>    <chr>           <dbl>   <dbl>
##  1 25-34    Non-immigrant          City_… B.C.     Orchard_01          2       3
##  2 25-34    Immigrant              City_… B.C.     Orchard_01          4       2
##  3 25-34    Non-permanent resident City_… B.C.     Orchard_12          1       1
##  4 35-44    Non-permanent resident City_… B.C.     Orchard_12          2       3
##  5 35-44    Non-permanent resident City_… B.C.     Orchard_02          2       2
##  6 35-44    Immigrant              City_… B.C.     Orchard_02          5       3
##  7 25-34    Non-immigrant          City_… B.C.     Orchard_03          3       3
##  8 25-34    Immigrant              City_… B.C.     Orchard_03          1       4
##  9 25-34    Non-permanent resident City_… B.C.     Orchard_04          3       3
## 10 35-44    Non-permanent resident City_… B.C.     Orchard_04          6       1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## #   sat_work_overall <dbl>

4.3 Data key file (DUMMY — illustrative only)

Keeping the data key file well protected is essential because it is the sole mechanism that can re-identify individuals in otherwise de-identified research data, and its compromise can undermine confidentiality protections and cause harm to participants. In the UBC environment, the key file must be stored separately from research data on UBC-approved secure systems, with access restricted to authorized personnel only and never stored on personal devices or unapproved cloud services.

Here is an example of a data key file for the work described above; this is a dummy file provided for illustrative purposes only, and in real-world research settings, such files must never be shared on public systems.

File path: data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx

Sheet	Contents
`README`	Dummy disclaimer and UBC storage requirements
`orchard_key`	Original orchard name → pseudonym code
`city_key`	City label after suppression → pseudonym code
`rare_city_suppress`	Cities pooled to `"Other BC community"`

The key holds orchard and city code mappings only — not worker names, emails, or other direct identifiers removed in step 3.1.

orchard_key <- step1 |>
  distinct(orchard_id) |>
  arrange(orchard_id) |>
  transmute(
    original_orchard = orchard_id,
    pseudonym_code   = paste0("Orchard_", sprintf("%02d", row_number()))
  )

city_key <- step3 |>
  distinct(city) |>
  arrange(city) |>
  transmute(
    city_after_suppression = city,
    pseudonym_code         = paste0("City_", sprintf("%02d", row_number()))
  )

if (length(rare_cities) > 0) {
  rare_city_key <- data.frame(
    original_city = rare_cities,
    pooled_to     = "Other BC community"
  )
} else {
  rare_city_key <- data.frame(
    original_city = "(none — all cities at or above k threshold)",
    pooled_to     = NA_character_
  )
}

key_readme <- data.frame(
  field = c(
    "File status",
    "Purpose",
    "Contains",
    "Storage requirement (UBC)"
  ),
  description = c(
    "DUMMY — illustrative example only; not for production use",
    "Reverse pseudonym codes for orchard and city in the de-identified file",
    "Orchard map, city map, rare-city suppression list — no worker names or emails",
    "Store separately from de-identified data on UBC-approved secure systems; authorized access only"
  )
)

cat("Orchard key:\n")

## Orchard key:

head(orchard_key, 10)

## # A tibble: 10 × 2
##    original_orchard    pseudonym_code
##    <chr>               <chr>         
##  1 Applejacks          Orchard_01    
##  2 Billie's Apples     Orchard_02    
##  3 Bon Appletit        Orchard_03    
##  4 Country Apples      Orchard_04    
##  5 Golden Valley Farms Orchard_05    
##  6 Lakeside Groves     Orchard_06    
##  7 Okanagan Fresh      Orchard_07    
##  8 Peak Harvest Co     Orchard_08    
##  9 Ridgeline Orchard   Orchard_09    
## 10 Sunrise Orchards    Orchard_10

cat("\nCity key:\n")

## 
## City key:

city_key

## # A tibble: 12 × 2
##    city_after_suppression pseudonym_code
##    <chr>                  <chr>         
##  1 Cobble Hill            City_01       
##  2 Kelowna                City_02       
##  3 Keremeos               City_03       
##  4 Naramata               City_04       
##  5 Oliver                 City_05       
##  6 Osoyoos                City_06       
##  7 Peachland              City_07       
##  8 Penticton              City_08       
##  9 Summerland             City_09       
## 10 Vernon                 City_10       
## 11 Victoria               City_11       
## 12 West Kelowna           City_12

dir.create(KEY_DIR, showWarnings = FALSE)

write_xlsx(
  list(
    README             = key_readme,
    orchard_key        = orchard_key,
    city_key           = city_key,
    rare_city_suppress = rare_city_key
  ),
  KEY_FILE
)

cat("Data key exported:", KEY_FILE, "\n")

## Data key exported: data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx

cat("Sheets: README, orchard_key, city_key, rare_city_suppress\n")

## Sheets: README, orchard_key, city_key, rare_city_suppress

cat("Store separately from", OUTPUT_FILE, "— never on public systems.\n")

## Store separately from WorkerSatisfaction_300rows_deidentified.xlsx — never on public systems.

Closing Summary

Stage	What changed
Direct identifiers	Removed `worker_id`, `email_id`, `username_id`, `owner_id`
External sources	Removed `comments`
Dates	`age` (DOB) → `age_band` (5-year groups)
Geography	Rare cities pooled; all cities → `City_01` …
Quasi-identifiers	`orchard_id` → `Orchard_01` …; mappings in data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx (dummy, separate storage)

The de-identified dataset keeps all satisfaction and treatment ratings for analysis while substantially lowering re-identification risk. The data key remains the only link back to original orchard and city names — protect it accordingly.

Worker Satisfaction Survey — Data De-identification Workbook

Data De-identification Advisor

May 28, 2026