Introduction

This workbook guides you through de-identifying the Apple Grower Satisfaction Survey (WorkerSatisfaction_300rows.xlsx: 300 rows, 16 columns).

Learning objectives

  • Identify privacy risks in survey data
  • Apply de-identification steps in R using dplyr
  • Export a shareable dataset and a separate data key file

Workflow

  1. Setup — load packages and set file paths
  2. Load and assess — inspect raw data and review privacy risks
  3. Transform — apply de-identification rules step by step
  4. Verify and deliver — run QA checks and export output files

Output files

  • WorkerSatisfaction_300rows_deidentified.xlsx — for research team and analysts
  • data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx — for authorized personnel only; store separately from the de-identified data (dummy example)

Do not edit the raw source file (WorkerSatisfaction_300rows.xlsx).

Roadmap

Part Section What happens
1 Setup Install packages, set paths
2 Load and assess Read raw data; review direct IDs, dates, geography, free text
3 Transform Remove, generalize, suppress, and pseudonymize — preview after each step
4 Verify and deliver QA checks → export de-identified file → export data key
Risk What to look for (Part 2) What to do (Part 3)
Direct identifiers Names, emails, handles, owner names Remove columns
Dates age is date of birth Replace with 5-year age bands
Geography Small city counts Suppress rare cities, then anonymize
External sources Free-text comments Remove column
Quasi-identifiers Orchard names Pseudonymize to Orchard_01, …

Part 1 — Setup

Required packages

Package Purpose
readxl Read the original Excel survey file
dplyr Transform data (native pipe \|>)
writexl Export Excel outputs

Run Install packages once if a package is missing.

# install.packages("readxl")
# install.packages("dplyr")
# install.packages("writexl")
# install.packages(c("readxl", "dplyr", "writexl"))
library(readxl)
library(dplyr)
library(writexl)
DATA_FILE   <- "WorkerSatisfaction_300rows.xlsx"
OUTPUT_FILE <- "WorkerSatisfaction_300rows_deidentified.xlsx"
KEY_DIR     <- "data_key_file"
KEY_FILE    <- file.path(KEY_DIR, "WorkerSatisfaction_data_key_DUMMY.xlsx")
K_THRESHOLD <- 5

k-anonymity threshold: categories with fewer than 5 records will be pooled before anonymization.

cat("K_THRESHOLD =", K_THRESHOLD, "\n")
## K_THRESHOLD = 5
cat("Data key file:", KEY_FILE, "\n")
## Data key file: data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx

Part 2 — Load and Assess

Load the unmodified survey and review what makes the data identifiable.

2.1 Load raw data

raw <- read_excel(DATA_FILE, sheet = "Unmodified data")

cat("Raw data:", nrow(raw), "rows ×", ncol(raw), "columns\n")
## Raw data: 300 rows × 16 columns
head(raw, 10)
## # A tibble: 10 × 16
##    worker_id        email_id age                 immigration_stat city  province
##    <chr>            <chr>    <dttm>              <chr>            <chr> <chr>   
##  1 James Strange    james.s… 1998-09-10 00:00:00 Non-immigrant    Kelo… B.C.    
##  2 Maya Liya        maya.li… 1994-08-04 00:00:00 Immigrant        Kelo… B.C.    
##  3 Amelio Beal      amelio.… 1995-01-02 00:00:00 Non-permanent r… West… B.C.    
##  4 Cara Sahara      cara.sa… 1991-03-22 00:00:00 Non-permanent r… West… B.C.    
##  5 Neiv Rieg        neivr@g… 1990-07-15 00:00:00 Non-permanent r… Vict… B.C.    
##  6 Troy Ahoy        tahoy@y… 1988-06-06 00:00:00 Immigrant        Vict… B.C.    
##  7 Dave Mahew       dmahew@… 2000-11-11 00:00:00 Non-immigrant    Vern… B.C.    
##  8 Jamie Thomas     jamiet@… 1992-12-18 00:00:00 Immigrant        Vern… B.C.    
##  9 Betty Stills     bettys5… 1997-09-06 00:00:00 Non-permanent r… Cobb… B.C.    
## 10 Enrique Iglasias ei123@h… 1985-10-16 00:00:00 Non-permanent r… Cobb… B.C.    
## # ℹ 10 more variables: orchard_id <chr>, owner_id <chr>, username_id <chr>,
## #   sns_worked <dbl>, sat_hrs <dbl>, trt_workers <dbl>, trt_manager <dbl>,
## #   cmf_manager <dbl>, sat_work_overall <dbl>, comments <chr>

2.2 Direct identifiers

These variables point to a specific person and will be removed in Part 3:

Variable Why it is risky
worker_id Full name (300 unique values)
email_id Email address (300 unique values)
username_id Social-media handle (265 unique values)
owner_id Orchard owner name

2.3 Dates

The column age stores date of birth, not a numeric age. Exact DOB is a strong quasi-identifier when combined with location and employer.

data.frame(
  earliest     = as.character(min(as.Date(raw$age))),
  latest       = as.character(max(as.Date(raw$age))),
  unique_dates = length(unique(raw$age))
)
##     earliest     latest unique_dates
## 1 1970-02-20 2000-11-11          295

Exact DOB will be replaced with 5-year age bands in Part 3.

2.4 Geography

city has 12 levels; province is always B.C. Small groups are easier to re-identify.

raw |>
  count(city, sort = TRUE) |>
  mutate(rare = n < K_THRESHOLD)
## # A tibble: 12 × 3
##    city             n rare 
##    <chr>        <int> <lgl>
##  1 Victoria        33 FALSE
##  2 Summerland      32 FALSE
##  3 Keremeos        31 FALSE
##  4 Kelowna         29 FALSE
##  5 West Kelowna    29 FALSE
##  6 Peachland       27 FALSE
##  7 Cobble Hill     26 FALSE
##  8 Vernon          24 FALSE
##  9 Naramata        18 FALSE
## 10 Oliver          18 FALSE
## 11 Osoyoos         17 FALSE
## 12 Penticton       16 FALSE
rare_cities <- raw |>
  count(city) |>
  filter(n < K_THRESHOLD) |>
  pull(city)

cat("Rare cities (n <", K_THRESHOLD, "):\n")
## Rare cities (n < 5 ):
rare_cities
## character(0)
raw |> count(immigration_stat, sort = TRUE)
## # A tibble: 3 × 2
##   immigration_stat           n
##   <chr>                  <int>
## 1 Non-permanent resident   107
## 2 Non-immigrant            103
## 3 Immigrant                 90

Note: Pool rare cities to "Other BC community", then replace all city names with codes (City_01, …). Immigration counts are checked here; recoding is only needed if any group falls below 5.

2.5 External sources (free text)

The comments field can name people, employers, platforms, and birth years — even after other columns are cleaned.

raw |>
  filter(comments != "None") |>
  slice(1:10) |>
  select(comments)
## # A tibble: 10 × 1
##    comments                                                                     
##    <chr>                                                                        
##  1 I get that I'm a farm worker and grinding is part of the job, but it'd be ni…
##  2 I don't mind the actual work, but the people I work with and the owners like…
##  3 I only like this job because I can make good money, everything else sucks. I…
##  4 The hours and work are hard, but it's good money and I love the team we have…
##  5 Hard but great work                                                          
##  6 I'm an apple picker, it is what it is                                        
##  7 It's a good summer job to help me pay for school, definitely not something I…
##  8 The owner's wife is such a sweetheart, but I don't like the owner himself.  …
##  9 I hate this job and everybody who works here.  the guys I work with are awfu…
## 10 I don't love the job but I stay for the money and the routine.

Note: Remove comments entirely. Free text cannot be reliably de-identified without manual review.

2.6 Quasi-identifiers (orchards)

Orchard names link respondents to a specific workplace. They will be pseudonymized in Part 3.

raw |> count(orchard_id, sort = TRUE)
## # A tibble: 12 × 2
##    orchard_id              n
##    <chr>               <int>
##  1 Billie's Apples        33
##  2 Okanagan Fresh         32
##  3 Valley View Apples     31
##  4 Applejacks             29
##  5 West Kelowna Apples    29
##  6 Lakeside Groves        27
##  7 Country Apples         26
##  8 Bon Appletit           24
##  9 Golden Valley Farms    18
## 10 Ridgeline Orchard      18
## 11 Peak Harvest Co        17
## 12 Sunrise Orchards       16

Note: Replace orchard names with non-descriptive codes (Orchard_01, …).


Part 3 — Transform

Apply each rule from Part 2. Run chunks in order; each step shows 10 rows of the updated dataframe.

3.1 Remove direct identifiers and comments

direct_ids <- c("worker_id", "email_id", "username_id", "owner_id")

step1 <- raw |>
  select(-all_of(direct_ids), -comments)

cat("Step 3.1:", nrow(step1), "rows ×", ncol(step1), "columns\n")
## Step 3.1: 300 rows × 11 columns
head(step1, 10)
## # A tibble: 10 × 11
##    age                 immigration_stat     city  province orchard_id sns_worked
##    <dttm>              <chr>                <chr> <chr>    <chr>           <dbl>
##  1 1998-09-10 00:00:00 Non-immigrant        Kelo… B.C.     Applejacks          2
##  2 1994-08-04 00:00:00 Immigrant            Kelo… B.C.     Applejacks          4
##  3 1995-01-02 00:00:00 Non-permanent resid… West… B.C.     West Kelo…          1
##  4 1991-03-22 00:00:00 Non-permanent resid… West… B.C.     West Kelo…          2
##  5 1990-07-15 00:00:00 Non-permanent resid… Vict… B.C.     Billie's …          2
##  6 1988-06-06 00:00:00 Immigrant            Vict… B.C.     Billie's …          5
##  7 2000-11-11 00:00:00 Non-immigrant        Vern… B.C.     Bon Apple…          3
##  8 1992-12-18 00:00:00 Immigrant            Vern… B.C.     Bon Apple…          1
##  9 1997-09-06 00:00:00 Non-permanent resid… Cobb… B.C.     Country A…          3
## 10 1985-10-16 00:00:00 Non-permanent resid… Cobb… B.C.     Country A…          6
## # ℹ 5 more variables: sat_hrs <dbl>, trt_workers <dbl>, trt_manager <dbl>,
## #   cmf_manager <dbl>, sat_work_overall <dbl>

3.2 Generalize date of birth

to_age_band <- function(dob) {
  yrs <- as.integer(difftime(Sys.Date(), as.Date(dob), units = "days") / 365.25)
  if (yrs < 25)      "18-24"
  else if (yrs < 35) "25-34"
  else if (yrs < 45) "35-44"
  else if (yrs < 55) "45-54"
  else               "55+"
}

step2 <- step1 |>
  mutate(age_band = sapply(age, to_age_band)) |>
  select(-age) |>
  relocate(age_band, .before = immigration_stat)

cat("Step 3.2:", nrow(step2), "rows ×", ncol(step2), "columns\n")
## Step 3.2: 300 rows × 11 columns
head(step2, 10)
## # A tibble: 10 × 11
##    age_band immigration_stat       city   province orchard_id sns_worked sat_hrs
##    <chr>    <chr>                  <chr>  <chr>    <chr>           <dbl>   <dbl>
##  1 25-34    Non-immigrant          Kelow… B.C.     Applejacks          2       3
##  2 25-34    Immigrant              Kelow… B.C.     Applejacks          4       2
##  3 25-34    Non-permanent resident West … B.C.     West Kelo…          1       1
##  4 35-44    Non-permanent resident West … B.C.     West Kelo…          2       3
##  5 35-44    Non-permanent resident Victo… B.C.     Billie's …          2       2
##  6 35-44    Immigrant              Victo… B.C.     Billie's …          5       3
##  7 25-34    Non-immigrant          Vernon B.C.     Bon Apple…          3       3
##  8 25-34    Immigrant              Vernon B.C.     Bon Apple…          1       4
##  9 25-34    Non-permanent resident Cobbl… B.C.     Country A…          3       3
## 10 35-44    Non-permanent resident Cobbl… B.C.     Country A…          6       1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## #   sat_work_overall <dbl>

3.3 Suppress rare cities

step3 <- step2 |>
  mutate(city = if_else(city %in% rare_cities, "Other BC community", city))

cat("Step 3.3:", nrow(step3), "rows ×", ncol(step3), "columns\n")
## Step 3.3: 300 rows × 11 columns
head(step3, 10)
## # A tibble: 10 × 11
##    age_band immigration_stat       city   province orchard_id sns_worked sat_hrs
##    <chr>    <chr>                  <chr>  <chr>    <chr>           <dbl>   <dbl>
##  1 25-34    Non-immigrant          Kelow… B.C.     Applejacks          2       3
##  2 25-34    Immigrant              Kelow… B.C.     Applejacks          4       2
##  3 25-34    Non-permanent resident West … B.C.     West Kelo…          1       1
##  4 35-44    Non-permanent resident West … B.C.     West Kelo…          2       3
##  5 35-44    Non-permanent resident Victo… B.C.     Billie's …          2       2
##  6 35-44    Immigrant              Victo… B.C.     Billie's …          5       3
##  7 25-34    Non-immigrant          Vernon B.C.     Bon Apple…          3       3
##  8 25-34    Immigrant              Vernon B.C.     Bon Apple…          1       4
##  9 25-34    Non-permanent resident Cobbl… B.C.     Country A…          3       3
## 10 35-44    Non-permanent resident Cobbl… B.C.     Country A…          6       1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## #   sat_work_overall <dbl>
step3 |> count(city, sort = TRUE)
## # A tibble: 12 × 2
##    city             n
##    <chr>        <int>
##  1 Victoria        33
##  2 Summerland      32
##  3 Keremeos        31
##  4 Kelowna         29
##  5 West Kelowna    29
##  6 Peachland       27
##  7 Cobble Hill     26
##  8 Vernon          24
##  9 Naramata        18
## 10 Oliver          18
## 11 Osoyoos         17
## 12 Penticton       16

3.4 Anonymize city

step4 <- step3 |>
  mutate(city = paste0(
    "City_", sprintf("%02d", as.integer(factor(city)))
  ))

cat("Step 3.4:", nrow(step4), "rows ×", ncol(step4), "columns\n")
## Step 3.4: 300 rows × 11 columns
head(step4, 10)
## # A tibble: 10 × 11
##    age_band immigration_stat       city   province orchard_id sns_worked sat_hrs
##    <chr>    <chr>                  <chr>  <chr>    <chr>           <dbl>   <dbl>
##  1 25-34    Non-immigrant          City_… B.C.     Applejacks          2       3
##  2 25-34    Immigrant              City_… B.C.     Applejacks          4       2
##  3 25-34    Non-permanent resident City_… B.C.     West Kelo…          1       1
##  4 35-44    Non-permanent resident City_… B.C.     West Kelo…          2       3
##  5 35-44    Non-permanent resident City_… B.C.     Billie's …          2       2
##  6 35-44    Immigrant              City_… B.C.     Billie's …          5       3
##  7 25-34    Non-immigrant          City_… B.C.     Bon Apple…          3       3
##  8 25-34    Immigrant              City_… B.C.     Bon Apple…          1       4
##  9 25-34    Non-permanent resident City_… B.C.     Country A…          3       3
## 10 35-44    Non-permanent resident City_… B.C.     Country A…          6       1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## #   sat_work_overall <dbl>
step4 |> count(city, sort = TRUE)
## # A tibble: 12 × 2
##    city        n
##    <chr>   <int>
##  1 City_11    33
##  2 City_09    32
##  3 City_03    31
##  4 City_02    29
##  5 City_12    29
##  6 City_07    27
##  7 City_01    26
##  8 City_10    24
##  9 City_04    18
## 10 City_05    18
## 11 City_06    17
## 12 City_08    16

3.5 Pseudonymize orchard

step5 <- step4 |>
  mutate(orchard_id = paste0(
    "Orchard_", sprintf("%02d", as.integer(factor(orchard_id)))
  ))

deid <- step5

cat("Step 3.5 — final:", nrow(deid), "rows ×", ncol(deid), "columns\n")
## Step 3.5 — final: 300 rows × 11 columns
head(deid, 10)
## # A tibble: 10 × 11
##    age_band immigration_stat       city   province orchard_id sns_worked sat_hrs
##    <chr>    <chr>                  <chr>  <chr>    <chr>           <dbl>   <dbl>
##  1 25-34    Non-immigrant          City_… B.C.     Orchard_01          2       3
##  2 25-34    Immigrant              City_… B.C.     Orchard_01          4       2
##  3 25-34    Non-permanent resident City_… B.C.     Orchard_12          1       1
##  4 35-44    Non-permanent resident City_… B.C.     Orchard_12          2       3
##  5 35-44    Non-permanent resident City_… B.C.     Orchard_02          2       2
##  6 35-44    Immigrant              City_… B.C.     Orchard_02          5       3
##  7 25-34    Non-immigrant          City_… B.C.     Orchard_03          3       3
##  8 25-34    Immigrant              City_… B.C.     Orchard_03          1       4
##  9 25-34    Non-permanent resident City_… B.C.     Orchard_04          3       3
## 10 35-44    Non-permanent resident City_… B.C.     Orchard_04          6       1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## #   sat_work_overall <dbl>
deid |> count(orchard_id, sort = TRUE)
## # A tibble: 12 × 2
##    orchard_id     n
##    <chr>      <int>
##  1 Orchard_02    33
##  2 Orchard_07    32
##  3 Orchard_11    31
##  4 Orchard_01    29
##  5 Orchard_12    29
##  6 Orchard_06    27
##  7 Orchard_04    26
##  8 Orchard_03    24
##  9 Orchard_05    18
## 10 Orchard_09    18
## 11 Orchard_08    17
## 12 Orchard_10    16

What changed

Removed Transformed
worker_id, email_id, username_id, owner_id, comments, exact DOB (age) ageage_band; rare cities pooled; cityCity_01 …; orchard_idOrchard_01

Part 4 — Verify and Deliver

4.1 QA checks

Confirm the de-identified data is safe to export. All checks should show TRUE.

data.frame(
  check = c(
    "No direct ID columns",
    "No comments column",
    "No exact DOB column",
    "Row count unchanged",
    "Rare cities suppressed",
    "Cities anonymized",
    "Orchards pseudonymized"
  ),
  passed = c(
    !any(direct_ids %in% names(deid)),
    !"comments" %in% names(deid),
    !"age" %in% names(deid),
    nrow(deid) == nrow(raw),
    !any(rare_cities %in% deid$city),
    !any(unique(raw$city) %in% deid$city),
    !any(unique(raw$orchard_id) %in% deid$orchard_id)
  )
)
##                    check passed
## 1   No direct ID columns   TRUE
## 2     No comments column   TRUE
## 3    No exact DOB column   TRUE
## 4    Row count unchanged   TRUE
## 5 Rare cities suppressed   TRUE
## 6      Cities anonymized   TRUE
## 7 Orchards pseudonymized   TRUE
data.frame(
  rows               = nrow(deid),
  columns            = ncol(deid),
  smallest_city      = min(table(deid$city)),
  smallest_age_band  = min(table(deid$age_band)),
  smallest_orchard   = min(table(deid$orchard_id))
)
##   rows columns smallest_city smallest_age_band smallest_orchard
## 1  300      11            16                13               16

Note: Each group should ideally have at least 5 records, or be pooled into a broader category.

4.2 Export de-identified dataset

This is the file you can share for analysis. It does not include the data key.

write_xlsx(deid, OUTPUT_FILE)

cat("Exported:", OUTPUT_FILE, "\n")
## Exported: WorkerSatisfaction_300rows_deidentified.xlsx
cat("Rows:", nrow(deid), "| Columns:", ncol(deid), "\n\n")
## Rows: 300 | Columns: 11
cat("Preview of exported data:\n")
## Preview of exported data:
head(deid, 10)
## # A tibble: 10 × 11
##    age_band immigration_stat       city   province orchard_id sns_worked sat_hrs
##    <chr>    <chr>                  <chr>  <chr>    <chr>           <dbl>   <dbl>
##  1 25-34    Non-immigrant          City_… B.C.     Orchard_01          2       3
##  2 25-34    Immigrant              City_… B.C.     Orchard_01          4       2
##  3 25-34    Non-permanent resident City_… B.C.     Orchard_12          1       1
##  4 35-44    Non-permanent resident City_… B.C.     Orchard_12          2       3
##  5 35-44    Non-permanent resident City_… B.C.     Orchard_02          2       2
##  6 35-44    Immigrant              City_… B.C.     Orchard_02          5       3
##  7 25-34    Non-immigrant          City_… B.C.     Orchard_03          3       3
##  8 25-34    Immigrant              City_… B.C.     Orchard_03          1       4
##  9 25-34    Non-permanent resident City_… B.C.     Orchard_04          3       3
## 10 35-44    Non-permanent resident City_… B.C.     Orchard_04          6       1
## # ℹ 4 more variables: trt_workers <dbl>, trt_manager <dbl>, cmf_manager <dbl>,
## #   sat_work_overall <dbl>

4.3 Data key file (DUMMY — illustrative only)

Keeping the data key file well protected is essential because it is the sole mechanism that can re-identify individuals in otherwise de-identified research data, and its compromise can undermine confidentiality protections and cause harm to participants. In the UBC environment, the key file must be stored separately from research data on UBC-approved secure systems, with access restricted to authorized personnel only and never stored on personal devices or unapproved cloud services.

Here is an example of a data key file for the work described above; this is a dummy file provided for illustrative purposes only, and in real-world research settings, such files must never be shared on public systems.

File path: data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx

Sheet Contents
README Dummy disclaimer and UBC storage requirements
orchard_key Original orchard name → pseudonym code
city_key City label after suppression → pseudonym code
rare_city_suppress Cities pooled to "Other BC community"

The key holds orchard and city code mappings only — not worker names, emails, or other direct identifiers removed in step 3.1.

orchard_key <- step1 |>
  distinct(orchard_id) |>
  arrange(orchard_id) |>
  transmute(
    original_orchard = orchard_id,
    pseudonym_code   = paste0("Orchard_", sprintf("%02d", row_number()))
  )

city_key <- step3 |>
  distinct(city) |>
  arrange(city) |>
  transmute(
    city_after_suppression = city,
    pseudonym_code         = paste0("City_", sprintf("%02d", row_number()))
  )

if (length(rare_cities) > 0) {
  rare_city_key <- data.frame(
    original_city = rare_cities,
    pooled_to     = "Other BC community"
  )
} else {
  rare_city_key <- data.frame(
    original_city = "(none — all cities at or above k threshold)",
    pooled_to     = NA_character_
  )
}

key_readme <- data.frame(
  field = c(
    "File status",
    "Purpose",
    "Contains",
    "Storage requirement (UBC)"
  ),
  description = c(
    "DUMMY — illustrative example only; not for production use",
    "Reverse pseudonym codes for orchard and city in the de-identified file",
    "Orchard map, city map, rare-city suppression list — no worker names or emails",
    "Store separately from de-identified data on UBC-approved secure systems; authorized access only"
  )
)

cat("Orchard key:\n")
## Orchard key:
head(orchard_key, 10)
## # A tibble: 10 × 2
##    original_orchard    pseudonym_code
##    <chr>               <chr>         
##  1 Applejacks          Orchard_01    
##  2 Billie's Apples     Orchard_02    
##  3 Bon Appletit        Orchard_03    
##  4 Country Apples      Orchard_04    
##  5 Golden Valley Farms Orchard_05    
##  6 Lakeside Groves     Orchard_06    
##  7 Okanagan Fresh      Orchard_07    
##  8 Peak Harvest Co     Orchard_08    
##  9 Ridgeline Orchard   Orchard_09    
## 10 Sunrise Orchards    Orchard_10
cat("\nCity key:\n")
## 
## City key:
city_key
## # A tibble: 12 × 2
##    city_after_suppression pseudonym_code
##    <chr>                  <chr>         
##  1 Cobble Hill            City_01       
##  2 Kelowna                City_02       
##  3 Keremeos               City_03       
##  4 Naramata               City_04       
##  5 Oliver                 City_05       
##  6 Osoyoos                City_06       
##  7 Peachland              City_07       
##  8 Penticton              City_08       
##  9 Summerland             City_09       
## 10 Vernon                 City_10       
## 11 Victoria               City_11       
## 12 West Kelowna           City_12
dir.create(KEY_DIR, showWarnings = FALSE)

write_xlsx(
  list(
    README             = key_readme,
    orchard_key        = orchard_key,
    city_key           = city_key,
    rare_city_suppress = rare_city_key
  ),
  KEY_FILE
)

cat("Data key exported:", KEY_FILE, "\n")
## Data key exported: data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx
cat("Sheets: README, orchard_key, city_key, rare_city_suppress\n")
## Sheets: README, orchard_key, city_key, rare_city_suppress
cat("Store separately from", OUTPUT_FILE, "— never on public systems.\n")
## Store separately from WorkerSatisfaction_300rows_deidentified.xlsx — never on public systems.

Closing Summary

Stage What changed
Direct identifiers Removed worker_id, email_id, username_id, owner_id
External sources Removed comments
Dates age (DOB) → age_band (5-year groups)
Geography Rare cities pooled; all cities → City_01
Quasi-identifiers orchard_idOrchard_01 …; mappings in data_key_file/WorkerSatisfaction_data_key_DUMMY.xlsx (dummy, separate storage)

The de-identified dataset keeps all satisfaction and treatment ratings for analysis while substantially lowering re-identification risk. The data key remains the only link back to original orchard and city names — protect it accordingly.