Keys
Primary keys and foreign keys
As mentioned in the previous section, combining two data frames requires at least one key, a variable shared by those data frames.
- If a key variable or set of variables uniquely identifies each observation/row in a data frame, it is called the primary key of that data frame.
- If a key variable or set of variables in a data frame uniquely identifies each observation/row in another data frame, it is called a foreign key in the current data frame.
For example, in the following relationship diagram,
Source: Ben Brumm, A Guide to the Entity Relationship Diagram (ERD)
student_id
is the primary key (PK) in the data frame namedstudent
, because eachstudent_id
should be able to identify one and only one student.course_id
is the PK in the data frame namedcourse
. Eachcourse_id
should correspond to one and only one course (not exactly like the system at UBC).- The
course_enrollment
data frame can connect to the other two data frames because it has the two variablesstudent_id
andcourse_id
. And that’s why those two variables are marked as the foreign key (FK) in thecourse_enrollment
data frame.
Such a diagram would be excellent when we come across a new set of data frames, but it is not always available. We can identify the primary key by making sense of the data frame or examining whether a variable can uniquely identify each row.
Examining primary keys in R
According to its definition, to examine whether a variable is a primary key, we can examine whether it uniquely identifies each row. To do this in R, we can use the count()
function and see if all the n
, i.e., the number of cases associated with each value of the variable, is equal to one. For example, let’s say we make a guess that tailnum
is the primary key for the planes
data frame, then we can check if any tailnum
shows up exactly once in planes
by the following code.
Input
tailnum.count=planes %>%
count(tailnum) %>%
filter(n > 1)
The output shows zero rows, which indicates that each value of tailnum
only shows up once in the planes
table. In other words, tailnum
can serve as the primary key for the planes
table.
Output
# A tibble: 0 × 2
# … with 2 variables: tailnum <chr>, n <int>
# ℹ Use `colnames()` to see all variable names
Practice
- Identify and examine the primary keys for the other four data frames in the
nycflights13
packages? - Find an example of foreign key from those five data frames?
Once you are comfortable identifying keys, the next step is to see how they can help us join multiple data frames.