Joins

Combining two data frames seems easy if the rows are perfectly aligned. However, the order of the rows in two data frames is usually messy in reality, as shown in the following figure. That’s where keys come into place. This section introduces how to join two data frames based on the identified keys.

Source: Brad Cannell, R for Epidemiology

Mutating join

Description in R Document mutate-joins {dplyr} [Package dplyr version 1.0.10]

The mutating joins add columns from y to x, matching rows based on the keys:

inner_join(): includes all rows in x and y.

left_join(): includes all rows in x.

right_join(): includes all rows in y.

full_join(): includes all rows in x or y.

Source: Hadley Wickham & Garrett Grolemund, R for Data Science (2e)

Examples of mutating joins in R

Input

# run the following code line by line
# after run each line, look at the help tab in the right bottom
# scroll down to the examples section and run examples
?inner_join
?full_join
?left_join
?right_join

Practice

Source: Jeffrey B. Arnold, R for Data Science Exercise Solutions

What weather conditions make it more likely to see a delay?

Step 1

Join the weather and flights data frames. What type of join to use? What is the key?

Input

weather_flights <- weather %>%
  ?_join(flights, by = "")

Step 2

Select the necessary variables, weather conditions from temp to visib and dep_delay.

Input

use_weather_flights <- weather_flights %>%
  select()

Step 3

Visualize the relationship between each weather condition and departure delay.

Input

## group data to calculate mean delay per temperature
flights.sum=use_weather_flights %>%
  group_by(temp) %>%
  summarise(mean_delay = mean(dep_delay, na.rm = TRUE))

## plot the data
ggplot(flights.sum, aes(x = temp, y = mean_delay)) +
  geom_smooth(se=F) + 
  geom_point()

Filtering join

Description in R Document filter-joins {dplyr} [Package dplyr version 1.0.10]

Filtering joins filter rows from x based on the presence or absence of matches in y:

semi_join() return all rows from x with a match in y.

anti_join() return all rows from x without a match in y.

Examples of filtering joins in R

Input

# run the following code line by line
# after run each line, look at the help tab in the right bottom
# scroll down to the examples section and run examples
?semi_join
?anti_join

Practice

Source: Jeffrey B. Arnold, R for Data Science Exercise Solutions

Find the 48 hours (over the course of the whole year) that have the worst departure delays. Cross-reference it with the weather data. Can you see any patterns?

Step 1

To find the 48 hours with the worst departure delays, group flights by hour of scheduled departure time and calculate the average delay, arrange by delay, and select the 48 observations (hours) with the highest average delay.

Input

worst_hours <- flights %>%
  mutate(hour = sched_dep_time %/% 100) %>%
  group_by(origin, time_hour) %>%
# new for R4.4.0 - need to add .groups = 'drop' to the summarise section to group by all the variables in group_by()
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE), .groups = 'drop') %>%
  ungroup() %>%
# sort the data to have longest delay at the top of the dataframe
  arrange(desc(dep_delay)) %>%
# keep only 5 longst average delays for easier visualisation 
  slice(1:5)

Step 2

To get the weather for these hours, choose a type of filtering joins.

Input

weather_most_delayed <- ?_join(weather, worst_hours, by = c("origin", "time_hour"))

Step 3

To get the weather for hours except for those hours, choose a type of filtering joins.

Input

weather_less_delayed <- ?_join(weather, worst_hours, by = c("origin", "time_hour"))

Step 4

Compare the average weather conditions in the most and less delayed hours.

Input

weather_less_delayed <- weather_less_delayed %>%
  mutate(most_delayed = 'no')

weather.md.sum=weather_most_delayed %>%
  mutate(most_delayed = 'yes') %>%
  rows_append(weather_less_delayed) %>%
  group_by(most_delayed) %>%
  summarise(temp = mean(temp, na.rm = TRUE),
            dewp = mean(dewp, na.rm = TRUE),
            humid = mean(humid, na.rm = TRUE),
            wind_speed = mean(wind_speed, na.rm = TRUE),
            precip = mean(precip, na.rm = TRUE),
            visib = mean(visib, na.rm = TRUE))