Link Search Menu Expand Document

1 Introduction to dplyr

1.1 What is dplyr?

A part of the tidyverse meta-package that facilitates data manipulation. As with all other tidyverse pacakges, dplyr has extensive documentation and cheat sheets available

Package installation and loading

In order to be able to use the functions from the dplyr package, let’s (install and) load it first.

Input

if(!require(tidyverse)) # checks if package can be loaded
  { # if not, install and then load
  install.packages("tidyverse") # install package
  library(tidyverse) # load package 
  }
Click here for outputs

If package already installed and can be loaded:


If package not installed:

1.2 Tidy data and pipes

Tidy data is data that has been cleaned AND that is in the correct format for data analysis. Usually, it is best to have the data in a “long format” because of how code has to be written for data analysis.

wide vs long format

The pipe operator %>% takes the thing on its left side and feeds that to its right side. You can read it as “then”.

The use of pipe operator is based on some patterns shared by most of the functions from the dplyr package:

  • The input is a data frame, which is usually the first argument for a function.
  • The output is also a data frame.

Therefore, you can pass on the output from a function as input for the next function.

Note: you can use CTRL+Shift+M (PC) or CMD+Shift+M (Mac) as a keyboard shortcut for %>%. If you use the keyboard shortcut, your pipe will look like this |>.

Getting grouped summaries is a common data exploration task, and it usually requires multiple steps. For example, in the dplyr cheat sheet, the following sample code first group the cases in mtcars data by the cyl variable, i.e., group the cars by number of cylinders, and then calculate the average mpg miles per gallon for each group.

Input

# without using pipes - operations are nested
newmtcars = summarise(group_by(mtcars, cyl), avg = mean(mpg))

# with pipes - easier to read because each command is separated and in order
newmtcars = mtcars %>% 
  group_by(cyl) %>%
  summarise(avg = mean(mpg))


This page is meant to introduce the dplyr package briefly and get you ready move on to learn functions from dplyr.