Data transformation

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

Visualisation is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need for it. Often you’ll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the dplyr package and a new dataset on flights that departed New York City in 2013.

The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We’ll come back these functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).

Prerequisites

In this chapter we’ll focus on the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.

library(nycflights13)
library(tidyverse)
#> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
#> ✔ ggplot2 3.4.0.9000        ✔ purrr   0.9000.0.9000
#> ✔ tibble  3.1.8             ✔ dplyr   1.0.99.9000  
#> ✔ tidyr   1.2.1.9001        ✔ stringr 1.4.1.9000   
#> ✔ readr   2.1.3             ✔ forcats 0.5.2        
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag(). So far we’ve mostly ignored which package a function comes from because most of the time it doesn’t matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, we’ll use the same syntax as R: packagename::functionname().

nycflights13

To explore the basic dplyr verbs, we’re going to use nycflights13::flights. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights.

flights
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

If you’ve used R before, you might notice that this data frame prints a little differently to other data frames you’ve seen. That’s because it’s a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. To see everything you can use print(flights, width = Inf) to show everything in the console, but it’s generally more convenient to instead use View(flights) to open the dataset in the scrollable RStudio viewer.

You might have noticed the short abbreviations that follow each column name. These tell you the type of each variable: <int> is short for integer, <dbl> is short for double (aka real numbers), <chr> for character (aka strings), and <dttm> for date-time. These are important because the operations you can perform on a column depend so much on its “type”, and these types are used to organize the chapters in the next section of the book.

dplyr basics

You’re about to learn the primary dplyr verbs which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, it’s worth stating what they have in common:

  1. The first argument is always a data frame.

  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

  3. The result is always a new data frame.

Because the first argument is a data frame and the output is a data frame, dplyr verbs work well with the pipe, |>. The pipe takes the thing on its left and passes it along to the function on its right so that x |> f(y) is equivalent to f(x, y), and x |> f(y) |> g(z) is equivalent to into g(f(x, y), z). The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you haven’t yet learned the details:

flights |>
  filter(dest == "IAH") |> 
  group_by(year, month, day) |> 
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )

The code starts with the flights dataset, then filters it, then groups it, then summarizes it. We’ll come back to the pipe and its alternatives in #sec-pipes.

dplyr’s verbs are organised into four groups based on what they operate on: rows, columns, groups, or tables. In the following sections you’ll learn the most important verbs for rows, columns, and groups, then we’ll come back to verb that work on tables in #chp-joins. Let’s dive in!

Rows

The most important verbs that operate on rows are filter(), which changes which rows are present without changing their order, and arrange(), which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged.

filter()

filter() allows you to keep rows based on the values of the columnsLater, you’ll learn about the slice_*() family which allows you to choose rows based on their positions.. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:

flights |> 
  filter(arr_delay > 120)
#> # A tibble: 10,034 × 19
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      811      630     101    1047     830     137 MQ     
#> 2  2013     1     1      848     1835     853    1001    1950     851 MQ     
#> 3  2013     1     1      957      733     144    1056     853     123 UA     
#> 4  2013     1     1     1114      900     134    1447    1222     145 UA     
#> 5  2013     1     1     1505     1310     115    1638    1431     127 EV     
#> 6  2013     1     1     1525     1340     105    1831    1626     125 B6     
#> # … with 10,028 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

As well as > (greater than), you can use >= (greater than or equal to), < (less than), <= (less than or equal to), == (equal to), and != (not equal to). You can also use & (and) or | (or) to combine multiple conditions:

# Flights that departed on January 1
flights |> 
  filter(month == 1 & day == 1)
#> # A tibble: 842 × 19
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 836 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

# Flights that departed in January or February
flights |> 
  filter(month == 1 | month == 2)
#> # A tibble: 51,955 × 19
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 51,949 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

There’s a useful shortcut when you’re combining | and ==: %in%. It keeps rows where the variable equals one of the values on the right:

# A shorter way to select flights that departed in January or February
flights |> 
  filter(month %in% c(1, 2))
#> # A tibble: 51,955 × 19
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 51,949 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

We’ll come back to these comparisons and logical operators in more detail in #chp-logicals.

When you run filter() dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing flights dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <-:

jan1 <- flights |> 
  filter(month == 1 & day == 1)

Common mistakes

When you’re starting out with R, the easiest mistake to make is to use = instead of == when testing for equality. filter() will let you know when this happens:

flights |> 
  filter(month = 1)
#> Error in `filter()`:
#> ! We detected a named input.
#> ℹ This usually means that you've used `=` instead of `==`.
#> ℹ Did you mean `month == 1`?

Another mistakes is you write “or” statements like you would in English:

flights |> 
  filter(month == 1 | 2)

This works, in the sense that it doesn’t throw an error, but it doesn’t do what you want. We’ll come back to what it does and why in #sec-boolean-operations.

arrange()

arrange() changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.

flights |> 
  arrange(year, month, day, dep_time)
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

You can use desc() to re-order by a column in descending order. For example, this code shows the most delayed flights:

flights |> 
  arrange(desc(dep_delay))
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     9      641      900    1301    1242    1530    1272 HA     
#> 2  2013     6    15     1432     1935    1137    1607    2120    1127 MQ     
#> 3  2013     1    10     1121     1635    1126    1239    1810    1109 MQ     
#> 4  2013     9    20     1139     1845    1014    1457    2210    1007 AA     
#> 5  2013     7    22      845     1600    1005    1044    1815     989 MQ     
#> 6  2013     4    10     1100     1900     960    1342    2211     931 DL     
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

You can combine arrange() and filter() to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:

flights |> 
  filter(dep_delay <= 10 & dep_delay >= -10) |> 
  arrange(desc(arr_delay))
#> # A tibble: 239,109 × 19
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013    11     1      658      700      -2    1329    1015     194 VX     
#> 2  2013     4    18      558      600      -2    1149     850     179 AA     
#> 3  2013     7     7     1659     1700      -1    2050    1823     147 US     
#> 4  2013     7    22     1606     1615      -9    2056    1831     145 DL     
#> 5  2013     9    19      648      641       7    1035     810     145 UA     
#> 6  2013     4    18      655      700      -5    1213     950     143 AA     
#> # … with 239,103 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Exercises

  1. Find all flights that

    1. Had an arrival delay of two or more hours
    2. Flew to Houston (IAH or HOU)
    3. Were operated by United, American, or Delta
    4. Departed in summer (July, August, and September)
    5. Arrived more than two hours late, but didn’t leave late
    6. Were delayed by at least an hour, but made up over 30 minutes in flight
  2. Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

  3. Sort flights to find the fastest flights (Hint: try sorting by a calculation).

  4. Which flights traveled the farthest? Which traveled the shortest?

  5. Does it matter what order you used filter() and arrange() in if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

Columns

There are four important verbs that affect the columns without changing the rows: mutate(), select(), rename(), and relocate(). mutate() creates new columns that are functions of the existing columns; select(), rename(), and relocate() change which columns are present, their names, or their positions.

mutate()

The job of mutate() is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the gain, how much time a delayed flight made up in the air, and the speed in miles per hour:

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60
  )
#> # A tibble: 336,776 × 21
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 336,770 more rows, 11 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>, and abbreviated
#> #   variable names ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time,
#> #   ⁵​arr_delay

By default, mutate() adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before argument to instead add the variables to the left hand sideRemember that in RStudio, the easiest way to see a dataset with many columns is View().:

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1
  )
#> # A tibble: 336,776 × 21
#>    gain speed  year month   day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
#>   <dbl> <dbl> <int> <int> <int>    <int>        <int>   <dbl>   <int>   <int>
#> 1    -9  370.  2013     1     1      517          515       2     830     819
#> 2   -16  374.  2013     1     1      533          529       4     850     830
#> 3   -31  408.  2013     1     1      542          540       2     923     850
#> 4    17  517.  2013     1     1      544          545      -1    1004    1022
#> 5    19  394.  2013     1     1      554          600      -6     812     837
#> 6   -16  288.  2013     1     1      554          558      -4     740     728
#> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#> #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time

The . is a sign that .before is an argument to the function, not the name of a new variable. You can also use .after to add after a variable, and in both .before and .after you can the name of a variable name instead of a position. For example, we could add the new variables after day:

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day
  )
#> # A tibble: 336,776 × 21
#>    year month   day  gain speed dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
#>   <int> <int> <int> <dbl> <dbl>    <int>        <int>   <dbl>   <int>   <int>
#> 1  2013     1     1    -9  370.      517          515       2     830     819
#> 2  2013     1     1   -16  374.      533          529       4     850     830
#> 3  2013     1     1   -31  408.      542          540       2     923     850
#> 4  2013     1     1    17  517.      544          545      -1    1004    1022
#> 5  2013     1     1    19  394.      554          600      -6     812     837
#> 6  2013     1     1   -16  288.      554          558      -4     740     728
#> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#> #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time

Alternatively, you can control which variables are kept with the .keep argument. A particularly useful argument is "used" which allows you to see the inputs and outputs from your calculations:

flights |> 
  mutate(,
    gain = dep_delay - arr_delay,
    hours = air_time / 60,
    gain_per_hour = gain / hours,
    .keep = "used"
  )
#> # A tibble: 336,776 × 6
#>   dep_delay arr_delay air_time  gain hours gain_per_hour
#>       <dbl>     <dbl>    <dbl> <dbl> <dbl>         <dbl>
#> 1         2        11      227    -9  3.78         -2.38
#> 2         4        20      227   -16  3.78         -4.23
#> 3         2        33      160   -31  2.67        -11.6 
#> 4        -1       -18      183    17  3.05          5.57
#> 5        -6       -25      116    19  1.93          9.83
#> 6        -4        12      150   -16  2.5          -6.4 
#> # … with 336,770 more rows

select()

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. select() is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:

# Select columns by name
flights |> 
  select(year, month, day)
#> # A tibble: 336,776 × 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> 5  2013     1     1
#> 6  2013     1     1
#> # … with 336,770 more rows

# Select all columns between year and day (inclusive)
flights |> 
  select(year:day)
#> # A tibble: 336,776 × 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> 5  2013     1     1
#> 6  2013     1     1
#> # … with 336,770 more rows

# Select all columns except those from year to day (inclusive)
flights |> 
  select(!year:day)
#> # A tibble: 336,776 × 16
#>   dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum
#>      <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>  
#> 1      517         515       2     830     819      11 UA        1545 N14228 
#> 2      533         529       4     850     830      20 UA        1714 N24211 
#> 3      542         540       2     923     850      33 AA        1141 N619AA 
#> 4      544         545      -1    1004    1022     -18 B6         725 N804JB 
#> 5      554         600      -6     812     837     -25 DL         461 N668DN 
#> 6      554         558      -4     740     728      12 UA        1696 N39463 
#> # … with 336,770 more rows, 7 more variables: origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#> #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

# Select all columns that are characters
flights |> 
  select(where(is.character))
#> # A tibble: 336,776 × 4
#>   carrier tailnum origin dest 
#>   <chr>   <chr>   <chr>  <chr>
#> 1 UA      N14228  EWR    IAH  
#> 2 UA      N24211  LGA    IAH  
#> 3 AA      N619AA  JFK    MIA  
#> 4 B6      N804JB  JFK    BQN  
#> 5 DL      N668DN  LGA    ATL  
#> 6 UA      N39463  EWR    ORD  
#> # … with 336,770 more rows

There are a number of helper functions you can use within select():

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • num_range("x", 1:3): matches x1, x2 and x3.

See ?select for more details. Once you know regular expressions (the topic of #chp-regexps) you’ll also be use matches() to select variables that match a pattern.

You can rename variables as you select() them by using =. The new name appears on the left hand side of the =, and the old variable appears on the right hand side:

flights |> 
  select(tail_num = tailnum)
#> # A tibble: 336,776 × 1
#>   tail_num
#>   <chr>   
#> 1 N14228  
#> 2 N24211  
#> 3 N619AA  
#> 4 N804JB  
#> 5 N668DN  
#> 6 N39463  
#> # … with 336,770 more rows

rename()

If you just want to keep all the existing variables and just want to rename a few, you can use rename() instead of select():

flights |> 
  rename(tail_num = tailnum)
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 336,770 more rows, 9 more variables: flight <int>, tail_num <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

It works exactly the same way as select(), but keeps all the variables that aren’t explicitly selected.

If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out janitor::clean_names() which provides some useful automated cleaning.

relocate()

Use relocate() to move variables around. You might want to collect related variables together or move important variables to the front. By default relocate() moves variables to the front:

flights |> 
  relocate(time_hour, air_time)
#> # A tibble: 336,776 × 19
#>   time_hour           air_time  year month   day dep_time sched_dep…¹ dep_d…²
#>   <dttm>                 <dbl> <int> <int> <int>    <int>       <int>   <dbl>
#> 1 2013-01-01 05:00:00      227  2013     1     1      517         515       2
#> 2 2013-01-01 05:00:00      227  2013     1     1      533         529       4
#> 3 2013-01-01 05:00:00      160  2013     1     1      542         540       2
#> 4 2013-01-01 05:00:00      183  2013     1     1      544         545      -1
#> 5 2013-01-01 06:00:00      116  2013     1     1      554         600      -6
#> 6 2013-01-01 05:00:00      150  2013     1     1      554         558      -4
#> # … with 336,770 more rows, 11 more variables: arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, and abbreviated variable names ¹​sched_dep_time, ²​dep_delay

But you can use the same .before and .after arguments as mutate() to choose where to put them:

flights |> 
  relocate(year:dep_time, .after = time_hour)
#> # A tibble: 336,776 × 19
#>   sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest 
#>     <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr>  <chr>
#> 1     515       2     830     819      11 UA        1545 N14228  EWR    IAH  
#> 2     529       4     850     830      20 UA        1714 N24211  LGA    IAH  
#> 3     540       2     923     850      33 AA        1141 N619AA  JFK    MIA  
#> 4     545      -1    1004    1022     -18 B6         725 N804JB  JFK    BQN  
#> 5     600      -6     812     837     -25 DL         461 N668DN  LGA    ATL  
#> 6     558      -4     740     728      12 UA        1696 N39463  EWR    ORD  
#> # … with 336,770 more rows, 9 more variables: air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, year <int>,
#> #   month <int>, day <int>, dep_time <int>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
flights |> 
  relocate(starts_with("arr"), .before = dep_time)
#> # A tibble: 336,776 × 19
#>    year month   day arr_time arr_de…¹ dep_t…² sched…³ dep_d…⁴ sched…⁵ carrier
#>   <int> <int> <int>    <int>    <dbl>   <int>   <int>   <dbl>   <int> <chr>  
#> 1  2013     1     1      830       11     517     515       2     819 UA     
#> 2  2013     1     1      850       20     533     529       4     830 UA     
#> 3  2013     1     1      923       33     542     540       2     850 AA     
#> 4  2013     1     1     1004      -18     544     545      -1    1022 B6     
#> 5  2013     1     1      812      -25     554     600      -6     837 DL     
#> 6  2013     1     1      740       12     554     558      -4     728 UA     
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​arr_delay, ²​dep_time, ³​sched_dep_time, ⁴​dep_delay, ⁵​sched_arr_time

Exercises

  1. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

  2. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

  3. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

  4. What happens if you include the name of a variable multiple times in a select() call?

  5. What does the any_of() function do? Why might it be helpful in conjunction with this vector?

    variables <- c("year", "month", "day", "dep_delay", "arr_delay")
  6. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

    select(flights, contains("TIME"))

Groups

So far you’ve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, we’ll focus on the most important functions: group_by(), summarize(), and the slice family of functions.

group_by()

Use group_by() to divide your dataset into groups meaningful for your analysis:

flights |> 
  group_by(month)
#> # A tibble: 336,776 × 19
#> # Groups:   month [12]
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

group_by() doesn’t change the data but, if you look closely at the output, you’ll notice that it’s now “grouped by” month. This means subsequent operations will now work “by month”.

summarize()

The most important grouped operation is a summary. It collapses each group to a single rowThis is a slightly simplification; later on you’ll learn how to use summarize() to produce multiple summary rows for each group.. Here we compute the average departure delay by month:

flights |> 
  group_by(month) |> 
  summarize(
    delay = mean(dep_delay)
  )
#> # A tibble: 12 × 2
#>   month delay
#>   <int> <dbl>
#> 1     1    NA
#> 2     2    NA
#> 3     3    NA
#> 4     4    NA
#> 5     5    NA
#> 6     6    NA
#> # … with 6 more rows

Uhoh! Something has gone wrong and all of our results are NA (pronounced “N-A”), R’s symbol for missing value. We’ll come back to discuss missing values in #chp-missing-values, but for now we’ll remove them by using na.rm = TRUE:

flights |> 
  group_by(month) |> 
  summarize(
    delay = mean(dep_delay, na.rm = TRUE)
  )
#> # A tibble: 12 × 2
#>   month delay
#>   <int> <dbl>
#> 1     1  10.0
#> 2     2  10.8
#> 3     3  13.2
#> 4     4  13.9
#> 5     5  13.0
#> 6     6  20.8
#> # … with 6 more rows

You can create any number of summaries in a single call to summarize(). You’ll learn various useful summaries in the upcoming chapters, but one very useful summary is n(), which returns the number of rows in each group:

flights |> 
  group_by(month) |> 
  summarize(
    delay = mean(dep_delay, na.rm = TRUE), 
    n = n()
  )
#> # A tibble: 12 × 3
#>   month delay     n
#>   <int> <dbl> <int>
#> 1     1  10.0 27004
#> 2     2  10.8 24951
#> 3     3  13.2 28834
#> 4     4  13.9 28330
#> 5     5  13.0 28796
#> 6     6  20.8 28243
#> # … with 6 more rows

Means and counts can get you a surprisingly long way in data science!

Theslice_ functions

There are five handy functions that allow you pick off specific rows within each group:

  • df |> slice_head(n = 1) takes the first row from each group.
  • df |> slice_tail(n = 1) takes the last row in each group.
  • df |> slice_min(x, n = 1) takes the row with the smallest value of x.
  • df |> slice_max(x, n = 1) takes the row with the largest value of x.
  • df |> slice_sample(x, n = 1) takes one random row.

You can vary n to select more than one row, or instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group. For example, the following code finds the most delayed flight to each destination:

flights |> 
  group_by(dest) |> 
  slice_max(arr_delay, n = 1)
#> # A tibble: 108 × 19
#> # Groups:   dest [105]
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     7    22     2145     2007      98     132    2259     153 B6     
#> 2  2013     7    23     1139      800     219    1250     909     221 B6     
#> 3  2013     1    25      123     2000     323     229    2101     328 EV     
#> 4  2013     8    17     1740     1625      75    2042    2003      39 UA     
#> 5  2013     7    22     2257      759     898     121    1026     895 DL     
#> 6  2013     7    10     2056     1505     351    2347    1758     349 UA     
#> # … with 102 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

This is similar to computing the max delay with summarize(), but you get the whole row instead of the single summary:

flights |> 
  group_by(dest) |> 
  summarize(max_delay = max(arr_delay, na.rm = TRUE))
#> Warning: There was 1 warning in `summarize()`.
#> ℹ In argument `max_delay = max(arr_delay, na.rm = TRUE)`.
#> ℹ In group 52: `dest = "LGA"`.
#> Caused by warning in `max()`:
#> ! no non-missing arguments to max; returning -Inf
#> # A tibble: 105 × 2
#>   dest  max_delay
#>   <chr>     <dbl>
#> 1 ABQ         153
#> 2 ACK         221
#> 3 ALB         328
#> 4 ANC          39
#> 5 ATL         895
#> 6 AUS         349
#> # … with 99 more rows

Grouping by multiple variables

You can create groups using more than one variable. For example, we could make a group for each day:

daily <- flights |>  
  group_by(year, month, day)
daily
#> # A tibble: 336,776 × 19
#> # Groups:   year, month, day [365]
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:

daily_flights <- daily |> 
  summarize(
    n = n()
  )
#> `summarise()` has grouped output by 'year', 'month'. You can override using
#> the `.groups` argument.

If you’re happy with this behavior, you can explicitly request it in order to suppress the message:

daily_flights <- daily |> 
  summarize(
    n = n(), 
    .groups = "drop_last"
  )

Alternatively, change the default behavior by setting a different value, e.g. "drop" to drop all grouping or "keep" to preserve the same groups.

Ungrouping

You might also want to remove grouping outside of summarize(). You can do this with ungroup().

daily |> 
  ungroup() |>
  summarize(
    delay = mean(dep_delay, na.rm = TRUE), 
    flights = n()
  )
#> # A tibble: 1 × 2
#>   delay flights
#>   <dbl>   <int>
#> 1  12.6  336776

As you can see, when you summarize an ungrouped data frame, you get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.

Exercises

  1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

  2. Find the most delayed flight to each destination.

  3. How do delays vary over the course of the day. Illustrate your answer with a plot.

  4. What happens if you supply a negative n to slice_min() and friends?

  5. Explain what count() does in terms of the dplyr verbs you just learn. What does the sort argument to count() do?

Case study: aggregates and sample size

Whenever you do any aggregation, it’s always a good idea to include a count (n()). That way, you can ensure that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes (identified by their tail number) that have the highest average delays:

delays <- flights |>  
  filter(!is.na(arr_delay), !is.na(tailnum)) |> 
  group_by(tailnum) |> 
  summarize(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )

ggplot(delays, aes(delay)) + 
  geom_freqpoly(binwidth = 10)

A frequency histogram showing the distribution of flight delays. The distribution is unimodal, with a large spike around 0, and asymmetric: very few flights leave more than 30 minutes early, but flights are delayed up to 5 hours.

Wow, there are some planes that have an average delay of 5 hours (300 minutes)! That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:

ggplot(delays, aes(n, delay)) + 
  geom_point(alpha = 1/10)

A scatterplot showing number of flights versus after delay. Delays for planes with very small number of flights have very high variability (from -50 to ~300), but the variability rapidly decreases as the number of flights increases.

Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases*cough* the central limit theorem *cough*..

When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups:

delays |>  
  filter(n > 25) |> 
  ggplot(aes(n, delay)) + 
  geom_point(alpha = 1/10) + 
  geom_smooth(se = FALSE)

Now that the y-axis (average delay) is smaller (-20 to 60 minutes), we can see a more complicated story. The smooth line suggests an initial decrease in average delay from 10 minutes to 0 minutes as number of flights per plane increases from 25 to 100. This is followed by a gradual increase up to 10 minutes for 250 flights, then a gradual decrease to ~5 minutes at 500 flights.

Note the handy pattern for combining ggplot2 and dplyr. It’s a bit annoying that you have to switch from |> to +, but it’s not too much of a hassle once you get the hang of it.

There’s another common variation on this pattern that we can see in some data about baseball players. The following code uses data from the Lahman package to compare what proportion of times a player hits the ball vs. the number of attempts they take:

batters <- Lahman::Batting |> 
  group_by(playerID) |> 
  summarize(
    perf = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    n = sum(AB, na.rm = TRUE)
  )
batters
#> # A tibble: 20,166 × 3
#>   playerID    perf     n
#>   <chr>      <dbl> <int>
#> 1 aardsda01 0          4
#> 2 aaronha01 0.305  12364
#> 3 aaronto01 0.229    944
#> 4 aasedo01  0          5
#> 5 abadan01  0.0952    21
#> 6 abadfe01  0.111      9
#> # … with 20,160 more rows

When we plot the skill of the batter (measured by the batting average, ba) against the number of opportunities to hit the ball (measured by at bat, ab), you see two patterns:

  1. As above, the variation in our aggregate decreases as we get more data points.

  2. There’s a positive correlation between skill (perf) and opportunities to hit the ball (n) because obviously teams want to give their best batters the most opportunities to hit the ball.

batters |> 
  filter(n > 100) |> 
  ggplot(aes(n, perf)) +
    geom_point(alpha = 1 / 10) + 
    geom_smooth(se = FALSE)

A scatterplot of number of batting opportunites vs batting performance overlaid with a smoothed line. Average performance increases sharply from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance continues to increase linearly at a much shallower slope reaching ~0.3 when n is ~15,000.

This also has important implications for ranking. If you naively sort on desc(ba), the people with the best batting averages are clearly lucky, not skilled:

batters |> 
  arrange(desc(perf))
#> # A tibble: 20,166 × 3
#>   playerID   perf     n
#>   <chr>     <dbl> <int>
#> 1 abramge01     1     1
#> 2 alberan01     1     1
#> 3 banisje01     1     1
#> 4 bartocl01     1     1
#> 5 bassdo01      1     1
#> 6 birasst01     1     2
#> # … with 20,160 more rows

You can find a good explanation of this problem and how to overcome it at http://varianceexplained.org/r/empirical_bayes_baseball/ and https://www.evanmiller.org/how-not-to-sort-by-average-rating.html.

Summary

In this chapter, you’ve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like filter() and arrange(), those that manipulate the columns (like select() and mutate()), and those that manipulate groups (like group_by() and summarise()). In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable. We’ll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.

For now, we’ll pivot back to workflow, and in the next chapter you’ll learn more about the pipe, |>, why we recommend it, and a little of the history that lead from magrittr’s %>% to base R’s |>.