r4ds/data-transform.Rmd

536 lines
19 KiB
Plaintext

# Data transformation {#data-transform}
## Introduction
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need.
Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with.
You'll learn how to do all that (and more!) in this chapter, which will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013.
### Prerequisites
In this chapter we're going to focus on how to use the dplyr package, another core member of the tidyverse.
We'll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.
```{r setup, message = FALSE}
library(nycflights13)
library(tidyverse)
```
Take careful note of the conflicts message that's printed when you load the tidyverse.
It tells you that dplyr overwrites some functions in base R.
If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()` and `stats::lag()`.
### nycflights13
To explore the basic data manipulation verbs of dplyr, we'll use `nycflights13::flights`.
This data frame contains all `r format(nrow(nycflights13::flights), big.mark = ",")` flights that departed from New York City in 2013.
The data comes from the US [Bureau of Transportation Statistics](http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0), and is documented in `?flights`.
```{r}
flights
```
If you've used R before, you might notice that this data frame prints a little differently to data frames you might've worked with in the past.
That's because it's a **tibble**, a special type of data frame designed by the tidyverse team.
The most important difference between a tibble and a data frame is the print method.
Tibbles only shows the first few rows and the columns that fit on one screen.
This makes it easier to rapidly iterate when working with large data; if you want to see everything you can use `View(flights)` to open the dataset in the RStudio viewer.
We'll come back to other important differences in Chapter \@ref(tibbles).
You might also have noticed the row of three (or four) letter abbreviations under the column names.
These describe the type of each variable:
- `int` stands for integer.
- `dbl` stands for double, a vector of real numbers.
- `chr` stands for character, a vector of strings.
- `dttm` stands for date-time (a date + a time).
There are three other common types of variables that aren't used in this dataset but you'll encounter later in the book:
- `lgl` stands for logical, vectors that contain only `TRUE` or `FALSE`.
- `fctr` stands for factor, which R uses to represent categorical variables with fixed possible values.
- `date` stands for date.
### dplyr basics
In this chapter you are going to learn the primary dplyr verbs that allow you to solve the vast majority of your data manipulation challenges.
They are organised into three camps:
- Functions that operate on **rows**, like `filter()` which subsets rows based on the values of the columns, the `slice()` functions that subset rows based on their position, and `arrange()` which changes the order of the rows.
- Functions that operate on **columns**, like `mutate()` which creates new columns, `select()` which picks columns, `rename()` which changes column names, `relocate()` which moves columns from place to place.
- Functions that operate on **groups**, like `group_by()` which divides data up into groups for analysis, and `summarise()` which reduces each group to a single row.
Later, in Chapter \@ref(relational-data), you'll learn about other verbs that work with **tables**, like the join functions and the set operations.
All dplyr verbs work the same way:
1. The first argument is a data frame.
2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
3. The result is a new data frame.
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
Let's dive in and see how these verbs work.
## Rows
### `filter()`
`filter()` allows you to subset observations based on their values.
The first argument is the name of the data frame.
The second and subsequent arguments are the expressions that filter the data frame.
For example, we can select all flights on January 1st with:
```{r}
filter(flights, month == 1, day == 1)
```
When you run that line of code, dplyr executes the filtering operation and returns a new data frame.
dplyr functions never modify their inputs, so if you want to save the result, you'll need to use the assignment operator, `<-`:
```{r}
jan1 <- filter(flights, month == 1, day == 1)
```
R either prints out the results, or saves them to a variable.
If you want to do both, you can wrap the assignment in parentheses:
```{r}
(dec25 <- filter(flights, month == 12, day == 25))
```
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).
It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`.
When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
`filter()` will let you know when this happens:
```{r, error = TRUE}
filter(flights, month = 1)
```
### `slice()`
### `arrange()`
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
It takes a data frame and a set of column names (or more complicated expressions) to order by.
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
```{r}
arrange(flights, year, month, day)
```
Use `desc()` to re-order by a column in descending order:
```{r}
arrange(flights, desc(dep_delay))
```
### Exercises
1. Find all flights that
a. Had an arrival delay of two or more hours
b. Flew to Houston (`IAH` or `HOU`)
c. Were operated by United, American, or Delta
d. Departed in summer (July, August, and September)
e. Arrived more than two hours late, but didn't leave late
f. Were delayed by at least an hour, but made up over 30 minutes in flight
g. Departed between midnight and 6am (inclusive)
2. Sort `flights` to find the flights with longest departure delays.
Find the flights that left earliest.
3. Sort `flights` to find the fastest (highest speed) flights.
(Hint: try sorting by a calculation).
4. Which flights travelled the farthest?
Which travelled the shortest?
## Columns
### `mutate()`
Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns.
That's the job of `mutate()`.
`mutate()` always adds new columns at the end of your dataset so we'll start by creating a narrower dataset so we can see the new variables.
Remember that when you're in RStudio, the easiest way to see all the columns is `View()`.
```{r}
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
```
```{r}
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
```
Note that you can refer to columns that you've just created:
```{r}
mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
You can control which variables are kept with the `.keep` argument:
```{r}
mutate(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "none"
)
```
### `select()` {#select}
It's not uncommon to get datasets with hundreds or even thousands of variables.
In this case, the first challenge is often narrowing in on the variables you're actually interested in.
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
`select()` is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea:
```{r}
# Select columns by name
select(flights, year, month, day)
# Select all columns between year and day (inclusive)
select(flights, year:day)
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
```
There are a number of helper functions you can use within `select()`:
- `starts_with("abc")`: matches names that begin with "abc".
- `ends_with("xyz")`: matches names that end with "xyz".
- `contains("ijk")`: matches names that contain "ijk".
- `matches("(.)\\1")`: selects variables that match a regular expression. This one matches any variables that contain repeated characters. You'll learn more about regular expressions in [strings].
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
See `?select` for more details.
You can rename variables as you `select()` them by using `=`.
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
```{r}
select(flights, tail_num = tailnum)
```
### `rename()`
If you just want to keep all the existing variables and just want to rename a few, you can use `rename()` instead of `select()`:
```{r}
rename(flights, tail_num = tailnum)
```
It works exactly the same way as `select()`, but keeps all the variables that aren't explicitly selected.
### `relocate()`
You can move variables around with `relocate`.
By default it moves variables to the front:
```{r}
relocate(flights, time_hour, air_time)
```
But you can use the `.before` and `.after` arguments to choose where to place them:
```{r}
relocate(flights, year:dep_time, .after = time_hour)
relocate(flights, starts_with("arr"), .before = dep_time)
```
### Exercises
```{r, eval = FALSE, echo = FALSE}
# For data checking, not used in results shown in book
flights <- flights %>% mutate(
dep_time = hour * 60 + minute,
arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100),
airtime2 = arr_time - dep_time,
dep_sched = dep_time + dep_delay
)
ggplot(flights, aes(dep_sched)) + geom_histogram(binwidth = 60)
ggplot(flights, aes(dep_sched %% 60)) + geom_histogram(binwidth = 1)
ggplot(flights, aes(air_time - airtime2)) + geom_histogram()
```
1. Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they're not really continuous numbers.
Convert them to a more convenient representation of number of minutes since midnight.
2. Compare `air_time` with `arr_time - dep_time`.
What do you expect to see?
What do you see?
What do you need to do to fix it?
3. Compare `dep_time`, `sched_dep_time`, and `dep_delay`.
How would you expect those three numbers to be related?
4. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
5. What happens if you include the name of a variable multiple times in a `select()` call?
6. What does the `any_of()` function do?
Why might it be helpful in conjunction with this vector?
```{r}
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
```
7. Does the result of running the following code surprise you?
How do the select helpers deal with case by default?
How can you change that default?
```{r, eval = FALSE}
select(flights, contains("TIME"))
```
## Groups
### `group_by()`
`group_by()` doesn't appear to do anything:
```{r}
by_month <- group_by(flights, month)
by_month
```
If you look closely, you'll notice that it's now "grouped by" month, but otherwise the data is unchanged.
The reason to group your data is because it changes the operation of other verbs.
### `summarise()`
The most important operation that you might apply to grouped data is a summary.
It collapses each group to a single row:
```{r}
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
```
You can create any number of summaries at once.
You'll learn various useful summaries in the upcoming chapters on individual data types, but another useful summary function is `n()`, which returns the number of rows in each group:
```{r}
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE), n = n())
```
(In fact, `count()` which you already learned about, is just a short cut for grouping + summarising with `n()`)
Here we've used `mean()` to compute the average delay for each month.
The `na.rm = TRUE` is important because it asks R to "remove" (rm) the missing (na) values.
If you forget it, the output isn't very useful:
```{r}
summarise(by_month, delay = mean(dep_delay))
```
We'll come back to discuss missing values in Chapter \@ref(missing-values).
For now, know you can drop them in summary functions by using `na.rm = TRUE` or remove them with a filter by using `!is.na()`:
```{r}
not_cancelled <- filter(flights, !is.na(dep_delay))
by_month <- group_by(not_cancelled, month)
summarise(by_month, delay = mean(dep_delay))
```
### Combining multiple operations with the pipe
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about them.
Naming things is hard, so this slows down our analysis.
There's another way to tackle the same problem with the pipe, `%>%`:
```{r}
flights %>%
filter(!is.na(dep_delay)) %>%
group_by(month) %>%
summarise(delay = mean(dep_delay))
```
This focuses on the transformations, not what's being transformed, which makes the code easier to read.
You can read it as a series of imperative statements: filter, then group, then summarise.
As suggested by this reading, a good way to pronounce `%>%` when reading code is "then".
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on.
You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(pipes).
### Grouping by multiple variables
You can group a data frame by multiple variables:
```{r}
daily <- group_by(flights, year, month, day)
daily
```
When you group by multiple variables, each summary peels off one level of the grouping by default, and a message is printed that tells you how you can change this behaviour.
```{r}
daily %>% summarise(flights = n())
```
If you're happy with this behaviour, you can also explicitly define it, in which case the message won't be printed out.
```{r results = FALSE}
summarise(daily, flights = n(), .groups = "drop_last")
```
Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`:
```{r results = FALSE}
summarise(daily, flights = n(), .groups = "drop")
summarise(daily, flights = n(), .groups = "keep")
```
### Ungrouping
You might also want to remove grouping outside of `summarise()`.
You can do this and return to operations on ungrouped data using `ungroup()`.
```{r}
daily %>%
ungroup() %>%
summarise(
delay = mean(dep_delay, na.rm = TRUE),
flights = n()
)
```
For the purposes of summarising, ungrouped data is treated as if all your data was in a single group, so you get one row back.
### Other verbs
- `select()`, `rename()`, `relocate()`: grouping has no affect
- `filter()`, `mutate()`: computation happens per group.
This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions).
### Exercises
1. Which carrier has the worst delays?
Challenge: can you disentangle the effects of bad airports vs. bad carriers?
Why/why not?
(Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
2. What does the `sort` argument to `count()` do.
Can you explain it in terms of the dplyr verbs you've learned so far?
## Case study: aggregates and sample size
Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`).
That way you can check that you're not drawing conclusions based on very small amounts of data.
For example, let's look at the planes (identified by their tail number) that have the highest average delays:
```{r}
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay)
)
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
```
Wow, there are some planes that have an *average* delay of 5 hours (300 minutes)!
The story is actually a little more nuanced.
We can get more insight if we draw a scatterplot of number of flights vs. average delay:
```{r}
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay),
n = n()
)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
```
Not surprisingly, there is much greater variation in the average delay when there are few flights.
The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you'll see that the variation decreases as the sample size increases.
When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups.
This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows.
It's a bit painful that you have to switch from `%>%` to `+`, but once you get the hang of it, it's quite convenient.
```{r}
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
```
There's another common variation of this type of pattern.
Let's look at how the average performance of batters in baseball is related to the number of times they're at bat.
Here I use data from the **Lahman** package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
When I plot the skill of the batter (measured by the batting average, `ba`) against the number of opportunities to hit the ball (measured by at bat, `ab`), you see two patterns:
1. As above, the variation in our aggregate decreases as we get more data points.
2. There's a positive correlation between skill (`ba`) and opportunities to hit the ball (`ab`).
This is because teams control who gets to play, and obviously they'll pick their best players.
```{r}
# Convert to a tibble so it prints nicely
batting <- as_tibble(Lahman::Batting)
batters <- batting %>%
group_by(playerID) %>%
summarise(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE)
)
batters %>%
filter(ab > 100) %>%
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point() +
geom_smooth(se = FALSE)
```
This also has important implications for ranking.
If you naively sort on `desc(ba)`, the people with the best batting averages are clearly lucky, not skilled:
```{r}
batters %>%
arrange(desc(ba))
```
You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.