Diving back into transformation chapter

This commit is contained in:
Hadley Wickham 2022-02-14 10:00:53 -06:00
parent 2efcd7e4fe
commit 6825c577d9
2 changed files with 255 additions and 186 deletions

View File

@ -8,7 +8,10 @@ status("restructuring")
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need.
Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with.
You'll learn how to do all that (and more!) in this chapter, which will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013.
You'll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the dplyr package and a new dataset on flights departing New York City in 2013.
The goal of this chapter is to give you an overview of all the key tools for transforming a data frame.
We'll come back these functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).
### Prerequisites
@ -34,30 +37,15 @@ The data comes from the US [Bureau of Transportation Statistics](http://www.tran
flights
```
If you've used R before, you might notice that this data frame prints a little differently to data frames you might've worked with in the past.
If you've used R before, you might notice that this data frame prints a little differently to data frames that you might've worked with in the past.
That's because it's a **tibble**, a special type of data frame designed by the tidyverse team to avoid some common data.frame gotchas.
The most important difference is the way it prints: tibbles are designed for large datasets, so only show the first few rows and only the columns that fit on one screen.
The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
If you want to see everything you can use `View(flights)` to open the dataset in the RStudio viewer.
We'll come back to other important differences in Chapter \@ref(tibbles).
You might also have noticed the row of short abbreviations under the column names.
These describe the type of each variable:
- `int` stands for integer.
- `dbl` stands for double, a vector of real numbers.
- `chr` stands for character, a vector of strings.
- `dttm` stands for date-time (a date + a time).
There are three other common types that aren't used here but you'll encounter later in the book:
- `lgl` stands for logical, a vector that contains only `TRUE` or `FALSE`.
- `fctr` stands for factor, which R uses to represent categorical variables with fixed possible values.
- `date` stands for date.
You might also have noticed the row of short abbreviations following each column name.
These describe the type of each variable: `<int>` is short for integer, and `<dbl>` is short for double (aka real numbers), `<chr>` for characters (aka strings), and `<dttm>` for date-times.
These are important because the operations you can perform on a column depend so much on the type of column, and are used to organize the chapters in the Transform section of this book.
### dplyr basics
@ -70,72 +58,134 @@ All dplyr verbs work the same way:
3. The result is a new data frame.
This means that dplyr code typically looks something like this:
```{r, eval = FALSE}
data |>
filter(x == 1) |>
mutate(
y = x + 1
)
```
`|>` is a special operator called a pipe.
It takes the thing on its left and passes it along to the function on its right.
The easiest way to pronounce the pipe is "then".
So you can read the above as take data, then filter it, then mutate it.
We'll come back to the pipe and its alternatives in Chapter \@ref(pipes).
In RStudio, you can make the pipe by pressing Ctrl/Cmd + Shift + M.
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on.
You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(workflow-pipes).
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
The verbs are organised into four groups:
- Functions that operate on **rows**: `filter()` subsets rows based on the values of the columns and `arrange()` changes the order of the rows.
- Functions that operate on **columns**: `mutate()` creates new columns, `select()` columns, `rename()` changes their names, and `relocate()` changes their positions.
- Functions that operate on **groups**: `group_by()` divides data up into groups for analysis, and `summarise()` reduces each group to a single row.
- Functions that operate on **tables**, like the join functions and the set operations.
We'll come back to these in in Chapter \@ref(relational-data).
Let's dive in and see how these verbs work.
The verbs are organised into four groups based on what they operate on: **rows**, **columns**, **groups**, or **tables**.
In the following sections you'll learn the most important verbs for rows, columns, and groups.
We'll come back to operations that work on multiple tables in Chapter \@ref(relational-data).
Let's dive in!
## Rows
`filter()` and `arrange()` affect the rows (the observations), leaving the columns (the variables) unchanged.
`filter()` changes which rows are included without changing the order, `arrange()` changes the order without changing the membership.
The most important verbs that affect the rows are `filter()` which changes membership without changing order and `arrange()` which changes the order without changing the membership.
Both functions only affect the rows, so the columns are left unchanged.
### `filter()`
`filter()` allows you to choose rows based on their values[^data-transform-1].
The first argument is the name of the data frame.
The second and subsequent arguments are the expressions that filter the data frame.
For example, we can select all flights on January 1st with:
`filter()` allows you to pick rows based on the values of the columns[^data-transform-1].
The first argument is the data frame.
The second and subsequent arguments are the conditions that must be true to keep the row.
For example, we could find all flights that arrived more than 120 minutes (two hours) late:
[^data-transform-1]: Later, you'll learn about the `slice_*()` family which allows you to choose rows based on their positions
```{r}
filter(flights, month == 1, day == 1)
flights |>
filter(arr_delay > 120)
```
When you run that line of code, dplyr executes the filtering operation and returns a new data frame.
dplyr functions never modify their inputs, so if you want to save the result, you'll need to use the assignment operator, `<-`:
As well as `>` (greater than) provides the `>=` (greater than or equal to), `<` (less than), `<=` (less than or equal to), `==` (equal to), and `!=` (not equal to).
You can use `&` (and) or `|` (or) to combine multiple conditions:
```{r}
jan1 <- filter(flights, month == 1, day == 1)
# Flights that departed on January 1
flights |>
filter(month == 1 & day == 1)
# Flights that departed in January or February
flights |>
filter(month == 1 | month == 2)
```
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).
It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`.
We'll come back to these operations again in Chapter \@ref(logicals-numbers).
There's a useful shortcut when you're combining `|` and `==`: `%in%`.
It returns true if the value on the left right hand side is any of the values on the right hand side:
```{r}
flights |>
filter(month %in% c(1, 2))
```
We'll come back to these comparisons and logical operators in more detail in Chapter \@ref(logicals-numbers).
When you run `filter()` dplyr executes the filtering operation, creating a new data frame, and then prints it.
It doesn't modify the existing `flights` dataset because dplyr functions never modify their inputs.
To save the result, you need to use the assignment operator, `<-`:
```{r}
jan1 <- flights |>
filter(month == 1 & day == 1)
```
### `arrange()`
`arrange()` changes the order of the rows based on the value of the columns.
Again, it takes a data frame and a set of column names (or more complicated expressions) to order by.
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
For example, the following code sorts by the departure time, which is spread over four columns.
```{r}
flights |>
arrange(year, month, day, dep_time)
```
You can use `desc()` to re-order by a column in descending order.
For example, this is useful if you want to see the most delayed flights:
```{r}
flights |>
arrange(desc(dep_delay))
```
You can of course combine `arrange()` and `filter()` to solve more complex problems.
For example, we could look for the flights that were most delayed on arrival that left on roughly on time:
```{r}
flights |>
filter(dep_delay <= 10 & dep_delay >= -10) |>
arrange(desc(arr_delay))
```
### Common mistakes
When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
`filter()` will let you know when this happens:
```{r, error = TRUE}
filter(flights, month = 1)
flights |>
filter(month = 1)
```
### `arrange()`
Another mistakes is you write "or" statements like you would in English:
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
It takes a data frame and a set of column names (or more complicated expressions) to order by.
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
```{r}
arrange(flights, year, month, day)
```{r, eval = FALSE}
flights |>
filter(month == 1 | 2)
```
You can use `desc()` to re-order by a column in descending order:
```{r}
arrange(flights, desc(dep_delay))
```
This works, in the sense that it doesn't throw an error, but it doesn't do what you want.
We'll come back to what it does and why in Section \@ref(boolean-operations).
### Exercises
@ -147,97 +197,109 @@ arrange(flights, desc(dep_delay))
d. Departed in summer (July, August, and September)
e. Arrived more than two hours late, but didn't leave late
f. Were delayed by at least an hour, but made up over 30 minutes in flight
g. Departed between midnight and 6am (inclusive)
2. Sort `flights` to find the flights with longest departure delays.
Find the flights that left earliest.
Find the flights that left earliest in the morning.
3. Sort `flights` to find the fastest (highest speed) flights.
(Hint: try sorting by a calculation).
3. Sort `flights` to find the fastest flights (Hint: try sorting by a calculation).
4. Which flights travelled the farthest?
Which travelled the shortest?
4. Which flights traveled the farthest?
Which traveled the shortest?
5. Does it matter what order you used `filter()` and `arrange()` in if you're using both?
Why/why not?
Think about the results and how much work the functions would have to do.
## Columns
`mutate()`, `select()`, `rename()`, and `relocate()` affect the columns (the variables) without changing the rows (the observations).
`mutate()` creates new variables that are functions of the existing variables; `select()`, `rename()`, and `relocate()` changes which variables are present, their names, and their positions.
The are four important verbs that affect the columns without changing the rows: `mutate()`, `select()`, `rename()`, and `relocate()`.
`mutate()` creates new columns that are functions of the existing columns; `select()`, `rename()`, and `relocate()` change which columns are present, their names, and their positions.
### `mutate()`
The job of `mutate()` is to add new columns that are functions of existing column.
In the later chapters, you'll learn the full set of functions that you can use to manipulate different types of variables.
For now, we'll stick with basic mathematical operators, which allows us to compute the `gain`, how much time a delayed flight made up in the air, and the `speed` in miles per hour:
The job of `mutate()` is to add new columns that are calculated from the existing columns.
In the transform chapters, you'll learn a large set of functions that you can use to manipulate different types of variables.
For now, we'll stick with basic algebra, which allows us to compute the `gain`, how much time a delayed flight made up in the air, and the `speed` in miles per hour:
```{r}
mutate(flights,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
```
By default, `mutate()` adds new columns at the right hand side of your dataset, which makes it hard to see what's happening here.
By default, `mutate()` adds new columns on the right hand side of your dataset, which makes it hard to see what's happening here.
We can use the `.before` argument to instead add the variables to the left hand side[^data-transform-2]:
[^data-transform-2]: Remember that when you're in RStudio, the easiest way to see all the columns is `View()`.
```{r}
mutate(flights,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.before = 1
)
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.before = 1
)
```
The leading `.` is a sign that `.before` is an argument to the function, not the name of a new variable.
You can also use `.after` to add after a variable, and in both `.before` and `.after` you can the name of a variable name instead of a position:
The `.` is a sign that `.before` is an argument to the function, not the name of a new variable.
You can also use `.after` to add after a variable, and in both `.before` and `.after` you can the name of a variable name instead of a position.
For example, we could add the new variables after `day:`
```{r}
mutate(flights,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.after = day
)
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.after = day
)
```
Alternatively, can control which variables are kept with the `.keep` argument:
Alternatively, can control which variables are kept with the `.keep` argument.
A particularly useful argument is `"used"` which allows you to see the inputs and outputs from your calculations:
```{r}
mutate(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "none"
)
flights |>
mutate(,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "used"
)
```
### `select()` {#select}
It's not uncommon to get datasets with hundreds or even thousands of variables.
In this case, the first challenge is often focussing on just the variables you're interested in.
In this case, the first challenge is often focusing on just the variables you're interested in.
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
`select()` is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:
```{r}
# Select columns by name
select(flights, year, month, day)
flights |>
select(year, month, day)
# Select all columns between year and day (inclusive)
select(flights, year:day)
flights |>
select(year:day)
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
flights |>
select(-(year:day))
# Select all columns that are characters
flights |>
select(where(is.character))
```
There are a number of helper functions you can use within `select()`:
- `starts_with("abc")`: matches names that begin with "abc".
- `ends_with("xyz")`: matches names that end with "xyz".
- `contains("ijk")`: matches names that contain "ijk".
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
See `?select` for more details.
@ -247,7 +309,7 @@ You can rename variables as you `select()` them by using `=`.
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
```{r}
select(flights, tail_num = tailnum)
flights |> select(tail_num = tailnum)
```
### `rename()`
@ -255,29 +317,31 @@ select(flights, tail_num = tailnum)
If you just want to keep all the existing variables and just want to rename a few, you can use `rename()` instead of `select()`:
```{r}
rename(flights, tail_num = tailnum)
flights |>
rename(tail_num = tailnum)
```
It works exactly the same way as `select()`, but keeps all the variables that aren't explicitly selected.
### `relocate()`
You can move variables around with `relocate`.
You can move variables around with `relocate()`.
By default it moves variables to the front:
```{r}
relocate(flights, time_hour, air_time)
flights |>
relocate(time_hour, air_time)
```
But like with `mutate()`, you can use the `.before` and `.after` arguments to choose where to place them:
But you can use the same `.before` and `.after` arguments as `mutate()` to choose where to put them:
```{r}
relocate(flights, year:dep_time, .after = time_hour)
relocate(flights, starts_with("arr"), .before = dep_time)
flights |>
relocate(year:dep_time, .after = time_hour)
flights |>
relocate(starts_with("arr"), .before = dep_time)
```
These work the same way as the `.before` and `.after` arguments to `mutate()` --- they can be a numeric position, the name of a variable, or any of the other functions that you can use with `select()`.
### Exercises
```{r, eval = FALSE, echo = FALSE}
@ -334,11 +398,11 @@ The two key functions are `group_by()` and `summarise()`, but as you'll learn `g
Use `group_by()` to divide your dataset into groups meaningful for your analysis:
```{r}
by_month <- group_by(flights, month)
by_month
flights |>
group_by(month)
```
`group_by()` doesn't change the data but, if you look closely, you'll notice that it's now "grouped by" month.
`group_by()` doesn't change the data but, if you look closely at the output, you'll notice that it's now "grouped by" month.
The reason to group your data is because it changes the operation of subsequent verbs.
### `summarise()`
@ -350,73 +414,83 @@ Here we compute the average departure delay by month:
[^data-transform-3]: This is a slightly simplification; later on you'll learn how to use `summarise()` to produce multiple summary rows for each group.
```{r}
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
flights |>
group_by(month) |>
summarise(
delay = mean(dep_delay)
)
```
Uhoh!
Something has gone wrong and all of our results are `NA`, R's symbol for missing value.
We'll come back to discuss missing values in Chapter \@ref(missing-values), but for now we'll remove them by using `na.rm = TRUE`:
```{r}
flights |>
group_by(month) |>
summarise(
delay = mean(dep_delay, na.rm = TRUE)
)
```
You can create any number of summaries in a single call to `summarise()`.
You'll learn various useful summaries in the upcoming chapters on individual data types, but one very useful summary is `n()`, which returns the number of rows in each group:
You'll learn various useful summaries in the upcoming chapters, but one very useful summary is `n()`, which returns the number of rows in each group:
```{r}
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE), n = n())
flights |>
group_by(month) |>
summarise(
delay = mean(dep_delay, na.rm = TRUE),
n = n()
)
```
(In fact, `count()`, which we've used a bunch in previous chapters, is just shorthand for `group_by()` + `summarise(n = n())`.)
We'll come back to discuss missing values in Chapter \@ref(missing-values).
For now, know you can drop them in summary functions by using `na.rm = TRUE` or remove them with a filter by using `!is.na()`:
```{r}
not_cancelled <- filter(flights, !is.na(dep_delay))
by_month <- group_by(not_cancelled, month)
summarise(by_month, delay = mean(dep_delay))
```
### Combining multiple operations
This code is starting to get a little frustrating to write because each intermediate data frame has to be given a name, even though we don't care about them.
Naming things is hard, so this slows down our analysis.
There's another way to tackle the same problem with the **pipe**, `%>%`:
```{r}
flights %>%
filter(!is.na(dep_delay)) %>%
group_by(month) %>%
summarise(delay = mean(dep_delay), n = n())
```
When you see `%>%` in code, a good way to "pronounce" it in your head is as "then".
That way you can read this code as a series of imperative statements: take the flights dataset, then filter it to remove rows with missing `dep_delay`, then group it by month, then summarise it with the average `dep_delay` and the number of observations.
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on.
You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(workflow-pipes).
Means and counts can get you a surprisingly long way in data science!
### Grouping by multiple variables
You can group a data frame by multiple variables:
```{r}
daily <- flights %>% group_by(year, month, day)
daily <- flights %>%
group_by(year, month, day)
daily
```
When you group by multiple variables, each summary peels off one level of the grouping by default, and a message is printed that tells you how you can change this behaviour.
```{r}
daily %>% summarise(n = n())
daily %>%
summarise(
n = n()
)
```
If you're happy with this behaviour, you can explicitly define it in order to suppress the message:
```{r, results = FALSE}
daily %>% summarise(n = n(), .groups = "drop_last")
daily %>%
summarise(
n = n(),
.groups = "drop_last"
)
```
Alternatively, you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`:
```{r, results = FALSE}
daily %>% summarise(n = n(), .groups = "drop")
daily %>% summarise(n = n(), .groups = "keep")
daily %>%
summarise(
n = n(),
.groups = "drop"
)
daily %>%
summarise(
n = n(),
.groups = "keep"
)
```
### Ungrouping
@ -439,7 +513,7 @@ For the purposes of summarising, ungrouped data is treated as if all your data w
`group_by()` is usually paired with `summarise()`, but it's good to know how it affects other verbs:
- `select()`, `rename()`, `relocate()`: grouping has no affect
- `select()`, `rename()`, `relocate()`: grouping has no affect.
- `mutate()`: computation happens per group.
This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions).
@ -458,15 +532,17 @@ For the purposes of summarising, ungrouped data is treated as if all your data w
## Case study: aggregates and sample size
Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`).
Whenever you do any aggregation, it's always a good idea to include either a count (`n()`).
That way you can check that you're not drawing conclusions based on very small amounts of data.
For example, let's look at the planes (identified by their tail number) that have the highest average delays:
```{r}
delays <- not_cancelled %>%
delays <- flights %>%
filter(!is.na(arr_delay)) |>
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay)
delay = mean(arr_delay),
n = n()
)
ggplot(data = delays, mapping = aes(x = delay)) +
@ -479,13 +555,6 @@ The story is actually a little more nuanced.
We can get more insight if we draw a scatterplot of number of flights vs. average delay:
```{r}
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay),
n = n()
)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
```
@ -501,13 +570,24 @@ It's a bit painful that you have to switch from `%>%` to `+`, but once you get t
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
geom_point(alpha = 1/10) +
geom_smooth(se = FALSE)
```
There's another common variation of this type of pattern.
Let's look at how the average performance of batters in baseball is related to the number of times they're at bat.
Here I use data from the **Lahman** package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
```{r}
batters <- Lahman::Batting %>%
group_by(playerID) %>%
summarise(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE)
)
batters
```
When I plot the skill of the batter (measured by the batting average, `ba`) against the number of opportunities to hit the ball (measured by at bat, `ab`), you see two patterns:
1. As above, the variation in our aggregate decreases as we get more data points.
@ -516,20 +596,10 @@ When I plot the skill of the batter (measured by the batting average, `ba`) agai
This is because teams control who gets to play, and obviously they'll pick their best players.
```{r}
# Convert to a tibble so it prints nicely
batting <- as_tibble(Lahman::Batting)
batters <- batting %>%
group_by(playerID) %>%
summarise(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE)
)
batters %>%
filter(ab > 100) %>%
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point() +
geom_point(alpha = 1 / 10) +
geom_smooth(se = FALSE)
```

View File

@ -24,6 +24,19 @@ You can create new objects with `<-`:
x <- 3 * 4
```
You can **c**ombine multiple elements into a vector with `c()`:
```{r}
primes <- c(1, 2, 3, 5, 7, 11, 13)
```
And basic arithmetic is applied to every element of the vector:
```{r}
primes * 2
primes - 1
```
All R statements where you create objects, **assignment** statements, have the same form:
```{r eval = FALSE}
@ -134,20 +147,6 @@ If this happens, R will show you the continuation character "+":
The `+` tells you that R is waiting for more input; it doesn't think you're done yet.
Usually that means you've forgotten either a `"` or a `)`. Either add the missing pair, or press ESCAPE to abort the expression and try again.
If you make an assignment, you don't get to see the value.
You're then tempted to immediately double-check the result:
```{r}
y <- seq(1, 10, length.out = 5)
y
```
This common action can be shortened by surrounding the assignment with parentheses, which causes assignment and "print to screen" to happen.
```{r}
(y <- seq(1, 10, length.out = 5))
```
Now look at your environment in the upper right pane:
```{r, echo = FALSE, out.width = NULL}