Start rewriting transform chapter
This commit is contained in:
parent
d80982caa6
commit
86e98ae66e
|
@ -30,46 +30,47 @@ The data comes from the US [Bureau of Transportation Statistics](http://www.tran
|
|||
flights
|
||||
```
|
||||
|
||||
You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen.
|
||||
It also displays the number of rows (`r format(nrow(nycflights13::flights), big.mark = ",")`) and columns (`r ncol(nycflights13::flights)`).
|
||||
(To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer).
|
||||
It prints differently because it's a **tibble**.
|
||||
Tibbles are data frames, but slightly tweaked to work better in the tidyverse.
|
||||
For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in Chapter \@ref(tibbles).
|
||||
If you've used R before, you might notice that this data frame prints a little differently to data frames you might've worked with in the past.
|
||||
That's because it's a **tibble**, a special type of data frame designed by the tidyverse team.
|
||||
|
||||
The most important difference between a tibble and a data frame is the print method.
|
||||
Tibbles only shows the first few rows and the columns that fit on one screen.
|
||||
This makes it easier to rapidly iterate when working with large data; if you want to see everything you can use `View(flights)` to open the dataset in the RStudio viewer.
|
||||
We'll come back to other important differences in Chapter \@ref(tibbles).
|
||||
|
||||
You might also have noticed the row of three (or four) letter abbreviations under the column names.
|
||||
These describe the type of each variable:
|
||||
|
||||
- `int` stands for integers.
|
||||
- `int` stands for integer.
|
||||
|
||||
- `dbl` stands for doubles, or real numbers.
|
||||
- `dbl` stands for double, a vector of real numbers.
|
||||
|
||||
- `chr` stands for characters, or strings.
|
||||
- `chr` stands for character, a vector of strings.
|
||||
|
||||
- `dttm` stands for date-times (a date + a time).
|
||||
- `dttm` stands for date-time (a date + a time).
|
||||
|
||||
There are three other common types of variables that aren't used in this dataset but you'll encounter later in the book:
|
||||
|
||||
- `lgl` stands for logical, vectors that contain only `TRUE` or `FALSE`.
|
||||
|
||||
- `fctr` stands for factors, which R uses to represent categorical variables with fixed possible values.
|
||||
- `fctr` stands for factor, which R uses to represent categorical variables with fixed possible values.
|
||||
|
||||
- `date` stands for dates.
|
||||
- `date` stands for date.
|
||||
|
||||
### dplyr basics
|
||||
|
||||
In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:
|
||||
In this chapter you are going to learn the primary dplyr verbs that allow you to solve the vast majority of your data manipulation challenges.
|
||||
They are organised into three camps:
|
||||
|
||||
- Pick observations by their values (`filter()`).
|
||||
- Reorder the rows (`arrange()`).
|
||||
- Pick variables by their names (`select()`).
|
||||
- Create new variables with functions of existing variables (`mutate()`).
|
||||
- Collapse many values down to a single summary (`summarise()`).
|
||||
- Functions that operate on **rows**, like `filter()` which subsets rows based on the values of the columns, the `slice()` functions that subset rows based on their position, and `arrange()` which changes the order of the rows.
|
||||
|
||||
These can all be used in conjunction with `group_by()` which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.
|
||||
These six functions provide the verbs for a language of data manipulation.
|
||||
- Functions that operate on **columns**, like `mutate()` which creates new columns, `select()` which picks columns, `rename()` which changes column names, `relocate()` which moves columns from place to place.
|
||||
|
||||
All verbs work similarly:
|
||||
- Functions that operate on **groups**, like `group_by()` which divides data up into groups for analysis, and `summarise()` which reduces each group to a single row.
|
||||
|
||||
Later, in Chapter \@ref(relational-data), you'll learn about other verbs that work with **tables**, like the join functions and the set operations.
|
||||
|
||||
All dplyr verbs work the same way:
|
||||
|
||||
1. The first argument is a data frame.
|
||||
|
||||
|
@ -80,7 +81,9 @@ All verbs work similarly:
|
|||
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
|
||||
Let's dive in and see how these verbs work.
|
||||
|
||||
## Filter rows with `filter()`
|
||||
## Rows
|
||||
|
||||
### `filter()`
|
||||
|
||||
`filter()` allows you to subset observations based on their values.
|
||||
The first argument is the name of the data frame.
|
||||
|
@ -105,35 +108,20 @@ If you want to do both, you can wrap the assignment in parentheses:
|
|||
(dec25 <- filter(flights, month == 12, day == 25))
|
||||
```
|
||||
|
||||
### Comparisons
|
||||
|
||||
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
|
||||
R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).
|
||||
It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`.
|
||||
|
||||
When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
|
||||
When this happens you'll get an informative error:
|
||||
`filter()` will let you know when this happens:
|
||||
|
||||
```{r, error = TRUE}
|
||||
filter(flights, month = 1)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
### `slice()`
|
||||
|
||||
1. Find all flights that
|
||||
|
||||
a. Had an arrival delay of two or more hours
|
||||
b. Flew to Houston (`IAH` or `HOU`)
|
||||
c. Were operated by United, American, or Delta
|
||||
d. Departed in summer (July, August, and September)
|
||||
e. Arrived more than two hours late, but didn't leave late
|
||||
f. Were delayed by at least an hour, but made up over 30 minutes in flight
|
||||
g. Departed between midnight and 6am (inclusive)
|
||||
|
||||
2. How many flights have a missing `dep_time`?
|
||||
What other variables are missing?
|
||||
What might these rows represent?
|
||||
|
||||
## Arrange rows with `arrange()`
|
||||
### `arrange()`
|
||||
|
||||
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
|
||||
It takes a data frame and a set of column names (or more complicated expressions) to order by.
|
||||
|
@ -151,15 +139,73 @@ arrange(flights, desc(dep_delay))
|
|||
|
||||
### Exercises
|
||||
|
||||
1. Sort `flights` to find the flights with longest departure delays.
|
||||
1. Find all flights that
|
||||
|
||||
a. Had an arrival delay of two or more hours
|
||||
b. Flew to Houston (`IAH` or `HOU`)
|
||||
c. Were operated by United, American, or Delta
|
||||
d. Departed in summer (July, August, and September)
|
||||
e. Arrived more than two hours late, but didn't leave late
|
||||
f. Were delayed by at least an hour, but made up over 30 minutes in flight
|
||||
g. Departed between midnight and 6am (inclusive)
|
||||
|
||||
2. Sort `flights` to find the flights with longest departure delays.
|
||||
Find the flights that left earliest.
|
||||
|
||||
2. Sort `flights` to find the fastest (highest speed) flights.
|
||||
3. Sort `flights` to find the fastest (highest speed) flights.
|
||||
(Hint: try sorting by a calculation).
|
||||
|
||||
3. Which flights travelled the farthest?
|
||||
4. Which flights travelled the farthest?
|
||||
Which travelled the shortest?
|
||||
|
||||
## Select columns with `select()` {#select}
|
||||
## Columns
|
||||
|
||||
### `mutate()`
|
||||
|
||||
Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns.
|
||||
That's the job of `mutate()`.
|
||||
|
||||
`mutate()` always adds new columns at the end of your dataset so we'll start by creating a narrower dataset so we can see the new variables.
|
||||
Remember that when you're in RStudio, the easiest way to see all the columns is `View()`.
|
||||
|
||||
```{r}
|
||||
flights_sml <- select(flights,
|
||||
year:day,
|
||||
ends_with("delay"),
|
||||
distance,
|
||||
air_time
|
||||
)
|
||||
```
|
||||
|
||||
```{r}
|
||||
mutate(flights_sml,
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60
|
||||
)
|
||||
```
|
||||
|
||||
Note that you can refer to columns that you've just created:
|
||||
|
||||
```{r}
|
||||
mutate(flights_sml,
|
||||
gain = dep_delay - arr_delay,
|
||||
hours = air_time / 60,
|
||||
gain_per_hour = gain / hours
|
||||
)
|
||||
```
|
||||
|
||||
You can control which variables are kept with the `.keep` argument:
|
||||
|
||||
```{r}
|
||||
mutate(flights,
|
||||
gain = dep_delay - arr_delay,
|
||||
hours = air_time / 60,
|
||||
gain_per_hour = gain / hours,
|
||||
.keep = "none"
|
||||
)
|
||||
```
|
||||
|
||||
### `select()` {#select}
|
||||
|
||||
It's not uncommon to get datasets with hundreds or even thousands of variables.
|
||||
In this case, the first challenge is often narrowing in on the variables you're actually interested in.
|
||||
|
@ -190,80 +236,37 @@ There are a number of helper functions you can use within `select()`:
|
|||
|
||||
See `?select` for more details.
|
||||
|
||||
`select()` can be used to rename variables, but it's rarely useful because it drops all of the variables not explicitly mentioned.
|
||||
Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
|
||||
You can rename variables as you `select()` them by using `=`.
|
||||
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
|
||||
|
||||
```{r}
|
||||
select(flights, tail_num = tailnum)
|
||||
```
|
||||
|
||||
### `rename()`
|
||||
|
||||
If you just want to keep all the existing variables and just want to rename a few, you can use `rename()` instead of `select()`:
|
||||
|
||||
```{r}
|
||||
rename(flights, tail_num = tailnum)
|
||||
```
|
||||
|
||||
If you want to move certain variables to the start of the data frame but not drop the others, you can do this in two ways: using `select()` in conjunction with the `everything()` helper or using `relocate()`.
|
||||
It works exactly the same way as `select()`, but keeps all the variables that aren't explicitly selected.
|
||||
|
||||
### `relocate()`
|
||||
|
||||
You can move variables around with `relocate`.
|
||||
By default it moves variables to the front:
|
||||
|
||||
```{r}
|
||||
select(flights, time_hour, air_time, everything())
|
||||
relocate(flights, time_hour, air_time)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
|
||||
|
||||
2. What happens if you include the name of a variable multiple times in a `select()` call?
|
||||
|
||||
3. What does the `any_of()` function do?
|
||||
Why might it be helpful in conjunction with this vector?
|
||||
|
||||
```{r}
|
||||
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
|
||||
```
|
||||
|
||||
4. Does the result of running the following code surprise you?
|
||||
How do the select helpers deal with case by default?
|
||||
How can you change that default?
|
||||
|
||||
```{r, eval = FALSE}
|
||||
select(flights, contains("TIME"))
|
||||
```
|
||||
|
||||
## Add new variables with `mutate()`
|
||||
|
||||
Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns.
|
||||
That's the job of `mutate()`.
|
||||
|
||||
`mutate()` always adds new columns at the end of your dataset so we'll start by creating a narrower dataset so we can see the new variables.
|
||||
Remember that when you're in RStudio, the easiest way to see all the columns is `View()`.
|
||||
But you can use the `.before` and `.after` arguments to choose where to place them:
|
||||
|
||||
```{r}
|
||||
flights_sml <- select(flights,
|
||||
year:day,
|
||||
ends_with("delay"),
|
||||
distance,
|
||||
air_time
|
||||
)
|
||||
mutate(flights_sml,
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60
|
||||
)
|
||||
```
|
||||
|
||||
Note that you can refer to columns that you've just created:
|
||||
|
||||
```{r}
|
||||
mutate(flights_sml,
|
||||
gain = dep_delay - arr_delay,
|
||||
hours = air_time / 60,
|
||||
gain_per_hour = gain / hours
|
||||
)
|
||||
```
|
||||
|
||||
If you only want to keep the new variables, use `transmute()`:
|
||||
|
||||
```{r}
|
||||
transmute(flights,
|
||||
gain = dep_delay - arr_delay,
|
||||
hours = air_time / 60,
|
||||
gain_per_hour = gain / hours
|
||||
)
|
||||
relocate(flights, year:dep_time, .after = time_hour)
|
||||
relocate(flights, starts_with("arr"), .before = dep_time)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
@ -293,68 +296,75 @@ ggplot(flights, aes(air_time - airtime2)) + geom_histogram()
|
|||
3. Compare `dep_time`, `sched_dep_time`, and `dep_delay`.
|
||||
How would you expect those three numbers to be related?
|
||||
|
||||
4. Find the 10 most delayed flights using a ranking function.
|
||||
How do you want to handle ties?
|
||||
Carefully read the documentation for `min_rank()`.
|
||||
4. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
|
||||
|
||||
5. What does `1:3 + 1:10` return?
|
||||
Why?
|
||||
5. What happens if you include the name of a variable multiple times in a `select()` call?
|
||||
|
||||
6. What trigonometric functions does R provide?
|
||||
6. What does the `any_of()` function do?
|
||||
Why might it be helpful in conjunction with this vector?
|
||||
|
||||
## Grouped summaries with `summarise()`
|
||||
```{r}
|
||||
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
|
||||
```
|
||||
|
||||
The last key verb is `summarise()`.
|
||||
It collapses a data frame to a single row:
|
||||
7. Does the result of running the following code surprise you?
|
||||
How do the select helpers deal with case by default?
|
||||
How can you change that default?
|
||||
|
||||
```{r}
|
||||
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
|
||||
```
|
||||
```{r, eval = FALSE}
|
||||
select(flights, contains("TIME"))
|
||||
```
|
||||
|
||||
(We'll come back to what that `na.rm = TRUE` means very shortly.)
|
||||
## Groups
|
||||
|
||||
`summarise()` is not terribly useful unless we pair it with `group_by()`.
|
||||
This changes the unit of analysis from the complete dataset to individual groups.
|
||||
Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group".
|
||||
For example, if we applied exactly the same code to a data frame grouped by month, we get the average delay per month:
|
||||
### `group_by()`
|
||||
|
||||
`group_by()` doesn't appear to do anything:
|
||||
|
||||
```{r}
|
||||
by_month <- group_by(flights, month)
|
||||
by_month
|
||||
```
|
||||
|
||||
If you look closely, you'll notice that it's now "grouped by" month, but otherwise the data is unchanged.
|
||||
The reason to group your data is because it changes the operation of other verbs.
|
||||
|
||||
### `summarise()`
|
||||
|
||||
The most important operation that you might apply to grouped data is a summary.
|
||||
It collapses each group to a single row:
|
||||
|
||||
```{r}
|
||||
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
|
||||
```
|
||||
|
||||
Together `group_by()` and `summarise()` provide one of the tools that you'll use most commonly when working with dplyr: grouped summaries.
|
||||
But before we go any further with this, we need to introduce a powerful new idea: the pipe.
|
||||
You can create any number of summaries at once.
|
||||
You'll learn various useful summaries in the upcoming chapters on individual data types, but another useful summary function is `n()`, which returns the number of rows in each group:
|
||||
|
||||
### Combining multiple operations with the pipe
|
||||
|
||||
Imagine that we want to explore the relationship between the distance and average delay for each location.
|
||||
Using what you know about dplyr, you might write code like this:
|
||||
|
||||
```{r, fig.width = 6}
|
||||
by_dest <- group_by(flights, dest)
|
||||
delay <- summarise(by_dest,
|
||||
count = n(),
|
||||
dist = mean(distance, na.rm = TRUE),
|
||||
delay = mean(arr_delay, na.rm = TRUE)
|
||||
)
|
||||
delay <- filter(delay, count > 20, dest != "HNL")
|
||||
|
||||
# It looks like delays increase with distance up to ~750 miles
|
||||
# and then decrease. Maybe as flights get longer there's more
|
||||
# ability to make up delays in the air?
|
||||
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
|
||||
geom_point(aes(size = count), alpha = 1/3) +
|
||||
geom_smooth(se = FALSE)
|
||||
```{r}
|
||||
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE), n = n())
|
||||
```
|
||||
|
||||
There are three steps to prepare this data:
|
||||
(In fact, `count()` which you already learned about, is just a short cut for grouping + summarising with `n()`)
|
||||
|
||||
1. Group flights by destination.
|
||||
Here we've used `mean()` to compute the average delay for each month.
|
||||
The `na.rm = TRUE` is important because it asks R to "remove" (rm) the missing (na) values.
|
||||
If you forget it, the output isn't very useful:
|
||||
|
||||
2. Summarise to compute distance, average delay, and number of flights.
|
||||
```{r}
|
||||
summarise(by_month, delay = mean(dep_delay))
|
||||
```
|
||||
|
||||
3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
|
||||
We'll come back to discuss missing values in Chapter \@ref(missing-values).
|
||||
For now, know you can drop them in summary functions by using `na.rm = TRUE` or remove them with a filter by using `!is.na()`:
|
||||
|
||||
```{r}
|
||||
not_cancelled <- filter(flights, !is.na(dep_delay))
|
||||
by_month <- group_by(not_cancelled, month)
|
||||
summarise(by_month, delay = mean(dep_delay))
|
||||
```
|
||||
|
||||
### Combining multiple operations with the pipe
|
||||
|
||||
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about them.
|
||||
Naming things is hard, so this slows down our analysis.
|
||||
|
@ -362,66 +372,23 @@ Naming things is hard, so this slows down our analysis.
|
|||
There's another way to tackle the same problem with the pipe, `%>%`:
|
||||
|
||||
```{r}
|
||||
sdelays <- flights %>%
|
||||
group_by(dest) %>%
|
||||
summarise(
|
||||
count = n(),
|
||||
dist = mean(distance, na.rm = TRUE),
|
||||
delay = mean(arr_delay, na.rm = TRUE)
|
||||
) %>%
|
||||
filter(count > 20, dest != "HNL")
|
||||
flights %>%
|
||||
filter(!is.na(dep_delay)) %>%
|
||||
group_by(month) %>%
|
||||
summarise(delay = mean(dep_delay))
|
||||
```
|
||||
|
||||
This focuses on the transformations, not what's being transformed, which makes the code easier to read.
|
||||
You can read it as a series of imperative statements: group, then summarise, then filter.
|
||||
You can read it as a series of imperative statements: filter, then group, then summarise.
|
||||
As suggested by this reading, a good way to pronounce `%>%` when reading code is "then".
|
||||
|
||||
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on.
|
||||
You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
|
||||
We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(pipes).
|
||||
|
||||
Working with the pipe is one of the key criteria for belonging to the tidyverse.
|
||||
The only exception is ggplot2: it was written before the pipe was discovered.
|
||||
Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.
|
||||
### Grouping by multiple variables
|
||||
|
||||
## Missing values {#missing-values-summarise}
|
||||
|
||||
You may have wondered about the `na.rm` argument we used above.
|
||||
What happens if we don't set it?
|
||||
|
||||
```{r}
|
||||
flights %>%
|
||||
group_by(month) %>%
|
||||
summarise(mean = mean(dep_delay))
|
||||
```
|
||||
|
||||
We get a lot of missing values!
|
||||
That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value.
|
||||
Fortunately, all aggregation functions have an `na.rm` argument which removes the missing values prior to computation:
|
||||
|
||||
```{r}
|
||||
flights %>%
|
||||
group_by(month) %>%
|
||||
summarise(mean = mean(dep_delay, na.rm = TRUE))
|
||||
```
|
||||
|
||||
In this case, missing values represent cancelled flights, therefore we could also tackle the problem by first removing the cancelled flights.
|
||||
We'll save this dataset so we can reuse it in the next few examples.
|
||||
|
||||
```{r}
|
||||
not_cancelled <- flights %>%
|
||||
filter(!is.na(dep_delay), !is.na(arr_delay))
|
||||
|
||||
not_cancelled %>%
|
||||
group_by(month) %>%
|
||||
summarise(mean = mean(dep_delay))
|
||||
```
|
||||
|
||||
## Grouping by multiple variables
|
||||
|
||||
You can group a data frame by multiple variables as well.
|
||||
Note that the grouping information is printed on top of the output.
|
||||
The number in the square brackets indicates how many groups are created.
|
||||
You can group a data frame by multiple variables:
|
||||
|
||||
```{r}
|
||||
daily <- group_by(flights, year, month, day)
|
||||
|
@ -431,34 +398,22 @@ daily
|
|||
When you group by multiple variables, each summary peels off one level of the grouping by default, and a message is printed that tells you how you can change this behaviour.
|
||||
|
||||
```{r}
|
||||
summarise(daily, flights = n())
|
||||
daily %>% summarise(flights = n())
|
||||
```
|
||||
|
||||
If you're happy with this behaviour, you can also explicitly define it, in which case the message won't be printed out.
|
||||
|
||||
```{r}
|
||||
```{r results = FALSE}
|
||||
summarise(daily, flights = n(), .groups = "drop_last")
|
||||
```
|
||||
|
||||
Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`.
|
||||
Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`:
|
||||
|
||||
```{r}
|
||||
# Note the difference between the grouping structures
|
||||
```{r results = FALSE}
|
||||
summarise(daily, flights = n(), .groups = "drop")
|
||||
summarise(daily, flights = n(), .groups = "keep")
|
||||
```
|
||||
|
||||
The fact that each summary peels off one level of the grouping by default makes it easy to progressively roll up a dataset:
|
||||
|
||||
```{r}
|
||||
(per_day <- summarise(daily, flights = n()))
|
||||
(per_month <- summarise(per_day, flights = sum(flights)))
|
||||
(per_year <- summarise(per_month, flights = sum(flights)))
|
||||
```
|
||||
|
||||
Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median.
|
||||
In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
|
||||
|
||||
### Ungrouping
|
||||
|
||||
You might also want to remove grouping outside of `summarise()`.
|
||||
|
@ -466,11 +421,33 @@ You can do this and return to operations on ungrouped data using `ungroup()`.
|
|||
|
||||
```{r}
|
||||
daily %>%
|
||||
ungroup() %>% # no longer grouped by date
|
||||
summarise(flights = n()) # all flights
|
||||
ungroup() %>%
|
||||
summarise(
|
||||
delay = mean(dep_delay, na.rm = TRUE),
|
||||
flights = n()
|
||||
)
|
||||
```
|
||||
|
||||
### Counts
|
||||
For the purposes of summarising, ungrouped data is treated as if all your data was in a single group, so you get one row back.
|
||||
|
||||
### Other verbs
|
||||
|
||||
- `select()`, `rename()`, `relocate()`: grouping has no affect
|
||||
|
||||
- `filter()`, `mutate()`: computation happens per group.
|
||||
This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions).
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Which carrier has the worst delays?
|
||||
Challenge: can you disentangle the effects of bad airports vs. bad carriers?
|
||||
Why/why not?
|
||||
(Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
|
||||
|
||||
2. What does the `sort` argument to `count()` do.
|
||||
Can you explain it in terms of the dplyr verbs you've learned so far?
|
||||
|
||||
## Case study: aggregates and sample size
|
||||
|
||||
Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`).
|
||||
That way you can check that you're not drawing conclusions based on very small amounts of data.
|
||||
|
@ -518,15 +495,6 @@ delays %>%
|
|||
geom_point(alpha = 1/10)
|
||||
```
|
||||
|
||||
------------------------------------------------------------------------
|
||||
|
||||
RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P.
|
||||
This resends the previously sent chunk from the editor to the console.
|
||||
This is very convenient when you're (e.g.) exploring the value of `n` in the example above.
|
||||
You send the whole block once with Cmd/Ctrl + Enter, then you modify the value of `n` and press Cmd/Ctrl + Shift + P to resend the complete block.
|
||||
|
||||
------------------------------------------------------------------------
|
||||
|
||||
There's another common variation of this type of pattern.
|
||||
Let's look at how the average performance of batters in baseball is related to the number of times they're at bat.
|
||||
Here I use data from the **Lahman** package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
|
||||
|
@ -565,99 +533,3 @@ batters %>%
|
|||
```
|
||||
|
||||
You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
|
||||
Consider the following scenarios:
|
||||
|
||||
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
|
||||
|
||||
- A flight is always 10 minutes late.
|
||||
|
||||
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
|
||||
|
||||
- 99% of the time a flight is on time.
|
||||
1% of the time it's 2 hours late.
|
||||
|
||||
Which is more important: arrival delay or departure delay?
|
||||
|
||||
2. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).
|
||||
|
||||
3. Our definition of cancelled flights (`is.na(dep_delay) | is.na(arr_delay)` ) is slightly suboptimal.
|
||||
Why?
|
||||
Which is the most important column?
|
||||
|
||||
4. Look at the number of cancelled flights per day.
|
||||
Is there a pattern?
|
||||
Is the proportion of cancelled flights related to the average delay?
|
||||
|
||||
5. Which carrier has the worst delays?
|
||||
Challenge: can you disentangle the effects of bad airports vs. bad carriers?
|
||||
Why/why not?
|
||||
(Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
|
||||
|
||||
6. What does the `sort` argument to `count()` do.
|
||||
When might you use it?
|
||||
|
||||
## Grouped mutates and filters
|
||||
|
||||
Grouping is most useful in conjunction with `summarise()`, but you can also do convenient operations with `mutate()` and `filter()`:
|
||||
|
||||
- Find the worst members of each group:
|
||||
|
||||
```{r}
|
||||
flights_sml %>%
|
||||
group_by(year, month, day) %>%
|
||||
filter(rank(desc(arr_delay)) < 10)
|
||||
```
|
||||
|
||||
- Find all groups bigger than a threshold:
|
||||
|
||||
```{r}
|
||||
popular_dests <- flights %>%
|
||||
group_by(dest) %>%
|
||||
filter(n() > 365)
|
||||
popular_dests
|
||||
```
|
||||
|
||||
- Standardise to compute per group metrics:
|
||||
|
||||
```{r}
|
||||
popular_dests %>%
|
||||
filter(arr_delay > 0) %>%
|
||||
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
|
||||
select(year:day, dest, arr_delay, prop_delay)
|
||||
```
|
||||
|
||||
A grouped filter is a grouped mutate followed by an ungrouped filter.
|
||||
I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly.
|
||||
|
||||
Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries).
|
||||
You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Refer back to the lists of useful mutate and filtering functions.
|
||||
Describe how each operation changes when you combine it with grouping.
|
||||
|
||||
2. Which plane (`tailnum`) has the worst on-time record?
|
||||
|
||||
3. What time of day should you fly if you want to avoid delays as much as possible?
|
||||
|
||||
4. For each destination, compute the total minutes of delay.
|
||||
For each flight, compute the proportion of the total delay for its destination.
|
||||
|
||||
5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
|
||||
Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.
|
||||
|
||||
6. Look at each destination.
|
||||
Can you find flights that are suspiciously fast?
|
||||
(i.e. flights that represent a potential data entry error).
|
||||
Compute the air time of a flight relative to the shortest flight to that destination.
|
||||
Which flights were most delayed in the air?
|
||||
|
||||
7. Find all destinations that are flown by at least two carriers.
|
||||
Use that information to rank the carriers.
|
||||
|
||||
8. For each plane, count the number of flights before the first delay of greater than 1 hour.
|
||||
|
|
|
@ -26,7 +26,7 @@ filter(flights, month == 11 | month == 12)
|
|||
```
|
||||
|
||||
The order of operations doesn't work like English.
|
||||
You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
|
||||
You can't write `filter(flights, month == 11 | 12)`, which you might literally translate into "finds all flights that departed in November or December".
|
||||
Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
|
||||
In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
|
||||
This is quite confusing!
|
||||
|
@ -77,6 +77,12 @@ You'll learn how to create new variables shortly.
|
|||
summarise(hour_prop = mean(arr_delay > 60))
|
||||
```
|
||||
|
||||
`cumany()` `cumall()`
|
||||
|
||||
### Exercises
|
||||
|
||||
1. For each plane, count the number of flights before the first delay of greater than 1 hour.
|
||||
|
||||
## Basic math
|
||||
|
||||
There are many functions for creating new variables that you can use with `mutate()`.
|
||||
|
@ -121,6 +127,12 @@ There's no way to list every possible function that you might use, but here's a
|
|||
cummean(x)
|
||||
```
|
||||
|
||||
### Recycling rules
|
||||
|
||||
Base R.
|
||||
|
||||
Tidyverse.
|
||||
|
||||
## Summaries
|
||||
|
||||
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
|
||||
|
@ -175,6 +187,22 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
|
|||
)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
|
||||
Consider the following scenarios:
|
||||
|
||||
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
|
||||
|
||||
- A flight is always 10 minutes late.
|
||||
|
||||
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
|
||||
|
||||
- 99% of the time a flight is on time.
|
||||
1% of the time it's 2 hours late.
|
||||
|
||||
Which is more important: arrival delay or departure delay?
|
||||
|
||||
## Floating point
|
||||
|
||||
There's another common problem you might encounter when using `==`: floating point numbers.
|
||||
|
@ -195,5 +223,6 @@ near(1 / 49 * 49, 1)
|
|||
|
||||
## Exercises
|
||||
|
||||
1. How could you use `arrange()` to sort all missing values to the start?
|
||||
(Hint: use `!is.na()`).
|
||||
1. What trigonometric functions does R provide?
|
||||
2.
|
||||
|
||||
|
|
|
@ -46,6 +46,21 @@ If you want to determine if a value is missing, use `is.na()`:
|
|||
is.na(x)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. How many flights have a missing `dep_time`?
|
||||
What other variables are missing?
|
||||
What might these rows represent?
|
||||
|
||||
2. How could you use `arrange()` to sort all missing values to the start?
|
||||
(Hint: use `!is.na()`).
|
||||
|
||||
3. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).
|
||||
|
||||
4. Look at the number of cancelled flights per day.
|
||||
Is there a pattern?
|
||||
Is the proportion of cancelled flights related to the average delay?
|
||||
|
||||
## Explicit vs implicit missing values {#missing-values-tidy}
|
||||
|
||||
Changing the representation of a dataset brings up an important subtlety of missing values.
|
||||
|
@ -151,8 +166,8 @@ arrange(df, desc(x))
|
|||
|
||||
## Exercises
|
||||
|
||||
1. Why is `NA ^ 0` not missing?
|
||||
Why is `NA | TRUE` not missing?
|
||||
Why is `FALSE & NA` not missing?
|
||||
Can you figure out the general rule?
|
||||
(`NA * 0` is a tricky counterexample!)
|
||||
1. Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)
|
||||
|
||||
### Missing matches
|
||||
|
||||
Discuss `anti_join()`
|
||||
|
|
|
@ -102,3 +102,74 @@ not_cancelled <- flights %>%
|
|||
mutate(r = min_rank(desc(dep_time))) %>%
|
||||
filter(r %in% range(r))
|
||||
```
|
||||
|
||||
### dplyr
|
||||
|
||||
```{r}
|
||||
flights_sml <- select(flights,
|
||||
year:day,
|
||||
ends_with("delay"),
|
||||
distance,
|
||||
air_time
|
||||
)
|
||||
```
|
||||
|
||||
- Find the worst members of each group:
|
||||
|
||||
```{r}
|
||||
flights_sml %>%
|
||||
group_by(year, month, day) %>%
|
||||
filter(rank(desc(arr_delay)) < 10)
|
||||
```
|
||||
|
||||
- Find all groups bigger than a threshold:
|
||||
|
||||
```{r}
|
||||
popular_dests <- flights %>%
|
||||
group_by(dest) %>%
|
||||
filter(n() > 365)
|
||||
popular_dests
|
||||
```
|
||||
|
||||
- Standardise to compute per group metrics:
|
||||
|
||||
```{r}
|
||||
popular_dests %>%
|
||||
filter(arr_delay > 0) %>%
|
||||
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
|
||||
select(year:day, dest, arr_delay, prop_delay)
|
||||
```
|
||||
|
||||
A grouped filter is a grouped mutate followed by an ungrouped filter.
|
||||
I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly.
|
||||
|
||||
Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries).
|
||||
You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Find the 10 most delayed flights using a ranking function.
|
||||
How do you want to handle ties?
|
||||
Carefully read the documentation for `min_rank()`.
|
||||
|
||||
2. Which plane (`tailnum`) has the worst on-time record?
|
||||
|
||||
3. What time of day should you fly if you want to avoid delays as much as possible?
|
||||
|
||||
4. For each destination, compute the total minutes of delay.
|
||||
For each flight, compute the proportion of the total delay for its destination.
|
||||
|
||||
5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
|
||||
Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.
|
||||
|
||||
6. Look at each destination.
|
||||
Can you find flights that are suspiciously fast?
|
||||
(i.e. flights that represent a potential data entry error).
|
||||
Compute the air time of a flight relative to the shortest flight to that destination.
|
||||
Which flights were most delayed in the air?
|
||||
|
||||
7. Find all destinations that are flown by at least two carriers.
|
||||
Use that information to rank the carriers.
|
||||
|
||||
8.
|
||||
|
||||
|
|
Loading…
Reference in New Issue