Break up data-transform content

This commit is contained in:
Hadley Wickham 2021-04-19 07:56:29 -05:00
parent 40d7bcb5d0
commit 861e27026e
4 changed files with 351 additions and 335 deletions

View File

@ -117,115 +117,6 @@ When this happens you'll get an informative error:
filter(flights, month = 1)
```
There's another common problem you might encounter when using `==`: floating point numbers.
These results might surprise you!
```{r}
(sqrt(2) ^ 2) == 2
(1 / 49 * 49) == 1
```
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
Instead of relying on `==`, use `near()`:
```{r}
near(sqrt(2) ^ 2, 2)
near(1 / 49 * 49, 1)
```
### Logical operators
Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output.
For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not".
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."}
knitr::include_graphics("diagrams/transform-logical.png")
```
The following code finds all flights that departed in November or December:
```{r, eval = FALSE}
filter(flights, month == 11 | month == 12)
```
The order of operations doesn't work like English.
You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
This is quite confusing!
A useful short-hand for this problem is `x %in% y`.
This will select every row where `x` is one of the values in `y`.
We could use it to rewrite the code above:
```{r, eval = FALSE}
nov_dec <- filter(flights, month %in% c(11, 12))
```
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
```{r, eval = FALSE}
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
```
As well as `&` and `|`, R also has `&&` and `||`.
Don't use them here!
You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead.
That makes it much easier to check your work.
You'll learn how to create new variables shortly.
### Missing values {#missing-values-filter}
One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
```{r}
NA > 5
10 == NA
NA + 10
NA / 2
```
The most confusing result is this one:
```{r}
NA == NA
```
It's easiest to understand why this is true with a bit more context:
```{r}
# Let x be Mary's age. We don't know how old she is.
x <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# Are John and Mary the same age?
x == y
# We don't know!
```
If you want to determine if a value is missing, use `is.na()`:
```{r}
is.na(x)
```
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
If you want to preserve missing values, ask for them explicitly:
```{r}
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
```
### Exercises
1. Find all flights that
@ -238,20 +129,10 @@ filter(df, is.na(x) | x > 1)
f. Were delayed by at least an hour, but made up over 30 minutes in flight
g. Departed between midnight and 6am (inclusive)
2. Another useful dplyr filtering helper is `between()`.
What does it do?
Can you use it to simplify the code needed to answer the previous challenges?
3. How many flights have a missing `dep_time`?
2. How many flights have a missing `dep_time`?
What other variables are missing?
What might these rows represent?
4. Why is `NA ^ 0` not missing?
Why is `NA | TRUE` not missing?
Why is `FALSE & NA` not missing?
Can you figure out the general rule?
(`NA * 0` is a tricky counterexample!)
## Arrange rows with `arrange()`
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
@ -268,14 +149,6 @@ Use `desc()` to re-order by a column in descending order:
arrange(flights, desc(dep_delay))
```
Missing values are always sorted at the end:
```{r}
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))
```
### Exercises
1. Sort `flights` to find the flights with longest departure delays.
@ -286,9 +159,6 @@ arrange(df, desc(x))
3. Which flights travelled the farthest?
Which travelled the shortest?
4. How could you use `arrange()` to sort all missing values to the start?
(Hint: use `!is.na()`).
## Select columns with `select()` {#select}
It's not uncommon to get datasets with hundreds or even thousands of variables.
@ -396,80 +266,6 @@ transmute(flights,
)
```
### Useful creation functions {#mutate-funs}
There are many functions for creating new variables that you can use with `mutate()`.
The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output.
There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
- Arithmetic operators: `+`, `-`, `*`, `/`, `^`.
These are all vectorised, using the so called "recycling rules".
If one parameter is shorter than the other, it will be automatically extended to be the same length.
This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc.
Arithmetic operators are also useful in conjunction with the aggregate functions you'll learn about later.
For example, `x / sum(x)` calculates the proportion of a total, and `y - mean(y)` computes the difference from the mean.
- Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where `x == y * (x %/% y) + (x %% y)`.
Modular arithmetic is a handy tool because it allows you to break integers up into pieces.
For example, in the flights dataset, you can compute `hour` and `minute` from `dep_time` with:
```{r}
transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)
```
- Logs: `log()`, `log2()`, `log10()`.
Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
They also convert multiplicative relationships to additive.
All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
They are most useful in conjunction with `group_by()`, which you'll learn about shortly.
```{r}
(x <- 1:10)
lag(x)
lead(x)
```
- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
```{r}
x
cumsum(x)
cummean(x)
```
- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier.
If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.
```{r}
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
min_rank(desc(y))
```
If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
See their help pages for more details.
```{r}
row_number(y)
dense_rank(y)
percent_rank(y)
cume_dist(y)
```
### Exercises
```{r, eval = FALSE, echo = FALSE}
@ -588,7 +384,7 @@ Working with the pipe is one of the key criteria for belonging to the tidyverse.
The only exception is ggplot2: it was written before the pipe was discovered.
Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.
### Missing values {#missing-values-summarise}
## Missing values {#missing-values-summarise}
You may have wondered about the `na.rm` argument we used above.
What happens if we don't set it?
@ -621,7 +417,7 @@ not_cancelled %>%
summarise(mean = mean(dep_delay))
```
### Grouping by multiple variables
## Grouping by multiple variables
You can group a data frame by multiple variables as well.
Note that the grouping information is printed on top of the output.
@ -770,134 +566,6 @@ batters %>%
You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
### Useful summary functions {#summarise-funs}
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
- Measures of location: we've used `mean(x)`, but `median(x)` is also useful.
The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
```{r}
not_cancelled %>%
group_by(month) %>%
summarise(
med_arr_delay = median(arr_delay),
med_dep_delay = median(dep_delay)
)
```
It's sometimes useful to combine aggregation with logical subsetting.
We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting).
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
```
- Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.
The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread.
The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
```{r}
# Why is distance to some destinations more variable than to others?
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
```
- Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
Quantiles are a generalisation of the median.
For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%.
```{r}
# When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
```
- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
For example, we can find the first and last departure for each day:
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
```
These functions are complementary to filtering on ranks.
Filtering gives you all variables, with each observation in a separate row:
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
```
- Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
To count the number of non-missing values, use `sum(!is.na(x))`.
To count the number of distinct (unique) values, use `n_distinct(x)`.
```{r}
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
```
Counts are so useful that dplyr provides a simple helper if all you want is a count:
```{r}
not_cancelled %>%
count(dest)
```
Just like with `group_by()`, you can also provide multiple variables to `count()`.
```{r}
not_cancelled %>%
count(carrier, dest)
```
You can optionally provide a weight variable.
For example, you could use this to "count" (sum) the total number of miles a plane flew:
```{r}
not_cancelled %>%
count(tailnum, wt = distance)
```
- Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.
When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0.
This makes `sum()` and `mean()` very useful: `sum(x)` gives the number of `TRUE`s in `x`, and `mean(x)` gives the proportion.
```{r}
# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500))
# What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_prop = mean(arr_delay > 60))
```
### Exercises
1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.

View File

@ -1,3 +1,191 @@
# Logicals and numbers {#logicals-numbers}
## Introduction
`between()`
## Logical operators
Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output.
For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not".
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."}
knitr::include_graphics("diagrams/transform-logical.png")
```
The following code finds all flights that departed in November or December:
```{r, eval = FALSE}
filter(flights, month == 11 | month == 12)
```
The order of operations doesn't work like English.
You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
This is quite confusing!
A useful short-hand for this problem is `x %in% y`.
This will select every row where `x` is one of the values in `y`.
We could use it to rewrite the code above:
```{r, eval = FALSE}
nov_dec <- filter(flights, month %in% c(11, 12))
```
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
```{r, eval = FALSE}
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
```
As well as `&` and `|`, R also has `&&` and `||`.
Don't use them here!
You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead.
That makes it much easier to check your work.
You'll learn how to create new variables shortly.
## Summaries
- Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.
When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0.
This makes `sum()` and `mean()` very useful: `sum(x)` gives the number of `TRUE`s in `x`, and `mean(x)` gives the proportion.
```{r}
# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500))
# What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_prop = mean(arr_delay > 60))
```
## Basic math
There are many functions for creating new variables that you can use with `mutate()`.
The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output.
There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
- Arithmetic operators: `+`, `-`, `*`, `/`, `^`.
These are all vectorised, using the so called "recycling rules".
If one parameter is shorter than the other, it will be automatically extended to be the same length.
This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc.
Arithmetic operators are also useful in conjunction with the aggregate functions you'll learn about later.
For example, `x / sum(x)` calculates the proportion of a total, and `y - mean(y)` computes the difference from the mean.
- Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where `x == y * (x %/% y) + (x %% y)`.
Modular arithmetic is a handy tool because it allows you to break integers up into pieces.
For example, in the flights dataset, you can compute `hour` and `minute` from `dep_time` with:
```{r}
transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)
```
- Logs: `log()`, `log2()`, `log10()`.
Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
They also convert multiplicative relationships to additive.
All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier.
If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
```{r}
x
cumsum(x)
cummean(x)
```
## Summaries
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
- Measures of location: we've used `mean(x)`, but `median(x)` is also useful.
The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
```{r}
not_cancelled %>%
group_by(month) %>%
summarise(
med_arr_delay = median(arr_delay),
med_dep_delay = median(dep_delay)
)
```
It's sometimes useful to combine aggregation with logical subsetting.
We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting).
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
```
- Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.
The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread.
The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
```{r}
# Why is distance to some destinations more variable than to others?
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
```
- Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
Quantiles are a generalisation of the median.
For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%.
```{r}
# When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
```
## Floating point
There's another common problem you might encounter when using `==`: floating point numbers.
These results might surprise you!
```{r}
(sqrt(2) ^ 2) == 2
(1 / 49 * 49) == 1
```
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
Instead of relying on `==`, use `near()`:
```{r}
near(sqrt(2) ^ 2, 2)
near(1 / 49 * 49, 1)
```
## Exercises
1. How could you use `arrange()` to sort all missing values to the start?
(Hint: use `!is.na()`).

View File

@ -1,3 +1,70 @@
# Missing values {#missing-values}
## Introduction
## Basics
### Missing values {#missing-values-filter}
One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
```{r}
NA > 5
10 == NA
NA + 10
NA / 2
```
The most confusing result is this one:
```{r}
NA == NA
```
It's easiest to understand why this is true with a bit more context:
```{r}
# Let x be Mary's age. We don't know how old she is.
x <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# Are John and Mary the same age?
x == y
# We don't know!
```
If you want to determine if a value is missing, use `is.na()`:
```{r}
is.na(x)
```
## dplyr verbs
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
If you want to preserve missing values, ask for them explicitly:
```{r}
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
```
Missing values are always sorted at the end:
```{r}
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))
```
## Exercises
1. Why is `NA ^ 0` not missing?
Why is `NA | TRUE` not missing?
Why is `FALSE & NA` not missing?
Can you figure out the general rule?
(`NA * 0` is a tricky counterexample!)

View File

@ -1,3 +1,96 @@
# Vector tools
## Introduction
`%in%`
## Counts
- Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
To count the number of non-missing values, use `sum(!is.na(x))`.
To count the number of distinct (unique) values, use `n_distinct(x)`.
```{r}
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
```
Counts are so useful that dplyr provides a simple helper if all you want is a count:
```{r}
not_cancelled %>%
count(dest)
```
Just like with `group_by()`, you can also provide multiple variables to `count()`.
```{r}
not_cancelled %>%
count(carrier, dest)
```
You can optionally provide a weight variable.
For example, you could use this to "count" (sum) the total number of miles a plane flew:
```{r}
not_cancelled %>%
count(tailnum, wt = distance)
```
## Window functions
- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
They are most useful in conjunction with `group_by()`, which you'll learn about shortly.
```{r}
(x <- 1:10)
lag(x)
lead(x)
```
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.
```{r}
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
min_rank(desc(y))
```
If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
See their help pages for more details.
```{r}
row_number(y)
dense_rank(y)
percent_rank(y)
cume_dist(y)
```
- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
For example, we can find the first and last departure for each day:
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
```
These functions are complementary to filtering on ranks.
Filtering gives you all variables, with each observation in a separate row:
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
```