From 861e27026e1b04e3c9cee4ffb23d04991c83963f Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Mon, 19 Apr 2021 07:56:29 -0500 Subject: [PATCH] Break up data-transform content --- data-transform.Rmd | 338 +------------------------------------------ logicals-numbers.Rmd | 188 ++++++++++++++++++++++++ missing-values.Rmd | 67 +++++++++ vector-tools.Rmd | 93 ++++++++++++ 4 files changed, 351 insertions(+), 335 deletions(-) diff --git a/data-transform.Rmd b/data-transform.Rmd index e3477a0..1e9f08f 100644 --- a/data-transform.Rmd +++ b/data-transform.Rmd @@ -117,115 +117,6 @@ When this happens you'll get an informative error: filter(flights, month = 1) ``` -There's another common problem you might encounter when using `==`: floating point numbers. -These results might surprise you! - -```{r} -(sqrt(2) ^ 2) == 2 -(1 / 49 * 49) == 1 -``` - -Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation. -Instead of relying on `==`, use `near()`: - -```{r} -near(sqrt(2) ^ 2, 2) -near(1 / 49 * 49, 1) -``` - -### Logical operators - -Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output. -For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not". -Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations. - -```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."} -knitr::include_graphics("diagrams/transform-logical.png") -``` - -The following code finds all flights that departed in November or December: - -```{r, eval = FALSE} -filter(flights, month == 11 | month == 12) -``` - -The order of operations doesn't work like English. -You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December". -Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`. -In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December. -This is quite confusing! - -A useful short-hand for this problem is `x %in% y`. -This will select every row where `x` is one of the values in `y`. -We could use it to rewrite the code above: - -```{r, eval = FALSE} -nov_dec <- filter(flights, month %in% c(11, 12)) -``` - -Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`. -For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters: - -```{r, eval = FALSE} -filter(flights, !(arr_delay > 120 | dep_delay > 120)) -filter(flights, arr_delay <= 120, dep_delay <= 120) -``` - -As well as `&` and `|`, R also has `&&` and `||`. -Don't use them here! -You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution. - -Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead. -That makes it much easier to check your work. -You'll learn how to create new variables shortly. - -### Missing values {#missing-values-filter} - -One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables"). -`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown. - -```{r} -NA > 5 -10 == NA -NA + 10 -NA / 2 -``` - -The most confusing result is this one: - -```{r} -NA == NA -``` - -It's easiest to understand why this is true with a bit more context: - -```{r} -# Let x be Mary's age. We don't know how old she is. -x <- NA - -# Let y be John's age. We don't know how old he is. -y <- NA - -# Are John and Mary the same age? -x == y -# We don't know! -``` - -If you want to determine if a value is missing, use `is.na()`: - -```{r} -is.na(x) -``` - -`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. -If you want to preserve missing values, ask for them explicitly: - -```{r} -df <- tibble(x = c(1, NA, 3)) -filter(df, x > 1) -filter(df, is.na(x) | x > 1) -``` - ### Exercises 1. Find all flights that @@ -238,20 +129,10 @@ filter(df, is.na(x) | x > 1) f. Were delayed by at least an hour, but made up over 30 minutes in flight g. Departed between midnight and 6am (inclusive) -2. Another useful dplyr filtering helper is `between()`. - What does it do? - Can you use it to simplify the code needed to answer the previous challenges? - -3. How many flights have a missing `dep_time`? +2. How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent? -4. Why is `NA ^ 0` not missing? - Why is `NA | TRUE` not missing? - Why is `FALSE & NA` not missing? - Can you figure out the general rule? - (`NA * 0` is a tricky counterexample!) - ## Arrange rows with `arrange()` `arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order. @@ -268,14 +149,6 @@ Use `desc()` to re-order by a column in descending order: arrange(flights, desc(dep_delay)) ``` -Missing values are always sorted at the end: - -```{r} -df <- tibble(x = c(5, 2, NA)) -arrange(df, x) -arrange(df, desc(x)) -``` - ### Exercises 1. Sort `flights` to find the flights with longest departure delays. @@ -286,9 +159,6 @@ arrange(df, desc(x)) 3. Which flights travelled the farthest? Which travelled the shortest? -4. How could you use `arrange()` to sort all missing values to the start? - (Hint: use `!is.na()`). - ## Select columns with `select()` {#select} It's not uncommon to get datasets with hundreds or even thousands of variables. @@ -396,80 +266,6 @@ transmute(flights, ) ``` -### Useful creation functions {#mutate-funs} - -There are many functions for creating new variables that you can use with `mutate()`. -The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. -There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful: - -- Arithmetic operators: `+`, `-`, `*`, `/`, `^`. - These are all vectorised, using the so called "recycling rules". - If one parameter is shorter than the other, it will be automatically extended to be the same length. - This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc. - - Arithmetic operators are also useful in conjunction with the aggregate functions you'll learn about later. - For example, `x / sum(x)` calculates the proportion of a total, and `y - mean(y)` computes the difference from the mean. - -- Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where `x == y * (x %/% y) + (x %% y)`. - Modular arithmetic is a handy tool because it allows you to break integers up into pieces. - For example, in the flights dataset, you can compute `hour` and `minute` from `dep_time` with: - - ```{r} - transmute(flights, - dep_time, - hour = dep_time %/% 100, - minute = dep_time %% 100 - ) - ``` - -- Logs: `log()`, `log2()`, `log10()`. - Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. - They also convert multiplicative relationships to additive. - - All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving. - -- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. - This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`). - They are most useful in conjunction with `group_by()`, which you'll learn about shortly. - - ```{r} - (x <- 1:10) - lag(x) - lead(x) - ``` - -- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means. - If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package. - - ```{r} - x - cumsum(x) - cummean(x) - ``` - -- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier. - If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected. - -- Ranking: there are a number of ranking functions, but you should start with `min_rank()`. - It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). - The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks. - - ```{r} - y <- c(1, 2, 2, NA, 3, 4) - min_rank(y) - min_rank(desc(y)) - ``` - - If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`. - See their help pages for more details. - - ```{r} - row_number(y) - dense_rank(y) - percent_rank(y) - cume_dist(y) - ``` - ### Exercises ```{r, eval = FALSE, echo = FALSE} @@ -588,7 +384,7 @@ Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet. -### Missing values {#missing-values-summarise} +## Missing values {#missing-values-summarise} You may have wondered about the `na.rm` argument we used above. What happens if we don't set it? @@ -621,7 +417,7 @@ not_cancelled %>% summarise(mean = mean(dep_delay)) ``` -### Grouping by multiple variables +## Grouping by multiple variables You can group a data frame by multiple variables as well. Note that the grouping information is printed on top of the output. @@ -770,134 +566,6 @@ batters %>% You can find a good explanation of this problem at and . -### Useful summary functions {#summarise-funs} - -Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions: - -- Measures of location: we've used `mean(x)`, but `median(x)` is also useful. - The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it. - - ```{r} - not_cancelled %>% - group_by(month) %>% - summarise( - med_arr_delay = median(arr_delay), - med_dep_delay = median(dep_delay) - ) - ``` - - It's sometimes useful to combine aggregation with logical subsetting. - We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting). - - ```{r} - not_cancelled %>% - group_by(year, month, day) %>% - summarise( - avg_delay1 = mean(arr_delay), - avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay - ) - ``` - -- Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`. - The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread. - The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers. - - ```{r} - # Why is distance to some destinations more variable than to others? - not_cancelled %>% - group_by(dest) %>% - summarise(distance_sd = sd(distance)) %>% - arrange(desc(distance_sd)) - ``` - -- Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`. - Quantiles are a generalisation of the median. - For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%. - - ```{r} - # When do the first and last flights leave each day? - not_cancelled %>% - group_by(year, month, day) %>% - summarise( - first = min(dep_time), - last = max(dep_time) - ) - ``` - -- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`. - These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements). - For example, we can find the first and last departure for each day: - - ```{r} - not_cancelled %>% - group_by(year, month, day) %>% - summarise( - first_dep = first(dep_time), - last_dep = last(dep_time) - ) - ``` - - These functions are complementary to filtering on ranks. - Filtering gives you all variables, with each observation in a separate row: - - ```{r} - not_cancelled %>% - group_by(year, month, day) %>% - mutate(r = min_rank(desc(dep_time))) %>% - filter(r %in% range(r)) - ``` - -- Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group. - To count the number of non-missing values, use `sum(!is.na(x))`. - To count the number of distinct (unique) values, use `n_distinct(x)`. - - ```{r} - # Which destinations have the most carriers? - not_cancelled %>% - group_by(dest) %>% - summarise(carriers = n_distinct(carrier)) %>% - arrange(desc(carriers)) - ``` - - Counts are so useful that dplyr provides a simple helper if all you want is a count: - - ```{r} - not_cancelled %>% - count(dest) - ``` - - Just like with `group_by()`, you can also provide multiple variables to `count()`. - - ```{r} - not_cancelled %>% - count(carrier, dest) - ``` - - You can optionally provide a weight variable. - For example, you could use this to "count" (sum) the total number of miles a plane flew: - - ```{r} - not_cancelled %>% - count(tailnum, wt = distance) - ``` - -- Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`. - When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0. - This makes `sum()` and `mean()` very useful: `sum(x)` gives the number of `TRUE`s in `x`, and `mean(x)` gives the proportion. - - ```{r} - # How many flights left before 5am? (these usually indicate delayed - # flights from the previous day) - not_cancelled %>% - group_by(year, month, day) %>% - summarise(n_early = sum(dep_time < 500)) - - # What proportion of flights are delayed by more than an hour? - not_cancelled %>% - group_by(year, month, day) %>% - summarise(hour_prop = mean(arr_delay > 60)) - ``` - ### Exercises 1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. diff --git a/logicals-numbers.Rmd b/logicals-numbers.Rmd index deaecbb..93c6aa4 100644 --- a/logicals-numbers.Rmd +++ b/logicals-numbers.Rmd @@ -1,3 +1,191 @@ # Logicals and numbers {#logicals-numbers} ## Introduction + +`between()` + +## Logical operators + +Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output. +For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not". +Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations. + +```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."} +knitr::include_graphics("diagrams/transform-logical.png") +``` + +The following code finds all flights that departed in November or December: + +```{r, eval = FALSE} +filter(flights, month == 11 | month == 12) +``` + +The order of operations doesn't work like English. +You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December". +Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`. +In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December. +This is quite confusing! + +A useful short-hand for this problem is `x %in% y`. +This will select every row where `x` is one of the values in `y`. +We could use it to rewrite the code above: + +```{r, eval = FALSE} +nov_dec <- filter(flights, month %in% c(11, 12)) +``` + +Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`. +For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters: + +```{r, eval = FALSE} +filter(flights, !(arr_delay > 120 | dep_delay > 120)) +filter(flights, arr_delay <= 120, dep_delay <= 120) +``` + +As well as `&` and `|`, R also has `&&` and `||`. +Don't use them here! +You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution. + +Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead. +That makes it much easier to check your work. +You'll learn how to create new variables shortly. + +## Summaries + +- Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`. + When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0. + This makes `sum()` and `mean()` very useful: `sum(x)` gives the number of `TRUE`s in `x`, and `mean(x)` gives the proportion. + + ```{r} + # How many flights left before 5am? (these usually indicate delayed + # flights from the previous day) + not_cancelled %>% + group_by(year, month, day) %>% + summarise(n_early = sum(dep_time < 500)) + + # What proportion of flights are delayed by more than an hour? + not_cancelled %>% + group_by(year, month, day) %>% + summarise(hour_prop = mean(arr_delay > 60)) + ``` + +## Basic math + +There are many functions for creating new variables that you can use with `mutate()`. +The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. +There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful: + +- Arithmetic operators: `+`, `-`, `*`, `/`, `^`. + These are all vectorised, using the so called "recycling rules". + If one parameter is shorter than the other, it will be automatically extended to be the same length. + This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc. + + Arithmetic operators are also useful in conjunction with the aggregate functions you'll learn about later. + For example, `x / sum(x)` calculates the proportion of a total, and `y - mean(y)` computes the difference from the mean. + +- Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where `x == y * (x %/% y) + (x %% y)`. + Modular arithmetic is a handy tool because it allows you to break integers up into pieces. + For example, in the flights dataset, you can compute `hour` and `minute` from `dep_time` with: + + ```{r} + transmute(flights, + dep_time, + hour = dep_time %/% 100, + minute = dep_time %% 100 + ) + ``` + +- Logs: `log()`, `log2()`, `log10()`. + Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. + They also convert multiplicative relationships to additive. + + All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving. + +- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier. + If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected. + +- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means. + If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package. + + ```{r} + x + cumsum(x) + cummean(x) + ``` + +## Summaries + +Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions: + +- Measures of location: we've used `mean(x)`, but `median(x)` is also useful. + The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it. + + ```{r} + not_cancelled %>% + group_by(month) %>% + summarise( + med_arr_delay = median(arr_delay), + med_dep_delay = median(dep_delay) + ) + ``` + + It's sometimes useful to combine aggregation with logical subsetting. + We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting). + + ```{r} + not_cancelled %>% + group_by(year, month, day) %>% + summarise( + avg_delay1 = mean(arr_delay), + avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay + ) + ``` + +- Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`. + The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread. + The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers. + + ```{r} + # Why is distance to some destinations more variable than to others? + not_cancelled %>% + group_by(dest) %>% + summarise(distance_sd = sd(distance)) %>% + arrange(desc(distance_sd)) + ``` + +- Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`. + Quantiles are a generalisation of the median. + For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%. + + ```{r} + # When do the first and last flights leave each day? + not_cancelled %>% + group_by(year, month, day) %>% + summarise( + first = min(dep_time), + last = max(dep_time) + ) + ``` + +## Floating point + +There's another common problem you might encounter when using `==`: floating point numbers. +These results might surprise you! + +```{r} +(sqrt(2) ^ 2) == 2 +(1 / 49 * 49) == 1 +``` + +Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation. +Instead of relying on `==`, use `near()`: + +```{r} +near(sqrt(2) ^ 2, 2) +near(1 / 49 * 49, 1) +``` + +## Exercises + +1. How could you use `arrange()` to sort all missing values to the start? + (Hint: use `!is.na()`). diff --git a/missing-values.Rmd b/missing-values.Rmd index abc43a9..ed10e52 100644 --- a/missing-values.Rmd +++ b/missing-values.Rmd @@ -1,3 +1,70 @@ # Missing values {#missing-values} ## Introduction + +## Basics + +### Missing values {#missing-values-filter} + +One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables"). +`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown. + +```{r} +NA > 5 +10 == NA +NA + 10 +NA / 2 +``` + +The most confusing result is this one: + +```{r} +NA == NA +``` + +It's easiest to understand why this is true with a bit more context: + +```{r} +# Let x be Mary's age. We don't know how old she is. +x <- NA + +# Let y be John's age. We don't know how old he is. +y <- NA + +# Are John and Mary the same age? +x == y +# We don't know! +``` + +If you want to determine if a value is missing, use `is.na()`: + +```{r} +is.na(x) +``` + +## dplyr verbs + +`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. +If you want to preserve missing values, ask for them explicitly: + +```{r} +df <- tibble(x = c(1, NA, 3)) +filter(df, x > 1) +filter(df, is.na(x) | x > 1) +``` + +Missing values are always sorted at the end: + +```{r} +df <- tibble(x = c(5, 2, NA)) +arrange(df, x) +arrange(df, desc(x)) +``` + +## Exercises + +1. Why is `NA ^ 0` not missing? + Why is `NA | TRUE` not missing? + Why is `FALSE & NA` not missing? + Can you figure out the general rule? + (`NA * 0` is a tricky counterexample!) diff --git a/vector-tools.Rmd b/vector-tools.Rmd index 00a569d..b98df1c 100644 --- a/vector-tools.Rmd +++ b/vector-tools.Rmd @@ -1,3 +1,96 @@ # Vector tools ## Introduction + +`%in%` + +## Counts + +- Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group. + To count the number of non-missing values, use `sum(!is.na(x))`. + To count the number of distinct (unique) values, use `n_distinct(x)`. + + ```{r} + # Which destinations have the most carriers? + not_cancelled %>% + group_by(dest) %>% + summarise(carriers = n_distinct(carrier)) %>% + arrange(desc(carriers)) + ``` + + Counts are so useful that dplyr provides a simple helper if all you want is a count: + + ```{r} + not_cancelled %>% + count(dest) + ``` + + Just like with `group_by()`, you can also provide multiple variables to `count()`. + + ```{r} + not_cancelled %>% + count(carrier, dest) + ``` + + You can optionally provide a weight variable. + For example, you could use this to "count" (sum) the total number of miles a plane flew: + + ```{r} + not_cancelled %>% + count(tailnum, wt = distance) + ``` + +## Window functions + +- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. + This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`). + They are most useful in conjunction with `group_by()`, which you'll learn about shortly. + + ```{r} + (x <- 1:10) + lag(x) + lead(x) + ``` + +- Ranking: there are a number of ranking functions, but you should start with `min_rank()`. + It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). + The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks. + + ```{r} + y <- c(1, 2, 2, NA, 3, 4) + min_rank(y) + min_rank(desc(y)) + ``` + + If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`. + See their help pages for more details. + + ```{r} + row_number(y) + dense_rank(y) + percent_rank(y) + cume_dist(y) + ``` + +- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`. + These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements). + For example, we can find the first and last departure for each day: + + ```{r} + not_cancelled %>% + group_by(year, month, day) %>% + summarise( + first_dep = first(dep_time), + last_dep = last(dep_time) + ) + ``` + + These functions are complementary to filtering on ranks. + Filtering gives you all variables, with each observation in a separate row: + + ```{r} + not_cancelled %>% + group_by(year, month, day) %>% + mutate(r = min_rank(desc(dep_time))) %>% + filter(r %in% range(r)) + ```