diff --git a/data-transform.Rmd b/data-transform.Rmd
index e3477a0..1e9f08f 100644
--- a/data-transform.Rmd
+++ b/data-transform.Rmd
@@ -117,115 +117,6 @@ When this happens you'll get an informative error:
filter(flights, month = 1)
```
-There's another common problem you might encounter when using `==`: floating point numbers.
-These results might surprise you!
-
-```{r}
-(sqrt(2) ^ 2) == 2
-(1 / 49 * 49) == 1
-```
-
-Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
-Instead of relying on `==`, use `near()`:
-
-```{r}
-near(sqrt(2) ^ 2, 2)
-near(1 / 49 * 49, 1)
-```
-
-### Logical operators
-
-Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output.
-For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not".
-Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
-
-```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."}
-knitr::include_graphics("diagrams/transform-logical.png")
-```
-
-The following code finds all flights that departed in November or December:
-
-```{r, eval = FALSE}
-filter(flights, month == 11 | month == 12)
-```
-
-The order of operations doesn't work like English.
-You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
-Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
-In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
-This is quite confusing!
-
-A useful short-hand for this problem is `x %in% y`.
-This will select every row where `x` is one of the values in `y`.
-We could use it to rewrite the code above:
-
-```{r, eval = FALSE}
-nov_dec <- filter(flights, month %in% c(11, 12))
-```
-
-Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
-For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
-
-```{r, eval = FALSE}
-filter(flights, !(arr_delay > 120 | dep_delay > 120))
-filter(flights, arr_delay <= 120, dep_delay <= 120)
-```
-
-As well as `&` and `|`, R also has `&&` and `||`.
-Don't use them here!
-You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
-
-Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead.
-That makes it much easier to check your work.
-You'll learn how to create new variables shortly.
-
-### Missing values {#missing-values-filter}
-
-One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
-`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
-
-```{r}
-NA > 5
-10 == NA
-NA + 10
-NA / 2
-```
-
-The most confusing result is this one:
-
-```{r}
-NA == NA
-```
-
-It's easiest to understand why this is true with a bit more context:
-
-```{r}
-# Let x be Mary's age. We don't know how old she is.
-x <- NA
-
-# Let y be John's age. We don't know how old he is.
-y <- NA
-
-# Are John and Mary the same age?
-x == y
-# We don't know!
-```
-
-If you want to determine if a value is missing, use `is.na()`:
-
-```{r}
-is.na(x)
-```
-
-`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
-If you want to preserve missing values, ask for them explicitly:
-
-```{r}
-df <- tibble(x = c(1, NA, 3))
-filter(df, x > 1)
-filter(df, is.na(x) | x > 1)
-```
-
### Exercises
1. Find all flights that
@@ -238,20 +129,10 @@ filter(df, is.na(x) | x > 1)
f. Were delayed by at least an hour, but made up over 30 minutes in flight
g. Departed between midnight and 6am (inclusive)
-2. Another useful dplyr filtering helper is `between()`.
- What does it do?
- Can you use it to simplify the code needed to answer the previous challenges?
-
-3. How many flights have a missing `dep_time`?
+2. How many flights have a missing `dep_time`?
What other variables are missing?
What might these rows represent?
-4. Why is `NA ^ 0` not missing?
- Why is `NA | TRUE` not missing?
- Why is `FALSE & NA` not missing?
- Can you figure out the general rule?
- (`NA * 0` is a tricky counterexample!)
-
## Arrange rows with `arrange()`
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
@@ -268,14 +149,6 @@ Use `desc()` to re-order by a column in descending order:
arrange(flights, desc(dep_delay))
```
-Missing values are always sorted at the end:
-
-```{r}
-df <- tibble(x = c(5, 2, NA))
-arrange(df, x)
-arrange(df, desc(x))
-```
-
### Exercises
1. Sort `flights` to find the flights with longest departure delays.
@@ -286,9 +159,6 @@ arrange(df, desc(x))
3. Which flights travelled the farthest?
Which travelled the shortest?
-4. How could you use `arrange()` to sort all missing values to the start?
- (Hint: use `!is.na()`).
-
## Select columns with `select()` {#select}
It's not uncommon to get datasets with hundreds or even thousands of variables.
@@ -396,80 +266,6 @@ transmute(flights,
)
```
-### Useful creation functions {#mutate-funs}
-
-There are many functions for creating new variables that you can use with `mutate()`.
-The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output.
-There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
-
-- Arithmetic operators: `+`, `-`, `*`, `/`, `^`.
- These are all vectorised, using the so called "recycling rules".
- If one parameter is shorter than the other, it will be automatically extended to be the same length.
- This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc.
-
- Arithmetic operators are also useful in conjunction with the aggregate functions you'll learn about later.
- For example, `x / sum(x)` calculates the proportion of a total, and `y - mean(y)` computes the difference from the mean.
-
-- Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where `x == y * (x %/% y) + (x %% y)`.
- Modular arithmetic is a handy tool because it allows you to break integers up into pieces.
- For example, in the flights dataset, you can compute `hour` and `minute` from `dep_time` with:
-
- ```{r}
- transmute(flights,
- dep_time,
- hour = dep_time %/% 100,
- minute = dep_time %% 100
- )
- ```
-
-- Logs: `log()`, `log2()`, `log10()`.
- Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
- They also convert multiplicative relationships to additive.
-
- All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
-
-- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
- This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
- They are most useful in conjunction with `group_by()`, which you'll learn about shortly.
-
- ```{r}
- (x <- 1:10)
- lag(x)
- lead(x)
- ```
-
-- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
- If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
-
- ```{r}
- x
- cumsum(x)
- cummean(x)
- ```
-
-- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier.
- If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
-
-- Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
- It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
- The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.
-
- ```{r}
- y <- c(1, 2, 2, NA, 3, 4)
- min_rank(y)
- min_rank(desc(y))
- ```
-
- If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
- See their help pages for more details.
-
- ```{r}
- row_number(y)
- dense_rank(y)
- percent_rank(y)
- cume_dist(y)
- ```
-
### Exercises
```{r, eval = FALSE, echo = FALSE}
@@ -588,7 +384,7 @@ Working with the pipe is one of the key criteria for belonging to the tidyverse.
The only exception is ggplot2: it was written before the pipe was discovered.
Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.
-### Missing values {#missing-values-summarise}
+## Missing values {#missing-values-summarise}
You may have wondered about the `na.rm` argument we used above.
What happens if we don't set it?
@@ -621,7 +417,7 @@ not_cancelled %>%
summarise(mean = mean(dep_delay))
```
-### Grouping by multiple variables
+## Grouping by multiple variables
You can group a data frame by multiple variables as well.
Note that the grouping information is printed on top of the output.
@@ -770,134 +566,6 @@ batters %>%
You can find a good explanation of this problem at and .
-### Useful summary functions {#summarise-funs}
-
-Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
-
-- Measures of location: we've used `mean(x)`, but `median(x)` is also useful.
- The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
-
- ```{r}
- not_cancelled %>%
- group_by(month) %>%
- summarise(
- med_arr_delay = median(arr_delay),
- med_dep_delay = median(dep_delay)
- )
- ```
-
- It's sometimes useful to combine aggregation with logical subsetting.
- We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting).
-
- ```{r}
- not_cancelled %>%
- group_by(year, month, day) %>%
- summarise(
- avg_delay1 = mean(arr_delay),
- avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
- )
- ```
-
-- Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.
- The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread.
- The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
-
- ```{r}
- # Why is distance to some destinations more variable than to others?
- not_cancelled %>%
- group_by(dest) %>%
- summarise(distance_sd = sd(distance)) %>%
- arrange(desc(distance_sd))
- ```
-
-- Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
- Quantiles are a generalisation of the median.
- For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%.
-
- ```{r}
- # When do the first and last flights leave each day?
- not_cancelled %>%
- group_by(year, month, day) %>%
- summarise(
- first = min(dep_time),
- last = max(dep_time)
- )
- ```
-
-- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
- These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
- For example, we can find the first and last departure for each day:
-
- ```{r}
- not_cancelled %>%
- group_by(year, month, day) %>%
- summarise(
- first_dep = first(dep_time),
- last_dep = last(dep_time)
- )
- ```
-
- These functions are complementary to filtering on ranks.
- Filtering gives you all variables, with each observation in a separate row:
-
- ```{r}
- not_cancelled %>%
- group_by(year, month, day) %>%
- mutate(r = min_rank(desc(dep_time))) %>%
- filter(r %in% range(r))
- ```
-
-- Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
- To count the number of non-missing values, use `sum(!is.na(x))`.
- To count the number of distinct (unique) values, use `n_distinct(x)`.
-
- ```{r}
- # Which destinations have the most carriers?
- not_cancelled %>%
- group_by(dest) %>%
- summarise(carriers = n_distinct(carrier)) %>%
- arrange(desc(carriers))
- ```
-
- Counts are so useful that dplyr provides a simple helper if all you want is a count:
-
- ```{r}
- not_cancelled %>%
- count(dest)
- ```
-
- Just like with `group_by()`, you can also provide multiple variables to `count()`.
-
- ```{r}
- not_cancelled %>%
- count(carrier, dest)
- ```
-
- You can optionally provide a weight variable.
- For example, you could use this to "count" (sum) the total number of miles a plane flew:
-
- ```{r}
- not_cancelled %>%
- count(tailnum, wt = distance)
- ```
-
-- Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.
- When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0.
- This makes `sum()` and `mean()` very useful: `sum(x)` gives the number of `TRUE`s in `x`, and `mean(x)` gives the proportion.
-
- ```{r}
- # How many flights left before 5am? (these usually indicate delayed
- # flights from the previous day)
- not_cancelled %>%
- group_by(year, month, day) %>%
- summarise(n_early = sum(dep_time < 500))
-
- # What proportion of flights are delayed by more than an hour?
- not_cancelled %>%
- group_by(year, month, day) %>%
- summarise(hour_prop = mean(arr_delay > 60))
- ```
-
### Exercises
1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
diff --git a/logicals-numbers.Rmd b/logicals-numbers.Rmd
index deaecbb..93c6aa4 100644
--- a/logicals-numbers.Rmd
+++ b/logicals-numbers.Rmd
@@ -1,3 +1,191 @@
# Logicals and numbers {#logicals-numbers}
## Introduction
+
+`between()`
+
+## Logical operators
+
+Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output.
+For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not".
+Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
+
+```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."}
+knitr::include_graphics("diagrams/transform-logical.png")
+```
+
+The following code finds all flights that departed in November or December:
+
+```{r, eval = FALSE}
+filter(flights, month == 11 | month == 12)
+```
+
+The order of operations doesn't work like English.
+You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
+Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
+In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
+This is quite confusing!
+
+A useful short-hand for this problem is `x %in% y`.
+This will select every row where `x` is one of the values in `y`.
+We could use it to rewrite the code above:
+
+```{r, eval = FALSE}
+nov_dec <- filter(flights, month %in% c(11, 12))
+```
+
+Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
+For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
+
+```{r, eval = FALSE}
+filter(flights, !(arr_delay > 120 | dep_delay > 120))
+filter(flights, arr_delay <= 120, dep_delay <= 120)
+```
+
+As well as `&` and `|`, R also has `&&` and `||`.
+Don't use them here!
+You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
+
+Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead.
+That makes it much easier to check your work.
+You'll learn how to create new variables shortly.
+
+## Summaries
+
+- Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.
+ When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0.
+ This makes `sum()` and `mean()` very useful: `sum(x)` gives the number of `TRUE`s in `x`, and `mean(x)` gives the proportion.
+
+ ```{r}
+ # How many flights left before 5am? (these usually indicate delayed
+ # flights from the previous day)
+ not_cancelled %>%
+ group_by(year, month, day) %>%
+ summarise(n_early = sum(dep_time < 500))
+
+ # What proportion of flights are delayed by more than an hour?
+ not_cancelled %>%
+ group_by(year, month, day) %>%
+ summarise(hour_prop = mean(arr_delay > 60))
+ ```
+
+## Basic math
+
+There are many functions for creating new variables that you can use with `mutate()`.
+The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output.
+There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
+
+- Arithmetic operators: `+`, `-`, `*`, `/`, `^`.
+ These are all vectorised, using the so called "recycling rules".
+ If one parameter is shorter than the other, it will be automatically extended to be the same length.
+ This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc.
+
+ Arithmetic operators are also useful in conjunction with the aggregate functions you'll learn about later.
+ For example, `x / sum(x)` calculates the proportion of a total, and `y - mean(y)` computes the difference from the mean.
+
+- Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where `x == y * (x %/% y) + (x %% y)`.
+ Modular arithmetic is a handy tool because it allows you to break integers up into pieces.
+ For example, in the flights dataset, you can compute `hour` and `minute` from `dep_time` with:
+
+ ```{r}
+ transmute(flights,
+ dep_time,
+ hour = dep_time %/% 100,
+ minute = dep_time %% 100
+ )
+ ```
+
+- Logs: `log()`, `log2()`, `log10()`.
+ Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
+ They also convert multiplicative relationships to additive.
+
+ All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
+
+- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier.
+ If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
+
+- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
+ If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
+
+ ```{r}
+ x
+ cumsum(x)
+ cummean(x)
+ ```
+
+## Summaries
+
+Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
+
+- Measures of location: we've used `mean(x)`, but `median(x)` is also useful.
+ The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
+
+ ```{r}
+ not_cancelled %>%
+ group_by(month) %>%
+ summarise(
+ med_arr_delay = median(arr_delay),
+ med_dep_delay = median(dep_delay)
+ )
+ ```
+
+ It's sometimes useful to combine aggregation with logical subsetting.
+ We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting).
+
+ ```{r}
+ not_cancelled %>%
+ group_by(year, month, day) %>%
+ summarise(
+ avg_delay1 = mean(arr_delay),
+ avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
+ )
+ ```
+
+- Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.
+ The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread.
+ The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
+
+ ```{r}
+ # Why is distance to some destinations more variable than to others?
+ not_cancelled %>%
+ group_by(dest) %>%
+ summarise(distance_sd = sd(distance)) %>%
+ arrange(desc(distance_sd))
+ ```
+
+- Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
+ Quantiles are a generalisation of the median.
+ For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%.
+
+ ```{r}
+ # When do the first and last flights leave each day?
+ not_cancelled %>%
+ group_by(year, month, day) %>%
+ summarise(
+ first = min(dep_time),
+ last = max(dep_time)
+ )
+ ```
+
+## Floating point
+
+There's another common problem you might encounter when using `==`: floating point numbers.
+These results might surprise you!
+
+```{r}
+(sqrt(2) ^ 2) == 2
+(1 / 49 * 49) == 1
+```
+
+Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
+Instead of relying on `==`, use `near()`:
+
+```{r}
+near(sqrt(2) ^ 2, 2)
+near(1 / 49 * 49, 1)
+```
+
+## Exercises
+
+1. How could you use `arrange()` to sort all missing values to the start?
+ (Hint: use `!is.na()`).
diff --git a/missing-values.Rmd b/missing-values.Rmd
index abc43a9..ed10e52 100644
--- a/missing-values.Rmd
+++ b/missing-values.Rmd
@@ -1,3 +1,70 @@
# Missing values {#missing-values}
## Introduction
+
+## Basics
+
+### Missing values {#missing-values-filter}
+
+One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
+`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
+
+```{r}
+NA > 5
+10 == NA
+NA + 10
+NA / 2
+```
+
+The most confusing result is this one:
+
+```{r}
+NA == NA
+```
+
+It's easiest to understand why this is true with a bit more context:
+
+```{r}
+# Let x be Mary's age. We don't know how old she is.
+x <- NA
+
+# Let y be John's age. We don't know how old he is.
+y <- NA
+
+# Are John and Mary the same age?
+x == y
+# We don't know!
+```
+
+If you want to determine if a value is missing, use `is.na()`:
+
+```{r}
+is.na(x)
+```
+
+## dplyr verbs
+
+`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
+If you want to preserve missing values, ask for them explicitly:
+
+```{r}
+df <- tibble(x = c(1, NA, 3))
+filter(df, x > 1)
+filter(df, is.na(x) | x > 1)
+```
+
+Missing values are always sorted at the end:
+
+```{r}
+df <- tibble(x = c(5, 2, NA))
+arrange(df, x)
+arrange(df, desc(x))
+```
+
+## Exercises
+
+1. Why is `NA ^ 0` not missing?
+ Why is `NA | TRUE` not missing?
+ Why is `FALSE & NA` not missing?
+ Can you figure out the general rule?
+ (`NA * 0` is a tricky counterexample!)
diff --git a/vector-tools.Rmd b/vector-tools.Rmd
index 00a569d..b98df1c 100644
--- a/vector-tools.Rmd
+++ b/vector-tools.Rmd
@@ -1,3 +1,96 @@
# Vector tools
## Introduction
+
+`%in%`
+
+## Counts
+
+- Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
+ To count the number of non-missing values, use `sum(!is.na(x))`.
+ To count the number of distinct (unique) values, use `n_distinct(x)`.
+
+ ```{r}
+ # Which destinations have the most carriers?
+ not_cancelled %>%
+ group_by(dest) %>%
+ summarise(carriers = n_distinct(carrier)) %>%
+ arrange(desc(carriers))
+ ```
+
+ Counts are so useful that dplyr provides a simple helper if all you want is a count:
+
+ ```{r}
+ not_cancelled %>%
+ count(dest)
+ ```
+
+ Just like with `group_by()`, you can also provide multiple variables to `count()`.
+
+ ```{r}
+ not_cancelled %>%
+ count(carrier, dest)
+ ```
+
+ You can optionally provide a weight variable.
+ For example, you could use this to "count" (sum) the total number of miles a plane flew:
+
+ ```{r}
+ not_cancelled %>%
+ count(tailnum, wt = distance)
+ ```
+
+## Window functions
+
+- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
+ This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
+ They are most useful in conjunction with `group_by()`, which you'll learn about shortly.
+
+ ```{r}
+ (x <- 1:10)
+ lag(x)
+ lead(x)
+ ```
+
+- Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
+ It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
+ The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.
+
+ ```{r}
+ y <- c(1, 2, 2, NA, 3, 4)
+ min_rank(y)
+ min_rank(desc(y))
+ ```
+
+ If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
+ See their help pages for more details.
+
+ ```{r}
+ row_number(y)
+ dense_rank(y)
+ percent_rank(y)
+ cume_dist(y)
+ ```
+
+- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
+ These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
+ For example, we can find the first and last departure for each day:
+
+ ```{r}
+ not_cancelled %>%
+ group_by(year, month, day) %>%
+ summarise(
+ first_dep = first(dep_time),
+ last_dep = last(dep_time)
+ )
+ ```
+
+ These functions are complementary to filtering on ranks.
+ Filtering gives you all variables, with each observation in a separate row:
+
+ ```{r}
+ not_cancelled %>%
+ group_by(year, month, day) %>%
+ mutate(r = min_rank(desc(dep_time))) %>%
+ filter(r %in% range(r))
+ ```