Break up data-transform content

2021-04-19 07:56:29 -05:00 · 2021-04-19 07:56:29 -05:00 · 861e27026e
parent 40d7bcb5d0
commit 861e27026e
4 changed files with 351 additions and 335 deletions
--- a/data-transform.Rmd
+++ b/data-transform.Rmd
@ -117,115 +117,6 @@ When this happens you'll get an informative error:
 filter(flights, month = 1)
 ```

-There's another common problem you might encounter when using `==`: floating point numbers.
-These results might surprise you!
-
-```{r}
-(sqrt(2) ^ 2) == 2
-(1 / 49 * 49) == 1
-```
-
-Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
-Instead of relying on `==`, use `near()`:
-
-```{r}
-near(sqrt(2) ^ 2,  2)
-near(1 / 49 * 49, 1)
-```
-
-### Logical operators
-
-Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output.
-For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not".
-Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
-
-```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."}
-knitr::include_graphics("diagrams/transform-logical.png")
-```
-
-The following code finds all flights that departed in November or December:
-
-```{r, eval = FALSE}
-filter(flights, month == 11 | month == 12)
-```
-
-The order of operations doesn't work like English.
-You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
-Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
-In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
-This is quite confusing!
-
-A useful short-hand for this problem is `x %in% y`.
-This will select every row where `x` is one of the values in `y`.
-We could use it to rewrite the code above:
-
-```{r, eval = FALSE}
-nov_dec <- filter(flights, month %in% c(11, 12))
-```
-
-Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
-For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
-
-```{r, eval = FALSE}
-filter(flights, !(arr_delay > 120 | dep_delay > 120))
-filter(flights, arr_delay <= 120, dep_delay <= 120)
-```
-
-As well as `&` and `|`, R also has `&&` and `||`.
-Don't use them here!
-You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
-
-Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead.
-That makes it much easier to check your work.
-You'll learn how to create new variables shortly.
-
-### Missing values {#missing-values-filter}
-
-One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
-`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
-
-```{r}
-NA > 5
-10 == NA
-NA + 10
-NA / 2
-```
-
-The most confusing result is this one:
-
-```{r}
-NA == NA
-```
-
-It's easiest to understand why this is true with a bit more context:
-
-```{r}
-# Let x be Mary's age. We don't know how old she is.
-x <- NA
-
-# Let y be John's age. We don't know how old he is.
-y <- NA
-
-# Are John and Mary the same age?
-x == y
-# We don't know!
-```
-
-If you want to determine if a value is missing, use `is.na()`:
-
-```{r}
-is.na(x)
-```
-
-`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
-If you want to preserve missing values, ask for them explicitly:
-
-```{r}
-df <- tibble(x = c(1, NA, 3))
-filter(df, x > 1)
-filter(df, is.na(x) | x > 1)
-```
-
 ### Exercises

 1.  Find all flights that
@ -238,20 +129,10 @@ filter(df, is.na(x) | x > 1)
    f.  Were delayed by at least an hour, but made up over 30 minutes in flight
    g.  Departed between midnight and 6am (inclusive)

-2.  Another useful dplyr filtering helper is `between()`.
-    What does it do?
-    Can you use it to simplify the code needed to answer the previous challenges?
-
-3.  How many flights have a missing `dep_time`?
+2.  How many flights have a missing `dep_time`?
    What other variables are missing?
    What might these rows represent?

-4.  Why is `NA ^ 0` not missing?
-    Why is `NA | TRUE` not missing?
-    Why is `FALSE & NA` not missing?
-    Can you figure out the general rule?
-    (`NA * 0` is a tricky counterexample!)
-
 ## Arrange rows with `arrange()`

 `arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
@ -268,14 +149,6 @@ Use `desc()` to re-order by a column in descending order:
 arrange(flights, desc(dep_delay))
 ```

-Missing values are always sorted at the end:
-
-```{r}
-df <- tibble(x = c(5, 2, NA))
-arrange(df, x)
-arrange(df, desc(x))
-```
-
 ### Exercises

 1.  Sort `flights` to find the flights with longest departure delays.
@ -286,9 +159,6 @@ arrange(df, desc(x))
 3.  Which flights travelled the farthest?
    Which travelled the shortest?

-4.  How could you use `arrange()` to sort all missing values to the start?
-    (Hint: use `!is.na()`).
-
 ## Select columns with `select()` {#select}

 It's not uncommon to get datasets with hundreds or even thousands of variables.
@ -396,80 +266,6 @@ transmute(flights,
 )
 ```

-### Useful creation functions {#mutate-funs}
-
-There are many functions for creating new variables that you can use with `mutate()`.
-The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output.
-There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
-
-   Arithmetic operators: `+`, `-`, `*`, `/`, `^`.
-    These are all vectorised, using the so called "recycling rules".
-    If one parameter is shorter than the other, it will be automatically extended to be the same length.
-    This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc.
-
-    Arithmetic operators are also useful in conjunction with the aggregate functions you'll learn about later.
-    For example, `x / sum(x)` calculates the proportion of a total, and `y - mean(y)` computes the difference from the mean.
-
-   Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where `x == y * (x %/% y) + (x %% y)`.
-    Modular arithmetic is a handy tool because it allows you to break integers up into pieces.
-    For example, in the flights dataset, you can compute `hour` and `minute` from `dep_time` with:
-
-    ```{r}
-    transmute(flights,
-      dep_time,
-      hour = dep_time %/% 100,
-      minute = dep_time %% 100
-    )
-    ```
-
-   Logs: `log()`, `log2()`, `log10()`.
-    Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
-    They also convert multiplicative relationships to additive.
-
-    All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
-
-   Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
-    This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
-    They are most useful in conjunction with `group_by()`, which you'll learn about shortly.
-
-    ```{r}
-    (x <- 1:10)
-    lag(x)
-    lead(x)
-    ```
-
-   Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
-    If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
-
-    ```{r}
-    x
-    cumsum(x)
-    cummean(x)
-    ```
-
-   Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier.
-    If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
-
-   Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
-    It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
-    The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.
-
-    ```{r}
-    y <- c(1, 2, 2, NA, 3, 4)
-    min_rank(y)
-    min_rank(desc(y))
-    ```
-
-    If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
-    See their help pages for more details.
-
-    ```{r}
-    row_number(y)
-    dense_rank(y)
-    percent_rank(y)
-    cume_dist(y)
-    ```
-
 ### Exercises

 ```{r, eval = FALSE, echo = FALSE}
@ -588,7 +384,7 @@ Working with the pipe is one of the key criteria for belonging to the tidyverse.
 The only exception is ggplot2: it was written before the pipe was discovered.
 Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.

-### Missing values {#missing-values-summarise}
+## Missing values {#missing-values-summarise}

 You may have wondered about the `na.rm` argument we used above.
 What happens if we don't set it?
@ -621,7 +417,7 @@ not_cancelled %>%
  summarise(mean = mean(dep_delay))
 ```

-### Grouping by multiple variables
+## Grouping by multiple variables

 You can group a data frame by multiple variables as well.
 Note that the grouping information is printed on top of the output.
@ -770,134 +566,6 @@ batters %>%

 You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.

-### Useful summary functions {#summarise-funs}
-
-Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
-
-   Measures of location: we've used `mean(x)`, but `median(x)` is also useful.
-    The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
-
-    ```{r}
-    not_cancelled %>%
-      group_by(month) %>%
-      summarise(
-        med_arr_delay = median(arr_delay),
-        med_dep_delay = median(dep_delay)
-        )
-    ```
-
-    It's sometimes useful to combine aggregation with logical subsetting.
-    We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting).
-
-    ```{r}
-    not_cancelled %>% 
-      group_by(year, month, day) %>% 
-      summarise(
-        avg_delay1 = mean(arr_delay),
-        avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
-      )
-    ```
-
-   Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.
-    The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread.
-    The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
-
-    ```{r}
-    # Why is distance to some destinations more variable than to others?
-    not_cancelled %>% 
-      group_by(dest) %>% 
-      summarise(distance_sd = sd(distance)) %>% 
-      arrange(desc(distance_sd))
-    ```
-
-   Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
-    Quantiles are a generalisation of the median.
-    For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%.
-
-    ```{r}
-    # When do the first and last flights leave each day?
-    not_cancelled %>% 
-      group_by(year, month, day) %>% 
-      summarise(
-        first = min(dep_time),
-        last = max(dep_time)
-      )
-    ```
-
-   Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
-    These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
-    For example, we can find the first and last departure for each day:
-
-    ```{r}
-    not_cancelled %>% 
-      group_by(year, month, day) %>% 
-      summarise(
-        first_dep = first(dep_time), 
-        last_dep = last(dep_time)
-      )
-    ```
-
-    These functions are complementary to filtering on ranks.
-    Filtering gives you all variables, with each observation in a separate row:
-
-    ```{r}
-    not_cancelled %>% 
-      group_by(year, month, day) %>% 
-      mutate(r = min_rank(desc(dep_time))) %>% 
-      filter(r %in% range(r))
-    ```
-
-   Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
-    To count the number of non-missing values, use `sum(!is.na(x))`.
-    To count the number of distinct (unique) values, use `n_distinct(x)`.
-
-    ```{r}
-    # Which destinations have the most carriers?
-    not_cancelled %>% 
-      group_by(dest) %>% 
-      summarise(carriers = n_distinct(carrier)) %>% 
-      arrange(desc(carriers))
-    ```
-
-    Counts are so useful that dplyr provides a simple helper if all you want is a count:
-
-    ```{r}
-    not_cancelled %>% 
-      count(dest)
-    ```
-
-    Just like with `group_by()`, you can also provide multiple variables to `count()`.
-
-    ```{r}
-    not_cancelled %>% 
-      count(carrier, dest)
-    ```
-
-    You can optionally provide a weight variable.
-    For example, you could use this to "count" (sum) the total number of miles a plane flew:
-
-    ```{r}
-    not_cancelled %>% 
-      count(tailnum, wt = distance)
-    ```
-
-   Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.
-    When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0.
-    This makes `sum()` and `mean()` very useful: `sum(x)` gives the number of `TRUE`s in `x`, and `mean(x)` gives the proportion.
-
-    ```{r}
-    # How many flights left before 5am? (these usually indicate delayed
-    # flights from the previous day)
-    not_cancelled %>% 
-      group_by(year, month, day) %>% 
-      summarise(n_early = sum(dep_time < 500))
-
-    # What proportion of flights are delayed by more than an hour?
-    not_cancelled %>% 
-      group_by(year, month, day) %>% 
-      summarise(hour_prop = mean(arr_delay > 60))
-    ```
-
 ### Exercises

 1.  Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
--- a/logicals-numbers.Rmd
+++ b/logicals-numbers.Rmd
@ -1,3 +1,191 @@
 # Logicals and numbers {#logicals-numbers}

 ## Introduction
+
+`between()`
+
+## Logical operators
+
+Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output.
+For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not".
+Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
+
+```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."}
+knitr::include_graphics("diagrams/transform-logical.png")
+```
+
+The following code finds all flights that departed in November or December:
+
+```{r, eval = FALSE}
+filter(flights, month == 11 | month == 12)
+```
+
+The order of operations doesn't work like English.
+You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
+Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
+In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
+This is quite confusing!
+
+A useful short-hand for this problem is `x %in% y`.
+This will select every row where `x` is one of the values in `y`.
+We could use it to rewrite the code above:
+
+```{r, eval = FALSE}
+nov_dec <- filter(flights, month %in% c(11, 12))
+```
+
+Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
+For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
+
+```{r, eval = FALSE}
+filter(flights, !(arr_delay > 120 | dep_delay > 120))
+filter(flights, arr_delay <= 120, dep_delay <= 120)
+```
+
+As well as `&` and `|`, R also has `&&` and `||`.
+Don't use them here!
+You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
+
+Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead.
+That makes it much easier to check your work.
+You'll learn how to create new variables shortly.
+
+## Summaries
+
+-   Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.
+    When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0.
+    This makes `sum()` and `mean()` very useful: `sum(x)` gives the number of `TRUE`s in `x`, and `mean(x)` gives the proportion.
+
+    ```{r}
+    # How many flights left before 5am? (these usually indicate delayed
+    # flights from the previous day)
+    not_cancelled %>% 
+      group_by(year, month, day) %>% 
+      summarise(n_early = sum(dep_time < 500))
+
+    # What proportion of flights are delayed by more than an hour?
+    not_cancelled %>% 
+      group_by(year, month, day) %>% 
+      summarise(hour_prop = mean(arr_delay > 60))
+    ```
+
+## Basic math
+
+There are many functions for creating new variables that you can use with `mutate()`.
+The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output.
+There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
+
+-   Arithmetic operators: `+`, `-`, `*`, `/`, `^`.
+    These are all vectorised, using the so called "recycling rules".
+    If one parameter is shorter than the other, it will be automatically extended to be the same length.
+    This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc.
+
+    Arithmetic operators are also useful in conjunction with the aggregate functions you'll learn about later.
+    For example, `x / sum(x)` calculates the proportion of a total, and `y - mean(y)` computes the difference from the mean.
+
+-   Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where `x == y * (x %/% y) + (x %% y)`.
+    Modular arithmetic is a handy tool because it allows you to break integers up into pieces.
+    For example, in the flights dataset, you can compute `hour` and `minute` from `dep_time` with:
+
+    ```{r}
+    transmute(flights,
+      dep_time,
+      hour = dep_time %/% 100,
+      minute = dep_time %% 100
+    )
+    ```
+
+-   Logs: `log()`, `log2()`, `log10()`.
+    Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
+    They also convert multiplicative relationships to additive.
+
+    All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
+
+-   Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier.
+    If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
+
+-   Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
+    If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
+
+    ```{r}
+    x
+    cumsum(x)
+    cummean(x)
+    ```
+
+## Summaries
+
+Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
+
+-   Measures of location: we've used `mean(x)`, but `median(x)` is also useful.
+    The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
+
+    ```{r}
+    not_cancelled %>%
+      group_by(month) %>%
+      summarise(
+        med_arr_delay = median(arr_delay),
+        med_dep_delay = median(dep_delay)
+        )
+    ```
+
+    It's sometimes useful to combine aggregation with logical subsetting.
+    We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting).
+
+    ```{r}
+    not_cancelled %>% 
+      group_by(year, month, day) %>% 
+      summarise(
+        avg_delay1 = mean(arr_delay),
+        avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
+      )
+    ```
+
+-   Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.
+    The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread.
+    The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
+
+    ```{r}
+    # Why is distance to some destinations more variable than to others?
+    not_cancelled %>% 
+      group_by(dest) %>% 
+      summarise(distance_sd = sd(distance)) %>% 
+      arrange(desc(distance_sd))
+    ```
+
+-   Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
+    Quantiles are a generalisation of the median.
+    For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%.
+
+    ```{r}
+    # When do the first and last flights leave each day?
+    not_cancelled %>% 
+      group_by(year, month, day) %>% 
+      summarise(
+        first = min(dep_time),
+        last = max(dep_time)
+      )
+    ```
+
+## Floating point
+
+There's another common problem you might encounter when using `==`: floating point numbers.
+These results might surprise you!
+
+```{r}
+(sqrt(2) ^ 2) == 2
+(1 / 49 * 49) == 1
+```
+
+Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
+Instead of relying on `==`, use `near()`:
+
+```{r}
+near(sqrt(2) ^ 2,  2)
+near(1 / 49 * 49, 1)
+```
+
+## Exercises
+
+1.  How could you use `arrange()` to sort all missing values to the start?
+    (Hint: use `!is.na()`).
--- a/missing-values.Rmd
+++ b/missing-values.Rmd
@ -1,3 +1,70 @@
 # Missing values {#missing-values}

 ## Introduction
+
+## Basics
+
+### Missing values {#missing-values-filter}
+
+One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
+`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
+
+```{r}
+NA > 5
+10 == NA
+NA + 10
+NA / 2
+```
+
+The most confusing result is this one:
+
+```{r}
+NA == NA
+```
+
+It's easiest to understand why this is true with a bit more context:
+
+```{r}
+# Let x be Mary's age. We don't know how old she is.
+x <- NA
+
+# Let y be John's age. We don't know how old he is.
+y <- NA
+
+# Are John and Mary the same age?
+x == y
+# We don't know!
+```
+
+If you want to determine if a value is missing, use `is.na()`:
+
+```{r}
+is.na(x)
+```
+
+## dplyr verbs
+
+`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
+If you want to preserve missing values, ask for them explicitly:
+
+```{r}
+df <- tibble(x = c(1, NA, 3))
+filter(df, x > 1)
+filter(df, is.na(x) | x > 1)
+```
+
+Missing values are always sorted at the end:
+
+```{r}
+df <- tibble(x = c(5, 2, NA))
+arrange(df, x)
+arrange(df, desc(x))
+```
+
+## Exercises
+
+1.  Why is `NA ^ 0` not missing?
+    Why is `NA | TRUE` not missing?
+    Why is `FALSE & NA` not missing?
+    Can you figure out the general rule?
+    (`NA * 0` is a tricky counterexample!)
--- a/vector-tools.Rmd
+++ b/vector-tools.Rmd
@ -1,3 +1,96 @@
 # Vector tools

 ## Introduction
+
+`%in%`
+
+## Counts
+
+-   Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
+    To count the number of non-missing values, use `sum(!is.na(x))`.
+    To count the number of distinct (unique) values, use `n_distinct(x)`.
+
+    ```{r}
+    # Which destinations have the most carriers?
+    not_cancelled %>% 
+      group_by(dest) %>% 
+      summarise(carriers = n_distinct(carrier)) %>% 
+      arrange(desc(carriers))
+    ```
+
+    Counts are so useful that dplyr provides a simple helper if all you want is a count:
+
+    ```{r}
+    not_cancelled %>% 
+      count(dest)
+    ```
+
+    Just like with `group_by()`, you can also provide multiple variables to `count()`.
+
+    ```{r}
+    not_cancelled %>% 
+      count(carrier, dest)
+    ```
+
+    You can optionally provide a weight variable.
+    For example, you could use this to "count" (sum) the total number of miles a plane flew:
+
+    ```{r}
+    not_cancelled %>% 
+      count(tailnum, wt = distance)
+    ```
+
+## Window functions
+
+-   Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
+    This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
+    They are most useful in conjunction with `group_by()`, which you'll learn about shortly.
+
+    ```{r}
+    (x <- 1:10)
+    lag(x)
+    lead(x)
+    ```
+
+-   Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
+    It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
+    The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.
+
+    ```{r}
+    y <- c(1, 2, 2, NA, 3, 4)
+    min_rank(y)
+    min_rank(desc(y))
+    ```
+
+    If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
+    See their help pages for more details.
+
+    ```{r}
+    row_number(y)
+    dense_rank(y)
+    percent_rank(y)
+    cume_dist(y)
+    ```
+
+-   Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
+    These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
+    For example, we can find the first and last departure for each day:
+
+    ```{r}
+    not_cancelled %>% 
+      group_by(year, month, day) %>% 
+      summarise(
+        first_dep = first(dep_time), 
+        last_dep = last(dep_time)
+      )
+    ```
+
+    These functions are complementary to filtering on ranks.
+    Filtering gives you all variables, with each observation in a separate row:
+
+    ```{r}
+    not_cancelled %>% 
+      group_by(year, month, day) %>% 
+      mutate(r = min_rank(desc(dep_time))) %>% 
+      filter(r %in% range(r))
+    ```