Start rewriting transform chapter

2021-04-20 07:59:47 -05:00 · 2021-04-20 07:59:47 -05:00 · 86e98ae66e
parent d80982caa6
commit 86e98ae66e
4 changed files with 319 additions and 332 deletions
--- a/data-transform.Rmd
+++ b/data-transform.Rmd
@ -30,46 +30,47 @@ The data comes from the US [Bureau of Transportation Statistics](http://www.tran
 flights
 ```

-You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen.
-It also displays the number of rows (`r format(nrow(nycflights13::flights), big.mark = ",")`) and columns (`r ncol(nycflights13::flights)`).
-(To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer).
-It prints differently because it's a **tibble**.
-Tibbles are data frames, but slightly tweaked to work better in the tidyverse.
-For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in Chapter \@ref(tibbles).
+If you've used R before, you might notice that this data frame prints a little differently to data frames you might've worked with in the past.
+That's because it's a **tibble**, a special type of data frame designed by the tidyverse team.
+
+The most important difference between a tibble and a data frame is the print method.
+Tibbles only shows the first few rows and the columns that fit on one screen.
+This makes it easier to rapidly iterate when working with large data; if you want to see everything you can use `View(flights)` to open the dataset in the RStudio viewer.
+We'll come back to other important differences in Chapter \@ref(tibbles).

 You might also have noticed the row of three (or four) letter abbreviations under the column names.
 These describe the type of each variable:

-   `int` stands for integers.
+-   `int` stands for integer.

-   `dbl` stands for doubles, or real numbers.
+-   `dbl` stands for double, a vector of real numbers.

-   `chr` stands for characters, or strings.
+-   `chr` stands for character, a vector of strings.

-   `dttm` stands for date-times (a date + a time).
+-   `dttm` stands for date-time (a date + a time).

 There are three other common types of variables that aren't used in this dataset but you'll encounter later in the book:

 -   `lgl` stands for logical, vectors that contain only `TRUE` or `FALSE`.

-   `fctr` stands for factors, which R uses to represent categorical variables with fixed possible values.
+-   `fctr` stands for factor, which R uses to represent categorical variables with fixed possible values.

-   `date` stands for dates.
+-   `date` stands for date.

 ### dplyr basics

-In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:
+In this chapter you are going to learn the primary dplyr verbs that allow you to solve the vast majority of your data manipulation challenges.
+They are organised into three camps:

-   Pick observations by their values (`filter()`).
-   Reorder the rows (`arrange()`).
-   Pick variables by their names (`select()`).
-   Create new variables with functions of existing variables (`mutate()`).
-   Collapse many values down to a single summary (`summarise()`).
+-   Functions that operate on **rows**, like `filter()` which subsets rows based on the values of the columns, the `slice()` functions that subset rows based on their position, and `arrange()` which changes the order of the rows.

-These can all be used in conjunction with `group_by()` which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.
-These six functions provide the verbs for a language of data manipulation.
+-   Functions that operate on **columns**, like `mutate()` which creates new columns, `select()` which picks columns, `rename()` which changes column names, `relocate()` which moves columns from place to place.

-All verbs work similarly:
+-   Functions that operate on **groups**, like `group_by()` which divides data up into groups for analysis, and `summarise()` which reduces each group to a single row.
+
+Later, in Chapter \@ref(relational-data), you'll learn about other verbs that work with **tables**, like the join functions and the set operations.
+
+All dplyr verbs work the same way:

 1.  The first argument is a data frame.

@ -80,7 +81,9 @@ All verbs work similarly:
 Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
 Let's dive in and see how these verbs work.

-## Filter rows with `filter()`
+## Rows
+
+### `filter()`

 `filter()` allows you to subset observations based on their values.
 The first argument is the name of the data frame.
@ -105,35 +108,20 @@ If you want to do both, you can wrap the assignment in parentheses:
 (dec25 <- filter(flights, month == 12, day == 25))
 ```

-### Comparisons
-
 To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
 R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).
+It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`.

 When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
-When this happens you'll get an informative error:
+`filter()` will let you know when this happens:

 ```{r, error = TRUE}
 filter(flights, month = 1)
 ```

-### Exercises
+### `slice()`

-1.  Find all flights that
-
-    a.  Had an arrival delay of two or more hours
-    b.  Flew to Houston (`IAH` or `HOU`)
-    c.  Were operated by United, American, or Delta
-    d.  Departed in summer (July, August, and September)
-    e.  Arrived more than two hours late, but didn't leave late
-    f.  Were delayed by at least an hour, but made up over 30 minutes in flight
-    g.  Departed between midnight and 6am (inclusive)
-
-2.  How many flights have a missing `dep_time`?
-    What other variables are missing?
-    What might these rows represent?
-
-## Arrange rows with `arrange()`
+### `arrange()`

 `arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
 It takes a data frame and a set of column names (or more complicated expressions) to order by.
@ -151,15 +139,73 @@ arrange(flights, desc(dep_delay))

 ### Exercises

-1.  Sort `flights` to find the flights with longest departure delays.
+1.  Find all flights that
+
+    a.  Had an arrival delay of two or more hours
+    b.  Flew to Houston (`IAH` or `HOU`)
+    c.  Were operated by United, American, or Delta
+    d.  Departed in summer (July, August, and September)
+    e.  Arrived more than two hours late, but didn't leave late
+    f.  Were delayed by at least an hour, but made up over 30 minutes in flight
+    g.  Departed between midnight and 6am (inclusive)
+
+2.  Sort `flights` to find the flights with longest departure delays.
    Find the flights that left earliest.

-2.  Sort `flights` to find the fastest (highest speed) flights.
+3.  Sort `flights` to find the fastest (highest speed) flights.
+    (Hint: try sorting by a calculation).

-3.  Which flights travelled the farthest?
+4.  Which flights travelled the farthest?
    Which travelled the shortest?

-## Select columns with `select()` {#select}
+## Columns
+
+### `mutate()`
+
+Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns.
+That's the job of `mutate()`.
+
+`mutate()` always adds new columns at the end of your dataset so we'll start by creating a narrower dataset so we can see the new variables.
+Remember that when you're in RStudio, the easiest way to see all the columns is `View()`.
+
+```{r}
+flights_sml <- select(flights, 
+  year:day, 
+  ends_with("delay"), 
+  distance, 
+  air_time
+)
+```
+
+```{r}
+mutate(flights_sml,
+  gain = dep_delay - arr_delay,
+  speed = distance / air_time * 60
+)
+```
+
+Note that you can refer to columns that you've just created:
+
+```{r}
+mutate(flights_sml,
+  gain = dep_delay - arr_delay,
+  hours = air_time / 60,
+  gain_per_hour = gain / hours
+)
+```
+
+You can control which variables are kept with the `.keep` argument:
+
+```{r}
+mutate(flights,
+  gain = dep_delay - arr_delay,
+  hours = air_time / 60,
+  gain_per_hour = gain / hours,
+  .keep = "none"
+)
+```
+
+### `select()` {#select}

 It's not uncommon to get datasets with hundreds or even thousands of variables.
 In this case, the first challenge is often narrowing in on the variables you're actually interested in.
@ -190,80 +236,37 @@ There are a number of helper functions you can use within `select()`:

 See `?select` for more details.

-`select()` can be used to rename variables, but it's rarely useful because it drops all of the variables not explicitly mentioned.
-Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
+You can rename variables as you `select()` them by using `=`.
+The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
+
+```{r}
+select(flights, tail_num = tailnum)
+```
+
+### `rename()`
+
+If you just want to keep all the existing variables and just want to rename a few, you can use `rename()` instead of `select()`:

 ```{r}
 rename(flights, tail_num = tailnum)
 ```

-If you want to move certain variables to the start of the data frame but not drop the others, you can do this in two ways: using `select()` in conjunction with the `everything()` helper or using `relocate()`.
+It works exactly the same way as `select()`, but keeps all the variables that aren't explicitly selected.
+
+### `relocate()`
+
+You can move variables around with `relocate`.
+By default it moves variables to the front:

 ```{r}
-select(flights, time_hour, air_time, everything())
 relocate(flights, time_hour, air_time)
 ```

-### Exercises
-
-1.  Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
-
-2.  What happens if you include the name of a variable multiple times in a `select()` call?
-
-3.  What does the `any_of()` function do?
-    Why might it be helpful in conjunction with this vector?
-
-    ```{r}
-    variables <- c("year", "month", "day", "dep_delay", "arr_delay")
-    ```
-
-4.  Does the result of running the following code surprise you?
-    How do the select helpers deal with case by default?
-    How can you change that default?
-
-    ```{r, eval = FALSE}
-    select(flights, contains("TIME"))
-    ```
-
-## Add new variables with `mutate()`
-
-Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns.
-That's the job of `mutate()`.
-
-`mutate()` always adds new columns at the end of your dataset so we'll start by creating a narrower dataset so we can see the new variables.
-Remember that when you're in RStudio, the easiest way to see all the columns is `View()`.
+But you can use the `.before` and `.after` arguments to choose where to place them:

 ```{r}
-flights_sml <- select(flights, 
-  year:day, 
-  ends_with("delay"), 
-  distance, 
-  air_time
-)
-mutate(flights_sml,
-  gain = dep_delay - arr_delay,
-  speed = distance / air_time * 60
-)
-```
-
-Note that you can refer to columns that you've just created:
-
-```{r}
-mutate(flights_sml,
-  gain = dep_delay - arr_delay,
-  hours = air_time / 60,
-  gain_per_hour = gain / hours
-)
-```
-
-If you only want to keep the new variables, use `transmute()`:
-
-```{r}
-transmute(flights,
-  gain = dep_delay - arr_delay,
-  hours = air_time / 60,
-  gain_per_hour = gain / hours
-)
+relocate(flights, year:dep_time, .after = time_hour)
+relocate(flights, starts_with("arr"), .before = dep_time)
 ```

 ### Exercises
@ -293,68 +296,75 @@ ggplot(flights, aes(air_time - airtime2)) + geom_histogram()
 3.  Compare `dep_time`, `sched_dep_time`, and `dep_delay`.
    How would you expect those three numbers to be related?

-4.  Find the 10 most delayed flights using a ranking function.
-    How do you want to handle ties?
-    Carefully read the documentation for `min_rank()`.
+4.  Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.

-5.  What does `1:3 + 1:10` return?
-    Why?
+5.  What happens if you include the name of a variable multiple times in a `select()` call?

-6.  What trigonometric functions does R provide?
+6.  What does the `any_of()` function do?
+    Why might it be helpful in conjunction with this vector?

-## Grouped summaries with `summarise()`
+    ```{r}
+    variables <- c("year", "month", "day", "dep_delay", "arr_delay")
+    ```

-The last key verb is `summarise()`.
-It collapses a data frame to a single row:
+7.  Does the result of running the following code surprise you?
+    How do the select helpers deal with case by default?
+    How can you change that default?

-```{r}
-summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
-```
+    ```{r, eval = FALSE}
+    select(flights, contains("TIME"))
+    ```

-(We'll come back to what that `na.rm = TRUE` means very shortly.)
+## Groups

-`summarise()` is not terribly useful unless we pair it with `group_by()`.
-This changes the unit of analysis from the complete dataset to individual groups.
-Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group".
-For example, if we applied exactly the same code to a data frame grouped by month, we get the average delay per month:
+### `group_by()`
+
+`group_by()` doesn't appear to do anything:

 ```{r}
 by_month <- group_by(flights, month)
+by_month
+```
+
+If you look closely, you'll notice that it's now "grouped by" month, but otherwise the data is unchanged.
+The reason to group your data is because it changes the operation of other verbs.
+
+### `summarise()`
+
+The most important operation that you might apply to grouped data is a summary.
+It collapses each group to a single row:
+
+```{r}
 summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
 ```

-Together `group_by()` and `summarise()` provide one of the tools that you'll use most commonly when working with dplyr: grouped summaries.
-But before we go any further with this, we need to introduce a powerful new idea: the pipe.
+You can create any number of summaries at once.
+You'll learn various useful summaries in the upcoming chapters on individual data types, but another useful summary function is `n()`, which returns the number of rows in each group:

-### Combining multiple operations with the pipe
-
-Imagine that we want to explore the relationship between the distance and average delay for each location.
-Using what you know about dplyr, you might write code like this:
-
-```{r, fig.width = 6}
-by_dest <- group_by(flights, dest)
-delay <- summarise(by_dest,
-  count = n(),
-  dist = mean(distance, na.rm = TRUE),
-  delay = mean(arr_delay, na.rm = TRUE)
-)
-delay <- filter(delay, count > 20, dest != "HNL")
-
-# It looks like delays increase with distance up to ~750 miles 
-# and then decrease. Maybe as flights get longer there's more 
-# ability to make up delays in the air?
-ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
-  geom_point(aes(size = count), alpha = 1/3) +
-  geom_smooth(se = FALSE)
+```{r}
+summarise(by_month, delay = mean(dep_delay, na.rm = TRUE), n = n())
 ```

-There are three steps to prepare this data:
+(In fact, `count()` which you already learned about, is just a short cut for grouping + summarising with `n()`)

-1.  Group flights by destination.
+Here we've used `mean()` to compute the average delay for each month.
+The `na.rm = TRUE` is important because it asks R to "remove" (rm) the missing (na) values.
+If you forget it, the output isn't very useful:

-2.  Summarise to compute distance, average delay, and number of flights.
+```{r}
+summarise(by_month, delay = mean(dep_delay))
+```

-3.  Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
+We'll come back to discuss missing values in Chapter \@ref(missing-values).
+For now, know you can drop them in summary functions by using `na.rm = TRUE` or remove them with a filter by using `!is.na()`:
+
+```{r}
+not_cancelled <- filter(flights, !is.na(dep_delay))
+by_month <- group_by(not_cancelled, month)
+summarise(by_month, delay = mean(dep_delay))
+```
+
+### Combining multiple operations with the pipe

 This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about them.
 Naming things is hard, so this slows down our analysis.
@ -362,66 +372,23 @@ Naming things is hard, so this slows down our analysis.
 There's another way to tackle the same problem with the pipe, `%>%`:

 ```{r}
-sdelays <- flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  ) %>% 
-  filter(count > 20, dest != "HNL")
+flights %>% 
+  filter(!is.na(dep_delay)) %>% 
+  group_by(month) %>%
+  summarise(delay = mean(dep_delay))
 ```

 This focuses on the transformations, not what's being transformed, which makes the code easier to read.
-You can read it as a series of imperative statements: group, then summarise, then filter.
+You can read it as a series of imperative statements: filter, then group, then summarise.
 As suggested by this reading, a good way to pronounce `%>%` when reading code is "then".

 Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on.
 You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
 We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(pipes).

-Working with the pipe is one of the key criteria for belonging to the tidyverse.
-The only exception is ggplot2: it was written before the pipe was discovered.
-Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.
+### Grouping by multiple variables

-## Missing values {#missing-values-summarise}
-
-You may have wondered about the `na.rm` argument we used above.
-What happens if we don't set it?
-
-```{r}
-flights %>% 
-  group_by(month) %>% 
-  summarise(mean = mean(dep_delay))
-```
-
-We get a lot of missing values!
-That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value.
-Fortunately, all aggregation functions have an `na.rm` argument which removes the missing values prior to computation:
-
-```{r}
-flights %>% 
-  group_by(month) %>% 
-  summarise(mean = mean(dep_delay, na.rm = TRUE))
-```
-
-In this case, missing values represent cancelled flights, therefore we could also tackle the problem by first removing the cancelled flights.
-We'll save this dataset so we can reuse it in the next few examples.
-
-```{r}
-not_cancelled <- flights %>% 
-  filter(!is.na(dep_delay), !is.na(arr_delay))
-
-not_cancelled %>% 
-  group_by(month) %>% 
-  summarise(mean = mean(dep_delay))
-```
-
-## Grouping by multiple variables
-
-You can group a data frame by multiple variables as well.
-Note that the grouping information is printed on top of the output.
-The number in the square brackets indicates how many groups are created.
+You can group a data frame by multiple variables:

 ```{r}
 daily <- group_by(flights, year, month, day)
@ -431,34 +398,22 @@ daily
 When you group by multiple variables, each summary peels off one level of the grouping by default, and a message is printed that tells you how you can change this behaviour.

 ```{r}
-summarise(daily, flights = n())
+daily %>% summarise(flights = n())
 ```

 If you're happy with this behaviour, you can also explicitly define it, in which case the message won't be printed out.

-```{r}
+```{r results = FALSE}
 summarise(daily, flights = n(), .groups = "drop_last")
 ```

-Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`.
+Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`:

-```{r}
-# Note the difference between the grouping structures
+```{r results = FALSE}
 summarise(daily, flights = n(), .groups = "drop")
 summarise(daily, flights = n(), .groups = "keep")
 ```

-The fact that each summary peels off one level of the grouping by default makes it easy to progressively roll up a dataset:
-
-```{r}
-(per_day   <- summarise(daily, flights = n()))
-(per_month <- summarise(per_day, flights = sum(flights)))
-(per_year  <- summarise(per_month, flights = sum(flights)))
-```
-
-Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median.
-In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
-
 ### Ungrouping

 You might also want to remove grouping outside of `summarise()`.
@ -466,11 +421,33 @@ You can do this and return to operations on ungrouped data using `ungroup()`.

 ```{r}
 daily %>% 
-  ungroup() %>%             # no longer grouped by date
-  summarise(flights = n())  # all flights
+  ungroup() %>%
+  summarise(
+    delay = mean(dep_delay, na.rm = TRUE), 
+    flights = n()
+  )
 ```

-### Counts
+For the purposes of summarising, ungrouped data is treated as if all your data was in a single group, so you get one row back.
+
+### Other verbs
+
+-   `select()`, `rename()`, `relocate()`: grouping has no affect
+
+-   `filter()`, `mutate()`: computation happens per group.
+    This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions).
+
+### Exercises
+
+1.  Which carrier has the worst delays?
+    Challenge: can you disentangle the effects of bad airports vs. bad carriers?
+    Why/why not?
+    (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
+
+2.  What does the `sort` argument to `count()` do.
+    Can you explain it in terms of the dplyr verbs you've learned so far?
+
+## Case study: aggregates and sample size

 Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`).
 That way you can check that you're not drawing conclusions based on very small amounts of data.
@ -518,15 +495,6 @@ delays %>%
  geom_point(alpha = 1/10)
 ```

------------------------------------------------------------------------
-
-RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P.
-This resends the previously sent chunk from the editor to the console.
-This is very convenient when you're (e.g.) exploring the value of `n` in the example above.
-You send the whole block once with Cmd/Ctrl + Enter, then you modify the value of `n` and press Cmd/Ctrl + Shift + P to resend the complete block.
-
------------------------------------------------------------------------
-
 There's another common variation of this type of pattern.
 Let's look at how the average performance of batters in baseball is related to the number of times they're at bat.
 Here I use data from the **Lahman** package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
@ -565,99 +533,3 @@ batters %>%
 ```

 You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
-
-### Exercises
-
-1.  Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
-    Consider the following scenarios:
-
-    -   A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
-
-    -   A flight is always 10 minutes late.
-
-    -   A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
-
-    -   99% of the time a flight is on time.
-        1% of the time it's 2 hours late.
-
-    Which is more important: arrival delay or departure delay?
-
-2.  Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).
-
-3.  Our definition of cancelled flights (`is.na(dep_delay) | is.na(arr_delay)` ) is slightly suboptimal.
-    Why?
-    Which is the most important column?
-
-4.  Look at the number of cancelled flights per day.
-    Is there a pattern?
-    Is the proportion of cancelled flights related to the average delay?
-
-5.  Which carrier has the worst delays?
-    Challenge: can you disentangle the effects of bad airports vs. bad carriers?
-    Why/why not?
-    (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
-
-6.  What does the `sort` argument to `count()` do.
-    When might you use it?
-
-## Grouped mutates and filters
-
-Grouping is most useful in conjunction with `summarise()`, but you can also do convenient operations with `mutate()` and `filter()`:
-
-   Find the worst members of each group:
-
-    ```{r}
-    flights_sml %>% 
-      group_by(year, month, day) %>%
-      filter(rank(desc(arr_delay)) < 10)
-    ```
-
-   Find all groups bigger than a threshold:
-
-    ```{r}
-    popular_dests <- flights %>% 
-      group_by(dest) %>% 
-      filter(n() > 365)
-    popular_dests
-    ```
-
-   Standardise to compute per group metrics:
-
-    ```{r}
-    popular_dests %>% 
-      filter(arr_delay > 0) %>% 
-      mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
-      select(year:day, dest, arr_delay, prop_delay)
-    ```
-
-A grouped filter is a grouped mutate followed by an ungrouped filter.
-I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly.
-
-Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries).
-You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`.
-
-### Exercises
-
-1.  Refer back to the lists of useful mutate and filtering functions.
-    Describe how each operation changes when you combine it with grouping.
-
-2.  Which plane (`tailnum`) has the worst on-time record?
-
-3.  What time of day should you fly if you want to avoid delays as much as possible?
-
-4.  For each destination, compute the total minutes of delay.
-    For each flight, compute the proportion of the total delay for its destination.
-
-5.  Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
-    Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.
-
-6.  Look at each destination.
-    Can you find flights that are suspiciously fast?
-    (i.e. flights that represent a potential data entry error).
-    Compute the air time of a flight relative to the shortest flight to that destination.
-    Which flights were most delayed in the air?
-
-7.  Find all destinations that are flown by at least two carriers.
-    Use that information to rank the carriers.
-
-8.  For each plane, count the number of flights before the first delay of greater than 1 hour.
--- a/logicals-numbers.Rmd
+++ b/logicals-numbers.Rmd
@ -26,7 +26,7 @@ filter(flights, month == 11 | month == 12)
 ```

 The order of operations doesn't work like English.
-You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
+You can't write `filter(flights, month == 11 | 12)`, which you might literally translate into "finds all flights that departed in November or December".
 Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
 In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
 This is quite confusing!
@ -77,6 +77,12 @@ You'll learn how to create new variables shortly.
      summarise(hour_prop = mean(arr_delay > 60))
    ```

+`cumany()` `cumall()`
+
+### Exercises
+
+1.  For each plane, count the number of flights before the first delay of greater than 1 hour.
+
 ## Basic math

 There are many functions for creating new variables that you can use with `mutate()`.
@ -121,6 +127,12 @@ There's no way to list every possible function that you might use, but here's a
    cummean(x)
    ```

+### Recycling rules
+
+Base R.
+
+Tidyverse.
+
 ## Summaries

 Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
@ -175,6 +187,22 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
      )
    ```

+### Exercises
+
+1.  Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
+    Consider the following scenarios:
+
+    -   A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
+
+    -   A flight is always 10 minutes late.
+
+    -   A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
+
+    -   99% of the time a flight is on time.
+        1% of the time it's 2 hours late.
+
+    Which is more important: arrival delay or departure delay?
+
 ## Floating point

 There's another common problem you might encounter when using `==`: floating point numbers.
@ -195,5 +223,6 @@ near(1 / 49 * 49, 1)

 ## Exercises

-1.  How could you use `arrange()` to sort all missing values to the start?
-    (Hint: use `!is.na()`).
+1.  What trigonometric functions does R provide?
+2.  
+
--- a/missing-values.Rmd
+++ b/missing-values.Rmd
@ -46,6 +46,21 @@ If you want to determine if a value is missing, use `is.na()`:
 is.na(x)
 ```

+### Exercises
+
+1.  How many flights have a missing `dep_time`?
+    What other variables are missing?
+    What might these rows represent?
+
+2.  How could you use `arrange()` to sort all missing values to the start?
+    (Hint: use `!is.na()`).
+
+3.  Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).
+
+4.  Look at the number of cancelled flights per day.
+    Is there a pattern?
+    Is the proportion of cancelled flights related to the average delay?
+
 ## Explicit vs implicit missing values {#missing-values-tidy}

 Changing the representation of a dataset brings up an important subtlety of missing values.
@ -151,8 +166,8 @@ arrange(df, desc(x))

 ## Exercises

-1.  Why is `NA ^ 0` not missing?
-    Why is `NA | TRUE` not missing?
-    Why is `FALSE & NA` not missing?
-    Can you figure out the general rule?
-    (`NA * 0` is a tricky counterexample!)
+1.  Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)
+
+### Missing matches
+
+Discuss `anti_join()`
--- a/vector-tools.Rmd
+++ b/vector-tools.Rmd
@ -102,3 +102,74 @@ not_cancelled <- flights %>%
      mutate(r = min_rank(desc(dep_time))) %>% 
      filter(r %in% range(r))
    ```
+
+### dplyr
+
+```{r}
+flights_sml <- select(flights, 
+  year:day, 
+  ends_with("delay"), 
+  distance, 
+  air_time
+)
+```
+
+-   Find the worst members of each group:
+
+    ```{r}
+    flights_sml %>% 
+      group_by(year, month, day) %>%
+      filter(rank(desc(arr_delay)) < 10)
+    ```
+
+-   Find all groups bigger than a threshold:
+
+    ```{r}
+    popular_dests <- flights %>% 
+      group_by(dest) %>% 
+      filter(n() > 365)
+    popular_dests
+    ```
+
+-   Standardise to compute per group metrics:
+
+    ```{r}
+    popular_dests %>% 
+      filter(arr_delay > 0) %>% 
+      mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
+      select(year:day, dest, arr_delay, prop_delay)
+    ```
+
+A grouped filter is a grouped mutate followed by an ungrouped filter.
+I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly.
+
+Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries).
+You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`.
+
+### Exercises
+
+1.  Find the 10 most delayed flights using a ranking function.
+    How do you want to handle ties?
+    Carefully read the documentation for `min_rank()`.
+
+2.  Which plane (`tailnum`) has the worst on-time record?
+
+3.  What time of day should you fly if you want to avoid delays as much as possible?
+
+4.  For each destination, compute the total minutes of delay.
+    For each flight, compute the proportion of the total delay for its destination.
+
+5.  Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
+    Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.
+
+6.  Look at each destination.
+    Can you find flights that are suspiciously fast?
+    (i.e. flights that represent a potential data entry error).
+    Compute the air time of a flight relative to the shortest flight to that destination.
+    Which flights were most delayed in the air?
+
+7.  Find all destinations that are flown by at least two carriers.
+    Use that information to rank the carriers.
+
+8.  
+