diff --git a/data-transform.Rmd b/data-transform.Rmd
index 1e9f08f..19118eb 100644
--- a/data-transform.Rmd
+++ b/data-transform.Rmd
@@ -30,46 +30,47 @@ The data comes from the US [Bureau of Transportation Statistics](http://www.tran
flights
```
-You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen.
-It also displays the number of rows (`r format(nrow(nycflights13::flights), big.mark = ",")`) and columns (`r ncol(nycflights13::flights)`).
-(To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer).
-It prints differently because it's a **tibble**.
-Tibbles are data frames, but slightly tweaked to work better in the tidyverse.
-For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in Chapter \@ref(tibbles).
+If you've used R before, you might notice that this data frame prints a little differently to data frames you might've worked with in the past.
+That's because it's a **tibble**, a special type of data frame designed by the tidyverse team.
+
+The most important difference between a tibble and a data frame is the print method.
+Tibbles only shows the first few rows and the columns that fit on one screen.
+This makes it easier to rapidly iterate when working with large data; if you want to see everything you can use `View(flights)` to open the dataset in the RStudio viewer.
+We'll come back to other important differences in Chapter \@ref(tibbles).
You might also have noticed the row of three (or four) letter abbreviations under the column names.
These describe the type of each variable:
-- `int` stands for integers.
+- `int` stands for integer.
-- `dbl` stands for doubles, or real numbers.
+- `dbl` stands for double, a vector of real numbers.
-- `chr` stands for characters, or strings.
+- `chr` stands for character, a vector of strings.
-- `dttm` stands for date-times (a date + a time).
+- `dttm` stands for date-time (a date + a time).
There are three other common types of variables that aren't used in this dataset but you'll encounter later in the book:
- `lgl` stands for logical, vectors that contain only `TRUE` or `FALSE`.
-- `fctr` stands for factors, which R uses to represent categorical variables with fixed possible values.
+- `fctr` stands for factor, which R uses to represent categorical variables with fixed possible values.
-- `date` stands for dates.
+- `date` stands for date.
### dplyr basics
-In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:
+In this chapter you are going to learn the primary dplyr verbs that allow you to solve the vast majority of your data manipulation challenges.
+They are organised into three camps:
-- Pick observations by their values (`filter()`).
-- Reorder the rows (`arrange()`).
-- Pick variables by their names (`select()`).
-- Create new variables with functions of existing variables (`mutate()`).
-- Collapse many values down to a single summary (`summarise()`).
+- Functions that operate on **rows**, like `filter()` which subsets rows based on the values of the columns, the `slice()` functions that subset rows based on their position, and `arrange()` which changes the order of the rows.
-These can all be used in conjunction with `group_by()` which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.
-These six functions provide the verbs for a language of data manipulation.
+- Functions that operate on **columns**, like `mutate()` which creates new columns, `select()` which picks columns, `rename()` which changes column names, `relocate()` which moves columns from place to place.
-All verbs work similarly:
+- Functions that operate on **groups**, like `group_by()` which divides data up into groups for analysis, and `summarise()` which reduces each group to a single row.
+
+Later, in Chapter \@ref(relational-data), you'll learn about other verbs that work with **tables**, like the join functions and the set operations.
+
+All dplyr verbs work the same way:
1. The first argument is a data frame.
@@ -80,7 +81,9 @@ All verbs work similarly:
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
Let's dive in and see how these verbs work.
-## Filter rows with `filter()`
+## Rows
+
+### `filter()`
`filter()` allows you to subset observations based on their values.
The first argument is the name of the data frame.
@@ -105,35 +108,20 @@ If you want to do both, you can wrap the assignment in parentheses:
(dec25 <- filter(flights, month == 12, day == 25))
```
-### Comparisons
-
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).
+It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`.
When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
-When this happens you'll get an informative error:
+`filter()` will let you know when this happens:
```{r, error = TRUE}
filter(flights, month = 1)
```
-### Exercises
+### `slice()`
-1. Find all flights that
-
- a. Had an arrival delay of two or more hours
- b. Flew to Houston (`IAH` or `HOU`)
- c. Were operated by United, American, or Delta
- d. Departed in summer (July, August, and September)
- e. Arrived more than two hours late, but didn't leave late
- f. Were delayed by at least an hour, but made up over 30 minutes in flight
- g. Departed between midnight and 6am (inclusive)
-
-2. How many flights have a missing `dep_time`?
- What other variables are missing?
- What might these rows represent?
-
-## Arrange rows with `arrange()`
+### `arrange()`
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
It takes a data frame and a set of column names (or more complicated expressions) to order by.
@@ -151,15 +139,73 @@ arrange(flights, desc(dep_delay))
### Exercises
-1. Sort `flights` to find the flights with longest departure delays.
+1. Find all flights that
+
+ a. Had an arrival delay of two or more hours
+ b. Flew to Houston (`IAH` or `HOU`)
+ c. Were operated by United, American, or Delta
+ d. Departed in summer (July, August, and September)
+ e. Arrived more than two hours late, but didn't leave late
+ f. Were delayed by at least an hour, but made up over 30 minutes in flight
+ g. Departed between midnight and 6am (inclusive)
+
+2. Sort `flights` to find the flights with longest departure delays.
Find the flights that left earliest.
-2. Sort `flights` to find the fastest (highest speed) flights.
+3. Sort `flights` to find the fastest (highest speed) flights.
+ (Hint: try sorting by a calculation).
-3. Which flights travelled the farthest?
+4. Which flights travelled the farthest?
Which travelled the shortest?
-## Select columns with `select()` {#select}
+## Columns
+
+### `mutate()`
+
+Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns.
+That's the job of `mutate()`.
+
+`mutate()` always adds new columns at the end of your dataset so we'll start by creating a narrower dataset so we can see the new variables.
+Remember that when you're in RStudio, the easiest way to see all the columns is `View()`.
+
+```{r}
+flights_sml <- select(flights,
+ year:day,
+ ends_with("delay"),
+ distance,
+ air_time
+)
+```
+
+```{r}
+mutate(flights_sml,
+ gain = dep_delay - arr_delay,
+ speed = distance / air_time * 60
+)
+```
+
+Note that you can refer to columns that you've just created:
+
+```{r}
+mutate(flights_sml,
+ gain = dep_delay - arr_delay,
+ hours = air_time / 60,
+ gain_per_hour = gain / hours
+)
+```
+
+You can control which variables are kept with the `.keep` argument:
+
+```{r}
+mutate(flights,
+ gain = dep_delay - arr_delay,
+ hours = air_time / 60,
+ gain_per_hour = gain / hours,
+ .keep = "none"
+)
+```
+
+### `select()` {#select}
It's not uncommon to get datasets with hundreds or even thousands of variables.
In this case, the first challenge is often narrowing in on the variables you're actually interested in.
@@ -190,80 +236,37 @@ There are a number of helper functions you can use within `select()`:
See `?select` for more details.
-`select()` can be used to rename variables, but it's rarely useful because it drops all of the variables not explicitly mentioned.
-Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
+You can rename variables as you `select()` them by using `=`.
+The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
+
+```{r}
+select(flights, tail_num = tailnum)
+```
+
+### `rename()`
+
+If you just want to keep all the existing variables and just want to rename a few, you can use `rename()` instead of `select()`:
```{r}
rename(flights, tail_num = tailnum)
```
-If you want to move certain variables to the start of the data frame but not drop the others, you can do this in two ways: using `select()` in conjunction with the `everything()` helper or using `relocate()`.
+It works exactly the same way as `select()`, but keeps all the variables that aren't explicitly selected.
+
+### `relocate()`
+
+You can move variables around with `relocate`.
+By default it moves variables to the front:
```{r}
-select(flights, time_hour, air_time, everything())
relocate(flights, time_hour, air_time)
```
-### Exercises
-
-1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
-
-2. What happens if you include the name of a variable multiple times in a `select()` call?
-
-3. What does the `any_of()` function do?
- Why might it be helpful in conjunction with this vector?
-
- ```{r}
- variables <- c("year", "month", "day", "dep_delay", "arr_delay")
- ```
-
-4. Does the result of running the following code surprise you?
- How do the select helpers deal with case by default?
- How can you change that default?
-
- ```{r, eval = FALSE}
- select(flights, contains("TIME"))
- ```
-
-## Add new variables with `mutate()`
-
-Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns.
-That's the job of `mutate()`.
-
-`mutate()` always adds new columns at the end of your dataset so we'll start by creating a narrower dataset so we can see the new variables.
-Remember that when you're in RStudio, the easiest way to see all the columns is `View()`.
+But you can use the `.before` and `.after` arguments to choose where to place them:
```{r}
-flights_sml <- select(flights,
- year:day,
- ends_with("delay"),
- distance,
- air_time
-)
-mutate(flights_sml,
- gain = dep_delay - arr_delay,
- speed = distance / air_time * 60
-)
-```
-
-Note that you can refer to columns that you've just created:
-
-```{r}
-mutate(flights_sml,
- gain = dep_delay - arr_delay,
- hours = air_time / 60,
- gain_per_hour = gain / hours
-)
-```
-
-If you only want to keep the new variables, use `transmute()`:
-
-```{r}
-transmute(flights,
- gain = dep_delay - arr_delay,
- hours = air_time / 60,
- gain_per_hour = gain / hours
-)
+relocate(flights, year:dep_time, .after = time_hour)
+relocate(flights, starts_with("arr"), .before = dep_time)
```
### Exercises
@@ -293,68 +296,75 @@ ggplot(flights, aes(air_time - airtime2)) + geom_histogram()
3. Compare `dep_time`, `sched_dep_time`, and `dep_delay`.
How would you expect those three numbers to be related?
-4. Find the 10 most delayed flights using a ranking function.
- How do you want to handle ties?
- Carefully read the documentation for `min_rank()`.
+4. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
-5. What does `1:3 + 1:10` return?
- Why?
+5. What happens if you include the name of a variable multiple times in a `select()` call?
-6. What trigonometric functions does R provide?
+6. What does the `any_of()` function do?
+ Why might it be helpful in conjunction with this vector?
-## Grouped summaries with `summarise()`
+ ```{r}
+ variables <- c("year", "month", "day", "dep_delay", "arr_delay")
+ ```
-The last key verb is `summarise()`.
-It collapses a data frame to a single row:
+7. Does the result of running the following code surprise you?
+ How do the select helpers deal with case by default?
+ How can you change that default?
-```{r}
-summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
-```
+ ```{r, eval = FALSE}
+ select(flights, contains("TIME"))
+ ```
-(We'll come back to what that `na.rm = TRUE` means very shortly.)
+## Groups
-`summarise()` is not terribly useful unless we pair it with `group_by()`.
-This changes the unit of analysis from the complete dataset to individual groups.
-Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group".
-For example, if we applied exactly the same code to a data frame grouped by month, we get the average delay per month:
+### `group_by()`
+
+`group_by()` doesn't appear to do anything:
```{r}
by_month <- group_by(flights, month)
+by_month
+```
+
+If you look closely, you'll notice that it's now "grouped by" month, but otherwise the data is unchanged.
+The reason to group your data is because it changes the operation of other verbs.
+
+### `summarise()`
+
+The most important operation that you might apply to grouped data is a summary.
+It collapses each group to a single row:
+
+```{r}
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
```
-Together `group_by()` and `summarise()` provide one of the tools that you'll use most commonly when working with dplyr: grouped summaries.
-But before we go any further with this, we need to introduce a powerful new idea: the pipe.
+You can create any number of summaries at once.
+You'll learn various useful summaries in the upcoming chapters on individual data types, but another useful summary function is `n()`, which returns the number of rows in each group:
-### Combining multiple operations with the pipe
-
-Imagine that we want to explore the relationship between the distance and average delay for each location.
-Using what you know about dplyr, you might write code like this:
-
-```{r, fig.width = 6}
-by_dest <- group_by(flights, dest)
-delay <- summarise(by_dest,
- count = n(),
- dist = mean(distance, na.rm = TRUE),
- delay = mean(arr_delay, na.rm = TRUE)
-)
-delay <- filter(delay, count > 20, dest != "HNL")
-
-# It looks like delays increase with distance up to ~750 miles
-# and then decrease. Maybe as flights get longer there's more
-# ability to make up delays in the air?
-ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
- geom_point(aes(size = count), alpha = 1/3) +
- geom_smooth(se = FALSE)
+```{r}
+summarise(by_month, delay = mean(dep_delay, na.rm = TRUE), n = n())
```
-There are three steps to prepare this data:
+(In fact, `count()` which you already learned about, is just a short cut for grouping + summarising with `n()`)
-1. Group flights by destination.
+Here we've used `mean()` to compute the average delay for each month.
+The `na.rm = TRUE` is important because it asks R to "remove" (rm) the missing (na) values.
+If you forget it, the output isn't very useful:
-2. Summarise to compute distance, average delay, and number of flights.
+```{r}
+summarise(by_month, delay = mean(dep_delay))
+```
-3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
+We'll come back to discuss missing values in Chapter \@ref(missing-values).
+For now, know you can drop them in summary functions by using `na.rm = TRUE` or remove them with a filter by using `!is.na()`:
+
+```{r}
+not_cancelled <- filter(flights, !is.na(dep_delay))
+by_month <- group_by(not_cancelled, month)
+summarise(by_month, delay = mean(dep_delay))
+```
+
+### Combining multiple operations with the pipe
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about them.
Naming things is hard, so this slows down our analysis.
@@ -362,66 +372,23 @@ Naming things is hard, so this slows down our analysis.
There's another way to tackle the same problem with the pipe, `%>%`:
```{r}
-sdelays <- flights %>%
- group_by(dest) %>%
- summarise(
- count = n(),
- dist = mean(distance, na.rm = TRUE),
- delay = mean(arr_delay, na.rm = TRUE)
- ) %>%
- filter(count > 20, dest != "HNL")
+flights %>%
+ filter(!is.na(dep_delay)) %>%
+ group_by(month) %>%
+ summarise(delay = mean(dep_delay))
```
This focuses on the transformations, not what's being transformed, which makes the code easier to read.
-You can read it as a series of imperative statements: group, then summarise, then filter.
+You can read it as a series of imperative statements: filter, then group, then summarise.
As suggested by this reading, a good way to pronounce `%>%` when reading code is "then".
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on.
You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(pipes).
-Working with the pipe is one of the key criteria for belonging to the tidyverse.
-The only exception is ggplot2: it was written before the pipe was discovered.
-Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.
+### Grouping by multiple variables
-## Missing values {#missing-values-summarise}
-
-You may have wondered about the `na.rm` argument we used above.
-What happens if we don't set it?
-
-```{r}
-flights %>%
- group_by(month) %>%
- summarise(mean = mean(dep_delay))
-```
-
-We get a lot of missing values!
-That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value.
-Fortunately, all aggregation functions have an `na.rm` argument which removes the missing values prior to computation:
-
-```{r}
-flights %>%
- group_by(month) %>%
- summarise(mean = mean(dep_delay, na.rm = TRUE))
-```
-
-In this case, missing values represent cancelled flights, therefore we could also tackle the problem by first removing the cancelled flights.
-We'll save this dataset so we can reuse it in the next few examples.
-
-```{r}
-not_cancelled <- flights %>%
- filter(!is.na(dep_delay), !is.na(arr_delay))
-
-not_cancelled %>%
- group_by(month) %>%
- summarise(mean = mean(dep_delay))
-```
-
-## Grouping by multiple variables
-
-You can group a data frame by multiple variables as well.
-Note that the grouping information is printed on top of the output.
-The number in the square brackets indicates how many groups are created.
+You can group a data frame by multiple variables:
```{r}
daily <- group_by(flights, year, month, day)
@@ -431,34 +398,22 @@ daily
When you group by multiple variables, each summary peels off one level of the grouping by default, and a message is printed that tells you how you can change this behaviour.
```{r}
-summarise(daily, flights = n())
+daily %>% summarise(flights = n())
```
If you're happy with this behaviour, you can also explicitly define it, in which case the message won't be printed out.
-```{r}
+```{r results = FALSE}
summarise(daily, flights = n(), .groups = "drop_last")
```
-Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`.
+Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`:
-```{r}
-# Note the difference between the grouping structures
+```{r results = FALSE}
summarise(daily, flights = n(), .groups = "drop")
summarise(daily, flights = n(), .groups = "keep")
```
-The fact that each summary peels off one level of the grouping by default makes it easy to progressively roll up a dataset:
-
-```{r}
-(per_day <- summarise(daily, flights = n()))
-(per_month <- summarise(per_day, flights = sum(flights)))
-(per_year <- summarise(per_month, flights = sum(flights)))
-```
-
-Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median.
-In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
-
### Ungrouping
You might also want to remove grouping outside of `summarise()`.
@@ -466,11 +421,33 @@ You can do this and return to operations on ungrouped data using `ungroup()`.
```{r}
daily %>%
- ungroup() %>% # no longer grouped by date
- summarise(flights = n()) # all flights
+ ungroup() %>%
+ summarise(
+ delay = mean(dep_delay, na.rm = TRUE),
+ flights = n()
+ )
```
-### Counts
+For the purposes of summarising, ungrouped data is treated as if all your data was in a single group, so you get one row back.
+
+### Other verbs
+
+- `select()`, `rename()`, `relocate()`: grouping has no affect
+
+- `filter()`, `mutate()`: computation happens per group.
+ This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions).
+
+### Exercises
+
+1. Which carrier has the worst delays?
+ Challenge: can you disentangle the effects of bad airports vs. bad carriers?
+ Why/why not?
+ (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
+
+2. What does the `sort` argument to `count()` do.
+ Can you explain it in terms of the dplyr verbs you've learned so far?
+
+## Case study: aggregates and sample size
Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`).
That way you can check that you're not drawing conclusions based on very small amounts of data.
@@ -518,15 +495,6 @@ delays %>%
geom_point(alpha = 1/10)
```
-------------------------------------------------------------------------
-
-RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P.
-This resends the previously sent chunk from the editor to the console.
-This is very convenient when you're (e.g.) exploring the value of `n` in the example above.
-You send the whole block once with Cmd/Ctrl + Enter, then you modify the value of `n` and press Cmd/Ctrl + Shift + P to resend the complete block.
-
-------------------------------------------------------------------------
-
There's another common variation of this type of pattern.
Let's look at how the average performance of batters in baseball is related to the number of times they're at bat.
Here I use data from the **Lahman** package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
@@ -565,99 +533,3 @@ batters %>%
```
You can find a good explanation of this problem at and .
-
-### Exercises
-
-1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
- Consider the following scenarios:
-
- - A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
-
- - A flight is always 10 minutes late.
-
- - A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
-
- - 99% of the time a flight is on time.
- 1% of the time it's 2 hours late.
-
- Which is more important: arrival delay or departure delay?
-
-2. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).
-
-3. Our definition of cancelled flights (`is.na(dep_delay) | is.na(arr_delay)` ) is slightly suboptimal.
- Why?
- Which is the most important column?
-
-4. Look at the number of cancelled flights per day.
- Is there a pattern?
- Is the proportion of cancelled flights related to the average delay?
-
-5. Which carrier has the worst delays?
- Challenge: can you disentangle the effects of bad airports vs. bad carriers?
- Why/why not?
- (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
-
-6. What does the `sort` argument to `count()` do.
- When might you use it?
-
-## Grouped mutates and filters
-
-Grouping is most useful in conjunction with `summarise()`, but you can also do convenient operations with `mutate()` and `filter()`:
-
-- Find the worst members of each group:
-
- ```{r}
- flights_sml %>%
- group_by(year, month, day) %>%
- filter(rank(desc(arr_delay)) < 10)
- ```
-
-- Find all groups bigger than a threshold:
-
- ```{r}
- popular_dests <- flights %>%
- group_by(dest) %>%
- filter(n() > 365)
- popular_dests
- ```
-
-- Standardise to compute per group metrics:
-
- ```{r}
- popular_dests %>%
- filter(arr_delay > 0) %>%
- mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
- select(year:day, dest, arr_delay, prop_delay)
- ```
-
-A grouped filter is a grouped mutate followed by an ungrouped filter.
-I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly.
-
-Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries).
-You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`.
-
-### Exercises
-
-1. Refer back to the lists of useful mutate and filtering functions.
- Describe how each operation changes when you combine it with grouping.
-
-2. Which plane (`tailnum`) has the worst on-time record?
-
-3. What time of day should you fly if you want to avoid delays as much as possible?
-
-4. For each destination, compute the total minutes of delay.
- For each flight, compute the proportion of the total delay for its destination.
-
-5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
- Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.
-
-6. Look at each destination.
- Can you find flights that are suspiciously fast?
- (i.e. flights that represent a potential data entry error).
- Compute the air time of a flight relative to the shortest flight to that destination.
- Which flights were most delayed in the air?
-
-7. Find all destinations that are flown by at least two carriers.
- Use that information to rank the carriers.
-
-8. For each plane, count the number of flights before the first delay of greater than 1 hour.
diff --git a/logicals-numbers.Rmd b/logicals-numbers.Rmd
index d918625..63f3aaa 100644
--- a/logicals-numbers.Rmd
+++ b/logicals-numbers.Rmd
@@ -26,7 +26,7 @@ filter(flights, month == 11 | month == 12)
```
The order of operations doesn't work like English.
-You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
+You can't write `filter(flights, month == 11 | 12)`, which you might literally translate into "finds all flights that departed in November or December".
Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
This is quite confusing!
@@ -77,6 +77,12 @@ You'll learn how to create new variables shortly.
summarise(hour_prop = mean(arr_delay > 60))
```
+`cumany()` `cumall()`
+
+### Exercises
+
+1. For each plane, count the number of flights before the first delay of greater than 1 hour.
+
## Basic math
There are many functions for creating new variables that you can use with `mutate()`.
@@ -121,6 +127,12 @@ There's no way to list every possible function that you might use, but here's a
cummean(x)
```
+### Recycling rules
+
+Base R.
+
+Tidyverse.
+
## Summaries
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
@@ -175,6 +187,22 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
)
```
+### Exercises
+
+1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
+ Consider the following scenarios:
+
+ - A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
+
+ - A flight is always 10 minutes late.
+
+ - A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
+
+ - 99% of the time a flight is on time.
+ 1% of the time it's 2 hours late.
+
+ Which is more important: arrival delay or departure delay?
+
## Floating point
There's another common problem you might encounter when using `==`: floating point numbers.
@@ -195,5 +223,6 @@ near(1 / 49 * 49, 1)
## Exercises
-1. How could you use `arrange()` to sort all missing values to the start?
- (Hint: use `!is.na()`).
+1. What trigonometric functions does R provide?
+2.
+
diff --git a/missing-values.Rmd b/missing-values.Rmd
index 01e7aa8..6416bb1 100644
--- a/missing-values.Rmd
+++ b/missing-values.Rmd
@@ -46,6 +46,21 @@ If you want to determine if a value is missing, use `is.na()`:
is.na(x)
```
+### Exercises
+
+1. How many flights have a missing `dep_time`?
+ What other variables are missing?
+ What might these rows represent?
+
+2. How could you use `arrange()` to sort all missing values to the start?
+ (Hint: use `!is.na()`).
+
+3. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).
+
+4. Look at the number of cancelled flights per day.
+ Is there a pattern?
+ Is the proportion of cancelled flights related to the average delay?
+
## Explicit vs implicit missing values {#missing-values-tidy}
Changing the representation of a dataset brings up an important subtlety of missing values.
@@ -151,8 +166,8 @@ arrange(df, desc(x))
## Exercises
-1. Why is `NA ^ 0` not missing?
- Why is `NA | TRUE` not missing?
- Why is `FALSE & NA` not missing?
- Can you figure out the general rule?
- (`NA * 0` is a tricky counterexample!)
+1. Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)
+
+### Missing matches
+
+Discuss `anti_join()`
diff --git a/vector-tools.Rmd b/vector-tools.Rmd
index f1a2e23..b1a8621 100644
--- a/vector-tools.Rmd
+++ b/vector-tools.Rmd
@@ -102,3 +102,74 @@ not_cancelled <- flights %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
```
+
+### dplyr
+
+```{r}
+flights_sml <- select(flights,
+ year:day,
+ ends_with("delay"),
+ distance,
+ air_time
+)
+```
+
+- Find the worst members of each group:
+
+ ```{r}
+ flights_sml %>%
+ group_by(year, month, day) %>%
+ filter(rank(desc(arr_delay)) < 10)
+ ```
+
+- Find all groups bigger than a threshold:
+
+ ```{r}
+ popular_dests <- flights %>%
+ group_by(dest) %>%
+ filter(n() > 365)
+ popular_dests
+ ```
+
+- Standardise to compute per group metrics:
+
+ ```{r}
+ popular_dests %>%
+ filter(arr_delay > 0) %>%
+ mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
+ select(year:day, dest, arr_delay, prop_delay)
+ ```
+
+A grouped filter is a grouped mutate followed by an ungrouped filter.
+I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly.
+
+Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries).
+You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`.
+
+### Exercises
+
+1. Find the 10 most delayed flights using a ranking function.
+ How do you want to handle ties?
+ Carefully read the documentation for `min_rank()`.
+
+2. Which plane (`tailnum`) has the worst on-time record?
+
+3. What time of day should you fly if you want to avoid delays as much as possible?
+
+4. For each destination, compute the total minutes of delay.
+ For each flight, compute the proportion of the total delay for its destination.
+
+5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
+ Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.
+
+6. Look at each destination.
+ Can you find flights that are suspiciously fast?
+ (i.e. flights that represent a potential data entry error).
+ Compute the air time of a flight relative to the shortest flight to that destination.
+ Which flights were most delayed in the air?
+
+7. Find all destinations that are flown by at least two carriers.
+ Use that information to rank the carriers.
+
+8.
+