diff --git a/data-transform.Rmd b/data-transform.Rmd index 1e9f08f..19118eb 100644 --- a/data-transform.Rmd +++ b/data-transform.Rmd @@ -30,46 +30,47 @@ The data comes from the US [Bureau of Transportation Statistics](http://www.tran flights ``` -You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. -It also displays the number of rows (`r format(nrow(nycflights13::flights), big.mark = ",")`) and columns (`r ncol(nycflights13::flights)`). -(To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). -It prints differently because it's a **tibble**. -Tibbles are data frames, but slightly tweaked to work better in the tidyverse. -For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in Chapter \@ref(tibbles). +If you've used R before, you might notice that this data frame prints a little differently to data frames you might've worked with in the past. +That's because it's a **tibble**, a special type of data frame designed by the tidyverse team. + +The most important difference between a tibble and a data frame is the print method. +Tibbles only shows the first few rows and the columns that fit on one screen. +This makes it easier to rapidly iterate when working with large data; if you want to see everything you can use `View(flights)` to open the dataset in the RStudio viewer. +We'll come back to other important differences in Chapter \@ref(tibbles). You might also have noticed the row of three (or four) letter abbreviations under the column names. These describe the type of each variable: -- `int` stands for integers. +- `int` stands for integer. -- `dbl` stands for doubles, or real numbers. +- `dbl` stands for double, a vector of real numbers. -- `chr` stands for characters, or strings. +- `chr` stands for character, a vector of strings. -- `dttm` stands for date-times (a date + a time). +- `dttm` stands for date-time (a date + a time). There are three other common types of variables that aren't used in this dataset but you'll encounter later in the book: - `lgl` stands for logical, vectors that contain only `TRUE` or `FALSE`. -- `fctr` stands for factors, which R uses to represent categorical variables with fixed possible values. +- `fctr` stands for factor, which R uses to represent categorical variables with fixed possible values. -- `date` stands for dates. +- `date` stands for date. ### dplyr basics -In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges: +In this chapter you are going to learn the primary dplyr verbs that allow you to solve the vast majority of your data manipulation challenges. +They are organised into three camps: -- Pick observations by their values (`filter()`). -- Reorder the rows (`arrange()`). -- Pick variables by their names (`select()`). -- Create new variables with functions of existing variables (`mutate()`). -- Collapse many values down to a single summary (`summarise()`). +- Functions that operate on **rows**, like `filter()` which subsets rows based on the values of the columns, the `slice()` functions that subset rows based on their position, and `arrange()` which changes the order of the rows. -These can all be used in conjunction with `group_by()` which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. -These six functions provide the verbs for a language of data manipulation. +- Functions that operate on **columns**, like `mutate()` which creates new columns, `select()` which picks columns, `rename()` which changes column names, `relocate()` which moves columns from place to place. -All verbs work similarly: +- Functions that operate on **groups**, like `group_by()` which divides data up into groups for analysis, and `summarise()` which reduces each group to a single row. + +Later, in Chapter \@ref(relational-data), you'll learn about other verbs that work with **tables**, like the join functions and the set operations. + +All dplyr verbs work the same way: 1. The first argument is a data frame. @@ -80,7 +81,9 @@ All verbs work similarly: Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let's dive in and see how these verbs work. -## Filter rows with `filter()` +## Rows + +### `filter()` `filter()` allows you to subset observations based on their values. The first argument is the name of the data frame. @@ -105,35 +108,20 @@ If you want to do both, you can wrap the assignment in parentheses: (dec25 <- filter(flights, month == 12, day == 25)) ``` -### Comparisons - To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal). +It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`. When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality. -When this happens you'll get an informative error: +`filter()` will let you know when this happens: ```{r, error = TRUE} filter(flights, month = 1) ``` -### Exercises +### `slice()` -1. Find all flights that - - a. Had an arrival delay of two or more hours - b. Flew to Houston (`IAH` or `HOU`) - c. Were operated by United, American, or Delta - d. Departed in summer (July, August, and September) - e. Arrived more than two hours late, but didn't leave late - f. Were delayed by at least an hour, but made up over 30 minutes in flight - g. Departed between midnight and 6am (inclusive) - -2. How many flights have a missing `dep_time`? - What other variables are missing? - What might these rows represent? - -## Arrange rows with `arrange()` +### `arrange()` `arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. @@ -151,15 +139,73 @@ arrange(flights, desc(dep_delay)) ### Exercises -1. Sort `flights` to find the flights with longest departure delays. +1. Find all flights that + + a. Had an arrival delay of two or more hours + b. Flew to Houston (`IAH` or `HOU`) + c. Were operated by United, American, or Delta + d. Departed in summer (July, August, and September) + e. Arrived more than two hours late, but didn't leave late + f. Were delayed by at least an hour, but made up over 30 minutes in flight + g. Departed between midnight and 6am (inclusive) + +2. Sort `flights` to find the flights with longest departure delays. Find the flights that left earliest. -2. Sort `flights` to find the fastest (highest speed) flights. +3. Sort `flights` to find the fastest (highest speed) flights. + (Hint: try sorting by a calculation). -3. Which flights travelled the farthest? +4. Which flights travelled the farthest? Which travelled the shortest? -## Select columns with `select()` {#select} +## Columns + +### `mutate()` + +Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. +That's the job of `mutate()`. + +`mutate()` always adds new columns at the end of your dataset so we'll start by creating a narrower dataset so we can see the new variables. +Remember that when you're in RStudio, the easiest way to see all the columns is `View()`. + +```{r} +flights_sml <- select(flights, + year:day, + ends_with("delay"), + distance, + air_time +) +``` + +```{r} +mutate(flights_sml, + gain = dep_delay - arr_delay, + speed = distance / air_time * 60 +) +``` + +Note that you can refer to columns that you've just created: + +```{r} +mutate(flights_sml, + gain = dep_delay - arr_delay, + hours = air_time / 60, + gain_per_hour = gain / hours +) +``` + +You can control which variables are kept with the `.keep` argument: + +```{r} +mutate(flights, + gain = dep_delay - arr_delay, + hours = air_time / 60, + gain_per_hour = gain / hours, + .keep = "none" +) +``` + +### `select()` {#select} It's not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you're actually interested in. @@ -190,80 +236,37 @@ There are a number of helper functions you can use within `select()`: See `?select` for more details. -`select()` can be used to rename variables, but it's rarely useful because it drops all of the variables not explicitly mentioned. -Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned: +You can rename variables as you `select()` them by using `=`. +The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side: + +```{r} +select(flights, tail_num = tailnum) +``` + +### `rename()` + +If you just want to keep all the existing variables and just want to rename a few, you can use `rename()` instead of `select()`: ```{r} rename(flights, tail_num = tailnum) ``` -If you want to move certain variables to the start of the data frame but not drop the others, you can do this in two ways: using `select()` in conjunction with the `everything()` helper or using `relocate()`. +It works exactly the same way as `select()`, but keeps all the variables that aren't explicitly selected. + +### `relocate()` + +You can move variables around with `relocate`. +By default it moves variables to the front: ```{r} -select(flights, time_hour, air_time, everything()) relocate(flights, time_hour, air_time) ``` -### Exercises - -1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`. - -2. What happens if you include the name of a variable multiple times in a `select()` call? - -3. What does the `any_of()` function do? - Why might it be helpful in conjunction with this vector? - - ```{r} - variables <- c("year", "month", "day", "dep_delay", "arr_delay") - ``` - -4. Does the result of running the following code surprise you? - How do the select helpers deal with case by default? - How can you change that default? - - ```{r, eval = FALSE} - select(flights, contains("TIME")) - ``` - -## Add new variables with `mutate()` - -Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. -That's the job of `mutate()`. - -`mutate()` always adds new columns at the end of your dataset so we'll start by creating a narrower dataset so we can see the new variables. -Remember that when you're in RStudio, the easiest way to see all the columns is `View()`. +But you can use the `.before` and `.after` arguments to choose where to place them: ```{r} -flights_sml <- select(flights, - year:day, - ends_with("delay"), - distance, - air_time -) -mutate(flights_sml, - gain = dep_delay - arr_delay, - speed = distance / air_time * 60 -) -``` - -Note that you can refer to columns that you've just created: - -```{r} -mutate(flights_sml, - gain = dep_delay - arr_delay, - hours = air_time / 60, - gain_per_hour = gain / hours -) -``` - -If you only want to keep the new variables, use `transmute()`: - -```{r} -transmute(flights, - gain = dep_delay - arr_delay, - hours = air_time / 60, - gain_per_hour = gain / hours -) +relocate(flights, year:dep_time, .after = time_hour) +relocate(flights, starts_with("arr"), .before = dep_time) ``` ### Exercises @@ -293,68 +296,75 @@ ggplot(flights, aes(air_time - airtime2)) + geom_histogram() 3. Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related? -4. Find the 10 most delayed flights using a ranking function. - How do you want to handle ties? - Carefully read the documentation for `min_rank()`. +4. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`. -5. What does `1:3 + 1:10` return? - Why? +5. What happens if you include the name of a variable multiple times in a `select()` call? -6. What trigonometric functions does R provide? +6. What does the `any_of()` function do? + Why might it be helpful in conjunction with this vector? -## Grouped summaries with `summarise()` + ```{r} + variables <- c("year", "month", "day", "dep_delay", "arr_delay") + ``` -The last key verb is `summarise()`. -It collapses a data frame to a single row: +7. Does the result of running the following code surprise you? + How do the select helpers deal with case by default? + How can you change that default? -```{r} -summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) -``` + ```{r, eval = FALSE} + select(flights, contains("TIME")) + ``` -(We'll come back to what that `na.rm = TRUE` means very shortly.) +## Groups -`summarise()` is not terribly useful unless we pair it with `group_by()`. -This changes the unit of analysis from the complete dataset to individual groups. -Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group". -For example, if we applied exactly the same code to a data frame grouped by month, we get the average delay per month: +### `group_by()` + +`group_by()` doesn't appear to do anything: ```{r} by_month <- group_by(flights, month) +by_month +``` + +If you look closely, you'll notice that it's now "grouped by" month, but otherwise the data is unchanged. +The reason to group your data is because it changes the operation of other verbs. + +### `summarise()` + +The most important operation that you might apply to grouped data is a summary. +It collapses each group to a single row: + +```{r} summarise(by_month, delay = mean(dep_delay, na.rm = TRUE)) ``` -Together `group_by()` and `summarise()` provide one of the tools that you'll use most commonly when working with dplyr: grouped summaries. -But before we go any further with this, we need to introduce a powerful new idea: the pipe. +You can create any number of summaries at once. +You'll learn various useful summaries in the upcoming chapters on individual data types, but another useful summary function is `n()`, which returns the number of rows in each group: -### Combining multiple operations with the pipe - -Imagine that we want to explore the relationship between the distance and average delay for each location. -Using what you know about dplyr, you might write code like this: - -```{r, fig.width = 6} -by_dest <- group_by(flights, dest) -delay <- summarise(by_dest, - count = n(), - dist = mean(distance, na.rm = TRUE), - delay = mean(arr_delay, na.rm = TRUE) -) -delay <- filter(delay, count > 20, dest != "HNL") - -# It looks like delays increase with distance up to ~750 miles -# and then decrease. Maybe as flights get longer there's more -# ability to make up delays in the air? -ggplot(data = delay, mapping = aes(x = dist, y = delay)) + - geom_point(aes(size = count), alpha = 1/3) + - geom_smooth(se = FALSE) +```{r} +summarise(by_month, delay = mean(dep_delay, na.rm = TRUE), n = n()) ``` -There are three steps to prepare this data: +(In fact, `count()` which you already learned about, is just a short cut for grouping + summarising with `n()`) -1. Group flights by destination. +Here we've used `mean()` to compute the average delay for each month. +The `na.rm = TRUE` is important because it asks R to "remove" (rm) the missing (na) values. +If you forget it, the output isn't very useful: -2. Summarise to compute distance, average delay, and number of flights. +```{r} +summarise(by_month, delay = mean(dep_delay)) +``` -3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport. +We'll come back to discuss missing values in Chapter \@ref(missing-values). +For now, know you can drop them in summary functions by using `na.rm = TRUE` or remove them with a filter by using `!is.na()`: + +```{r} +not_cancelled <- filter(flights, !is.na(dep_delay)) +by_month <- group_by(not_cancelled, month) +summarise(by_month, delay = mean(dep_delay)) +``` + +### Combining multiple operations with the pipe This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about them. Naming things is hard, so this slows down our analysis. @@ -362,66 +372,23 @@ Naming things is hard, so this slows down our analysis. There's another way to tackle the same problem with the pipe, `%>%`: ```{r} -sdelays <- flights %>% - group_by(dest) %>% - summarise( - count = n(), - dist = mean(distance, na.rm = TRUE), - delay = mean(arr_delay, na.rm = TRUE) - ) %>% - filter(count > 20, dest != "HNL") +flights %>% + filter(!is.na(dep_delay)) %>% + group_by(month) %>% + summarise(delay = mean(dep_delay)) ``` This focuses on the transformations, not what's being transformed, which makes the code easier to read. -You can read it as a series of imperative statements: group, then summarise, then filter. +You can read it as a series of imperative statements: filter, then group, then summarise. As suggested by this reading, a good way to pronounce `%>%` when reading code is "then". Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(pipes). -Working with the pipe is one of the key criteria for belonging to the tidyverse. -The only exception is ggplot2: it was written before the pipe was discovered. -Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet. +### Grouping by multiple variables -## Missing values {#missing-values-summarise} - -You may have wondered about the `na.rm` argument we used above. -What happens if we don't set it? - -```{r} -flights %>% - group_by(month) %>% - summarise(mean = mean(dep_delay)) -``` - -We get a lot of missing values! -That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value. -Fortunately, all aggregation functions have an `na.rm` argument which removes the missing values prior to computation: - -```{r} -flights %>% - group_by(month) %>% - summarise(mean = mean(dep_delay, na.rm = TRUE)) -``` - -In this case, missing values represent cancelled flights, therefore we could also tackle the problem by first removing the cancelled flights. -We'll save this dataset so we can reuse it in the next few examples. - -```{r} -not_cancelled <- flights %>% - filter(!is.na(dep_delay), !is.na(arr_delay)) - -not_cancelled %>% - group_by(month) %>% - summarise(mean = mean(dep_delay)) -``` - -## Grouping by multiple variables - -You can group a data frame by multiple variables as well. -Note that the grouping information is printed on top of the output. -The number in the square brackets indicates how many groups are created. +You can group a data frame by multiple variables: ```{r} daily <- group_by(flights, year, month, day) @@ -431,34 +398,22 @@ daily When you group by multiple variables, each summary peels off one level of the grouping by default, and a message is printed that tells you how you can change this behaviour. ```{r} -summarise(daily, flights = n()) +daily %>% summarise(flights = n()) ``` If you're happy with this behaviour, you can also explicitly define it, in which case the message won't be printed out. -```{r} +```{r results = FALSE} summarise(daily, flights = n(), .groups = "drop_last") ``` -Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`. +Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`: -```{r} -# Note the difference between the grouping structures +```{r results = FALSE} summarise(daily, flights = n(), .groups = "drop") summarise(daily, flights = n(), .groups = "keep") ``` -The fact that each summary peels off one level of the grouping by default makes it easy to progressively roll up a dataset: - -```{r} -(per_day <- summarise(daily, flights = n())) -(per_month <- summarise(per_day, flights = sum(flights))) -(per_year <- summarise(per_month, flights = sum(flights))) -``` - -Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median. -In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median. - ### Ungrouping You might also want to remove grouping outside of `summarise()`. @@ -466,11 +421,33 @@ You can do this and return to operations on ungrouped data using `ungroup()`. ```{r} daily %>% - ungroup() %>% # no longer grouped by date - summarise(flights = n()) # all flights + ungroup() %>% + summarise( + delay = mean(dep_delay, na.rm = TRUE), + flights = n() + ) ``` -### Counts +For the purposes of summarising, ungrouped data is treated as if all your data was in a single group, so you get one row back. + +### Other verbs + +- `select()`, `rename()`, `relocate()`: grouping has no affect + +- `filter()`, `mutate()`: computation happens per group. + This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions). + +### Exercises + +1. Which carrier has the worst delays? + Challenge: can you disentangle the effects of bad airports vs. bad carriers? + Why/why not? + (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`) + +2. What does the `sort` argument to `count()` do. + Can you explain it in terms of the dplyr verbs you've learned so far? + +## Case study: aggregates and sample size Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`). That way you can check that you're not drawing conclusions based on very small amounts of data. @@ -518,15 +495,6 @@ delays %>% geom_point(alpha = 1/10) ``` ------------------------------------------------------------------------- - -RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. -This resends the previously sent chunk from the editor to the console. -This is very convenient when you're (e.g.) exploring the value of `n` in the example above. -You send the whole block once with Cmd/Ctrl + Enter, then you modify the value of `n` and press Cmd/Ctrl + Shift + P to resend the complete block. - ------------------------------------------------------------------------- - There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the **Lahman** package to compute the batting average (number of hits / number of attempts) of every major league baseball player. @@ -565,99 +533,3 @@ batters %>% ``` You can find a good explanation of this problem at and . - -### Exercises - -1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. - Consider the following scenarios: - - - A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time. - - - A flight is always 10 minutes late. - - - A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time. - - - 99% of the time a flight is on time. - 1% of the time it's 2 hours late. - - Which is more important: arrival delay or departure delay? - -2. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`). - -3. Our definition of cancelled flights (`is.na(dep_delay) | is.na(arr_delay)` ) is slightly suboptimal. - Why? - Which is the most important column? - -4. Look at the number of cancelled flights per day. - Is there a pattern? - Is the proportion of cancelled flights related to the average delay? - -5. Which carrier has the worst delays? - Challenge: can you disentangle the effects of bad airports vs. bad carriers? - Why/why not? - (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`) - -6. What does the `sort` argument to `count()` do. - When might you use it? - -## Grouped mutates and filters - -Grouping is most useful in conjunction with `summarise()`, but you can also do convenient operations with `mutate()` and `filter()`: - -- Find the worst members of each group: - - ```{r} - flights_sml %>% - group_by(year, month, day) %>% - filter(rank(desc(arr_delay)) < 10) - ``` - -- Find all groups bigger than a threshold: - - ```{r} - popular_dests <- flights %>% - group_by(dest) %>% - filter(n() > 365) - popular_dests - ``` - -- Standardise to compute per group metrics: - - ```{r} - popular_dests %>% - filter(arr_delay > 0) %>% - mutate(prop_delay = arr_delay / sum(arr_delay)) %>% - select(year:day, dest, arr_delay, prop_delay) - ``` - -A grouped filter is a grouped mutate followed by an ungrouped filter. -I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly. - -Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries). -You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`. - -### Exercises - -1. Refer back to the lists of useful mutate and filtering functions. - Describe how each operation changes when you combine it with grouping. - -2. Which plane (`tailnum`) has the worst on-time record? - -3. What time of day should you fly if you want to avoid delays as much as possible? - -4. For each destination, compute the total minutes of delay. - For each flight, compute the proportion of the total delay for its destination. - -5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. - Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight. - -6. Look at each destination. - Can you find flights that are suspiciously fast? - (i.e. flights that represent a potential data entry error). - Compute the air time of a flight relative to the shortest flight to that destination. - Which flights were most delayed in the air? - -7. Find all destinations that are flown by at least two carriers. - Use that information to rank the carriers. - -8. For each plane, count the number of flights before the first delay of greater than 1 hour. diff --git a/logicals-numbers.Rmd b/logicals-numbers.Rmd index d918625..63f3aaa 100644 --- a/logicals-numbers.Rmd +++ b/logicals-numbers.Rmd @@ -26,7 +26,7 @@ filter(flights, month == 11 | month == 12) ``` The order of operations doesn't work like English. -You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December". +You can't write `filter(flights, month == 11 | 12)`, which you might literally translate into "finds all flights that departed in November or December". Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`. In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December. This is quite confusing! @@ -77,6 +77,12 @@ You'll learn how to create new variables shortly. summarise(hour_prop = mean(arr_delay > 60)) ``` +`cumany()` `cumall()` + +### Exercises + +1. For each plane, count the number of flights before the first delay of greater than 1 hour. + ## Basic math There are many functions for creating new variables that you can use with `mutate()`. @@ -121,6 +127,12 @@ There's no way to list every possible function that you might use, but here's a cummean(x) ``` +### Recycling rules + +Base R. + +Tidyverse. + ## Summaries Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions: @@ -175,6 +187,22 @@ Just using means, counts, and sum can get you a long way, but R provides many ot ) ``` +### Exercises + +1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. + Consider the following scenarios: + + - A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time. + + - A flight is always 10 minutes late. + + - A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time. + + - 99% of the time a flight is on time. + 1% of the time it's 2 hours late. + + Which is more important: arrival delay or departure delay? + ## Floating point There's another common problem you might encounter when using `==`: floating point numbers. @@ -195,5 +223,6 @@ near(1 / 49 * 49, 1) ## Exercises -1. How could you use `arrange()` to sort all missing values to the start? - (Hint: use `!is.na()`). +1. What trigonometric functions does R provide? +2. + diff --git a/missing-values.Rmd b/missing-values.Rmd index 01e7aa8..6416bb1 100644 --- a/missing-values.Rmd +++ b/missing-values.Rmd @@ -46,6 +46,21 @@ If you want to determine if a value is missing, use `is.na()`: is.na(x) ``` +### Exercises + +1. How many flights have a missing `dep_time`? + What other variables are missing? + What might these rows represent? + +2. How could you use `arrange()` to sort all missing values to the start? + (Hint: use `!is.na()`). + +3. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`). + +4. Look at the number of cancelled flights per day. + Is there a pattern? + Is the proportion of cancelled flights related to the average delay? + ## Explicit vs implicit missing values {#missing-values-tidy} Changing the representation of a dataset brings up an important subtlety of missing values. @@ -151,8 +166,8 @@ arrange(df, desc(x)) ## Exercises -1. Why is `NA ^ 0` not missing? - Why is `NA | TRUE` not missing? - Why is `FALSE & NA` not missing? - Can you figure out the general rule? - (`NA * 0` is a tricky counterexample!) +1. Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!) + +### Missing matches + +Discuss `anti_join()` diff --git a/vector-tools.Rmd b/vector-tools.Rmd index f1a2e23..b1a8621 100644 --- a/vector-tools.Rmd +++ b/vector-tools.Rmd @@ -102,3 +102,74 @@ not_cancelled <- flights %>% mutate(r = min_rank(desc(dep_time))) %>% filter(r %in% range(r)) ``` + +### dplyr + +```{r} +flights_sml <- select(flights, + year:day, + ends_with("delay"), + distance, + air_time +) +``` + +- Find the worst members of each group: + + ```{r} + flights_sml %>% + group_by(year, month, day) %>% + filter(rank(desc(arr_delay)) < 10) + ``` + +- Find all groups bigger than a threshold: + + ```{r} + popular_dests <- flights %>% + group_by(dest) %>% + filter(n() > 365) + popular_dests + ``` + +- Standardise to compute per group metrics: + + ```{r} + popular_dests %>% + filter(arr_delay > 0) %>% + mutate(prop_delay = arr_delay / sum(arr_delay)) %>% + select(year:day, dest, arr_delay, prop_delay) + ``` + +A grouped filter is a grouped mutate followed by an ungrouped filter. +I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly. + +Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries). +You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`. + +### Exercises + +1. Find the 10 most delayed flights using a ranking function. + How do you want to handle ties? + Carefully read the documentation for `min_rank()`. + +2. Which plane (`tailnum`) has the worst on-time record? + +3. What time of day should you fly if you want to avoid delays as much as possible? + +4. For each destination, compute the total minutes of delay. + For each flight, compute the proportion of the total delay for its destination. + +5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. + Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight. + +6. Look at each destination. + Can you find flights that are suspiciously fast? + (i.e. flights that represent a potential data entry error). + Compute the air time of a flight relative to the shortest flight to that destination. + Which flights were most delayed in the air? + +7. Find all destinations that are flown by at least two carriers. + Use that information to rank the carriers. + +8. +