Brainstorm a few exercises

This commit is contained in:
hadley 2015-12-18 09:53:15 -06:00
parent bdcb95410b
commit c3eed28bbf
1 changed files with 32 additions and 2 deletions

View File

@ -589,6 +589,21 @@ mean(c(1, 5, 10, NA), na.rm = TRUE)
### Exercises
1. Brainstorm at least 5 different ways to assess the typically delay
characteristics of a group of flights. Consider the following scenarios:
* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of
the time.
* A flight is always 10 minutes late.
* A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of
the time.
* 99% of the time a flight is on time. 1% of the time it's 2 hours late.
Which is more important: arrival delay or departure delay?
## Multiple operations
Imagine we want to explore the relationship between the distance and average delay for each location. Using what you already know about dplyr, you might write code like this:
@ -755,10 +770,25 @@ Grouping is definitely most useful in conjunction with `summarise()`, but you ca
mutate(prop_delay = arr_delay / sum(arr_delay))
```
You can see more uses in window functions vignette `vignette("window-functions")`.
A grouped filter is basically like a grouped mutate followed by an ungrouped filter. I generally avoid them except for quick and dirty manipulations. Otherwise it's too hard to check that you've done the manipulation correctly.
Function that work most naturally in grouped mutates and filtered are known as window functions (vs. aggregate or summary functions used in grouped summaries). You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`.
### Exercises
1. Which plane (`tailnum`) has the worst on-time record?
1. What time of day should you fly if you want to avoid delays as much
as possible?
1. Look at each destination. Can you find flights that are suspiciously
fast? (i.e. flights that represent a potential data entry error). Compute
the air time a flight relative to the shortest flight to that destination.
Which flights were most delayed in the air?
1. Find all destinations that are flown by at least two carriers. Use that
information to rank the carriers.
## Multiple tables of data
It's rare that a data analysis involves only a single table of data. In practice, you'll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two tables at a time: