From c3eed28bbf48781b1660c2b44c2f501918f6bf52 Mon Sep 17 00:00:00 2001 From: hadley Date: Fri, 18 Dec 2015 09:53:15 -0600 Subject: [PATCH] Brainstorm a few exercises --- transform.Rmd | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/transform.Rmd b/transform.Rmd index ea87703..d2fcaf9 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -589,6 +589,21 @@ mean(c(1, 5, 10, NA), na.rm = TRUE) ### Exercises +1. Brainstorm at least 5 different ways to assess the typically delay + characteristics of a group of flights. Consider the following scenarios: + + * A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of + the time. + + * A flight is always 10 minutes late. + + * A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of + the time. + + * 99% of the time a flight is on time. 1% of the time it's 2 hours late. + + Which is more important: arrival delay or departure delay? + ## Multiple operations Imagine we want to explore the relationship between the distance and average delay for each location. Using what you already know about dplyr, you might write code like this: @@ -755,10 +770,25 @@ Grouping is definitely most useful in conjunction with `summarise()`, but you ca mutate(prop_delay = arr_delay / sum(arr_delay)) ``` -You can see more uses in window functions vignette `vignette("window-functions")`. - A grouped filter is basically like a grouped mutate followed by an ungrouped filter. I generally avoid them except for quick and dirty manipulations. Otherwise it's too hard to check that you've done the manipulation correctly. +Function that work most naturally in grouped mutates and filtered are known as window functions (vs. aggregate or summary functions used in grouped summaries). You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`. + +### Exercises + +1. Which plane (`tailnum`) has the worst on-time record? + +1. What time of day should you fly if you want to avoid delays as much + as possible? + +1. Look at each destination. Can you find flights that are suspiciously + fast? (i.e. flights that represent a potential data entry error). Compute + the air time a flight relative to the shortest flight to that destination. + Which flights were most delayed in the air? + +1. Find all destinations that are flown by at least two carriers. Use that + information to rank the carriers. + ## Multiple tables of data It's rare that a data analysis involves only a single table of data. In practice, you'll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two tables at a time: