From 41e84491bf4df7b864cc42d2d8bf997b6a06db75 Mon Sep 17 00:00:00 2001
From: Peter Hurford <peter@peterhurford.com>
Date: Wed, 16 Dec 2015 10:39:26 -0600
Subject: [PATCH 01/14] Typofix: Juli -> Julia

---
 intro.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/intro.Rmd b/intro.Rmd
index 587c35f..fcabdd4 100644
--- a/intro.Rmd
+++ b/intro.Rmd
@@ -71,7 +71,7 @@ Another class of big data problem consists of many small data problems. Each ind
 
 ### Python
 
-In this book, you won't learn anything about Python, Juli, or any other programming language useful for data science. This isn't because we think these tools are bad. They're not! And in practice, most data science teams use a mix of languages, often at least R and Python.
+In this book, you won't learn anything about Python, Julia, or any other programming language useful for data science. This isn't because we think these tools are bad. They're not! And in practice, most data science teams use a mix of languages, often at least R and Python.
 
 However, we strongly believe that it's best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn't mean you should be only know one thing, just that you'll generally learn faster if you stick to one thing at a time.
 

From 8babf77789123833d3b9af7cddabdba9e08335fb Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Wed, 16 Dec 2015 11:20:33 -0600
Subject: [PATCH 02/14] tab dump

---
 rmarkdown.Rmd | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/rmarkdown.Rmd b/rmarkdown.Rmd
index 4b6e2dd..4407df7 100644
--- a/rmarkdown.Rmd
+++ b/rmarkdown.Rmd
@@ -4,6 +4,8 @@ Recommendations for learning more about communication:
 
 For writing: [Style: Lessons in Clarity and Grace](http://amzn.com/0321898680), <http://www.americanscientist.org/issues/id.877,y.0,no.,content.true,page.1,css.print/issue.aspx>
 
-For presentations: [slide:ology](http://amzn.com/0596522347), <http://www.howtogiveatalk.com>, <https://github.com/jtleek/talkguide> (academic).
+For presentations: [slide:ology](http://amzn.com/0596522347), <http://www.howtogiveatalk.com>, <https://github.com/jtleek/talkguide> (academic), http://speaking.io, https://www.coursera.org/learn/public-speaking
 
 For expository visulisations: WSJ guide?
+
+Design: [The Non-Designer's Design Book](http://amzn.com/0133966151)

From d2613dcd72aaf3be4b765e311692dd025defea10 Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Wed, 16 Dec 2015 11:21:42 -0600
Subject: [PATCH 03/14] Another tab dump for strptime

---
 datetimes.Rmd | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/datetimes.Rmd b/datetimes.Rmd
index 78d639f..556cb4e 100644
--- a/datetimes.Rmd
+++ b/datetimes.Rmd
@@ -1 +1,4 @@
 # Dates and times
+
+
+If you have trouble remembering these abbreviations, check out the [strptimer package](https://cran.r-project.org/web/packages/strptimer/vignettes/strptimer.html).

From d149c28639c95cde8e3718fb6906efd3df37d012 Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Wed, 16 Dec 2015 16:29:39 -0600
Subject: [PATCH 04/14] Remove unneeded config option

---
 .travis.yml | 1 -
 1 file changed, 1 deletion(-)

diff --git a/.travis.yml b/.travis.yml
index 3d09cb3..614ad68 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -51,6 +51,5 @@ deploy:
   secret_access_key:
      secure: "KB6D4dRFyqABOUBC6q6CTI7WZQ+4kFOSDWNQFAbXJQR4TzR8J6uddAiSZyG8T1/8z+9Lm1VK417Zi0dGm3r3epbSnLClitBetvE11DoByomK+ey+NJ0MdXuXbFCJhX9l+8QDbDRLd/b2MEr36JXNaNQaLf5wdHImVVfcCm5STAIOM42plYMvz4Uhao+VjIKo+0IqiGHQHsNcU4qQXS4jd4FtO/t1xCwa7SgH0wwV2yJmeh8mM7QpmUEpBcZTHDvqZu6BitxtkYQDCh1iuBwhbPlYug/WOtyHmKYgU/c3+C+xW4OLv10OsE+eK6noEzIXQ80sPIyKMpkn+9P+7MnoRU/oZTXmYJOuXE5mvy+CiJ4TzZZxzB/g8HzklRRI4eFBmJ/zTTMmJMwBdbUhCXepARe4gr7pDFKhSTXvBVxljJBrkiGz6W1JeZ9nKzUbuIlWNJ9aaYM2UDMbRef7xyKlKbBNw1+90aTTW8Jo+0Sz3/R7daBTcnr0Bszg4QCaOMoxJJF/Ty/tTHiComAt/kNRqlSiU2g/Ch0jOz5TRV3c29OjQQ/a9ftf5pqlvgStwjjszgHQfRrd4mxGq2E/1gkPGL7ada+TWPAVjCc8HtPGK/36IjSccFB6qGkwTFf3uOBmAC2XVnJJlwG8v20nL5ZZwpCCbQANeQq/ILQsYUmk7RM="
   bucket: r4ds.had.co.nz
-  endpoint: r4ds.had.co.nz.s3-website-us-east-1.amazonaws.com
   local-dir: _site
   skip_cleanup: true

From fae4699bba188081ec03d7600b7ef2962b5d6da9 Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Wed, 16 Dec 2015 16:41:08 -0600
Subject: [PATCH 05/14] Restore yaml metadata necessary for travis

---
 _plugins/knit.r       | 2 +-
 _plugins/rmarkdown.rb | 8 ++++----
 data-structures.Rmd   | 4 ++++
 datetimes.Rmd         | 4 ++++
 eda.Rmd               | 4 ++++
 functions.Rmd         | 4 ++++
 import.Rmd            | 4 ++++
 intro.Rmd             | 4 ++++
 lists.Rmd             | 4 ++++
 model-assess.Rmd      | 4 ++++
 model-vis.Rmd         | 4 ++++
 model.Rmd             | 4 ++++
 rmarkdown.Rmd         | 4 ++++
 shiny.Rmd             | 4 ++++
 strings.Rmd           | 4 ++++
 tidy.Rmd              | 4 ++++
 transform.Rmd         | 4 ++++
 visualize.Rmd         | 4 ++++
 18 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/_plugins/knit.r b/_plugins/knit.r
index 0d0f1ce..c6cfd26 100755
--- a/_plugins/knit.r
+++ b/_plugins/knit.r
@@ -4,7 +4,7 @@ library(bookdown)
 library(methods)
 
 args <- commandArgs(trailingOnly = TRUE)
-path <- args[1]
+path <- temp.Rmd
 
 if (!file.exists(path)) {
   stop("Can't find path ", path, call. = FALSE)
diff --git a/_plugins/rmarkdown.rb b/_plugins/rmarkdown.rb
index 0bc0451..9bbb72f 100644
--- a/_plugins/rmarkdown.rb
+++ b/_plugins/rmarkdown.rb
@@ -21,13 +21,13 @@ module Jekyll
 
       # http://rubyquicktips.com/post/5862861056/execute-shell-commands
       content = `_plugins/knit.r temp.Rmd`
-      
+
       if $?.exitstatus != 0
-        raise "Knitting failed" 
+        raise "Knitting failed"
       end
-      
+
       content
       # File.unlink f.path
     end
   end
-end
\ No newline at end of file
+end
diff --git a/data-structures.Rmd b/data-structures.Rmd
index 8800ede..61d4948 100644
--- a/data-structures.Rmd
+++ b/data-structures.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Data structures
 
 Might be quite brief.
diff --git a/datetimes.Rmd b/datetimes.Rmd
index 556cb4e..bbfae71 100644
--- a/datetimes.Rmd
+++ b/datetimes.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Dates and times
 
 
diff --git a/eda.Rmd b/eda.Rmd
index 8685f66..541eaa2 100644
--- a/eda.Rmd
+++ b/eda.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Exploratory data analysis
 
 ```{r, include = FALSE}
diff --git a/functions.Rmd b/functions.Rmd
index f215592..399e72f 100644
--- a/functions.Rmd
+++ b/functions.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Expressing yourself in code
 
 ```{r, include = FALSE}
diff --git a/import.Rmd b/import.Rmd
index 79c4a44..5e4eb8e 100644
--- a/import.Rmd
+++ b/import.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Data import
 
 ```{r, include = FALSE}
diff --git a/intro.Rmd b/intro.Rmd
index 587c35f..4d116db 100644
--- a/intro.Rmd
+++ b/intro.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Introduction
 
 ```{r setup-intro, include = FALSE}
diff --git a/lists.Rmd b/lists.Rmd
index 09aa4eb..16b2b5b 100644
--- a/lists.Rmd
+++ b/lists.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Lists
 
 ```{r setup-lists, include=FALSE}
diff --git a/model-assess.Rmd b/model-assess.Rmd
index 48ebb0e..ed35000 100644
--- a/model-assess.Rmd
+++ b/model-assess.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Model assessment
 
 ```{r setup-model, include=FALSE}
diff --git a/model-vis.Rmd b/model-vis.Rmd
index ac429e7..b6b2a16 100644
--- a/model-vis.Rmd
+++ b/model-vis.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Model visualisation
 
 Gap minder
diff --git a/model.Rmd b/model.Rmd
index 163e9d1..7b4316c 100644
--- a/model.Rmd
+++ b/model.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Model
 
 After reading this chapter, what can you do that you couldn't before?
diff --git a/rmarkdown.Rmd b/rmarkdown.Rmd
index 4407df7..2030a23 100644
--- a/rmarkdown.Rmd
+++ b/rmarkdown.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # R Markdown
 
 Recommendations for learning more about communication:
diff --git a/shiny.Rmd b/shiny.Rmd
index 578d0bd..a0c4073 100644
--- a/shiny.Rmd
+++ b/shiny.Rmd
@@ -1 +1,5 @@
+----
+layout: default
+----
+
 # Shiny
diff --git a/strings.Rmd b/strings.Rmd
index 35355af..ce79688 100644
--- a/strings.Rmd
+++ b/strings.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # String manipulation
 
 ```{r setup-strings, include = FALSE}
diff --git a/tidy.Rmd b/tidy.Rmd
index 86be192..45ae7db 100644
--- a/tidy.Rmd
+++ b/tidy.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Tidy data
 
 > "Tidy datasets are all alike but every messy dataset is messy in its
diff --git a/transform.Rmd b/transform.Rmd
index 137e765..5d0f501 100644
--- a/transform.Rmd
+++ b/transform.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Data transformation {#transform}
 
 ```{r setup-transform, include = FALSE}
diff --git a/visualize.Rmd b/visualize.Rmd
index 3ee6654..458e8ff 100644
--- a/visualize.Rmd
+++ b/visualize.Rmd
@@ -1,3 +1,7 @@
+----
+layout: default
+----
+
 # Data visualisation
 
 ```{r setup-visualise, include = FALSE}

From 784ac93688496e437be5b4a737e98af91a04d625 Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Wed, 16 Dec 2015 16:44:59 -0600
Subject: [PATCH 06/14] Fix accidentally committed change

---
 _plugins/knit.r | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_plugins/knit.r b/_plugins/knit.r
index c6cfd26..0d0f1ce 100755
--- a/_plugins/knit.r
+++ b/_plugins/knit.r
@@ -4,7 +4,7 @@ library(bookdown)
 library(methods)
 
 args <- commandArgs(trailingOnly = TRUE)
-path <- temp.Rmd
+path <- args[1]
 
 if (!file.exists(path)) {
   stop("Can't find path ", path, call. = FALSE)

From 81a35eeb319ab322dc9c6dd9e2f5676afa895a18 Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Wed, 16 Dec 2015 16:50:28 -0600
Subject: [PATCH 07/14] Do they need titles too?

---
 intro.Rmd | 1 +
 1 file changed, 1 insertion(+)

diff --git a/intro.Rmd b/intro.Rmd
index 4d116db..37a5384 100644
--- a/intro.Rmd
+++ b/intro.Rmd
@@ -1,4 +1,5 @@
 ----
+title: Introduction
 layout: default
 ----
 

From 0961c774687e6c809d76a6d6739a242383209e64 Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Wed, 16 Dec 2015 17:16:01 -0600
Subject: [PATCH 08/14] Only 3 dashes?

---
 intro.Rmd | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/intro.Rmd b/intro.Rmd
index 37a5384..1a268cd 100644
--- a/intro.Rmd
+++ b/intro.Rmd
@@ -1,7 +1,7 @@
-----
+---
 title: Introduction
 layout: default
-----
+---
 
 # Introduction
 

From 7e70308dd569193a46a9b6037cc0ea3bff64a295 Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Wed, 16 Dec 2015 17:22:03 -0600
Subject: [PATCH 09/14] Fix yaml metadata

---
 data-structures.Rmd | 5 +++--
 datetimes.Rmd       | 5 +++--
 eda.Rmd             | 5 +++--
 functions.Rmd       | 5 +++--
 import.Rmd          | 5 +++--
 lists.Rmd           | 5 +++--
 model-assess.Rmd    | 5 +++--
 model-vis.Rmd       | 5 +++--
 model.Rmd           | 5 +++--
 rmarkdown.Rmd       | 5 +++--
 shiny.Rmd           | 5 +++--
 strings.Rmd         | 7 ++++---
 tidy.Rmd            | 5 +++--
 transform.Rmd       | 5 +++--
 visualize.Rmd       | 5 +++--
 15 files changed, 46 insertions(+), 31 deletions(-)

diff --git a/data-structures.Rmd b/data-structures.Rmd
index 61d4948..4abb782 100644
--- a/data-structures.Rmd
+++ b/data-structures.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Data structures
+---
 
 # Data structures
 
diff --git a/datetimes.Rmd b/datetimes.Rmd
index bbfae71..a3411e3 100644
--- a/datetimes.Rmd
+++ b/datetimes.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Dates and times
+---
 
 # Dates and times
 
diff --git a/eda.Rmd b/eda.Rmd
index 541eaa2..f03ea4f 100644
--- a/eda.Rmd
+++ b/eda.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Exploratory data analysis
+---
 
 # Exploratory data analysis
 
diff --git a/functions.Rmd b/functions.Rmd
index 399e72f..1e74d09 100644
--- a/functions.Rmd
+++ b/functions.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Expressing yourself in code
+---
 
 # Expressing yourself in code
 
diff --git a/import.Rmd b/import.Rmd
index 5e4eb8e..518ec6f 100644
--- a/import.Rmd
+++ b/import.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Data import
+---
 
 # Data import
 
diff --git a/lists.Rmd b/lists.Rmd
index 16b2b5b..8350760 100644
--- a/lists.Rmd
+++ b/lists.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Lists
+---
 
 # Lists
 
diff --git a/model-assess.Rmd b/model-assess.Rmd
index ed35000..4502580 100644
--- a/model-assess.Rmd
+++ b/model-assess.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Model assessment
+---
 
 # Model assessment
 
diff --git a/model-vis.Rmd b/model-vis.Rmd
index b6b2a16..afcfcbc 100644
--- a/model-vis.Rmd
+++ b/model-vis.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Model visualisation
+---
 
 # Model visualisation
 
diff --git a/model.Rmd b/model.Rmd
index 7b4316c..f611a65 100644
--- a/model.Rmd
+++ b/model.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Model
+---
 
 # Model
 
diff --git a/rmarkdown.Rmd b/rmarkdown.Rmd
index 2030a23..58c0809 100644
--- a/rmarkdown.Rmd
+++ b/rmarkdown.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: R Markdown
+---
 
 # R Markdown
 
diff --git a/shiny.Rmd b/shiny.Rmd
index a0c4073..1d857ed 100644
--- a/shiny.Rmd
+++ b/shiny.Rmd
@@ -1,5 +1,6 @@
-----
+---
 layout: default
-----
+title: Shiny
+---
 
 # Shiny
diff --git a/strings.Rmd b/strings.Rmd
index ce79688..fa48b37 100644
--- a/strings.Rmd
+++ b/strings.Rmd
@@ -1,8 +1,9 @@
-----
+---
 layout: default
-----
+title: Strings
+---
 
-# String manipulation
+# Strings
 
 ```{r setup-strings, include = FALSE}
 library(stringr)
diff --git a/tidy.Rmd b/tidy.Rmd
index 45ae7db..9fc6ece 100644
--- a/tidy.Rmd
+++ b/tidy.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Tidy data
+---
 
 # Tidy data
 
diff --git a/transform.Rmd b/transform.Rmd
index 5d0f501..54e3a0b 100644
--- a/transform.Rmd
+++ b/transform.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Transform
+---
 
 # Data transformation {#transform}
 
diff --git a/visualize.Rmd b/visualize.Rmd
index 458e8ff..d948e0d 100644
--- a/visualize.Rmd
+++ b/visualize.Rmd
@@ -1,6 +1,7 @@
-----
+---
 layout: default
-----
+title: Visualize
+---
 
 # Data visualisation
 

From fc142e6c5fbde2571c0083ac8c04e6db64089978 Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Wed, 16 Dec 2015 17:23:36 -0600
Subject: [PATCH 10/14] Don't show code when including graphics

---
 intro.Rmd | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/intro.Rmd b/intro.Rmd
index 1a268cd..8a403f0 100644
--- a/intro.Rmd
+++ b/intro.Rmd
@@ -109,7 +109,7 @@ To run the code in this book, you will need to install both R and the RStudio ID
 
 RStudio is an integated development environment, or IDE, for R programming. There are three key regions:
 
-```{r}
+```{r, echo = FALSE}
 knitr::include_graphics("screenshots/rstudio-layout.png")
 ```
 
@@ -129,7 +129,7 @@ If you want to see a list of all keyboard shortcuts, use the meta keyboard short
 
 We strongly recommend making two changes to the default RStudio options:
 
-```{r}
+```{r, echo = FALSE}
 knitr::include_graphics("screenshots/rstudio-workspace.png")
 ```
 

From 1c37fa049338b03cc00324c46e4e51b927df34c5 Mon Sep 17 00:00:00 2001
From: Kirill Sevastyanenko <kirillseva@gmail.com>
Date: Wed, 16 Dec 2015 18:25:30 -0500
Subject: [PATCH 11/14] Update lists.Rmd

I would imagine this was the intention
---
 lists.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lists.Rmd b/lists.Rmd
index 8350760..aedf831 100644
--- a/lists.Rmd
+++ b/lists.Rmd
@@ -49,7 +49,7 @@ x <- list(1, 2, 3)
 str(x)
 
 x_named <- list(a = 1, b = 2, c = 3)
-str(x)
+str(x_named)
 ```
 
 Unlike atomic vectors, `lists()` can contain a mix of objects:

From bdcb95410b98ffc3146bccff4290f8311c2eb42e Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Thu, 17 Dec 2015 08:46:44 -0600
Subject: [PATCH 12/14] More on data transform

---
 transform.Rmd | 95 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 73 insertions(+), 22 deletions(-)

diff --git a/transform.Rmd b/transform.Rmd
index 54e3a0b..ea87703 100644
--- a/transform.Rmd
+++ b/transform.Rmd
@@ -11,6 +11,7 @@ library(nycflights13)
 library(ggplot2)
 source("common.R")
 options(dplyr.print_min = 6)
+knitr::opts_chunk$set(fig.path = "figures/")
 ```
 
 Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need for visualisation. Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You'll learn how to do all that (and more!) in this chapter which will teach you how to transform your data using the dplyr package.
@@ -530,6 +531,19 @@ by_day <- group_by(flights, year, month, day)
 summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
 ```
 
+### Grouping by multiple variables
+
+When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset:
+
+```{r}
+daily <- group_by(flights, year, month, day)
+(per_day   <- summarise(daily, flights = n()))
+(per_month <- summarise(per_day, flights = sum(flights)))
+(per_year  <- summarise(per_month, flights = sum(flights)))
+```
+
+However you need to be careful when progressively rolling up summaries like this: it's ok for sums and counts, but you need to think about weighting for means and variances, and it's not possible to do it exactly for medians.
+
 ### Useful summaries
 
 You use `summarise()` with __aggregate functions__, which take a vector of values and return a single number. 
@@ -623,6 +637,10 @@ Behind the scenes, `x %>% f(y)` turns into `f(x, y)` so you can use it to rewrit
 
 The pipe makes it easier to solve complex problems by joining together simple pieces. Each dplyr function does one thing well, helping you advance to your goal with one small step. You can check your work frequently, and if you get stuck, you just need to think: "what's one small thing I could do to advance towards a solution".
 
+Where does `%>%` come from.
+
+Most of the packages you'll learn through this book have been designed to work with the pipe (tidyr, dplyr, stringr, purrr, ...). The only exception is ggplot2: it was developed considerably before the discovery of the pipe. Unfortunately the next iteration of ggplot2, ggvis, which does use the pipe, isn't ready from prime time yet. 
+
 The rest of this section explores some practical uses of the pipe when combining multiple dplyr operations to solve real problems.
 
 ### Counts
@@ -660,7 +678,22 @@ ggplot(delays, aes(n, delay)) +
 
 You'll see that most of the very delayed flight numbers happen very rarely. The shape of this plot is very characteristic: whenever you plot a mean (or many other summaries) vs number of observations, you'll see that the variation decreases as the sample size increases.
 
-There's another variation on this type of plot as shown below. Here I use the Lahman package to compute the batting average (number of hits / number of attempts) of every major league baseball player.  When I plot the skill of the batter against the number of times batted, you see two patterns:
+When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This what the following code does, and also shows you a handy pattern for integrating ggplot2 into dplyr flows. It's a bit painful that you have to switch from `%>%` to `+`, but once you get the hang of it, it's quite convenient.
+
+```{r}
+delays %>% 
+  filter(n > 25) %>% 
+  ggplot(aes(n, delay)) + 
+    geom_point()
+```
+
+--------------------------------------------------------------------------------
+
+RStudio tip: useful keyboard shortcut is Cmd + Shift + P. This resends the previously sent chunk from the editor to the console. This is very convenient when you're (e.g.) exploring the value of `n` in the example above. You send the whole block once with Cmd + Enter, then you modify the value of `n` and press Cmd + Shift + P to resend the complete block.
+
+--------------------------------------------------------------------------------
+
+There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use the Lahman package to compute the batting average (number of hits / number of attempts) of every major league baseball player.  When I plot the skill of the batter against the number of times batted, you see two patterns:
 
 1.  As above, the variation in our aggregate decreases as we get more 
     data points.
@@ -677,34 +710,54 @@ batters <- batting %>%
   summarise(
     ba = sum(H) / sum(AB),
     ab = sum(AB)
-  ) %>% 
-  filter(ab > 100)
+  )
 
-ggplot(batters, aes(ab, ba)) +
-  geom_point() + 
-  geom_smooth(se = FALSE)
+batters %>% 
+  filter(ab > 100) %>% 
+  ggplot(aes(ab, ba)) +
+    geom_point() + 
+    geom_smooth(se = FALSE)
 ```
 
-### Grouping by multiple variables
-
-When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset:
+This also has important implications for ranking. If you naively sort on `desc(ba)`, the people with the best batting averages are clearly lucky, not skilled:
 
 ```{r}
-daily <- group_by(flights, year, month, day)
-(per_day   <- summarise(daily, flights = n()))
-(per_month <- summarise(per_day, flights = sum(flights)))
-(per_year  <- summarise(per_month, flights = sum(flights)))
+batters %>% arrange(desc(ba))
 ```
 
-However you need to be careful when progressively rolling up summaries like this: it's ok for sums and counts, but you need to think about weighting for means and variances, and it's not possible to do it exactly for medians.
+You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
 
 ### Grouped mutates (and filters)
 
-* `mutate()` and `filter()` are most useful in conjunction with window 
-  functions (like `rank()`, or `min(x) == x`). They are described in detail in 
-  the windows function vignette `vignette("window-functions")`.
+Grouping is definitely most useful in conjunction with `summarise()`, but you can also do convenient operations with `mutate()` and `filter()`:
 
-A grouped filter is basically like a grouped mutate followed by a regular filter. I generally avoid them except for quick and dirty manipulations. Otherwise it's too hard to check that you've done the manipulation correctly.
+*   Find the worst members of each group:
+
+    ```{r}
+    flights %>% 
+      group_by(year, month, day) %>%
+      filter(rank(arr_delay) < 10)
+    ```
+
+*   Find all groups bigger than a threshold:
+
+    ```{r}
+    popular_dests <- flights %>% 
+      group_by(dest) %>% 
+      filter(n() > 365)
+    ```
+
+*   Standardise to compute per group metrics:
+
+    ```{r}
+    popular_dests %>% 
+      filter(arr_delay > 0) %>% 
+      mutate(prop_delay = arr_delay / sum(arr_delay))
+    ```
+
+You can see more uses in window functions vignette `vignette("window-functions")`.
+
+A grouped filter is basically like a grouped mutate followed by an ungrouped filter. I generally avoid them except for quick and dirty manipulations. Otherwise it's too hard to check that you've done the manipulation correctly.
 
 ## Multiple tables of data
 
@@ -727,8 +780,7 @@ All two-table verbs work similarly. The first two arguments are `x` and `y`, and
 
 Mutating joins allow you to combine variables from multiple tables. For example, take the nycflights13 data. In one table we have flight information with an abbreviation for carrier, and in another we have a mapping between abbreviations and full names. You can use a join to add the carrier names to the flight data:
 
-```{r, warning = FALSE}
-library("nycflights13")
+```{r}
 # Drop unimportant variables so it's easier to understand the join results.
 flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
 
@@ -832,7 +884,6 @@ Filtering joins match obserations in the same way as mutating joins, but affect
 These are most useful for diagnosing join mismatches. For example, there are many flights in the nycflights13 dataset that don't have a matching tail number in the planes table:
 
 ```{r}
-library("nycflights13")
 flights %>% 
   anti_join(planes, by = "tailnum") %>% 
   count(tailnum, sort = TRUE)
@@ -936,7 +987,7 @@ When joining tables, dplyr is a little more conservative than base R about the t
     
 Otherwise logicals will be silently upcast to integer, and integer to numeric, but coercing to character will raise an error:
 
-```{r, error = TRUE, purl = FALSE}
+```{r, error = TRUE}
 df1 <- data_frame(x = 1, y = 1L)
 df2 <- data_frame(x = 2, y = 1.5)
 full_join(df1, df2) %>% str()

From 43978e440569b27383adb0f52ca0a652f44c1272 Mon Sep 17 00:00:00 2001
From: Kirill Sevastyanenko <kirillseva@gmail.com>
Date: Thu, 17 Dec 2015 11:29:47 -0500
Subject: [PATCH 13/14] Update lists.Rmd

typo
---
 lists.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lists.Rmd b/lists.Rmd
index 8350760..4a5a722 100644
--- a/lists.Rmd
+++ b/lists.Rmd
@@ -541,7 +541,7 @@ You'll see an example of this in the next section, as `transpose()` is particula
 
 It's called transpose by analogy to matrices. When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`. It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`. Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`.
 
-Tranpose is also useful when working with JSON apis. Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. `transpose()` makes it easy to switch between the two:
+Transpose is also useful when working with JSON apis. Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. `transpose()` makes it easy to switch between the two:
 
 ```{r}
 df <- dplyr::data_frame(x = 1:3, y = c("a", "b", "c"))

From c3eed28bbf48781b1660c2b44c2f501918f6bf52 Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Fri, 18 Dec 2015 09:53:15 -0600
Subject: [PATCH 14/14] Brainstorm a few exercises

---
 transform.Rmd | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/transform.Rmd b/transform.Rmd
index ea87703..d2fcaf9 100644
--- a/transform.Rmd
+++ b/transform.Rmd
@@ -589,6 +589,21 @@ mean(c(1, 5, 10, NA), na.rm = TRUE)
 
 ### Exercises
 
+1.  Brainstorm at least 5 different ways to assess the typically delay 
+    characteristics of a group of flights. Consider the following scenarios:
+    
+    * A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of 
+      the time.
+      
+    * A flight is always 10 minutes late.
+
+    * A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of 
+      the time.
+      
+    * 99% of the time a flight is on time. 1% of the time it's 2 hours late.
+    
+    Which is more important: arrival delay or departure delay?
+
 ## Multiple operations
 
 Imagine we want to explore the relationship between the distance and average delay for each location. Using what you already know about dplyr, you might write code like this:
@@ -755,10 +770,25 @@ Grouping is definitely most useful in conjunction with `summarise()`, but you ca
       mutate(prop_delay = arr_delay / sum(arr_delay))
     ```
 
-You can see more uses in window functions vignette `vignette("window-functions")`.
-
 A grouped filter is basically like a grouped mutate followed by an ungrouped filter. I generally avoid them except for quick and dirty manipulations. Otherwise it's too hard to check that you've done the manipulation correctly.
 
+Function that work most naturally in grouped mutates and filtered are known as  window functions (vs. aggregate or summary functions used in grouped summaries). You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`.
+
+### Exercises
+
+1.  Which plane (`tailnum`) has the worst on-time record?
+
+1.  What time of day should you fly if you want to avoid delays as much
+    as possible?
+    
+1.  Look at each destination. Can you find flights that are suspiciously
+    fast? (i.e. flights that represent a potential data entry error). Compute
+    the air time a flight relative to the shortest flight to that destination.
+    Which flights were most delayed in the air?
+    
+1.  Find all destinations that are flown by at least two carriers. Use that
+    information to rank the carriers.
+
 ## Multiple tables of data
 
 It's rare that a data analysis involves only a single table of data. In practice, you'll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two tables at a time: