More work on visualization chapter. Adds grammar of graphics diagrams.

2015-11-18 09:21:22 -05:00 · 2015-11-18 09:21:22 -05:00 · c2ee6409e0
parent 05223ebb52
commit c2ee6409e0
15 changed files with 123 additions and 100 deletions
--- a/images/visualization-10.png
+++ b/images/visualization-10.png
--- a/images/visualization-11.png
+++ b/images/visualization-11.png
--- a/images/visualization-12.png
+++ b/images/visualization-12.png
--- a/images/visualization-13.png
+++ b/images/visualization-13.png
--- a/images/visualization-14.png
+++ b/images/visualization-14.png
--- a/images/visualization-15.png
+++ b/images/visualization-15.png
--- a/images/visualization-16.png
+++ b/images/visualization-16.png
--- a/images/visualization-3.png
+++ b/images/visualization-3.png
--- a/images/visualization-4.png
+++ b/images/visualization-4.png
--- a/images/visualization-5.png
+++ b/images/visualization-5.png
--- a/images/visualization-6.png
+++ b/images/visualization-6.png
--- a/images/visualization-7.png
+++ b/images/visualization-7.png
--- a/images/visualization-8.png
+++ b/images/visualization-8.png
--- a/images/visualization-9.png
+++ b/images/visualization-9.png
--- a/visualize.Rmd
+++ b/visualize.Rmd
@ -116,7 +116,7 @@ ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
 ```

-The next few subsections will introduce useful arguments (and functions) that you can add to the template. Each argument will come with a new set of options---and likely a new set of questions. Hold those questions for now. We will catalogue your options in Section 2. Use this section to become familiar with the `ggplot2` syntax. Once you do, the low level details of `ggplot2` will be easier to understand.
+The next few subsections will introduce several arguments (and functions) that you can add to the template. Each argument will come with a new set of options---and likely a new set of questions. Hold those questions for now. We will catalogue your options in Section 2. Use this section to become familiar with the `ggplot2` syntax. Once you do, the low level details of `ggplot2` will be easier to understand.

 #### Aesthetic Mappings

@ -205,7 +205,7 @@ But isn't random noise, you know, bad? It *is* true that jittering your data wil

 ### Bar Charts

-After scatterplots, bar charts are the most used type of plot. To make a bar chart use the function `geom_bar()`.
+After scatterplots, bar charts are one of the most used types of plot. To make a bar chart with `ggplot2` use the function `geom_bar()`.

 ```{r}
 ggplot(data = diamonds) + 
@ -251,7 +251,7 @@ ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
 ```

-You'll learn the complete set of position options in Section 2.
+See Section 2 to learn about other position options.

 #### Stats

@ -378,12 +378,10 @@ ggplot(data = mpg) +

 > "Wax on. Wax off."---*The Karate Kid* (1984)

-In this section, you learned how to make more than just scatterplots, bar charts, and coxcomb plots; you learned a foundation that you can use to make _any_ type of plot with `ggplot2`.
+In this section, you learned more than how to make scatterplots, bar charts, and coxcomb plots; you learned a foundation that you can use to make _any_ type of plot with `ggplot2`.

 To see this, let's add position adjustments, stats, coordinate systems, and facetting to our code template. In `ggplot2`, each of these parameters will work with every plot and every geom. 

-As a result, you can use this template to make each plot in `ggplot2`: 
-
 ```{r eval = FALSE}
 ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
@ -395,169 +393,193 @@ ggplot(data = <DATA>) +
  <FACET_FUNCTION>
 ```

-The template takes seven parameters, the capitalized words that appear in the template. In practice, you rarely need to supply all seven parameters because `ggplot2` will provide useful defaults for everything except the data, mappings, and geom function.
+The template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters because `ggplot2` will provide useful defaults for everything except the data, mappings, and geom function.

-These seven parameters are connected by a powerful idea known as the _Grammar and Graphics_, which you can use to make _any_ type of plot. The next section will look at each of these parameters closely. It begins by introducing the Grammar of Graphics.
+The seven parameters in the template are connected by a powerful idea known as the _Grammar of Graphics_, a system for describing plots. The grammar shows that you can uniquely describe _any_ plot as a combination of---you guessed it: a data set, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a facetting scheme.
+
+In other words, you can use the template above to make any graph that you can imagine---at least in theory. Section 2 will examine how this works in practice. The section explains how the grammar of graphics works and how `ggplot2` implements the grammar to build real graphs. It also catalogues all of the options that `ggplot2` puts at your fingertips for geoms, mappings, stats, position adjustments, and coordinate systems. 

 ## The Grammar of Graphics

-The "gg" of `ggplot2` stands for the grammar of graphics, a system for describing and building plots. You can think of the grammar of graphics as a formula for building a plot---any plot. 
+The "gg" of `ggplot2` stands for the grammar of graphics, a system for describing plots. According to the grammar, a plot is a combination of seven elements:

-$$\text{plot} = \text{coordinate system} + \left \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \right + \text{facet scheme}$$
+$$\text{plot} = \Big( \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \Big) + \text{coordinate system} + \text{facet scheme}$$

-According to the grammar, you can uniquely describe any plot as a combination of these seven elements. To see how the grammar of graphics works, consider a thought exercise:
+This may not be an obvious way to think about plots, so let's explore the formula above with a thought exercise. You can build any plot in the following manner.

-To build a plot, you begin with a data set and a coordinate system.
+To build the plot, you begin with a data set to visualize and a coordinate system to visualize it in. We'll visualize an abbreviated version of the `mpg` data set, and the cartesian coordiante system.

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+`r bookdown::embed_png("images/visualization-3.png", dpi = 400)`

-You then choose whether to visualize the data as it is, or whether to summarize the data with a transformation (and then visualize the summary). Let's visualize our data as it is. To do this, we will use the identity transformation, which returns the data as it is.

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+You then choose whether to visualize the data itself, or whether to summarize the data with a transformation and then visualize the summary. Let's visualize our data as it is. This would be the same as applying an identity transformation to the data, since an identity transformation returns the data as it is.

-You then choose a visual object to represent the observations in your data set. Here we will use a point. Each point will represent one row of data. Let's call the points geoms. short for geometrical object. 
+`r bookdown::embed_png("images/visualization-4.png", dpi = 400)`

-`r bookdown::embed_png("images/blank.png", dpi = 150)`

-Next you map variables in your data to the aesthetic properties of your geoms. Here we map the... to the...
+Next, you need to choose some sort of visual object to represent the observations in your data set. This object will be what you actually draw in the coordinate system. 

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+Here we will use a set of points. Each point will represent one row of data. Let's call the points geoms, short for geometrical object. 

-To place your points into your coordinate system, you map the x location aesthetic to a variable
+`r bookdown::embed_png("images/visualization-5.png", dpi = 400)`

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+Next, you map variables in your data to the visual properties of your geoms. These properties are what we call aesthetics. Let's map the... to the...

-as well as the y location aesthetic.
+`r bookdown::embed_png("images/visualization-6.png", dpi = 400)`

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+One pair of mappings is particularly important. To place your points into your coordinate system, you map the x location aesthetic to a variable. Here `displ`.

-The process creates a complete graph, but you can also choose to adjust the position of the points (or not) and to facet the graph (or not).
+`r bookdown::embed_png("images/visualization-7.png", dpi = 400)`

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+And you map the y location aesthetic to a variable. Here `hwy`.

-You can reuse this process to make any graph. To make the graph different, switch out one of the elements. For example, you can use a line as a geom to make a line graph, or a bar to make a bar chart. You can also switch the data set, coordinate system, etc.
+`r bookdown::embed_png("images/visualization-8.png", dpi = 400)`

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+The process creates a complete graph: 

-We can use the same thought experiment to see that the grammar of graphics has a layered nature. You can assemble a data set, a stat, a geom, mappings, and a position adjustment into a layer that you can add to another graph.
+`r bookdown::embed_png("images/visualization-9.png", dpi = 400)`

-Imagine that we begin a new graph. This graph uses the same data set as our previous graph. This time we'll apply a "smooth" stat to the data. The stat fits a model to the data and then returns a transformed data set. 
+However, you can also choose to adjust the position of the points (or not) and to facet the graph (or not).

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+`r bookdown::embed_png("images/visualization-10.png", dpi = 400)`

-The transformed data contains three new columns: 
+You can reuse this process to make any graph. If you change any of the elements involved, you will end up with a new graph. For example, we can change our geom to a line to make a line graph, or to a bar to make a bar chart. Or we can change the position to "jitter" to make a jittered plot.
+
+`r bookdown::embed_png("images/visualization-11.png", dpi = 400)`
+
+You can also switch the data set, coordinate system, or any other component of the graph.
+
+Let's extend this our experiment to add a model line to the graph. To do this, we will add a new _layer_ to the graph.
+
+### Layers
+
+A layer is a collection of a data set, a stat, a geom, and a position adjustment. You can add a layer to a coordinate system and facetting scheme to make a complete graph, or you can add a layer to an existing graph to make a layered graph.
+
+Let's build a layer that uses the same data set as our previous graph. In this layer, we will apply a "smooth" stat to the data. The stat fits a model to the data and then returns a transformed data set with three new columns: 

 * `y` - the value of the model line at each data point
 * `ymin` - the y value of the bottom of the confidence interval associated with the model at each data point
 * `ymax` - the y value of the top of the confidence interval associated with the model at each point 

-Let's represent these points with a line geom. We will map the x values of the line to `displ` and we will map the y values to our new `y` variable. We won't use a position adjustment.
+`r bookdown::embed_png("images/visualization-12.png", dpi = 400)`

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+In this layer, we will represent the observations with a line geom. We map the x values of the line to `displ` and we map the y values to our new `y` variable. We won't use a position adjustment.
+
+`r bookdown::embed_png("images/visualization-13.png", dpi = 400)`

 We now have a "layer" that we can add to a coordinate system and facetting scheme to make a complete graph.

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+`r bookdown::embed_png("images/visualization-14.png", dpi = 400)`

 Or we can add the layer to our previous graph to make a plot that shows both summary information and raw data.

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+`r bookdown::embed_png("images/visualization-15.png", dpi = 400)`

-For completion, let's add one more layer. This layer will begin with the same data set as the previous layer. It will also use the same stat. However, we will use the ribbon geom to visualize the data points. We will map the top of the ribbon to `ymax`, the bottom of the ribbon to `ymin`, and we will map the x position of the ribbon to `displ`. We will not use a position adjustment.
+For completion, let's add one more layer. This layer will begin with the same data set as the previous layer. It will also use the same stat. However, we will use the ribbon geom to visualize the data points. A ribbon is similar to a shaded region contained by two lines.

-We can overlay the layer on our graph to show raw data, summary information, and the uncertainty associated with that summary.
+We map the top of the ribbon to `ymax` and the bottom of the ribbon to `ymin`. We map the x position of the ribbon to `displ`. We will not use a position adjustment.

-`r bookdown::embed_png("images/blank.png", dpi = 150)`
+We can now add the layer to our graph to show in one plot:
+
+* our raw data
+* a visual summary of the data (the smooth line)
+* the uncertainty associated with the summary
+
+`r bookdown::embed_png("images/visualization-16.png", dpi = 400)`

 If you like, you can continue to add layers to the graph (but the graph will soon become cluttered).

-The thought experiment shows that you can describe any graph with a combination of elements that should seem familiar now---data, coordinate system, geoms, stats, aesthetic mappings, position adjustments, and facets. These elements themselves form the grammar of graphics.
-
-In summary, the grammar of graphics is a system that helps you uniquely describe graphs. 
-
-`ggplot2` is a software package that uses R to assemble actual graphs from descriptions that you write with the grammar of graphics.
-
-### Layers
+The thought experiment shows that the elements of the grammar of graphics work together to build a graph. You can describe any graph with these elements, and each unique combination of elements makes a single, unique graph. You can also extend a graph by adding layers of new data, stats, geoms, mappings, and positions.


+In other words, you can extend the grammar of graphics formula indefinitely to make layered plots:

-```{r echo = FALSE}
-ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
-  geom_point() +
-  geom_smooth()
+$$
+\begin{aligned}
+\text{plot} = & \Big( \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \Big) + \\
+& \Big( \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \Big)^{*} + \\
+& \Big( \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \Big)^{*} + \\
+& \text{coordinate system} + \text{facet scheme}
+\end{aligned}
+$$
+
+### Working with layers
+
+`ggplot2` syntax matches this formulation almost exactly. The basic low level function of `ggplot2` is `layer()` which combines data, stats, geoms, mappings, and positions into a single layer to plot. 
+
+If you have time on your hands, you can use `layer()` to create a multi-level plot like the one above. Initialize your plot with `ggplot()`. Then add as many calls to `layer()` as you like. Give each layer its own `data`, `stat`, `geom`, `mapping`, and `position` arguments. 
+
+```{r message = FALSE}
+ggplot() + 
+  layer(
+    data = mpg, 
+    stat = "identity", 
+    geom = "point", 
+    mapping = aes(x = displ, y = hwy),
+    position = "identity"
+  ) + 
+  layer(
+    data = mpg, 
+    stat = "smooth", 
+    geom = "ribbon", 
+    mapping = aes(x = displ, y = hwy),
+    position = "identity"
+  ) + 
+  layer(
+    data = mpg, 
+    stat = "smooth", 
+    geom = "line", 
+    mapping = aes(x = displ, y = hwy),
+    position = "identity"
+  ) +
+  coord_cartesian()
 ```

-In practice, you can add multiple data sets, geoms, stats, mappings, and position adjustments to the same graph. The graph above contains two geoms: a "point" geom and a "smooth" geom (i.e. a model line); as well as two stats: an "identity" stat and a "smooth" stat.
+Although you can build all of your graphs this way, few people do because `ggplot2` supplies some very efficient shortcuts.

-In contrast, each graph can only use one coordinate system and one facetting scheme.
+For example, you will find in practice that you always pair the same geoms with the same stats and position adjustments. You'll almost always use the point geom with the "identity" stat and the "identity" position. You'll almost always use the bar geom with the "bin" stat and the "stack" position.

+The `geom_` functions in `ggplot2` take advantage of these common combinations. Like `layer()`, each geom function builds a layer, but the geom functions preset the geom, stat, and position values of the layer to useful defaults. The geom becomes the geom that appears in the function name. The stat and position become the stat and postion most commonly asscoiated with the geom.

-You can think of `ggplot()` as initializing your graph with a cartesian coordinate system. Add a coordinate function or a facet function to change these defaults.
+`ggplot2` even provides geom functions for less common, but still useful combinations of geoms, stats, and positions. For example, the function `geom_jitter()` builds a layer that has a point geom, an "identity" stat and a "jitter" position. The function `geom_smooth()` builds two layers: a ribbon layer that is combined with a line layer as in the plot above. Together these layers display a model line with its standard error band.

-```{r plot}
-p <- ggplot() +
-  coord_polar() +
-  facet_wrap(~drv)
-```
+The result is a more direct syntax for making plots, one that you are already familiar with from Section 1.

-Then use a geom function to supply a combination of data, geom, stat, mappings, and position.
-
-```{r dependson=plot}
-p + geom_point(mapping = aes(x = displ, y = hwy), data = mpg, stat = "identity", position = "identity")
-```
-
-Each combination of data, geom, stat, mappings, and position provides a visual "layer" of information that you can add to the established coordinate system and facetting scheme. To build a multi-layered plot, just add multiple layers.
-
-```{r message = FALSE, dependson=plot}
-p + geom_point(mapping = aes(x = displ, y = hwy), data = mpg, stat = "identity", position = "identity") +
-  geom_smooth(mapping = aes(x = displ, y = hwy), data = mpg, stat = "smooth", position = "identity")
+```{r message = FALSE}
+ggplot() +
+  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
+  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
 ```

 #### Multiple geoms

-You can add a smooth line to your raw data by adding `geom_smooth()` to your original plot call. 
+As with `layer()`, you can add multiple geom functions to a single plot call.

-```{r, message = FALSE}
-ggplot(data = mpg) + 
-  geom_point(mapping = aes(x = displ, y = hwy)) +
-  geom_smooth(mapping= aes(x = displ, y = hwy))
+This system lets you build sophisticated graphs geom by geom, but it also makes it possible to write repetitive code. For example, the code above repeats the arguments `data = mpg, mapping = aes(x = displ, y = hwy)`. Repetition makes your code harder to read and write, and it also increases the chance of errors and typos.
+
+You can avoid repetition by passing `ggplot()` a set of global mappings to apply to each layer. For example, we can eliminate the duplication of `mapping = aes(x = displ, y = hwy)` in our previous code with a global mapping argument:
+
+```{r, eval = FALSE}
+ggplot(mapping = aes(x = displ, y = hwy)) + 
+  geom_point(data = mpg) + 
+  geom_smooth(data = mpg)
 ```

-This system lets you build sophisticated graphs geom by geom. You can add as many geoms as you like to a single plot call. `ggplot2` will place each new geom on top of the preceeding geom.
-
-#### Global and local mappings
-
-You can create a set of global mappings that apply to all geoms  by passing a mapping argument to `ggplot()`. For example, we can eliminate the duplication of `mapping = aes(x = displ, y = hwy)` in our previous code by using global mappings:
-
-```{r, message = FALSE}
-ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
-  geom_point() + 
-  geom_smooth()
-```
-
-You can also combine global mappings with local mappings to differentiate geoms.
+You can even combine global mappings with local mappings to differentiate geoms.

 * Mappings that appear in `ggplot()` will be applied to each geom.
 * Mappings that appear in a geom function will be applied to that geom only.
-* If a local mapping conflicts with a global mapping, `ggplot2` will use the local mapping for that geom only.
+* If a local aesthetic mapping conflicts with a global aesthetic mapping, `ggplot2` will use the local mapping. This is arbitrated on an aesthetic by aesthetic basis.

 ```{r, message = FALSE}
-ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
-  geom_point(mapping = aes(color = class)) + 
-  geom_smooth()
+ggplot(mapping = aes(x = displ, y = hwy)) + 
+  geom_point(data = mpg, mapping = aes(color = class)) + 
+  geom_smooth(data = mpg)
 ```

-The smooth line above is a single line with a single color. This does not occur if you add the color aesthetic to the global mappings. Smooth will draw a different colored line for each class of cars.
+For example, the smooth line above is a single line with a single color. This does not occur if you add the color aesthetic to the global mappings. Smooth will draw a different colored line for each class of cars.

-```{r, message = FALSE, warning = FALSE}
-ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
-  geom_point() + 
-  geom_smooth(aes(y = cty))
-```
-
-#### Global and local data sets
-
-You can use the same system to specify individual data sets for each layer.
+You can use the same system to specify a global data set for every layer.

 ```{r, eval = FALSE}
 ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
@ -581,6 +603,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_smooth(data = subset(mpg, cyl == 8))
 ```

+## The Vocabulary of Graphics

 ### Aesthetics