More work on visualization chapter.

This commit is contained in:
Garrett 2015-11-16 21:19:27 -05:00
parent 413b943517
commit 05223ebb52
1 changed files with 117 additions and 86 deletions

View File

@ -34,7 +34,7 @@ This chapter will teach you how to visualize your data with R and the `ggplot2`
## Outline
*Section 1* will get you started making graphs right away. You'll learn how to make several common types of plots, and you will explore `ggplot2`'s syntax along the way.
*Section 1* will get you started making graphs right away. You'll learn how to make several common types of plots, and you will explore `ggplot2`'s syntax.
*Section 2* will teach you the _grammar of graphics_, a versatile system for building plots. You'll learn to assemble any plot you like with _layers_, _geoms_, _stats_, _aesthetic mappings_, _position adjustments_, and _coordinate systems_.
@ -57,7 +57,7 @@ library(ggplot2)
Do cars with big engines use more fuel than cars with small engines?
Try to answer the question with a precise hypothesis: What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear? Strong? Weak?
You probably have an intuitive answer to this question. Now try to make your answer precise: What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear? Strong? Weak?
You can test your hypothesis with the `mpg` data set in the `ggplot2` package. The data set contains observations collected by the EPA on 38 models of car. Among the variables in `mpg` are `displ`, a car's engine size in litres, and `hwy`, a car's fuel efficiency on the highway in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
@ -77,23 +77,18 @@ You will need to reload the package each time you start a new R session.
### Scatterplots
The code below plots the `displ` variable of `mpg` against the `hwy` variable.
Open an R session and run the code below. The code plots the `displ` variable of `mpg` against the `hwy` variable.
```{r eval = FALSE}
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
Open an R session and run the code. Your result will look like the graph below. Does the graph confirm your hypothesis about fuel and engine size?
```{r echo = FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
Your result will look like the graph above. Does the graph confirm your hypothesis about fuel and engine size?
The graph shows a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). In other words, cars with big engines use more fuel. But the graph shows us something else as well.
One group of points seems to fall outside the linear trend. These cars have a higher mileage than you might expect. Can you tell why? Before we examine these cars, let's review the code that made our graph.
One group of points seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. Can you tell why? Before we examine these cars, let's review the code that made our graph.
`r bookdown::embed_png("images/visualization-1.png", dpi = 150)`
@ -121,15 +116,15 @@ ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```
The next few sections will reveal useful arguments (and functions) that you can add to the template.
The next few subsections will introduce useful arguments (and functions) that you can add to the template. Each argument will come with a new set of options---and likely a new set of questions. Hold those questions for now. We will catalogue your options in Section 2. Use this section to become familiar with the `ggplot2` syntax. Once you do, the low level details of `ggplot2` will be easier to understand.
#### Aesthetic Mappings
> "The greatest value of a picture is when it forces us to notice what we never expected to see."---John Tukey
Our plot above revealed a groups of cars that had better than expected mileage. How can you explain these cars?
Our plot above revealed a group of cars that had better than expected mileage. How can you explain these cars?
Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car. The `class` variable of the `mpg` data set classifies cars into groups such as compact, midsize, and suv. If the outlying points are hybrids, they should be classified as compact, or perhaps subcompact, cars (keep in mind that this data was collected before hybrid trucks and suvs became popular).
Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car. The `class` variable of the `mpg` data set classifies cars into groups such as compact, midsize, and suv. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and suvs became popular).
You can add a third value, like `class`, to a two dimensional scatterplot by mapping it to a new _aesthetic_.
@ -139,7 +134,7 @@ An aesthetic is a visual property of the points in your plot. Aesthetics include
You can convey information about your data by mapping the aesthetics in your plot to the variables in your data set. For example, we can map the colors of our points to the `class` variable. Then the color of each point will reveal its class affiliation.
To map an aesthetic to a variable, set the name of the aesthetic to the name of the variable, and do this _in your plot's `aes()` call_:
To map an aesthetic to a variable, set the name of the aesthetic to the name of the variable, _and do this in your plot's `aes()` call_:
```{r}
ggplot(data = mpg) +
@ -150,14 +145,14 @@ ggplot(data = mpg) +
The colors reveal that many of the unusual points are two seater cars. These cars don't seem like hybrids. In fact, they seem like sports cars---and that's what they are. Sports cars have large engines like suvs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have such large engines.
Color is one of the most popular aesthetics to use in a scatterplot, but we could have mapped `class` to the size aesthetic in the same way. In this case, the exact size of each point reveals its class affiliation.
In the above example, we mapped `class` to the color aesthetic, but we could have mapped `class` to the size aesthetic in the same way. In this case, the exact size of each point reveals its class affiliation.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
```
Or we could have mapped `class` to the _alpha_ (i.e., transparency) of the points. Now the transparency of each point corresponds with its class affiliation.
Or we could have mapped `class` to the _alpha_ aesthetic, which controls the transparency of the points. Now the transparency of each point corresponds with its class affiliation.
```{r}
ggplot(data = mpg) +
@ -171,9 +166,9 @@ ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
In each case, you set the name of the aesthetic to the variable to display and you do this within the `aes()` function. The syntax highlights a useful insight because you also set `x` and `y` to variables within `aes()`: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
In each case, you set the name of the aesthetic to the variable to display, and you do this within the `aes()` function. The syntax highlights a useful insight because you also set `x` and `y` to variables within `aes()`. The insight is that the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of values to use for the aesthetic, and it constructs a legend that explains the mapping. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts in the same way as a legend; it explains the mapping between locations and values.
Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of levels to use for the aesthetic, and it constructs a legend that explains the mapping. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts in the same way as a legend; it explains the mapping between locations and values.
#### Exercises
@ -189,14 +184,14 @@ See the help page for `geom_point()` (`?geom_point`) to learn which aesthetics a
#### Position adjustments
Our scatterplot presents an interesting riddle: why does the plot only display 126 points? There are 234 observations in the data set. Also, why do the points appear to be arranged on a grid?
Did you notice that there is another riddle hidden in our scatterplot? The plot displays 126 points, but there are 234 observations in the `mpg` data set. Also, the points appear to fall on a grid. Why should this be?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
The points appear in a grid because the `hwy` and `displ` measurements were rounded to the nearest integer and tenths values. This also explains why our graph appears to contain only 126 points. Many points overlap each other because they've been rounded to the same values of `hwy` and `displ`. 108 points are hidden on top of other points located at the same value.
The points appear in a grid because the `hwy` and `displ` measurements in `mpg` are rounded to the nearest integer and tenths values. This also explains why our graph appears to contain 126 points. Many points overlap each other because they have been rounded to the same values of `hwy` and `displ`. 108 points are hidden on top of other points located at the same value.
You can avoid this overplotting problem by setting the position argument of `geom_point()` to "jitter". `position = "jitter"` adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.
@ -210,16 +205,14 @@ But isn't random noise, you know, bad? It *is* true that jittering your data wil
### Bar Charts
Bar charts are the most commonly used type of plot after scatterplots. to make a bar chart use the function `geom_bar()`.
After scatterplots, bar charts are the most used type of plot. To make a bar chart use the function `geom_bar()`.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
The chart above displays the total number of diamonds in the `diamonds` data set, grouped by `cut`. The `diamonds` data set comes in `ggplot2` and contains information about 53940 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond.
The graph shows that more diamonds are available with high quality cuts than low quality cuts.
The chart above displays the total number of diamonds in the `diamonds` data set, grouped by `cut`. The `diamonds` data set comes in `ggplot2` and contains information about 53940 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
A bar has different visual properties than a point, which can create some surprises. For example, how would you create this simple chart? If you have an R session open, give it a try.
@ -251,11 +244,18 @@ ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
```
Bar charts are interesting because they reveal something subtle about many types of plots.
Bar charts also use different position adjustments than scatterplots. Every geom function in `ggplot2` accepts a position argument, but it wouldn't make sense set `position = "jitter"` for a bar chart. However, you could set `position = "dodge"` to create an unstacked bar chart.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
```
You'll learn the complete set of position options in Section 2.
#### Stats
Consider our basic bar chart.
Bar charts are interesting because they reveal something subtle about many types of plots. Consider our basic bar chart.
```{r}
ggplot(data = diamonds) +
@ -277,16 +277,16 @@ ggplot(data = diamonds) +
Where does count come from?
Some graphs, like scatterplots, plot the raw values of your data set. Other graphs, like bar charts, do not plot raw values at all. These graphs apply an algorithm to your data and then plot the results of the algorithm. Consider how many graphs do this.
Some graphs, like scatterplots, plot the raw values of your data set. Other graphs, like bar charts, do not plot raw values at all. These graphs apply an algorithm to your data and then plot the results of the algorithm. Consider how often graphs do this.
* **bar charts** and **histograms** bin the raw data and then plot bin counts
* **smooth lines** (e.g. trend lines) apply a model to the raw data and then plot the model line
* **boxplots** calculate the quartiles of the raw data and then plot the quartiles as a box.
* **bar charts** and **histograms** bin your data and then plot bin counts
* **smooth lines** (e.g. trend lines) apply a model to your data and then plot the model line
* **boxplots** calculate the quartiles of your data and then plot the quartiles as a box.
* and so on.
`ggplot2` calls the algorithm that a graph uses to transform raw data a _stat_, which is short for statistical transformation. Each geom in `ggplot2` is associated with a stat that it automatically uses to plot your data (if a geom plots the raw data it uses the "identity" stat, i.e. the identity transformation).
`ggplot2` calls the algorithm that a graph uses to transform raw data a _stat_, which is short for statistical transformation. Each geom in `ggplot2` is associated with a stat that it uses to plot your data. `geom_bar()` uses the "bin" stat, which bins raw data and computes bin counts. `geom_point()` uses the "identity" stat, which applies the identity transformation, i.e. no transformation.
You can change the stat that your geom uses. For example, you can use the identity stat to plot data that already lists the counts of each bar.
You can change the stat that your geom uses. For example, you can ask `geom_bar()` to use the "identity" stat. This is a useful way to plot data that already lists the heights for each bar, like the data set below.
```{r}
demo <- data.frame(
@ -297,7 +297,7 @@ demo <- data.frame(
demo
```
To use the identity stat, set the stat argument of `geom_bar()` to "identity".
To use the identity stat, set the stat argument of `geom_bar()` to "identity".
```{r}
ggplot(data = demo) +
@ -306,7 +306,7 @@ ggplot(data = demo) +
***
*Tip*: To learn which stat a geom uses, visit the geom's help page, e.g. `?geom_bar`. To learn more about a stat, visit the stat's help page, e.g. `?stat_bin`.
*Tip*: To learn which stat a geom uses, visit the geom's help page, e.g. `?geom_bar`.
***
@ -322,11 +322,13 @@ ggplot(data = diamonds) +
coord_polar()
```
Answer: A coxcomb plot is just a bar chart plotted in polar coordinates.
Answer: A coxcomb plot is a bar chart plotted in polar coordinates.
You can make coxcomb plots with `ggplot2` by first building a bar chart and then plotting it in polar coordinates.
#### Coordinate systems
To plot your data in polar coordinates, add `coord_polar()` to your plot call. Polar bar charts will look better if you also set `geom_bar()`'s width parameter to 1.
To plot your data in polar coordinates, add `coord_polar()` to your plot call. Polar bar charts will look better if you also set the width parameter of `geom_bar()` to 1.
```{r}
ggplot(data = diamonds) +
@ -334,13 +336,15 @@ ggplot(data = diamonds) +
coord_polar()
```
By default, `ggplot2` will map your y variable to $r$ and your x variable to $\theta$. When applied to a bar chart, this creates a coxcomb plot.
You can add `coord_polar()` to any plot in `ggplot2` to draw the plot in polar coordinates. `ggplot2` will map the y variable to $r$ and your x variable to $\theta$.
Coxcomb plots make a useful glyph that you can use to compare subgroups of data. _Facetting_ provides a quick way to do this.
#### Facets
You can create a separate polar chart for each level of a third variable by _facetting_ your plot. For example, you can create a separate subplot for each level of the `clarity` variable.
```{r}
```{r fig.height = 7, fig.width = 7}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), width = 1) +
coord_polar() +
@ -349,9 +353,9 @@ ggplot(data = diamonds) +
Here, the first subplot displays the group of points that have the `clarity` value `I1`. The second subplot displays the group of points that have the `clarity` value `SI2`. And so on.
To facet your plot on a single discrete variable, add `facet_wrap()` to your plot call. The first argument of `facet_wrap()` is always a formula, a `~` followed by a variable name.
To facet your plot on a single discrete variable, add `facet_wrap()` to your plot call. The first argument of `facet_wrap()` is a formula, always a `~` followed by a variable name.
To facet your plot on the combinations of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is always a formula, two variable names separated by a `~`.
To facet your plot on the combinations of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is a formula, always two variable names separated by a `~`.
```{r fig.height = 7, fig.width = 7}
ggplot(data = diamonds) +
@ -360,7 +364,15 @@ ggplot(data = diamonds) +
facet_grid(color ~ clarity)
```
Here the first subplot displays all of the points that have an `I1` code for `clarity` _and_ a `D` code for `color`. Don't be confused; `color` is a variable name in the `diamonds` data set. So `facet_grid(color ~ clarity)` has nothing to do with the color aesthetic.
Here the first subplot displays all of the points that have an `I1` code for `clarity` _and_ a `D` code for `color`. Don't be confused; `color` is a variable name in the `diamonds` data set; `facet_grid(color ~ clarity)` is not invoking the color aesthetic.
Facetting works on more than just polar charts. You can add `facet_wrap()` or `facet_grid()` to any plot in `ggplot2`. For example, you could facet our original scatterplot.
```{r fig.height = 6, fig.width = 6}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class)
```
### Bringing it together
@ -370,6 +382,8 @@ In this section, you learned how to make more than just scatterplots, bar charts
To see this, let's add position adjustments, stats, coordinate systems, and facetting to our code template. In `ggplot2`, each of these parameters will work with every plot and every geom.
As a result, you can use this template to make each plot in `ggplot2`:
```{r eval = FALSE}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
@ -381,85 +395,102 @@ ggplot(data = <DATA>) +
<FACET_FUNCTION>
```
***
The template takes seven parameters, the capitalized words that appear in the template. In practice, you rarely need to supply all seven parameters because `ggplot2` will provide useful defaults for everything except the data, mappings, and geom function.
*Tip*: In practice, you do not need to define each of these parameters when you make a graph. `ggplot2` will supply a set of sensible defaults.
***
The parameters in our template are connected by a powerful idea known as the _Grammar and Graphics_. The Grammar of Graphics shows that you can uniquely describe a plot as a combination of:
* a data set
* a coordinate system
* a geom
* a stat
* a set of aesthetic mappings
* a position adjustment, and
* a facet scheme
As a result, you can build _any_ plot that you have in mind with the template above. To do so, just fill in the parameters that describe the plot.
The next section will look at each of these parameters closely.
These seven parameters are connected by a powerful idea known as the _Grammar and Graphics_, which you can use to make _any_ type of plot. The next section will look at each of these parameters closely. It begins by introducing the Grammar of Graphics.
## The Grammar of Graphics
The _grammar of graphics_ is the core of `ggplot2`. In fact, the "gg" of `ggplot2` stands for the grammar of graphics.
The "gg" of `ggplot2` stands for the grammar of graphics, a system for describing and building plots. You can think of the grammar of graphics as a formula for building a plot---any plot.
You can think of the grammar of graphics as a formula for building a plot---any plot. To build a plot, you begin with a data set and a coordinate system.
$$\text{plot} = \text{coordinate system} + \left \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \right + \text{facet scheme}$$
According to the grammar, you can uniquely describe any plot as a combination of these seven elements. To see how the grammar of graphics works, consider a thought exercise:
To build a plot, you begin with a data set and a coordinate system.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
You then visualize each row of data with a geom.
You then choose whether to visualize the data as it is, or whether to summarize the data with a transformation (and then visualize the summary). Let's visualize our data as it is. To do this, we will use the identity transformation, which returns the data as it is.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
And you map variables in your data to the aesthetic properties of your geoms. Here we map the...
You then choose a visual object to represent the observations in your data set. Here we will use a point. Each point will represent one row of data. Let's call the points geoms. short for geometrical object.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
Once you map the x aesthetic of your geoms
Next you map variables in your data to the aesthetic properties of your geoms. Here we map the... to the...
`r bookdown::embed_png("images/blank.png", dpi = 150)`
and the y aesthetic
To place your points into your coordinate system, you map the x location aesthetic to a variable
`r bookdown::embed_png("images/blank.png", dpi = 150)`
you have a complete graph that you can choose to facet or not. You can also adjust positions as necessary.
as well as the y location aesthetic.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
For some graphs you add an extra step; you transform the data with a statistical transformation, and then use geoms to represent the results.
The process creates a complete graph, but you can also choose to adjust the position of the points (or not) and to facet the graph (or not).
`r bookdown::embed_png("images/blank.png", dpi = 150)`
These parameters---data, coordinate system, geoms, stats, aesthetic mappings, position adjustments, and facets---make up the grammar of graphics. You can build any graph by selecting the correct combination of parameters, e.g.
* **data**:
* **coordinate system**:
* **geom**:
* **stat**:
* **mappings**:
* **position adjustment**:
* **facets**:
You can reuse this process to make any graph. To make the graph different, switch out one of the elements. For example, you can use a line as a geom to make a line graph, or a bar to make a bar chart. You can also switch the data set, coordinate system, etc.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
If you alter any single parameter, you make a new graph:
We can use the same thought experiment to see that the grammar of graphics has a layered nature. You can assemble a data set, a stat, a geom, mappings, and a position adjustment into a layer that you can add to another graph.
* **data**:
* **coordinate system**:
* **geom**:
* **stat**:
* **mappings**:
* **position adjustment**:
* **facets**:
Imagine that we begin a new graph. This graph uses the same data set as our previous graph. This time we'll apply a "smooth" stat to the data. The stat fits a model to the data and then returns a transformed data set.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
The transformed data contains three new columns:
* `y` - the value of the model line at each data point
* `ymin` - the y value of the bottom of the confidence interval associated with the model at each data point
* `ymax` - the y value of the top of the confidence interval associated with the model at each point
Let's represent these points with a line geom. We will map the x values of the line to `displ` and we will map the y values to our new `y` variable. We won't use a position adjustment.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
We now have a "layer" that we can add to a coordinate system and facetting scheme to make a complete graph.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
Or we can add the layer to our previous graph to make a plot that shows both summary information and raw data.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
For completion, let's add one more layer. This layer will begin with the same data set as the previous layer. It will also use the same stat. However, we will use the ribbon geom to visualize the data points. We will map the top of the ribbon to `ymax`, the bottom of the ribbon to `ymin`, and we will map the x position of the ribbon to `displ`. We will not use a position adjustment.
We can overlay the layer on our graph to show raw data, summary information, and the uncertainty associated with that summary.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
If you like, you can continue to add layers to the graph (but the graph will soon become cluttered).
The thought experiment shows that you can describe any graph with a combination of elements that should seem familiar now---data, coordinate system, geoms, stats, aesthetic mappings, position adjustments, and facets. These elements themselves form the grammar of graphics.
In summary, the grammar of graphics is a system that helps you uniquely describe graphs.
`ggplot2` is a software package that uses R to assemble actual graphs from descriptions that you write with the grammar of graphics.
### Layers
To build a graph in `ggplot2`, choose a coordinate system and a facetting scheme for your entire graph, and then add as many combinations of data, geoms, stats, mappings, and position adjustments as you like.
```{r echo = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
```
In practice, you can add multiple data sets, geoms, stats, mappings, and position adjustments to the same graph. The graph above contains two geoms: a "point" geom and a "smooth" geom (i.e. a model line); as well as two stats: an "identity" stat and a "smooth" stat.
In contrast, each graph can only use one coordinate system and one facetting scheme.
You can think of `ggplot()` as initializing your graph with a cartesian coordinate system. Add a coordinate function or a facet function to change these defaults.