More work on visualization.

This commit is contained in:
Garrett 2015-11-18 17:01:07 -05:00
parent c2ee6409e0
commit 24cd931ef7
1 changed files with 211 additions and 141 deletions

View File

@ -34,11 +34,13 @@ This chapter will teach you how to visualize your data with R and the `ggplot2`
## Outline
*Section 1* will get you started making graphs right away. You'll learn how to make several common types of plots, and you will explore `ggplot2`'s syntax.
*Section 1* will get you started making graphs right away. You'll learn how to make several common types of plots, and how to use the `ggplot2` syntax.
*Section 2* will teach you the _grammar of graphics_, a versatile system for building plots. You'll learn to assemble any plot you like with _layers_, _geoms_, _stats_, _aesthetic mappings_, _position adjustments_, and _coordinate systems_.
*Section 2* will teach you the _grammar of graphics_, a versatile system for building plots. You'll learn how to use a combination of _layers_, _geoms_, _stats_, _aesthetic mappings_, _position adjustments_, and _coordinate systems_ to assemble any plot you like.
*Section 3* will show you how to customize your plots with labels, legends, color schemes, and more.
*Section 3* will show you how to use `ggplot2` and the grammar of graphics to make many specific types of plot. This section documents each of the options provided by `ggplot2`.
*Section 4* will show you how to customize your plots with labels, legends, color schemes, and more.
## Prerequisites
@ -55,9 +57,9 @@ library(ggplot2)
## Basics
Do cars with big engines use more fuel than cars with small engines?
Before we look at any graphs, let's begin with a question to explore: Do cars with big engines use more fuel than cars with small engines?
You probably have an intuitive answer to this question. Now try to make your answer precise: What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear? Strong? Weak?
You probably have an intuitive answer, but try to make your answer precise: What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear? Strong? Weak?
You can test your hypothesis with the `mpg` data set in the `ggplot2` package. The data set contains observations collected by the EPA on 38 models of car. Among the variables in `mpg` are `displ`, a car's engine size in litres, and `hwy`, a car's fuel efficiency on the highway in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
@ -77,7 +79,7 @@ You will need to reload the package each time you start a new R session.
### Scatterplots
Open an R session and run the code below. The code plots the `displ` variable of `mpg` against the `hwy` variable.
The easiest way to understand the `mpg` data set is to visualize it, which means that its time to make our first graph. To do this, open an R session and run the code below. The code plots the `displ` variable of `mpg` against the `hwy` variable.
```{r}
ggplot(data = mpg) +
@ -105,18 +107,18 @@ With `ggplot2`, you begin a plot with the function `ggplot()`. `ggplot()` doesn'
The first argument of `ggplot()` is the data set to use in the graph. So `ggplot(data = mpg)` initializes a graph that will use the `mpg` data set.
You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to the plot, which creates a scatterplot. `ggplot2` comes with other `geom_` functions that you can use as well. Each function creates a different type of layer, and each function takes a mapping argument.
You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to the plot, which creates a scatterplot. `ggplot2` comes with other `geom_` functions that you can use as well. Each function creates a different type of layer, and each function takes a mapping argument. We'll learn about all of the geom functions in Section 3.
The mapping argument explains where your points should go. You must set mapping to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of the graph. `ggplot()` will look for those variables in your data set, `mpg`.
The mapping argument of your geom function explains where your points should go. You must set mapping to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of the graph. `ggplot()` will look for those variables in your data set, `mpg`.
You can use this code as a template to make many graphs with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a new data set, a new geom function, or a new set of mappings.
This code suggests a template for making graphs with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a new data set, a new geom function, or a new set of mappings.
```{r eval = FALSE}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```
The next few subsections will introduce several arguments (and functions) that you can add to the template. Each argument will come with a new set of options---and likely a new set of questions. Hold those questions for now. We will catalogue your options in Section 2. Use this section to become familiar with the `ggplot2` syntax. Once you do, the low level details of `ggplot2` will be easier to understand.
The next few subsections will introduce several arguments (and functions) that you can add to the template. Each argument will come with a new set of options---and likely a new set of questions. Hold those questions for now. We will catalogue your options in Section 3. Use this section to become familiar with the `ggplot2` syntax. Once you do, the low level details of `ggplot2` will be easier to understand.
#### Aesthetic Mappings
@ -126,7 +128,7 @@ Our plot above revealed a group of cars that had better than expected mileage. H
Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car. The `class` variable of the `mpg` data set classifies cars into groups such as compact, midsize, and suv. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and suvs became popular).
You can add a third value, like `class`, to a two dimensional scatterplot by mapping it to a new _aesthetic_.
You can add a third value, like `class`, to a two dimensional scatterplot by mapping it to an _aesthetic_.
An aesthetic is a visual property of the points in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing its aesthetic properties.
@ -143,7 +145,7 @@ ggplot(data = mpg) +
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable. `ggplot2` will also add a legend that explains which levels correspond to which values.
The colors reveal that many of the unusual points are two seater cars. These cars don't seem like hybrids. In fact, they seem like sports cars---and that's what they are. Sports cars have large engines like suvs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have such large engines.
The colors reveal that many of the unusual points are two seater cars. These cars don't seem like hybrids. In fact, they seem like sports cars---and that's what they are. Sports cars have large engines like suvs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.
In the above example, we mapped `class` to the color aesthetic, but we could have mapped `class` to the size aesthetic in the same way. In this case, the exact size of each point reveals its class affiliation.
@ -166,9 +168,15 @@ ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
***
**Tip** - What happened to the suv's? `ggplot2` will only use six shapes at a time. See Section 3 for more details.
***
In each case, you set the name of the aesthetic to the variable to display, and you do this within the `aes()` function. The syntax highlights a useful insight because you also set `x` and `y` to variables within `aes()`. The insight is that the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of levels to use for the aesthetic, and it constructs a legend that explains the mapping. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts in the same way as a legend; it explains the mapping between locations and values.
Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of levels to use for the aesthetic, and it constructs a legend that explains the mapping. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts like a legend; it explains the mapping between locations and values.
#### Exercises
@ -193,7 +201,7 @@ ggplot(data = mpg) +
The points appear in a grid because the `hwy` and `displ` measurements in `mpg` are rounded to the nearest integer and tenths values. This also explains why our graph appears to contain 126 points. Many points overlap each other because they have been rounded to the same values of `hwy` and `displ`. 108 points are hidden on top of other points located at the same value.
You can avoid this overplotting problem by setting the position argument of `geom_point()` to "jitter". `position = "jitter"` adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.
You can avoid this overplotting problem by setting the position argument of `geom_point()` to "jitter". `position = "jitter"` adds a small amount of random noise to each point. This spreads the points out since no two points are likely to receive the same amount of random noise.
```{r}
ggplot(data = mpg) +
@ -205,7 +213,9 @@ But isn't random noise, you know, bad? It *is* true that jittering your data wil
### Bar Charts
After scatterplots, bar charts are one of the most used types of plot. To make a bar chart with `ggplot2` use the function `geom_bar()`.
You now know how to make scatterplots, but there are many different types of plots that you can use to visualize your data. After scatterplots, one of the most used types of plot is the bar chart.
To make a bar chart with `ggplot2` use the function `geom_bar()`.
```{r}
ggplot(data = diamonds) +
@ -244,14 +254,14 @@ ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
```
Bar charts also use different position adjustments than scatterplots. Every geom function in `ggplot2` accepts a position argument, but it wouldn't make sense set `position = "jitter"` for a bar chart. However, you could set `position = "dodge"` to create an unstacked bar chart.
Bar charts also use different position adjustments than scatterplots. Every geom function in `ggplot2` accepts a position argument, but it wouldn't make sense to set `position = "jitter"` for a bar chart. However, you could set `position = "dodge"` to create an unstacked bar chart.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
```
See Section 2 to learn about other position options.
See Section 3 to learn about other position options.
#### Stats
@ -324,11 +334,11 @@ ggplot(data = diamonds) +
Answer: A coxcomb plot is a bar chart plotted in polar coordinates.
You can make coxcomb plots with `ggplot2` by first building a bar chart and then plotting it in polar coordinates.
#### Coordinate systems
To plot your data in polar coordinates, add `coord_polar()` to your plot call. Polar bar charts will look better if you also set the width parameter of `geom_bar()` to 1.
You can make coxcomb plots with `ggplot2` by first building a bar chart and then plotting it in polar coordinates.
To plot your data in polar coordinates, add `coord_polar()` to your plot call. Polar bar charts will look better if you also set the width parameter of `geom_bar()` to 1. This will ensure that no space appears between the bars.
```{r}
ggplot(data = diamonds) +
@ -338,11 +348,11 @@ ggplot(data = diamonds) +
You can add `coord_polar()` to any plot in `ggplot2` to draw the plot in polar coordinates. `ggplot2` will map the y variable to $r$ and your x variable to $\theta$.
Coxcomb plots make a useful glyph that you can use to compare subgroups of data. _Facetting_ provides a quick way to do this.
#### Facets
You can create a separate polar chart for each level of a third variable by _facetting_ your plot. For example, you can create a separate subplot for each level of the `clarity` variable.
Coxcomb plots are especially useful when you make many plots to compare against each other. Each coxcomb will act as a glyph that you can use to compare subgroups of data.
You can create a separate coxcomb plot for each subgroup in your data by _faceting_ your plot. For example, here we create a separate subplot for each level of the `clarity` variable.
```{r fig.height = 7, fig.width = 7}
ggplot(data = diamonds) +
@ -351,11 +361,11 @@ ggplot(data = diamonds) +
facet_wrap( ~ clarity)
```
Here, the first subplot displays the group of points that have the `clarity` value `I1`. The second subplot displays the group of points that have the `clarity` value `SI2`. And so on.
The first subplot displays the group of points that have the `clarity` value `I1`. The second subplot displays the group of points that have the `clarity` value `SI2`. And so on.
To facet your plot on a single discrete variable, add `facet_wrap()` to your plot call. The first argument of `facet_wrap()` is a formula, always a `~` followed by a variable name.
To facet your plot on the combinations of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is a formula, always two variable names separated by a `~`.
To facet your plot on the combinations of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is also a formula. This time the formula should contain two variable names separated by a `~`.
```{r fig.height = 7, fig.width = 7}
ggplot(data = diamonds) +
@ -364,9 +374,9 @@ ggplot(data = diamonds) +
facet_grid(color ~ clarity)
```
Here the first subplot displays all of the points that have an `I1` code for `clarity` _and_ a `D` code for `color`. Don't be confused; `color` is a variable name in the `diamonds` data set; `facet_grid(color ~ clarity)` is not invoking the color aesthetic.
Here the first subplot displays all of the points that have an `I1` code for `clarity` _and_ a `D` code for `color`. Don't be confused by the word color here; `color` is a variable name in the `diamonds` data set. `facet_grid(color ~ clarity)` is not invoking the color aesthetic.
Facetting works on more than just polar charts. You can add `facet_wrap()` or `facet_grid()` to any plot in `ggplot2`. For example, you could facet our original scatterplot.
faceting works on more than just polar charts. You can add `facet_wrap()` or `facet_grid()` to any plot in `ggplot2`. For example, you could facet our original scatterplot.
```{r fig.height = 6, fig.width = 6}
ggplot(data = mpg) +
@ -380,7 +390,7 @@ ggplot(data = mpg) +
In this section, you learned more than how to make scatterplots, bar charts, and coxcomb plots; you learned a foundation that you can use to make _any_ type of plot with `ggplot2`.
To see this, let's add position adjustments, stats, coordinate systems, and facetting to our code template. In `ggplot2`, each of these parameters will work with every plot and every geom.
To see this, let's add position adjustments, stats, coordinate systems, and faceting to our code template. In `ggplot2`, each of these parameters will work with every plot and every geom.
```{r eval = FALSE}
ggplot(data = <DATA>) +
@ -393,11 +403,11 @@ ggplot(data = <DATA>) +
<FACET_FUNCTION>
```
The template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters because `ggplot2` will provide useful defaults for everything except the data, mappings, and geom function.
The template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters because `ggplot2` will provide useful defaults for everything except the data, the mappings, and the geom function.
The seven parameters in the template are connected by a powerful idea known as the _Grammar of Graphics_, a system for describing plots. The grammar shows that you can uniquely describe _any_ plot as a combination of---you guessed it: a data set, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a facetting scheme.
The seven parameters in the template are connected by a powerful idea known as the _Grammar of Graphics_, a system for describing plots. The grammar shows that you can uniquely describe _any_ plot as a combination of---you guessed it: a data set, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.
In other words, you can use the template above to make any graph that you can imagine---at least in theory. Section 2 will examine how this works in practice. The section explains how the grammar of graphics works and how `ggplot2` implements the grammar to build real graphs. It also catalogues all of the options that `ggplot2` puts at your fingertips for geoms, mappings, stats, position adjustments, and coordinate systems.
In other words, you can use the template above to make any graph that you can imagine---at least in theory. Section 2 will examine how this works in practice. The section explains the details of the grammar of graphics works, and it shows how `ggplot2` implements the grammar to build real graphs.
## The Grammar of Graphics
@ -405,33 +415,35 @@ The "gg" of `ggplot2` stands for the grammar of graphics, a system for describin
$$\text{plot} = \Big( \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \Big) + \text{coordinate system} + \text{facet scheme}$$
This may not be an obvious way to think about plots, so let's explore the formula above with a thought exercise. You can build any plot in the following manner.
You might not be used to thinking of plots in this way, so let's explore the formula above with a thought exercise. If you had to build a graph from scratch, how would you do it?
To build the plot, you begin with a data set to visualize and a coordinate system to visualize it in. We'll visualize an abbreviated version of the `mpg` data set, and the cartesian coordiante system.
Here's one way. To build a plot, you could begin with a data set to visualize and a coordinate system to visualize the data in. For this thought exercise, we will visualize an abbreviated version of the `mpg` data set, and we will use the cartesian coordinate system.
`r bookdown::embed_png("images/visualization-3.png", dpi = 400)`
You then choose whether to visualize the data itself, or whether to summarize the data with a transformation and then visualize the summary. Let's visualize our data as it is. This would be the same as applying an identity transformation to the data, since an identity transformation returns the data as it is.
You could then choose whether to visualize the data in its raw form, or whether to summarize the data with a transformation and then visualize the summary. Let's visualize our data as in its raw form. This would be the same as applying an identity transformation to the data, since an identity transformation returns the data as it is.
`r bookdown::embed_png("images/visualization-4.png", dpi = 400)`
Next, you need to choose some sort of visual object to represent the observations in your data set. This object will be what you actually draw in the coordinate system.
Next, you would need to choose some sort of visual object to represent the observations in your data set. This object will be what you actually draw in the coordinate system.
Here we will use a set of points. Each point will represent one row of data. Let's call the points geoms, short for geometrical object.
Here we will use a set of points. Each point will represent one row of data. Let's call the points "geoms", short for geometrical object.
`r bookdown::embed_png("images/visualization-5.png", dpi = 400)`
Next, you map variables in your data to the visual properties of your geoms. These properties are what we call aesthetics. Let's map the... to the...
Next, you could map variables in your data to the visual properties of your geoms. These visual properties are what we call aesthetics. Once you do this, the visual information contained in the point will communicate recorded information contained in the data set.
Let's map the `cyl` variable to the shape of our points.
`r bookdown::embed_png("images/visualization-6.png", dpi = 400)`
One pair of mappings is particularly important. To place your points into your coordinate system, you map the x location aesthetic to a variable. Here `displ`.
One pair of mappings would be particularly important. To place your points into your coordinate system, you would need to map a variable to the x location of the points, which is an aesthetic. Here we map `displ` to the x location.
`r bookdown::embed_png("images/visualization-7.png", dpi = 400)`
And you map the y location aesthetic to a variable. Here `hwy`.
And you would need to map a variable to the y location of the points, which is also an aesthetic. Here we map `hwy` to the y location.
`r bookdown::embed_png("images/visualization-8.png", dpi = 400)`
@ -439,21 +451,21 @@ The process creates a complete graph:
`r bookdown::embed_png("images/visualization-9.png", dpi = 400)`
However, you can also choose to adjust the position of the points (or not) and to facet the graph (or not).
However, you could modify the graph further. You could choose to adjust the position of the points (or not) and to facet the graph (or not).
`r bookdown::embed_png("images/visualization-10.png", dpi = 400)`
You can reuse this process to make any graph. If you change any of the elements involved, you will end up with a new graph. For example, we can change our geom to a line to make a line graph, or to a bar to make a bar chart. Or we can change the position to "jitter" to make a jittered plot.
This process works to make any graph. If you change any of the elements involved, you would end up with a new graph. For example, we could change our geom to a line to make a line graph, or to a bar to make a bar chart. Or we could change the position to "jitter" to make a jittered plot.
`r bookdown::embed_png("images/visualization-11.png", dpi = 400)`
You can also switch the data set, coordinate system, or any other component of the graph.
You could also switch the data set, coordinate system, or any other component of the graph.
Let's extend this our experiment to add a model line to the graph. To do this, we will add a new _layer_ to the graph.
Let's extend the thought expercise to add a model line to the graph. To do this, we will add a new _layer_ to the graph.
### Layers
A layer is a collection of a data set, a stat, a geom, and a position adjustment. You can add a layer to a coordinate system and facetting scheme to make a complete graph, or you can add a layer to an existing graph to make a layered graph.
A layer is a collection of a data set, a stat, a geom, and a position adjustment. You can add a layer to a coordinate system and faceting scheme to make a complete graph, or you can add a layer to an existing graph to make a layered graph.
Let's build a layer that uses the same data set as our previous graph. In this layer, we will apply a "smooth" stat to the data. The stat fits a model to the data and then returns a transformed data set with three new columns:
@ -467,7 +479,7 @@ In this layer, we will represent the observations with a line geom. We map the x
`r bookdown::embed_png("images/visualization-13.png", dpi = 400)`
We now have a "layer" that we can add to a coordinate system and facetting scheme to make a complete graph.
We now have a "layer" that we can add to a coordinate system and faceting scheme to make a complete graph.
`r bookdown::embed_png("images/visualization-14.png", dpi = 400)`
@ -481,7 +493,7 @@ We map the top of the ribbon to `ymax` and the bottom of the ribbon to `ymin`. W
We can now add the layer to our graph to show in one plot:
* our raw data
* raw data
* a visual summary of the data (the smooth line)
* the uncertainty associated with the summary
@ -489,7 +501,7 @@ We can now add the layer to our graph to show in one plot:
If you like, you can continue to add layers to the graph (but the graph will soon become cluttered).
The thought experiment shows that the elements of the grammar of graphics work together to build a graph. You can describe any graph with these elements, and each unique combination of elements makes a single, unique graph. You can also extend a graph by adding layers of new data, stats, geoms, mappings, and positions.
The thought exercise shows that the elements of the grammar of graphics work together to build a graph. You can describe any graph with these elements, and each unique combination of elements makes a single, unique graph. You can also extend a graph by adding layers of new data, stats, geoms, mappings, and positions.
In other words, you can extend the grammar of graphics formula indefinitely to make layered plots:
@ -537,13 +549,13 @@ ggplot() +
Although you can build all of your graphs this way, few people do because `ggplot2` supplies some very efficient shortcuts.
For example, you will find in practice that you always pair the same geoms with the same stats and position adjustments. You'll almost always use the point geom with the "identity" stat and the "identity" position. You'll almost always use the bar geom with the "bin" stat and the "stack" position.
For example, you will find in practice that you almost always pair the same geoms with the same stats and position adjustments. For instance, you will almost always use the point geom with the "identity" stat and the "identity" position. Similarly, you will almost always use the bar geom with the "bin" stat and the "stack" position.
The `geom_` functions in `ggplot2` take advantage of these common combinations. Like `layer()`, each geom function builds a layer, but the geom functions preset the geom, stat, and position values of the layer to useful defaults. The geom becomes the geom that appears in the function name. The stat and position become the stat and postion most commonly asscoiated with the geom.
The `geom_` functions in `ggplot2` take advantage of these common combinations. Like `layer()`, each geom function builds a layer, but the geom functions preset the geom, stat, and position values of the layer to useful defaults. The geom that appears in the function name becomes the geom of the layer. The stat and postion most commonly asscoiated with the geom become the default stat and position of the layer.
`ggplot2` even provides geom functions for less common, but still useful combinations of geoms, stats, and positions. For example, the function `geom_jitter()` builds a layer that has a point geom, an "identity" stat and a "jitter" position. The function `geom_smooth()` builds two layers: a ribbon layer that is combined with a line layer as in the plot above. Together these layers display a model line with its standard error band.
`ggplot2` even provides geom functions for less common, but still useful combinations of geoms, stats, and positions. For example, the function `geom_jitter()` builds a layer that has a point geom, an "identity" stat, and a "jitter" position. The function `geom_smooth()` builds a "layer" that is made of two sub-layers: a line layer that displays a model line and ribbon layer that displays a standard error band.
The result is a more direct syntax for making plots, one that you are already familiar with from Section 1.
As a result, `geom_` functions provide a more direct syntax for making plots, one that you are already familiar with from Section 1.
```{r message = FALSE}
ggplot() +
@ -555,9 +567,9 @@ ggplot() +
As with `layer()`, you can add multiple geom functions to a single plot call.
This system lets you build sophisticated graphs geom by geom, but it also makes it possible to write repetitive code. For example, the code above repeats the arguments `data = mpg, mapping = aes(x = displ, y = hwy)`. Repetition makes your code harder to read and write, and it also increases the chance of errors and typos.
This system lets you build sophisticated graphs geom by geom, but it also makes it possible to write repetitive code. For example, the code above repeats the arguments `data = mpg` and `mapping = aes(x = displ, y = hwy)`. Repetition makes your code harder to read and write, and it also increases the chance of typos and errors.
You can avoid repetition by passing `ggplot()` a set of global mappings to apply to each layer. For example, we can eliminate the duplication of `mapping = aes(x = displ, y = hwy)` in our previous code with a global mapping argument:
You can avoid repetition by passing the repeated mappings to `ggplot()`. `ggplot2` will treat mappings that appear in `ggplot()` as global mappings to be applied to each layer. For example, we can eliminate the duplication of `mapping = aes(x = displ, y = hwy)` in our previous code with a global mapping argument:
```{r, eval = FALSE}
ggplot(mapping = aes(x = displ, y = hwy)) +
@ -577,9 +589,9 @@ ggplot(mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg)
```
For example, the smooth line above is a single line with a single color. This does not occur if you add the color aesthetic to the global mappings. Smooth will draw a different colored line for each class of cars.
This system lets us overlay a single smooth line on a set of colored points. Notice that this would not occur if you add the color aesthetic to the global mappings. In that case, smooth would use the color mapping to draw a different colored line for each class of cars.
You can use the same system to specify a global data set for every layer.
You can use the same system to specify a global data set for every layer. In other words,
```{r, eval = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
@ -595,7 +607,7 @@ ggplot(mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg)
```
To apply the smooth line to a subset of your data, pass it its own data argument, here the subset of cars that have eight cylinders.
As with mappings, you can define a local data argument to override the global data argument on a layer by layer basis.
```{r, message = FALSE, warning = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
@ -603,17 +615,155 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = subset(mpg, cyl == 8))
```
### Recap
Your understanding of the `ggplot2` syntax is now complete. You understand the grammar written into the syntax, and you know how to extend the syntax by adding extra layers to your plot, as well as how to truncate the syntax by relying on `ggplot2`'s default settings.
Only one thing remains. You need to learn the vocabulary of function names and argument options that you can use with your code template.
Section 3 will guide you through these functions and arguments. It catalogues all of the options that `ggplot2` puts at your fingertips for geoms, mappings, stats, position adjustments, and coordinate systems.
## The Vocabulary of Graphics
### Aesthetics
`ggplot2` comes with 37 geom functions, 22 stats, eight coordinate systems, six position adjustments, two facetting schemes, and an uncounted number of aesthetics to map. Each of these components introduces new decisions for you to make and new dilemma's for you to consider.
# Aesthetics
Tackling these details can seem overwhelming to a new student, but you are ready. You understand the big picture that these details fits into, and you know how to make your own graphs, so you can try things out as you go.
This section will explain all of the options that you can use to make graphs with `ggplot2`. Read through this section once, then return to it as a reference guide when you need it.
### Geoms
The geom of a plot is the geometric object that the plot uses to represent its data. People often describe plots by the type of geom that they use. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on.
`ggplot2` provides 37 `geom_` functions that you can use to visualize your data. Each geom is particularly well suited for displaying a certain type of data or a certain type of relationship.
This section organizes geoms according to these relationships. It describes each geom and lists the aesthetics to use with the geom.
Throughout the section we will rely on an important distinction between two types of variables:
* A variable is **continuous** if you can arrange its values in order _and_ an infinite number of values exists between any two values of the variable.
Numbers and date-times are examples of continuous variables. `ggplot2` will treat your variable as continuous if it is a numeric, integer, or a recognizable date-time class (but not a factor, see `?factor`).
* A variable is **discrete** if it is not continuous. Discrete variables can only contain a finite (or countably infinite) set of unique values.
Character strings and boolean values are examples of discrete data. `ggplot2` will treat your variable as discrete if it is not a numeric, integer, or recognizable date-time structure.
#### Visualizing Distributions
Recall that a variable is a quantity, quality, or property whose value can change between measurements.
This unique property---that the values of a variable can vary---gives the word "variable" its name. It also motivates all of data science. Scientists attempt to predict the value of variables and to understand what determines what those values will be.
One of the most useful tools in this quest are the values themselves. As you collect more data, the values of a variable will reveal which states of the variable are common, which are rare, and which are seemingly impossible. The pattern of the values that emerges is known as the variable's _distribution_.
##### Discrete distributions
To visualize the distribution of a discrete variable, count how many observations are associated with each value of the variable. You can compute these numbers quickly with R's `table()` function, but the easiest way to visualize the results is with `geom_bar()`.
```{r}
table(diamonds$cut)
```
##### `geom_bar()`
`geom_bar()` counts the number of observations that are associated with each value of a variable and displays the results as a bar. The height of each bar reveals the count of observations that are associated with the x value of the bar.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
Useful aesthetics for `geom_bar()` are:
* x (required)
* alpha
* color
* fill
* linetype
* size
* weight
Useful position adjustments for `geom_bar()` are
* "stack" (default)
* "dodge"
* "fill"
##### Continuous distributions
Plotting the distribution of a continuous variable is more tricky than plotting the distribution of a discrete variable.
To reveal the distribution, you must first _bin_ the range of the variable, which means to divide the range into equally spaced intervals.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
You can then count the number of observations that fall into each bin.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
And display them as a bar, or some other object.
`r bookdown::embed_png("images/blank.png", dpi = 150)`
This method is temperamental because the appearance of the distribution can change dramatically if the bin size changes. As no bin size is "correct," you should explore several bin sizes when examining data.
Several geoms exist to help you, and they almost all use the "bin" stat to implement the above strategy. For each of these geoms, you can set the following arguments for "bin" to use:
* binwidth - the width to use for the bins in the same units as the x variable
* origin - origin of the first bin interval
* right - if `TRUE` bins will be right closed (e.g. points that fall on the border of two bins will be counted with the bin to the left)
* breaks - a vector of actual bin breaks to use. If you set the breaks argument, it will overide the binwidth and origin arguments.
###### Histograms
###### Freqpoly
###### Dotplots
###### Density
###### Boxplots
##### Bivariate Distributions
###### bin2d
###### hex
###### density2d
###### rug
#### Visualizing Relationships
##### Discrete x, discrete y
###### Jitter
##### Discrete x, continuous y
###### Bar Charts
###### Boxplots
###### Dotplots
###### Violin plots
###### crossbar
###### errorbar
###### linerange
###### point range
##### Continuous x, continuous y
###### Points
###### Text
###### Jitter
###### Smooth
###### Quantile
##### Functions
###### line
###### area
###### step
##### Discrete x, discrete y, continuous z
###### raster
###### tile
##### Continuous x, continuous y, continuous z
###### contour
##### Maps
### Mappings
Have you experimented with aesthetics? Great! Here are some things that you may have noticed.
#### Continuous data
A continuous variable can contain an infinite number of values that can be put in order, like numbers or date-times. If your variable is continuous, `ggplot2` will treat it in a special way. `ggplot2` will
A continuous variable can contain an infinite number of values that can be put in order, like numbers or date-times. `ggplot2` will treat your variable as continuous if it is a numeric, integer, or a recognizable date-time structure (but not a factor, see `?factor`).
If your variable is continuous, `ggplot2` will treat it in a special way. `ggplot2` will
* use a gradient of colors from blue to black for the color aesthetic
* display a colorbar in the legend for the color aesthetic
@ -749,90 +899,8 @@ ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
```
### Geoms
You can add new data to your scatterplot with aesthetics and facets, but how can you summarize the data that is already there, for example with a trend line?
You can add summary information to your scatterplot with a geom. To understand geoms, ask yourself: how are these two plots similar?
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
Both plots contain the same:
* x variable
* y variable
* underlying data set
But the plots are not identical. Each uses a different _geom_, or geometrical object, to represent the data. The first plot uses a set of points to represent the data. The second plot uses a single, smoothed line.
To create the second plot, replace `geom_point()` in our template code...
```{r eval=FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
...with `geom_smooth()`,
```{r eval=FALSE, message = FALSE}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
`ggplot2` comes with 37 `geom_` functions that you can use to to visualize your data. Each function will represent the data with a different type of geom, like a bar, a line, a boxplot, a histogram, etc. You select the type of plot you wish to make by calling the geom_ function that draws the geom you have in mind.
Each `geom_` function takes a `mapping` argument. However, the aesthetics that you pass to the argument will change from geom to geom. For example, you can set the shape of points, but it would not make sense to set the shape of a line.
To see which aesthetics your geom uses, visit its help page. To see a list of all available geoms, open the `ggplot2` package help page with `help(package = ggplot2)`.
#### Graphical primitives
#### Visualizing Distributions
##### Discrete distributions
###### Bar Charts
##### Continuous distributions
###### Histograms
###### Dotplots
###### Freqpoly
###### Density
###### Boxplots
##### Bivariate Distributions
###### bin2d
###### hex
###### density2d
###### rug
#### Visualizing Relationships
##### Discrete x, discrete y
###### Jitter
##### Discrete x, continuous y
###### Bar Charts
###### Boxplots
###### Dotplots
###### Violin plots
###### crossbar
###### errorbar
###### linerange
###### point range
##### Continuous x, continuous y
###### Points
###### Text
###### Jitter
###### Smooth
###### Quantile
##### Functions
###### line
###### area
###### step
##### Discrete x, discrete y, continuous z
###### raster
###### tile
##### Continuous x, continuous y, continuous z
###### contour
##### Maps
### Stats
@ -1254,6 +1322,8 @@ The results of `facet_wrap()` can be easier to study than the results of `facet_
#### Size
#### Shape
### Themes
### Zoom
### Saving plots
## Summary