More work on visualization.

This commit is contained in:
Garrett 2015-11-11 17:29:55 -05:00
parent 1931343cf3
commit ce92742614
1 changed files with 356 additions and 161 deletions

View File

@ -39,9 +39,9 @@ In *Section 1*, you will learn how to make scatterplots, the most popular type o
*Section 3* draws on examples in the first two sections to teach the _gramar of graphics_, a versatile system for describing---and building---any plot.
*Section 4* describes the best practices and functions for visualizing distributions of values.
*Section 4* describes the best ways to visualize distributions of values.
*Section 5* teaches the best practices and functions for visualizing relationships between variables.
*Section 5* teaches the best ways to visualize relationships between variables.
*Section 6* shows how to use `ggplot2` to create maps.
@ -74,13 +74,13 @@ To learn more about `mpg`, open its help page with the command `?mpg`.
***
*Tip*: If you have trouble loading `mpg`, its help page, or any of the functions in this chapter, you may need to load the `ggplot2` library with the command
*Tip*: If you have trouble loading `mpg`, its help page, or any of the functions in this chapter, you may need to load the `ggplot2` package with the command
```{r eval=FALSE}
library(ggplot2)
```
You will need to reload the library each time you start a new R session.
You will need to reload the package each time you start a new R session.
***
@ -108,9 +108,11 @@ ggplot(data = mpg) +
With `ggplot2`, you begin every plot with the function `ggplot()`. `ggplot()` doesn't create a plot by itself; instead it initializes a new plot that you can add layers to.
The first argument of `ggplot()` is the data set that you would like to use in your graph. So `ggplot(mpg)` initializes a graph that will use the `mpg` data set.
The first argument of `ggplot()` is the data set to use in the graph. So `ggplot(data = mpg)` initializes a graph that will use the `mpg` data set.
To complete a graph, you add one or more layers to `ggplot()`. The function `geom_point()` adds a layer of points to the plot, which creates a scatterplot. The mapping argument explains where those points should go. You must always set mapping to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of the graph.
You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to the plot, which creates a scatterplot. `ggplot2` comes with 37 different `geom_` functions that you can use with `ggplot()`. Each function creates a different type of layer, and each function takes a mapping argument.
The mapping argument explains where your points should go. You must always set mapping to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of the graph. `ggplot()` will look for those variables in your data set, `mpg`.
You can use this code as a template to make any graph with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a new data set, a new geom, or a new set of mappings. You can also add functions and arguments to the template that do not appear here.
@ -129,11 +131,11 @@ Let's hypothesize that the cars are hybrids. One way to test this hypothesis is
There are two ways to add a third value, like `class`, to a two dimensional scatterplot. You can map the value to a new _aesthetic_ or you can divide the plot into _facets_.
An aesthetic is a visual property of the points in your plot. Aesthetics include things like the size, the shape, or the color of your points.
An aesthetic is a visual property of the points in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing its aesthetic properties.
`r bookdown::embed_png("images/visualization-2.png", dpi = 150)`
You can convey information by mapping the aesthetics in your plot to the variables in your data set. For example, we can map the colors of the points to the `class` variable. Then the color of each point will reveal its class affiliation.
You can convey information about your data by mapping the aesthetics in your plot to the variables in your data set. For example, we can map the colors of our points to the `class` variable. Then the color of each point will reveal its class affiliation.
To map an aesthetic to a variable, set the name of the aesthetic to the name of the variable, and do this _in your plot's `aes()` call_:
@ -144,37 +146,40 @@ ggplot(data = mpg) +
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable. `ggplot2` will also add a legend that explains which levels correspond to which values.
You can now see that most of the unusual points are two seater cars. This doesn't sound like a hybrid. In fact, it sounds like a sports car---and that's what the points are. Sports cars have the same size engines as suvs and pickup trucks. However, sports cars have much smaller bodies than suvs and pickup trucks, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have such large engines.
The colors reveal that many of the unusual points are two seater cars. These don't sound like hybrids. In fact, they sound like sports cars---and that's what the points are. Sports cars have large engines like suvs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have such large engines.
Color is one of the most popular aesthetics to use in a scatterplot, but we could have mapped the size aesthetic to `class` in the same way. In this case, the exact size of the point reveals its class affiliation.
Color is one of the most popular aesthetics to use in a scatterplot, but we could have mapped `class` to the size aesthetic in the same way. In this case, the exact size of each point reveals its class affiliation.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
```
Or we could have mapped the _alpha_ of the points to the `class` variable. The alpha is the transparency of the points. Now the transparency of each point corresponds with its class affiliation.
Or we could have mapped `class` to the _alpha_, or transparency, of the points. Now the transparency of each point corresponds with its class affiliation.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
```
We also could have mapped the shape of the points to the `class` variable.
We also could have mapped `class` to the shape of the points.
```{r warning=FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
In each case above, we set the name of the aesthetic to the variable to display and we do this within the `aes()` function. The syntax highlights a useful insight since we also set `x` and `y` to variables within `aes()`: the x location and the y location of a point are also aesthetics, visual properties that we can map to variables.
In each case, we set the name of the aesthetic to the variable to display and we do this within the `aes()` function. The syntax highlights a useful insight because we also set `x` and `y` to variables within `aes()`: the x location and the y location of a point are also aesthetics, visual properties that you can map to variables to display information about the data.
Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of values to use for the aesthetic and it constructs a legend that explains the mapping. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts in the same way as a legend; it explains the mapping between locations and values.
Now that you know how to use aesthetics, take a moment to experiment with the `mpg` data set.
* Attempt to match different types of variables to different types of aesthetics.
+ Continuous variables in `mpg`: `displ`, `year`, `cyl`, `cty`, `hwy`
+ Discrete variables in `mpg`: `manufacturer`, `model`, `trans`, `drv`, `fl`, `class`
* Attempt to use more than one aesthetic at a time.
* Attempt to set an aesthetic to something other than a variable name, like `hwy / 2`.
See the help page for `geom_point()` (`?geom_point`) to learn which aesthetics are available to use in a scatterplot. See the help page for the `mpg` data set (`?mpg`) to learn which variables are in the data set.
@ -182,27 +187,27 @@ Have you experimented with aesthetics? Great! Here are some things that you may
#### Continuous data
A continuous variable can contain an infinite number of values that can be put in order, like numbers or date times. If your variable is continuous, `ggplot2` will treat it differently than a discrete variable. `ggplot2` will
A continuous variable can contain an infinite number of values that can be put in order, like numbers or date-times. If your variable is continuous, `ggplot2` will treat it in a special way. `ggplot2` will
* use a gradient of colors from blue to black for the color aesthetic.
* display a colorbar in the legend for the color aesthetic.
* not use the shape aesthetic.
`ggplot2` will not use the shape aesthetic to display continuous information. Why? Because the human eye cannot easily interpolate between shapes. Can you tell whether a shape is three quarters of the way between a triangle and a circle? how about five eights of the way?
`ggplot2` will not use the shape aesthetic to display continuous information. Why? Because the human eye cannot easily interpolate between shapes. Can you tell whether a shape is three-quarters of the way between a triangle and a circle? how about five-eights of the way?
`ggplot2` will treat your variable as continuous if it is a numeric, integer, or a recognizable date time structure (but not a factor, see `?factor`).
`ggplot2` will treat your variable as continuous if it is a numeric, integer, or a recognizable date-time structure (but not a factor, see `?factor`).
#### Discrete data
A discrete variable can takes a finite (or countably infinite) set of values. Character strings and boolean values are examples of discrete data. `ggplot2` will treat your variable as discrete if it is not a numeric, integer, or recognizable date time structure.
A discrete variable can only contain a finite (or countably infinite) set of values. Character strings and boolean values are examples of discrete data. `ggplot2` will treat your variable as discrete if it is not a numeric, integer, or recognizable date-time structure.
If your data is discrete, `ggplot2` will:
* `ggplot2` will use a set of colors that span the hues of the rainbow. The exact colors will depend on how many hues appear in your graph. `ggplot2` selects the colors in a way that ensures that one color does not visually dominate the others.
* use a set of colors that span the hues of the rainbow. The exact colors will depend on how many hues appear in your graph. `ggplot2` selects the colors in a way that ensures that one color does not visually dominate the others.
* use equally spaced values of size and alpha
* display up to six shapes for the shape aesthetic.
If your data requires more than six unique shapes, `ggplot2` will print an error message and only display the first six shapes. You may have noticed this in the graph above (and below), `ggplot2` did not display the suv values, which were the seventh unique class.
If your data requires more than six unique shapes, `ggplot2` will print a warning message and only display the first six shapes. You may have noticed this in the graph above (and below), `ggplot2` did not display the suv values, which were the seventh unique class.
```{r}
ggplot(data = mpg) +
@ -213,7 +218,7 @@ See _Section 7_ to learn how to pick your own colors, shapes, sizes, etc. for `g
#### Multiple aesthetics
You can use more than one aesthetic at a time. `ggplot2` will combine aesthetic legeneds where possible.
You can use more than one aesthetic at a time. `ggplot2` will combine aesthetic legends when possible.
```{r}
ggplot(data = mpg) +
@ -268,10 +273,10 @@ We could even divide our data into subgroups based on the combination of two var
#### `facet_grid()`
The graphs below show what a faceted graph looks like. They also show how you can build a faceted graph with `facet_grid()`. I'm not going to tell you how facet grid works---well at least not yet---because that would be too easy. Instead, I would like you to try to induce the syntax of `facet_grid()` from the code below.
The graphs below show what a faceted graph looks like. They also show how you can build a faceted graph with `facet_grid()`. I'm not going to tell you how `facet_grid()` works---well at least not yet. That would be too easy. Instead, I would like you to try to induce the syntax of `facet_grid()` from the code below. Consider:
* What variables determines the graph is split into rows?
* What variables determines the graph is split into columns?
* Which variables determine how the graph is split into rows?
* Which variables determine how the graph is split into columns?
* What parts of the syntax always stay the same?
* And what does the `.` do?
@ -283,10 +288,10 @@ ggplot(data = mpg) +
facet_grid(drv ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
facet_grid(. ~ cyl)
```
Ready for the answers?
@ -301,13 +306,13 @@ This syntax mirrors the rows first, columns second convention of R.
If you prefer to facet your plot on only one dimension, add a `.` to your formula as a place holder. If you place a `.` before the `~`, `facet_grid()` will not facet on the rows dimension. If you place a `.` after the `~`, `facet_grid()` will not facet on the columns dimension.
Facets let you quickly compare subgroups by glancing down rows and across columns. Each facet will use the same x and y limits, but you can change this behavior across rows or columns by adding a scales argument. Set scales to one of
Facets let you quickly compare subgroups by glancing down rows and across columns. Each facet will use the same limits on the x and y axes, but you can change this behavior across rows or columns by adding a scales argument. Set scales to one of
* `"free_y"` - to let y limits vary accross rows
* `"free_x"` - to let x limits vary accross columns
* `"free"` - to let both x and y limits vary
For example, the code below lets x limits vary across columns.
For example, the code below lets the limits of the x axes vary across columns.
```{r}
ggplot(data = mpg) +
@ -318,21 +323,30 @@ ggplot(data = mpg) +
#### `facet_wrap()`
`facet_wrap()` provides a pleasant way to facet a plot across a single variable with many values. The easiest way to understand `facet_wrap()` is to compare the output of `facet_grid()` and `facet_wrap()`.
What if you want to facet on a variable that has too many values to display nicely?
For example, if we facet on `class`, `ggplot2` must display narrow subplots to fit each subplot into the same column. This makes it diffcult to compare x values with precision.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ class)
```
`facet_wrap()` provides a more pleasant way to facet a plot across many values. It wraps the subplots into a multi-row, roughly square result.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class)
```
`facet_wrap()` wraps the facets into a multi-row, roughly square result. if your facetting variable has many values, the results of `facet_wrap()` will be easier to study than the results of `facet_grid()`. However, `facet_wrap()` can only facet by one variable at a time.
The results of `facet_wrap()` can be easier to study than the results of `facet_grid()`. However, `facet_wrap()` can only facet by one variable at a time.
### Geoms
You can add new data to your scatterplot with aesthetics and facets, but how can you summarize the data that is already there, for example with a trend line?
You can add summary information to your scatterplot with a geom. To understand geoms, ask yourself: how are these two plots similar?
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
@ -365,7 +379,7 @@ ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
`ggplot2` comes with 37 `geom_` functions that you can use to to visualize your data. Each function will represent the data with a different type of geom, like a bar, a line, a boxplot, a histogram, etc. You select the type of plot you wish to make by calling the geom_function that draws the geom you have in mind.
`ggplot2` comes with 37 `geom_` functions that you can use to to visualize your data. Each function will represent the data with a different type of geom, like a bar, a line, a boxplot, a histogram, etc. You select the type of plot you wish to make by calling the geom_ function that draws the geom you have in mind.
Each `geom_` function takes a `mapping` argument. However, the aesthetics that you pass to the argument will change from geom to geom. For example, you can set the shape of points, but it would not make sense to set the shape of a line.
@ -373,7 +387,7 @@ To see which aesthetics your geom uses, visit its help page. To see a list of al
#### Group aesthetic
The _group_ aesthetic is a useful way to apply a monolithic geom, like a smooth, to multiple subgroups.
The _group_ aesthetic is a useful way to apply a monolithic geom, like a smooth line, to multiple subgroups.
By default, `geom_smooth()` draws a single smoothed line for the entire data set. To draw a separate line for each group of points, set the group aesthetic to a grouping variable or expression.
@ -391,9 +405,7 @@ ggplot(data = mpg) +
#### Multiple geoms
How can you use a geom to add summary information to your scatterplot?
You can adde multiple geoms to the same plot by adding multiple `geom_` functions to the plot call. For example, you can add the smooth geom to your existing scatterplot.
You can add a smooth line to your raw data by adding `geom_smooth()` to your original plot call.
```{r, message = FALSE}
ggplot(data = mpg) +
@ -401,13 +413,11 @@ ggplot(data = mpg) +
geom_smooth(mapping= aes(x = displ, y = hwy))
```
`ggplot2` will place each new geom on top of the preceeding geom. This system lets you build sophisticated graphs geom by geom.
This system lets you build sophisticated graphs geom by geom. You can add as many geoms as you like to a single plot call. `ggplot2` will place each new geom on top of the preceeding geom.
#### Global and local mappings
Our new code calls `mapping = aes(x = displ, y = hwy)` twice. This is unwise because repetition increases the chance of a typo and makes your code harder to read and write.
To avoid repetition, pass the set of repeated mappings to `ggplot()`. `ggplot2` will treat these mappings as global mappings and apply them to each geom in your graph. You can then remove the mapping arguments in the individual layers.
You can create a set of global mappings that apply to all geoms by passing a mapping argument to `ggplot()`. For example, we can eliminate the duplication of `mapping = aes(x = displ, y = hwy)` in our previous code by using global mappings:
```{r, message = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
@ -430,9 +440,9 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
The smooth line above is a single line with a single color. This does not occur if you add the color aesthetic to the global mappings. Smooth will draw a different colored line for each class of cars.
```{r, message = FALSE, warning = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
geom_smooth(aes(y = cty))
```
#### Global and local data sets
@ -453,7 +463,7 @@ ggplot(mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg)
```
To apply the smooth line to a subset of the data, pass it its own data argument, here the subset of eight cylinder cars.
To apply the smooth line to a subset of your data, pass it its own data argument, here the subset of cars that have eight cylinders.
```{r, message = FALSE, warning = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
@ -481,14 +491,14 @@ ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
```
It may be tempting to call the color aesthetic, but for bars and similar geoms the color aesthetic controls the _outline_ of the geom, e.g.
It may be tempting to call the color aesthetic, but for bars and other large geoms the color aesthetic controls the _outline_ of the geom, e.g.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, color = cut))
```
What an ...interesting effect. Sort of psychedelic! But not what we had in mind.
The effect is interesting, sort of psychedelic, but not what we had in mind.
To control the interior fill of a bar, polygon, histogram, boxplot, or other geom with mass, you must call the _fill_ aesthetic.
@ -504,134 +514,95 @@ ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
```
### Position
Ready for another riddle?
How could you make the chart below? Hint: given what you know now, you can't. So don't spend _too_ long trying.
```{r echo = FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
```
This plot displays the same information as the stacked bar chart above. Both charts show 40 color coded bars. Each bar represents a combination of `cut` and `clarity`.
However, the position of the bars within the two charts differ. In the stacked bar chart, `ggplot2` stacked bars with the same `cut` on top of one another. In the second plot, `ggplot2` placed bars with the same cut beside each other.
You can control this behavior by adding a _position adjustment_ to your call. A position adjustment tells `ggplot2` what to do when two or more objects overlap.
To set a position adjustment, set the `position` argument of your geom function to one of `"identity"`, `"stack"`, `"dodge"`, `"fill"`, or `"jitter"`.
#### Position = "identity"
For many geoms, the default position value is "identity". When `position = "identity"`, `ggplot2` will place each object exactly where it falls in the context of the graph.
This would make little sense for our bar chart. Each bar would start at `y = 0` and be placed directly above the `cut` value that it describes. Since there are seven bars for each `cut` value, many bars will overlap. The plot will look suspiciously like a stacked bar chart, but the stacked heights will be inaccurate, as each bar actually extends to `y = 0`.
to see how such a graph would appear, set `position = "identity"`.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity") +
ggtitle('Position = "identity"')
```
#### Position = "stack"
To avoid confusion, `ggplot2` uses a default "stack" position adjustment for bar charts. When `position = "stack"` `ggplot2` places overlapping objects directly _above_ one another.
Here each bar begins exactly where the bar below it ends.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack") +
ggtitle('Position = "stack"')
```
#### Position = "dodge"
When `position = "dodge"`, `ggplot2` places overlapping objects directly _beside_ one another. This is how I created the graph at the start of the section.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") +
ggtitle('Position = "dodge"')
```
#### Position = "fill"
When `position = "fill"`, `ggplot2` uses all of the available space to display overlapping objects. Within that space, it scales each in proportion to the other objects. `position = "fill"` is the most unusual of the position adjustments, but it creates an easy way to compare relative frequencies across groups.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") +
ggtitle('Position = "fill"')
```
#### Position = "jitter"
The last type of position doesn't make sense for bar charts, but it is very useful for scatterplots. Recall our first scatterplot.
Why does the plot appear to only display 126 points? There are 234 observations in the data set. Also, why do the points appear to be arranged on a grid?
```{r echo = FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
The points appear in a grid because the `hwy` and `displ` measurements were rounded to the nearest integer and tenths values. As a result, many points overlap each other because they've been rounded to the same values of `hwy` and `displ`. This also explains why our graph appears to contain only 126 points. 108 points are hidden on top of other points located at the same value.
This arrangement can cause problems because it makes it hard to see where the mass of the data is. Is there one special combination of `hwy` and `displ` that contains 109 values? Or are the data points more or less equally spread throughout the graph?
You can avoid this overplotting problem by setting the position adjustment to "jitter". `position = "jitter"` adds a small amount of random noise to each point, as we see above. This spreads the points out because no two points are likely to receive the same amount of random noise.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
```
But isn't this, you know, bad? It *is* true that jittering your data will make it less accurate at the local level, but jittering may make your data _more_ accurate at the global level. By jittering your data, you can see where the mass of your data falls on an overplotted grid. Occasionally, jittering will reveal a pattern that was hidden within the grid.
`position = "jitter"` is shorthand for `position = position_jitter()`. This is true for the other values of position as well (e.g, `position_identity()`, `position_dodge()`, `position_fill()`, and `position_stack()`. The expanded syntax lets you specify details of the adjustment process, and also provides a way to open a help page for each process (which you will need to do if you wish to learn more).
```{r eval=FALSE}
?position_jitter
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy),
position = position_jitter(width = 0.03, height = 0.3))
```
Bar charts are interesting because they reveal something subtle about common types of plots.
### Stats
How does `ggplot2` know where to place the line in our smooth plot?
Consider our basic bar chart.
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
The y values of the line do not appear in our data set, nor did we give the y values to `ggplot2`. `ggplot2` calculated they y values by applying an algorithm to the data. In this case, `ggplot2` applied a smoothing algorithm to the data.
On the x axis it displays `cut`, a variable in the `diamonds` data set. On the y axis, it displays count. But where does count come from? Count is not a variable in the diamonds data set:
Many types of graphs plot information that does not appear in the raw data. To do this, the graph first applies an algorithm to the raw data and then plots the results. For example, a boxplot calculates the first, second, and third quartiles of a data set and then plots those summary statistics (among others). A histogram bins the raw data and then counts how many points fall into each bin. It plots those counts on the y axis.
```{r}
head(diamonds)
```
`ggplot2` calls these algorithms _stats_, which is short for statistical transformation. Stats are handled automatically in `ggplot2`. Not every geom uses a stat; but when one does, `ggplot2` will apply the stat in the background.
And we didn't tell `ggplot2` in our plot call where to find count values.
You can fine tune how a geom implements a stat by passing the geom parameters for the stat to use. To discover which stat a geom uses, visit the geom's help page.
```{r eval = FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
For example, the `?geom_smooth` help page shows that `geom_smooth()` uses the `stat_smooth()` stat by default. If you then open the `?stat_smooth` help page, you will see that `stat_smooth()` takes the arguments `method` and `se` among others. With `ggplot2`, you can supply arguments to the stat called by a geom, by passing the arguments as parameters to the geom.
What is going on here?
***
Some plots, like scatterplots, plot the raw values in your data set. Other types of graphs, like bar charts and smooth lines, do not plot raw values at all. These graphs apply an algorithm to the data and then plot the results of the algorithm. Consider how many graphs do this.
In general practice, you do not need to worry much about stats. Usually one geom will be closely associated with one stat, and `ggplot2` will implement the stat by default. However, stats are an integral part of the `ggplot2` package that you are welcome to modify. To learn more about `ggplot2`'s stat system, see [ggplot2: Elegant Graphics for Data Analysis](http://www.amazon.com/dp/0387981403/ref=cm_sw_su_dp?tag=ggplot2-20).
* **bar charts** and **histograms** bin the data and then plot bin counts
* **smooth lines** apply a model to the data and then plot the model line
* **boxplots** calculate the first, second, and third quartiles of a data set and then plot those summary statistics (among others)
* and so on.
`ggplot2` calls these algorithms _stats_, which is short for statistical transformation. Each geom in `ggplot2` is paired with a stat; although for some geoms, like points, the stat is the identity transformation, i.e. no transformation.
When you use a geom, `ggplot2` automatically applies the geom's stat algorithm in the background. You don't need to worry about the details or even think much about stats.
However, you can change or fine tune a geom's default stat to create interesting and useful plots.
To learn which stat a geom uses, visit the geom's help page. For example, the `?geom_bar` help page shows that `geom_bar()` uses the `stat_bin()` stat by default. To learn about the stat, visit the stat's help page.
To change a geom's stat, set the `stat` argument to the name of a stat in `ggplot2`. You can find a list of these stats by running `help(package = "ggplot2")`. Each stat is represented by a function that begins with `stat_`. The name of the stat is the portion of the function name that appears after `stat_`.
Many combinations of geoms and stats will create incompatible results. However, one useful non-default combination is to pair `geom_bar()` with `stat_identity()`. This combination let's you map the height of each bar to the value of a variable.
```{r}
demo <- data.frame(
a = c(1,2,3),
b = c(20, 30, 40)
)
ggplot(data = demo) +
geom_bar(aes(x = a, y = b), stat = "identity")
```
#### ..variables..
Many stats in `ggplot2` create more data than they display. For example, the `?stat_bin` help page explains that `stat_bin()` uses your raw data to create a new data frame with four columns: `count`, `density`, `ncount`, `ndensity`.
`geom_bar()` maps one of these columns, the `count` column to the y axis of your plot. You can map any of the three remaining columns to your y axis as well. To do this, map the y aesthetic to the stat column name, and surround the column name with a pair of dots, `..`.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..density..))
```
Note that to do this, you will need to
1. Determine which stat your geom uses
2. Determine which variables the stat creates from its help page
3. Surround the variable name with `..`
Also note that this procedure will not make sense in as many cases as you suppose. Usually stat columns exist for a very narrow purpose. For example, it does not make sense to use `..density..` in a bar chart of discrete values, but `..density..` is a useful alternative to `..count..` when you use a histogram geom (histograms rely on the same stat as bar charts).
```{r message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat))
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat, y = ..density..))
```
### Parameters
Two of the graphs in the last section used the `width` argument. `width` is a _parameter_ of the `geom_bar()` function, a piece of information that `ggplot2` uses to build the geom.
How do these two plots differ?
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
@ -697,37 +668,260 @@ ggplot(data = mpg) +
As with aesthetics, different geoms respond to different parameters. How do you know which parameters to use with a geom? You can always treat a geom's aesthetics as parameters. You can also spot additional parameters by identifying a geom's stat.
### Coordinate systems
### Stats
How does `ggplot2` know where to place the line in our smooth plot?
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
```
The y values of the line do not appear in our data set, nor did we give the y values to `ggplot2`. `ggplot2` calculated they y values by applying an algorithm to the data. In this case, `ggplot2` applied a smoothing algorithm to the data.
Many types of graphs plot information that does not appear in the raw data. To do this, the graph first applies an algorithm to the raw data and then plots the results. For example, a boxplot calculates the first, second, and third quartiles of a data set and then plots those summary statistics (among others). A histogram bins the raw data and then counts how many points fall into each bin. It plots those counts on the y axis.
`ggplot2` calls these algorithms _stats_, which is short for statistical transformation. Stats are handled automatically in `ggplot2`. Not every geom uses a stat; but when one does, `ggplot2` will apply the stat in the background.
You can fine tune how a geom implements a stat by passing the geom parameters for the stat to use. To discover which stat a geom uses, visit the geom's help page.
For example, the `?geom_smooth` help page shows that `geom_smooth()` uses the `stat_smooth()` stat by default. If you then open the `?stat_smooth` help page, you will see that `stat_smooth()` takes the arguments `method` and `se` among others. With `ggplot2`, you can supply arguments to the stat called by a geom, by passing the arguments as parameters to the geom.
***
In general practice, you do not need to worry much about stats. Usually one geom will be closely associated with one stat, and `ggplot2` will implement the stat by default. However, stats are an integral part of the `ggplot2` package that you are welcome to modify. To learn more about `ggplot2`'s stat system, see [ggplot2: Elegant Graphics for Data Analysis](http://www.amazon.com/dp/0387981403/ref=cm_sw_su_dp?tag=ggplot2-20).
### Position
What if you didn't want a stacked bar chart? What if you wanted the chart below? Could you make it?
```{r echo = FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
```
This chart displays the same 40 color coded bars as the stacked bar chart above. Each bar represents a combination of `cut` and `clarity`.
However, the position of the bars within the two charts is different. In the stacked bar chart, `ggplot2` stacked the bars on top of each other if they had the same cut. In the second plot, `ggplot2` placed the bars beside each other if they had the same cut.
You can control this behavior by adding a _position adjustment_ to your call. A position adjustment tells `ggplot2` what to do when two or more objects overlap.
To set a position adjustment, set the `position` argument of your geom function to one of `"identity"`, `"stack"`, `"dodge"`, `"fill"`, or `"jitter"`.
#### Position = "identity"
For many geoms, the default position value is "identity". When `position = "identity"`, `ggplot2` will place each object exactly where it falls in the context of the graph.
This would make little sense for our bar chart. Each bar would start at `y = 0` and would appear directly above the `cut` value that it describes. Since there are seven bars for each value of `cut`, many bars would overlap. The plot will look suspiciously like a stacked bar chart, but the stacked heights will be inaccurate, as each bar actually extends to `y = 0`. Some bars would not appear at all because they would be completely overlapped by other bars.
To see how such a graph would appear, set `position = "identity"`.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity") +
ggtitle('Position = "identity"')
```
#### Position = "stack"
To avoid confusion, `ggplot2` uses a default "stack" position adjustment for bar charts. When `position = "stack"` `ggplot2` places overlapping objects directly _above_ one another.
Here each bar begins exactly where the bar below it ends.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack") +
ggtitle('Position = "stack"')
```
#### Position = "dodge"
When `position = "dodge"`, `ggplot2` places overlapping objects directly _beside_ one another. This is how I created the graph at the start of the section.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") +
ggtitle('Position = "dodge"')
```
#### Position = "fill"
When `position = "fill"`, `ggplot2` uses all of the available space to display overlapping objects. Within that space, `ggplot2` scales each object in proportion to the other objects. `position = "fill"` is the most unusual of the position adjustments, but it creates an easy way to compare relative frequencies across groups.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") +
ggtitle('Position = "fill"')
```
#### Position = "jitter"
The last type of position doesn't make sense for bar charts, but it is very useful for scatterplots. Recall our first scatterplot.
```{r echo = FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
Why does the plot appear to display only 126 points? There are 234 observations in the data set. Also, why do the points appear to be arranged on a grid?
The points appear in a grid because the `hwy` and `displ` measurements were rounded to the nearest integer and tenths values. As a result, many points overlap each other because they've been rounded to the same values of `hwy` and `displ`. This also explains why our graph appears to contain only 126 points. 108 points are hidden on top of other points located at the same value.
This arrangement can cause problems because it makes it hard to see where the mass of the data is. Is there one special combination of `hwy` and `displ` that contains 109 values? Or are the data points more or less equally spread throughout the graph?
You can avoid this overplotting problem by setting the position adjustment to "jitter". `position = "jitter"` adds a small amount of random noise to each point, as we see above. This spreads the points out because no two points are likely to receive the same amount of random noise.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
```
But isn't this, you know, bad? It *is* true that jittering your data will make it less accurate at the local level, but jittering may make your data _more_ accurate at the global level. By jittering your data, you can see where the mass of your data falls on an overplotted grid. Occasionally, jittering will reveal a pattern that was hidden within the grid.
`ggplot2` recognizes `position = "jitter"` as shorthand for `position = position_jitter()`. This is true for the other values of position as well:
* `position = "identity"` is shorthand for `position = position_identity()`
* `position = "stack"` is shorthand for `position = position_stack()`
* `position = "dodge"` is shorthand for `position = position_dodge()`
* `position = "fill"` is shorthand for `position = position_fill()`
You can use the explanded syntax to specify details of the position process. You can also use the expanded syntax to open a help page for each position process (which you will need to do if you wish to learn more).
```{r eval=FALSE}
?position_jitter
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy),
position = position_jitter(width = 0.03, height = 0.3))
```
### Coordinate systems
You can make your bar charts even more versatile by changing the coordinate system of your plot. For example, you could flip the x and y axes of your plot, or you could plot your bar chart on polar coordinates, which creates a coxcomb plot.
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=3, fig.height=4}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut)) +
coord_flip()
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), width = 1) +
coord_polar()
```
To change the coordinate system of your plot, add a `coordinate_` function to your plot call. `ggplot2` comes with seven coordinate functions that each implement a different coordinate system.
#### Cartesian coordinates
`coord_cartesian()` generates a cartesian coordinate system for your plot. `ggplot2` adds a call to `coord_cartesian()` to your plot by default, but you can also manually add this call. Why would you want to do this?
You can set the `xlim` and `ylim` arguments of `coord_cartesian()` to zoom in on a region of your plot. Set each argument to a vector of length 2. `ggplot2` will use the first value as the minimum value on the x or y axis. It will use the second value as the maximum value.
Zooming is not very useful in our bar graph, but it can help us study the sports cars in our scatterplot.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
coord_cartesian(xlim = c(4.5, 7.5), ylim = c(20, 30))
```
You can use the same arguments to zoom with any of the coordinate functions in `ggplot2`.
***
*Tip*: You can also zoom by adding `xlim()` and/or `ylim()` to your plot call.
```{r eval = FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
xlim(4.5, 7.5) +
ylim(20, 30)
```
However, `xlim()` and `ylim()` do not provide a true zoom. Instead, they plot the subset of data that appears within the limits. This may change the appearance of elements that rely on unseen data points, such as a smooth line.
***
#### Fixed coordinates
`coord_fixed()` also generates a cartesian coordinate system for your plot. However, you can used `coord_fixed()` to set the visual ratio between units on the x axis and units on the y axis. To do this, set the `ratio` argument to the desired ratio in length between y units and x units, e.g.
$$\text{ratio} = \frac{\text{length of one Y unit}}{\text{length of one X unit}}$$
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = factor(1), fill = cut)) +
coord_fixed(ratio = 0.5)
```
`coord_equal()` does the same thing as `coord_fixed()`.
#### Flipped coordinates
Add `coord_flip()` to your plot to switch the x and y axes.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut)) +
coord_flip()
```
#### Map coordinates
Add `coord_map()` or `coord_quickmap()` to plot map data on a cartographic projection. See _Section 6_ for more details.
#### Polar coordinates
Add `coord_polar()` to your plot to plot your data in polar coordinates. By default, `ggplot2` will map your y variable to $r$ and your x variable to $\theta$. Reverse this behavior with the argument `theta = "y"`.
You can also use the `start` argument to control where in the plot your data starts, from 0 to 12 (o'clock), and the `direction` argument to control the orientation of the plot (1 for clockwise, -1 for anti-clockwise).
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut)) +
geom_bar(mapping = aes(x = cut, fill = cut), width = 1) +
coord_polar()
```
Ignore `width = 1` for now. We will cover the argument in the section on parameters below.
***
*Tip*: `ggplot2` does not come with a pie chart geom, but you can make a pie chart by plotting a stacked bar chart in polar coordinates. To do this, ensure that:
* your x axis only has one value, e.g. `x = factor(1)`
* `width = 1`
* `theta = "y"`
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = factor(1), fill = cut), width = 1) +
coord_polar(theta = "y")
```
***
#### Transformed coordinates
Add `coord_trans()` to plot your data on cartesian coordinates that have been transformed in some way. To use `coord_trans()`, set the `xtrans` and/or `ytrans` argument to the name of a function that you would like to apply to the x and/or y values.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter") +
coord_cartesian(xlim = c(1, 5), ylim = c(25, 45))
geom_point(mapping = aes(x = displ, y = hwy)) +
coord_trans(xtrans = "log", ytran = "log")
```
## The Grammar of Graphics
### Layers
@ -774,6 +968,7 @@ ggplot(data = mpg) +
#### tile
### Continuous x, continuous y, continuous z
#### contour
### Advice for Big Data
## Maps