532 lines
30 KiB
Plaintext
532 lines
30 KiB
Plaintext
---
|
|
layout: default
|
|
title: Data Visualization
|
|
output: bookdown::html_chapter
|
|
---
|
|
|
|
```{r setup, include=FALSE}
|
|
knitr::opts_chunk$set(cache = TRUE)
|
|
```
|
|
|
|
# Visualize Data
|
|
|
|
Visualization makes data decipherable. Have you ever tried to study a table of raw data? Raw data is difficult to comprehend. You can examine values one at a time, but you cannot attend to many values at once. From a cognitive standpoint, the data overloads your attention span, which makes it hard to spot patterns in the data. See this for yourself; can you spot the striking relationship between $X$ and $Y$ in the table below?
|
|
|
|
```{r echo=FALSE}
|
|
X <- rep(seq(0.1, 1.9, length = 6), 2) + runif(12, -0.1, 0.1)
|
|
Y <- sqrt(1 - (X - 1)^2)
|
|
Y[1:6] <- -1 * Y[1:6]
|
|
Y <- Y - 1
|
|
order <- sample(1:10)
|
|
knitr::kable(round(data.frame(X = X[order], Y = Y[order]), 2))
|
|
```
|
|
|
|
In contrast, visualized data is easy to understand. Once you visualize data in a graph, you can see instantly the relationships between data points. You can spot the structure of the data, and you can read off individual values as necessary. For example, the graph below shows the same data as above. Here, the relationship between the points is obvious.
|
|
|
|
```{r echo=FALSE}
|
|
ggplot2::qplot(X, Y) + ggplot2::coord_fixed(ylim = c(-2.5, 2.5), xlim = c(-2.5, 2.5))
|
|
```
|
|
|
|
This chapter will teach you how to visualize your data with R and the `ggplot2` package. R contains several systems for making graphs, but the `ggplot2` system is one of the most beautiful and most versatile. `ggplot2` implements the *grammar of graphics*, a coherent system for describing and building graphs. The advantage is tremendous. With `ggplot2`, you can do more faster by learning one system and applying it in many places.
|
|
|
|
## Outline
|
|
|
|
In *Section 1*, you will learn how to make scatterplots, the most popular type of data visualization. Along the way, you will learn to add information to your plots with color, size, shape, and facets; and how to change the "type" of your plot with _geoms_ .
|
|
|
|
*Section 2* shows how to build bar charts. Here you will learn how to plot summaries of your data with _stats_ and how to control the placement of objects with with _positions_.
|
|
|
|
*Section 3* explains how to make histograms and how to fine tune your plots with _parameters_. You will also learn the best ways to display comparisons in your plots.
|
|
|
|
*Section 4* draws on examples in the first three sections to teach the _gramar of graphics_, a versatile system for describing---and building---any plot.
|
|
|
|
*Section 5* concludes the chapter by showing how to customize your plots with labels, legends, and color schemes.
|
|
|
|
## Prerequisites
|
|
|
|
Load the `ggplot2` package to access the data sets and functions that we will use in this chapter. You can load the `ggplot2` package with the command:
|
|
|
|
```{r}
|
|
library(ggplot2)
|
|
```
|
|
|
|
## Scatterplots
|
|
|
|
Consider what you know about cars and form a hypothesis: do cars with big engines use more fuel than cars with small enigines?
|
|
|
|
Now make your hypothesis more precise: What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear? Strong? Weak?
|
|
|
|
You can test your hypothesis with the `mpg` data set that comes in the `ggplot2` package. The data set contains data collected by the EPA on 38 models of car. Among the variables in `mpg` are `displ`, a car's engine size in litres, and `hwy`, a car's fuel efficiency on the highway in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
|
|
|
|
To learn more about `mpg`, open its help page with the command `?mpg`.
|
|
|
|
***
|
|
|
|
*Tip*: If you have trouble loading `mpg`, its help page, or any of the functions in this chapter, you may need to load the `ggplot2` library with the command
|
|
|
|
```{r eval=FALSE}
|
|
library(ggplot2)
|
|
```
|
|
|
|
You will need to reload the library each time you start a new R session.
|
|
|
|
***
|
|
|
|
You can use the code below to plot the `displ` variable of `mpg` against the `hwy` variable. The syntax may seem strange, but you will learn to understand it soon enough. For now, just concentrate on being able to visualize the data, and enjoy your new powers. In the next section, we will explain the reasons behind the `ggplot2` syntax.
|
|
|
|
```{r eval=FALSE}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy))
|
|
```
|
|
|
|
Open an R session and run the code. Does the graph confirm or refute your hypothesis?
|
|
|
|
```{r echo=FALSE}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy))
|
|
```
|
|
|
|
You can immediately see that there is a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). In other words, cars with big engines have a worse fuel efficiency. But the graph shows us something else as well.
|
|
|
|
![](images/visualization-1.png)
|
|
|
|
One groups of points seems to fall outside the linear trend. These cars appear to get a higher mileage than we would expect. What can explain this cluster? We'll examine this riddle in a second, so brainstorm some ideas.
|
|
|
|
In the meantime, let's review the code that we used to make the graph.
|
|
|
|
### Template
|
|
|
|
This code is almost a template for making plots with `ggplot2`.
|
|
|
|
```{r eval=FALSE}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy))
|
|
```
|
|
|
|
The function `ggplot()` initializes a new plot that you can add layers to. `ggplot()` doesn't create a plot by itself, but you should use `ggplot()` to begin every plot you make with `ggplot2`.
|
|
|
|
The first argument of `ggplot()` is the data set that you would like to use in your graph. So `ggplot(mpg)` initializes a graph that will use the `mpg` data set.
|
|
|
|
To complete a graph, add one or more layers to `ggplot()`. The function `geom_point()` adds a layer of points to the plot, which creates a scatterplot. The mapping argument explains where those points should go. Always set mapping to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map the x and y locations of the points to.
|
|
|
|
You can change the data set, geom function, and aes arguments that you use in your plots. You can also add functions and arguments that do not appear here to make a graph. However, you can always return to this code as a simple template for a complete graph. To make a graph, replace the bracketed sections in the code below.
|
|
|
|
```{r eval = FALSE}
|
|
ggplot(data = <DATA>) +
|
|
geom_<GEOM>(mapping = aes(<MAPPINGS>))
|
|
```
|
|
|
|
### Aesthetic Mappings
|
|
|
|
> "The greatest value of a picture is when it forces us to notice what we never expected to see."
|
|
> - John Tukey
|
|
|
|
Visualizations can reveal relationships that you didn't expect to see, which makes them a very powerful tool for data science. For example, our plot above revealed a groups of cars that had better than expected mileage. How can you explain these cars? Make a hypothesis before reading on.
|
|
|
|
Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car. The `class` variable of the `mpg` data set classifies cars into groups such as compact, midsize, and suv. If the outlying points are hybrids, they should be classified as compact or perhaps subcompact cars (keep in mind that this data was collected before hybrid trucks and suvs became popular).
|
|
|
|
There are two ways to add a third value, like `class` to a two dimensional scatterplot. You can map the value to a new _aesthetic_ or you can divide the plot into _facets_.
|
|
|
|
An aesthetic is a visual property of the points in your plot. Aesthetics include things like the size, shape, or color of your points.
|
|
|
|
![](images/visualization-2.png)
|
|
|
|
You can convey information by mapping the aesthetics in your plot to the variables in your data set. For example, we can map the colors of the points to the `class` variable. Then the color of the point will reveal its class affiliation.
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy, color = class))
|
|
```
|
|
|
|
To map an aesthetic to a variable, set the name of the aesthetic to the name of the variable and do this _in your plot's `aes()` call_. For example, above we set `color` to `class`.
|
|
|
|
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable. `ggplot2` will also add a legend that explains which levels correspond to which values.
|
|
|
|
We can now see that most of the unusual points are two seater cars. This doesn't sound like a hybrid. In fact, it sounds a lot like a sports car---and that's what the points are. As you can see in the graph, these cars have the same size engines as suvs and pickup trucks. However, sports cars have much smaller bodies than suvs and pickup trucks, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have such large engines.
|
|
|
|
Color is one of the most popular aesthetics to use in a scatterplot, but we could have mapped the size aesthetic to `class` in the same way. In this case, the exact size of the point reveals its class affiliation.
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy, size = class))
|
|
```
|
|
|
|
Or we could have mapped the _alpha_ of the points to the `class` variable. The alpha is the transparency of the points. Now the transparency of each point corresponds with its class affiliation.
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
|
|
```
|
|
|
|
We also could have mapped the shape of the points to the `class` variable.
|
|
|
|
```{r warning=FALSE}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
|
|
```
|
|
|
|
In each case above, we set the name of the aesthetic to the variable to display and we do this within the `aes()` function. This arrangement highlights a useful insight since we also set `x` and `y` to variables within `aes()`: the x location and the y location of a point are aesthetics, visual properties that we can map to variables.
|
|
|
|
Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of values ot use for the aesthetic and it constructs a legend that explains the mapping. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts in the same way as a legend, as a guide.
|
|
|
|
Now that you know how to use aesthetics, you should take a moment to experiment with the `mpg` data set. Attempt to match different types of variables to different types of aesthetics. Attempt to use more than one aesthetic at a time. You can learn what aesthetics are available to use in a scatterplot by looking up the help page for `geom_point()`, e.g. `?geom_point`.
|
|
|
|
Have you experimented with aesthetics? Great. Here are some things that you may have noticed.
|
|
|
|
#### Continuous data
|
|
|
|
`ggplot2` treats continuous variables differently than discrete variables. A continuous variable can contain an infinite number of values that can be put in order, like numbers or date times. If your variable is continuous, `ggplot2` will use a gradation of levels to display the values. For example, `ggplot2` will use a gradiant of colors colors from blue to black, or a gradiant of sizes and alpha levels. `ggplot2` will not use the shape aesthetic to display continuous information. Why? Because the human eye cannot easily interpolate between shapes. Can you tell whether a shape is 3/4's of the way between a triangle and a circle? Or just 5/8th's of the way?
|
|
|
|
`ggplot2` will treat your variable as continuous if it is a numeric, integer, or a recognizable date time (but not a factor, see `?factor`).
|
|
|
|
#### Discrete data
|
|
|
|
A discrete variable can only take a finite (or countably infinite) set of values. Character strings and boolean values are examples of discrete data. `ggplot2` will treat your variable as discrete if it is not numeric, integer, or recognizable date time.
|
|
|
|
`ggplot2` will attempt to display discrete data with a discrete set of aesthetic levels. It will choose a set of levels that maximizes the difference between each pair of levels. This means that `gpplot2` will use equally spaced sizes and alpha levels to display discrete data. `ggplot2` will use a set of colors that span the hues of the rainbow. The exact colors will depend on how many hues appear in your graph. `ggplot2` selects the colors in a way that ensures that one color does not visually dominate the others.
|
|
|
|
`ggplot2` will use up to six shapes to display discrete data. If the data contains more than six unique values, `ggplot2` will print an error message and only display the first six values. You may have noticed that in the graph above (and below), `ggplot2` did not display the suv values, which were the seventh unique class.
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
|
|
```
|
|
|
|
See _Section 5_ to learn how to pick your own colors, shapes, sizes, etc. for `ggplot2` to use.
|
|
|
|
#### Multiple aesthetics
|
|
|
|
You can use more than one aesthetic at a time. `ggplot2` will combine aesthetic legeneds where possible.
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = class, size = cty))
|
|
```
|
|
|
|
### Facets
|
|
|
|
Facets provide a second way to add a variables to a two dimensional graph. When you facet a graph, you divide your data into subgroups qand then plot a separate graph, or _facet_, for each subgroup.
|
|
|
|
For example, we can divide our data set into four subgroups based on the `cyl` variable:
|
|
|
|
1. all of the cars that have four cylinder engines
|
|
2. all of the cars that have five cylinder engines (there are some)
|
|
3. all of the cars that have six cylinder engines, and
|
|
4. all of the cars that have eight cylinder engines
|
|
|
|
Or we could divide our data into three groups based on the `drv` variable:
|
|
|
|
1. all of the cars with four wheel drive (4)
|
|
2. all of the cars with front wheel drive (f)
|
|
3. all of the cars with rear wheel drive (r)
|
|
|
|
We could even divide our data into subgroups based on the combination of two variables:
|
|
|
|
1. all of the cars with four wheel drive (4) and 4 cylinders
|
|
2. all of the cars with four wheel drive (4) and 5 cylinders
|
|
3. all of the cars with four wheel drive (4) and 6 cylinders
|
|
4. all of the cars with four wheel drive (4) and 8 cylinders
|
|
5. all of the cars with front wheel drive (f) and 4 cylinders
|
|
6. all of the cars with front wheel drive (f) and 5 cylinders
|
|
7. all of the cars with front wheel drive (f) and 6 cylinders
|
|
8. all of the cars with front wheel drive (f) and 8 cylinders
|
|
9. all of the cars with rear wheel drive (r) and 4 cylinders
|
|
10. all of the cars with rear wheel drive (r) and 5 cylinders
|
|
11. all of the cars with rear wheel drive (r) and 6 cylinders
|
|
12. all of the cars with rear wheel drive (r) and 8 cylinders
|
|
|
|
#### `facet_grid()`
|
|
|
|
The graphs below show what a faceted graph looks like. They also show how you can build a faceted graph with `facet_grid()`. I'm not going to tell you how facet grid works---well at least not yet---because that would be too easy. Instead, I would like you to try to induce the syntax of `facet_grid()`. What determines which variable splits the graphs into rows? columns? What parts of the syntax always stay the same? And what does the `.` do?
|
|
|
|
Make an honest effort at answering these questions, and then read on on the other side of the graphs.
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy)) +
|
|
facet_grid(drv ~ cyl)
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy)) +
|
|
facet_grid(. ~ cyl)
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy)) +
|
|
facet_grid(drv ~ .)
|
|
```
|
|
|
|
To divide a plot into a grid of facets, add `facet_grid()` to the plot. Pass `facet_grid()` a formula, two variable names separaed by a `~`. `facet_grid()` will split the graph into rows based on values of the first variable name. It will split the graph into columns based on values of the second variable name. This mirrors the rows first, columns second convention of R.
|
|
|
|
If you prefer to not split inot rows or columns, pass `facet_grid()` a `.` as a place holder on the desired side of the function.
|
|
|
|
Facets let you quickly compare subgroups by glancing down rows or across columns. Each facet will use the same x and y limits, but you can change this behavior across rows or columns by adding the argument `scales = "free_x"`, `scales = "free_y"`, or `scales = "free"` (both x and y), i.e.,
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy)) +
|
|
facet_grid(drv ~ cyl, scales = "free_x")
|
|
```
|
|
|
|
|
|
#### `facet_wrap()`
|
|
|
|
`facet_wrap()` provides a pleasant way to facet a plot across a single variable with many values. It operates like `facet_grid(. ~ <VAR>)` except that it wraps the results into multiple rows to present a multi-line, roughly square result.
|
|
|
|
The easiest way to understand `facet_wrap()` is to compare the output of `facet_grid()` and `facet_wrap()`.
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy)) +
|
|
facet_grid(. ~ class)
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy)) +
|
|
facet_wrap(~ class)
|
|
```
|
|
|
|
The results of `facet_wrap()` are easier to study if the facetting variable has many values. However, `facet_wrap()` can only facet by one variable at a time.
|
|
|
|
### Position
|
|
|
|
Ready for another riddle?
|
|
|
|
Why does our graph appear to only display 126 points? There are 234 observations in the data set. Also, why do the points appear to be arranged on a grid?
|
|
|
|
```{r echo = FALSE}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy))
|
|
```
|
|
|
|
The points appear in a grid because the `hwy` and `displ` measurements were rounded to the nearest integer and tenths values. As a result, many points overlap each other because they've been rounded to the same values of `hwy` and `displ`. This also explains why our graph appears to contain only 126 points. 108 points are hidden on top of other points located at the same value.
|
|
|
|
This arrangement can cause problems because it makes it hard to see where the mass of the data is. Is there one special combination of `hwy` and `displ` that contains 109 values? Or are the data points more or less equally spread throughout the graph?
|
|
|
|
You can avoid this overplotting problem by adding a _position adjustment_ to each point. A position adjustment tells `ggplot2` what to do when two or more points overlap.
|
|
|
|
To set a position adjustment, set the `position` argument of `geom_point()` to one of `"identity"`, `"jitter"`, `"dodge"`, `"fill"`, or `"stack"`.
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
|
|
```
|
|
|
|
`ggplot2` recognizes all five of these options, but only `"identity"` and `"jitter"` make sense for scatterplots (we'll look at the rest when we study bar charts). `position = "identity"` plots the points where they arrear (the default). `position = "jitter"` adds a small amount of random noise to each point, as we see above. This spreads the points out because no two points are likely to receive the same amount of random noise.
|
|
|
|
But isn't this, you know, bad? It *is* true that jittering your data makes it less accurate at the local level, but jittering may make your data _more_ accurate at the global level. By jittering your data, you can see where the mass of your data falls on an overplotted grid. Occasionally, jittering will reveal a pattern that was hidden within the grid.
|
|
|
|
`position = "jitter"` is shorthand for `position = position_jitter()`. This is true for the other values of position as well (e.g, `position_identity()`, `position_dodge()`, `position_fill()`, and `position_stack()`. The expanded forms let you specify details of the adjustment process, and also provide a way to open a help page for each process (which you will need to do if you wish to learn more).
|
|
|
|
```{r eval=TRUE}
|
|
?position_jitter
|
|
```
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy),
|
|
position = position_jitter(width = 0.03, height = 0.3))
|
|
```
|
|
|
|
### Geoms
|
|
|
|
How are these two plots similar?
|
|
|
|
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy))
|
|
|
|
ggplot(data = mpg) +
|
|
geom_smooth(mapping = aes(x = displ, y = hwy))
|
|
```
|
|
|
|
They both contain the same x variable, the same y variable, and if you look closely, you can see that they are plotting the same data. But the plots are not identical.
|
|
|
|
Each plot uses a different visual object to represent the data. The first plot represents each observation in the data set with a point. The second plot represents the entire group of observations with a smoothed line. You could say that these two graphs are different "types" of plots, or that they "draw" different things. In `ggplot2` syntax, we say that they use different _geoms_.
|
|
|
|
_geom_ is short for geometrical object. The geom of the plot determines what type of visual object the plot uses to represent the data. So far, all of our plots have used the point geom, which is how you create scatterplots.
|
|
|
|
The new plot uses the smooth geom, a smooth line fitted to the data. You can use different geoms to plot the same data. To change the geom in your plot, change the `geom_` function that you add to `ggplot()`. For example, to go from the first plot above to the second, replace `geom_point()` in
|
|
|
|
```{r eval=FALSE}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy))
|
|
```
|
|
|
|
with `geom_smooth()` like this
|
|
|
|
```{r eval=FALSE, message = FALSE}
|
|
ggplot(data = mpg) +
|
|
geom_smooth(mapping = aes(x = displ, y = hwy))
|
|
```
|
|
|
|
`ggplot2` comes with 37 `geom_` functions that you can use. You can also find additional `geom_` functions in other R packages. All `geom_` functions behave similarly; each takes a `mapping` argument. However, the aesthetics that you pass the argument will change from geom to geom. For example, you can set the shape of points, but it would not make sense to set the shape of a line.
|
|
|
|
#### Multiple geoms
|
|
|
|
You can add multiple geoms to the same plot by adding multiple `geom_` functions to the plot call. For instance, it is common to combine a geom that displays the raw data with a geom that displays a summary of the data:
|
|
|
|
```{r, message = FALSE}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy))
|
|
geom_smooth(mapping= aes(x = displ, y = hwy))
|
|
```
|
|
|
|
`ggplot2` will place each new geom on top of the preceeding geom. This system lets you build a sophisticated graph layer by layer, geom by geom.
|
|
|
|
#### Global and local mappings
|
|
|
|
Notice that our call now contains some redundant code. We call `mapping = aes(x = displ, y = hwy)` twice. It is unwise to repeat code because each repetition creates a chance to make a typo or error. Repetitions also make your code harder to read and write.
|
|
|
|
You can avoid repetition by passing a set of aesthetics to `ggplot()`. `ggplot2` will treat these aesthetics as global mappings that apply to each geom in the graph. You can then remove the mapping arguments in the individual layers.
|
|
|
|
```{r, message = FALSE}
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
|
geom_point() +
|
|
geom_smooth()
|
|
```
|
|
|
|
If you supply a mapping argument in a `geom_`function, `ggplot2` will add the local aesthetics to the global aesthetics _for that geom only_. If one of the local aesthetics conflicts with a global aesthetic, `ggplot2` will override the global aesthetic _for that geom only_. This provides an easy way to differentiate geoms.
|
|
|
|
```{r, message = FALSE}
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
|
geom_point(mapping = aes(color = class)) +
|
|
geom_smooth()
|
|
```
|
|
|
|
The smooth line above is a single line with a single color. This does not occur if you add the color aesthetic to the global mappings. Smooth will draw a different colored line for each class of cars.
|
|
|
|
```{r, message = FALSE, warning = FALSE}
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
|
|
geom_point() +
|
|
geom_smooth()
|
|
```
|
|
|
|
#### Global and local data sets
|
|
|
|
You can use the same system to specify individual data sets for each layer.
|
|
|
|
```{r, eval = FALSE}
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
|
geom_point() +
|
|
geom_smooth()
|
|
```
|
|
|
|
is analagous to
|
|
|
|
```{r, eval = FALSE}
|
|
ggplot(mapping = aes(x = displ, y = hwy)) +
|
|
geom_point(data = mpg) +
|
|
geom_smooth(data = mpg)
|
|
```
|
|
|
|
To apply the smooth line to a subset of the data, pass it its own data argument (here the subset of 8 cylinder cars).
|
|
|
|
```{r, message = FALSE, warning = FALSE}
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
|
geom_point() +
|
|
geom_smooth(data = subset(mpg, cyl == 8))
|
|
```
|
|
|
|
### Parameters
|
|
|
|
How do these two plots differ?
|
|
|
|
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
|
geom_point() +
|
|
geom_smooth()
|
|
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
|
geom_point() +
|
|
geom_smooth(method = lm)
|
|
```
|
|
|
|
Each overlays a smooth geom on a points geom, but each displays a different "type" of smooth line. In the first graph, `ggplot2` draws the result of a loess algorithm. In the second plot, `ggplot2` draws the result of a linear regression.
|
|
|
|
You can customize the output of `geom_smooth()` with its `method` argument. Set `method` to the name of a model function in R. `geom_smooth()` will display the result of modelling y on x with the function. In the graph above, we set `method = lm` to create the regression line. `lm()` is the R function that builds linear models.
|
|
|
|
```{r eval=FALSE}
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
|
geom_point() +
|
|
geom_smooth(method = lm)
|
|
```
|
|
|
|
`method` is a _parameter_ of the `geom_smooth()` function, a piece of information that `ggplot2` uses to build the geom. If you do not set the `method` parameter, `ggplot2` defaults to a loess model or a general additive model depending on how many points appear in the graph.
|
|
|
|
`se` is another parameter of `geom_smooth()`. You can set the `se` parameter of `geom_smooth()` to `FALSE` to prevent `ggplot2` from drawing the standard error band that appears around the smooth line, i.e.
|
|
|
|
```{r }
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
|
geom_point() +
|
|
geom_smooth(method = lm, se = FALSE)
|
|
```
|
|
|
|
Parameters are different than mappings because you do not set a parameter to a variable in the data set. `ggplot2` uses the value of a parameter directly. In contrast, to use a mapping, `ggplot2` must create a system of equivalencies between values of a variable and levels of an aesthetic.
|
|
|
|
##### Aesthetics as parameters
|
|
|
|
The distinction between parameters and mappings makes it easy to customize your graphs. Suppose you want to make a graph like the one below. How would you do it?
|
|
|
|
```{r echo = FALSE}
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
|
geom_point(color = "blue")
|
|
```
|
|
|
|
If you add `color = "blue"` to the mappings argument, you will get an unexpected result.
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
|
|
```
|
|
|
|
`ggplot2` treats `color = "blue"` as a mapping. It assumes that "blue" is a value in the data space. It uses R's recycling rules to assign the single value "blue" to each row of data. Then it creates a mapping from the value "blue" in the data space to the pinkish color that we see in the visual space. It even creates a legend to let you know that the color pink represents the value "blue." The choice of pink is a coincidence; `ggplot2` defaults to pink whenever a single discrete value is mapped to the color aesthetic.
|
|
|
|
This is not what we want. We want to set the color to blue. In short, we want to treat the color of the points like a parameter and set it directly.
|
|
|
|
To set an aesthetic as if it were a parameter, set it _outside_ of the `mapping` argument. This will place it outside of the `aes()` function as well.
|
|
|
|
```{r}
|
|
ggplot(data = mpg) +
|
|
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
|
|
```
|
|
|
|
`ggplot2` will treat assignments that appear in the `aes()` call of the mapping argument as mappings. It will treat assignments that appear outside of the mappign argument as parameters.
|
|
|
|
As with aesthetics, different geoms respond to different parameters. How do you know which parameters to use with a geom? You can always treat a geom's aesthetics as parameters. You can also spot additional parameters by identifying a geom's stat.
|
|
|
|
|
|
### Stats
|
|
|
|
How does `ggplot2` know where to place the line in our smooth plot?
|
|
|
|
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
|
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
|
geom_point() +
|
|
geom_smooth()
|
|
```
|
|
|
|
The y values of the line do not appear in our data set, nor did we give the y values to `ggplot2`. `ggplot2` calculated they y values by applying an algorithm to the data. In this case, `ggplot2` applied a smoothing algorithm to the data.
|
|
|
|
Many types of graphs plot information that does not appear in the raw data. To do this, the graph first applies an algorithm to the raw data and then plots the results. For example, a boxplot calculates the first, second, and third quartiles of a data set and then plots those summary statistics (among others). A histogram bins the raw data and then counts how many points fall into each bin. It plots those counts on the y axis.
|
|
|
|
`ggplot2` calls these algorithms _stats_, which is short for statistical transformation. Stats are handled automatically in `ggplot2`. Not every geom uses a stat; but when one does, `ggplot2` will apply the stat in the background.
|
|
|
|
You can fine tune how a geom implements a stat by passing the geom parameters for the stat to use. To discover which stat a geom uses, visit the geom's help page.
|
|
|
|
For example, the `?geom_smooth` help page shows that `geom_smooth()` uses the `stat_smooth()` stat by default. If you then open the `?stat_smooth` help page, you will see that `stat_smooth()` takes the arguments `method` and `se` among others. With `ggplot2`, you can supply arguments to the stat called by a geom, by passing the arguments as parameters to the geom.
|
|
|
|
***
|
|
|
|
In general practice, you do not need to worry much about stats. Usually one geom will be closely associated with one stat, and `ggplot2` will implement the stat by default. However, stats are an integral part of the `ggplot2` package that you are welcome to modify. To learn more about `ggplot2`'s stat system, see [ggplot2: Elegant Graphics for Data Analysis](http://www.amazon.com/dp/0387981403/ref=cm_sw_su_dp?tag=ggplot2-20).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## The Grammar of Graphics
|
|
|
|
## Bar Charts
|
|
|
|
After scatterplots, the most common type of plot is probably the bar chart. A bar chart is only a graph that uses the bar geom.
|
|
|
|
## Histograms
|
|
|
|
## Customizing plots
|