From 413b943517b56caab189c9cb5776209452e273ad Mon Sep 17 00:00:00 2001 From: Garrett Date: Fri, 13 Nov 2015 17:16:12 -0500 Subject: [PATCH] More work on visualization. Improves initial examples. --- visualize.Rmd | 922 +++++++++++++++++++++++++++++++------------------- 1 file changed, 570 insertions(+), 352 deletions(-) diff --git a/visualize.Rmd b/visualize.Rmd index 066de59..a7815ff 100644 --- a/visualize.Rmd +++ b/visualize.Rmd @@ -34,19 +34,11 @@ This chapter will teach you how to visualize your data with R and the `ggplot2` ## Outline -In *Section 1*, you will learn how to make scatterplots, the most popular type of data visualization. Along the way, you will learn to add information to your plots with color, size, shape, and facets; and how to change the "type" of your plot with _geoms_ . +*Section 1* will get you started making graphs right away. You'll learn how to make several common types of plots, and you will explore `ggplot2`'s syntax along the way. -*Section 2* shows how to build bar charts. Here you will learn how to plot summaries of your data with _stats_ and how to control the placement of objects with with _positions_. You'll also see how to change the _coordinate system_ of your graph. +*Section 2* will teach you the _grammar of graphics_, a versatile system for building plots. You'll learn to assemble any plot you like with _layers_, _geoms_, _stats_, _aesthetic mappings_, _position adjustments_, and _coordinate systems_. -*Section 3* draws on examples in the first two sections to teach the _gramar of graphics_, a versatile system for describing---and building---any plot. - -*Section 4* describes the best ways to visualize distributions of values. - -*Section 5* teaches the best ways to visualize relationships between variables. - -*Section 6* shows how to use `ggplot2` to create maps. - -*Section 7* concludes the chapter by showing how to customize your plots with labels, legends, and color schemes. +*Section 3* will show you how to customize your plots with labels, legends, color schemes, and more. ## Prerequisites @@ -61,9 +53,7 @@ install.packages("ggplot2") library(ggplot2) ``` -## Scatterplots - -> "A picture is not merely worth a thousand words, it is much more likely to be scrutinized than words are to be read."---John Tukey +## Basics Do cars with big engines use more fuel than cars with small engines? @@ -85,22 +75,31 @@ You will need to reload the package each time you start a new R session. *** -The code below plots the `displ` variable of `mpg` against the `hwy` variable. Open an R session and run the code. Does the graph confirm or refute your hypothesis? +### Scatterplots -```{r} +The code below plots the `displ` variable of `mpg` against the `hwy` variable. + +```{r eval = FALSE} ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ``` -You can immediately see that there is a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). In other words, cars with big engines have a worse fuel efficiency. But the graph shows us something else as well. +Open an R session and run the code. Your result will look like the graph below. Does the graph confirm your hypothesis about fuel and engine size? + +```{r echo = FALSE} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) +``` + +The graph shows a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). In other words, cars with big engines use more fuel. But the graph shows us something else as well. One group of points seems to fall outside the linear trend. These cars have a higher mileage than you might expect. Can you tell why? Before we examine these cars, let's review the code that made our graph. `r bookdown::embed_png("images/visualization-1.png", dpi = 150)` -### Template +#### Template -Our is almost a template for making plots with `ggplot2`. +Our code is almost a template for making plots with `ggplot2`. ```{r eval=FALSE} ggplot(data = mpg) + @@ -111,18 +110,20 @@ With `ggplot2`, you begin a plot with the function `ggplot()`. `ggplot()` doesn' The first argument of `ggplot()` is the data set to use in the graph. So `ggplot(data = mpg)` initializes a graph that will use the `mpg` data set. -You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to the plot, which creates a scatterplot. `ggplot2` comes with 37 different `geom_` functions that you can use with `ggplot()`. Each function creates a different type of layer, and each function takes a mapping argument. +You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to the plot, which creates a scatterplot. `ggplot2` comes with other `geom_` functions that you can use as well. Each function creates a different type of layer, and each function takes a mapping argument. -The mapping argument explains where your points should go. You must always set mapping to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of the graph. `ggplot()` will look for those variables in your data set, `mpg`. +The mapping argument explains where your points should go. You must set mapping to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of the graph. `ggplot()` will look for those variables in your data set, `mpg`. -You can use this code as a template to make any graph with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a new data set, a new geom, or a new set of mappings. You can also add functions and arguments to the template that do not appear here. +You can use this code as a template to make many graphs with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a new data set, a new geom function, or a new set of mappings. ```{r eval = FALSE} ggplot(data = ) + - geom_(mapping = aes()) + (mapping = aes()) ``` -### Aesthetic Mappings +The next few sections will reveal useful arguments (and functions) that you can add to the template. + +#### Aesthetic Mappings > "The greatest value of a picture is when it forces us to notice what we never expected to see."---John Tukey @@ -130,7 +131,7 @@ Our plot above revealed a groups of cars that had better than expected mileage. Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car. The `class` variable of the `mpg` data set classifies cars into groups such as compact, midsize, and suv. If the outlying points are hybrids, they should be classified as compact, or perhaps subcompact, cars (keep in mind that this data was collected before hybrid trucks and suvs became popular). -There are two ways to add a third value, like `class`, to a two dimensional scatterplot. You can map the value to a new _aesthetic_ or you can divide the plot into _facets_. +You can add a third value, like `class`, to a two dimensional scatterplot by mapping it to a new _aesthetic_. An aesthetic is a visual property of the points in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing its aesthetic properties. @@ -170,298 +171,315 @@ ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = class)) ``` -In each case, we set the name of the aesthetic to the variable to display and we do this within the `aes()` function. The syntax highlights a useful insight because we also set `x` and `y` to variables within `aes()`: the x and y locations of a point are also aesthetics, visual properties that you can map to variables to display information about the data. +In each case, you set the name of the aesthetic to the variable to display and you do this within the `aes()` function. The syntax highlights a useful insight because you also set `x` and `y` to variables within `aes()`: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data. -Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of values to use for the aesthetic and it constructs a legend that explains the mapping. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts in the same way as a legend; it explains the mapping between locations and values. +Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of values to use for the aesthetic, and it constructs a legend that explains the mapping. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts in the same way as a legend; it explains the mapping between locations and values. + +#### Exercises Now that you know how to use aesthetics, take a moment to experiment with the `mpg` data set. * Attempt to match different types of variables to different types of aesthetics. - + Continuous variables in `mpg`: `displ`, `year`, `cyl`, `cty`, `hwy` - + Discrete variables in `mpg`: `manufacturer`, `model`, `trans`, `drv`, `fl`, `class` + + The continuous variables in `mpg` are: `displ`, `year`, `cyl`, `cty`, `hwy` + + The discrete variables in `mpg` are: `manufacturer`, `model`, `trans`, `drv`, `fl`, `class` * Attempt to use more than one aesthetic at a time. * Attempt to set an aesthetic to something other than a variable name, like `displ < 5`. See the help page for `geom_point()` (`?geom_point`) to learn which aesthetics are available to use in a scatterplot. See the help page for the `mpg` data set (`?mpg`) to learn which variables are in the data set. -Have you experimented with aesthetics? Great! Here are some things that you may have noticed. +#### Position adjustments -#### Continuous data - -A continuous variable can contain an infinite number of values that can be put in order, like numbers or date-times. If your variable is continuous, `ggplot2` will treat it in a special way. `ggplot2` will - -* use a gradient of colors from blue to black for the color aesthetic -* display a colorbar in the legend for the color aesthetic -* not use the shape aesthetic - - `ggplot2` will not use the shape aesthetic to display continuous information because the human eye cannot easily interpolate between shapes. Can you tell whether a shape is three-quarters of the way between a triangle and a circle? How about five-eights of the way? - -`ggplot2` will treat your variable as continuous if it is a numeric, integer, or a recognizable date-time structure (but not a factor, see `?factor`). - -#### Discrete data - -A discrete variable can only contain a finite (or countably infinite) set of values. Character strings and boolean values are examples of discrete data. `ggplot2` will treat your variable as discrete if it is not a numeric, integer, or recognizable date-time structure. - -If your data is discrete, `ggplot2` will: - -* use a set of colors that span the hues of the rainbow. The exact colors will depend on how many hues appear in your graph. `ggplot2` selects the colors in a way that ensures that one color does not visually dominate the others. -* use equally spaced values of size and alpha -* display up to six shapes for the shape aesthetic. - -If your data requires more than six unique shapes, `ggplot2` will print a warning message and only display the first six shapes. You may have noticed this in the graph above (and below), `ggplot2` did not display the suv values, which were the seventh unique class. +Our scatterplot presents an interesting riddle: why does the plot only display 126 points? There are 234 observations in the data set. Also, why do the points appear to be arranged on a grid? ```{r} ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, shape = class)) + geom_point(mapping = aes(x = displ, y = hwy)) ``` -See _Section 7_ to learn how to pick your own colors, shapes, sizes, etc. for `ggplot2` to use. +The points appear in a grid because the `hwy` and `displ` measurements were rounded to the nearest integer and tenths values. This also explains why our graph appears to contain only 126 points. Many points overlap each other because they've been rounded to the same values of `hwy` and `displ`. 108 points are hidden on top of other points located at the same value. -#### Multiple aesthetics - -You can use more than one aesthetic at a time. `ggplot2` will combine aesthetic legends when possible. +You can avoid this overplotting problem by setting the position argument of `geom_point()` to "jitter". `position = "jitter"` adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise. ```{r} ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, - color = drv, shape = drv, size = cty)) + geom_point(mapping = aes(x = displ, y = hwy), position = "jitter") ``` -#### Expressions +But isn't random noise, you know, bad? It *is* true that jittering your data will make it less accurate at the local level, but jittering may make your data _more_ accurate at the global level. Occasionally, jittering will reveal a pattern that was hidden within the grid. -You can map an aesthetic to more than a variable. You can map an aesthetic to raw data, or an expression. + +### Bar Charts + +Bar charts are the most commonly used type of plot after scatterplots. to make a bar chart use the function `geom_bar()`. ```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, - color = 1:234)) -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, - color = displ < 5)) +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut)) ``` -#### Setting vs. Mapping +The chart above displays the total number of diamonds in the `diamonds` data set, grouped by `cut`. The `diamonds` data set comes in `ggplot2` and contains information about 53940 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. -You can also manually set aesthetics to specific levels. For example, you can make all of the points in your plot blue. +The graph shows that more diamonds are available with high quality cuts than low quality cuts. -```{r echo = FALSE} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy), color = "blue") -``` - -To set an aesthetic manually, call the aesthetic name as an argument of your geom function. Then pass the aesthetic a value that R will recognize, such as - -* the name of a color as a character string -* the size of a point as a cex expansion factor (see `?par`) -* the shape as a point as a number code - -R uses the following numeric codes to refer to the following shapes. +A bar has different visual properties than a point, which can create some surprises. For example, how would you create this simple chart? If you have an R session open, give it a try. ```{r echo=FALSE} -pchShow <- - function(extras = c("*",".", "o","O","0","+","-","|","%","#"), - cex = 2, - col = "red3", bg = "gold", coltext = "brown", cextext = 1.1, - main = "") - { - nex <- length(extras) - np <- 26 + nex - ipch <- 0:(np-1) - k <- floor(sqrt(np)) - dd <- c(-1,1)/2 - rx <- dd + range(ix <- ipch %/% k) - ry <- dd + range(iy <- 3 + (k-1)- ipch %% k) - pch <- as.list(ipch) # list with integers & strings - if(nex > 0) pch[26+ 1:nex] <- as.list(extras) - plot(rx, ry, type = "n", axes = FALSE, xlab = "", ylab = "", main = main) - abline(v = ix, h = iy, col = "lightgray", lty = "dotted") - for(i in 1:np) { - pc <- pch[[i]] - points(ix[i], iy[i], pch = pc, col = col, bg = bg, cex = cex) - if(cextext > 0) - text(ix[i] - 0.4, iy[i], pc, col = coltext, cex = cextext) - } - } - -pchShow() +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = cut)) ``` -If you try to set an aesthetic from within the mappings argument (i.e. the `aes()` call), you will get an unexpected result, as below. +It may be tempting to call the color aesthetic, but for bars the color aesthetic controls the _outline_ of the bar, e.g. ```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, color = "blue")) +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, color = cut)) ``` -Here, `ggplot2` treats `color = "blue"` as a mapping because it appears in the mapping argument. `ggplot2` assumes that "blue" is a value in the data space. It uses R's recycling rules to pair the single value "blue" with each row of data in `mpg`. Then `ggplot2` creates a mapping from the value "blue" in the data space to the pinkish color that we see in the visual space. `ggplot2` even creates a legend to let you know that the color pink represents the value "blue." The choice of pink is a coincidence; `ggplot2` defaults to pink whenever a single discrete value is mapped to the color aesthetic. +The effect is interesting, sort of psychedelic, but not what we had in mind. -If you experience this type of behavior, remember: - -* define an aesthetic _within_ the `aes()` function to map levels of the aesthetic to values of data. You would expect a legend after this operation. -* define an aesthetic _outside of_ the `aes()` function to manually set the aesthetic to a specific level. You would not expect a legend after this operation. - -### Facets - -Facets provide a second way to add a variables to a two dimensional graph. When you facet a graph, you divide your data into subgroups and then plot a separate graph, or _facet_, for each subgroup. - -For example, we can divide our data set into four subgroups based on the `cyl` variable: - -1. all of the cars that have four cylinder engines -2. all of the cars that have five cylinder engines (there are some) -3. all of the cars that have six cylinder engines, and -4. all of the cars that have eight cylinder engines - -Or we could divide our data into three groups based on the `drv` variable: - -1. all of the cars with four wheel drive (4) -2. all of the cars with front wheel drive (f) -3. all of the cars with rear wheel drive (r) - -We could even divide our data into subgroups based on the combination of two variables: - -1. all of the cars with four wheel drive (4) and 4 cylinders -2. all of the cars with four wheel drive (4) and 5 cylinders -3. all of the cars with four wheel drive (4) and 6 cylinders -4. all of the cars with four wheel drive (4) and 8 cylinders -5. all of the cars with front wheel drive (f) and 4 cylinders -6. all of the cars with front wheel drive (f) and 5 cylinders -7. all of the cars with front wheel drive (f) and 6 cylinders -8. all of the cars with front wheel drive (f) and 8 cylinders -9. all of the cars with rear wheel drive (r) and 4 cylinders -10. all of the cars with rear wheel drive (r) and 5 cylinders -11. all of the cars with rear wheel drive (r) and 6 cylinders -12. all of the cars with rear wheel drive (r) and 8 cylinders - -#### `facet_grid()` - -The graphs below show what a faceted graph looks like. They also show how you can build a faceted graph with `facet_grid()`. I'm not going to tell you how `facet_grid()` works---well at least not yet. That would be too easy. Instead, I would like you to try to induce the syntax of `facet_grid()` from the code below. Consider: - -* Which variables determine how the graph is split into rows? -* Which variables determine how the graph is split into columns? -* What parts of the syntax always stay the same? -* And what does the `.` do? - -Make an honest effort at answering these questions, and then read on past the graphs. +To control the interior fill of a bar, you must call the _fill_ aesthetic. ```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(drv ~ cyl) -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(drv ~ .) -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(. ~ cyl) +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = cut)) ``` -Ready for the answers? - -To facet your graph, add `facet_grid()` to your code. The first argument of `facet_grid()` is always a formula, two variable names separated by a `~`. - -`facet_grid()` will use the first variable in the formula to split the graph into rows. Each row will contain data points that have the same value of the variable. - -`facet_grid()` will use the second variable in the formula to split the graph into columns. Each column will contain data points that have the same value of the second variable. - -This syntax mirrors the rows first, columns second convention of R. - -If you prefer to facet your plot on only one dimension, add a `.` to your formula as a place holder. If you place a `.` before the `~`, `facet_grid()` will not facet on the rows dimension. If you place a `.` after the `~`, `facet_grid()` will not facet on the columns dimension. - -Facets let you quickly compare subgroups by glancing down rows and across columns. Each facet will use the same limits on the x and y axes, but you can change this behavior across rows or columns by adding a scales argument. Set scales to one of - -* `"free_y"` - to let y limits vary accross rows -* `"free_x"` - to let x limits vary accross columns -* `"free"` - to let both x and y limits vary - -For example, the code below lets the limits of the x axes vary across columns. +If you map the fill aesthetic to a third variable, like `clarity`, you get a stacked bar chart. ```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(drv ~ cyl, scales = "free_x") +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = clarity)) ``` +Bar charts are interesting because they reveal something subtle about many types of plots. -#### `facet_wrap()` +#### Stats -What if you want to facet on a variable that has too many values to display nicely? - -For example, if we facet on `class`, `ggplot2` must display narrow subplots to fit each subplot into the same column. This makes it diffcult to compare x values with precision. +Consider our basic bar chart. ```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(. ~ class) +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut)) ``` -`facet_wrap()` provides a more pleasant way to facet a plot across many values. It wraps the subplots into a multi-row, roughly square result. +On the x axis it displays `cut`, a variable in the `diamonds` data set. On the y axis, it displays count. But count is not a variable in the diamonds data set: ```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_wrap(~ class) +head(diamonds) ``` -The results of `facet_wrap()` can be easier to study than the results of `facet_grid()`. However, `facet_wrap()` can only facet by one variable at a time. +Nor did we tell `ggplot2` in our code where to find count values. -### Geoms - -You can add new data to your scatterplot with aesthetics and facets, but how can you summarize the data that is already there, for example with a trend line? - -You can add summary information to your scatterplot with a geom. To understand geoms, ask yourself: how are these two plots similar? - -```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) - -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy)) +```{r eval = FALSE} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut)) ``` -Both plots contain the same: +Where does count come from? -* x variable -* y variable -* underlying data set +Some graphs, like scatterplots, plot the raw values of your data set. Other graphs, like bar charts, do not plot raw values at all. These graphs apply an algorithm to your data and then plot the results of the algorithm. Consider how many graphs do this. -But the plots are not identical. Each uses a different _geom_, or geometrical object, to represent the data. The first plot uses a set of points to represent the data. The second plot uses a single, smoothed line. +* **bar charts** and **histograms** bin the raw data and then plot bin counts +* **smooth lines** (e.g. trend lines) apply a model to the raw data and then plot the model line +* **boxplots** calculate the quartiles of the raw data and then plot the quartiles as a box. +* and so on. -To create the second plot, replace `geom_point()` in our template code... +`ggplot2` calls the algorithm that a graph uses to transform raw data a _stat_, which is short for statistical transformation. Each geom in `ggplot2` is associated with a stat that it automatically uses to plot your data (if a geom plots the raw data it uses the "identity" stat, i.e. the identity transformation). -```{r eval=FALSE} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) +You can change the stat that your geom uses. For example, you can use the identity stat to plot data that already lists the counts of each bar. + +```{r} +demo <- data.frame( + bars = c("bar_1","bar_2","bar_3"), + counts = c(20, 30, 40) +) + +demo ``` -...with `geom_smooth()`, +To use the identity stat, set the stat argument of `geom_bar()` to "identity". -```{r eval=FALSE, message = FALSE} -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy)) +```{r} +ggplot(data = demo) + + geom_bar(mapping = aes(x = bars, y = counts), stat = "identity") ``` -`ggplot2` comes with 37 `geom_` functions that you can use to to visualize your data. Each function will represent the data with a different type of geom, like a bar, a line, a boxplot, a histogram, etc. You select the type of plot you wish to make by calling the geom_ function that draws the geom you have in mind. +*** -Each `geom_` function takes a `mapping` argument. However, the aesthetics that you pass to the argument will change from geom to geom. For example, you can set the shape of points, but it would not make sense to set the shape of a line. +*Tip*: To learn which stat a geom uses, visit the geom's help page, e.g. `?geom_bar`. To learn more about a stat, visit the stat's help page, e.g. `?stat_bin`. -To see which aesthetics your geom uses, visit its help page. To see a list of all available geoms, open the `ggplot2` package help page with `help(package = ggplot2)`. +*** -#### Group aesthetic +### Polar charts -The _group_ aesthetic is a useful way to apply a monolithic geom, like a smooth line, to multiple subgroups. +Here's another riddle: how is a bar chart similar to a coxcomb plot, like the one below? -By default, `geom_smooth()` draws a single smoothed line for the entire data set. To draw a separate line for each group of points, set the group aesthetic to a grouping variable or expression. - -```{r message = FALSE} -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy, group = displ < 5)) +```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=3, fig.height=4} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = cut)) +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = cut), width = 1) + + coord_polar() ``` -`ggplot2` will automatically infer a group aesthetic when you map an aesthetic of a monolithic geom to a discrete variable. Below `ggplot2` infers a group aesthetic from the `linetype = drv` aesthetic. It is useful to combine group aesthetics with secondary aesthetics because `ggplot2` cannot build a legend for a group aesthetic. +Answer: A coxcomb plot is just a bar chart plotted in polar coordinates. -```{r message = FALSE} -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) +#### Coordinate systems + +To plot your data in polar coordinates, add `coord_polar()` to your plot call. Polar bar charts will look better if you also set `geom_bar()`'s width parameter to 1. + +```{r} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = cut), width = 1) + + coord_polar() +``` + +By default, `ggplot2` will map your y variable to $r$ and your x variable to $\theta$. When applied to a bar chart, this creates a coxcomb plot. + +#### Facets + +You can create a separate polar chart for each level of a third variable by _facetting_ your plot. For example, you can create a separate subplot for each level of the `clarity` variable. + +```{r} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = cut), width = 1) + + coord_polar() + + facet_wrap( ~ clarity) +``` + +Here, the first subplot displays the group of points that have the `clarity` value `I1`. The second subplot displays the group of points that have the `clarity` value `SI2`. And so on. + +To facet your plot on a single discrete variable, add `facet_wrap()` to your plot call. The first argument of `facet_wrap()` is always a formula, a `~` followed by a variable name. + +To facet your plot on the combinations of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is always a formula, two variable names separated by a `~`. + +```{r fig.height = 7, fig.width = 7} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = cut), width = 1) + + coord_polar() + + facet_grid(color ~ clarity) +``` + +Here the first subplot displays all of the points that have an `I1` code for `clarity` _and_ a `D` code for `color`. Don't be confused; `color` is a variable name in the `diamonds` data set. So `facet_grid(color ~ clarity)` has nothing to do with the color aesthetic. + +### Bringing it together + +> "Wax on. Wax off."---*The Karate Kid* (1984) + +In this section, you learned how to make more than just scatterplots, bar charts, and coxcomb plots; you learned a foundation that you can use to make _any_ type of plot with `ggplot2`. + +To see this, let's add position adjustments, stats, coordinate systems, and facetting to our code template. In `ggplot2`, each of these parameters will work with every plot and every geom. + +```{r eval = FALSE} +ggplot(data = ) + + ( + mapping = aes(), + stat = , + position = + ) + + + + +``` + +*** + +*Tip*: In practice, you do not need to define each of these parameters when you make a graph. `ggplot2` will supply a set of sensible defaults. + +*** + +The parameters in our template are connected by a powerful idea known as the _Grammar and Graphics_. The Grammar of Graphics shows that you can uniquely describe a plot as a combination of: + +* a data set +* a coordinate system +* a geom +* a stat +* a set of aesthetic mappings +* a position adjustment, and +* a facet scheme + +As a result, you can build _any_ plot that you have in mind with the template above. To do so, just fill in the parameters that describe the plot. + +The next section will look at each of these parameters closely. + +## The Grammar of Graphics + +The _grammar of graphics_ is the core of `ggplot2`. In fact, the "gg" of `ggplot2` stands for the grammar of graphics. + +You can think of the grammar of graphics as a formula for building a plot---any plot. To build a plot, you begin with a data set and a coordinate system. + +`r bookdown::embed_png("images/blank.png", dpi = 150)` + +You then visualize each row of data with a geom. + +`r bookdown::embed_png("images/blank.png", dpi = 150)` + +And you map variables in your data to the aesthetic properties of your geoms. Here we map the... + +`r bookdown::embed_png("images/blank.png", dpi = 150)` + +Once you map the x aesthetic of your geoms + +`r bookdown::embed_png("images/blank.png", dpi = 150)` + +and the y aesthetic + +`r bookdown::embed_png("images/blank.png", dpi = 150)` + +you have a complete graph that you can choose to facet or not. You can also adjust positions as necessary. + +`r bookdown::embed_png("images/blank.png", dpi = 150)` + +For some graphs you add an extra step; you transform the data with a statistical transformation, and then use geoms to represent the results. + +`r bookdown::embed_png("images/blank.png", dpi = 150)` + +These parameters---data, coordinate system, geoms, stats, aesthetic mappings, position adjustments, and facets---make up the grammar of graphics. You can build any graph by selecting the correct combination of parameters, e.g. + +* **data**: +* **coordinate system**: +* **geom**: +* **stat**: +* **mappings**: +* **position adjustment**: +* **facets**: + +`r bookdown::embed_png("images/blank.png", dpi = 150)` + +If you alter any single parameter, you make a new graph: + +* **data**: +* **coordinate system**: +* **geom**: +* **stat**: +* **mappings**: +* **position adjustment**: +* **facets**: + +`r bookdown::embed_png("images/blank.png", dpi = 150)` + +### Layers + +To build a graph in `ggplot2`, choose a coordinate system and a facetting scheme for your entire graph, and then add as many combinations of data, geoms, stats, mappings, and position adjustments as you like. + +You can think of `ggplot()` as initializing your graph with a cartesian coordinate system. Add a coordinate function or a facet function to change these defaults. + +```{r plot} +p <- ggplot() + + coord_polar() + + facet_wrap(~drv) +``` + +Then use a geom function to supply a combination of data, geom, stat, mappings, and position. + +```{r dependson=plot} +p + geom_point(mapping = aes(x = displ, y = hwy), data = mpg, stat = "identity", position = "identity") +``` + +Each combination of data, geom, stat, mappings, and position provides a visual "layer" of information that you can add to the established coordinate system and facetting scheme. To build a multi-layered plot, just add multiple layers. + +```{r message = FALSE, dependson=plot} +p + geom_point(mapping = aes(x = displ, y = hwy), data = mpg, stat = "identity", position = "identity") + + geom_smooth(mapping = aes(x = displ, y = hwy), data = mpg, stat = "smooth", position = "identity") ``` #### Multiple geoms @@ -532,94 +550,242 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_smooth(data = subset(mpg, cyl == 8)) ``` -## Bar Charts -A bar chart is a graph that uses the bar geom. Bar charts are the most commonly used type of plot after scatterplots. +### Aesthetics -The chart below displays the total number of diamonds in the `diamonds` data set, grouped by `cut`. The `diamonds` data set comes in `ggplot2` and contains information about 53940 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. +# Aesthetics + +Have you experimented with aesthetics? Great! Here are some things that you may have noticed. + +#### Continuous data + +A continuous variable can contain an infinite number of values that can be put in order, like numbers or date-times. If your variable is continuous, `ggplot2` will treat it in a special way. `ggplot2` will + +* use a gradient of colors from blue to black for the color aesthetic +* display a colorbar in the legend for the color aesthetic +* not use the shape aesthetic + + `ggplot2` will not use the shape aesthetic to display continuous information because the human eye cannot easily interpolate between shapes. Can you tell whether a shape is three-quarters of the way between a triangle and a circle? How about five-eights of the way? + +`ggplot2` will treat your variable as continuous if it is a numeric, integer, or a recognizable date-time structure (but not a factor, see `?factor`). + +#### Discrete data + +A discrete variable can only contain a finite (or countably infinite) set of values. Character strings and boolean values are examples of discrete data. `ggplot2` will treat your variable as discrete if it is not a numeric, integer, or recognizable date-time structure. + +If your data is discrete, `ggplot2` will: + +* use a set of colors that span the hues of the rainbow. The exact colors will depend on how many hues appear in your graph. `ggplot2` selects the colors in a way that ensures that one color does not visually dominate the others. +* use equally spaced values of size and alpha +* display up to six shapes for the shape aesthetic. + +If your data requires more than six unique shapes, `ggplot2` will print a warning message and only display the first six shapes. You may have noticed this in the graph above (and below), `ggplot2` did not display the suv values, which were the seventh unique class. ```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut)) +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy, shape = class)) ``` -The graph shows that more diamonds are available with high quality cuts than low quality cuts. +See _Section 7_ to learn how to pick your own colors, shapes, sizes, etc. for `ggplot2` to use. -A bar has different visual properties than a point, which can create some surprises. For example, how would you create this simple chart? If you have an R session open, give it a try. +#### Multiple aesthetics + +You can use more than one aesthetic at a time. `ggplot2` will combine aesthetic legends when possible. + +```{r} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy, + color = drv, shape = drv, size = cty)) +``` + +#### Expressions + +You can map an aesthetic to more than a variable. You can map an aesthetic to raw data, or an expression. + +```{r} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy, + color = 1:234)) +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy, + color = displ < 5)) +``` + +```{r} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy, color = "blue")) +``` + +#### Setting vs. Mapping + +You can also manually set an aesthetic to a specific level. For example, you can make all of the points in your plot blue. + +```{r} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy), color = "blue") +``` + +To set an aesthetic manually, call the aesthetic as an argument of your geom function. Then pass the aesthetic a value that R will recognize, such as + +* the name of a color as a character string +* the size of a point as a cex expansion factor (see `?par`) +* the shape as a point as a number code + +R uses the following numeric codes to refer to the following shapes. ```{r echo=FALSE} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = cut)) +pchShow <- + function(extras = c("*",".", "o","O","0","+","-","|","%","#"), + cex = 2, + col = "red3", bg = "gold", coltext = "brown", cextext = 1.1, + main = "") + { + nex <- length(extras) + np <- 26 + nex + ipch <- 0:(np-1) + k <- floor(sqrt(np)) + dd <- c(-1,1)/2 + rx <- dd + range(ix <- ipch %/% k) + ry <- dd + range(iy <- 3 + (k-1)- ipch %% k) + pch <- as.list(ipch) # list with integers & strings + if(nex > 0) pch[26+ 1:nex] <- as.list(extras) + plot(rx, ry, type = "n", axes = FALSE, xlab = "", ylab = "", main = main) + abline(v = ix, h = iy, col = "lightgray", lty = "dotted") + for(i in 1:np) { + pc <- pch[[i]] + points(ix[i], iy[i], pch = pc, col = col, bg = bg, cex = cex) + if(cextext > 0) + text(ix[i] - 0.4, iy[i], pc, col = coltext, cex = cextext) + } + } + +pchShow() ``` -It may be tempting to call the color aesthetic, but for bars and other large geoms the color aesthetic controls the _outline_ of the geom, e.g. +If you get an odd result, double check that you are calling the aesthetic as its own argument (and not calling it from inside of `mapping = aes()`. -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, color = cut)) + +Here, `ggplot2` treats `color = "blue"` as a mapping because it appears in the mapping argument. `ggplot2` assumes that "blue" is a value in the data space. It uses R's recycling rules to pair the single value "blue" with each row of data in `mpg`. Then `ggplot2` creates a mapping from the value "blue" in the data space to the pinkish color that we see in the visual space. `ggplot2` even creates a legend to let you know that the color pink represents the value "blue." The choice of pink is a coincidence; `ggplot2` defaults to pink whenever a single discrete value is mapped to the color aesthetic. + +If you experience this type of behavior, remember: + +* define an aesthetic _within_ the `aes()` function to map levels of the aesthetic to values of data. You would expect a legend after this operation. +* define an aesthetic _outside of_ the `aes()` function to manually set the aesthetic to a specific level. You would not expect a legend after this operation. + +Remember: + +* define an aesthetic _within_ the `aes()` function to map levels of the aesthetic to values of data. You would expect a legend after this operation. +* define an aesthetic _outside of_ the `aes()` function to manually set the aesthetic to a specific level. You would not expect a legend after this operation. + +#### Group aesthetic + +The _group_ aesthetic is a useful way to apply a monolithic geom, like a smooth line, to multiple subgroups. + +By default, `geom_smooth()` draws a single smoothed line for the entire data set. To draw a separate line for each group of points, set the group aesthetic to a grouping variable or expression. + +```{r message = FALSE} +ggplot(data = mpg) + + geom_smooth(mapping = aes(x = displ, y = hwy, group = displ < 5)) ``` -The effect is interesting, sort of psychedelic, but not what we had in mind. +`ggplot2` will automatically infer a group aesthetic when you map an aesthetic of a monolithic geom to a discrete variable. Below `ggplot2` infers a group aesthetic from the `linetype = drv` aesthetic. It is useful to combine group aesthetics with secondary aesthetics because `ggplot2` cannot build a legend for a group aesthetic. -To control the interior fill of a bar, polygon, histogram, boxplot, or other geom with mass, you must call the _fill_ aesthetic. - -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = cut)) +```{r message = FALSE} +ggplot(data = mpg) + + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) ``` -If you map the fill aesthetic to a third variable, like `clarity`, you get a stacked bar chart. +### Geoms +You can add new data to your scatterplot with aesthetics and facets, but how can you summarize the data that is already there, for example with a trend line? -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = clarity)) +You can add summary information to your scatterplot with a geom. To understand geoms, ask yourself: how are these two plots similar? + +```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + +ggplot(data = mpg) + + geom_smooth(mapping = aes(x = displ, y = hwy)) ``` -Bar charts are interesting because they reveal something subtle about common types of plots. +Both plots contain the same: + +* x variable +* y variable +* underlying data set + +But the plots are not identical. Each uses a different _geom_, or geometrical object, to represent the data. The first plot uses a set of points to represent the data. The second plot uses a single, smoothed line. + +To create the second plot, replace `geom_point()` in our template code... + +```{r eval=FALSE} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) +``` + +...with `geom_smooth()`, + +```{r eval=FALSE, message = FALSE} +ggplot(data = mpg) + + geom_smooth(mapping = aes(x = displ, y = hwy)) +``` + +`ggplot2` comes with 37 `geom_` functions that you can use to to visualize your data. Each function will represent the data with a different type of geom, like a bar, a line, a boxplot, a histogram, etc. You select the type of plot you wish to make by calling the geom_ function that draws the geom you have in mind. + +Each `geom_` function takes a `mapping` argument. However, the aesthetics that you pass to the argument will change from geom to geom. For example, you can set the shape of points, but it would not make sense to set the shape of a line. + +To see which aesthetics your geom uses, visit its help page. To see a list of all available geoms, open the `ggplot2` package help page with `help(package = ggplot2)`. + +#### Graphical primitives +#### Visualizing Distributions +##### Discrete distributions +###### Bar Charts +##### Continuous distributions +###### Histograms +###### Dotplots +###### Freqpoly +###### Density +###### Boxplots +##### Bivariate Distributions +###### bin2d +###### hex +###### density2d +###### rug +#### Visualizing Relationships +##### Discrete x, discrete y +###### Jitter +##### Discrete x, continuous y +###### Bar Charts +###### Boxplots +###### Dotplots +###### Violin plots +###### crossbar +###### errorbar +###### linerange +###### point range +##### Continuous x, continuous y +###### Points +###### Text +###### Jitter +###### Smooth +###### Quantile +##### Functions +###### line +###### area +###### step +##### Discrete x, discrete y, continuous z +###### raster +###### tile +##### Continuous x, continuous y, continuous z +###### contour +##### Maps ### Stats -Consider our basic bar chart. - -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut)) -``` - -On the x axis it displays `cut`, a variable in the `diamonds` data set. On the y axis, it displays count. But count is not a variable in the diamonds data set: - -```{r} -head(diamonds) -``` - -Nor did we tell `ggplot2` in our code where to find count values. - -```{r eval = FALSE} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut)) -``` - -Where does count come from? - -Some graphs, like scatterplots, plot the raw values of your data set. Other graphs, like bar charts and smooth lines, do not plot raw values at all. These graphs apply an algorithm to your data and then plot the results of the algorithm. Consider how many graphs do this. - -* **bar charts** and **histograms** bin the raw data and then plot bin counts -* **smooth lines** apply a model to the raw data and then plot the model line -* **boxplots** calculate the quartiles of the raw data and then plot the quartiles as a box. -* and so on. - -`ggplot2` calls the algorithm that a graph uses to transform raw data a _stat_, which is short for statistical transformation. - -Each geom in `ggplot2` is designed to use a default stat when it creates a graph (if a geom plots the raw data it uses the "identity" stat, i.e. an identity transformation). In many cases, it does not make sense to change a geom's default stat. In other cases, you can change or fine tune the stat to make new graphs. - -*** - -*Tip*: To learn which stat a geom uses, visit the geom's help page, e.g. `?geom_bar`. To learn more about a stat, visit the stat's help page, e.g. `?stat_bin`. - -*** - #### Change a stat +In many cases, it does not make sense to change a geom's default stat. In other cases, you can change or fine tune the stat to make new graphs. + You can map the heights of bars in a bar chart to data values---not counts---by changing the stat of the bar chart. This works best if your data set contains one observation per bar, e.g. ```{r} @@ -680,7 +846,7 @@ Note that to do this, you will need to 2. Determine which variables the stat creates from its help page 3. Surround the variable name with `..` -### Position +### Positions At the beginning of this section, you learned how to use the fill aesthetic to make a stacked bar chart. @@ -921,62 +1087,110 @@ ggplot(data = mpg) + coord_trans(xtrans = "log", ytran = "log") ``` +### Facets + +Facets provide a second way to add a variables to a two dimensional graph. When you facet a graph, you divide your data into subgroups and then plot a separate graph, or _facet_, for each subgroup. + +For example, we can divide our data set into four subgroups based on the `cyl` variable: + +1. all of the cars that have four cylinder engines +2. all of the cars that have five cylinder engines (there are some) +3. all of the cars that have six cylinder engines, and +4. all of the cars that have eight cylinder engines + +Or we could divide our data into three groups based on the `drv` variable: + +1. all of the cars with four wheel drive (4) +2. all of the cars with front wheel drive (f) +3. all of the cars with rear wheel drive (r) + +We could even divide our data into subgroups based on the combination of two variables: + +1. all of the cars with four wheel drive (4) and 4 cylinders +2. all of the cars with four wheel drive (4) and 5 cylinders +3. all of the cars with four wheel drive (4) and 6 cylinders +4. all of the cars with four wheel drive (4) and 8 cylinders +5. all of the cars with front wheel drive (f) and 4 cylinders +6. all of the cars with front wheel drive (f) and 5 cylinders +7. all of the cars with front wheel drive (f) and 6 cylinders +8. all of the cars with front wheel drive (f) and 8 cylinders +9. all of the cars with rear wheel drive (r) and 4 cylinders +10. all of the cars with rear wheel drive (r) and 5 cylinders +11. all of the cars with rear wheel drive (r) and 6 cylinders +12. all of the cars with rear wheel drive (r) and 8 cylinders + +#### `facet_grid()` + +The graphs below show what a faceted graph looks like. They also show how you can build a faceted graph with `facet_grid()`. I'm not going to tell you how `facet_grid()` works---well at least not yet. That would be too easy. Instead, I would like you to try to induce the syntax of `facet_grid()` from the code below. Consider: + +* Which variables determine how the graph is split into rows? +* Which variables determine how the graph is split into columns? +* What parts of the syntax always stay the same? +* And what does the `.` do? + +Make an honest effort at answering these questions, and then read on past the graphs. + +```{r} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + + facet_grid(drv ~ cyl) +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + + facet_grid(drv ~ .) +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + + facet_grid(. ~ cyl) +``` + +Ready for the answers? + +To facet your graph, add `facet_grid()` to your code. The first argument of `facet_grid()` is always a formula, two variable names separated by a `~`. + +`facet_grid()` will use the first variable in the formula to split the graph into rows. Each row will contain data points that have the same value of the variable. + +`facet_grid()` will use the second variable in the formula to split the graph into columns. Each column will contain data points that have the same value of the second variable. + +This syntax mirrors the rows first, columns second convention of R. + +If you prefer to facet your plot on only one dimension, add a `.` to your formula as a place holder. If you place a `.` before the `~`, `facet_grid()` will not facet on the rows dimension. If you place a `.` after the `~`, `facet_grid()` will not facet on the columns dimension. + +Facets let you quickly compare subgroups by glancing down rows and across columns. Each facet will use the same limits on the x and y axes, but you can change this behavior across rows or columns by adding a scales argument. Set scales to one of + +* `"free_y"` - to let y limits vary accross rows +* `"free_x"` - to let x limits vary accross columns +* `"free"` - to let both x and y limits vary + +For example, the code below lets the limits of the x axes vary across columns. + +```{r} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + + facet_grid(drv ~ cyl, scales = "free_x") +``` +#### `facet_wrap()` +What if you want to facet on a variable that has too many values to display nicely? +For example, if we facet on `class`, `ggplot2` must display narrow subplots to fit each subplot into the same column. This makes it diffcult to compare x values with precision. -## The Grammar of Graphics +```{r} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + + facet_grid(. ~ class) +``` -> "Wax on. Wax off."---Mr. Miyagi. *The Karate Kid* (1984) +`facet_wrap()` provides a more pleasant way to facet a plot across many values. It wraps the subplots into a multi-row, roughly square result. -### Layers +```{r} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + + facet_wrap(~ class) +``` -## Visualizing Distributions -### Discrete distributions -#### Bar Charts -### Continuous distributions -#### Histograms -#### Dotplots -#### Freqpoly -#### Density -#### Boxplots -### Bivariate Distributions -#### bin2d -#### hex -#### density2d -#### rug +The results of `facet_wrap()` can be easier to study than the results of `facet_grid()`. However, `facet_wrap()` can only facet by one variable at a time. -## Visualizing Relationships -### Discrete x, discrete y -#### Jitter -### Discrete x, continuous y -#### Bar Charts -#### Boxplots -#### Dotplots -#### Violin plots -#### crossbar -#### errorbar -#### linerange -#### point range -### Continuous x, continuous y -#### Points -#### Text -#### Jitter -#### Smooth -#### Quantile -### Functions -#### line -#### area -#### step -### Discrete x, discrete y, continuous z -#### raster -#### tile -### Continuous x, continuous y, continuous z -#### contour -### Advice for Big Data - -## Maps ## Customizing plots ### Titles @@ -988,3 +1202,7 @@ ggplot(data = mpg) + ### Themes +## Summary + +> "A picture is not merely worth a thousand words, it is much more likely to be scrutinized than words are to be read."---John Tukey +