From f25b0daf052298b1ec9a468c7c33b714ae719f43 Mon Sep 17 00:00:00 2001 From: Garrett Date: Mon, 7 Dec 2015 14:05:04 -0500 Subject: [PATCH] Edited out 33 pages of visualization chapter. Completed first section of the chapter, which teaches the entire grammar of graphics by doing. --- visualize.Rmd | 1334 +++++++++++++++++-------------------------------- 1 file changed, 467 insertions(+), 867 deletions(-) diff --git a/visualize.Rmd b/visualize.Rmd index a7bb7bc..54f4cb4 100644 --- a/visualize.Rmd +++ b/visualize.Rmd @@ -4,15 +4,16 @@ title: Data Visualization output: bookdown::html_chapter --- -```{r setup, include=FALSE} +```{r setup, include=FALSE, message=FALSE} knitr::opts_chunk$set(cache = TRUE) +if (!require(hexbin)) install.packages("hexbin") ``` # Visualize Data > "The simple graph has brought more information to the data analyst’s mind than any other device."---John Tukey -Visualization makes data decipherable. Have you ever tried to study a table of raw data? You can examine values one at a time, but you cannot attend to many values at once. The data overloads your attention span, which makes it hard to spot patterns in the data. See this for yourself; can you spot the striking relationship between $X$ and $Y$ in the table below? +Visualization makes data decipherable. Have you ever tried to study a table of raw data? You can examine a couple of values at a time, but you cannot attend to many values at once. The data overloads your attention span, which makes it hard to spot patterns in the data. See this for yourself; can you spot the striking relationship between $X$ and $Y$ in the table below? ```{r data, echo=FALSE} x <- rep(seq(0.2, 1.8, length = 5), 2) + runif(10, -0.15, 0.15) @@ -34,13 +35,11 @@ This chapter will teach you how to visualize your data with R and the `ggplot2` ## Outline -*Section 1* will get you started making graphs right away. You'll learn how to make several common types of plots, and how to use the `ggplot2` syntax. +*Section 1* will get you started making graphs right away. You'll learn how to use the grammar of graphics to make any type of plot. -*Section 2* will guide you through the geoms, stats, position adjustments, coordinate systems, and facetting schemes that you can use to make different types of plots with `ggplot2`. +*Section 2* will show you how to use data visualization to explore and understand your data. -*Section 3* will teach you the _layered grammar of graphics_, a versatile system for building multi-layered plots that underlies `ggplot2`. - -*Section 4* will show you how to customize your plots with labels, legends, color schemes, and more. +*Section 3* will show you how to customize your plots with labels, legends, color schemes, and more. ## Prerequisites @@ -55,11 +54,15 @@ install.packages("ggplot2") library(ggplot2) ``` -## Basics +## The Layered Grammar of Graphics -Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? +The grammar of graphics is a language for describing graphs. Once you learn the language, you can use it to build graphs with `ggplot2`, but how should you learn the language? -You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear? Strong? Weak? +Have you ever tried to learn a language by only studying its rules, vocabulary, and syntax? That's how I tried to learn spanish in college, and now I speak _un muy, muy, poquito_. + +It is far better to learn a language by actually speaking it! And that's what we'll do here; we'll learn the grammar of graphics by making a series of plots. Don't worry if things seem confusing at first, by the end of the section everything will come together in a clear way. + +Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear? Strong? Weak? You can test your answer with the `mpg` data set in the `ggplot2` package. The data set contains observations collected by the EPA on 38 models of car. Among the variables in `mpg` are @@ -80,18 +83,18 @@ library(ggplot2) ### Scatterplots -The easiest way to understand the `mpg` data set is to visualize it, which means that it is time to make our first graph. To do this, open an R session and run the code below. The code plots the `displ` variable of `mpg` against the `hwy` variable to make the graph below. +The easiest way to understand the `mpg` data set is to visualize it, which means that it is time to make our first graph. To do this, open an R session and run the code below. The code plots the `displ` variable of `mpg` against the `hwy` variable to make the graph below. Does the graph confirm your hypothesis about fuel efficiency and engine size? ```{r} ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ``` -Does the graph confirm your hypothesis about fuel efficiency and engine size? The graph shows a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). In other words, cars with big engines use more fuel. But the graph shows us something else as well. +The graph shows a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). In other words, cars with big engines use more fuel. But the graph shows us something else as well. One group of points seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. Can you tell why? Before we examine these cars, let's review the code that made our graph. -`r bookdown::embed_png("images/visualization-1.png", dpi = 150)` +`r bookdown::embed_png("images/visualization-1.png", dpi = 300)` #### Template @@ -106,19 +109,17 @@ With `ggplot2`, you begin a plot with the function `ggplot()`. `ggplot()` doesn' The first argument of `ggplot()` is the data set to use in the graph. So `ggplot(data = mpg)` initializes a graph that will use the `mpg` data set. -You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to the plot, which creates a scatterplot. `ggplot2` comes with other `geom_` functions that you can use as well. Each function creates a different type of layer, and each function takes a mapping argument. You'll learn about all of the geom functions in Section 2. +You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to the plot, which creates a scatterplot. `ggplot2` comes with other geom functions that you can use as well. Each function creates a different type of layer, and each function takes a mapping argument. The mapping argument of your geom function explains where your points should go. You must set `mapping` to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of the graph. `ggplot()` will look for those variables in your data set, `mpg`. -This code suggests a template for making graphs with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a new data set, a new geom function, or a new set of mappings. +This code suggests a minimal template for making graphs with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a data set, a geom function, or a set of mappings. ```{r eval = FALSE} ggplot(data = ) + (mapping = aes()) ``` -The remainder of this section will introduce several arguments (and functions) that you can add to the template. Each argument will come with a new set of options---and likely a new set of questions. Hold those questions for now. We will catalogue your options in Section 2. Use this section to become familiar with the `ggplot2` syntax. Once you do, the low level details of `ggplot2` will be easier to understand. - #### Aesthetic Mappings > "The greatest value of a picture is when it forces us to notice what we never expected to see."---John Tukey @@ -129,20 +130,18 @@ Let's hypothesize that the cars are hybrids. One way to test this hypothesis is You can add a third value, like `class`, to a two dimensional scatterplot by mapping it to an _aesthetic_. -An aesthetic is a visual property of the points in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties. Here we change the levels of a point's size, shape, and color properties to make the point small, trianglular, or blue. +An aesthetic is a visual property of the points in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties. Here we change the levels of a point's size, shape, and color to make the point small, trianglular, or blue. -`r bookdown::embed_png("images/visualization-2.png", dpi = 150)` +`r bookdown::embed_png("images/visualization-2.png", dpi = 300)` You can convey information about your data by mapping the aesthetics in your plot to the variables in your data set. For example, we can map the colors of our points to the `class` variable. Then the color of each point will reveal its class affiliation. -To map an aesthetic to a variable, set the name of the aesthetic to the name of the variable, _and do this in your plot's `aes()` call_: - ```{r} ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class)) ``` -`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable. `ggplot2` will also add a legend that explains which levels correspond to which values. +To map an aesthetic to a variable, set the name of the aesthetic to the name of the variable, _and do this in your plot's `aes()` call_. `ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable. `ggplot2` will also add a legend that explains which levels correspond to which values. The colors reveal that many of the unusual points are two seater cars. These cars don't seem like hybrids. In fact, they seem like sports cars---and that's what they are. Sports cars have large engines like suvs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines. @@ -169,7 +168,7 @@ ggplot(data = mpg) + *** -**Tip** - What happened to the suv's? `ggplot2` will only use six shapes at a time. See Section 2 for more details. +**Tip** - What happened to the suv's? `ggplot2` will only use six shapes at a time. Additional groups will go unplotted when you use this aesthetic. *** @@ -177,46 +176,202 @@ In each case, you set the name of the aesthetic to the variable to display, and Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of levels to use for the aesthetic, and it constructs a legend that explains the mapping. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values. +You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue. + +```{r} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy), color = "blue") +``` + +Here, the color doesn't convey information about a variable. It just changes the appearance of the plot. To set an aesthetic manually, call the aesthetic as an argument of your geom function. Then pass the aesthetic a value that R will recognize, such as + +* the name of a color as a character string +* the size of a point as a cex expansion factor (see `?par`) +* the shape as a point as a number code + +R uses the following numeric codes to refer to the following shapes. + +```{r echo=FALSE} +pchShow <- + function(extras = c("*",".", "o","O","0","+","-","|","%","#"), + cex = 2, + col = "red3", bg = "gold", coltext = "brown", cextext = 1.1, + main = "") + { + nex <- length(extras) + np <- 26 + nex + ipch <- 0:(np-1) + k <- floor(sqrt(np)) + dd <- c(-1,1)/2 + rx <- dd + range(ix <- ipch %/% k) + ry <- dd + range(iy <- 3 + (k-1)- ipch %% k) + pch <- as.list(ipch) # list with integers & strings + if(nex > 0) pch[26+ 1:nex] <- as.list(extras) + plot(rx, ry, type = "n", axes = FALSE, xlab = "", ylab = "", main = main) + abline(v = ix, h = iy, col = "lightgray", lty = "dotted") + for(i in 1:np) { + pc <- pch[[i]] + points(ix[i], iy[i], pch = pc, col = col, bg = bg, cex = cex) + if(cextext > 0) + text(ix[i] - 0.4, iy[i], pc, col = coltext, cex = cextext) + } + } + +pchShow() +``` + +If you get an odd result, double check that you are calling the aesthetic as its own argument (and not calling it from inside of `mapping = aes()`. I like to think of aesthetics like this, if you set the aesthetic: + +* _inside_ of the `aes()` function, `ggplot2` will map the aesthetic to data values and build a legend. +* _outside_ of the `aes()` function, `ggplot2` will directly set the aesthetic to your input. + + #### Exercises Now that you know how to use aesthetics, take a moment to experiment with the `mpg` data set. -* Attempt to match different types of variables to different types of aesthetics. - + The continuous variables in `mpg` are: `displ`, `year`, `cyl`, `cty`, `hwy` +1. Map a discrete variable to `color`, `size`, `alpha`, and `shape`. Then map a continuous variable to each. Does `ggplot2` behave differently for discrete vs. continuous variables? + The discrete variables in `mpg` are: `manufacturer`, `model`, `trans`, `drv`, `fl`, `class` -* Attempt to use more than one aesthetic at a time. -* Attempt to set an aesthetic to something other than a variable name, like `displ < 5`. + + The continuous variables in `mpg` are: `displ`, `year`, `cyl`, `cty`, `hwy` +2. Map the same variable to multiple aesthetics in the same plot. Does it work? How many legends does `ggplot2` create? +3. Attempt to set an aesthetic to something other than a variable name, like `displ < 5`. What happens? -See the help page for `geom_point()` (`?geom_point`) to learn which aesthetics are available to use in a scatterplot. See the help page for the `mpg` data set (`?mpg`) to learn which variables are in the data set. +*** -#### Position adjustments +**Tip** - See the help page for `geom_point()` (`?geom_point`) to learn which aesthetics are available to use in a scatterplot. See the help page for the `mpg` data set (`?mpg`) to learn which variables are in the data set. -Did you notice that there is another riddle hidden in our scatterplot? The plot displays 126 points, but the `mpg` data set contains 234 observations. Also, the points appear to fall on a grid. Why should this be? +*** -```{r} +#### Geoms + +How are these two plots similar? + +```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=3, fig.height=3} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + +ggplot(data = mpg) + + geom_smooth(mapping = aes(x = displ, y = hwy)) +``` + +They both contain the same x variable, the same y variable, and if you look closely, you can see that they both describe the same data. But the plots are not identical. + +Each plot uses a different visual object to represent the data. You could say that these two graphs are different "types" of plots, or that they "draw" different things. In `ggplot2` syntax, we say that they use different _geoms_. + +A _geom_ is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. + +As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, which is how you create a scatterplot; and the plot on the right uses the smooth geom, a smooth line fitted to the data. + +To change the geom in your plot, change the geom function that you add to `ggplot()`. For instance, you can make the plot on the left with `geom_point()`: + +```{r eval=FALSE} ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ``` -The points appear in a grid because the `hwy` and `displ` measurements in `mpg` are rounded to the nearest integer and tenths values. This also explains why our graph appears to contain 126 points. Many points overlap each other because they have been rounded to the same values of `hwy` and `displ`. 108 points are hidden on top of other points located at the same value. +And you can make the plot on the right with `geom_smooth()`: -You can avoid this overplotting problem by adjusting the position of the points. Each geom function uses a position argument to determine how to adjust the position of objects that overlap. - -The most useful type of adjustment for scatterplots is known as a "jitter". Jittering adds a small amount of random noise to each point. This spreads the points out since no two points are likely to receive the same amount of random noise. To jitter your points, add `position = "jitter"` to `geom_point`. - -```{r} +```{r eval=FALSE, message = FALSE} ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy), position = "jitter") + geom_smooth(mapping = aes(x = displ, y = hwy)) ``` -But isn't random noise, you know, bad? It *is* true that jittering your data will make it less accurate at the local level, but jittering may make your data _more_ accurate at the global level. Occasionally, jittering will reveal a pattern that was hidden within the grid. +Every geom function takes a `mapping` argument. However, the aesthetics that you pass the argument will change from geom to geom. If you think about it, this makes sense. You could set the shape of a point, but you couldn't set the "shape" of a line. On the other hand, you _could_ change the linetype of a line: + +```{r message = FALSE} +ggplot(data = mpg) + + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) +``` + +Now `geom_smooth()` separates the cars into three lines based on their `drv` value, which describes a car's drive train. `geom_smooth()` then gives each line a unique linetype. Here, `4` stands for four wheel drive, `f` for front wheel drive, and `r` for rear wheel drive. + +```{r message = FALSE} +ggplot(data = mpg) + + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) +``` + +*** + +**Tip** - Many geoms use a single object to describe all of the data. For these geoms, you can ask `ggplot2` to draw a separate object for each group of observations by setting the `group` aesthetic to a discrete variable. + +In practice, `ggplot2` will automatically detect when it needs to group the data to apply several levels of an aesthetic to a single, monolithic geom (as in the `geom_smooth()` example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the resulting objects. + +*** + +`ggplot2` provides 37 geom functions that you can use to visualize your data. Each geom is particularly well suited for visualizing a certain type of data or a certain type of relationship. The table below lists the geoms in `ggplot2`, loosely organized by the type of relationship that they describe. Next to each geom is a visual representation of the geom. Beneath the geom is a list of aesthetics that apply to the geom. + +*** + +`r bookdown::embed_png("images/blank.png", dpi = 300)` + +*** + +#### Layers + +Smooth lines are especially useful when you plot them _on top_ of raw data. The raw data provides a context for the smooth line, and the smooth line provides a summary of the raw data. To plot a smooth line on top of a scatterplot, add a call to `geom_smooth()` _after_ a call to `geom_point()`. + +```{r, message = FALSE} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + + geom_smooth(mapping= aes(x = displ, y = hwy)) +``` + +Why does this work? You can think of each geom function in `ggplot2` as a layer. When you add multiple geoms to your plot call, `ggplot2` will add multiple layers to your plot. This let's you build sophisticated, multi-layer plots; `ggplot2` will place each new geom on top of the preceeding geoms. + +Pay attention to our coding habits whenever you use multiple geoms. Our call now contains some redundant code. We call `mapping = aes(x = displ, y = hwy)` twice. As a general rule, it is unwise to repeat code because each repetition creates a chance to make a typo or error. Repetitions also make your code harder to read and write. + +You can avoid repetition by passing a set of mappings to `ggplot()`. `ggplot2` will treat these mappings as global mappings that apply to each geom in the graph. You can then remove the mapping arguments in the individual layers. + +```{r, message = FALSE} +ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + + geom_point() + + geom_smooth() +``` + +If you place mappings in a geom function, `ggplot2` will treat them as local mappings. It will use these mappings to extend or overwrite the global mappings _for that geom only_. This provides an easy way to differentiate geoms. + +```{r, message = FALSE} +ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + + geom_point(mapping = aes(color = class)) + + geom_smooth() +``` + +You can use the same system to specify individual data sets for each layer. For example, we can apply our smooth line to just a subset of the `mpg` data set, the cars with eight cylinder engines. + +```{r, message = FALSE, warning = FALSE} +ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + + geom_point() + + geom_smooth(data = subset(mpg, cyl == 8)) +``` + +##### Exercises + +1. What would this graph look like? + +```{r, eval = FALSE} +ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + + geom_point() + + geom_smooth() +``` + +2. Will these two graphs look different? + +```{r, eval = FALSE} +ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + + geom_point() + + geom_smooth() + +ggplot(mapping = aes(x = displ, y = hwy)) + + geom_point(data = mpg) + + geom_smooth(data = mpg) +``` ### Bar Charts -You now know how to make scatterplots, but there are many different types of plots that you can use to visualize your data. After scatterplots, one of the most used types of plot is the bar chart. +You now know how to make useful scatterplots with `ggplot2`, but there are many different types of plots that you can use to visualize your data. After scatterplots, one of the most used types of plot is the bar chart. -To make a bar chart with `ggplot2` use the function `geom_bar()`. +To make a bar chart with `ggplot2` use the function `geom_bar()`. `geom_bar()` does not require a $y$ aesthetic. ```{r} ggplot(data = diamonds) + @@ -255,73 +410,190 @@ ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity)) ``` -Bar charts also use different position adjustments than scatterplots. It wouldn't make sense to set `position = "jitter"` for a bar chart. However, you could set `position = "dodge"` to create an unstacked bar chart. You'll learn about other position options in Section 2. +#### Positions -```{r} +But what if you don't want a stacked bar chart? What if you want the chart below? Could you make it? + +```{r echo = FALSE} ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") ``` -#### Stats +The chart displays the same 40 color coded rectangles as the stacked bar chart above. Each bar represents a combination of `cut` and `clarity`. However, the position of the bars within the two charts is different. In the stacked bar chart, `ggplot2` stacked bars that have the same cut on top of each other. In this plot, `ggplot2` places bars that have the same cut beside each other. -Bar charts are interesting because they reveal something subtle about many types of plots. Consider our basic bar chart. +You can control this behavior by adding a _position adjustment_ to your geom. A position adjustment tells `ggplot2` what to do when two or more objects appear at the same spot in the coordinate system. To set a position adjustment, set the `position` argument of your geom function to one of `"identity"`, `"stack"`, `"dodge"`, `"fill"`, or `"jitter"`. + +##### Position = "identity" + +When `position = "identity"`, `ggplot2` will place each object exactly where it falls in the context of the graph. + +For our bar chart, this would mean that each bar would start at `y = 0` and would appear directly above the `cut` value that it describes. Since there are eight bars for each value of `cut`, many bars would overlap. The plot will look suspiciously like a stacked bar chart, but the stacked heights will be inaccurate, as each bar actually descends to `y = 0`. Some bars would not appear at all because they would be completely overlapped by other bars. + +`position = "identity"` is a poor choice for a bar chart, but is the sensible default position adjustment for many geoms, such as `geom_point()`. ```{r} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity") + + ggtitle('Position = "identity"') +``` + +##### Position = "stack" + +`position = "stack"` places overlapping objects directly _above_ one another. This is the default position adjustment for bar charts in `ggplot2`. Here each bar begins exactly where the bar below it ends. + +```{r} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack") + + ggtitle('Position = "stack"') +``` + +##### Position = "fill" + +`position = "fill"` places overlapping objects above one another. However, it scales the objects to take up all of the available vertical space. As a result, `position = "fill"` makes it easy to compare relative frequencies across groups. + +```{r} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") + + ggtitle('Position = "fill"') +``` + +##### Position = "dodge" + +`position = "dodge"` places overlapping objects directly _beside_ one another. This is how I created the graph at the start of the section. + +```{r} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") + + ggtitle('Position = "dodge"') +``` + +##### Position = "jitter" + +The last type of position doesn't make sense for bar charts, but it is very useful for scatterplots. Recall our first scatterplot. + +```{r echo = FALSE} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) +``` + +Did you notice that the plot displays only 126 points, even though there are 234 observations in the data set? Did you also notice that the points appear to fall on a grid. Why should this be? + +This is common behavior in scatterplots. The points appear in a grid because the `hwy` and `displ` measurements were rounded to the nearest integer and tenths values. As a result, many points overlap each other because they've been rounded to the same values of `hwy` and `displ`. The rounding also explains why our graph appears to contain only 126 points. 108 points are hidden on top of other points located at the same value. + +This arrangement can cause problems because it makes it hard to see where the mass of the data is. Is there one special combination of `hwy` and `displ` that contains 109 values? Or are the data points more or less equally spread throughout the graph? + +You can avoid this overplotting problem by setting the position adjustment to "jitter". `position = "jitter"` adds a small amount of random noise to each point, which spreads the points out because no two points are likely to receive the same amount of random noise. + +```{r} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy), position = "jitter") + + ggtitle('Position = "jitter"') +``` + +But isn't random noise, you know, bad? It *is* true that jittering your data will make it less accurate at the local level, but jittering may make your data _more_ accurate at the global level. Occasionally, jittering will reveal a pattern that was hidden within the grid. + +*** + +**Tip** - `ggplot2` comes with a special geom `geom_jitter()` that is the exact equivalent of `geom_point(position = "jitter")`. + +*** + +*** + +**Tip** - To learn more about a position adjustment, look up the help page associated with each adjustment: `?position_dodge`, `?position_fill`, `?position_identity`, `?position_jitter`, and `?position_stack`. + +*** + +#### Stats + +Bar charts are interesting because they reveal something subtle about plots. Consider our basic bar chart. + +```{r echo = FALSE} ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) ``` -On the x axis, the chart displays `cut`, a variable in the `diamonds` data set. On the y axis, it displays count. But count is not a variable in the diamonds data set: +On the x axis, the chart displays `cut`, a variable in the `diamonds` data set. On the y axis, it displays count; but count is not a variable in the diamonds data set: ```{r} head(diamonds) ``` -Nor did we tell `ggplot2` in our code where to find count values. - -```{r eval = FALSE} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut)) -``` - Where does count come from? Some graphs, like scatterplots, plot the raw values of your data set. Other graphs, like bar charts, do not plot raw values at all. These graphs apply an algorithm to your data and then plot the results of the algorithm. Consider how often graphs do this. * **bar charts** and **histograms** bin your data and then plot bin counts, the number of points that fall in each bin. -* **smooth lines** (e.g. trend lines) apply a model to your data and then plot the model line. +* **smooth lines** fit a model to your data and then plot the model line. * **boxplots** calculate the quartiles of your data and then plot the quartiles as a box. * and so on. -`ggplot2` calls the algorithm that a graph uses to transform raw data a _stat_, which is short for statistical transformation. Each geom in `ggplot2` is associated with a stat that it uses to plot your data. `geom_bar()` uses the "bin" stat, which bins raw data and computes bin counts. In contrast, `geom_point()` uses the "identity" stat, which applies the identity transformation, i.e. no transformation. +`ggplot2` calls the algorithm that a graph uses to transform raw data a _stat_, which is short for statistical transformation. Each geom in `ggplot2` is associated with a default stat that it uses to plot your data. `geom_bar()` uses the "count" stat, which computes a data set of counts for each x value from your raw data. `geom_bar()` then uses this computed data to make the plot. -You can change the stat that your geom uses. For example, you can ask `geom_bar()` to use the "identity" stat. This is a useful way to plot data that already lists the heights for each bar, like the data set below. +`r bookdown::embed_png("images/blank.png", dpi = 300)` -```{r} -demo <- data.frame( - bars = c("bar_1","bar_2","bar_3"), - counts = c(20, 30, 40) -) +A few geoms, like `geom_point()`, plot your raw data as it is. To keep things simple, let's imagine that these geoms also transform the data. They just use a very lame transformation, the identity transformation, which returns the data in its original state. Now we can say that _every_ geom uses a stat. -demo -``` +`r bookdown::embed_png("images/blank.png", dpi = 300)` -To use the identity stat, set the stat argument of `geom_bar()` to "identity". +You can learn which stat a geom uses, as well as what variables it computes by visiting the geom's help page. For example, the help page of `geom_bar()` shows that it uses the count stat and that the count stat computes two new variables, `count` and `prop`. If you have an R session open---and you should!---you can verify this by running `?geom_bar` at the command line. -```{r} -ggplot(data = demo) + - geom_bar(mapping = aes(x = bars, y = counts), stat = "identity") +Stats are the most subtle part of plotting because you do not see them in action. `ggplot2` applies the transformation and stores the results behind the scenes. You only see the finished plot. Moreover, `ggplot2` applies stats automatically, with a very intuitive set of defaults. So why bother thinking about them? Because you can use stats to do three very useful things. + +First, you can tell `ggplot2` to use variables created by the stat. For example, the count stat creates two variables, `count` and `prop`, but `geom_bar()` only uses the `count` variable by default. + +You can tell `geom_bar()` to use the prop variable by mapping $y$ to `..prop..`. The two dots that surround prop notify `ggplot2` that the prop variable appears in the transformed data set, not the raw data set. Be sure to include these dots whenever you refer to a variable that is created by a stat. + +```{r message = FALSE, fig.show='hold', fig.width=4, fig.height=4} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut)) + +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, y = ..prop.., group = cut)) ``` *** -*Tip*: To learn which stat a geom uses, visit the geom's help page, e.g. `?geom_bar`. +**Tip** - The best way to discover which variables are created by a stat is to visit the stat's help page. To open the help page, place the prefix `?stat_` before the name of the stat, then run the command at the command line, e.g. `?stat_count`. + +*** + +Second, you can customize how a stat does its job. For example, the count stat takes a width parameter that it uses to set the widths of the bars in a bar plot. To pass a width value to the stat, provide a width argument to the geom that uses the stat. `width = 1` will make the bars wide enough to touch each other. + +```{r} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut), width = 1) +``` + +You can learn which arguments a stat takes and how it uses them at the stat's help page. + +Finally, you can change the stat that your geom uses by etting the geom's stat argument. For example, you can map the heights of your bars to raw values---not counts---if you change the stat of `geom_bar()` from "count" to "identity". This works best if your data contains one value per bar, as in the demo data set below. Map the $y$ aesthetic to the variable that contains the bar heights. + +```{r} +demo <- data.frame( + a = c("bar_1","bar_2","bar_3"), + b = c(20, 30, 40) +) + +demo + +ggplot(data = demo) + + geom_bar(mapping = aes(x = a, y = b), stat = "identity") +``` + +Use consideration when you change a geom's stat. Many combinations of geoms and stats will create incompatible results. In practice, I almost always use a geom's default stat. + +`ggplot2` provides 22 stats for you to use. The table below describes each stat and lists the command that will open the stat's help page. As of `ggplot2` version 1.0.1.9003, stats share the same help page as the geom that they are most frequently associated with. + +*** + +`r bookdown::embed_png("images/blank.png", dpi = 300)` *** ### Polar charts -Here's another riddle: how is a bar chart similar to a coxcomb plot, like the one below? +Now that you can make scatterplots and bar charts, let's leave the cartesian coordinate system and examine the polar coordinate system. We will begin with a riddle: how is a bar chart similar to a coxcomb plot, like the one below? ```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=3, fig.height=4} ggplot(data = diamonds) + @@ -335,9 +607,7 @@ Answer: A coxcomb plot is a bar chart plotted in polar coordinates. #### Coordinate systems -You can make coxcomb plots with `ggplot2` by first building a bar chart and then plotting the chart in polar coordinates. - -To plot your data in polar coordinates, add `coord_polar()` to your plot call. Polar bar charts will look better if you also set the width parameter of `geom_bar()` to 1. This will ensure that no space appears between the bars. +To make a coxcomb plot with `ggplot2`, first build a bar chart and then add `coord_polar()` to your plot call. Polar bar charts will look better if you also set the width parameter of `geom_bar()` to 1. This will ensure that no space appears between the bars. ```{r} ggplot(data = diamonds) + @@ -345,15 +615,42 @@ ggplot(data = diamonds) + coord_polar() ``` -You can add `coord_polar()` to any plot in `ggplot2` to draw the plot in polar coordinates. `ggplot2` will map your $y$ variable to $r$ and your $x$ variable to $\theta$. +You can use `coord_polar()` to turn any plot in `ggplot2` into a polar chart. Whenever you add `coord_polar()` to a plot's call, `ggplot2` will draw the plot on a polar coordinate system. It will map the plot's $y$ variable to $r$ and the plot's $x$ variable to $\theta$. You can reverse this behavior by passing `coord_polar()` the argument `theta = "y"`. + +Polar coordinates unlock another riddle as well. You may have noticed that `ggplot2` does not come with a pie chart geom. Why would that be? + +In practice, a pie chart is just a stacked bar chart plotted in polar coordinates. To make a pie chart in `ggplot2`, create a stacked bar chart and: + +1. ensure that the x axis only has one value. An easy way to do this is to set `x = factor(1)`. +2. set the width of the bar to one, e.g. `width = 1` +3. Add `coord_polar()` +4 Pass `coord_polar()` the argument `theta = "y"` + +```{r} +ggplot(data = diamonds) + + geom_bar(mapping = aes(x = factor(1), fill = cut), width = 1) + + coord_polar(theta = "y") +``` + +`ggplot2` comes with eight coordinate functions that you can use in the same way as `coord_polar()`. The table below describes each function and what it does. Add any function to your plot's call to change the coordinate system that plot uses. + +*** + +`r bookdown::embed_png("images/blank.png", dpi = 300)` + +*** + +*** + +**Tip** - You can learn more about each coordinate system by opening its help page in R, e.g. `?coord_cartesian`, `?coord_fixed`, `?coord_flip`, `?coord_map`, `?coord_polar`, and `?coord_trans`. + +*** #### Facets -Coxcomb plots are especially useful when you make many plots to compare against each other. Each coxcomb will act as a glyph that you can use to compare subgroups of data. +Coxcomb plots are especially useful when you make many coxcomb plots to compare against each other. Each coxcomb will act as a glyph that you can use to compare subsets of your data. The quickest way to draw separate coxcombs for subsets of your data is to facet your plot. When you _facet_ a plot, you split it into subplots that each describe a subset of the data. -You can create a separate coxcomb plot for each subgroup in your data by _faceting_ your plot. To facet your plot on a single discrete variable, add `facet_wrap()` to your plot call. The first argument of `facet_wrap()` is a formula, always a `~` followed by a variable name. - -For example, here we create a separate subplot for each level of the `clarity` variable. The first subplot displays the group of points that have the `clarity` value `I1`. The second subplot displays the group of points that have the `clarity` value `SI2`. And so on. +To facet your plot on a single discrete variable, add `facet_wrap()` to your plot call. The first argument of `facet_wrap()` is a formula, always a `~` followed by a variable name. For example, here we create a separate subplot for each level of the `clarity` variable. The first subplot displays the group of points that have the `clarity` value `I1`. The second subplot displays the group of points that have the `clarity` value `SI2`. And so on. ```{r fig.height = 7, fig.width = 7} ggplot(data = diamonds) + @@ -381,7 +678,26 @@ ggplot(data = mpg) + facet_wrap(~ class) ``` -### Bringing it together +If you prefer to not facet on the rows or columns dimension, place a `.` instead of a variable name before or after the `~`. + +##### Exercises + +1. What graph will this code make? +```{r eval = FALSE} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + + facet_grid(drv ~ .) +``` + +1. What graph will this code make? +```{r eval = FALSE} +ggplot(data = mpg) + + geom_point(mapping = aes(x = displ, y = hwy)) + + facet_grid(. ~ cyl) +``` + + +### The layered grammar of graphics In this section, you learned more than how to make scatterplots, bar charts, and coxcomb plots; you learned a foundation that you can use to make _any_ type of plot with `ggplot2`. @@ -398,32 +714,23 @@ ggplot(data = ) + ``` -The template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters because `ggplot2` will provide useful defaults for everything except the data, the mappings, and the geom function. +Our new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters because `ggplot2` will provide useful defaults for everything except the data, the mappings, and the geom function. -The seven parameters in the template are connected by a powerful idea known as the _Grammar of Graphics_, a system for describing plots. The grammar shows that you can uniquely describe _any_ plot as a combination of---you guessed it: a data set, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme. +The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe _any_ plot as a combination of---you guessed it: a data set, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme. -Before we look at the grammar of graphics, let's take a look at the different geoms, stats, position adjustments, coordinate systems, and facetting schemes that you can use in `ggplot2`. +To see how this works, consider how you could build a basic plot from scratch: you could start with a data set, transform it into the information that you want to display, choose a geometric object to represent each observation, map aesthetic properties of the objects to variables in the data to visually display the values of the observation. You'd then select a coordinate system to place the objects into. You'd use the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. At that point, you would have a complete graph, but you could further adjust the positions of the objects or facet the graph if you like. You could also extend the plot by adding one or more additional layers, where each additional layer contains a data set, a geom, a set of mappings, a stat, and a position adjustment. -## The Vocabulary of `ggplot2` +*** -`ggplot2` comes with 37 geom functions, 22 stats, eight coordinate systems, six position adjustments, two facetting schemes, and 28 aesthetics to map. Each of these options introduces a new set of details to think about. +`r bookdown::embed_png("images/blank.png", dpi = 300)` -This section will guide you through the options, building your ability to make new types of plots as you go. Let's begin with the most noticeable part of a data visualization, the geom. +*** -### Geoms +Although this method may seem complicated, you could use it to build _any_ plot that you imagine. In other words, you can use the code template that you've learned in this chapter to build hundreds of thousnds of unique plots. -The geom of a plot is the geometric object that the plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. +In the next section, we will use the template to explore a data set. Along the way, we will build several of the most useful graphs for data scientists. -`ggplot2` provides 37 `geom_` functions that you can use to visualize your data. Each geom is particularly well suited for visualizing a certain type of data or a certain type of relationship. You can loosely classify geoms into groups that: - -1. Visualize distributions -2. Visualize functions between two variables -3. Visualize correlations between two variables -4. Visualize correlations between three variables -5. Visualize maps -6. Display basic objects (graphical primitives) - -Let's examine each group one at a time. For all of the geoms in `ggplot2`, you use the geom by inserting the geom's function into the `` spot in the code template in Section 1. +## Exploratory Data Visualization *** @@ -435,7 +742,7 @@ Let's examine each group one at a time. For all of the geoms in `ggplot2`, you u *** -#### Visualizing Distributions +### Visualizing Distributions The first group of geoms visualizes the _distribution_ of the values in a variable. @@ -505,19 +812,19 @@ The strategy of counting the number of observations at each value breaks down fo To get around this, data scientists divide the range of a continuous variable into equally spaced intervals, a process called _binning_. -`r bookdown::embed_png("images/visualization-17.png", dpi = 150)` +`r bookdown::embed_png("images/visualization-17.png", dpi = 300)` They then count how many observations fall into each bin. -`r bookdown::embed_png("images/visualization-18.png", dpi = 150)` +`r bookdown::embed_png("images/visualization-18.png", dpi = 300)` And display the count as a bar, or some other object. -`r bookdown::embed_png("images/visualization-19.png", dpi = 150)` +`r bookdown::embed_png("images/visualization-19.png", dpi = 300)` This method is temperamental because the appearance of the distribution can change dramatically if the bin size changes. As no bin size is "correct," you should explore several bin sizes when examining data. -`r bookdown::embed_png("images/visualization-20.png", dpi = 150)` +`r bookdown::embed_png("images/visualization-20.png", dpi = 300)` Several geoms exist to help you visualize continuous distributions. They almost all use the "bin" stat to implement the above strategy. For each of these geoms, you can set the following arguments for "bin" to use: @@ -657,6 +964,10 @@ Useful arguments that apply to `geom_dotplot()` In practice, I find that `geom_dotplot()` works best with small data sets and takes a lot of tweaking of the binwidth, dotsize, and stackratio arguments to fit the dots within the graph (the stack heights depend entirely on the organization of the dots, which renders the y axis ambiguous). That said, dotplots can be useful as a learning aid. They provide an intuitive representation of a histogram. +### Compare Distributions + +### Visualize Covariation + #### Visualize functions between two variables Distributions provide useful information about variables, but the information is general. By itself, a distribution cannot tell you how the value of a variable in one set of circumstances will differ from the value of the same variable in a different set of circumstances. @@ -896,804 +1207,93 @@ Useful arguments for `geom_smooth()` are: Be careful, `geom_smooth()` will overlay a trend line on every data set, even if the underlying data is uncorrelated. You can avoid being fooled by also inspecting the raw data or calculating the correlation between your variables, e.g. `cor(diamonds$carat, diamonds$price)`. - -##### Visualize correlations between three variables - -##### Visualize maps - -##### Display basic objects (graphical primitives) - - -#### Aesthetic Mappings - -Have you experimented with aesthetics? Great! Here are some things that you may have noticed. - -#### Continuous data - -A continuous variable can contain an infinite number of values that can be put in order, like numbers or date-times. `ggplot2` will treat your variable as continuous if it is a numeric, integer, or a recognizable date-time structure (but not a factor, see `?factor`). - - -If your variable is continuous, `ggplot2` will treat it in a special way. `ggplot2` will - -* use a gradient of colors from blue to black for the color aesthetic -* display a colorbar in the legend for the color aesthetic -* not use the shape aesthetic - - `ggplot2` will not use the shape aesthetic to display continuous information because the human eye cannot easily interpolate between shapes. Can you tell whether a shape is three-quarters of the way between a triangle and a circle? How about five-eights of the way? - -`ggplot2` will treat your variable as continuous if it is a numeric, integer, or a recognizable date-time structure (but not a factor, see `?factor`). - -#### Discrete data - -A discrete variable can only contain a finite (or countably infinite) set of values. Character strings and boolean values are examples of discrete data. `ggplot2` will treat your variable as discrete if it is not a numeric, integer, or recognizable date-time structure. - -If your data is discrete, `ggplot2` will: - -* use a set of colors that span the hues of the rainbow. The exact colors will depend on how many hues appear in your graph. `ggplot2` selects the colors in a way that ensures that one color does not visually dominate the others. -* use equally spaced values of size and alpha -* display up to six shapes for the shape aesthetic. - -If your data requires more than six unique shapes, `ggplot2` will print a warning message and only display the first six shapes. You may have noticed this in the graph above (and below), `ggplot2` did not display the suv values, which were the seventh unique class. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, shape = class)) -``` - -See _Section 7_ to learn how to pick your own colors, shapes, sizes, etc. for `ggplot2` to use. - -#### Multiple aesthetics - -You can use more than one aesthetic at a time. `ggplot2` will combine aesthetic legends when possible. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, - color = drv, shape = drv, size = cty)) -``` - -#### Expressions - -You can map an aesthetic to more than a variable. You can map an aesthetic to raw data, or an expression. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, - color = 1:234)) -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, - color = displ < 5)) -``` - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, color = "blue")) -``` - -#### Setting vs. Mapping - -You can also manually set an aesthetic to a specific level. For example, you can make all of the points in your plot blue. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy), color = "blue") -``` - -To set an aesthetic manually, call the aesthetic as an argument of your geom function. Then pass the aesthetic a value that R will recognize, such as - -* the name of a color as a character string -* the size of a point as a cex expansion factor (see `?par`) -* the shape as a point as a number code - -R uses the following numeric codes to refer to the following shapes. - -```{r echo=FALSE} -pchShow <- - function(extras = c("*",".", "o","O","0","+","-","|","%","#"), - cex = 2, - col = "red3", bg = "gold", coltext = "brown", cextext = 1.1, - main = "") - { - nex <- length(extras) - np <- 26 + nex - ipch <- 0:(np-1) - k <- floor(sqrt(np)) - dd <- c(-1,1)/2 - rx <- dd + range(ix <- ipch %/% k) - ry <- dd + range(iy <- 3 + (k-1)- ipch %% k) - pch <- as.list(ipch) # list with integers & strings - if(nex > 0) pch[26+ 1:nex] <- as.list(extras) - plot(rx, ry, type = "n", axes = FALSE, xlab = "", ylab = "", main = main) - abline(v = ix, h = iy, col = "lightgray", lty = "dotted") - for(i in 1:np) { - pc <- pch[[i]] - points(ix[i], iy[i], pch = pc, col = col, bg = bg, cex = cex) - if(cextext > 0) - text(ix[i] - 0.4, iy[i], pc, col = coltext, cex = cextext) - } - } - -pchShow() -``` - -If you get an odd result, double check that you are calling the aesthetic as its own argument (and not calling it from inside of `mapping = aes()`. - - -Here, `ggplot2` treats `color = "blue"` as a mapping because it appears in the mapping argument. `ggplot2` assumes that "blue" is a value in the data space. It uses R's recycling rules to pair the single value "blue" with each row of data in `mpg`. Then `ggplot2` creates a mapping from the value "blue" in the data space to the pinkish color that we see in the visual space. `ggplot2` even creates a legend to let you know that the color pink represents the value "blue." The choice of pink is a coincidence; `ggplot2` defaults to pink whenever a single discrete value is mapped to the color aesthetic. - -If you experience this type of behavior, remember: - -* define an aesthetic _within_ the `aes()` function to map levels of the aesthetic to values of data. You would expect a legend after this operation. -* define an aesthetic _outside of_ the `aes()` function to manually set the aesthetic to a specific level. You would not expect a legend after this operation. - -Remember: - -* define an aesthetic _within_ the `aes()` function to map levels of the aesthetic to values of data. You would expect a legend after this operation. -* define an aesthetic _outside of_ the `aes()` function to manually set the aesthetic to a specific level. You would not expect a legend after this operation. - -#### Group aesthetic - -The _group_ aesthetic is a useful way to apply a monolithic geom, like a smooth line, to multiple subgroups. - -By default, `geom_smooth()` draws a single smoothed line for the entire data set. To draw a separate line for each group of points, set the group aesthetic to a grouping variable or expression. - -```{r message = FALSE} -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy, group = displ < 5)) -``` - -`ggplot2` will automatically infer a group aesthetic when you map an aesthetic of a monolithic geom to a discrete variable. Below `ggplot2` infers a group aesthetic from the `linetype = drv` aesthetic. It is useful to combine group aesthetics with secondary aesthetics because `ggplot2` cannot build a legend for a group aesthetic. - -```{r message = FALSE} -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) -``` - - - - -### Stats - -#### Change a stat - -In many cases, it does not make sense to change a geom's default stat. In other cases, you can change or fine tune the stat to make new graphs. - -You can map the heights of bars in a bar chart to data values---not counts---by changing the stat of the bar chart. This works best if your data set contains one observation per bar, e.g. - -```{r} -demo <- data.frame( - a = c("bar_1","bar_2","bar_3"), - b = c(20, 30, 40) -) -``` - -By default, `geom_bar()` uses the bin stat, which creates a count for each bar. - -```{r} -ggplot(data = demo) + - geom_bar(mapping = aes(x = a)) -``` - -To change the stat of a geom, set its `stat` argument to the name of a stat. You may need to supply or remove mappings to accomodate the new stat. - -```{r} -ggplot(data = demo) + - geom_bar(mapping = aes(x = a, y = b), stat = "identity") -``` - -To find a list of available stats, run `help(package = "ggplot2")`. Each stat is listed as a function that begins with `stat_`. Set a geom's stat argument to the part of the function name that follows the underscore, surrounded in quotes, as above. - -Use consideration when you change a stat. Many combinations of geoms and stats create incompatible results. - -#### Set parameters - -Many stats use _parameters_ arguments that fine tune the statistical transformation. For example, the bin stat takes the parameter `width`, which controls the width of the bars in a bar chart. - -To set a parameter of a stat, pass the parameter as an argument to the geom function. +`geom_quantile()` fits a different type of model to your data. Use it to display the results of a quantile regression (see `?rq` for details). Like `geom_smooth()`, `geom_quantile()` takes a formula argument that describes the relationship between $x$ and $y$. ```{r} ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut), width = 1) + geom_point(mapping = aes(x = carat, y = price)) + + geom_quantile(mapping = aes(x = carat, y = price), + quantiles = c(0.05, 0.5, 0.95), + formula = y ~ poly(x, 2)) ``` -To learn which parameters are used by a stat, visit the stat's help page, e.g. `?stat_bin`. +Useful aesthetics for `geom_quantile()` are: -#### Use data from a stat +* x (required) +* y (required) +* alpha +* color +* linetype +* size +* weight -Many stats in `ggplot2` create more data than they display. For example, the `?stat_bin` help page explains that the `stat_bin()` transformation creates four new variables: `count`, `density`, `ncount`, and `ndensity`. `geom_bar()` uses only one of these variables. It maps the `count` variable to the y axis of your plot. +Useful arguments for `geom_quantile()` are: -You can use any of the variables created by a stat in an aesthetic mapping. To use a variable created by a stat, surround its name with a pair of dots, `..`. +* `formula` - the formula to use in the smoothing function +* `quantiles` - Conditional quantiles of $y$ to display. Each quantile is displayed with a line. -```{r message = FALSE, fig.show='hold', fig.width=4, fig.height=4} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = carat)) +`geom_smooth()` and `geom_quantile()` summarize the relationship between two variables as a function, but you can also summarize the relationship as a bivariate distribution. -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = carat, y = ..density..)) -``` - -Note that to do this, you will need to - -1. Determine which stat your geom uses -2. Determine which variables the stat creates from its help page -3. Surround the variable name with `..` - -### Positions - -At the beginning of this section, you learned how to use the fill aesthetic to make a stacked bar chart. +`geom_bin2d()` divides the coordinate plane into a two dimensional grid and then displays the number of observations that fall into each bin in the grid. This technique let's you see where the mass of the data lies; bins with a light fill color contain more data than bins with a dark fill color. Bins with no fill contain no data at all. ```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = clarity)) +ggplot(data = diamonds) + + geom_bin2d(mapping = aes(x = carat, y = price), binwidth = c(0.1, 500)) ``` -But what if you don't want a stacked bar chart? What if you want the chart below? Could you make it? +Useful aesthetics for `geom_bin2d()` are: -```{r echo = FALSE} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") -``` +* x (required) +* y (required) +* alpha +* color +* fill +* size +* weight -This chart displays the same 40 color coded bars as the stacked bar chart above. Each bar represents a combination of `cut` and `clarity`. +Useful arguments for `geom_bin2d()` are: -However, the position of the bars within the two charts is different. In the stacked bar chart, `ggplot2` stacked the bars on top of each other if they had the same cut. In the second plot, `ggplot2` placed the bars beside each other if they had the same cut. +* `bins` - A vector like `c(30, 40)` that gives the number of bins to use in the horizontal and vertical directions. +* `binwidth` - A vector like `c(0.1, 500)` that gives the binwidths to use in the horizontal and vertical directions. Overrides `bins` when set. +* `drop` - If `TRUE` (default) `geom_bin2d()` removes the fill from all bins that contain zero observations. -You can control this behavior by adding a _position adjustment_ to your call. A position adjustment tells `ggplot2` what to do when two or more objects overlap. - -To set a position adjustment, set the `position` argument of your geom function to one of `"identity"`, `"stack"`, `"dodge"`, `"fill"`, or `"jitter"`. - -#### Position = "identity" - -For many geoms, the default position value is "identity". When `position = "identity"`, `ggplot2` will place each object exactly where it falls in the context of the graph. - -This would make little sense for our bar chart. Each bar would start at `y = 0` and would appear directly above the `cut` value that it describes. Since there are seven bars for each value of `cut`, many bars would overlap. The plot will look suspiciously like a stacked bar chart, but the stacked heights will be inaccurate, as each bar actually extends to `y = 0`. Some bars would not appear at all because they would be completely overlapped by other bars. - -To see how such a graph would appear, set `position = "identity"`. +`geom_hex()` works similarly to `geom_bin2d()`, but it divides the coordinate plain into hexagon shaped bins. This can reduce visual artifacts that are introduced by the aligning edges of rectangular bins. ```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity") + - ggtitle('Position = "identity"') +ggplot(data = diamonds) + + geom_hex(mapping = aes(x = carat, y = price), binwidth = c(0.1, 500)) ``` -#### Position = "stack" +`geom_hex()` requires the `hexbin` package, which you can install with `install.packages("hexbin")`. -To avoid confusion, `ggplot2` uses a default "stack" position adjustment for bar charts. When `position = "stack"` `ggplot2` places overlapping objects directly _above_ one another. +`geom_density2d()` uses density contours to display similar information. It is the two dimensional equivalent of `geom_density()`. Interpret a two dimensional density plot the same way you would interpret a contour map. Each line connects an area of equal density, which makes changes of slope easy to see. -Here each bar begins exactly where the bar below it ends. +As with other summary geoms, `geom_density2d()` makes a useful second layer. ```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack") + - ggtitle('Position = "stack"') +ggplot(data = diamonds) + + geom_point(mapping = aes(x = carat, y = price)) + + geom_density2d(mapping = aes(x = carat, y = price)) ``` -#### Position = "dodge" +Useful aesthetics for `geom_density2d()` are: -When `position = "dodge"`, `ggplot2` places overlapping objects directly _beside_ one another. This is how I created the graph at the start of the section. +* x (required) +* y (required) +* alpha +* color +* linetype +* size -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") + - ggtitle('Position = "dodge"') -``` +Useful arguments for `geom_density2d()` are: -#### Position = "fill" +* `h` - A vector like `c(0.2, 500)` that gives the bandwiths to use to estimate the density in the horizontal and vertical directions. +* `n` - number of gridpoints to use when estimating the density (defaults to 100). -When `position = "fill"`, `ggplot2` uses all of the available space to display overlapping objects. Within that space, `ggplot2` scales each object in proportion to the other objects. `position = "fill"` is the most unusual of the position adjustments, but it creates an easy way to compare relative frequencies across groups. +##### Visualize correlations between three variables -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") + - ggtitle('Position = "fill"') -``` +There are two ways to add three (or more) variables to a two dimensional plot. You can map additional variables to aesthics within the plot, or you can use a geom that is designed to visualize three variables. +`ggplot2` provides three geoms that are designed to display three variables: `geom_raster()`, `geom_tile()` and `geom_contour()`. These geoms generalize `geom_bin2d()` and `geom_density()` to display a third variable instead of a count, or a density. -#### Position = "jitter" - -The last type of position doesn't make sense for bar charts, but it is very useful for scatterplots. Recall our first scatterplot. - -```{r echo = FALSE} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) -``` - -Why does the plot appear to display only 126 points? There are 234 observations in the data set. Also, why do the points appear to be arranged on a grid? - -The points appear in a grid because the `hwy` and `displ` measurements were rounded to the nearest integer and tenths values. As a result, many points overlap each other because they've been rounded to the same values of `hwy` and `displ`. This also explains why our graph appears to contain only 126 points. 108 points are hidden on top of other points located at the same value. - -This arrangement can cause problems because it makes it hard to see where the mass of the data is. Is there one special combination of `hwy` and `displ` that contains 109 values? Or are the data points more or less equally spread throughout the graph? - -You can avoid this overplotting problem by setting the position adjustment to "jitter". `position = "jitter"` adds a small amount of random noise to each point, as we see above. This spreads the points out because no two points are likely to receive the same amount of random noise. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy), position = "jitter") -``` - -But isn't this, you know, bad? It *is* true that jittering your data will make it less accurate at the local level, but jittering may make your data _more_ accurate at the global level. By jittering your data, you can see where the mass of your data falls on an overplotted grid. Occasionally, jittering will reveal a pattern that was hidden within the grid. - -`ggplot2` recognizes `position = "jitter"` as shorthand for `position = position_jitter()`. This is true for the other values of position as well: - -* `position = "identity"` is shorthand for `position = position_identity()` -* `position = "stack"` is shorthand for `position = position_stack()` -* `position = "dodge"` is shorthand for `position = position_dodge()` -* `position = "fill"` is shorthand for `position = position_fill()` - -You can use the explanded syntax to specify details of the position process. You can also use the expanded syntax to open a help page for each position process (which you will need to do if you wish to learn more). - -```{r eval=FALSE} -?position_jitter -``` - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy), - position = position_jitter(width = 0.03, height = 0.3)) -``` - -### Coordinate systems - -You can make your bar charts even more versatile by changing the coordinate system of your plot. For example, you could flip the x and y axes of your plot, or you could plot your bar chart on polar coordinates to make a coxcomb plot or a polar clock chart. - -```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=3, fig.height=4} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = cut)) -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = cut)) + - coord_flip() -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = cut), width = 1) + - coord_polar() -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = cut), width = 1) + - coord_polar(theta = "y") -``` - -To change the coordinate system of your plot, add a `coordinate_` function to your plot call. `ggplot2` comes with seven coordinate functions that each implement a different coordinate system. - -#### Cartesian coordinates - -`coord_cartesian()` generates a cartesian coordinate system for your plot. `ggplot2` adds a call to `coord_cartesian()` to your plot by default, but you can also manually add this call. Why would you want to do this? - -You can set the `xlim` and `ylim` arguments of `coord_cartesian()` to zoom in on a region of your plot. Set each argument to a vector of length 2. `ggplot2` will use the first value as the minimum value on the x or y axis. It will use the second value as the maximum value. - -Zooming is not very useful in our bar graph, but it can help us study the sports cars in our scatterplot. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, color = class)) + - coord_cartesian(xlim = c(4.5, 7.5), ylim = c(20, 30)) -``` - -You can use the same arguments to zoom with any of the coordinate functions in `ggplot2`. - -*** - -*Tip*: You can also zoom by adding `xlim()` and/or `ylim()` to your plot call. - -```{r eval = FALSE} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, color = class)) + - xlim(4.5, 7.5) + - ylim(20, 30) -``` - -However, `xlim()` and `ylim()` do not provide a true zoom. Instead, they plot the subset of data that appears within the limits. This may change the appearance of elements that rely on unseen data points, such as a smooth line. - -*** - -#### Fixed coordinates - -`coord_fixed()` also generates a cartesian coordinate system for your plot. However, you can used `coord_fixed()` to set the visual ratio between units on the x axis and units on the y axis. To do this, set the `ratio` argument to the desired ratio in length between y units and x units, e.g. - -$$\text{ratio} = \frac{\text{length of one Y unit}}{\text{length of one X unit}}$$ - -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = factor(1), fill = cut)) + - coord_fixed(ratio = 0.5) -``` - -`coord_equal()` does the same thing as `coord_fixed()`. - -#### Flipped coordinates - -Add `coord_flip()` to your plot to switch the x and y axes. - -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = cut)) + - coord_flip() -``` - -#### Map coordinates - -Add `coord_map()` or `coord_quickmap()` to plot map data on a cartographic projection. See _Section 6_ for more details. - -#### Polar coordinates - -Add `coord_polar()` to your plot to plot your data in polar coordinates. - -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = cut), width = 1) + - coord_polar() -``` - -By default, `ggplot2` will map your y variable to $r$ and your x variable to $\theta$. When applied to a bar chart, this creates a coxcomb plot. - -Reverse this behavior with the argument `theta = "y"`. When applied to a bar chart, this creates a polar clock chart. - -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = cut), width = 1) + - coord_polar(theta = "y") -``` - -You can also use the `start` argument to control where in the plot your data starts, from 0 to 12 (o'clock), and the `direction` argument to control the orientation of the plot (1 for clockwise, -1 for anti-clockwise). - -*** - -*Tip*: `ggplot2` does not come with a pie chart geom, but you can make a pie chart by plotting a stacked bar chart in polar coordinates. To do this, ensure that: - -* your x axis only has one value, e.g. `x = factor(1)` -* `width = 1` -* `theta = "y"` - -```{r} -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = factor(1), fill = cut), width = 1) + - coord_polar(theta = "y") -``` - -*** - -#### Transformed coordinates - -Add `coord_trans()` to plot your data on cartesian coordinates that have been transformed in some way. To use `coord_trans()`, set the `xtrans` and/or `ytrans` argument to the name of a function that you would like to apply to the x and/or y values. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - coord_trans(xtrans = "log", ytran = "log") -``` - -### Facets - -Facets provide a second way to add a variables to a two dimensional graph. When you facet a graph, you divide your data into subgroups and then plot a separate graph, or _facet_, for each subgroup. - -For example, we can divide our data set into four subgroups based on the `cyl` variable: - -1. all of the cars that have four cylinder engines -2. all of the cars that have five cylinder engines (there are some) -3. all of the cars that have six cylinder engines, and -4. all of the cars that have eight cylinder engines - -Or we could divide our data into three groups based on the `drv` variable: - -1. all of the cars with four wheel drive (4) -2. all of the cars with front wheel drive (f) -3. all of the cars with rear wheel drive (r) - -We could even divide our data into subgroups based on the combination of two variables: - -1. all of the cars with four wheel drive (4) and 4 cylinders -2. all of the cars with four wheel drive (4) and 5 cylinders -3. all of the cars with four wheel drive (4) and 6 cylinders -4. all of the cars with four wheel drive (4) and 8 cylinders -5. all of the cars with front wheel drive (f) and 4 cylinders -6. all of the cars with front wheel drive (f) and 5 cylinders -7. all of the cars with front wheel drive (f) and 6 cylinders -8. all of the cars with front wheel drive (f) and 8 cylinders -9. all of the cars with rear wheel drive (r) and 4 cylinders -10. all of the cars with rear wheel drive (r) and 5 cylinders -11. all of the cars with rear wheel drive (r) and 6 cylinders -12. all of the cars with rear wheel drive (r) and 8 cylinders - -#### `facet_grid()` - -The graphs below show what a faceted graph looks like. They also show how you can build a faceted graph with `facet_grid()`. I'm not going to tell you how `facet_grid()` works---well at least not yet. That would be too easy. Instead, I would like you to try to induce the syntax of `facet_grid()` from the code below. Consider: - -* Which variables determine how the graph is split into rows? -* Which variables determine how the graph is split into columns? -* What parts of the syntax always stay the same? -* And what does the `.` do? - -Make an honest effort at answering these questions, and then read on past the graphs. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(drv ~ cyl) -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(drv ~ .) -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(. ~ cyl) -``` - -Ready for the answers? - -To facet your graph, add `facet_grid()` to your code. The first argument of `facet_grid()` is always a formula, two variable names separated by a `~`. - -`facet_grid()` will use the first variable in the formula to split the graph into rows. Each row will contain data points that have the same value of the variable. - -`facet_grid()` will use the second variable in the formula to split the graph into columns. Each column will contain data points that have the same value of the second variable. - -This syntax mirrors the rows first, columns second convention of R. - -If you prefer to facet your plot on only one dimension, add a `.` to your formula as a place holder. If you place a `.` before the `~`, `facet_grid()` will not facet on the rows dimension. If you place a `.` after the `~`, `facet_grid()` will not facet on the columns dimension. - -Facets let you quickly compare subgroups by glancing down rows and across columns. Each facet will use the same limits on the x and y axes, but you can change this behavior across rows or columns by adding a scales argument. Set scales to one of - -* `"free_y"` - to let y limits vary accross rows -* `"free_x"` - to let x limits vary accross columns -* `"free"` - to let both x and y limits vary - -For example, the code below lets the limits of the x axes vary across columns. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(drv ~ cyl, scales = "free_x") -``` - - -#### `facet_wrap()` - -What if you want to facet on a variable that has too many values to display nicely? - -For example, if we facet on `class`, `ggplot2` must display narrow subplots to fit each subplot into the same column. This makes it diffcult to compare x values with precision. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(. ~ class) -``` - -`facet_wrap()` provides a more pleasant way to facet a plot across many values. It wraps the subplots into a multi-row, roughly square result. - -```{r} -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_wrap(~ class) -``` - -The results of `facet_wrap()` can be easier to study than the results of `facet_grid()`. However, `facet_wrap()` can only facet by one variable at a time. - -In other words, you can use the template above to make any graph that you can imagine---at least in theory. Section 2 will examine how this works in practice. The section explains the details of the grammar of graphics works, and it shows how `ggplot2` implements the grammar to build real graphs. - -## The Grammar of Graphics - -The "gg" of `ggplot2` stands for the grammar of graphics, a system for describing plots. According to the grammar, a plot is a combination of seven elements: - -$$\text{plot} = \Big( \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \Big) + \text{coordinate system} + \text{facet scheme}$$ - -You might not be used to thinking of plots in this way, so let's explore the formula above with a thought exercise. If you had to build a graph from scratch, how would you do it? - -Here's one way. To build a plot, you could begin with a data set to visualize and a coordinate system to visualize the data in. For this thought exercise, we will visualize an abbreviated version of the `mpg` data set, and we will use the cartesian coordinate system. - -`r bookdown::embed_png("images/visualization-3.png", dpi = 400)` - - -You could then choose whether to visualize the data in its raw form, or whether to summarize the data with a transformation and then visualize the summary. Let's visualize our data as in its raw form. This would be the same as applying an identity transformation to the data, since an identity transformation returns the data as it is. - -`r bookdown::embed_png("images/visualization-4.png", dpi = 400)` - - -Next, you would need to choose some sort of visual object to represent the observations in your data set. This object will be what you actually draw in the coordinate system. - -Here we will use a set of points. Each point will represent one row of data. Let's call the points "geoms", short for geometrical object. - -`r bookdown::embed_png("images/visualization-5.png", dpi = 400)` - -Next, you could map variables in your data to the visual properties of your geoms. These visual properties are what we call aesthetics. Once you do this, the visual information contained in the point will communicate recorded information contained in the data set. - -Let's map the `cyl` variable to the shape of our points. - -`r bookdown::embed_png("images/visualization-6.png", dpi = 400)` - -One pair of mappings would be particularly important. To place your points into your coordinate system, you would need to map a variable to the x location of the points, which is an aesthetic. Here we map `displ` to the x location. - -`r bookdown::embed_png("images/visualization-7.png", dpi = 400)` - -And you would need to map a variable to the y location of the points, which is also an aesthetic. Here we map `hwy` to the y location. - -`r bookdown::embed_png("images/visualization-8.png", dpi = 400)` - -The process creates a complete graph: - -`r bookdown::embed_png("images/visualization-9.png", dpi = 400)` - -However, you could modify the graph further. You could choose to adjust the position of the points (or not) and to facet the graph (or not). - -`r bookdown::embed_png("images/visualization-10.png", dpi = 400)` - -This process works to make any graph. If you change any of the elements involved, you would end up with a new graph. For example, we could change our geom to a line to make a line graph, or to a bar to make a bar chart. Or we could change the position to "jitter" to make a jittered plot. - -`r bookdown::embed_png("images/visualization-11.png", dpi = 400)` - -You could also switch the data set, coordinate system, or any other component of the graph. - -Let's extend the thought expercise to add a model line to the graph. To do this, we will add a new _layer_ to the graph. - -### Layers - -A layer is a collection of a data set, a stat, a geom, and a position adjustment. You can add a layer to a coordinate system and faceting scheme to make a complete graph, or you can add a layer to an existing graph to make a layered graph. - -Let's build a layer that uses the same data set as our previous graph. In this layer, we will apply a "smooth" stat to the data. The stat fits a model to the data and then returns a transformed data set with three new columns: - -* `y` - the value of the model line at each data point -* `ymin` - the y value of the bottom of the confidence interval associated with the model at each data point -* `ymax` - the y value of the top of the confidence interval associated with the model at each point - -`r bookdown::embed_png("images/visualization-12.png", dpi = 400)` - -In this layer, we will represent the observations with a line geom. We map the x values of the line to `displ` and we map the y values to our new `y` variable. We won't use a position adjustment. - -`r bookdown::embed_png("images/visualization-13.png", dpi = 400)` - -We now have a "layer" that we can add to a coordinate system and faceting scheme to make a complete graph. - -`r bookdown::embed_png("images/visualization-14.png", dpi = 400)` - -Or we can add the layer to our previous graph to make a plot that shows both summary information and raw data. - -`r bookdown::embed_png("images/visualization-15.png", dpi = 400)` - -For completion, let's add one more layer. This layer will begin with the same data set as the previous layer. It will also use the same stat. However, we will use the ribbon geom to visualize the data points. A ribbon is similar to a shaded region contained by two lines. - -We map the top of the ribbon to `ymax` and the bottom of the ribbon to `ymin`. We map the x position of the ribbon to `displ`. We will not use a position adjustment. - -We can now add the layer to our graph to show in one plot: - -* raw data -* a visual summary of the data (the smooth line) -* the uncertainty associated with the summary - -`r bookdown::embed_png("images/visualization-16.png", dpi = 400)` - -If you like, you can continue to add layers to the graph (but the graph will soon become cluttered). - -The thought exercise shows that the elements of the grammar of graphics work together to build a graph. You can describe any graph with these elements, and each unique combination of elements makes a single, unique graph. You can also extend a graph by adding layers of new data, stats, geoms, mappings, and positions. - - -In other words, you can extend the grammar of graphics formula indefinitely to make layered plots: - -$$ -\begin{aligned} -\text{plot} = & \Big( \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \Big) + \\ -& \Big( \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \Big)^{*} + \\ -& \Big( \text{data} + \text{stat} + \text{geom} + \text{mappings} + \text{position} \Big)^{*} + \\ -& \text{coordinate system} + \text{facet scheme} -\end{aligned} -$$ - -### Working with layers - -`ggplot2` syntax matches this formulation almost exactly. The basic low level function of `ggplot2` is `layer()` which combines data, stats, geoms, mappings, and positions into a single layer to plot. - -If you have time on your hands, you can use `layer()` to create a multi-level plot like the one above. Initialize your plot with `ggplot()`. Then add as many calls to `layer()` as you like. Give each layer its own `data`, `stat`, `geom`, `mapping`, and `position` arguments. - -```{r message = FALSE} -ggplot() + - layer( - data = mpg, - stat = "identity", - geom = "point", - mapping = aes(x = displ, y = hwy), - position = "identity" - ) + - layer( - data = mpg, - stat = "smooth", - geom = "ribbon", - mapping = aes(x = displ, y = hwy), - position = "identity" - ) + - layer( - data = mpg, - stat = "smooth", - geom = "line", - mapping = aes(x = displ, y = hwy), - position = "identity" - ) + - coord_cartesian() -``` - -Although you can build all of your graphs this way, few people do because `ggplot2` supplies some very efficient shortcuts. - -For example, you will find in practice that you almost always pair the same geoms with the same stats and position adjustments. For instance, you will almost always use the point geom with the "identity" stat and the "identity" position. Similarly, you will almost always use the bar geom with the "bin" stat and the "stack" position. - -The `geom_` functions in `ggplot2` take advantage of these common combinations. Like `layer()`, each geom function builds a layer, but the geom functions preset the geom, stat, and position values of the layer to useful defaults. The geom that appears in the function name becomes the geom of the layer. The stat and postion most commonly asscoiated with the geom become the default stat and position of the layer. - -`ggplot2` even provides geom functions for less common, but still useful combinations of geoms, stats, and positions. For example, the function `geom_jitter()` builds a layer that has a point geom, an "identity" stat, and a "jitter" position. The function `geom_smooth()` builds a "layer" that is made of two sub-layers: a line layer that displays a model line and ribbon layer that displays a standard error band. - -As a result, `geom_` functions provide a more direct syntax for making plots, one that you are already familiar with from Section 1. - -```{r message = FALSE} -ggplot() + - geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy)) -``` - -#### Multiple geoms - -As with `layer()`, you can add multiple geom functions to a single plot call. - -This system lets you build sophisticated graphs geom by geom, but it also makes it possible to write repetitive code. For example, the code above repeats the arguments `data = mpg` and `mapping = aes(x = displ, y = hwy)`. Repetition makes your code harder to read and write, and it also increases the chance of typos and errors. - -You can avoid repetition by passing the repeated mappings to `ggplot()`. `ggplot2` will treat mappings that appear in `ggplot()` as global mappings to be applied to each layer. For example, we can eliminate the duplication of `mapping = aes(x = displ, y = hwy)` in our previous code with a global mapping argument: - -```{r, eval = FALSE} -ggplot(mapping = aes(x = displ, y = hwy)) + - geom_point(data = mpg) + - geom_smooth(data = mpg) -``` - -You can even combine global mappings with local mappings to differentiate geoms. - -* Mappings that appear in `ggplot()` will be applied to each geom. -* Mappings that appear in a geom function will be applied to that geom only. -* If a local aesthetic mapping conflicts with a global aesthetic mapping, `ggplot2` will use the local mapping. This is arbitrated on an aesthetic by aesthetic basis. - -```{r, message = FALSE} -ggplot(mapping = aes(x = displ, y = hwy)) + - geom_point(data = mpg, mapping = aes(color = class)) + - geom_smooth(data = mpg) -``` - -This system lets us overlay a single smooth line on a set of colored points. Notice that this would not occur if you add the color aesthetic to the global mappings. In that case, smooth would use the color mapping to draw a different colored line for each class of cars. - -You can use the same system to specify a global data set for every layer. In other words, - -```{r, eval = FALSE} -ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point() + - geom_smooth() -``` - -is analagous to - -```{r, eval = FALSE} -ggplot(mapping = aes(x = displ, y = hwy)) + - geom_point(data = mpg) + - geom_smooth(data = mpg) -``` - -As with mappings, you can define a local data argument to override the global data argument on a layer by layer basis. - -```{r, message = FALSE, warning = FALSE} -ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point() + - geom_smooth(data = subset(mpg, cyl == 8)) -``` - -### Recap - -Your understanding of the `ggplot2` syntax is now complete. You understand the grammar written into the syntax, and you know how to extend the syntax by adding extra layers to your plot, as well as how to truncate the syntax by relying on `ggplot2`'s default settings. - -Only one thing remains. You need to learn the vocabulary of function names and argument options that you can use with your code template. - -Section 3 will guide you through these functions and arguments. It catalogues all of the options that `ggplot2` puts at your fingertips for geoms, mappings, stats, position adjustments, and coordinate systems. - -## Customizing plots -### Titles -### Guides -### Scales -#### Color -#### Size -#### Shape -### Themes -### Zoom -### Saving plots - - - - -## Summary - -> "A picture is not merely worth a thousand words, it is much more likely to be scrutinized than words are to be read."---John Tukey - +`geom_raster()` and `geom_tile()`