diff --git a/data-visualize.Rmd b/data-visualize.Rmd index 01969bc..907a496 100644 --- a/data-visualize.Rmd +++ b/data-visualize.Rmd @@ -4,7 +4,7 @@ > "The simple graph has brought more information to the data analyst's mind than any other device." --- John Tukey -This chapter will teach you how to visualise your data using ggplot2. +This chapter will teach you how to visualize your data using ggplot2. R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the **grammar of graphics**, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places. @@ -13,7 +13,7 @@ If you'd like to learn more about the theoretical underpinnings of ggplot2, I'd ### Prerequisites -This chapter focusses on ggplot2, one of the core members of the tidyverse. +This chapter focuses on ggplot2, one of the core members of the tidyverse. To access the datasets, help pages, and functions that we will use in this chapter, load the tidyverse by running this code: ```{r setup} @@ -25,7 +25,8 @@ It also tells you which functions from the tidyverse conflict with functions in If you run this code and get the error message "there is no package called 'tidyverse'", you'll need to first install it, then run `library()` once again. -```{r, eval = FALSE} +```{r} +#| eval: false install.packages("tidyverse") library(tidyverse) ``` @@ -57,7 +58,7 @@ mpg Among the variables in `mpg` are: -1. `displ`, a car's engine size, in litres. +1. `displ`, a car's engine size, in liters. 2. `hwy`, a car's fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance. @@ -68,7 +69,8 @@ To learn more about `mpg`, open its help page by running `?mpg`. To plot `mpg`, run this code to put `displ` on the x-axis and `hwy` on the y-axis: -```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association."} +```{r} +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association." ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ``` @@ -97,7 +99,8 @@ ggplot2 looks for the mapped variables in the `data` argument, in this case, `mp Let's turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings. -```{r, eval = FALSE} +```{r} +#| eval: false ggplot(data = ) + (mapping = aes()) ``` @@ -129,7 +132,9 @@ In the plot below, one group of points (highlighted in red) seems to fall outsid These cars have a higher mileage than you might expect. How can you explain these cars? -```{r, echo = FALSE, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. Cars with engine size greater than 5 litres and highway fuel efficiency greater than 20 miles per gallon stand out from the rest of the data and are highlighted in red."} +```{r} +#| echo: false +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. Cars with engine size greater than 5 litres and highway fuel efficiency greater than 20 miles per gallon stand out from the rest of the data and are highlighted in red." ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_point(data = dplyr::filter(mpg, displ > 5, hwy > 20), colour = "red", size = 2.2) @@ -147,7 +152,10 @@ You can display a point (like the one below) in different ways by changing the v Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties. Here we change the levels of a point's size, shape, and color to make the point small, triangular, or blue: -```{r, echo = FALSE, fig.asp = 1/4, fig.alt = "Diagram that shows four plotting characters next to each other. The first is a large circle, the second is a small circle, the third is a triangle, and the fourth is a blue circle."} +```{r} +#| echo: false +#| fig.asp: 1/4 +#| fig.alt: "Diagram that shows four plotting characters next to each other. The first is a large circle, the second is a small circle, the third is a triangle, and the fourth is a blue circle." ggplot() + geom_point(aes(1, 1), size = 20) + geom_point(aes(2, 1), size = 10) + @@ -159,9 +167,10 @@ ggplot() + ``` You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. -For example, you can map the colours of your points to the `class` variable to reveal the class of each car. +For example, you can map the colors of your points to the `class` variable to reveal the class of each car. -```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The points representing each car are coloured according to the class of the car. The legend on the right of the plot shows the mapping between colours and levels of the class variable: 2seater, compact, midsize, minivan, pickup, or suv."} +```{r} +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The points representing each car are coloured according to the class of the car. The legend on the right of the plot shows the mapping between colours and levels of the class variable: 2seater, compact, midsize, minivan, pickup, or suv." ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class)) ``` @@ -172,7 +181,7 @@ To map an aesthetic to a variable, associate the name of the aesthetic to the na ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as **scaling**. ggplot2 will also add a legend that explains which levels correspond to which values. -The colours reveal that many of the unusual points (with engine size greater than 5 litres and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars. +The colors reveal that many of the unusual points (with engine size greater than 5 liters and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars. These cars don't seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines. @@ -181,14 +190,23 @@ In the above example, we mapped `class` to the color aesthetic, but we could hav In this case, the exact size of each point would reveal its class affiliation. We get a *warning* here, because mapping an unordered variable (`class`) to an ordered aesthetic (`size`) is not a good idea. -```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The points representing each car are sized according to the class of the car. The legend on the right of the plot shows the mapping between colours and levels of the class variable -- going from small to large: 2seater, compact, midsize, minivan, pickup, or suv."} +```{r} +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The points representing each car are sized according to the class of the car. The legend on the right of the plot shows the mapping between colours and levels of the class variable -- going from small to large: 2seater, compact, midsize, minivan, pickup, or suv." ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = class)) ``` Or we could have mapped `class` to the *alpha* aesthetic, which controls the transparency of the points, or to the *shape* aesthetic, which controls the shape of the points. -```{r, fig.width = 4, out.width = "50%", fig.align = 'default', warning = FALSE, fig.asp = 1/2, fig.cap ="", fig.alt = "Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars in ggplot2::mpg and showing a negative association. In the plot on the left class is mapped to the alpha aesthetic, resulting in different transparency levels for each level of class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each level of class. Each plot comes with a legend that shows the mapping between alpha level or shape and levels of the class variable."} +```{r} +#| fig.width: 4 +#| out.width: "50%" +#| fig.align: "default" +#| warning: false +#| fig.asp: 1/2 +#| fig.cap: "" +#| fig.alt: "Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars in ggplot2::mpg and showing a negative association. In the plot on the left class is mapped to the alpha aesthetic, resulting in different transparency levels for each level of class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each level of class. Each plot comes with a legend that shows the mapping between alpha level or shape and levels of the class variable." + # Left ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class)) @@ -214,7 +232,9 @@ The axis line acts as a legend; it explains the mapping between locations and va You can also *set* the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue: -```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. All points are blue."} +```{r} +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. All points are blue." + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue") ``` @@ -229,7 +249,14 @@ You'll need to pick a level that makes sense for that aesthetic: - The shape of a point as a number, as shown in Figure \@ref(fig:shapes). -```{r shapes, echo = FALSE, fig.asp = 1/2.75, fig.cap="R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the `colour` and `fill` aesthetics. The hollow shapes (0--14) have a border determined by `colour`; the solid shapes (15--20) are filled with `colour`; the filled shapes (21--24) have a border of `colour` and are filled with `fill`.", warning = FALSE, fig.alt = "Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue."} +```{r} +#| label: shapes +#| echo: false +#| warning: false +#| fig.asp: 1/2.75 +#| fig.cap: "R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the `colour` and `fill` aesthetics. The hollow shapes (0--14) have a border determined by `colour`; the solid shapes (15--20) are filled with `colour`; the filled shapes (21--24) have a border of `colour` and are filled with `fill`." +#| fig.alt: "Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue." + shapes <- tibble( shape = c(0, 1, 2, 5, 3, 4, 6:19, 22, 21, 24, 23, 20), x = (0:24 %/% 5) / 2, @@ -251,7 +278,8 @@ ggplot(shapes, aes(x, y)) + 1. What's gone wrong with this code? Why are the points not blue? - ```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. All points are red and the legend shows a red point that is mapped to the word 'blue'."} + ```{r} + #| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. All points are red and the legend shows a red point that is mapped to the word 'blue'." ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue")) ``` @@ -312,7 +340,9 @@ To facet your plot by a single variable, use `facet_wrap()`. The first argument of `facet_wrap()` is a formula, which you create with `~` followed by a variable name (here, "formula" is the bane if a data structure in R, not a synonym for "equation"). The variable that you pass to `facet_wrap()` should be discrete. -```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by class, with facets spanning two rows."} +```{r} +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by class, with facets spanning two rows." + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl) @@ -322,7 +352,9 @@ To facet your plot on the combination of two variables, add `facet_grid()` to yo The first argument of `facet_grid()` is also a formula. This time the formula should contain two variable names separated by a `~`. -```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive."} +```{r} +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive." + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl) @@ -337,7 +369,9 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o 2. What do the empty cells in plot with `facet_grid(drv ~ cyl)` mean? How do they relate to this plot? - ```{r, fig.alt = "Scatterplot of number of cycles versus type of drive train of cars in ggplot2::mpg. Shows that there are no cars with 5 cylinders that are 4 wheel drive or with 4 or 5 cylinders that are front wheel drive."} + ```{r} + #| fig.alt: "Scatterplot of number of cycles versus type of drive train of cars in ggplot2::mpg. Shows that there are no cars with 5 cylinders that are 4 wheel drive or with 4 or 5 cylinders that are front wheel drive." + ggplot(data = mpg) + geom_point(mapping = aes(x = drv, y = cyl)) ``` @@ -345,7 +379,9 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o 3. What plots does the following code make? What does `.` do? - ```{r, eval = FALSE} + ```{r} + #| eval: false + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ .) @@ -357,7 +393,9 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o 4. Take the first faceted plot in this section: - ```{r, eval = FALSE} + ```{r} + #| eval: false + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2) @@ -376,7 +414,9 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o 6. Which of the following two plots makes it easier to compare engine size (`displ`) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns? - ```{r, fig.alt = "Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns."} + ```{r} + #| fig.alt: "Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns." + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ .) @@ -389,7 +429,9 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o 7. Recreate this plot using `facet_wrap()` instead of `facet_grid()`. How do the positions of the facet labels change? - ```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by type of drive train across rows."} + ```{r} + #| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by type of drive train across rows." + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ .) @@ -399,7 +441,14 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o How are these two plots similar? -```{r, echo = FALSE, fig.width = 4, out.width = "50%", fig.align="default", message = FALSE, fig.alt = "Two plots: the plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed."} +```{r} +#| echo: false +#| message: false +#| fig.width: 4 +#| out.width: "50%" +#| fig.align: "default" +#| fig.alt: "Two plots: the plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed." + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) @@ -422,7 +471,9 @@ The plot on the left uses the point geom, and the plot on the right uses the smo To change the geom in your plot, change the geom function that you add to `ggplot()`. For instance, to make the plots above, you can use this code: -```{r, eval = FALSE} +```{r} +#| eval: false + # left ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) @@ -438,7 +489,10 @@ You could set the shape of a point, but you couldn't set the "shape" of a line. On the other hand, you *could* set the linetype of a line. `geom_smooth()` will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype. -```{r, message = FALSE, fig.alt = "A plot of highway fuel efficiency versus engine size of cars in ggplot2::mpg. The data are represented with smooth curves, which use a different line type (solid, dashed, or long dashed) for each type of drive train. Confidence intervals around the smooth curves are also displayed."} +```{r} +#| message: false +#| fig.alt: "A plot of highway fuel efficiency versus engine size of cars in ggplot2::mpg. The data are represented with smooth curves, which use a different line type (solid, dashed, or long dashed) for each type of drive train. Confidence intervals around the smooth curves are also displayed." + ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) ``` @@ -447,9 +501,13 @@ Here `geom_smooth()` separates the cars into three lines based on their `drv` va One line describes all of the points that have a `4` value, one line describes all of the points that have an `f` value, and one line describes all of the points that have an `r` value. Here, `4` stands for four-wheel drive, `f` for front-wheel drive, and `r` for rear-wheel drive. -If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then colouring everything according to `drv`. +If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to `drv`. + +```{r} +#| echo: false +#| message: false +#| fig.alt: "A plot of highway fuel efficiency versus engine size of cars in ggplot2::mpg. The data are represented with points (coloured by drive train) as well as smooth curves (where line type is determined based on drive train as well). Confidence intervals around the smooth curves are also displayed." -```{r, echo = FALSE, message = FALSE, fig.alt = "A plot of highway fuel efficiency versus engine size of cars in ggplot2::mpg. The data are represented with points (coloured by drive train) as well as smooth curves (where line type is determined based on drive train as well). Confidence intervals around the smooth curves are also displayed."} ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth(mapping = aes(linetype = drv)) @@ -469,7 +527,13 @@ ggplot2 will draw a separate object for each unique value of the grouping variab In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the `linetype` example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms. -```{r, fig.width = 3, fig.align = 'default', out.width = "33%", message = FALSE, fig.alt = "Three plots, each with highway fuel efficiency on the y-axis and engine size of cars in ggplot2::mpg, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colours, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed."} +```{r} +#| fig.width: 3 +#| fig.align: "default" +#| out.width: "33%" +#| message: false +#| fig.alt: "Three plots, each with highway fuel efficiency on the y-axis and engine size of cars in ggplot2::mpg, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colours, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." + ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy)) @@ -485,7 +549,10 @@ ggplot(data = mpg) + To display multiple geoms in the same plot, add multiple geom functions to `ggplot()`: -```{r, message = FALSE, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg with a smooth curve overlaid. A confidence interval around the smooth curves is also displayed."} +```{r} +#| message: false +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg with a smooth curve overlaid. A confidence interval around the smooth curves is also displayed." + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy)) @@ -498,7 +565,9 @@ You can avoid this type of repetition by passing a set of mappings to `ggplot()` ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code: -```{r, eval = FALSE} +```{r} +#| eval: false + ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth() @@ -508,7 +577,10 @@ If you place mappings in a geom function, ggplot2 will treat them as local mappi It will use these mappings to extend or overwrite the global mappings *for that layer only*. This makes it possible to display different aesthetics in different layers. -```{r, message = FALSE, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, where points are coloured according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it."} +```{r} +#| message: false +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, where points are coloured according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it." + ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth() @@ -518,7 +590,10 @@ You can use the same idea to specify different `data` for each layer. Here, our smooth line displays just a subset of the `mpg` dataset, the subcompact cars. The local data argument in `geom_smooth()` overrides the global data argument in `ggplot()` for that layer only. -```{r, message = FALSE, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, where points are coloured according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it."} +```{r} +#| message: false +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, where points are coloured according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it." + ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE) @@ -536,7 +611,9 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 2. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions. - ```{r, eval = FALSE} + ```{r} + #| eval: false + ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth(se = FALSE) @@ -551,7 +628,9 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 5. Will these two graphs look different? Why/why not? - ```{r, eval = FALSE} + ```{r} + #| eval: false + ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth() @@ -564,7 +643,14 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 6. Recreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, it's `drv`. - ```{r, echo = FALSE, fig.width = 3, out.width = "50%", fig.align = "default", message = FALSE, fig.alt = "There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars in ggplot2::mpg are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colours for each level of drive train. In the fourth plot the points are represented in different colours for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colours for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colours for each level of drive train and they have a thick white border."} + ```{r} + #| echo: false + #| message: false + #| fig.width: 3 + #| out.width: "50%" + #| fig.align: "default" + #| fig.alt: "There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars in ggplot2::mpg are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colours for each level of drive train. In the fourth plot the points are represented in different colours for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colours for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colours for each level of drive train and they have a thick white border." + ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth(se = FALSE) @@ -594,7 +680,9 @@ The following chart displays the total number of diamonds in the `diamonds` data The `diamonds` dataset is in the ggplot2 package and contains information on \~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts. -```{r, fig.alt = "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 12000 very good, 14000 premium, and 22000 ideal cut diamonds."} +```{r} +#| fig.alt: "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 12000 very good, 14000 premium, and 22000 ideal cut diamonds." + ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) ``` @@ -614,7 +702,11 @@ Other graphs, like bar charts, calculate new values to plot: The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation. The figure below describes how this process works with `geom_bar()`. -```{r, echo = FALSE, out.width = "100%", fig.alt = 'A figure demonstrating three steps of creating a bar chart: 1. geom_bar() begins with the diamonds data set. 2. geom_bar() transforms the data with the "count" stat, which returns a data set of cut values and counts. 3. geom_bar() uses the transformed data to build the plot. cut is mapped to the x-axis, count is mapped to the y-axis.'} +```{r} +#| echo: false +#| out.width: "100%" +#| fig.alt: 'A figure demonstrating three steps of creating a bar chart: 1. geom_bar() begins with the diamonds data set. 2. geom_bar() transforms the data with the "count" stat, which returns a data set of cut values and counts. 3. geom_bar() uses the transformed data to build the plot. cut is mapped to the x-axis, count is mapped to the y-axis.' + knitr::include_graphics("images/visualization-stat-bar.png") ``` @@ -626,7 +718,9 @@ That describes how it computes two new variables: `count` and `prop`. You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`: -```{r, fig.alt = "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 12000 very good, 14000 premium, and 22000 ideal cut diamonds."} +```{r} +#| fig.alt: "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 12000 very good, 14000 premium, and 22000 ideal cut diamonds." + ggplot(data = diamonds) + stat_count(mapping = aes(x = cut)) ``` @@ -640,7 +734,10 @@ There are three reasons you might need to use a stat explicitly: This lets me map the height of the bars to the raw values of a $y$ variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows. - ```{r, warning = FALSE, fig.alt = "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 22000 ideal, 14000 premium, and 12000 very good, cut diamonds."} + ```{r} + #| warning: false + #| fig.alt: "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 22000 ideal, 14000 premium, and 12000 very good, cut diamonds." + demo <- tribble( ~cut, ~freq, "Fair", 1610, @@ -660,7 +757,9 @@ There are three reasons you might need to use a stat explicitly: 2. You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts: - ```{r, fig.alt = "Bar chart of proportion of each each cut of diamond in the ggplots::diamonds dataset. Roughly, fair diamonds make up 0.03, good 0.09, very good 0.22, premium 26, and ideal 0.40."} + ```{r} + #| fig.alt: "Bar chart of proportion of each each cut of diamond in the ggplots::diamonds dataset. Roughly, fair diamonds make up 0.03, good 0.09, very good 0.22, premium 26, and ideal 0.40." + ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1)) ``` @@ -668,9 +767,11 @@ There are three reasons you might need to use a stat explicitly: To find the variables computed by the stat, look for the section titled "computed variables" in the help for `geom_bar()`. 3. You might want to draw greater attention to the statistical transformation in your code. - For example, you might use `stat_summary()`, which summarises the y values for each unique x value, to draw attention to the summary that you're computing: + For example, you might use `stat_summary()`, which summarizes the y values for each unique x value, to draw attention to the summary that you're computing: + + ```{r} + #| fig.alt: "A plot with depth on the y-axis and cut on the x-axis (with levels fair, good, very good, premium, and ideal) of diamonds in ggplot2::diamonds. For each level of cut, vertical lines extend from minimum to maximum depth for diamonds in that cut category, and the median depth is indicated on the line with a point." - ```{r, fig.alt = "A plot with depth on the y-axis and cut on the x-axis (with levels fair, good, very good, premium, and ideal) of diamonds in ggplot2::diamonds. For each level of cut, vertical lines extend from minimum to maximum depth for diamonds in that cut category, and the median depth is indicated on the line with a point."} ggplot(data = diamonds) + stat_summary( mapping = aes(x = cut, y = depth), @@ -703,7 +804,9 @@ To see a complete list of stats, try the [ggplot2 cheatsheet](http://rstudio.com Why? In other words what is the problem with these two graphs? - ```{r, eval = FALSE} + ```{r} + #| eval: false + ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = after_stat(prop))) ggplot(data = diamonds) + @@ -715,7 +818,11 @@ To see a complete list of stats, try the [ggplot2 cheatsheet](http://rstudio.com There's one more piece of magic associated with bar charts. You can colour a bar chart using either the `colour` aesthetic, or, more usefully, `fill`: -```{r, out.width = "50%", fig.align = "default", fig.alt = "Two bar charts of cut of diamonds in ggplot2::diamonds. In the first plot, the bars have coloured borders. In the second plot, they're filled with colours. Heights of the bars correspond to the number of diamonds in each cut category."} +```{r} +#| out.width: "50%" +#| fig.align: "default" +#| fig.alt: "Two bar charts of cut of diamonds in ggplot2::diamonds. In the first plot, the bars have coloured borders. In the second plot, they're filled with colours. Heights of the bars correspond to the number of diamonds in each cut category." + ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, colour = cut)) ggplot(data = diamonds) + @@ -723,9 +830,11 @@ ggplot(data = diamonds) + ``` Note what happens if you map the fill aesthetic to another variable, like `clarity`: the bars are automatically stacked. -Each coloured rectangle represents a combination of `cut` and `clarity`. +Each colored rectangle represents a combination of `cut` and `clarity`. + +```{r} +#| fig.alt: "Segmented bar chart of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the coloured segments are proportional to the number of diamonds with a given clarity level within a given cut level." -```{r, fig.alt = "Segmented bar chart of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the coloured segments are proportional to the number of diamonds with a given clarity level within a given cut level."} ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity)) ``` @@ -737,7 +846,11 @@ If you don't want a stacked bar chart, you can use one of three other options: ` This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting `alpha` to a small value, or completely transparent by setting `fill = NA`. - ```{r, out.width = "50%", fig.align = "default", fig.alt = "Two segmented bar charts of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the coloured segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colours, in the second plot the segments are only outlined with colours."} + ```{r} + #| out.width: "50%" + #| fig.align: "default" + #| fig.alt: "Two segmented bar charts of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the coloured segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colours, in the second plot the segments are only outlined with colours." + ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + geom_bar(alpha = 1/5, position = "identity") ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) + @@ -749,7 +862,9 @@ If you don't want a stacked bar chart, you can use one of three other options: ` - `position = "fill"` works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups. - ```{r, fig.alt = "Segmented bar chart of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Height of each bar is 1 and heights of the coloured segments are proportional to the proportion of diamonds with a given clarity level within a given cut level."} + ```{r} + #| fig.alt: "Segmented bar chart of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Height of each bar is 1 and heights of the coloured segments are proportional to the proportion of diamonds with a given clarity level within a given cut level." + ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") ``` @@ -757,7 +872,9 @@ If you don't want a stacked bar chart, you can use one of three other options: ` - `position = "dodge"` places overlapping objects directly *beside* one another. This makes it easier to compare individual values. - ```{r, fig.alt = "Dodged bar chart of cut of diamonds in ggplot2::diamonds. Dodged bars are grouped by levels of cut (fair, good, very good, premium, and ideal). In each group there are eight bars, one for each level of clarity, and filled with a different color for each level. Heights of these bars represent the number of diamonds with a given level of cut and clarity."} + ```{r} + #| fig.alt: "Dodged bar chart of cut of diamonds in ggplot2::diamonds. Dodged bars are grouped by levels of cut (fair, good, very good, premium, and ideal). In each group there are eight bars, one for each level of clarity, and filled with a different color for each level. Heights of these bars represent the number of diamonds with a given level of cut and clarity." + ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") ``` @@ -766,7 +883,10 @@ There's one other type of adjustment that's not useful for bar charts, but it ca Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset? -```{r, echo = FALSE, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association."} +```{r} +#| echo: FALSE +#| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association." + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ``` @@ -780,7 +900,9 @@ You can avoid this gridding by setting the position adjustment to "jitter". `position = "jitter"` adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise. -```{r, fig.alt = "Jittered scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association."} +```{r} +#| fig.alt: "Jittered scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association." + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position = "jitter") ``` @@ -795,7 +917,9 @@ To learn more about a position adjustment, look up the help page associated with 1. What is the problem with this plot? How could you improve it? - ```{r, fig.alt = "Scatterplot of highway fuel efficiency versus city fuel efficiency of cars in ggplot2::mpg that shows a positive association. The number of points visible in this plot is less than the number of points in the dataset."} + ```{r} + #| fig.alt: "Scatterplot of highway fuel efficiency versus city fuel efficiency of cars in ggplot2::mpg that shows a positive association. The number of points visible in this plot is less than the number of points in the dataset." + ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() ``` @@ -805,7 +929,7 @@ To learn more about a position adjustment, look up the help page associated with 3. Compare and contrast `geom_jitter()` with `geom_count()`. 4. What's the default position adjustment for `geom_boxplot()`? - Create a visualisation of the `mpg` dataset that demonstrates it. + Create a visualization of the `mpg` dataset that demonstrates it. ## Coordinate systems @@ -817,7 +941,12 @@ There are a number of other coordinate systems that are occasionally helpful. This is useful (for example), if you want horizontal boxplots. It's also useful for long labels: it's hard to get them to fit without overlapping on the x-axis. - ```{r, fig.width = 3, out.width = "50%", fig.align = "default", fig.alt = "Two side-by-side box plots of highway fuel efficiency of cars in ggplot2::mpg. A separate box plot is created for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv). In the first plot class is on the x-axis, in the second plot class is on the y-axis. The second plot makes it easier to read the names of the levels of class since they're listed down the y-axis, avoiding overlap."} + ```{r} + #| fig.width: 3 + #| out.width: "50%" + #| fig.align: "default" + #| fig.alt: "Two side-by-side box plots of highway fuel efficiency of cars in ggplot2::mpg. A separate box plot is created for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv). In the first plot class is on the x-axis, in the second plot class is on the y-axis. The second plot makes it easier to read the names of the levels of class since they're listed down the y-axis, avoiding overlap." + ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + @@ -827,7 +956,11 @@ There are a number of other coordinate systems that are occasionally helpful. However, note that you can achieve the same result by flipping the aesthetic mappings of the two variables. - ```{r, fig.width = 3, fig.align = "default", fig.alt = "Side-by-side box plots of highway fuel efficiency of cars in ggplot2::mpg. A separate box plot is drawn along the y-axis for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv)."} + ```{r} + #| fig.width: 3 + #| fig.align: "default" + #| fig.alt: "Side-by-side box plots of highway fuel efficiency of cars in ggplot2::mpg. A separate box plot is drawn along the y-axis for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv)." + ggplot(data = mpg, mapping = aes(y = class, x = hwy)) + geom_boxplot() ``` @@ -835,7 +968,13 @@ There are a number of other coordinate systems that are occasionally helpful. - `coord_quickmap()` sets the aspect ratio correctly for maps. This is very important if you're plotting spatial data with ggplot2 (which unfortunately we don't have the space to cover in this book). - ```{r, fig.width = 3, out.width = "50%", fig.align = "default", message = FALSE, fig.alt = "Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it's correct."} + ```{r} + #| fig.width: 3 + #| out.width: "50%" + #| fig.align: "default" + #| message: FALSE + #| fig.alt: "Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it's correct." + nz <- map_data("nz") ggplot(nz, aes(long, lat, group = group)) + @@ -849,7 +988,13 @@ There are a number of other coordinate systems that are occasionally helpful. - `coord_polar()` uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart. - ```{r, fig.width = 3, out.width = "50%", fig.align = "default", fig.asp = 1, fig.alt = "Two plots: on the left is a bar chart of cut of diamonds in ggplot2::diamonds, on the right is a Coxcomb chart of the same data."} + ```{r} + #| fig.width: 3 + #| out.width: "50%" + #| fig.align: "default" + #| fig.asp: 1 + #| fig.alt: "Two plots: on the left is a bar chart of cut of diamonds in ggplot2::diamonds, on the right is a Coxcomb chart of the same data." + bar <- ggplot(data = diamonds) + geom_bar( mapping = aes(x = cut, fill = cut), @@ -876,7 +1021,11 @@ There are a number of other coordinate systems that are occasionally helpful. Why is `coord_fixed()` important? What does `geom_abline()` do? - ```{r, fig.asp = 1, out.width = "50%", fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The plot also has a straight line that follows the trend of the relationship between the variables but doesn't go through the cloud of points, it's beneath it."} + ```{r} + #| fig.asp: 1 + #| out.width: "50%" + #| fig.alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The plot also has a straight line that follows the trend of the relationship between the variables but doesn't go through the cloud of points, it's beneath it." + ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() + geom_abline() + @@ -906,7 +1055,11 @@ The grammar of graphics is based on the insight that you can uniquely describe * To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat). -```{r, echo = FALSE, out.width = "100%", fig.alt = "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Steps 1 and 2 are annotated: 1. Begin with the diamonds dataset. 2. Compute counts for each cut value with stat_count()."} +```{r} +#| echo: FALSE +#| out.width: "100%" +#| fig.alt: "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Steps 1 and 2 are annotated: 1. Begin with the diamonds dataset. 2. Compute counts for each cut value with stat_count()." + knitr::include_graphics("images/visualization-grammar-1.png") ``` @@ -914,7 +1067,11 @@ Next, you could choose a geometric object to represent each observation in the t You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic. -```{r, echo = FALSE, out.width = "100%", fig.alt = "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Each level is also mapped to a color. Steps 3 and 4 are annotated: 3. Represent each observation with a bar. 4. Map the fill of each bar to the ..count.. variable."} +```{r} +#| echo: FALSE +#| out.width: "100%" +#| fig.alt: "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Each level is also mapped to a color. Steps 3 and 4 are annotated: 3. Represent each observation with a bar. 4. Map the fill of each bar to the ..count.. variable." + knitr::include_graphics("images/visualization-grammar-2.png") ``` @@ -923,7 +1080,11 @@ You'd use the location of the objects (which is itself an aesthetic property) to At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment. -```{r, echo = FALSE, out.width = "100%", fig.alt = "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to bar chart where each bar represents one level of cut and filled in with a different color. Steps 5 and 6 are annotated: 5. Place geoms in a Cartesian coordinate system. 6. Map the y values to ..count.. and the x values to cut."} +```{r} +#| echo: FALSE +#| out.width: "100%" +#| fig.alt: "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to bar chart where each bar represents one level of cut and filled in with a different color. Steps 5 and 6 are annotated: 5. Place geoms in a Cartesian coordinate system. 6. Map the y values to ..count.. and the x values to cut." + knitr::include_graphics("images/visualization-grammar-3.png") ```