Updates to whole game intro + dataviz chapter (#932)

* Streamline narrative for new chapters and part name

* Add to do note to add sizing for faceted plots

* Minor edits, function updates, figure alt text

* Revert references to vars() in facets, use formula

* Spell check, colo*u*r, comma after r in chunk def

* Streamline fig.alt language

* If eval = FALSE, don't need fig.alt

* Fix sentence fragment
This commit is contained in:
Mine Cetinkaya-Rundel 2021-03-19 12:38:28 +00:00 committed by GitHub
parent 588f70ac59
commit 1eee408cb6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 98 additions and 66 deletions

View File

@ -621,6 +621,8 @@ You can learn more about `ggsave()` in the documentation.
### Figure sizing
<!--# TO DO: Add something about faceted plots here. -->
The biggest challenge of graphics in R Markdown is getting your figures the right size and shape.
There are five main options that control figure sizing: `fig.width`, `fig.height`, `fig.asp`, `out.width` and `out.height`.
Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).
@ -688,3 +690,4 @@ Unfortunately, the book is not available online for free, although you can find
Another great resource is the ggplot2 extensions gallery <https://exts.ggplot2.tidyverse.org/gallery/>.
This site lists many of the packages that extend ggplot2 with new geoms and scales.
It's a great place to start if you're trying to do something that seems hard with ggplot2.

View File

@ -9,7 +9,7 @@ R has several systems for making graphs, but ggplot2 is one of the most elegant
ggplot2 implements the **grammar of graphics**, a coherent system for describing and building graphs.
With ggplot2, you can do more faster by learning one system and applying it in many places.
If you'd like to learn more about the theoretical underpinnings of ggplot2 before you start, I'd recommend reading "The Layered Grammar of Graphics", <http://vita.had.co.nz/papers/layered-grammar.pdf>.
If you'd like to learn more about the theoretical underpinnings of ggplot2, I'd recommend reading "The Layered Grammar of Graphics", <http://vita.had.co.nz/papers/layered-grammar.pdf>.
### Prerequisites
@ -25,7 +25,7 @@ It also tells you which functions from the tidyverse conflict with functions in
If you run this code and get the error message "there is no package called 'tidyverse'", you'll need to first install it, then run `library()` once again.
```{r eval = FALSE}
```{r, eval = FALSE}
install.packages("tidyverse")
library(tidyverse)
```
@ -47,7 +47,7 @@ Nonlinear?
### The `mpg` data frame
You can test your answer with the `mpg` **data frame** found in ggplot2 (aka `ggplot2::mpg`).
You can test your answer with the `mpg` **data frame** found in ggplot2 (a.k.a. `ggplot2::mpg`).
A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
`mpg` contains observations collected by the US Environmental Protection Agency on 38 models of car.
@ -68,7 +68,7 @@ To learn more about `mpg`, open its help page by running `?mpg`.
To plot `mpg`, run this code to put `displ` on the x-axis and `hwy` on the y-axis:
```{r}
```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
@ -88,7 +88,7 @@ ggplot2 comes with many geom functions that each add a different type of layer t
You'll learn a whole bunch of them throughout this chapter.
Each geom function in ggplot2 takes a `mapping` argument.
This defines how variables in your dataset are mapped to visual properties.
This defines how variables in your dataset are mapped to visual properties of your plot.
The `mapping` argument is always paired with `aes()`, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes.
ggplot2 looks for the mapped variables in the `data` argument, in this case, `mpg`.
@ -97,7 +97,7 @@ ggplot2 looks for the mapped variables in the `data` argument, in this case, `mp
Let's turn this code into a reusable template for making graphs with ggplot2.
To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.
```{r eval = FALSE}
```{r, eval = FALSE}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```
@ -129,7 +129,7 @@ In the plot below, one group of points (highlighted in red) seems to fall outsid
These cars have a higher mileage than you might expect.
How can you explain these cars?
```{r, echo = FALSE}
```{r, echo = FALSE, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. Cars with engine size greater than 5 litres and highway fuel efficiency greater than 20 miles per gallon stand out from the rest of the data and are highlighted in red."}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = dplyr::filter(mpg, displ > 5, hwy > 20), colour = "red", size = 2.2)
@ -147,7 +147,7 @@ You can display a point (like the one below) in different ways by changing the v
Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties.
Here we change the levels of a point's size, shape, and color to make the point small, triangular, or blue:
```{r, echo = FALSE, asp = 1/4}
```{r, echo = FALSE, asp = 1/4, fig.alt = "Diagram that shows four plotting characters next to each other. The first is a large circle, the second is a small circle, the third is a triangle, and the fourth is a blue circle."}
ggplot() +
geom_point(aes(1, 1), size = 20) +
geom_point(aes(2, 1), size = 10) +
@ -159,9 +159,9 @@ ggplot() +
```
You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset.
For example, you can map the colors of your points to the `class` variable to reveal the class of each car.
For example, you can map the colours of your points to the `class` variable to reveal the class of each car.
```{r}
```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The points representing each car are coloured according to the class of the car. The legend on the right of the plot shows the mapping between colours and levels of the class variable: 2seater, compact, midsize, minivan, pickup, or suv."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
```
@ -172,7 +172,7 @@ To map an aesthetic to a variable, associate the name of the aesthetic to the na
ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as **scaling**.
ggplot2 will also add a legend that explains which levels correspond to which values.
The colors reveal that many of the unusual points are two-seater cars.
The colours reveal that many of the unusual points (with engine size greater than 5 litres and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars.
These cars don't seem like hybrids, and are, in fact, sports cars!
Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage.
In hindsight, these cars were unlikely to be hybrids since they have large engines.
@ -181,14 +181,14 @@ In the above example, we mapped `class` to the color aesthetic, but we could hav
In this case, the exact size of each point would reveal its class affiliation.
We get a *warning* here, because mapping an unordered variable (`class`) to an ordered aesthetic (`size`) is not a good idea.
```{r}
```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The points representing each car are sized according to the class of the car. The legend on the right of the plot shows the mapping between colours and levels of the class variable -- going from small to large: 2seater, compact, midsize, minivan, pickup, or suv."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
```
Or we could have mapped `class` to the *alpha* aesthetic, which controls the transparency of the points, or to the shape aesthetic, which controls the shape of the points.
```{r out.width = "50%", fig.align = 'default', warning = FALSE, fig.asp = 1/2, fig.cap =""}
```{r, out.width = "50%", fig.align = 'default', warning = FALSE, fig.asp = 1/2, fig.cap ="", fig.alt = "Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars in ggplot2::mpg and showing a negative association. In the plot on the left class is mapped to the alpha aesthetic, resulting in different transparency levels for each level of class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each level of class. Each plot comes with a legend that shows the mapping between alpha level or shape and levels of the class variable."}
# Left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
@ -214,7 +214,7 @@ The axis line acts as a legend; it explains the mapping between locations and va
You can also *set* the aesthetic properties of your geom manually.
For example, we can make all of the points in our plot blue:
```{r}
```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. All points are blue."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```
@ -229,7 +229,7 @@ You'll need to pick a level that makes sense for that aesthetic:
- The shape of a point as a number, as shown in Figure \@ref(fig:shapes).
```{r shapes, echo = FALSE, out.width = "75%", fig.asp = 1/3, fig.cap="R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the `colour` and `fill` aesthetics. The hollow shapes (0--14) have a border determined by `colour`; the solid shapes (15--20) are filled with `colour`; the filled shapes (21--24) have a border of `colour` and are filled with `fill`.", warning = FALSE}
```{r shapes, echo = FALSE, out.width = "75%", fig.asp = 1/3, fig.cap="R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the `colour` and `fill` aesthetics. The hollow shapes (0--14) have a border determined by `colour`; the solid shapes (15--20) are filled with `colour`; the filled shapes (21--24) have a border of `colour` and are filled with `fill`.", warning = FALSE, fig.alt = "Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue."}
shapes <- tibble(
shape = c(0, 1, 2, 5, 3, 4, 6:19, 22, 21, 24, 23, 20),
x = (0:24 %/% 5) / 2,
@ -251,7 +251,7 @@ ggplot(shapes, aes(x, y)) +
1. What's gone wrong with this code?
Why are the points not blue?
```{r}
```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. All points are red and the legend shows a red point that is mapped to the word 'blue'."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
```
@ -309,20 +309,20 @@ One way to add additional variables is with aesthetics.
Another way, particularly useful for categorical variables, is to split your plot into **facets**, subplots that each display one subset of the data.
To facet your plot by a single variable, use `facet_wrap()`.
The first argument of `facet_wrap()` should be a formula, which you create with `~` followed by a variable name (here "formula" is the name of a data structure in R, not a synonym for "equation").
The first argument of `facet_wrap()` is a formula, which you create with `~` followed by a variable name (here "formula" is the bane if a data structure in R, not a synonym for "equation").
The variable that you pass to `facet_wrap()` should be discrete.
```{r}
```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by class, with facets spanning two rows."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
facet_grid(drv ~ cyl)
```
To facet your plot on the combination of two variables, add `facet_grid()` to your plot call.
The first argument of `facet_grid()` is also a formula.
This time the formula should contain two variable names separated by a `~`.
```{r}
```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
@ -337,7 +337,7 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o
2. What do the empty cells in plot with `facet_grid(drv ~ cyl)` mean?
How do they relate to this plot?
```{r, eval = FALSE}
```{r, fig.alt = "Scatterplot of number of cycles versus type of drive train of cars in ggplot2::mpg. Shows that there are no cars with 5 cylinders that are 4 wheel drive or with 4 or 5 cylinders that are front wheel drive."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
```
@ -345,7 +345,7 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o
3. What plots does the following code make?
What does `.` do?
```{r eval = FALSE}
```{r, eval = FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
@ -373,14 +373,33 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o
What other options control the layout of the individual panels?
Why doesn't `facet_grid()` have `nrow` and `ncol` arguments?
6. When using `facet_grid()` you should usually put the variable with more unique levels in the columns.
Why?
6. Which of the following two plots makes it easier to compare engine size (`displ`) across cars with different drive trains?
What does this say about when to place a faceting variable across rows or columns?
```{r, fig.alt = "Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ drv)
```
7. Recreate this plot using `facet_wrap()` instead of `facet_grid()`.
How do the positions of the facet labels change?
```{r, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by type of drive train across rows."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
```
## Geometric objects
How are these two plots similar?
```{r echo = FALSE, out.width = "50%", fig.align="default", message = FALSE}
```{r, echo = FALSE, out.width = "50%", fig.align="default", message = FALSE, fig.alt = "Two plots: the plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
@ -403,7 +422,7 @@ The plot on the left uses the point geom, and the plot on the right uses the smo
To change the geom in your plot, change the geom function that you add to `ggplot()`.
For instance, to make the plots above, you can use this code:
```{r eval = FALSE}
```{r, eval = FALSE}
# left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
@ -419,18 +438,18 @@ You could set the shape of a point, but you couldn't set the "shape" of a line.
On the other hand, you *could* set the linetype of a line.
`geom_smooth()` will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.
```{r message = FALSE}
```{r, message = FALSE, fig.alt = "A plot of highway fuel efficiency versus engine size of cars in ggplot2::mpg. The data are represented with smooth curves, which use a different line type (solid, dashed, or long dashed) for each type of drive train. Confidence intervals around the smooth curves are also displayed."}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
```
Here `geom_smooth()` separates the cars into three lines based on their `drv` value, which describes a car's drivetrain.
Here `geom_smooth()` separates the cars into three lines based on their `drv` value, which describes a car's drive train.
One line describes all of the points with a `4` value, one line describes all of the points with an `f` value, and one line describes all of the points with an `r` value.
Here, `4` stands for four-wheel drive, `f` for front-wheel drive, and `r` for rear-wheel drive.
If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to `drv`.
If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then colouring everything according to `drv`.
```{r echo = FALSE, message = FALSE}
```{r, echo = FALSE, message = FALSE, fig.alt = "A plot of highway fuel efficiency versus engine size of cars in ggplot2::mpg. The data are represented with points (coloured by drive train) as well as smooth curves (where line type is determined based on drive train as well). Confidence intervals around the smooth curves are also displayed."}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
@ -438,11 +457,11 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
Notice that this plot contains two geoms in the same graph!
If this makes you excited, buckle up.
We will learn how to place multiple geoms in the same plot very soon.
You will learn how to place multiple geoms in the same plot very soon.
ggplot2 provides over 40 geoms, and extension packages provide even more (see <https://exts.ggplot2.tidyverse.org/gallery/> for a sampling).
The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at <http://rstudio.com/resources/cheatsheets>.
To learn more about any single geom, use help: `?geom_smooth`.
To learn more about any single geom, use help, e.g. `?geom_smooth`.
Many geoms, like `geom_smooth()`, use a single geometric object to display multiple rows of data.
For these geoms, you can set the `group` aesthetic to a categorical variable to draw multiple objects.
@ -450,7 +469,7 @@ ggplot2 will draw a separate object for each unique value of the grouping variab
In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the `linetype` example).
It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.
```{r, fig.width = 3, fig.align = 'default', out.width = "33%", message = FALSE}
```{r, fig.width = 3, fig.align = 'default', out.width = "33%", message = FALSE, fig.alt = "Three plots, each with highway fuel efficiency on the y-axis and engine size of cars in ggplot2::mpg, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colours, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed."}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
@ -466,7 +485,7 @@ ggplot(data = mpg) +
To display multiple geoms in the same plot, add multiple geom functions to `ggplot()`:
```{r, message = FALSE}
```{r, message = FALSE, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg with a smooth curve overlaid. A confidence interval around the smooth curves is also displayed."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
@ -489,7 +508,7 @@ If you place mappings in a geom function, ggplot2 will treat them as local mappi
It will use these mappings to extend or overwrite the global mappings *for that layer only*.
This makes it possible to display different aesthetics in different layers.
```{r, message = FALSE}
```{r, message = FALSE, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, where points are coloured according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it."}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
@ -499,7 +518,7 @@ You can use the same idea to specify different `data` for each layer.
Here, our smooth line displays just a subset of the `mpg` dataset, the subcompact cars.
The local data argument in `geom_smooth()` overrides the global data argument in `ggplot()` for that layer only.
```{r, message = FALSE}
```{r, message = FALSE, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, where points are coloured according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it."}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
@ -543,8 +562,9 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
```
6. Recreate the R code necessary to generate the following graphs.
Note that wherever a categorical variable is used in the plot, it's `drv`.
```{r echo = FALSE, fig.width = 3, out.width = "50%", fig.align = "default", message = FALSE}
```{r, echo = FALSE, fig.width = 3, out.width = "50%", fig.align = "default", message = FALSE, fig.alt = "There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars in ggplot2::mpg are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colours for each level of drive train. In the fourth plot the points are represented in different colours for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colours for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colours for each level of drive train and they have a thick white border."}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)
@ -571,10 +591,10 @@ Next, let's take a look at a bar chart.
Bar charts seem simple, but they are interesting because they reveal something subtle about plots.
Consider a basic bar chart, as drawn with `geom_bar()`.
The following chart displays the total number of diamonds in the `diamonds` dataset, grouped by `cut`.
The `diamonds` dataset comes in ggplot2 and contains information about \~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond.
The `diamonds` dataset is in the ggplot2 package and contains information on \~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond.
The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
```{r}
```{r, fig.alt = "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 12000 very good, 14000 premium, and 22000 ideal cut diamonds."}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
@ -594,7 +614,7 @@ Other graphs, like bar charts, calculate new values to plot:
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation.
The figure below describes how this process works with `geom_bar()`.
```{r, echo = FALSE, out.width = "100%"}
```{r, echo = FALSE, out.width = "100%", fig.alt = 'A figure demonstrating three steps of creating a bar chart: 1. geom_bar() begins with the diamonds data set. 2. geom_bar() transforms the data with the "count" stat, which returns a data set of cut values and counts. 3. geom_bar() uses the transformed data to build the plot. cut is mapped to the x-axis, count is mapped to the y-axis.'}
knitr::include_graphics("images/visualization-stat-bar.png")
```
@ -606,7 +626,7 @@ That describes how it computes two new variables: `count` and `prop`.
You can generally use geoms and stats interchangeably.
For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`:
```{r}
```{r, fig.alt = "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 12000 very good, 14000 premium, and 22000 ideal cut diamonds."}
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
```
@ -620,7 +640,7 @@ There are three reasons you might need to use a stat explicitly:
This lets me map the height of the bars to the raw values of a $y$ variable.
Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.
```{r, warning = FALSE}
```{r, warning = FALSE, fig.alt = "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 22000 ideal, 14000 premium, and 12000 very good, cut diamonds."}
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
@ -638,19 +658,19 @@ There are three reasons you might need to use a stat explicitly:
You might be able to guess at their meaning from the context, and you'll learn exactly what they do soon!)
2. You might want to override the default mapping from transformed variables to aesthetics.
For example, you might want to display a bar chart of proportion, rather than count:
For example, you might want to display a bar chart of proportions, rather than counts:
```{r}
```{r, fig.alt = "Bar chart of proportion of each each cut of diamond in the ggplots::diamonds dataset. Roughly, fair diamonds make up 0.03, good 0.09, very good 0.22, premium 26, and ideal 0.40."}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))
```
To find the variables computed by the stat, look for the help section titled "computed variables".
To find the variables computed by the stat, look for the section titled "computed variables" in the help for `geom_bar()`.
3. You might want to draw greater attention to the statistical transformation in your code.
For example, you might use `stat_summary()`, which summarises the y values for each unique x value, to draw attention to the summary that you're computing:
```{r}
```{r, fig.alt = "A plot with depth on the y-axis and cut on the x-axis (with levels fair, good, very good, premium, and ideal) of diamonds in ggplot2::diamonds. For each level of cut, vertical lines extend from minimum to maximum depth for diamonds in that cut category, and the median depth is indicated on the line with a point."}
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
@ -695,7 +715,7 @@ To see a complete list of stats, try the ggplot2 cheatsheet.
There's one more piece of magic associated with bar charts.
You can colour a bar chart using either the `colour` aesthetic, or, more usefully, `fill`:
```{r out.width = "50%", fig.align = "default"}
```{r, out.width = "50%", fig.align = "default", fig.alt = "Two bar charts of cut of diamonds in ggplot2::diamonds. In the first plot, the bars have coloured borders. In the second plot, they're filled with colours. Heights of the bars correspond to the number of diamonds in each cut category."}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
ggplot(data = diamonds) +
@ -703,9 +723,9 @@ ggplot(data = diamonds) +
```
Note what happens if you map the fill aesthetic to another variable, like `clarity`: the bars are automatically stacked.
Each colored rectangle represents a combination of `cut` and `clarity`.
Each coloured rectangle represents a combination of `cut` and `clarity`.
```{r}
```{r, fig.alt = "Segmented bar chart of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the coloured segments are proportional to the number of diamonds with a given clarity level within a given cut level."}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
```
@ -717,7 +737,7 @@ If you don't want a stacked bar chart, you can use one of three other options: `
This is not very useful for bars, because it overlaps them.
To see that overlapping we either need to make the bars slightly transparent by setting `alpha` to a small value, or completely transparent by setting `fill = NA`.
```{r out.width = "50%", fig.align = "default"}
```{r, out.width = "50%", fig.align = "default", fig.alt = "Two segmented bar charts of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the coloured segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colours, in the second plot the segments are only outlined with colours."}
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
@ -729,7 +749,7 @@ If you don't want a stacked bar chart, you can use one of three other options: `
- `position = "fill"` works like stacking, but makes each set of stacked bars the same height.
This makes it easier to compare proportions across groups.
```{r}
```{r, fig.alt = "Segmented bar chart of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Height of each bar is 1 and heights of the coloured segments are proportional to the proportion of diamonds with a given clarity level within a given cut level."}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
```
@ -737,7 +757,7 @@ If you don't want a stacked bar chart, you can use one of three other options: `
- `position = "dodge"` places overlapping objects directly *beside* one another.
This makes it easier to compare individual values.
```{r}
```{r, fig.alt = "Dodged bar chart of cut of diamonds in ggplot2::diamonds. Dodged bars are grouped by levels of cut (fair, good, very good, premium, and ideal). In each group there are eight bars, one for each level of clarity, and filled with a different color for each level. Heights of these bars represent the number of diamonds with a given level of cut and clarity."}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
```
@ -746,7 +766,7 @@ There's one other type of adjustment that's not useful for bar charts, but it ca
Recall our first scatterplot.
Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?
```{r echo = FALSE}
```{r, echo = FALSE, fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
@ -760,7 +780,7 @@ You can avoid this gridding by setting the position adjustment to "jitter".
`position = "jitter"` adds a small amount of random noise to each point.
This spreads the points out because no two points are likely to receive the same amount of random noise.
```{r}
```{r, fig.alt = "Jittered scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association."}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
```
@ -775,7 +795,7 @@ To learn more about a position adjustment, look up the help page associated with
1. What is the problem with this plot?
How could you improve it?
```{r}
```{r, fig.alt = "Scatterplot of highway fuel efficiency versus city fuel efficiency of cars in ggplot2::mpg that shows a positive association. The number of points visible in this plot is less than the number of points in the dataset."}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
```
@ -797,7 +817,7 @@ There are a number of other coordinate systems that are occasionally helpful.
This is useful (for example), if you want horizontal boxplots.
It's also useful for long labels: it's hard to get them to fit without overlapping on the x-axis.
```{r fig.width = 3, out.width = "50%", fig.align = "default"}
```{r, fig.width = 3, out.width = "50%", fig.align = "default", fig.alt = "Two side-by-side box plots of highway fuel efficiency of cars in ggplot2::mpg. A separate box plot is created for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv). In the first plot class is on the x-axis, in the second plot class is on the y-axis. The second plot makes it easier to read the names of the levels of class since they're listed down the y-axis, avoiding overlap."}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
@ -805,10 +825,17 @@ There are a number of other coordinate systems that are occasionally helpful.
coord_flip()
```
However, note that you can achieve the same result by flipping the aesthetic mappings of the two variables.
```{r, fig.width = 3, fig.align = "default", fig.alt = "Side-by-side box plots of highway fuel efficiency of cars in ggplot2::mpg. A separate box plot is drawn along the y-axis for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv)."}
ggplot(data = mpg, mapping = aes(y = class, x = hwy)) +
geom_boxplot()
```
- `coord_quickmap()` sets the aspect ratio correctly for maps.
This is very important if you're plotting spatial data with ggplot2 (which unfortunately we don't have the space to cover in this book).
```{r fig.width = 3, out.width = "50%", fig.align = "default", message = FALSE}
```{r, fig.width = 3, out.width = "50%", fig.align = "default", message = FALSE, fig.alt = "Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it's correct."}
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
@ -822,7 +849,7 @@ There are a number of other coordinate systems that are occasionally helpful.
- `coord_polar()` uses polar coordinates.
Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.
```{r fig.width = 3, out.width = "50%", fig.align = "default", fig.asp = 1}
```{r, fig.width = 3, out.width = "50%", fig.align = "default", fig.asp = 1, fig.alt = "Two plots: on the left is a bar chart of cut of diamonds in ggplot2::diamonds, on the right is a Coxcomb chart of the same data."}
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
@ -849,7 +876,7 @@ There are a number of other coordinate systems that are occasionally helpful.
Why is `coord_fixed()` important?
What does `geom_abline()` do?
```{r, fig.asp = 1, out.width = "50%"}
```{r, fig.asp = 1, out.width = "50%", fig.alt = "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The plot also has a straight line that follows the trend of the relationship between the variables but doesn't go through the cloud of points, it's beneath it."}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
@ -879,7 +906,7 @@ The grammar of graphics is based on the insight that you can uniquely describe *
To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat).
```{r, echo = FALSE, out.width = "100%"}
```{r, echo = FALSE, out.width = "100%", fig.alt = "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Steps 1 and 2 are annotated: 1. Begin with the diamonds dataset. 2. Compute counts for each cut value with stat_count()."}
knitr::include_graphics("images/visualization-grammar-1.png")
```
@ -887,7 +914,7 @@ Next, you could choose a geometric object to represent each observation in the t
You could then use the aesthetic properties of the geoms to represent variables in the data.
You would map the values of each variable to the levels of an aesthetic.
```{r, echo = FALSE, out.width = "100%"}
```{r, echo = FALSE, out.width = "100%", fig.alt = "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Each level is also mapped to a color. Steps 3 and 4 are annotated: 3. Represent each observation with a bar. 4. Map the fill of each bar to the ..count.. variable."}
knitr::include_graphics("images/visualization-grammar-2.png")
```
@ -896,7 +923,7 @@ You'd use the location of the objects (which is itself an aesthetic property) to
At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting).
You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.
```{r, echo = FALSE, out.width = "100%"}
```{r, echo = FALSE, out.width = "100%", fig.alt = "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to bar chart where each bar represents one level of cut and filled in with a different color. Steps 5 and 6 are annotated: 5. Place geoms in a Cartesian coordinate system. 6. Map the y values to ..count.. and the x values to cut."}
knitr::include_graphics("images/visualization-grammar-3.png")
```

View File

@ -2,7 +2,7 @@
# Introduction {#explore-intro}
The goal of the first part of this book is to get you up to speed with the basic tools of **data exploration** as quickly as possible.
The goal of the first part of this book is to introduce you the data science workflow including data **importing**, **tidying**, and data **exploration** as quickly as possible.
Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again.
The goal of data exploration is to generate many promising leads that you can later explore in more depth.
@ -10,6 +10,8 @@ The goal of data exploration is to generate many promising leads that you can la
knitr::include_graphics("diagrams/data-science-explore.png")
```
<!--# TO DO: Update figure to include import and tidy as well. -->
In this part of the book you will learn some useful tools that have an immediate payoff:
- Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data.
@ -21,12 +23,12 @@ In this part of the book you will learn some useful tools that have an immediate
You'll learn the underlying principles, and how to get your data into a tidy form.
- Before you can transform and visualise your data, you need to first get your data into R.
In Chapter \@ref(data-import) you'll learn the basics of getting plain-text rectangular data into R.
In Chapter \@ref(data-import) you'll learn the basics of getting plain-text, rectangular data into R.
- Finally, in Chapter \@ref(exploratory-data-analysis), you'll combine visualisation and transformation with your curiosity and scepticism to ask and answer interesting questions about data.
Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet so we will not cover it in this part.
Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet and details of modeling fall outside the scope of this book.
Nestled among these three chapters that teach you the tools of exploration are three chapters that focus on your R workflow.
Nestled among these five chapters that teach you the tools for doing data science are three chapters that focus on your R workflow.
In Chapters \@ref(workflow-basics), \@ref(workflow-scripts), and \@ref(workflow-projects), you'll learn good workflow practices for writing and organising your R code.
These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects.