New visualize part (#1115)

This commit is contained in:
Mine Cetinkaya-Rundel 2022-12-04 13:05:38 -05:00 committed by GitHub
parent bff64c83eb
commit 1ffbbf90b5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
22 changed files with 3030 additions and 2170 deletions

View File

@ -17,6 +17,7 @@ Imports:
gapminder,
ggplot2,
ggrepel,
ggridges,
hexbin,
janitor,
jsonlite,

289
EDA.qmd
View File

@ -4,7 +4,7 @@
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
status("complete")
```
## Introduction
@ -52,7 +52,7 @@ When you ask a question, the question focuses your attention on a specific part
EDA is fundamentally a creative process.
And like most creative processes, the key to asking *quality* questions is to generate a large *quantity* of questions.
It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset.
It is difficult to ask revealing questions at the start of your analysis because you do not know what insights can be gleaned from your dataset.
On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.
You can quickly drill down into the most interesting parts of your data---and develop a set of thought-provoking questions---if you follow up each question with a new question based on what you find.
@ -91,37 +91,10 @@ This is true even if you measure quantities that are constant, like the speed of
Each of your measurements will include a small amount of error that varies from measurement to measurement.
Variables can also vary if you measure across different subjects (e.g. the eye colors of different people) or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information about how that variable varies between measurements on the same observation as well as across observations.
The best way to understand that pattern is to visualize the distribution of the variable's values.
The best way to understand that pattern is to visualize the distribution of the variable's values, which you've learned about in @sec-data-visualisation.
### Visualizing distributions
How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous.
A variable is **categorical** if it can only take one of a small set of values.
In R, categorical variables are usually saved as factors or character vectors.
To examine the distribution of a categorical variable, you can use a bar chart:
```{r}
#| fig-alt: >
#| A bar chart of cuts of diamonds. The cuts are presented in increasing
#| order of frequency: Fair (less than 2500), Good (approximately 5000),
#| Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal
#| (approximately 21500).
ggplot(data = diamonds, mapping = aes(x = cut)) +
geom_bar()
```
The height of the bars displays how many observations occurred with each x value.
You can compute these values manually with `count()`:
```{r}
diamonds |>
count(cut)
```
A variable is **continuous** if it can take any of an infinite set of ordered values.
Numbers and date-times are two examples of continuous variables.
To examine the distribution of a continuous variable, you can use a histogram:
We'll start our exploration by visualizing the distribution of weights (`carat`) of \~54,000 diamonds from the `diamonds` dataset.
Since `carat` is a numerical variable, we can use a histogram:
```{r}
#| fig-alt: >
@ -132,62 +105,10 @@ To examine the distribution of a continuous variable, you can use a histogram:
#| at 1, and much fewer, approximately 5000 diamonds in the bin centered at
#| 1.5. Beyond this, there's a trailing tail.
ggplot(data = diamonds, mapping = aes(x = carat)) +
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.5)
```
You can compute this by hand by combining `count()` and `cut_width()`:
```{r}
diamonds |>
count(cut_width(carat, 0.5))
```
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin.
Note that even though it's not possible to have a `carat` value that is smaller than 0 (since weights of diamonds, by definition, are positive values), the bins start at a negative value (-0.25) in order to create bins of equal width across the range of the data with the center of the first bin at 0.
This behavior is also apparent in the histogram above, where the first bar ranges from -0.25 to 0.25.
The tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar centered at 0.5.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable.
You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.
For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
```{r}
#| fig-alt: >
#| A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and
#| the y-axis ranging from 0 to 10000. The binwidth is quite narrow (0.1),
#| resulting in many bars. The distribution is right skewed but there are lots
#| of ups and downs in the heights of the bins, creating a jagged outline.
smaller <- diamonds |>
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
```
If you wish to overlay multiple histograms in the same plot, we recommend using `geom_freqpoly()` instead of `geom_histogram()`.
`geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, uses lines instead.
It's much easier to understand overlapping lines than bars.
```{r}
#| fig-alt: >
#| A frequency polygon of carats of diamonds where each cut of carat (Fair,
#| Good, Very Good, Premium, and Ideal) is represented with a different color
#| line. The x-axis ranges from 0 to 3 and the y-axis ranges from 0 to almost
#| 6000. Ideal diamonds have a much higher peak than the others around 0.25
#| carats. All cuts of diamonds have right skewed distributions with local
#| peaks at 1 carat and 2 carats. As the cut level increases (from Fair to
#| Ideal), so does the number of diamonds that fall into that category.
ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
geom_freqpoly(binwidth = 0.1, size = 0.75)
```
We've also customized the thickness of the lines using the `size` argument in order to make them stand out a bit more against the background.
There are a few challenges with this type of plot, which we will come back to in @sec-cat-cont on visualizing a categorical and a continuous variable.
Now that you can visualize variation, what should you look for in your plots?
And what type of follow-up questions should you ask?
We've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information.
@ -223,7 +144,10 @@ As an example, the histogram below suggests several interesting questions:
#| is right skewed, with many peaks followed by bars in decreasing heights,
#| until a sharp increase at the next peak.
ggplot(data = smaller, mapping = aes(x = carat)) +
smaller <- diamonds |>
filter(carat < 3)
ggplot(smaller, aes(x = carat)) +
geom_histogram(binwidth = 0.01)
```
@ -247,7 +171,7 @@ Eruption times appear to be clustered into two groups: there are short eruptions
#| and the y-axis ranges from 0 to roughly 40. The distribution is bimodal
#| with peaks around 1.75 and 4.5.
ggplot(data = faithful, mapping = aes(x = eruptions)) +
ggplot(faithful, aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
```
@ -268,7 +192,7 @@ The only evidence of outliers is the unusually wide limits on the x-axis.
#| y-axis ranges from 0 to 12000. There is a peak around 5, and the data
#| appear to be completely clustered around the peak.
ggplot(data = diamonds, mapping = aes(x = y)) +
ggplot(diamonds, aes(x = y)) +
geom_histogram(binwidth = 0.5)
```
@ -283,7 +207,7 @@ To make it easy to see the unusual values, we need to zoom to small values of th
#| there is one bin at 0 with a height of about 8, one a little over 30 with
#| a height of 1 and another one a little below 60 with a height of 1.
ggplot(data = diamonds, mapping = aes(x = y)) +
ggplot(diamonds, aes(x = y)) +
geom_histogram(binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
```
@ -341,7 +265,7 @@ You'll need to figure out what caused them (e.g. a data entry error) and disclos
What happens if you leave `binwidth` unset?
What happens if you try and zoom so only half a bar shows?
## Missing values {#sec-missing-values-eda}
## Unusual values {#sec-missing-values-eda}
If you've encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.
@ -371,8 +295,8 @@ The first argument `test` should be a logical vector.
The result will contain the value of the second argument, `yes`, when `test` is `TRUE`, and the value of the third argument, `no`, when it is false.
Alternatively to `if_else()`, use `case_when()`.
`case_when()` is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple `if_else()` statements nested inside one another.
You will learn more about logical vectors in @sec-logicals.
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing.
It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but it does warn that they've been removed:
```{r}
@ -383,7 +307,7 @@ It's not obvious where you should plot missing values, so ggplot2 doesn't includ
#| has length greater than 3. The one outlier has a length of 0 and a width
#| of about 6.5.
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
ggplot(diamonds2, aes(x = x, y = y)) +
geom_point()
```
@ -392,7 +316,7 @@ To suppress that warning, set `na.rm = TRUE`:
```{r}
#| eval: false
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
ggplot(diamonds2, aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
```
@ -417,8 +341,8 @@ nycflights13::flights |>
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + (sched_min / 60)
) |>
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)
ggplot(aes(sched_dep_time)) +
geom_freqpoly(aes(color = cancelled), binwidth = 1/4)
```
However this plot isn't great because there are many more non-cancelled flights than cancelled flights.
@ -437,14 +361,10 @@ In the next section we'll explore some techniques for improving this comparison.
If variation describes the behavior *within* a variable, covariation describes the behavior *between* variables.
**Covariation** is the tendency for the values of two or more variables to vary together in a related way.
The best way to spot covariation is to visualize the relationship between two or more variables.
How you do that depends again on the types of variables involved.
### A categorical and continuous variable {#sec-cat-cont}
### A categorical and a numerical variable {#sec-cat-num}
It's common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon.
The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count.
That means if one of the groups is much smaller than the others, it's hard to see the differences in the shapes of their distributions.
For example, let's explore how the price of a diamond varies with its quality (measured by `cut`):
For example, let's explore how the price of a diamond varies with its quality (measured by `cut`) using `geom_freqpoly()`:
```{r}
#| fig-alt: >
@ -455,11 +375,11 @@ For example, let's explore how the price of a diamond varies with its quality (m
#| distributions of prices of diamonds. One notable feature is that
#| Ideal diamonds have the highest peak around 1500.
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)
ggplot(diamonds, aes(x = price)) +
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)
```
It's hard to see the difference in distribution because the overall counts differ so much:
The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count and the overall counts of `cut` in differ so much, making it hard to see the differences in the shapes of their distributions:
```{r}
#| fig-alt: >
@ -467,7 +387,7 @@ It's hard to see the difference in distribution because the overall counts diffe
#| frenquencies of various cuts. Fair diamonds have the lowest frequency,
#| then Good, then Very Good, then Premium, and then Ideal.
ggplot(data = diamonds, mapping = aes(x = cut)) +
ggplot(diamonds, aes(x = cut)) +
geom_bar()
```
@ -483,8 +403,8 @@ Instead of displaying count, we'll display the **density**, which is the count s
#| diamonds. One notable feature is that all but Fair diamonds have high peaks
#| around a price of 1500 and Fair diamonds have a higher mean than others.
ggplot(data = diamonds, mapping = aes(x = price, y = after_stat(density))) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)
ggplot(diamonds, aes(x = price, y = after_stat(density))) +
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)
```
Note that we're mapping the density the `y`, but since `density` is not a variable in the `diamonds` dataset, we need to first calculate it.
@ -493,29 +413,7 @@ We use the `after_stat()` function to do so.
There's something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price!
But maybe that's because frequency polygons are a little hard to interpret - there's a lot going on in this plot.
Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot.
A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians.
Each boxplot consists of:
- A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR).
In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution.
These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
- Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box.
These outlying points are unusual so are plotted individually.
- A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
```{r}
#| echo: false
#| fig-alt: >
#| A diagram depicting how a boxplot is created following the steps outlined
#| above.
knitr::include_graphics("images/EDA-boxplot.png")
```
Let's take a look at the distribution of price by cut using `geom_boxplot()`:
A visually simpler plot for exploring this relationship is using side-by-side boxplots.
```{r}
#| fig-height: 3
@ -525,7 +423,7 @@ Let's take a look at the distribution of price by cut using `geom_boxplot()`:
#| Ideal). The medians are close to each other, with the median for Ideal
#| diamonds lowest and that for Fair highest.
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
ggplot(diamonds, aes(x = cut, y = price)) +
geom_boxplot()
```
@ -535,7 +433,7 @@ In the exercises, you'll be challenged to figure out why.
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on.
Many categorical variables don't have such an intrinsic order, so you might want to reorder them to make a more informative display.
One way to do that is with the `reorder()` function.
One way to do that is with the `fct_reorder()` function.
For example, take the `class` variable in the `mpg` dataset.
You might be interested to know how highway mileage varies across classes:
@ -546,7 +444,7 @@ You might be interested to know how highway mileage varies across classes:
#| on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact,
#| and suv).
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
ggplot(mpg, aes(x = class, y = hwy)) +
geom_boxplot()
```
@ -559,8 +457,8 @@ To make the trend easier to see, we can reorder `class` based on the median valu
#| on the x-axis and ordered by increasing median highway mileage (pickup,
#| suv, minivan, 2seater, subcompact, compact, and midsize).
ggplot(data = mpg,
mapping = aes(x = fct_reorder(class, hwy, median), y = hwy)) +
ggplot(mpg,
aes(x = fct_reorder(class, hwy, median), y = hwy)) +
geom_boxplot()
```
@ -572,8 +470,8 @@ You can do that by exchanging the x and y aesthetic mappings.
#| Side-by-side boxplots of highway mileages of cars by class. Classes are
#| on the y-axis and ordered by increasing median highway mileage.
ggplot(data = mpg,
mapping = aes(y = fct_reorder(class, hwy, median), x = hwy)) +
ggplot(mpg,
aes(y = fct_reorder(class, hwy, median), x = hwy)) +
geom_boxplot()
```
@ -614,42 +512,13 @@ One way to do that is to rely on the built-in `geom_count()`:
#| the number of observations for that combination. The legend indicates
#| that these sizes range between 1000 and 4000.
ggplot(data = diamonds, mapping = aes(x = cut, y = color)) +
ggplot(diamonds, aes(x = cut, y = color)) +
geom_count()
```
The size of each circle in the plot displays how many observations occurred at each combination of values.
Covariation will appear as a strong correlation between specific x values and specific y values.
A more commonly used way of representing the covariation between two categorical variables is using a segmented bar chart.
In creating this bar chart, we map the variable we want to divide the data into first to the `x` aesthetic and the variable we then further want to divide each group into to the `fill` aesthetic.
```{r}
#| fig-alt: >
#| A bar chart of cuts of diamonds, segmented by color. The number of diamonds
#| for each level of cut increases from Fair to Ideal and the heights
#| of the segments within each bar represent the number of diamonds that fall
#| within each color/cut combination. There appear to be some of each color of
#| diamonds within each level of cut of diamonds.
ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
geom_bar()
```
However, in order to get a better sense of the relationship between these two variables, you should compare proportions instead of counts across groups.
```{r}
#| fig-alt: >
#| A bar chart of cuts of diamonds, segmented by color. The heights of each
#| of the bars representing each cut of diamond are the same, 1. The heights
#| of the segments within each bar represent the proportion of diamonds that
#| fall within each color/cut combination. The proportions don't appear to be
#| very different across the levels of cut.
ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
geom_bar(position = "fill")
```
Another approach for exploring the relationship between these variables is computing the counts with dplyr:
```{r}
@ -669,8 +538,8 @@ Then visualize with `geom_tile()` and the fill aesthetic:
diamonds |>
count(color, cut) |>
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
ggplot(aes(x = color, y = cut)) +
geom_tile(aes(fill = n))
```
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns.
@ -689,9 +558,9 @@ For larger plots, you might want to try the heatmaply package, which creates int
4. Why is it slightly better to use `aes(x = color, y = cut)` rather than `aes(x = cut, y = color)` in the example above?
### Two continuous variables
### Two numerical variables
You've already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with `geom_point()`.
You've already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with `geom_point()`.
You can see covariation as a pattern in the points.
For example, you can see an exponential relationship between the carat size and price of a diamond.
@ -701,7 +570,7 @@ For example, you can see an exponential relationship between the carat size and
#| A scatterplot of price vs. carat. The relationship is positive, somewhat
#| strong, and exponential.
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
```
@ -716,7 +585,7 @@ You've already seen one way to fix the problem: using the `alpha` aesthetic to a
#| the number of points is higher than other areas, The most obvious clusters
#| are for diamonds with 1, 1.5, and 2 carats.
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 1 / 100)
```
@ -738,11 +607,11 @@ You will need to install the hexbin package to use `geom_hex()`.
#| plot of price vs. carat. Both plots show that the highest density of
#| diamonds have low carats and low prices.
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
ggplot(smaller, aes(x = carat, y = price)) +
geom_bin2d()
# install.packages("hexbin")
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
ggplot(smaller, aes(x = carat, y = price)) +
geom_hex()
```
@ -760,8 +629,8 @@ For example, you could bin `carat` and then for each group, display a boxplot:
#| left skewed distributions. Cheaper, smaller diamonds have outliers on the
#| higher end, more expensive, bigger diamonds have outliers on the lower end.
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)))
```
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`.
@ -778,8 +647,8 @@ That's the job of `cut_number()`:
#| increases as well. Cheaper, smaller diamonds have outliers on the higher
#| end, more expensive, bigger diamonds have outliers on the lower end.
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_number(carat, 20)))
```
#### Exercises
@ -805,7 +674,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
#| strong, linear relationship. There are a few unusual observations
#| above and below the bulk of the data, more below it than above.
ggplot(data = diamonds, mapping = aes(x = x, y = y)) +
ggplot(diamonds, aes(x = x, y = y)) +
geom_point() +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
@ -839,7 +708,7 @@ The scatterplot also displays the two clusters that we noticed above.
#| eruption times and short waiting times and one with long eruption times and
#| long waiting times.
ggplot(data = faithful, mapping = aes(x = eruptions, y = waiting)) +
ggplot(faithful, aes(x = eruptions, y = waiting)) +
geom_point()
```
@ -880,7 +749,7 @@ diamonds_fit <- linear_reg() |>
diamonds_aug <- augment(diamonds_fit, new_data = diamonds) |>
mutate(.resid = exp(.resid))
ggplot(data = diamonds_aug, mapping = aes(x = carat, y = .resid)) +
ggplot(diamonds_aug, aes(x = carat, y = .resid)) +
geom_point()
```
@ -893,66 +762,12 @@ Once you've removed the strong relationship between carat and price, you can see
#| quite similar, between roughly 0.75 to 1.25. Each of the distributions of
#| residuals is right skewed, with many outliers on the higher end.
ggplot(data = diamonds_aug, mapping = aes(x = cut, y = .resid)) +
ggplot(diamonds_aug, aes(x = cut, y = .resid)) +
geom_boxplot()
```
We're not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
## ggplot2 calls
As we move on from these introductory chapters, we'll transition to a more concise expression of ggplot2 code.
So far we've been very explicit, which is helpful when you are learning:
```{r}
#| eval: false
#| fig-alt: >
#| A frequency polygon plot of eruption times for the Old Faithful geyser.
#| The distribution of eruption times is binomodal with one mode around 1.75
#| and the other around 4.5.
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)
```
Typically, the first one or two arguments to a function are so important that you should know them by heart.
The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`.
In the remainder of the book, we won't supply those names.
That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots.
That's a really important programming concern that we'll come back to in @sec-functions.
Rewriting the previous plot more concisely yields:
```{r}
#| eval: false
#| fig-alt: >
#| A frequency polygon plot of eruption times for the Old Faithful geyser.
#| The distribution of eruption times is binomodal with one mode around 1.75
#| and the other around 4.5.
ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)
```
Sometimes we'll turn the end of a pipeline of data transformation into a plot.
Watch for the transition from `|>` to `+`.
We wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered.
```{r}
#| eval: false
#| fig-alt: >
#| A tile plot of cut vs. clarity of diamonds. Each tile represents a
#| cut/ckarity combination and tiles are colored according to the number of
#| observations in each tile. There are more Ideal diamonds than other cuts,
#| with the highest number being Ideal diamonds with VS2 clarity. Fair diamonds
#| and diamonds with clarity I1 are the lowest in frequency.
diamonds |>
count(cut, clarity) |>
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()
```
## Summary
In this chapter you've learned a variety of tools to help you understand the variation within your data.

View File

@ -34,9 +34,14 @@ book:
- workflow-style.qmd
- data-import.qmd
- workflow-scripts.qmd
- EDA.qmd
- workflow-help.qmd
- part: visualize.qmd
chapters:
- layers.qmd
- EDA.qmd
- communication.qmd
- part: transform.qmd
chapters:
- logicals.qmd
@ -64,7 +69,6 @@ book:
- part: communicate.qmd
chapters:
- quarto.qmd
- communicate-plots.qmd
- quarto-formats.qmd
- quarto-workflow.qmd

View File

@ -1,743 +0,0 @@
# Graphics for communication {#sec-graphics-communication}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
## Introduction
In @sec-exploratory-data-analysis, you learned how to use plots as tools for *exploration*.
When you make exploratory plots, you know---even before looking---which variables the plot will display.
You made each plot for a purpose, could quickly look at it, and then move on to the next plot.
In the course of most analyses, you'll produce tens or hundreds of plots, most of which are immediately thrown away.
Now that you understand your data, you need to *communicate* your understanding to others.
Your audience will likely not share your background knowledge and will not be deeply invested in the data.
To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible.
In this chapter, you'll learn some of the tools that ggplot2 provides to do so.
This chapter focuses on the tools you need to create good graphics.
We assume that you know what you want, and just need to know how to do it.
For that reason, we highly recommend pairing this chapter with a good general visualization book.
We particularly like [*The Truthful Art*](https://www.amazon.com/gp/product/0321934075/), by Albert Cairo.
It doesn't teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.
### Prerequisites
In this chapter, we'll focus once again on ggplot2.
We'll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including **ggrepel** and **patchwork**.
Rather than loading those extensions here, we'll refer to their functions explicitly, using the `::` notation.
This will help make it clear which functions are built into ggplot2, and which come from other packages.
Don't forget you'll need to install those packages with `install.packages()` if you don't already have them.
```{r}
#| message: false
library(tidyverse)
```
## Label
The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels.
You add labels with the `labs()` function.
This example adds a plot title:
```{r}
#| message: false
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(title = "Fuel efficiency generally decreases with engine size")
```
The purpose of a plot title is to summarize the main finding.
Avoid titles that just describe what the plot is, e.g. "A scatterplot of engine displacement vs. fuel economy".
If you need to add more text, there are two other useful labels that you can use in ggplot2 2.2.0 and above:
- `subtitle` adds additional detail in a smaller font beneath the title.
- `caption` adds text at the bottom right of the plot, often used to describe the source of the data.
```{r}
#| message: false
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(
title = "Fuel efficiency generally decreases with engine size",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
caption = "Data from fueleconomy.gov"
)
```
You can also use `labs()` to replace the axis and legend titles.
It's usually a good idea to replace short variable names with more detailed descriptions, and to include the units.
```{r}
#| message: false
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)",
colour = "Car type"
)
```
It's possible to use mathematical equations instead of text strings.
Just switch `""` out for `quote()` and read about the available options in `?plotmath`:
```{r}
#| fig-asp: 1
#| out-width: "50%"
#| fig-width: 3
df <- tibble(
x = runif(10),
y = runif(10)
)
ggplot(df, aes(x, y)) +
geom_point() +
labs(
x = quote(sum(x[i] ^ 2, i == 1, n)),
y = quote(alpha + beta + frac(delta, theta))
)
```
### Exercises
1. Create one plot on the fuel economy data with customized `title`, `subtitle`, `caption`, `x`, `y`, and `colour` labels.
2. Recreate the following plot using the fuel economy data.
Note that both the colors and shapes of points vary by type of drive train.
```{r}
#| echo: false
ggplot(mpg, aes(cty, hwy, color = drv, shape = drv)) +
geom_point() +
labs(
x = "City MPG",
y = "Highway MPG",
shape = "Type of\ndrive train",
color = "Type of\ndrive train"
)
```
3. Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand.
## Annotations
In addition to labelling major components of your plot, it's often useful to label individual observations or groups of observations.
The first tool you have at your disposal is `geom_text()`.
`geom_text()` is similar to `geom_point()`, but it has an additional aesthetic: `label`.
This makes it possible to add textual labels to your plots.
There are two possible sources of labels.
First, you might have a tibble that provides labels.
The plot below isn't terribly useful, but it illustrates a useful approach: pull out the most efficient car in each class with dplyr, and then label it on the plot:
```{r}
best_in_class <- mpg |>
group_by(class) |>
filter(row_number(desc(hwy)) == 1)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_text(aes(label = model), data = best_in_class)
```
This is hard to read because the labels overlap with each other, and with the points.
We can make things a little better by switching to `geom_label()` which draws a rectangle behind the text.
We also use the `nudge_y` parameter to move the labels slightly above the corresponding points:
```{r}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_label(aes(label = model), data = best_in_class, nudge_y = 2, alpha = 0.5)
```
That helps a bit, but if you look closely in the top-left hand corner, you'll notice that there are two labels practically on top of each other.
This happens because the highway mileage and displacement for the best cars in the compact and subcompact categories are exactly the same.
There's no way that we can fix these by applying the same transformation for every label.
Instead, we can use the **ggrepel** package by Kamil Slowikowski.
This useful package will automatically adjust labels so that they don't overlap:
```{r}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_point(size = 3, shape = 1, data = best_in_class) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)
```
Note another handy technique used here: we added a second layer of large, hollow points to highlight the labelled points.
You can sometimes use the same idea to replace the legend with labels placed directly on the plot.
It's not wonderful for this plot, but it isn't too bad.
(`theme(legend.position = "none"`) turns the legend off --- we'll talk about it more shortly.)
```{r}
class_avg <- mpg |>
group_by(class) |>
summarize(
displ = median(displ),
hwy = median(hwy)
)
ggplot(mpg, aes(displ, hwy, colour = class)) +
ggrepel::geom_label_repel(aes(label = class),
data = class_avg,
size = 6,
label.size = 0,
segment.color = NA
) +
geom_point() +
theme(legend.position = "none")
```
Alternatively, you might just want to add a single label to the plot, but you'll still need to create a data frame.
Often, you want the label in the corner of the plot, so it's convenient to create a new data frame using `summarize()` to compute the maximum values of x and y.
```{r}
label_info <- mpg |>
summarize(
displ = max(displ),
hwy = max(hwy),
label = "Increasing engine size is \nrelated to decreasing fuel economy."
)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right")
```
If you want to place the text exactly on the borders of the plot, you can use `+Inf` and `-Inf`.
Since we're no longer computing the positions from `mpg`, we can use `tibble()` to create the data frame:
```{r}
label_info <- tibble(
displ = Inf,
hwy = Inf,
label = "Increasing engine size is \nrelated to decreasing fuel economy."
)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right")
```
In these examples, we manually broke the label up into lines using `"\n"`.
Another approach is to use `stringr::str_wrap()` to automatically add line breaks, given the number of characters you want per line:
```{r}
"Increasing engine size is related to decreasing fuel economy." |>
str_wrap(width = 40) |>
writeLines()
```
Note the use of `hjust` and `vjust` to control the alignment of the label.
@fig-just shows all nine possible combinations.
```{r}
#| label: fig-just
#| echo: false
#| fig-width: 4.5
#| fig-asp: 0.5
#| out-width: "60%"
#| fig-cap: >
#| All nine combinations of `hjust` and `vjust`.
vjust <- c(bottom = 0, center = 0.5, top = 1)
hjust <- c(left = 0, center = 0.5, right = 1)
df <- crossing(hj = names(hjust), vj = names(vjust)) |>
mutate(
y = vjust[vj],
x = hjust[hj],
label = paste0("hjust = '", hj, "'\n", "vjust = '", vj, "'")
)
ggplot(df, aes(x, y)) +
geom_point(colour = "grey70", size = 5) +
geom_point(size = 0.5, colour = "red") +
geom_text(aes(label = label, hjust = hj, vjust = vj), size = 4) +
labs(x = NULL, y = NULL)
```
Remember, in addition to `geom_text()`, you have many other geoms in ggplot2 available to help annotate your plot.
A few ideas:
- Use `geom_hline()` and `geom_vline()` to add reference lines.
We often make them thick (`size = 2`) and white (`colour = white`), and draw them underneath the primary data layer.
That makes them easy to see, without drawing attention away from the data.
- Use `geom_rect()` to draw a rectangle around points of interest.
The boundaries of the rectangle are defined by aesthetics `xmin`, `xmax`, `ymin`, `ymax`.
- Use `geom_segment()` with the `arrow` argument to draw attention to a point with an arrow.
Use aesthetics `x` and `y` to define the starting location, and `xend` and `yend` to define the end location.
The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!
### Exercises
1. Use `geom_text()` with infinite positions to place text at the four corners of the plot.
2. Read the documentation for `annotate()`.
How can you use it to add a text label to a plot without having to create a tibble?
3. How do labels with `geom_text()` interact with faceting?
How can you add a label to a single facet?
How can you put a different label in each facet?
(Hint: Think about the underlying data.)
4. What arguments to `geom_label()` control the appearance of the background box?
5. What are the four arguments to `arrow()`?
How do they work?
Create a series of plots that demonstrate the most important options.
## Scales
The third way you can make your plot better for communication is to adjust the scales.
Scales control the mapping from data values to things that you can perceive.
Normally, ggplot2 automatically adds scales for you.
For example, when you type:
```{r}
#| label: default-scales
#| fig-show: "hide"
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
```
ggplot2 automatically adds default scales behind the scenes:
```{r}
#| fig-show: "hide"
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
```
Note the naming scheme for scales: `scale_` followed by the name of the aesthetic, then `_`, then the name of the scale.
The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date.
There are lots of non-default scales which you'll learn about below.
The default scales have been carefully chosen to do a good job for a wide range of inputs.
Nevertheless, you might want to override the defaults for two reasons:
- You might want to tweak some of the parameters of the default scale.
This allows you to do things like change the breaks on the axes, or the key labels on the legend.
- You might want to replace the scale altogether, and use a completely different algorithm.
Often you can do better than the default because you know more about the data.
### Axis ticks and legend keys
There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: `breaks` and `labels`.
Breaks controls the position of the ticks, or the values associated with the keys.
Labels controls the text label associated with each tick/key.
The most common use of `breaks` is to override the default choice:
```{r}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))
```
You can use `labels` in the same way (a character vector the same length as `breaks`), but you can also set it to `NULL` to suppress the labels altogether.
This is useful for maps, or for publishing plots where you can't share the absolute numbers.
```{r}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)
```
You can also use `breaks` and `labels` to control the appearance of legends.
Collectively axes and legends are called **guides**.
Axes are used for x and y aesthetics; legends are used for everything else.
Another use of `breaks` is when you have relatively few data points and want to highlight exactly where the observations occur.
For example, take this plot that shows when each US president started and ended their term.
```{r}
presidential |>
mutate(id = 33 + row_number()) |>
ggplot(aes(start, id)) +
geom_point() +
geom_segment(aes(xend = end, yend = id)) +
scale_x_date(NULL, breaks = presidential$start, date_labels = "'%y")
```
Note that the specification of breaks and labels for date and datetime scales is a little different:
- `date_labels` takes a format specification, in the same form as `parse_datetime()`.
- `date_breaks` (not shown here), takes a string like "2 days" or "1 month".
### Legend layout
You will most often use `breaks` and `labels` to tweak the axes.
While they both also work for legends, there are a few other techniques you are more likely to use.
To control the overall position of the legend, you need to use a `theme()` setting.
We'll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot.
The theme setting `legend.position` controls where the legend is drawn:
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-asp: 1
base <- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
base + theme(legend.position = "left")
base + theme(legend.position = "top")
base + theme(legend.position = "bottom")
base + theme(legend.position = "right") # the default
```
You can also use `legend.position = "none"` to suppress the display of the legend altogether.
To control the display of individual legends, use `guides()` along with `guide_legend()` or `guide_colorbar()`.
The following example shows two important settings: controlling the number of rows the legend uses with `nrow`, and overriding one of the aesthetics to make the points bigger.
This is particularly useful if you have used a low `alpha` to display many points on a plot.
```{r}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
theme(legend.position = "bottom") +
guides(colour = guide_legend(nrow = 1, override.aes = list(size = 4)))
```
### Replacing a scale
Instead of just tweaking the details a little, you can instead replace the scale altogether.
There are two types of scales you're mostly likely to want to switch out: continuous position scales and colour scales.
Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and colour, you'll be able to quickly pick up other scale replacements.
It's very useful to plot transformations of your variable.
For example, as we've seen in [diamond prices](diamond-prices) it's easier to see the precise relationship between `carat` and `price` if we log transform them:
```{r}
#| fig-align: default
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
ggplot(diamonds, aes(carat, price)) +
geom_bin2d()
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d()
```
However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot.
Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale.
This is visually identical, except the axes are labelled on the original data scale.
```{r}
ggplot(diamonds, aes(carat, price)) +
geom_bin2d() +
scale_x_log10() +
scale_y_log10()
```
Another scale that is frequently customized is colour.
The default categorical scale picks colors that are evenly spaced around the colour wheel.
Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness.
The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.
```{r}
#| fig-align: default
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv))
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
scale_colour_brewer(palette = "Set1")
```
Don't forget simpler techniques.
If there are just a few colors, you can add a redundant shape mapping.
This will also help ensure your plot is interpretable in black and white.
```{r}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv, shape = drv)) +
scale_colour_brewer(palette = "Set1")
```
The ColorBrewer scales are documented online at <https://colorbrewer2.org/> and made available in R via the **RColorBrewer** package, by Erich Neuwirth.
@fig-brewer shows the complete list of all palettes.
The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a "middle".
This often arises if you've used `cut()` to make a continuous variable into a categorical variable.
```{r}
#| label: fig-brewer
#| echo: false
#| fig.cap: All ColourBrewer scales.
#| fig.asp: 2.5
par(mar = c(0, 3, 0, 0))
RColorBrewer::display.brewer.all()
```
When you have a predefined mapping between values and colors, use `scale_colour_manual()`.
For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats:
```{r}
presidential |>
mutate(id = 33 + row_number()) |>
ggplot(aes(start, id, colour = party)) +
geom_point() +
geom_segment(aes(xend = end, yend = id)) +
scale_colour_manual(values = c(Republican = "red", Democratic = "blue"))
```
For continuous colour, you can use the built-in `scale_colour_gradient()` or `scale_fill_gradient()`.
If you have a diverging scale, you can use `scale_colour_gradient2()`.
That allows you to give, for example, positive and negative values different colors.
That's sometimes also useful if you want to distinguish points above or below the mean.
Another option is to use the viridis color scales.
The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous colour schemes that are perceptible to people with various forms of colour blindness as well as perceptually uniform in both color and black and white.
These scales are available as continuous (`c`), discrete (`d`), and binned (`b`) palettes in ggplot2.
```{r}
#| fig-align: default
#| layout-ncol: 2
#| fig-width: 4
#| fig-asp: 1
df <- tibble(
x = rnorm(10000),
y = rnorm(10000)
)
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
labs(title = "Default, continuous")
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
scale_fill_viridis_c() +
labs(title = "Viridis, continuous")
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
scale_fill_viridis_b() +
labs(title = "Viridis, binned")
```
Note that all colour scales come in two variety: `scale_colour_x()` and `scale_fill_x()` for the `colour` and `fill` aesthetics respectively (the colour scales are available in both UK and US spellings).
### Exercises
1. Why doesn't the following code override the default scale?
```{r}
#| fig-show: "hide"
ggplot(df, aes(x, y)) +
geom_hex() +
scale_colour_gradient(low = "white", high = "red") +
coord_fixed()
```
2. What is the first argument to every scale?
How does it compare to `labs()`?
3. Change the display of the presidential terms by:
a. Combining the two variants shown above.
b. Improving the display of the y axis.
c. Labelling each term with the name of the president.
d. Adding informative plot labels.
e. Placing breaks every 4 years (this is trickier than it seems!).
4. Use `override.aes` to make the legend on the following plot easier to see.
```{r}
#| fig-format: "png"
#| out-width: "50%"
ggplot(diamonds, aes(carat, price)) +
geom_point(aes(colour = cut), alpha = 1/20)
```
## Zooming
There are three ways to control the plot limits:
1. Adjusting what data are plotted
2. Setting the limits in each scale
3. Setting `xlim` and `ylim` in `coord_cartesian()`
To zoom in on a region of the plot, it's generally best to use `coord_cartesian()`.
Compare the following two plots:
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
#| message: false
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
mpg |>
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) |>
ggplot(aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth()
```
You can also set the `limits` on individual scales.
Reducing the limits is basically equivalent to subsetting the data.
It is generally more useful if you want *expand* the limits, for example, to match scales across different plots.
For example, if we extract two classes of cars and plot them separately, it's difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges.
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
suv <- mpg |> filter(class == "suv")
compact <- mpg |> filter(class == "compact")
ggplot(suv, aes(displ, hwy, colour = drv)) +
geom_point()
ggplot(compact, aes(displ, hwy, colour = drv)) +
geom_point()
```
One way to overcome this problem is to share scales across multiple plots, training the scales with the `limits` of the full data.
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
x_scale <- scale_x_continuous(limits = range(mpg$displ))
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
col_scale <- scale_colour_discrete(limits = unique(mpg$drv))
ggplot(suv, aes(displ, hwy, colour = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale
ggplot(compact, aes(displ, hwy, colour = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale
```
In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.
## Themes
Finally, you can customize the non-data elements of your plot with a theme:
```{r}
#| message: false
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()
```
ggplot2 includes eight themes by default, as shown in @fig-themes.
Many more are included in add-on packages like **ggthemes** (<https://jrnold.github.io/ggthemes>), by Jeffrey Arnold.
```{r}
#| label: fig-themes
#| echo: false
#| fig-cap: The eight themes built-in to ggplot2.
#| fig-alt: >
#| Eight barplots created with ggplot2, each
#| with one of the eight built-in themes:
#| theme_bw() - White background with grid lines,
#| theme_light() - Light axes and grid lines,
#| theme_classic() - Classic theme, axes but no grid
#| lines, theme_linedraw() - Only black lines,
#| theme_dark() - Dark background for contrast,
#| theme_minimal() - Minimal theme, no background,
#| theme_gray() - Gray background (default theme),
#| theme_void() - Empty theme, only geoms are visible.
knitr::include_graphics("images/visualization-themes.png")
```
Many people wonder why the default theme has a gray background.
This was a deliberate choice because it puts the data forward while still making the grid lines visible.
The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out.
The grey background gives the plot a similar typographic colour to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background.
Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.
It's also possible to control individual components of each theme, like the size and colour of the font used for the y axis.
Unfortunately, this level of detail is outside the scope of this book, so you'll need to read the [ggplot2 book](https://ggplot2-book.org/) for the full details.
You can also create your own themes, if you are trying to match a particular corporate or journal style.
## Saving your plots {#sec-ggsave}
There are two main ways to get your plots out of R and into your final write-up: `ggsave()` and knitr.
`ggsave()` will save the most recent plot to disk:
```{r}
#| fig-show: "hide"
ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf")
```
```{r}
#| include: false
file.remove("my-plot.pdf")
```
If you don't specify the `width` and `height` they will be taken from the dimensions of the current plotting device.
For reproducible code, you'll want to specify them.
Generally, however, we recommend that you assemble your final reports using Quarto, so we focus on the important code chunk options that you should know about for graphics.
You can learn more about `ggsave()` in the documentation.
## Learning more
The absolute best place to learn more is the ggplot2 book: [*ggplot2: Elegant graphics for data analysis*](https://ggplot2-book.org/).
It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems.
Another great resource is the ggplot2 extensions gallery <https://exts.ggplot2.tidyverse.org/gallery/>.
This site lists many of the packages that extend ggplot2 with new geoms and scales.
It's a great place to start if you're trying to do something that seems hard with ggplot2.

View File

@ -17,7 +17,7 @@ However, it doesn't matter how great your analysis is unless you can explain it
#| can't communicate your results to other humans, it doesn't matter how
#| great your analysis is.
#| fig-alt: >
#| A diagram displaying the data science cycle with visualize and
#| A diagram displaying the data science cycle with
#| communicate highlighed in blue.
#| out.width: NULL

1154
communication.qmd Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 160 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 165 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 230 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 185 KiB

1057
layers.qmd Normal file

File diff suppressed because it is too large Load Diff

View File

@ -10,7 +10,7 @@ status("polishing")
## Introduction
You've already learned the basics of missing values earlier in the book.
You first saw them in @sec-summarize where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in @sec-na-comparison.
You first saw them in @sec-data-visualisation where they resulted in a warning when making a plot as well as in @sec-summarize where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in @sec-na-comparison.
Now we'll come back to them in more depth, so you can learn more of the details.
We'll start by discussing some general tools for working with missing values recorded as `NA`s.

View File

@ -336,35 +336,6 @@ The following table summarizes which types of output each option suppresses:
| `message: false` | | | | | \- | |
| `warning: false` | | | | | | \- |
### Global options
As you work more with knitr, you will discover that some of the default chunk options don't fit your needs and you want to change them.
You can do this by adding the preferred options in the document YAML, under `execute`.
For example, if you are preparing a report for an audience who does not need to see your code but only your results and narrative, you might set `echo: false` at the document level.
That will hide the code by default, so only showing the chunks you deliberately choose to show (with `echo: true`).
You might consider setting `message: false` and `warning: false`, but that would make it harder to debug problems because you wouldn't see any messages in the final document.
``` yaml
title: "My report"
execute:
echo: false
```
Since Quarto is designed to be multi-lingual (works with R as well as other languages like Python, Julia, etc.), all of the knitr options are not available at the document execution level since some of them only work with knitr and not other engines Quarto uses for running code in other languages (e.g., Jupyter).
You can, however, still set these as global options for your document under the `knitr` field, under `opts_chunk`.
For example, when writing books and tutorials we set:
``` yaml
title: "Tutorial"
knitr:
opts_chunk:
comment: "#>"
collapse: true
```
This uses our preferred comment formatting and ensures that the code and output are kept closely entwined.
### Inline code
There is one other way to embed R code into a Quarto document: directly into the text, with: `r inline()`.
@ -607,7 +578,7 @@ This makes it easier to understand the `dependson` specification.
1. Set up a network of chunks where `d` depends on `c` and `b`, and both `b` and `c` depend on `a`. Have each chunk print `lubridate::now()`, set `cache: true`, then verify your understanding of caching.
## Troubleshooting
> > > > > > > 7ff2b1502187f15a978d74f59a88534fa6f1012e \## Troubleshooting
Troubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks.
Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document.

View File

@ -6,8 +6,7 @@
source("_common.R")
```
After reading the first part of the book, you understand (at least superficially) the most important tools for doing data science.
Now it's time to start diving into the details.
The second part of the book was a deep dive into data visualization.
In this part of the book, you'll learn about the most important types of variables that you'll encounter inside a data frame and learn the tools you can use to work with them.
```{r}
@ -15,9 +14,9 @@ In this part of the book, you'll learn about the most important types of variabl
#| echo: false
#| fig-cap: >
#| The options for data transformation depends heavily on the type of
#| data involve, the subject of this part of the book.
#| data involved, the subject of this part of the book.
#| fig-alt: >
#| Our data science model transform, highlighted in blue.
#| Our data science model, with transform highlighted in blue.
#| out.width: NULL
knitr::include_graphics("diagrams/data-science/transform.png", dpi = 270)

41
visualize.qmd Normal file
View File

@ -0,0 +1,41 @@
# Visualize {#sec-visualize .unnumbered}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
After reading the first two parts of the book, you understand (at least superficially) the most important tools for doing data science.
Now it's time to start diving into the details.
In this part of the book, you'll learn about visualizing data in further depth.
```{r}
#| label: fig-ds-visualize
#| echo: false
#| fig-cap: >
#| Data visualization is often the first step in data exploration.
#| fig-alt: >
#| Our data science model, with visualize highlighted in blue.
#| out.width: NULL
knitr::include_graphics("diagrams/data-science/visualize.png", dpi = 270)
```
Each chapter addresses one to a few aspects of creating a data visualization.
- In @sec-layers you will learn about the layered grammar of graphics.
- In @sec-exploratory-data-analysis, you'll combine visualization with your curiosity and skepticism to ask and answer interesting questions about data.
- Finally, in @sec-communication you will learn how to take your exploratory graphics and turn them into expository graphics, graphics that help the newcomer to your analysis understand what's going on as quickly and easily as possible.
### Learning more
The absolute best place to learn more is the ggplot2 book: [*ggplot2: Elegant graphics for data analysis*](https://ggplot2-book.org/).
It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems.
Another great resource is the ggplot2 extensions gallery <https://exts.ggplot2.tidyverse.org/gallery/>.
This site lists many of the packages that extend ggplot2 with new geoms and scales.
It's a great place to start if you're trying to do something that seems hard with ggplot2.

View File

@ -39,8 +39,6 @@ Five chapters focus on the tools of data science:
- Before you can transform and visualize your data, you need to first get your data into R.
In @sec-data-import you'll learn the basics of getting `.csv` files into R.
- Finally, in @sec-exploratory-data-analysis, you'll combine visualization and transformation with your curiosity and skepticism to ask and answer interesting questions about data.
Nestled among these chapters that are five other chapters that focus on your R workflow.
In @sec-workflow-basics, @sec-workflow-pipes, @sec-workflow-style, and @sec-workflow-scripts-projects, you'll learn good workflow practices for writing and organizing your R code.
These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects.

View File

@ -231,7 +231,23 @@ knitr::include_graphics("screenshots/rstudio-env.png")
What happens?
How can you get to the same place using the menus?
4. Let's revisit an exercise from the @sec-ggsave.
Run the following lines of code.
Which of the two plots is saved as `mpg-plot.png`?
Why?
```{r}
#| eval: false
my_bar_plot <- ggplot(mpg, aes(x = class)) +
geom_bar()
my_scatter_plot <- ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggsave(filename = "mpg-plot.png", plot = my_bar_plot)
```
## Summary
Now that you've learned a little more about how R code works, and some tips to help you understand your code when you come back to it in the future.
In the next chapter, we'll continue your data science journey by teaching you about dplyr, the tidyverse package that helps you transform data, whether it's selecting important variables, filtering down to rows of interest, or computing summary statistics.

View File

@ -129,9 +129,24 @@ But they're still good to know about even if you've never used `%>%` because you
Luckily there's no need to commit entirely to one pipe or the other --- you can use the base pipe for the majority of cases where it's sufficient, and use the magrittr pipe when you really need its special features.
## `|>` vs `+`
Sometimes we'll turn the end of a pipeline of data transformation into a plot.
Watch for the transition from `|>` to `+`.
We wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered.
```{r}
#| eval: false
diamonds |>
count(cut, clarity) |>
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()
```
## Summary
In this chapter, you've learn more about the pipe: why we recommend it and some of the history that lead to `|>`.
In this chapter, you've learned more about the pipe: why we recommend it and some of the history that lead to `|>`.
The pipe is important because you'll use it again and again throughout your analysis, but hopefully it will quickly become invisible and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it.
In the next chapter, we switch back to data science tools, learning about tidy data.