diff --git a/DESCRIPTION b/DESCRIPTION index 8a68c1c..747f378 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -17,6 +17,7 @@ Imports: gapminder, ggplot2, ggrepel, + ggridges, hexbin, janitor, jsonlite, diff --git a/EDA.qmd b/EDA.qmd index ada8a91..b0fb536 100644 --- a/EDA.qmd +++ b/EDA.qmd @@ -4,7 +4,7 @@ #| results: "asis" #| echo: false source("_common.R") -status("polishing") +status("complete") ``` ## Introduction @@ -52,7 +52,7 @@ When you ask a question, the question focuses your attention on a specific part EDA is fundamentally a creative process. And like most creative processes, the key to asking *quality* questions is to generate a large *quantity* of questions. -It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. +It is difficult to ask revealing questions at the start of your analysis because you do not know what insights can be gleaned from your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought-provoking questions---if you follow up each question with a new question based on what you find. @@ -91,37 +91,10 @@ This is true even if you measure quantities that are constant, like the speed of Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g. the eye colors of different people) or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that variable varies between measurements on the same observation as well as across observations. -The best way to understand that pattern is to visualize the distribution of the variable's values. +The best way to understand that pattern is to visualize the distribution of the variable's values, which you've learned about in @sec-data-visualisation. -### Visualizing distributions - -How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. -A variable is **categorical** if it can only take one of a small set of values. -In R, categorical variables are usually saved as factors or character vectors. -To examine the distribution of a categorical variable, you can use a bar chart: - -```{r} -#| fig-alt: > -#| A bar chart of cuts of diamonds. The cuts are presented in increasing -#| order of frequency: Fair (less than 2500), Good (approximately 5000), -#| Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal -#| (approximately 21500). - -ggplot(data = diamonds, mapping = aes(x = cut)) + - geom_bar() -``` - -The height of the bars displays how many observations occurred with each x value. -You can compute these values manually with `count()`: - -```{r} -diamonds |> - count(cut) -``` - -A variable is **continuous** if it can take any of an infinite set of ordered values. -Numbers and date-times are two examples of continuous variables. -To examine the distribution of a continuous variable, you can use a histogram: +We'll start our exploration by visualizing the distribution of weights (`carat`) of \~54,000 diamonds from the `diamonds` dataset. +Since `carat` is a numerical variable, we can use a histogram: ```{r} #| fig-alt: > @@ -132,62 +105,10 @@ To examine the distribution of a continuous variable, you can use a histogram: #| at 1, and much fewer, approximately 5000 diamonds in the bin centered at #| 1.5. Beyond this, there's a trailing tail. -ggplot(data = diamonds, mapping = aes(x = carat)) + +ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.5) ``` -You can compute this by hand by combining `count()` and `cut_width()`: - -```{r} -diamonds |> - count(cut_width(carat, 0.5)) -``` - -A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. -Note that even though it's not possible to have a `carat` value that is smaller than 0 (since weights of diamonds, by definition, are positive values), the bins start at a negative value (-0.25) in order to create bins of equal width across the range of the data with the center of the first bin at 0. -This behavior is also apparent in the histogram above, where the first bar ranges from -0.25 to 0.25. -The tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar centered at 0.5. - -You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. -You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. -For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth. - -```{r} -#| fig-alt: > -#| A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and -#| the y-axis ranging from 0 to 10000. The binwidth is quite narrow (0.1), -#| resulting in many bars. The distribution is right skewed but there are lots -#| of ups and downs in the heights of the bins, creating a jagged outline. - -smaller <- diamonds |> - filter(carat < 3) - -ggplot(data = smaller, mapping = aes(x = carat)) + - geom_histogram(binwidth = 0.1) -``` - -If you wish to overlay multiple histograms in the same plot, we recommend using `geom_freqpoly()` instead of `geom_histogram()`. -`geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, uses lines instead. -It's much easier to understand overlapping lines than bars. - -```{r} -#| fig-alt: > -#| A frequency polygon of carats of diamonds where each cut of carat (Fair, -#| Good, Very Good, Premium, and Ideal) is represented with a different color -#| line. The x-axis ranges from 0 to 3 and the y-axis ranges from 0 to almost -#| 6000. Ideal diamonds have a much higher peak than the others around 0.25 -#| carats. All cuts of diamonds have right skewed distributions with local -#| peaks at 1 carat and 2 carats. As the cut level increases (from Fair to -#| Ideal), so does the number of diamonds that fall into that category. - -ggplot(data = smaller, mapping = aes(x = carat, color = cut)) + - geom_freqpoly(binwidth = 0.1, size = 0.75) -``` - -We've also customized the thickness of the lines using the `size` argument in order to make them stand out a bit more against the background. - -There are a few challenges with this type of plot, which we will come back to in @sec-cat-cont on visualizing a categorical and a continuous variable. - Now that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? We've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. @@ -223,7 +144,10 @@ As an example, the histogram below suggests several interesting questions: #| is right skewed, with many peaks followed by bars in decreasing heights, #| until a sharp increase at the next peak. -ggplot(data = smaller, mapping = aes(x = carat)) + +smaller <- diamonds |> + filter(carat < 3) + +ggplot(smaller, aes(x = carat)) + geom_histogram(binwidth = 0.01) ``` @@ -247,7 +171,7 @@ Eruption times appear to be clustered into two groups: there are short eruptions #| and the y-axis ranges from 0 to roughly 40. The distribution is bimodal #| with peaks around 1.75 and 4.5. -ggplot(data = faithful, mapping = aes(x = eruptions)) + +ggplot(faithful, aes(x = eruptions)) + geom_histogram(binwidth = 0.25) ``` @@ -268,7 +192,7 @@ The only evidence of outliers is the unusually wide limits on the x-axis. #| y-axis ranges from 0 to 12000. There is a peak around 5, and the data #| appear to be completely clustered around the peak. -ggplot(data = diamonds, mapping = aes(x = y)) + +ggplot(diamonds, aes(x = y)) + geom_histogram(binwidth = 0.5) ``` @@ -283,7 +207,7 @@ To make it easy to see the unusual values, we need to zoom to small values of th #| there is one bin at 0 with a height of about 8, one a little over 30 with #| a height of 1 and another one a little below 60 with a height of 1. -ggplot(data = diamonds, mapping = aes(x = y)) + +ggplot(diamonds, aes(x = y)) + geom_histogram(binwidth = 0.5) + coord_cartesian(ylim = c(0, 50)) ``` @@ -341,7 +265,7 @@ You'll need to figure out what caused them (e.g. a data entry error) and disclos What happens if you leave `binwidth` unset? What happens if you try and zoom so only half a bar shows? -## Missing values {#sec-missing-values-eda} +## Unusual values {#sec-missing-values-eda} If you've encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options. @@ -371,8 +295,8 @@ The first argument `test` should be a logical vector. The result will contain the value of the second argument, `yes`, when `test` is `TRUE`, and the value of the third argument, `no`, when it is false. Alternatively to `if_else()`, use `case_when()`. `case_when()` is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple `if_else()` statements nested inside one another. +You will learn more about logical vectors in @sec-logicals. -Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but it does warn that they've been removed: ```{r} @@ -383,7 +307,7 @@ It's not obvious where you should plot missing values, so ggplot2 doesn't includ #| has length greater than 3. The one outlier has a length of 0 and a width #| of about 6.5. -ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + +ggplot(diamonds2, aes(x = x, y = y)) + geom_point() ``` @@ -392,7 +316,7 @@ To suppress that warning, set `na.rm = TRUE`: ```{r} #| eval: false -ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + +ggplot(diamonds2, aes(x = x, y = y)) + geom_point(na.rm = TRUE) ``` @@ -417,8 +341,8 @@ nycflights13::flights |> sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + (sched_min / 60) ) |> - ggplot(mapping = aes(sched_dep_time)) + - geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4) + ggplot(aes(sched_dep_time)) + + geom_freqpoly(aes(color = cancelled), binwidth = 1/4) ``` However this plot isn't great because there are many more non-cancelled flights than cancelled flights. @@ -437,14 +361,10 @@ In the next section we'll explore some techniques for improving this comparison. If variation describes the behavior *within* a variable, covariation describes the behavior *between* variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. -How you do that depends again on the types of variables involved. -### A categorical and continuous variable {#sec-cat-cont} +### A categorical and a numerical variable {#sec-cat-num} -It's common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. -The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. -That means if one of the groups is much smaller than the others, it's hard to see the differences in the shapes of their distributions. -For example, let's explore how the price of a diamond varies with its quality (measured by `cut`): +For example, let's explore how the price of a diamond varies with its quality (measured by `cut`) using `geom_freqpoly()`: ```{r} #| fig-alt: > @@ -455,11 +375,11 @@ For example, let's explore how the price of a diamond varies with its quality (m #| distributions of prices of diamonds. One notable feature is that #| Ideal diamonds have the highest peak around 1500. -ggplot(data = diamonds, mapping = aes(x = price)) + - geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75) +ggplot(diamonds, aes(x = price)) + + geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75) ``` -It's hard to see the difference in distribution because the overall counts differ so much: +The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count and the overall counts of `cut` in differ so much, making it hard to see the differences in the shapes of their distributions: ```{r} #| fig-alt: > @@ -467,7 +387,7 @@ It's hard to see the difference in distribution because the overall counts diffe #| frenquencies of various cuts. Fair diamonds have the lowest frequency, #| then Good, then Very Good, then Premium, and then Ideal. -ggplot(data = diamonds, mapping = aes(x = cut)) + +ggplot(diamonds, aes(x = cut)) + geom_bar() ``` @@ -483,8 +403,8 @@ Instead of displaying count, we'll display the **density**, which is the count s #| diamonds. One notable feature is that all but Fair diamonds have high peaks #| around a price of 1500 and Fair diamonds have a higher mean than others. -ggplot(data = diamonds, mapping = aes(x = price, y = after_stat(density))) + - geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75) +ggplot(diamonds, aes(x = price, y = after_stat(density))) + + geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75) ``` Note that we're mapping the density the `y`, but since `density` is not a variable in the `diamonds` dataset, we need to first calculate it. @@ -493,29 +413,7 @@ We use the `after_stat()` function to do so. There's something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that's because frequency polygons are a little hard to interpret - there's a lot going on in this plot. -Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. -A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians. -Each boxplot consists of: - -- A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). - In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. - These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side. - -- Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. - These outlying points are unusual so are plotted individually. - -- A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution. - -```{r} -#| echo: false -#| fig-alt: > -#| A diagram depicting how a boxplot is created following the steps outlined -#| above. - -knitr::include_graphics("images/EDA-boxplot.png") -``` - -Let's take a look at the distribution of price by cut using `geom_boxplot()`: +A visually simpler plot for exploring this relationship is using side-by-side boxplots. ```{r} #| fig-height: 3 @@ -525,7 +423,7 @@ Let's take a look at the distribution of price by cut using `geom_boxplot()`: #| Ideal). The medians are close to each other, with the median for Ideal #| diamonds lowest and that for Fair highest. -ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + +ggplot(diamonds, aes(x = cut, y = price)) + geom_boxplot() ``` @@ -535,7 +433,7 @@ In the exercises, you'll be challenged to figure out why. `cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don't have such an intrinsic order, so you might want to reorder them to make a more informative display. -One way to do that is with the `reorder()` function. +One way to do that is with the `fct_reorder()` function. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes: @@ -546,7 +444,7 @@ You might be interested to know how highway mileage varies across classes: #| on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact, #| and suv). -ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + +ggplot(mpg, aes(x = class, y = hwy)) + geom_boxplot() ``` @@ -559,8 +457,8 @@ To make the trend easier to see, we can reorder `class` based on the median valu #| on the x-axis and ordered by increasing median highway mileage (pickup, #| suv, minivan, 2seater, subcompact, compact, and midsize). -ggplot(data = mpg, - mapping = aes(x = fct_reorder(class, hwy, median), y = hwy)) + +ggplot(mpg, + aes(x = fct_reorder(class, hwy, median), y = hwy)) + geom_boxplot() ``` @@ -572,8 +470,8 @@ You can do that by exchanging the x and y aesthetic mappings. #| Side-by-side boxplots of highway mileages of cars by class. Classes are #| on the y-axis and ordered by increasing median highway mileage. -ggplot(data = mpg, - mapping = aes(y = fct_reorder(class, hwy, median), x = hwy)) + +ggplot(mpg, + aes(y = fct_reorder(class, hwy, median), x = hwy)) + geom_boxplot() ``` @@ -614,42 +512,13 @@ One way to do that is to rely on the built-in `geom_count()`: #| the number of observations for that combination. The legend indicates #| that these sizes range between 1000 and 4000. -ggplot(data = diamonds, mapping = aes(x = cut, y = color)) + +ggplot(diamonds, aes(x = cut, y = color)) + geom_count() ``` The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values. -A more commonly used way of representing the covariation between two categorical variables is using a segmented bar chart. -In creating this bar chart, we map the variable we want to divide the data into first to the `x` aesthetic and the variable we then further want to divide each group into to the `fill` aesthetic. - -```{r} -#| fig-alt: > -#| A bar chart of cuts of diamonds, segmented by color. The number of diamonds -#| for each level of cut increases from Fair to Ideal and the heights -#| of the segments within each bar represent the number of diamonds that fall -#| within each color/cut combination. There appear to be some of each color of -#| diamonds within each level of cut of diamonds. - -ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) + - geom_bar() -``` - -However, in order to get a better sense of the relationship between these two variables, you should compare proportions instead of counts across groups. - -```{r} -#| fig-alt: > -#| A bar chart of cuts of diamonds, segmented by color. The heights of each -#| of the bars representing each cut of diamond are the same, 1. The heights -#| of the segments within each bar represent the proportion of diamonds that -#| fall within each color/cut combination. The proportions don't appear to be -#| very different across the levels of cut. - -ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) + - geom_bar(position = "fill") -``` - Another approach for exploring the relationship between these variables is computing the counts with dplyr: ```{r} @@ -669,8 +538,8 @@ Then visualize with `geom_tile()` and the fill aesthetic: diamonds |> count(color, cut) |> - ggplot(mapping = aes(x = color, y = cut)) + - geom_tile(mapping = aes(fill = n)) + ggplot(aes(x = color, y = cut)) + + geom_tile(aes(fill = n)) ``` If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. @@ -689,9 +558,9 @@ For larger plots, you might want to try the heatmaply package, which creates int 4. Why is it slightly better to use `aes(x = color, y = cut)` rather than `aes(x = cut, y = color)` in the example above? -### Two continuous variables +### Two numerical variables -You've already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with `geom_point()`. +You've already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with `geom_point()`. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond. @@ -701,7 +570,7 @@ For example, you can see an exponential relationship between the carat size and #| A scatterplot of price vs. carat. The relationship is positive, somewhat #| strong, and exponential. -ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + +ggplot(diamonds, aes(x = carat, y = price)) + geom_point() ``` @@ -716,7 +585,7 @@ You've already seen one way to fix the problem: using the `alpha` aesthetic to a #| the number of points is higher than other areas, The most obvious clusters #| are for diamonds with 1, 1.5, and 2 carats. -ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + +ggplot(diamonds, aes(x = carat, y = price)) + geom_point(alpha = 1 / 100) ``` @@ -738,11 +607,11 @@ You will need to install the hexbin package to use `geom_hex()`. #| plot of price vs. carat. Both plots show that the highest density of #| diamonds have low carats and low prices. -ggplot(data = smaller, mapping = aes(x = carat, y = price)) + +ggplot(smaller, aes(x = carat, y = price)) + geom_bin2d() # install.packages("hexbin") -ggplot(data = smaller, mapping = aes(x = carat, y = price)) + +ggplot(smaller, aes(x = carat, y = price)) + geom_hex() ``` @@ -760,8 +629,8 @@ For example, you could bin `carat` and then for each group, display a boxplot: #| left skewed distributions. Cheaper, smaller diamonds have outliers on the #| higher end, more expensive, bigger diamonds have outliers on the lower end. -ggplot(data = smaller, mapping = aes(x = carat, y = price)) + - geom_boxplot(mapping = aes(group = cut_width(carat, 0.1))) +ggplot(smaller, aes(x = carat, y = price)) + + geom_boxplot(aes(group = cut_width(carat, 0.1))) ``` `cut_width(x, width)`, as used above, divides `x` into bins of width `width`. @@ -778,8 +647,8 @@ That's the job of `cut_number()`: #| increases as well. Cheaper, smaller diamonds have outliers on the higher #| end, more expensive, bigger diamonds have outliers on the lower end. -ggplot(data = smaller, mapping = aes(x = carat, y = price)) + - geom_boxplot(mapping = aes(group = cut_number(carat, 20))) +ggplot(smaller, aes(x = carat, y = price)) + + geom_boxplot(aes(group = cut_number(carat, 20))) ``` #### Exercises @@ -805,7 +674,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) + #| strong, linear relationship. There are a few unusual observations #| above and below the bulk of the data, more below it than above. - ggplot(data = diamonds, mapping = aes(x = x, y = y)) + + ggplot(diamonds, aes(x = x, y = y)) + geom_point() + coord_cartesian(xlim = c(4, 11), ylim = c(4, 11)) ``` @@ -839,7 +708,7 @@ The scatterplot also displays the two clusters that we noticed above. #| eruption times and short waiting times and one with long eruption times and #| long waiting times. -ggplot(data = faithful, mapping = aes(x = eruptions, y = waiting)) + +ggplot(faithful, aes(x = eruptions, y = waiting)) + geom_point() ``` @@ -880,7 +749,7 @@ diamonds_fit <- linear_reg() |> diamonds_aug <- augment(diamonds_fit, new_data = diamonds) |> mutate(.resid = exp(.resid)) -ggplot(data = diamonds_aug, mapping = aes(x = carat, y = .resid)) + +ggplot(diamonds_aug, aes(x = carat, y = .resid)) + geom_point() ``` @@ -893,66 +762,12 @@ Once you've removed the strong relationship between carat and price, you can see #| quite similar, between roughly 0.75 to 1.25. Each of the distributions of #| residuals is right skewed, with many outliers on the higher end. -ggplot(data = diamonds_aug, mapping = aes(x = cut, y = .resid)) + +ggplot(diamonds_aug, aes(x = cut, y = .resid)) + geom_boxplot() ``` We're not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand. -## ggplot2 calls - -As we move on from these introductory chapters, we'll transition to a more concise expression of ggplot2 code. -So far we've been very explicit, which is helpful when you are learning: - -```{r} -#| eval: false -#| fig-alt: > -#| A frequency polygon plot of eruption times for the Old Faithful geyser. -#| The distribution of eruption times is binomodal with one mode around 1.75 -#| and the other around 4.5. - -ggplot(data = faithful, mapping = aes(x = eruptions)) + - geom_freqpoly(binwidth = 0.25) -``` - -Typically, the first one or two arguments to a function are so important that you should know them by heart. -The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`. -In the remainder of the book, we won't supply those names. -That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots. -That's a really important programming concern that we'll come back to in @sec-functions. - -Rewriting the previous plot more concisely yields: - -```{r} -#| eval: false -#| fig-alt: > -#| A frequency polygon plot of eruption times for the Old Faithful geyser. -#| The distribution of eruption times is binomodal with one mode around 1.75 -#| and the other around 4.5. - -ggplot(faithful, aes(eruptions)) + - geom_freqpoly(binwidth = 0.25) -``` - -Sometimes we'll turn the end of a pipeline of data transformation into a plot. -Watch for the transition from `|>` to `+`. -We wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered. - -```{r} -#| eval: false -#| fig-alt: > -#| A tile plot of cut vs. clarity of diamonds. Each tile represents a -#| cut/ckarity combination and tiles are colored according to the number of -#| observations in each tile. There are more Ideal diamonds than other cuts, -#| with the highest number being Ideal diamonds with VS2 clarity. Fair diamonds -#| and diamonds with clarity I1 are the lowest in frequency. - -diamonds |> - count(cut, clarity) |> - ggplot(aes(clarity, cut, fill = n)) + - geom_tile() -``` - ## Summary In this chapter you've learned a variety of tools to help you understand the variation within your data. diff --git a/_quarto.yml b/_quarto.yml index 983c4f7..3aaa96e 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -34,9 +34,14 @@ book: - workflow-style.qmd - data-import.qmd - workflow-scripts.qmd - - EDA.qmd - workflow-help.qmd + - part: visualize.qmd + chapters: + - layers.qmd + - EDA.qmd + - communication.qmd + - part: transform.qmd chapters: - logicals.qmd @@ -64,7 +69,6 @@ book: - part: communicate.qmd chapters: - quarto.qmd - - communicate-plots.qmd - quarto-formats.qmd - quarto-workflow.qmd diff --git a/communicate-plots.qmd b/communicate-plots.qmd deleted file mode 100644 index f8b2397..0000000 --- a/communicate-plots.qmd +++ /dev/null @@ -1,743 +0,0 @@ -# Graphics for communication {#sec-graphics-communication} - -```{r} -#| results: "asis" -#| echo: false -source("_common.R") -status("drafting") -``` - -## Introduction - -In @sec-exploratory-data-analysis, you learned how to use plots as tools for *exploration*. -When you make exploratory plots, you know---even before looking---which variables the plot will display. -You made each plot for a purpose, could quickly look at it, and then move on to the next plot. -In the course of most analyses, you'll produce tens or hundreds of plots, most of which are immediately thrown away. - -Now that you understand your data, you need to *communicate* your understanding to others. -Your audience will likely not share your background knowledge and will not be deeply invested in the data. -To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. -In this chapter, you'll learn some of the tools that ggplot2 provides to do so. - -This chapter focuses on the tools you need to create good graphics. -We assume that you know what you want, and just need to know how to do it. -For that reason, we highly recommend pairing this chapter with a good general visualization book. -We particularly like [*The Truthful Art*](https://www.amazon.com/gp/product/0321934075/), by Albert Cairo. -It doesn't teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics. - -### Prerequisites - -In this chapter, we'll focus once again on ggplot2. -We'll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including **ggrepel** and **patchwork**. -Rather than loading those extensions here, we'll refer to their functions explicitly, using the `::` notation. -This will help make it clear which functions are built into ggplot2, and which come from other packages. -Don't forget you'll need to install those packages with `install.packages()` if you don't already have them. - -```{r} -#| message: false - -library(tidyverse) -``` - -## Label - -The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. -You add labels with the `labs()` function. -This example adds a plot title: - -```{r} -#| message: false - -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(color = class)) + - geom_smooth(se = FALSE) + - labs(title = "Fuel efficiency generally decreases with engine size") -``` - -The purpose of a plot title is to summarize the main finding. -Avoid titles that just describe what the plot is, e.g. "A scatterplot of engine displacement vs. fuel economy". - -If you need to add more text, there are two other useful labels that you can use in ggplot2 2.2.0 and above: - -- `subtitle` adds additional detail in a smaller font beneath the title. - -- `caption` adds text at the bottom right of the plot, often used to describe the source of the data. - -```{r} -#| message: false - -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(color = class)) + - geom_smooth(se = FALSE) + - labs( - title = "Fuel efficiency generally decreases with engine size", - subtitle = "Two seaters (sports cars) are an exception because of their light weight", - caption = "Data from fueleconomy.gov" - ) -``` - -You can also use `labs()` to replace the axis and legend titles. -It's usually a good idea to replace short variable names with more detailed descriptions, and to include the units. - -```{r} -#| message: false - -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(colour = class)) + - geom_smooth(se = FALSE) + - labs( - x = "Engine displacement (L)", - y = "Highway fuel economy (mpg)", - colour = "Car type" - ) -``` - -It's possible to use mathematical equations instead of text strings. -Just switch `""` out for `quote()` and read about the available options in `?plotmath`: - -```{r} -#| fig-asp: 1 -#| out-width: "50%" -#| fig-width: 3 - -df <- tibble( - x = runif(10), - y = runif(10) -) -ggplot(df, aes(x, y)) + - geom_point() + - labs( - x = quote(sum(x[i] ^ 2, i == 1, n)), - y = quote(alpha + beta + frac(delta, theta)) - ) -``` - -### Exercises - -1. Create one plot on the fuel economy data with customized `title`, `subtitle`, `caption`, `x`, `y`, and `colour` labels. - -2. Recreate the following plot using the fuel economy data. - Note that both the colors and shapes of points vary by type of drive train. - - ```{r} - #| echo: false - - ggplot(mpg, aes(cty, hwy, color = drv, shape = drv)) + - geom_point() + - labs( - x = "City MPG", - y = "Highway MPG", - shape = "Type of\ndrive train", - color = "Type of\ndrive train" - ) - ``` - -3. Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand. - -## Annotations - -In addition to labelling major components of your plot, it's often useful to label individual observations or groups of observations. -The first tool you have at your disposal is `geom_text()`. -`geom_text()` is similar to `geom_point()`, but it has an additional aesthetic: `label`. -This makes it possible to add textual labels to your plots. - -There are two possible sources of labels. -First, you might have a tibble that provides labels. -The plot below isn't terribly useful, but it illustrates a useful approach: pull out the most efficient car in each class with dplyr, and then label it on the plot: - -```{r} -best_in_class <- mpg |> - group_by(class) |> - filter(row_number(desc(hwy)) == 1) - -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(colour = class)) + - geom_text(aes(label = model), data = best_in_class) -``` - -This is hard to read because the labels overlap with each other, and with the points. -We can make things a little better by switching to `geom_label()` which draws a rectangle behind the text. -We also use the `nudge_y` parameter to move the labels slightly above the corresponding points: - -```{r} -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(colour = class)) + - geom_label(aes(label = model), data = best_in_class, nudge_y = 2, alpha = 0.5) -``` - -That helps a bit, but if you look closely in the top-left hand corner, you'll notice that there are two labels practically on top of each other. -This happens because the highway mileage and displacement for the best cars in the compact and subcompact categories are exactly the same. -There's no way that we can fix these by applying the same transformation for every label. -Instead, we can use the **ggrepel** package by Kamil Slowikowski. -This useful package will automatically adjust labels so that they don't overlap: - -```{r} -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(colour = class)) + - geom_point(size = 3, shape = 1, data = best_in_class) + - ggrepel::geom_label_repel(aes(label = model), data = best_in_class) -``` - -Note another handy technique used here: we added a second layer of large, hollow points to highlight the labelled points. - -You can sometimes use the same idea to replace the legend with labels placed directly on the plot. -It's not wonderful for this plot, but it isn't too bad. -(`theme(legend.position = "none"`) turns the legend off --- we'll talk about it more shortly.) - -```{r} -class_avg <- mpg |> - group_by(class) |> - summarize( - displ = median(displ), - hwy = median(hwy) - ) - -ggplot(mpg, aes(displ, hwy, colour = class)) + - ggrepel::geom_label_repel(aes(label = class), - data = class_avg, - size = 6, - label.size = 0, - segment.color = NA - ) + - geom_point() + - theme(legend.position = "none") -``` - -Alternatively, you might just want to add a single label to the plot, but you'll still need to create a data frame. -Often, you want the label in the corner of the plot, so it's convenient to create a new data frame using `summarize()` to compute the maximum values of x and y. - -```{r} -label_info <- mpg |> - summarize( - displ = max(displ), - hwy = max(hwy), - label = "Increasing engine size is \nrelated to decreasing fuel economy." - ) - -ggplot(mpg, aes(displ, hwy)) + - geom_point() + - geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right") -``` - -If you want to place the text exactly on the borders of the plot, you can use `+Inf` and `-Inf`. -Since we're no longer computing the positions from `mpg`, we can use `tibble()` to create the data frame: - -```{r} -label_info <- tibble( - displ = Inf, - hwy = Inf, - label = "Increasing engine size is \nrelated to decreasing fuel economy." -) - -ggplot(mpg, aes(displ, hwy)) + - geom_point() + - geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right") -``` - -In these examples, we manually broke the label up into lines using `"\n"`. -Another approach is to use `stringr::str_wrap()` to automatically add line breaks, given the number of characters you want per line: - -```{r} -"Increasing engine size is related to decreasing fuel economy." |> - str_wrap(width = 40) |> - writeLines() -``` - -Note the use of `hjust` and `vjust` to control the alignment of the label. -@fig-just shows all nine possible combinations. - -```{r} -#| label: fig-just -#| echo: false -#| fig-width: 4.5 -#| fig-asp: 0.5 -#| out-width: "60%" -#| fig-cap: > -#| All nine combinations of `hjust` and `vjust`. - -vjust <- c(bottom = 0, center = 0.5, top = 1) -hjust <- c(left = 0, center = 0.5, right = 1) - -df <- crossing(hj = names(hjust), vj = names(vjust)) |> - mutate( - y = vjust[vj], - x = hjust[hj], - label = paste0("hjust = '", hj, "'\n", "vjust = '", vj, "'") - ) - -ggplot(df, aes(x, y)) + - geom_point(colour = "grey70", size = 5) + - geom_point(size = 0.5, colour = "red") + - geom_text(aes(label = label, hjust = hj, vjust = vj), size = 4) + - labs(x = NULL, y = NULL) -``` - -Remember, in addition to `geom_text()`, you have many other geoms in ggplot2 available to help annotate your plot. -A few ideas: - -- Use `geom_hline()` and `geom_vline()` to add reference lines. - We often make them thick (`size = 2`) and white (`colour = white`), and draw them underneath the primary data layer. - That makes them easy to see, without drawing attention away from the data. - -- Use `geom_rect()` to draw a rectangle around points of interest. - The boundaries of the rectangle are defined by aesthetics `xmin`, `xmax`, `ymin`, `ymax`. - -- Use `geom_segment()` with the `arrow` argument to draw attention to a point with an arrow. - Use aesthetics `x` and `y` to define the starting location, and `xend` and `yend` to define the end location. - -The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)! - -### Exercises - -1. Use `geom_text()` with infinite positions to place text at the four corners of the plot. - -2. Read the documentation for `annotate()`. - How can you use it to add a text label to a plot without having to create a tibble? - -3. How do labels with `geom_text()` interact with faceting? - How can you add a label to a single facet? - How can you put a different label in each facet? - (Hint: Think about the underlying data.) - -4. What arguments to `geom_label()` control the appearance of the background box? - -5. What are the four arguments to `arrow()`? - How do they work? - Create a series of plots that demonstrate the most important options. - -## Scales - -The third way you can make your plot better for communication is to adjust the scales. -Scales control the mapping from data values to things that you can perceive. -Normally, ggplot2 automatically adds scales for you. -For example, when you type: - -```{r} -#| label: default-scales -#| fig-show: "hide" - -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(colour = class)) -``` - -ggplot2 automatically adds default scales behind the scenes: - -```{r} -#| fig-show: "hide" - -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(colour = class)) + - scale_x_continuous() + - scale_y_continuous() + - scale_colour_discrete() -``` - -Note the naming scheme for scales: `scale_` followed by the name of the aesthetic, then `_`, then the name of the scale. -The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. -There are lots of non-default scales which you'll learn about below. - -The default scales have been carefully chosen to do a good job for a wide range of inputs. -Nevertheless, you might want to override the defaults for two reasons: - -- You might want to tweak some of the parameters of the default scale. - This allows you to do things like change the breaks on the axes, or the key labels on the legend. - -- You might want to replace the scale altogether, and use a completely different algorithm. - Often you can do better than the default because you know more about the data. - -### Axis ticks and legend keys - -There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: `breaks` and `labels`. -Breaks controls the position of the ticks, or the values associated with the keys. -Labels controls the text label associated with each tick/key. -The most common use of `breaks` is to override the default choice: - -```{r} -ggplot(mpg, aes(displ, hwy)) + - geom_point() + - scale_y_continuous(breaks = seq(15, 40, by = 5)) -``` - -You can use `labels` in the same way (a character vector the same length as `breaks`), but you can also set it to `NULL` to suppress the labels altogether. -This is useful for maps, or for publishing plots where you can't share the absolute numbers. - -```{r} -ggplot(mpg, aes(displ, hwy)) + - geom_point() + - scale_x_continuous(labels = NULL) + - scale_y_continuous(labels = NULL) -``` - -You can also use `breaks` and `labels` to control the appearance of legends. -Collectively axes and legends are called **guides**. -Axes are used for x and y aesthetics; legends are used for everything else. - -Another use of `breaks` is when you have relatively few data points and want to highlight exactly where the observations occur. -For example, take this plot that shows when each US president started and ended their term. - -```{r} -presidential |> - mutate(id = 33 + row_number()) |> - ggplot(aes(start, id)) + - geom_point() + - geom_segment(aes(xend = end, yend = id)) + - scale_x_date(NULL, breaks = presidential$start, date_labels = "'%y") -``` - -Note that the specification of breaks and labels for date and datetime scales is a little different: - -- `date_labels` takes a format specification, in the same form as `parse_datetime()`. - -- `date_breaks` (not shown here), takes a string like "2 days" or "1 month". - -### Legend layout - -You will most often use `breaks` and `labels` to tweak the axes. -While they both also work for legends, there are a few other techniques you are more likely to use. - -To control the overall position of the legend, you need to use a `theme()` setting. -We'll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. -The theme setting `legend.position` controls where the legend is drawn: - -```{r} -#| layout-ncol: 2 -#| fig-width: 4 -#| fig-asp: 1 - -base <- ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(colour = class)) - -base + theme(legend.position = "left") -base + theme(legend.position = "top") -base + theme(legend.position = "bottom") -base + theme(legend.position = "right") # the default -``` - -You can also use `legend.position = "none"` to suppress the display of the legend altogether. - -To control the display of individual legends, use `guides()` along with `guide_legend()` or `guide_colorbar()`. -The following example shows two important settings: controlling the number of rows the legend uses with `nrow`, and overriding one of the aesthetics to make the points bigger. -This is particularly useful if you have used a low `alpha` to display many points on a plot. - -```{r} -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(colour = class)) + - geom_smooth(se = FALSE) + - theme(legend.position = "bottom") + - guides(colour = guide_legend(nrow = 1, override.aes = list(size = 4))) -``` - -### Replacing a scale - -Instead of just tweaking the details a little, you can instead replace the scale altogether. -There are two types of scales you're mostly likely to want to switch out: continuous position scales and colour scales. -Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and colour, you'll be able to quickly pick up other scale replacements. - -It's very useful to plot transformations of your variable. -For example, as we've seen in [diamond prices](diamond-prices) it's easier to see the precise relationship between `carat` and `price` if we log transform them: - -```{r} -#| fig-align: default -#| layout-ncol: 2 -#| fig-width: 4 -#| fig-height: 3 - -ggplot(diamonds, aes(carat, price)) + - geom_bin2d() - -ggplot(diamonds, aes(log10(carat), log10(price))) + - geom_bin2d() -``` - -However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. -Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. -This is visually identical, except the axes are labelled on the original data scale. - -```{r} -ggplot(diamonds, aes(carat, price)) + - geom_bin2d() + - scale_x_log10() + - scale_y_log10() -``` - -Another scale that is frequently customized is colour. -The default categorical scale picks colors that are evenly spaced around the colour wheel. -Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness. -The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness. - -```{r} -#| fig-align: default -#| layout-ncol: 2 -#| fig-width: 4 -#| fig-height: 3 - -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(color = drv)) - -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(color = drv)) + - scale_colour_brewer(palette = "Set1") -``` - -Don't forget simpler techniques. -If there are just a few colors, you can add a redundant shape mapping. -This will also help ensure your plot is interpretable in black and white. - -```{r} -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(color = drv, shape = drv)) + - scale_colour_brewer(palette = "Set1") -``` - -The ColorBrewer scales are documented online at and made available in R via the **RColorBrewer** package, by Erich Neuwirth. -@fig-brewer shows the complete list of all palettes. -The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a "middle". -This often arises if you've used `cut()` to make a continuous variable into a categorical variable. - -```{r} -#| label: fig-brewer -#| echo: false -#| fig.cap: All ColourBrewer scales. -#| fig.asp: 2.5 - -par(mar = c(0, 3, 0, 0)) -RColorBrewer::display.brewer.all() -``` - -When you have a predefined mapping between values and colors, use `scale_colour_manual()`. -For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats: - -```{r} -presidential |> - mutate(id = 33 + row_number()) |> - ggplot(aes(start, id, colour = party)) + - geom_point() + - geom_segment(aes(xend = end, yend = id)) + - scale_colour_manual(values = c(Republican = "red", Democratic = "blue")) -``` - -For continuous colour, you can use the built-in `scale_colour_gradient()` or `scale_fill_gradient()`. -If you have a diverging scale, you can use `scale_colour_gradient2()`. -That allows you to give, for example, positive and negative values different colors. -That's sometimes also useful if you want to distinguish points above or below the mean. - -Another option is to use the viridis color scales. -The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous colour schemes that are perceptible to people with various forms of colour blindness as well as perceptually uniform in both color and black and white. -These scales are available as continuous (`c`), discrete (`d`), and binned (`b`) palettes in ggplot2. - -```{r} -#| fig-align: default -#| layout-ncol: 2 -#| fig-width: 4 -#| fig-asp: 1 - -df <- tibble( - x = rnorm(10000), - y = rnorm(10000) -) -ggplot(df, aes(x, y)) + - geom_hex() + - coord_fixed() + - labs(title = "Default, continuous") - -ggplot(df, aes(x, y)) + - geom_hex() + - coord_fixed() + - scale_fill_viridis_c() + - labs(title = "Viridis, continuous") - -ggplot(df, aes(x, y)) + - geom_hex() + - coord_fixed() + - scale_fill_viridis_b() + - labs(title = "Viridis, binned") -``` - -Note that all colour scales come in two variety: `scale_colour_x()` and `scale_fill_x()` for the `colour` and `fill` aesthetics respectively (the colour scales are available in both UK and US spellings). - -### Exercises - -1. Why doesn't the following code override the default scale? - - ```{r} - #| fig-show: "hide" - - ggplot(df, aes(x, y)) + - geom_hex() + - scale_colour_gradient(low = "white", high = "red") + - coord_fixed() - ``` - -2. What is the first argument to every scale? - How does it compare to `labs()`? - -3. Change the display of the presidential terms by: - - a. Combining the two variants shown above. - b. Improving the display of the y axis. - c. Labelling each term with the name of the president. - d. Adding informative plot labels. - e. Placing breaks every 4 years (this is trickier than it seems!). - -4. Use `override.aes` to make the legend on the following plot easier to see. - - ```{r} - #| fig-format: "png" - #| out-width: "50%" - - ggplot(diamonds, aes(carat, price)) + - geom_point(aes(colour = cut), alpha = 1/20) - ``` - -## Zooming - -There are three ways to control the plot limits: - -1. Adjusting what data are plotted -2. Setting the limits in each scale -3. Setting `xlim` and `ylim` in `coord_cartesian()` - -To zoom in on a region of the plot, it's generally best to use `coord_cartesian()`. -Compare the following two plots: - -```{r} -#| layout-ncol: 2 -#| fig-width: 4 -#| fig-height: 3 -#| message: false - -ggplot(mpg, mapping = aes(displ, hwy)) + - geom_point(aes(color = class)) + - geom_smooth() + - coord_cartesian(xlim = c(5, 7), ylim = c(10, 30)) - -mpg |> - filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) |> - ggplot(aes(displ, hwy)) + - geom_point(aes(color = class)) + - geom_smooth() -``` - -You can also set the `limits` on individual scales. -Reducing the limits is basically equivalent to subsetting the data. -It is generally more useful if you want *expand* the limits, for example, to match scales across different plots. -For example, if we extract two classes of cars and plot them separately, it's difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges. - -```{r} -#| layout-ncol: 2 -#| fig-width: 4 -#| fig-height: 3 - -suv <- mpg |> filter(class == "suv") -compact <- mpg |> filter(class == "compact") - -ggplot(suv, aes(displ, hwy, colour = drv)) + - geom_point() - -ggplot(compact, aes(displ, hwy, colour = drv)) + - geom_point() -``` - -One way to overcome this problem is to share scales across multiple plots, training the scales with the `limits` of the full data. - -```{r} -#| layout-ncol: 2 -#| fig-width: 4 -#| fig-height: 3 - -x_scale <- scale_x_continuous(limits = range(mpg$displ)) -y_scale <- scale_y_continuous(limits = range(mpg$hwy)) -col_scale <- scale_colour_discrete(limits = unique(mpg$drv)) - -ggplot(suv, aes(displ, hwy, colour = drv)) + - geom_point() + - x_scale + - y_scale + - col_scale - -ggplot(compact, aes(displ, hwy, colour = drv)) + - geom_point() + - x_scale + - y_scale + - col_scale -``` - -In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report. - -## Themes - -Finally, you can customize the non-data elements of your plot with a theme: - -```{r} -#| message: false - -ggplot(mpg, aes(displ, hwy)) + - geom_point(aes(color = class)) + - geom_smooth(se = FALSE) + - theme_bw() -``` - -ggplot2 includes eight themes by default, as shown in @fig-themes. -Many more are included in add-on packages like **ggthemes** (), by Jeffrey Arnold. - -```{r} -#| label: fig-themes -#| echo: false -#| fig-cap: The eight themes built-in to ggplot2. -#| fig-alt: > -#| Eight barplots created with ggplot2, each -#| with one of the eight built-in themes: -#| theme_bw() - White background with grid lines, -#| theme_light() - Light axes and grid lines, -#| theme_classic() - Classic theme, axes but no grid -#| lines, theme_linedraw() - Only black lines, -#| theme_dark() - Dark background for contrast, -#| theme_minimal() - Minimal theme, no background, -#| theme_gray() - Gray background (default theme), -#| theme_void() - Empty theme, only geoms are visible. - -knitr::include_graphics("images/visualization-themes.png") -``` - -Many people wonder why the default theme has a gray background. -This was a deliberate choice because it puts the data forward while still making the grid lines visible. -The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. -The grey background gives the plot a similar typographic colour to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. -Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity. - -It's also possible to control individual components of each theme, like the size and colour of the font used for the y axis. -Unfortunately, this level of detail is outside the scope of this book, so you'll need to read the [ggplot2 book](https://ggplot2-book.org/) for the full details. -You can also create your own themes, if you are trying to match a particular corporate or journal style. - -## Saving your plots {#sec-ggsave} - -There are two main ways to get your plots out of R and into your final write-up: `ggsave()` and knitr. -`ggsave()` will save the most recent plot to disk: - -```{r} -#| fig-show: "hide" - -ggplot(mpg, aes(displ, hwy)) + geom_point() -ggsave("my-plot.pdf") -``` - -```{r} -#| include: false - -file.remove("my-plot.pdf") -``` - -If you don't specify the `width` and `height` they will be taken from the dimensions of the current plotting device. -For reproducible code, you'll want to specify them. - -Generally, however, we recommend that you assemble your final reports using Quarto, so we focus on the important code chunk options that you should know about for graphics. -You can learn more about `ggsave()` in the documentation. - -## Learning more - -The absolute best place to learn more is the ggplot2 book: [*ggplot2: Elegant graphics for data analysis*](https://ggplot2-book.org/). -It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems. - -Another great resource is the ggplot2 extensions gallery . -This site lists many of the packages that extend ggplot2 with new geoms and scales. -It's a great place to start if you're trying to do something that seems hard with ggplot2. diff --git a/communicate.qmd b/communicate.qmd index 013641a..9cbb780 100644 --- a/communicate.qmd +++ b/communicate.qmd @@ -17,7 +17,7 @@ However, it doesn't matter how great your analysis is unless you can explain it #| can't communicate your results to other humans, it doesn't matter how #| great your analysis is. #| fig-alt: > -#| A diagram displaying the data science cycle with visualize and +#| A diagram displaying the data science cycle with #| communicate highlighed in blue. #| out.width: NULL diff --git a/communication.qmd b/communication.qmd new file mode 100644 index 0000000..3b974eb --- /dev/null +++ b/communication.qmd @@ -0,0 +1,1154 @@ +# Communication {#sec-communication} + +```{r} +#| results: "asis" +#| echo: false +source("_common.R") +status("polishing") +``` + +## Introduction + +In @sec-exploratory-data-analysis, you learned how to use plots as tools for *exploration*. +When you make exploratory plots, you know---even before looking---which variables the plot will display. +You made each plot for a purpose, could quickly look at it, and then move on to the next plot. +In the course of most analyses, you'll produce tens or hundreds of plots, most of which are immediately thrown away. + +Now that you understand your data, you need to *communicate* your understanding to others. +Your audience will likely not share your background knowledge and will not be deeply invested in the data. +To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. +In this chapter, you'll learn some of the tools that ggplot2 provides to do so. + +This chapter focuses on the tools you need to create good graphics. +We assume that you know what you want, and just need to know how to do it. +For that reason, we highly recommend pairing this chapter with a good general visualization book. +We particularly like [The Truthful Art](https://www.amazon.com/gp/product/0321934075/), by Albert Cairo. +It doesn't teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics. + +### Prerequisites + +In this chapter, we'll focus once again on ggplot2. +We'll also use a little dplyr for data manipulation, **scales** to override the default breaks, labels, transformations and palettes, and a few ggplot2 extension packages, including **ggrepel** ([https://ggrepel.slowkow.com](https://ggrepel.slowkow.com/)) by Kamil Slowikowski and **patchwork** ([https://patchwork.data-imaginist.com](https://patchwork.data-imaginist.com/)) by Thomas Lin Pedersen. +Don't forget that you'll need to install those packages with `install.packages()` if you don't already have them. + +```{r} +#| label: setup + +library(tidyverse) +library(ggrepel) +library(patchwork) +``` + +## Labels + +The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. +You add labels with the `labs()` function. +This example adds a plot title: + +```{r} +#| message: false +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars, where +#| points are colored according to the car class. A smooth curve following +#| the trajectory of the relationship between highway fuel efficiency versus +#| engine size of cars is overlaid. The plot is titled "Fuel efficiency +#| generally decreases with engine size". + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) + + geom_smooth(se = FALSE) + + labs(title = "Fuel efficiency generally decreases with engine size") +``` + +The purpose of a plot title is to summarize the main finding. +Avoid titles that just describe what the plot is, e.g. "A scatterplot of engine displacement vs. fuel economy". + +If you need to add more text, there are two other useful labels: + +- `subtitle` adds additional detail in a smaller font beneath the title. + +- `caption` adds text at the bottom right of the plot, often used to describe the source of the data. + +```{r} +#| message: false +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars, where +#| points are colored according to the car class. A smooth curve following +#| the trajectory of the relationship between highway fuel efficiency versus +#| engine size of cars is overlaid. The plot is titled "Fuel efficiency +#| generally decreases with engine size". The subtitle is "Two seaters +#| (sports cars) are an exception because of their light weight" and the +#| caption is "Data from fueleconomy.gov". + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) + + geom_smooth(se = FALSE) + + labs( + title = "Fuel efficiency generally decreases with engine size", + subtitle = "Two seaters (sports cars) are an exception because of their light weight", + caption = "Data from fueleconomy.gov" + ) +``` + +You can also use `labs()` to replace the axis and legend titles. +It's usually a good idea to replace short variable names with more detailed descriptions, and to include the units. + +```{r} +#| message: false +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars, where +#| points are colored according to the car class. A smooth curve following +#| the trajectory of the relationship between highway fuel efficiency versus +#| engine size of cars is overlaid. The x-axis is labelled "Engine +#| displacement (L)" and the y-axis is labelled "Highway fuel economy (mpg)". +#| The legend is labelled "Car type". + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) + + geom_smooth(se = FALSE) + + labs( + x = "Engine displacement (L)", + y = "Highway fuel economy (mpg)", + color = "Car type" + ) +``` + +It's possible to use mathematical equations instead of text strings. +Just switch `""` out for `quote()` and read about the available options in `?plotmath`: + +```{r} +#| fig-asp: 1 +#| out-width: "50%" +#| fig-width: 3 +#| fig-alt: > +#| Scatterplot with math text on the x and y axis labels. X-axis label +#| says sum of x_i squared, for i from 1 to n. Y-axis label says alpha + +#| beta + delta over theta. + +df <- tibble( + x = 1:10, + y = x ^ 2 +) + +ggplot(df, aes(x, y)) + + geom_point() + + labs( + x = quote(sum(x[i] ^ 2, i == 1, n)), + y = quote(alpha + beta + frac(delta, theta)) + ) +``` + +### Exercises + +1. Create one plot on the fuel economy data with customized `title`, `subtitle`, `caption`, `x`, `y`, and `color` labels. + +2. Recreate the following plot using the fuel economy data. + Note that both the colors and shapes of points vary by type of drive train. + + ```{r} + #| echo: false + #| fig-alt: > + #| Scatterplot of highway versus city fuel efficiency. Shapes and + #| colors of points are determined by type of drive train. + + ggplot(mpg, aes(x = cty, y = hwy, color = drv, shape = drv)) + + geom_point() + + labs( + x = "City MPG", + y = "Highway MPG", + shape = "Type of\ndrive train", + color = "Type of\ndrive train" + ) + ``` + +3. Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand. + +## Annotations + +In addition to labelling major components of your plot, it's often useful to label individual observations or groups of observations. +The first tool you have at your disposal is `geom_text()`. +`geom_text()` is similar to `geom_point()`, but it has an additional aesthetic: `label`. +This makes it possible to add textual labels to your plots. + +There are two possible sources of labels. +First, you might have a tibble that provides labels. +In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called `label_info`. +In order to create the `label_info` data frame we used a number of new dplyr functions. +You'll learn more about each of these soon! + +```{r} +label_info <- mpg |> + group_by(drv) |> + arrange(desc(displ)) |> + slice_head(n = 1) |> + mutate( + drive_type = case_when( + drv == "f" ~ "front-wheel drive", + drv == "r" ~ "rear-wheel drive", + drv == "4" ~ "4-wheel drive" + ) + ) |> + select(displ, hwy, drv, drive_type) + +label_info +``` + +Then, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot. +Using the `fontface` and `size` arguments we can customize the look of the text labels. +They're larger than the rest of the text on the plot and bolded. +(`theme(legend.position = "none"`) turns the legend off --- we'll talk about it more shortly.) + +```{r} +#| fig-alt: > +#| Scatterplot of highway mileage versus engine size where points are colored +#| by drive type. Smooth curves for each drive type are overlaid. +#| Text labels identify the curves as front-wheel, rear-wheel, and 4-wheel. + +ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + + geom_point(alpha = 0.3) + + geom_smooth(se = FALSE) + + geom_text( + data = label_info, + aes(x = displ, y = hwy, label = drive_type), + fontface = "bold", size = 5, hjust = "right", vjust = "bottom" + ) + + theme(legend.position = "none") +``` + +Note the use of `hjust` and `vjust` to control the alignment of the label. +@fig-just shows all nine possible combinations. + +```{r} +#| label: fig-just +#| echo: false +#| fig-width: 4.5 +#| fig-asp: 0.5 +#| out-width: "60%" +#| fig-cap: > +#| All nine combinations of `hjust` and `vjust`. +#| fig-alt: > +#| A 1x1 grid. At (0,0) hjust is set to left and vjust is set to bottom. +#| At (0.5, 0) hjust is center and vjust is bottom and at (1, 0) hjust is +#| right and vjust is bottom. At (0, 0.5) hjust is left and vjust is +#| center, at (0.5, 0.5) hjust is center and vjust is center, and at (1, 0.5) +#| hjust is right and vjust is center. Finally, at (1, 0) hjust is left and +#| vjust is top, at (0.5, 1) hjust is center and vjust is top, and at (1, 1) +#| hjust is right and vjust is bottom. + +vjust <- c(bottom = 0, center = 0.5, top = 1) +hjust <- c(left = 0, center = 0.5, right = 1) + +df <- crossing(hj = names(hjust), vj = names(vjust)) |> + mutate( + y = vjust[vj], + x = hjust[hj], + label = paste0("hjust = '", hj, "'\n", "vjust = '", vj, "'") + ) + +ggplot(df, aes(x, y)) + + geom_point(color = "grey70", size = 5) + + geom_point(size = 0.5, color = "red") + + geom_text(aes(label = label, hjust = hj, vjust = vj), size = 4) + + labs(x = NULL, y = NULL) +``` + +However the annotated plot we made above is hard to read because the labels overlap with each other, and with the points. +We can make things a little better by switching to `geom_label()` which draws a rectangle behind the text. +We also use the `nudge_y` parameter to move the labels slightly above the corresponding points: + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars, where +#| points are colored according to the car class. Some points are labelled +#| with the car's name. The labels are box with white, transparent background. + +ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + + geom_point(alpha = 0.3) + + geom_smooth(se = FALSE) + + geom_label( + data = label_info, + aes(x = displ, y = hwy, label = drive_type), + fontface = "bold", size = 5, hjust = "right", alpha = 0.5, nudge_y = 2, + ) + + theme(legend.position = "none") +``` + +That helps a bit, but two of the labels still overlap with each other. +This is difficult to fix by applying the same transformation for every label. +Instead, we can use the `geom_label_repel()` function from the ggrepel package. +This useful package will automatically adjust labels so that they don't overlap: + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars, where +#| points are colored according to the car class. Some points are labelled +#| with the car's name. The labels are box with white, transparent background +#| and positioned to not overlap. + +ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + + geom_point(alpha = 0.3) + + geom_smooth(se = FALSE) + + geom_label_repel( + data = label_info, + aes(x = displ, y = hwy, label = drive_type), + fontface = "bold", size = 5, nudge_y = 2, + ) + + theme(legend.position = "none") +``` + +You can also use the same idea to highlight certain points on a plot with `geom_text_repel()` from the ggrepel package. +Note another handy technique used here: we added a second layer of large, hollow points to further highlight the labelled points. + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars. Points +#| where highway mileage is above 40 as well as above 20 with engine size +#| above 5 are red, with a hollow red circle, and labelled with model name +#| of the car. + +potential_outliers <- mpg |> + filter(hwy > 40 | (hwy > 20 & displ > 5)) + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + geom_text_repel(data = potential_outliers, aes(label = model)) + + geom_point(data = potential_outliers, color = "red") + + geom_point(data = potential_outliers, color = "red", size = 3, shape = "circle open") +``` + +Alternatively, you might just want to add a single label to the plot, but you'll still need to create a data frame. +Often, you want the label in the corner of the plot, so it's convenient to create a new data frame using `summarize()` to compute the maximum values of x and y. + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars. On the +#| top right corner, inset a bit from the corner, is an annotation that +#| reads "increasing engine size is related to decreasing fuel economy". +#| The text spans two lines. + +label_info <- mpg |> + summarize( + displ = max(displ), + hwy = max(hwy), + label = "Increasing engine size is \nrelated to decreasing fuel economy." + ) + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + geom_text( + data = label_info, aes(label = label), + vjust = "top", hjust = "right" + ) +``` + +If you want to place the text exactly on the borders of the plot, you can use `+Inf` and `-Inf`. +Since we're no longer computing the positions from `mpg`, we can use `tibble()` to create the data frame: + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars. On the +#| top right corner, flush against the corner, is an annotation that +#| reads "increasing engine size is related to decreasing fuel economy". +#| The text spans two lines. + +label_info <- tibble( + displ = Inf, + hwy = Inf, + label = "Increasing engine size is \nrelated to decreasing fuel economy." +) + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right") +``` + +Alternatively, we can add the annotation without creating a new data frame, using `annotate()`. +This function adds a geom to a plot, but it doesn't map variables of a data frame to an aesthetic. +The first argument of this function, `geom`, is the geometric object you want to use for annotation. + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars. On the +#| top right corner, flush against the corner, is an annotation that +#| reads "increasing engine size is related to decreasing fuel economy". +#| The text spans two lines. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + annotate( + geom = "text", x = Inf, y = Inf, + label = "Increasing engine size is \nrelated to decreasing fuel economy.", + vjust = "top", hjust = "right" + ) +``` + +You can also use a label geom instead of a text geom like we did earlier, set aesthetics like color. +Another approach for drawing attention to a plot feature is using a segment geom with the `arrow` argument. +The `x` and `y` aesthetics define the starting location of the segment and `xend` and `yend` to define the end location. + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars. A red +#| arrow pointing down follows the trend of the points and the annptation +#| placed next to the arrow reads "increasing engine size is related to +#| decreasing fuel economy". The arrow and the annotation text is red. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + annotate( + geom = "label", x = 3.5, y = 38, + label = "Increasing engine size is \nrelated to decreasing fuel economy.", + hjust = "left", color = "red" + ) + + annotate( + geom = "segment", + x = 3, y = 35, xend = 5, yend = 25, color = "red", + arrow = arrow(type = "closed") + ) +``` + +In these examples, we manually broke the label up into lines using `"\n"`. +Another approach is to use `stringr::str_wrap()` to automatically add line breaks, given the number of characters you want per line: + +```{r} +"Increasing engine size is related to decreasing fuel economy." |> + str_wrap(width = 40) |> + writeLines() +``` + +Remember, in addition to `geom_text()`, you have many other geoms in ggplot2 available to help annotate your plot. +A couple ideas: + +- Use `geom_hline()` and `geom_vline()` to add reference lines. + We often make them thick (`size = 2`) and white (`color = white`), and draw them underneath the primary data layer. + That makes them easy to see, without drawing attention away from the data. + +- Use `geom_rect()` to draw a rectangle around points of interest. + The boundaries of the rectangle are defined by aesthetics `xmin`, `xmax`, `ymin`, `ymax`. + +- Use `geom_segment()` with the `arrow` argument to draw attention to a point with an arrow. + Use aesthetics `x` and `y` to define the starting location, and `xend` and `yend` to define the end location. + +The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)! + +### Exercises + +1. Use `geom_text()` with infinite positions to place text at the four corners of the plot. + +2. Use `annotate()` to add a point geom in the middle of your last plot without having to create a tibble. + Customize the shape, size, or color of the point. + +3. How do labels with `geom_text()` interact with faceting? + How can you add a label to a single facet? + How can you put a different label in each facet? + (Hint: Think about the underlying data.) + +4. What arguments to `geom_label()` control the appearance of the background box? + +5. What are the four arguments to `arrow()`? + How do they work? + Create a series of plots that demonstrate the most important options. + +## Scales + +The third way you can make your plot better for communication is to adjust the scales. +Scales control the mapping from data values to things that you can perceive. + +### Default scales + +Normally, ggplot2 automatically adds scales for you. +For example, when you type: + +```{r} +#| label: default-scales +#| fig-show: "hide" + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) +``` + +ggplot2 automatically adds default scales behind the scenes: + +```{r} +#| fig-show: "hide" + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) + + scale_x_continuous() + + scale_y_continuous() + + scale_color_discrete() +``` + +Note the naming scheme for scales: `scale_` followed by the name of the aesthetic, then `_`, then the name of the scale. +The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. +There are lots of non-default scales which you'll learn about below. + +The default scales have been carefully chosen to do a good job for a wide range of inputs. +Nevertheless, you might want to override the defaults for two reasons: + +- You might want to tweak some of the parameters of the default scale. + This allows you to do things like change the breaks on the axes, or the key labels on the legend. + +- You might want to replace the scale altogether, and use a completely different algorithm. + Often you can do better than the default because you know more about the data. + +### Axis ticks and legend keys + +There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: `breaks` and `labels`. +Breaks controls the position of the ticks, or the values associated with the keys. +Labels controls the text label associated with each tick/key. +The most common use of `breaks` is to override the default choice: + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars. +#| The y-axis has breaks starting at 15 and ending at 40, increasing by 5. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + scale_y_continuous(breaks = seq(15, 40, by = 5)) +``` + +You can use `labels` in the same way (a character vector the same length as `breaks`), but you can also set it to `NULL` to suppress the labels altogether. +This is useful for maps, or for publishing plots where you can't share the absolute numbers. + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars. +#| The x and y-axes do not have any labels at the axis ticks. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + scale_x_continuous(labels = NULL) + + scale_y_continuous(labels = NULL) +``` + +The `labels` argument coupled with labelling functions from the scales package is also useful for formatting numbers as currency, percent, etc. +The plot on the left shows default labelling with `label_dollar()`, which adds a dollar sign as well as a thousand separator comma. +The plot on the right adds further customization by dividing dollar values by 1,000 and adding a suffix "K" (for "thousands") as well as adding custom breaks. +Note that `breaks` is in the original scale of the data. + +```{r} +#| layout-ncol: 2 +#| fig-alt: > +#| Two side-by-side box plots of price versus cut of diamonds. The outliers +#| are transparent. On both plots the y-axis labels are formatted as dollars. +#| The y-axis labels on the plot start at $0 and go to $15,000, increasing +#| by $5,000. The y-axis labels on the right plot start at $1K and go to +#| $19K, increasing by $6K. + +# Left +ggplot(diamonds, aes(x = cut, y = price)) + + geom_boxplot(alpha = 0.05) + + scale_y_continuous(labels = scales::label_dollar()) + +# Right +ggplot(diamonds, aes(x = cut, y = price)) + + geom_boxplot(alpha = 0.05) + + scale_y_continuous( + labels = scales::label_dollar(scale = 1/1000, suffix = "K"), + breaks = seq(1000, 19000, by = 6000) + ) +``` + +Another handy label function is `label_percent()`: + +```{r} +#| fig-alt: > +#| Segmented bar plots of cut, filled with levels of clarity. The y-axis +#| labels start at 0% and go to 100%, increasing by 25%. The y-axis label +#| name is "Percentage". + +ggplot(diamonds, aes(x = cut, fill = clarity)) + + geom_bar(position = "fill") + + scale_y_continuous( + name = "Percentage", + labels = scales::label_percent() + ) +``` + +You can also use `breaks` and `labels` to control the appearance of legends. +Collectively axes and legends are called **guides**. +Axes are used for x and y aesthetics; legends are used for everything else. + +Another use of `breaks` is when you have relatively few data points and want to highlight exactly where the observations occur. +For example, take this plot that shows when each US president started and ended their term. + +```{r} +#| fig-alt: > +#| Line plot of id number of presidents versus the year they started their +#| presidency. Start year is marked with a point and a segment that starts +#| there and ends at the end of the presidency. The x-axis labels are +#| formatted as two digit years starting with an apostrophe, e.g., '53. + +presidential |> + mutate(id = 33 + row_number()) |> + ggplot(aes(start, id)) + + geom_point() + + geom_segment(aes(xend = end, yend = id)) + + scale_x_date(name = NULL, breaks = presidential$start, date_labels = "'%y") +``` + +Note that the specification of breaks and labels for date and datetime scales is a little different: + +- `date_labels` takes a format specification, in the same form as `parse_datetime()`. + +- `date_breaks` (not shown here), takes a string like "2 days" or "1 month". + +### Legend layout + +You will most often use `breaks` and `labels` to tweak the axes. +While they both also work for legends, there are a few other techniques you are more likely to use. + +To control the overall position of the legend, you need to use a `theme()` setting. +We'll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. +The theme setting `legend.position` controls where the legend is drawn: + +```{r} +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-asp: 1 +#| fig-alt: > +#| Four scatterplots of highway fuel efficiency versus engine size of cars +#| where points are colored based on class of car. Clockwise, the legend +#| is placed on the left, top, bottom, and right of the plot. + +base <- ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) + +base + theme(legend.position = "left") +base + theme(legend.position = "top") +base + theme(legend.position = "bottom") +base + theme(legend.position = "right") # the default +``` + +You can also use `legend.position = "none"` to suppress the display of the legend altogether. + +To control the display of individual legends, use `guides()` along with `guide_legend()` or `guide_colorbar()`. +The following example shows two important settings: controlling the number of rows the legend uses with `nrow`, and overriding one of the aesthetics to make the points bigger. +This is particularly useful if you have used a low `alpha` to display many points on a plot. + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars +#| where points are colored based on class of car. Overlaid on the plot is a +#| smooth curve. The legend is in the bottom and classes are listed +#| horizontally in a row. The points in the legend are larger than the points +#| in the plot. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) + + geom_smooth(se = FALSE) + + theme(legend.position = "bottom") + + guides(color = guide_legend(nrow = 1, override.aes = list(size = 4))) +``` + +### Replacing a scale + +Instead of just tweaking the details a little, you can instead replace the scale altogether. +There are two types of scales you're mostly likely to want to switch out: continuous position scales and color scales. +Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and color, you'll be able to quickly pick up other scale replacements. + +It's very useful to plot transformations of your variable. +For example, it's easier to see the precise relationship between `carat` and `price` if we log transform them: + +```{r} +#| fig-align: default +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-height: 3 +#| fig-alt: > +#| Two plots of price versus carat of diamonds. Data binned and the color of +#| the rectangles representing each bin based on the number of points that +#| fall into that bin. In the plot on the right, price and carat values +#| are logged and the axis labels shows the logged values. + +# Left +ggplot(diamonds, aes(x = carat, y = price)) + + geom_bin2d() + +# Right +ggplot(diamonds, aes(x = log10(carat), y = log10(price))) + + geom_bin2d() +``` + +However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. +Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. +This is visually identical, except the axes are labelled on the original data scale. + +```{r} +#| fig-alt: > +#| Plot of price versus carat of diamonds. Data binned and the color of +#| the rectangles representing each bin based on the number of points that +#| fall into that bin. The axis labels are on the original data scale. + +ggplot(diamonds, aes(x = carat, y = price)) + + geom_bin2d() + + scale_x_log10() + + scale_y_log10() +``` + +Another scale that is frequently customized is color. +The default categorical scale picks colors that are evenly spaced around the color wheel. +Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of color blindness. +The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green color blindness. + +```{r} +#| fig-align: default +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-height: 3 +#| fig-alt: > +#| Two scatterplots of highway mileage versus engine size where points are +#| colored by drive type. The plot on the left uses the default +#| ggplot2 color palette and the plot on the right uses a different color +#| palette. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = drv)) + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = drv)) + + scale_color_brewer(palette = "Set1") +``` + +Don't forget simpler techniques. +If there are just a few colors, you can add a redundant shape mapping. +This will also help ensure your plot is interpretable in black and white. + +```{r} +#| fig-alt: > +#| Two scatterplots of highway mileage versus engine size where both color +#| and shape of points are based on drive type. The color palette is not +#| the default ggplot2 palette. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = drv, shape = drv)) + + scale_color_brewer(palette = "Set1") +``` + +The ColorBrewer scales are documented online at and made available in R via the **RColorBrewer** package, by Erich Neuwirth. +@fig-brewer shows the complete list of all palettes. +The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a "middle". +This often arises if you've used `cut()` to make a continuous variable into a categorical variable. + +```{r} +#| label: fig-brewer +#| echo: false +#| fig.cap: All colorBrewer scales. +#| fig.asp: 2.5 +#| fig-alt: > +#| All colorBrewer scales. One group goes from light to dark colors. +#| Another group is a set of non ordinal colors. And the last group has +#| diverging scales (from dark to light to dark again). Within each set +#| there are a number of palettes. + +par(mar = c(0, 3, 0, 0)) +RColorBrewer::display.brewer.all() +``` + +When you have a predefined mapping between values and colors, use `scale_color_manual()`. +For example, if we map presidential party to color, we want to use the standard mapping of red for Republicans and blue for Democrats: + +```{r} +#| fig-alt: > +#| Line plot of id number of presidents versus the year they started their +#| presidency. Start year is marked with a point and a segment that starts +#| there and ends at the end of the presidency. Democratic presidents are +#| represented in black and Republicans in red. + +presidential |> + mutate(id = 33 + row_number()) |> + ggplot(aes(start, id, color = party)) + + geom_point() + + geom_segment(aes(xend = end, yend = id)) + + scale_color_manual(values = c(Republican = "red", Democratic = "blue")) +``` + +For continuous color, you can use the built-in `scale_color_gradient()` or `scale_fill_gradient()`. +If you have a diverging scale, you can use `scale_color_gradient2()`. +That allows you to give, for example, positive and negative values different colors. +That's sometimes also useful if you want to distinguish points above or below the mean. + +Another option is to use the viridis color scales. +The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous color schemes that are perceptible to people with various forms of color blindness as well as perceptually uniform in both color and black and white. +These scales are available as continuous (`c`), discrete (`d`), and binned (`b`) palettes in ggplot2. + +```{r} +#| fig-align: default +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-asp: 1 +#| fig-alt: > +#| Three hex plots where the color of the hexes show the number of observations +#| that fall into that hex bin. The first plot uses the default, continuous +#| ggplot2 scale. The second plot uses the viridis, continuous scale, and the +#| third plot uses the viridis, binned scale. + +df <- tibble( + x = rnorm(10000), + y = rnorm(10000) +) + +ggplot(df, aes(x, y)) + + geom_hex() + + coord_fixed() + + labs(title = "Default, continuous") + +ggplot(df, aes(x, y)) + + geom_hex() + + coord_fixed() + + scale_fill_viridis_c() + + labs(title = "Viridis, continuous") + +ggplot(df, aes(x, y)) + + geom_hex() + + coord_fixed() + + scale_fill_viridis_b() + + labs(title = "Viridis, binned") +``` + +Note that all color scales come in two variety: `scale_color_x()` and `scale_fill_x()` for the `color` and `fill` aesthetics respectively (the color scales are available in both UK and US spellings). + +### Zooming + +There are three ways to control the plot limits: + +1. Adjusting what data are plotted. +2. Setting the limits in each scale. +3. Setting `xlim` and `ylim` in `coord_cartesian()`. + +To zoom in on a region of the plot, it's generally best to use `coord_cartesian()`. +Compare the following two plots: + +```{r} +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-height: 3 +#| message: false + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) + + geom_smooth() + + coord_cartesian(xlim = c(5, 7), ylim = c(10, 30)) + +mpg |> + filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) |> + ggplot(aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) + + geom_smooth() +``` + +You can also set the `limits` on individual scales. +Reducing the limits is basically equivalent to subsetting the data. +It is generally more useful if you want *expand* the limits, for example, to match scales across different plots. +For example, if we extract two classes of cars and plot them separately, it's difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges. + +```{r} +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-height: 3 + +suv <- mpg |> filter(class == "suv") +compact <- mpg |> filter(class == "compact") + +ggplot(suv, aes(displ, hwy, color = drv)) + + geom_point() + +ggplot(compact, aes(displ, hwy, color = drv)) + + geom_point() +``` + +One way to overcome this problem is to share scales across multiple plots, training the scales with the `limits` of the full data. + +```{r} +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-height: 3 + +x_scale <- scale_x_continuous(limits = range(mpg$displ)) +y_scale <- scale_y_continuous(limits = range(mpg$hwy)) +col_scale <- scale_color_discrete(limits = unique(mpg$drv)) + +ggplot(suv, aes(x = displ, y = hwy, color = drv)) + + geom_point() + + x_scale + + y_scale + + col_scale + +ggplot(compact, aes(x = displ, y = hwy, color = drv)) + + geom_point() + + x_scale + + y_scale + + col_scale +``` + +In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report. + +### Exercises + +1. Why doesn't the following code override the default scale? + + ```{r} + #| fig-show: "hide" + + df <- tibble( + x = rnorm(10000), + y = rnorm(10000) + ) + + ggplot(df, aes(x, y)) + + geom_hex() + + scale_color_gradient(low = "white", high = "red") + + coord_fixed() + ``` + +2. What is the first argument to every scale? + How does it compare to `labs()`? + +3. Change the display of the presidential terms by: + + a. Combining the two variants shown above. + b. Improving the display of the y axis. + c. Labelling each term with the name of the president. + d. Adding informative plot labels. + e. Placing breaks every 4 years (this is trickier than it seems!). + +4. Use `override.aes` to make the legend on the following plot easier to see. + + ```{r} + #| fig-format: "png" + #| out-width: "50%" + #| fig-alt: > + #| Scatterplot of price versus carat of diamonds. The points are colored + #| by cut of the diamonds and they're very transparent. + + ggplot(diamonds, aes(x = carat, y = price)) + + geom_point(aes(color = cut), alpha = 1/20) + ``` + +## Themes {#sec-themes} + +Finally, you can customize the non-data elements of your plot with a theme: + +```{r} +#| message: false + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) + + geom_smooth(se = FALSE) + + theme_bw() +``` + +ggplot2 includes eight themes by default, as shown in @fig-themes. +Many more are included in add-on packages like **ggthemes** (), by Jeffrey Arnold. +You can also create your own themes, if you are trying to match a particular corporate or journal style. + +```{r} +#| label: fig-themes +#| echo: false +#| fig-cap: The eight themes built-in to ggplot2. +#| fig-alt: > +#| Eight barplots created with ggplot2, each +#| with one of the eight built-in themes: +#| theme_bw() - White background with grid lines, +#| theme_light() - Light axes and grid lines, +#| theme_classic() - Classic theme, axes but no grid +#| lines, theme_linedraw() - Only black lines, +#| theme_dark() - Dark background for contrast, +#| theme_minimal() - Minimal theme, no background, +#| theme_gray() - Gray background (default theme), +#| theme_void() - Empty theme, only geoms are visible. + +knitr::include_graphics("images/visualization-themes.png") +``` + +Many people wonder why the default theme has a gray background. +This was a deliberate choice because it puts the data forward while still making the grid lines visible. +The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. +The grey background gives the plot a similar typographic color to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. +Finally, the grey background creates a continuous field of color which ensures that the plot is perceived as a single visual entity. + +It's also possible to control individual components of each theme, like the size and color of the font used for the y axis. +We've already seen that `legend.position` controls where the legend is drawn. +There are many other aspects of the legend that can be customized with `theme()`. +For example, in the plot below we change the direction of the legend as well as put a black border around it. +A few other helpful `theme()` components are use to change the placement for format of the title and caption text. + +```{r} +ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + + geom_point() + + labs( + title = "Highway mileage decreases as engine size increases", + caption = "Source: https://fueleconomy.gov." + ) + + theme( + legend.position = c(0.6, 0.7), + legend.direction = "horizontal", + legend.box.background = element_rect(color = "black"), + plot.title = element_text(face = "bold"), + plot.title.position = "plot", + plot.caption.position = "plot", + plot.caption = element_text(hjust = 0) + ) +``` + +For an overview of all `theme()` components, see help with `?theme`. +The [ggplot2 book](https://ggplot2-book.org/) is also a great place to go for the full details on theming. + +### Exercises + +1. Pick a theme offered by the ggthemes package and apply it to the last plot you made. +2. Make the axis labels of your plot blue and bolded. + +## Layout + +So far we talked about how to create and modify a single plot. +What if you have multiple plots you want to lay out in a certain way? +The patchwork package allows you to combine separate plots into the same graphic. +We loaded this package earlier in the chapter. + +To place two plots next to each other, you can simply add them to each other. +Note that you first need to create the plots and save them as objects (in the following example they're called `p1` and `p2`). +Then, you place them next to each other with `+`. + +```{r} +#| fig-alt: > +#| Two plots (a scatterplot of highway mileage versus engine size and a +#| side-by-side boxplots of highway mileage versus drive train) placed next +#| to each other. + +p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + labs(title = "Plot 1") +p2 <- ggplot(mpg, aes(x = drv, y = hwy)) + + geom_boxplot() + + labs(title = "Plot 2") +p1 + p2 +``` + +It's important to note that in the above code chunk we did not use a new function from the patchwork package. +Instead, the package added a new functionality to the `+` operator. + +You can also create arbitrary plot layouts with patchwork. +In the following, `|` places the `p1` and `p3` next to each other and `/` moves `p2` to the next line. + +```{r} +#| fig-alt: > +#| Three plots laid out such that first and third plot are next to each other +#| and the second plot streatched beneath them. The first plot is a +#| scatterplot of highway mileage versus engine size, third plot is a +#| scatterplot of highway mileage versus city mileage, and the third plot is +#| side-by-side boxplots of highway mileage versus drive train) placed next +#| to each other. + +p3 <- ggplot(mpg, aes(x = cty, y = hwy)) + + geom_point() + + labs(title = "Plot 3") +(p1 | p3) / p2 +``` + +Additionally, patchwork allows you to collect legends from multiple plots into one common legend, customize the placement of the legend as well as dimensions of the plots, and add a common title, subtitle, caption, etc. to your plots. +In the following, we have 5 plots. +We have turned off the legends on the box plots and the scatterplot and collected the legends for the density plots at the top of the plot with `& theme(legend.position = "top")`. +Note the use of the `&` operator here instead of the usual `+`. +This is because we're modifying the theme for the patchwork plot as opposed to the individual ggplots. +The legend is placed on top, inside the `guide_area()`. +Finally, we have also customized the heights of the various components of our patchwork -- the guide has a height of 1, the box plots 3, density plots 2, and the faceted scatter plot 4. +Patchwork divides up the area you have allotted for your plot using this scale and places the components accordingly. + +```{r} +#| fig-alt: > +#| Five plots laid out such that first two plots are next to each other. Plots +#| three and four are underneath them. And the fifth plot stretches under them. +#| The patchworked plot is titled "City and highway mileage for cars with +#| different drive trains" and captioned "Source: Source: https://fueleconomy.gov". +#| The first two plots are side-by-side box plots. Plots 3 and 4 are density +#| plots. And the fifth plot is a faceted scatterplot. Each of these plots show +#| geoms colored by drive train, but the patchworked plot has only one legend +#| that applies to all of them, above the plots and beneath the title. + +p1 <- ggplot(mpg, aes(x = drv, y = cty, color = drv)) + + geom_boxplot(show.legend = FALSE) + + labs(title = "Plot 1") + +p2 <- ggplot(mpg, aes(x = drv, y = hwy, color = drv)) + + geom_boxplot(show.legend = FALSE) + + labs(title = "Plot 2") + +p3 <- ggplot(mpg, aes(x = cty, color = drv, fill = drv)) + + geom_density(alpha = 0.5) + + labs(title = "Plot 3") + +p4 <- ggplot(mpg, aes(x = hwy, color = drv, fill = drv)) + + geom_density(alpha = 0.5) + + labs(title = "Plot 4") + +p5 <- ggplot(mpg, aes(x = cty, y = hwy, color = drv)) + + geom_point(show.legend = FALSE) + + facet_wrap(~drv) + + labs(title = "Plot 5") + +(guide_area() / (p1 + p2) / (p3 + p4) / p5) + + plot_annotation( + title = "City and highway mileage for cars with different drive trains", + caption = "Source: Source: https://fueleconomy.gov." + ) + + plot_layout( + guides = "collect", + heights = c(1, 3, 2, 4) + ) & + theme(legend.position = "bottom") +``` + +If you'd like to learn more about combining and layout out multiple plots with patchwork, we recommend looking through the guides on the package website: . + +### Exercises + +1. What happens if you omit the parentheses in the following plot layout. + Can you explain why this happens? + + ```{r} + #| results: hide + + p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + labs(title = "Plot 1") + p2 <- ggplot(mpg, aes(x = drv, y = hwy)) + + geom_boxplot() + + labs(title = "Plot 2") + p3 <- ggplot(mpg, aes(x = cty, y = hwy)) + + geom_point() + + labs(title = "Plot 3") + + (p1 | p2) / p3 + ``` + +2. Using the three plots from the previous exercise, recreate the following patchwork. + + ```{r} + #| echo: false + #| fig-alt: > + #| Three plots: Plot 1 is a scatterplot of highway mileage versus engine size. + #| Plot 2 is side-by-side box plots of highway mileage versus drive train. + #| Plot 3 is side-by-side box plots of city mileage versus drive train. + #| Plots 1 is on the first row. Plots 2 and 3 are on the next row, each span + #| half the width of Plot 1. Plot 1 is labelled "Fig. A", Plot 2 is labelled + #| "Fig. B", and Plot 3 is labelled "Fig. C". + + p1 / (p2 + p3) + + plot_annotation( + tag_levels = c("A"), + tag_prefix = "Fig. ", + tag_suffix = ":" + ) + ``` + +## Summary + +In this chapter you've learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customizing the axis scales, and changing the theme of your plot. +You've also learned about combining multiple plots in a single graph using both simple and complex plot layouts. + +While you've so far learned about how to make many different types of plots and how to customize them using a variety of techniques, we've barely scratched the surface of what you can create with ggplot2. +If you want to get a comprehensive understanding of ggplot2, we recommend reading the book, [*ggplot2: Elegant Graphics for Data Analysis*](https://ggplot2-book.org). +Other useful resources are the [*R Graphics Cookbook*](https://r-graphics.org) by Winston Chang and [*Fundamentals of Data Visualization*](https://clauswilke.com/dataviz/) by Claus Wilke. diff --git a/data-visualize.qmd b/data-visualize.qmd index a53b096..932a1f6 100644 --- a/data-visualize.qmd +++ b/data-visualize.qmd @@ -4,17 +4,22 @@ #| results: "asis" #| echo: false source("_common.R") +status("complete") ``` ## Introduction > "The simple graph has brought more information to the data analyst's mind than any other device." --- John Tukey -This chapter will teach you how to visualize your data using ggplot2. R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the **grammar of graphics**, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places. +This chapter will teach you how to visualize your data using ggplot2. +We will start by creating a simple scatterplot and use that to introduce aesthetic mappings and geometric objects -- the fundamental building blocks of ggplot2. +We will then walk you through visualizing distributions of single variables as well as visualizing relationships between two or more variables. +We'll finish off with saving your plots and troubleshooting tips. + ### Prerequisites This chapter focuses on ggplot2, one of the core packages in the tidyverse. @@ -40,325 +45,769 @@ library(tidyverse) You only need to install a package once, but you need to reload it every time you start a new session. +In addition to tidyverse, we will also use the **palmerpenguins** package, which includes the `penguins` dataset containing body measurements for penguins in three islands in the Palmer Archipelago. + +```{r} +library(palmerpenguins) +``` + ## First steps -Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? +Let's use our first graph to answer a question: Do penguins with longer flippers weigh more or less than penguins with shorter flippers? You probably already have an answer, but try to make your answer precise. -What does the relationship between engine size and fuel efficiency look like? +What does the relationship between flipper length and body mass look like? Is it positive? Negative? Linear? Nonlinear? +Does the relationship vary by the species of the penguin? +And how about by the island where the penguin lives. -### The `mpg` data frame +### The `penguins` data frame -You can test your answer with the `mpg` **data frame** found in ggplot2 (a.k.a. `ggplot2::mpg`). +You can test your answer with the `penguins` **data frame** found in palmerpenguins (a.k.a. `palmerpenguins::penguins`). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). -`mpg` contains `r nrow(mpg)` observations collected by the US Environmental Protection Agency on `r mpg |> distinct(model) |> nrow()` car models. +`penguins` contains `r nrow(penguins)` observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER[^data-visualize-1]. + +[^data-visualize-1]: Horst AM, Hill AP, Gorman KB (2020). + palmerpenguins: Palmer Archipelago (Antarctica) penguin data. + R package version 0.1.0. + . + doi: 10.5281/zenodo.3960218. ```{r} -mpg +penguins ``` -Among the variables in `mpg` are: +This data frame contains `r ncol(penguins)` columns. +For an alternative view, where you can see all variables and the first few observations of each variable, use `glimpse()`. +Or, if you're in RStudio, run `View(penguins)` to open an interactive data viewer. -1. `displ`: a car's engine size, in liters. +```{r} +glimpse(penguins) +``` -2. `hwy`: a car's fuel efficiency on the highway, in miles per gallon (mpg). - A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance. +Among the variables in `penguins` are: -To learn more about `mpg`, open its help page by running `?mpg`. +1. `species`: a penguin's species (Adelie, Chinstrap, or Gentoo). + +2. `flipper_length_mm`: length of a penguin's flipper, in millimeters. + +3. `body_mass_g`: body mass of a penguin, in grams. + +To learn more about `penguins`, open its help page by running `?penguins`. + +### Ultimate goal {#sec-ultimate-goal} + +Our ultimate goal in this chapter is to recreate the following visualization displaying the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin. + +```{r} +#| echo: false +#| warning: false +#| fig-alt: > +#| A scatterplot of body mass vs. flipper length of penguins, with a +#| smooth curve displaying the relationship between these two variables +#| overlaid. The plot displays a positive, fairly linear, relatively +#| strong relationship between these two variables. Species (Adelie, +#| Chinstrap, and Gentoo) are represented with different colors and +#| shapes. The relationship between body mass and flipper length is +#| roughly the same for these three species, and Gentoo penguins are +#| larger than penguins from the other two species. + +ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point(aes(color = species, shape = species)) + + geom_smooth() + + labs( + title = "Body mass and flipper length", + subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", + x = "Flipper length (mm)", + y = "Body mass (mm)", + color = "Species", + shape = "Species" + ) +``` ### Creating a ggplot -To plot `mpg`, run this code to put `displ` on the x-axis, `hwy` on the y-axis, and represent each observation with a point: +Let's recreate this plot layer-by-layer. + +With ggplot2, you begin a plot with the function `ggplot()`, defining a plot object that you then add layers to. +The first argument of `ggplot()` is the dataset to use in the graph and So `ggplot(data = penguins)` creates an empty graph. +This is not a very exciting plot, but you can think of it like an empty canvas you'll paint the remaining layers of your plot onto. ```{r} #| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars that -#| shows a negative association. +#| A blank, gray plot area. -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) +ggplot(data = penguins) ``` -The plot shows a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). -In other words, cars with smaller engine sizes have higher fuel efficiency and, in general, as engine size increases, fuel efficiency decreases. -Does this confirm or refute your hypothesis about fuel efficiency and engine size? +Next, we need to tell `ggplot()` the variables from this data frame that we want to map to visual properties (**aesthetics**) of the plot. +The `mapping` argument of the `ggplot()` function defines how variables in your dataset are mapped to visual properties of your plot. +The `mapping` argument is always paired with the `aes()` function, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes. +For now, we will only map flipper length to the `x` aesthetic and body mass to the `y` aesthetic. +ggplot2 looks for the mapped variables in the `data` argument, in this case, `penguins`. -With ggplot2, you begin a plot with the function `ggplot()`. -`ggplot()` creates a coordinate system that you can add layers to. -You can think of it like an empty canvas you'll paint the rest of your plot on, layer by layer. -The first argument of `ggplot()` is the dataset to use in the graph. -So `ggplot(data = mpg)` creates an empty graph, but it's not very interesting so we won't show it here. +The following plots show the result of adding these mappings, one at a time. + +```{r} +#| layout-ncol: 2 +#| fig-alt: > +#| There are two plots. The plot on the left is shows flipper length on +#| the x-axis. The values range from 170 to 230 The plot on the right +#| also shows body mass on the y-axis. The values range from 3000 to +#| 6000. + +ggplot(data = penguins, + mapping = aes(x = flipper_length_mm)) +ggplot(data = penguins, + mapping = aes(x = flipper_length_mm, y = body_mass_g)) +``` + +Our empty canvas now has more structure -- it's clear where flipper lengths will be displayed (on the x-axis) and where body masses will be displayed (on the y-axis). +But the penguins themselves are not yet on the plot. +This is because we have not yet articulated, in our code, how to represent the observations from our data frame on our plot. + +To do so, we need to define a **geom**: the geometrical object that a plot uses to represent data. +These geometric objects are made available in ggplot2 with functions that start with `geom_`. +People often describe plots by the type of geom that the plot uses. +For example, bar charts use bar geoms (`geom_bar()`), line charts use line geoms (`geom_line()`), boxplots use boxplot geoms (`geom_boxplot()`), and so on. +Scatterplots break the trend; they use the point geom: `geom_point()`. -You complete your graph by adding one or more layers to `ggplot()`. The function `geom_point()` adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. -You'll learn a whole bunch of them throughout this chapter. +You'll learn a whole bunch of geoms throughout the book, particularly in @sec-layers. -Each geom function in ggplot2 takes a `mapping` argument. -This defines how variables in your dataset are mapped to visual properties of your plot. -The `mapping` argument is always paired with `aes()`, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes. -ggplot2 looks for the mapped variables in the `data` argument, in this case, `mpg`. +```{r} +#| fig-alt: > +#| A scatterplot of body mass vs. flipper length of penguins. The plot +#| displays a positive, linear, relatively strong relationship between +#| these two variables. -### A graphing template +ggplot(data = penguins, + mapping = aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point() +``` -Let's turn this code into a reusable template for making graphs with ggplot2. -To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings. +Now we have something that looks like what we might think of as a "scatter plot". +It doesn't yet match our "ultimate goal" plot, but using this plot we can start answering the question that motivated our exploration: "What does the relationship between flipper length and body mass look like?" The relationship appears to be positive, fairly linear, and moderately strong. +Penguins with longer flippers are generally larger in terms of their body mass. + +Before we add more layers to this plot, let's pause for a moment and review the warning message we got: + +> Removed 2 rows containing missing values (`geom_point()`). + +We're seeing this message because there are two penguins in our dataset with missing body mass and flipper length values and ggplot2 has no way of representing them on the plot. +You don't need to worry about understanding the following code yet (you will learn about it in @sec-data-transform), but it's one way of identifying the observations with `NA`s for either body mass or flipper length. + +```{r} +penguins |> + select(species, flipper_length_mm, body_mass_g) |> + filter(is.na(body_mass_g) | is.na(flipper_length_mm)) +``` + +Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. +This type of warning is probably one of the most common types of warnings you will see when working with real data -- missing values are a very common issue and you'll learn more about them throughout the book, particularly in @sec-missing-values. +For the remaining plots in this chapter we will suppress this warning so it's not printed alongside every single plot we make. + +### Adding aesthetics and layers + +Scatterplots are useful for displaying the relationship between two variables, but it's always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship. +Let's incorporate species into our plot and see if this reveals any additional insights into the apparent relationship between flipper length and body mass. +We will do this by representing species with different colored points. + +To achieve this, where should `species` go in the ggplot call from earlier? +If you guessed "in the aesthetic mapping, inside of `aes()`", you're already getting the hang of creating data visualizations with ggplot2! +And if not, don't worry. +Throughout the book you will make many more ggplots and have many more opportunities to check your intuition as you make them. + +```{r} +#| warning: false +#| fig-alt: > +#| A scatterplot of body mass vs. flipper length of penguins. The plot +#| displays a positive, fairly linear, relatively strong relationship +#| between these two variables. Species (Adelie, Chinstrap, and Gentoo) +#| are represented with different colors. + +ggplot(data = penguins, + mapping = aes(x = flipper_length_mm, y = body_mass_g, + color = species)) + + geom_point() +``` + +When a variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here a unique color) to each unique level of the variable (each of the three species), a process known as **scaling**. +ggplot2 will also add a legend that explains which values correspond to which levels. + +Now let's add one more layer: a smooth curve displaying the relationship between body mass and flipper length. +Before you proceed, refer back to the code above, and think about how we can add this to our existing plot. + +Since this is a new geometric object representing our data, we will add a new geom: `geom_smooth()`. + +```{r} +#| warning: false +#| fig-alt: > +#| A scatterplot of body mass vs. flipper length of penguins. Overlaid +#| on the scatterplot are three smooth curves displaying the +#| relationship between these variables for each species (Adelie, +#| Chinstrap, and Gentoo). Different penguin species are plotted in +#| different colors for the points and the smooth curves. + +ggplot(data = penguins, + mapping = aes(x = flipper_length_mm, y = body_mass_g, + color = species)) + + geom_point() + + geom_smooth() +``` + +We have successfully added a smooth curves, but this plot doesn't look like the plot from @sec-ultimate-goal, which only has one curve for the entire dataset as opposed to separate curves for each of the penguin species. + +When aesthetic mappings are defined in `ggplot()`, at the *global* level, they're inherited by each of the subsequent geom layers of the plot. +However, each geom function in ggplot2 can also take a `mapping` argument, which allows for aesthetic mappings at the *local* level. +Since we want points to be colored based on species but don't want the smooth curves to be separated out for them, we should specify `color = species` for `geom_point()` only. + +```{r} +#| warning: false +#| fig-alt: > +#| A scatterplot of body mass vs. flipper length of penguins. Overlaid +#| on the scatterplot are is a single smooth curve displaying the +#| relationship between these variables for each species (Adelie, +#| Chinstrap, and Gentoo). Different penguin species are plotted in +#| different colors for the points only. + +ggplot(data = penguins, + mapping = aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point(mapping = aes(color = species)) + + geom_smooth() +``` + +Voila! +We have something that looks very much like our ultimate goal, though it's not yet perfect. +We still need to use different shapes for each species of penguins and improve labels. + +It's generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences. +Therefore, in addition to color, we can also map `species` to the `shape` aesthetic. + +```{r} +#| warning: false +#| fig-alt: > +#| A scatterplot of body mass vs. flipper length of penguins. Overlaid +#| on the scatterplot are is a single smooth curve displaying the +#| relationship between these variables for each species (Adelie, +#| Chinstrap, and Gentoo). Different penguin species are plotted in +#| different colors and shapes for the points only. + +ggplot(data = penguins, + mapping = aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point(mapping = aes(color = species, shape = species)) + + geom_smooth() +``` + +Note that the legend is automatically updated to reflect the different shapes of the points as well. + +And finally, we can improve the labels of our plot using the `labs()` function in a new layer. + +```{r} +#| warning: false +#| fig-alt: > +#| A scatterplot of body mass vs. flipper length of penguins, with a +#| smooth curve displaying the relationship between these two variables +#| overlaid. The plot displays a positive, fairly linear, relatively +#| strong relationship between these two variables. Species (Adelie, +#| Chinstrap, and Gentoo) are represented with different colors and +#| shapes. The relationship between body mass and flipper length is +#| roughly the same for these three species, and Gentoo penguins are +#| larger than penguins from the other two species. + +ggplot(penguins, + aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point(aes(color = species, shape = species)) + + geom_smooth() + + labs( + title = "Body mass and flipper length", + subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", + x = "Flipper length (mm)", + y = "Body mass (mm)", + color = "Species", + shape = "Species" + ) +``` + +We finally have a plot that perfectly matches our "ultimate goal"! + +### Exercises + +1. How many rows are in `penguins`? + How many columns? + +2. What does the `bill_depth_mm` variable in the `penguins` data frame describe? + Read the help for `?penguins` to find out. + +3. Make a scatterplot of `bill_depth_mm` vs `bill_length_mm`. + Describe the relationship between these two variables. + +4. What happens if you make a scatterplot of `species` vs `bill_depth_mm`? + Why is the plot not useful? + +5. Why does the following give an error and how would you fix it? + + ```{r} + #| eval: false + + ggplot(data = penguins) + + geom_point() + ``` + +6. What does the `na.rm` argument do in `geom_point()`? + What is the default value of the argument? + Create a scatterplot where you successfully use this argument set to `TRUE`. + +7. Add the following caption to the plot you made in the previous exercise: "Data come from the palmerpenguins package." Hint: Take a look at the documentation for `labs()`. + +8. Recreate the following visualization. + What aesthetic should `bill_depth_mm` be mapped to? + And should it be mapped at the global level or at the geom level? + + ```{r} + #| echo: false + #| fig-alt: > + #| A scatterplot of body mass vs. flipper length of penguins, colored + #| by bill depth. A smooth curve of the relationship between body mass + #| and flipper length is overlaid. The relationship is positive, + #| fairly linear, and moderately strong. + + ggplot(data = penguins, + mapping = aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point(aes(color = bill_depth_mm)) + + geom_smooth() + ``` + +9 . +Run this code in your head and predict what the output will look like. +Then, run the code in R and check your predictions. + + ```{r} + #| eval: false + + ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + + geom_point() + + geom_smooth(se = FALSE) + ``` + +10. Will these two graphs look different? + Why/why not? + + ```{r} + #| eval: false + + ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + + geom_point() + + geom_smooth() + + ggplot() + + geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + + geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy)) + ``` + +## ggplot2 calls + +As we move on from these introductory sections, we'll transition to a more concise expression of ggplot2 code. +So far we've been very explicit, which is helpful when you are learning: ```{r} #| eval: false -ggplot(data = ) + - (mapping = aes()) +ggplot(data = penguins, + mapping = aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point() ``` -The rest of this chapter will show you how to complete and extend this template to make different types of graphs. -We will begin with the `` component. +Typically, the first one or two arguments to a function are so important that you should know them by heart. +The first two arguments to `ggplot()` are `data` and `mapping`, in the remainder of the book, we won't supply those names. +That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots. +That's a really important programming concern that we'll come back to in @sec-functions. -### Exercises - -1. Run `ggplot(data = mpg)`. - What do you see? - -2. How many rows are in `mpg`? - How many columns? - -3. What does the `drv` variable describe? - Read the help for `?mpg` to find out. - -4. Make a scatterplot of `hwy` vs `cyl`. - -5. What happens if you make a scatterplot of `class` vs `drv`? - Why is the plot not useful? - -## Aesthetic mappings - -> "The greatest value of a picture is when it forces us to notice what we never expected to see." --- John Tukey - -In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend. -These cars have a higher fuel efficiency than you might expect. -That is, they have a higher miles per gallon than other cars with similar engine sizes. -How can you explain these cars? +Rewriting the previous plot more concisely yields: ```{r} -#| echo: false -#| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars that -#| shows a negative association. Cars with engine size greater than 5 litres -#| and highway fuel efficiency greater than 20 miles per gallon stand out from -#| the rest of the data and are highlighted in red. +#| eval: false -ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point() + - geom_point(data = dplyr::filter(mpg, displ > 5, hwy > 20), color = "red", size = 1.6) + - geom_point(data = dplyr::filter(mpg, displ > 5, hwy > 20), color = "red", size = 3, shape = "circle open") +ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point() ``` -Let's hypothesize that the cars are hybrids. -One way to test this hypothesis is to look at the `class` value for each car. -The `class` variable of the `mpg` dataset classifies cars into groups such as compact, midsize, and SUV. -If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular). - -You can add a third variable, like `class`, to a two dimensional scatterplot by mapping it to an **aesthetic**. -An aesthetic is a visual property of the objects in your plot. -Aesthetics include things like the size, the shape, or the color of your points. -You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. -Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties. -Here we change the levels of a point's size, shape, and color to make the point small, triangular, or blue: +In the future, you'll also learn about the pipe which will allow you to create that plot with: ```{r} -#| echo: false -#| fig.asp: 0.25 -#| fig-width: 8 -#| fig-alt: > -#| Diagram that shows four plotting characters next to each other. The first -#| is a large circle, the second is a small circle, the third is a triangle, -#| and the fourth is a blue circle. +#| eval: false -ggplot() + - geom_point(aes(1, 1), size = 20) + - geom_point(aes(2, 1), size = 10) + - geom_point(aes(3, 1), size = 20, shape = 17) + - geom_point(aes(4, 1), size = 20, color = "blue") + - scale_x_continuous(NULL, limits = c(0.5, 4.5), labels = NULL) + - scale_y_continuous(NULL, limits = c(0.9, 1.1), labels = NULL) + - theme(aspect.ratio = 1/3) +penguins |> + ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point() ``` -You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. -For example, you can map the colors of your points to the `class` variable to reveal the class of each car. +This is the most common syntax you'll see in the wild. + +## Visualizing distributions + +How you visualize the distribution of a variable depends on the type of variable: categorical or numerical. + +### A categorical variable + +A variable is **categorical** if it can only take one of a small set of values. +To examine the distribution of a categorical variable, you can use a bar chart. +The height of the bars displays how many observations occurred with each `x` value. ```{r} #| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars that -#| shows a negative association. The points representing each car are colored -#| according to the class of the car. The legend on the right of the plot -#| shows the mapping between colors and levels of the class variable: -#| 2seater, compact, midsize, minivan, pickup, or suv. +#| A bar chart of frequencies of species of penguins: Adelie +#| (approximately 150), Chinstrap (approximately 90), Gentoo +#| (approximately 125). -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, color = class)) +ggplot(penguins, aes(x = species)) + + geom_bar() ``` -(If you prefer British English, like Hadley, you can use `colour` instead of `color`.) - -To map an aesthetic to a variable, associate the name of the aesthetic with the name of the variable inside `aes()`. -ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as **scaling**. -ggplot2 will also add a legend that explains which levels correspond to which values. - -The colors reveal that many of the unusual points (with engine size greater than 5 liters and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars. -These cars don't seem like hybrids, and are, in fact, sports cars! -Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. -In hindsight, these cars were unlikely to be hybrids since they have large engines. - -In the above example, we mapped `class` to the `color` aesthetic, but we could have mapped `class` to the `size` aesthetic in the same way. -In this case, the exact size of each point would reveal its class affiliation. -We get a *warning* here: mapping an unordered variable (`class`) to an ordered aesthetic (`size`) is generally not a good idea because it implies a ranking that does not in fact exist. +In bar plots of categorical variables with non-ordered levels, like the penguin `species` above, it's often preferable to reorder the bars based on their frequencies. +Doing so requires transforming the variable to a factor (how R handles categorical data) and then reordering the levels of that factor. ```{r} #| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars that -#| shows a negative association. The points representing each car are sized -#| according to the class of the car. The legend on the right of the plot -#| shows the mapping between sizes and levels of the class variable -- going -#| from small to large: 2seater, compact, midsize, minivan, pickup, or suv. +#| A bar chart of frequencies of species of penguins, where the bars are +#| ordered in decreasing order of their heights (frequencies): Adelie +#| (approximately 150), Gentoo (approximately 125), Chinstrap +#| (approximately 90). -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, size = class)) +ggplot(penguins, aes(x = fct_infreq(species))) + + geom_bar() ``` -Similarly, we could have mapped `class` to the `alpha` aesthetic, which controls the transparency of the points, or to the `shape` aesthetic, which controls the shape of the points. +You will learn more about factors and functions for dealing with factors (like `fct_infreq()` shown above) in @sec-factors. + +### A numerical variable + +A variable is **numerical** if it can take any of an infinite set of ordered values. +Numbers and date-times are two examples of continuous variables. +To visualize the distribution of a continuous variable, you can use a histogram or a density plot. ```{r} +#| warning: false #| layout-ncol: 2 -#| fig-width: 4 -#| fig-height: 2 -#| warning: false #| fig-alt: > -#| Two scatterplots next to each other, both visualizing highway fuel -#| efficiency versus engine size of cars and showing a negative association. -#| In the plot on the left class is mapped to the alpha aesthetic, resulting -#| in different transparency levels for each level of class. In the plot on -#| the right class is mapped the shape aesthetic, resulting in different -#| plotting character shapes for each level of class. Each plot comes with a -#| legend that shows the mapping between alpha level or shape and levels of -#| the class variable. +#| A histogram (on the left) and density plot (on the right) of body masses +#| of penguins. The distribution is unimodal and right skewed, ranging +#| between approximately 2500 to 6500 grams. -# Left -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, alpha = class)) - -# Right -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, shape = class)) +ggplot(penguins, aes(x = body_mass_g)) + + geom_histogram(binwidth = 200) +ggplot(penguins, aes(x = body_mass_g)) + + geom_density() ``` -What happened to the SUVs? -ggplot2 will only use six shapes at a time. -By default, additional groups will go unplotted when you use the shape aesthetic. - -For each aesthetic, you use `aes()` to associate the name of the aesthetic with a variable to display. -The `aes()` function gathers together each of the aesthetic mappings used by a layer and passes them to the layer's mapping argument. -The syntax highlights a useful insight about `x` and `y`: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data. - -Once you map an aesthetic, ggplot2 takes care of the rest. -It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. -For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. -The axis line acts as a legend; it explains the mapping between locations and values. - -You can also *set* the aesthetic properties of your geom manually. -For example, we can make all of the points in our plot blue: +A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. +In the graph above, the tallest bar shows that 39 observations have a `body_mass_g` value between 3,500 and 3,700 grams, which are the left and right edges of the bar. ```{r} -#| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars that -#| shows a negative association. All points are blue. - -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy), color = "blue") +penguins |> + count(cut_width(body_mass_g, 200)) ``` -Here, the color doesn't convey information about a variable, but only changes the appearance of the plot. -To set an aesthetic manually, set the aesthetic by name as an argument of your geom function. -In other words, it goes *outside* of `aes()`. -You'll need to pick a value that makes sense for that aesthetic: - -- The name of a color as a character string. -- The size of a point in mm. -- The shape of a point as a number, as shown in @fig-shapes. +You can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the `x` variable. +You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. +In the plots below a binwidth of 20 is too narrow, resulting in too many bars, making it difficult to determine the shape of the distribution. +Similarly, a binwidth of 2,000 is too high, resulting in all data being binned into only three bars, and also making it difficult to determine the shape of the distribution. ```{r} -#| label: fig-shapes -#| echo: false #| warning: false -#| fig.asp: 0.364 -#| fig-align: "center" -#| fig-cap: > -#| R has 25 built in shapes that are identified by numbers. There are some -#| seeming duplicates: for example, 0, 15, and 22 are all squares. The -#| difference comes from the interaction of the `color` and `fill` -#| aesthetics. The hollow shapes (0--14) have a border determined by `color`; -#| the solid shapes (15--20) are filled with `color`; the filled shapes -#| (21--24) have a border of `color` and are filled with `fill`. +#| layout-ncol: 3 #| fig-alt: > -#| Mapping between shapes and the numbers that represent them: 0 - square, -#| 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, -#| 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, -#| 10 - circle plus, 11 - triangles up and down, 12 - square plus, -#| 13 - circle cross, 14 - square and triangle down, 15 - filled square, -#| 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, -#| 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, -#| 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle -#| point-up blue, 25 - filled triangle point down blue. +#| Three histograms of body masses of penguins, one with binwidth of 20 +#| (right), one with binwidth of 200 (center), and one with binwidth of +#| 2000 (left). The histogram with binwidth of 20 shows lots of ups and +#| downs in the heights of the bins, creating a jagged outline. The histogram +#| with binwidth of 2000 shows only three bins. -shapes <- tibble( - shape = c(0, 1, 2, 5, 3, 4, 6:19, 22, 21, 24, 23, 20), - x = (0:24 %/% 5) / 2, - y = (-(0:24 %% 5)) / 4 -) -ggplot(shapes, aes(x, y)) + - geom_point(aes(shape = shape), size = 5, fill = "red") + - geom_text(aes(label = shape), hjust = 0, nudge_x = 0.15) + - scale_shape_identity() + - expand_limits(x = 4.1) + - scale_x_continuous(NULL, breaks = NULL) + - scale_y_continuous(NULL, breaks = NULL, limits = c(-1.2, 0.2)) + - theme_minimal() + - theme(aspect.ratio = 1/2.75) +ggplot(penguins, aes(x = body_mass_g)) + + geom_histogram(binwidth = 20) +ggplot(penguins, aes(x = body_mass_g)) + + geom_histogram(binwidth = 200) +ggplot(penguins, aes(x = body_mass_g)) + + geom_histogram(binwidth = 2000) ``` ### Exercises -1. Why did the following code not result in a plot with blue points? +1. Make a bar plot of `species` of `penguins`, where you assign `species` to the `y` aesthetic. + How is this plot different? + +2. How are the following two plots different? + Which aesthetic, `color` or `fill`, is more useful for changing the color of bars? ```{r} - #| fig-alt: > - #| Scatterplot of highway fuel efficiency versus engine size of cars - #| that shows a negative association. All points are red and - #| the legend shows a red point that is mapped to the word blue. + #| eval: false - ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy, color = "blue")) + ggplot(penguins, aes(x = species)) + + geom_bar(color = "red") + + ggplot(penguins, aes(x = species)) + + geom_bar(fill = "red") ``` -2. Which variables in `mpg` are categorical? +3. What does the `bins` argument in `geom_histogram()` do? + +4. Make a histogram of the `carat` variable in the `diamonds` dataset. + Experiment with different binwidths. + What binwidth reveals the most interesting patterns? + +## Visualizing relationships + +To visualize a relationship we need to have at least two variables mapped to aesthetics of a plot. +In the following sections you will learn about commonly used plots for visualizing relationships between two or more variables and the geoms used for creating them. + +### A numerical and a categorical variable + +To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots. +A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians. +Each boxplot consists of: + +- A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). + In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. + These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side. + +- Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. + These outlying points are unusual so are plotted individually. + +- A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution. + +```{r} +#| echo: false +#| fig-alt: > +#| A diagram depicting how a boxplot is created following the steps outlined +#| above. + +knitr::include_graphics("images/EDA-boxplot.png") +``` + +Let's take a look at the distribution of price by cut using `geom_boxplot()`: + +```{r} +#| warning: false +#| fig-alt: > +#| Side-by-side box plots of distributions of body masses of Adelie, +#| Chinstrap, and Gentoo penguins. The distribution of Adelie and +#| Chinstrap penguins' body masses appear to be symmetric with +#| medians around 3750 grams. The median body mass of Gentoo penguins +#| is much higher, around 5000 grams, and the distribution of the +#| body masses of these penguins appears to be somewhat right skewed. + +ggplot(penguins, aes(x = species, y = body_mass_g)) + + geom_boxplot() +``` + +Alternatively, we can make frequency polygons with `geom_freqpoly()`. +`geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, it uses lines instead. +It's much easier to understand overlapping lines than bars of `geom_histogram()`. +There are a few challenges with this type of plot, which we will come back to in @sec-cat-num on exploring the covariation between a categorical and a numerical variable. + +```{r} +#| warning: false +#| fig-alt: > +#| A frequency polygon of body masses of penguins by species of +#| penguins. Each species (Adelie, Chinstrap, and Gentoo) is +#| represented with different colored outlines for the polygons. + +ggplot(penguins, aes(x = body_mass_g, color = species)) + + geom_freqpoly(binwidth = 200, linewidth = 0.75) +``` + +We've also customized the thickness of the lines using the `linewidth` argument in order to make them stand out a bit more against the background. + +We can also use overlaid density plots, with `species` mapped to both `color` and `fill` aesthetics and using the `alpha` aesthetic to add transparency to the filled density curves. +This aesthetic takes values between 0 (completely transparent) and 1 (completely opaque). +In the following plot it's *set* to 0.5. + +```{r} +#| warning: false +#| fig-alt: > +#| A frequency polygon of body masses of penguins (on the left) and density +#| plot (on the right). Each species of penguins (Adelie, Chinstrap, and +#| Gentoo) are represented in different colored outlines for the frequency +#| polygons and the density curves. The density curves are also filled with +#| the same colors, with some transparency added. + +ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) + + geom_density(alpha = 0.5) +``` + +Note the terminology we have used here: + +- We *map* variables to aesthetics if we want the visual attribute represented by that aesthetic to vary based on the values of that variable. +- Otherwise, we *set* the value of an aesthetic. + +### Two categorical variables + +We can use segmented bar plots to visualize the distribution between two categorical variables. +In creating this bar chart, we map the variable we want to divide the data into first to the `x` aesthetic and the variable we then further want to divide each group into to the `fill` aesthetic. + +Below are two segmented bar plots, both displaying the relationship between `island` and `species`, or specifically, visualizing the distribution of `species` within each island. +The plot on the left shows the frequencies of each species of penguins on each island and the plot on the right shows the relative frequencies (proportions) of each species within each island (despite the incorrectly labeled y-axis that says "count"). +The relative frequency plot, created by setting `position = "fill"` in the geom is more useful for comparing species distributions across islands since it's not affected by the unequal numbers of penguins across the islands. +Based on the plot on the left, we can see that Gentoo penguins all live on Biscoe island and make up roughly 75% of the penguins on that island, Chinstrap all live on Dream island and make up roughly 50% of the penguins on that island, and Adelie live on all three islands and make up all of the penguins on Torgersen. + +```{r} +#| fig-alt: > +#| Bar plots of penguin species by island (Biscoe, Dream, and Torgersen). +#| On the right, frequencies of species are shown. On the left, relative +#| frequencies of species are shown. + +ggplot(penguins, aes(x = island, fill = species)) + + geom_bar() +ggplot(penguins, aes(x = island, fill = species)) + + geom_bar(position = "fill") +``` + +### Two numerical variables + +So far you've learned about scatterplots (created with `geom_point()`) and smooth curves (created with `geom_smooth()`) for visualizing the relationship between two numerical variables. +A scatterplot is probably the most commonly used plot for visualizing the relationship between two variables. + +```{r} +#| warning: false +#| fig-alt: > +#| A scatterplot of body mass vs. flipper length of penguins. The plot +#| displays a positive, linear, relatively strong relationship between +#| these two variables. + +ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point() +``` + +### Three or more variables + +One way to add additional variables to a plot is by mapping them to an aesthetic. +For example, in the following scatterplot the colors of points represent species and the shapes of points represent islands. + +```{r} +#| warning: false +#| fig-alt: > +#| A scatterplot of body mass vs. flipper length of penguins. The plot +#| displays a positive, linear, relatively strong relationship between +#| these two variables. The points are colored based on the species of the +#| penguins and the shapes of the points represent islands (round points are +#| Biscoe island, triangles are Dream island, and squared are Torgersen +#| island). The plot is very busy and it's difficult to distinguish the shapes +#| of the points. + +ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point(aes(color = species, shape = island)) +``` + +However adding too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. +Another way, which is particularly useful for categorical variables, is to split your plot into **facets**, subplots that each display one subset of the data. + +To facet your plot by a single variable, use `facet_wrap()`. +The first argument of `facet_wrap()` is a formula[^data-visualize-2], which you create with `~` followed by a variable name. +The variable that you pass to `facet_wrap()` should be categorical. + +[^data-visualize-2]: Here "formula" is the name of the type of thing created by `~`, not a synonym for "equation". + +```{r} +#| warning: false +#| fig-width: 8 +#| fig-asp: 0.33 +#| fig-alt: > +#| A scatterplot of body mass vs. flipper length of penguins. The shapes and +#| colors of points represent species. Penguins from each island are on a +#| separate facet. Within each facet, the relationship between body mass and +#| flipper length is positive, linear, relatively strong. + +ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point(aes(color = species, shape = species)) + + facet_wrap(~island) +``` + +You will learn about many other geoms for visualizing distributions of variables and relationships between them in @sec-layers. + +### Exercises + +1. Which variables in `mpg` are categorical? Which variables are continuous? (Hint: type `?mpg` to read the documentation for the dataset). How can you see this information when you run `mpg`? -3. Map a continuous variable to `color`, `size`, and `shape`. - How do these aesthetics behave differently for categorical vs. continuous variables? +2. Make a scatterplot of `hwy` vs. `displ` using the `mpg` data frame. + Then, map a third, numerical variable to `color`, `size`, and `shape`. + How do these aesthetics behave differently for categorical vs. numerical variables? + +3. In the scatterplot of `hwy` vs. `displ`, what happens if you map a third variable to `linewidth`? 4. What happens if you map the same variable to multiple aesthetics? -5. What does the `stroke` aesthetic do? - What shapes does it work with? - (Hint: use `?geom_point`) +5. Make a scatterplot of `bill_depth_mm` vs. `bill_length_mm` and color the points by `species`. + What does adding coloring by species reveal about the relationship between these two variables? -6. What happens if you map an aesthetic to something other than a variable name, like `aes(color = displ < 5)`? - Note, you'll also need to specify x and y. +6. Why does the following yield two separate legends. + How would you fix it to combine the two legends? + + ```{r} + #| warning: false + #| fig-alt: > + #| Scatterplot of bill depth vs. bill length where different color and + #| shape pairings represent each species. The plot has two legends, + #| one labelled "species" which shows the shape scale and the other + #| that shows the color scale. + + ggplot(data = penguins, + mapping = aes(x = bill_length_mm, y = bill_depth_mm, + color = species, shape = species)) + + geom_point() + + labs(color = "Species") + ``` + +## Saving your plots {#sec-ggsave} + +Once you've made a plot, you might want to get it out of R by saving it as an image that you can use elsewhere. +That's the job of `ggsave()`, which will save the most recent plot to disk: + +```{r} +#| fig-show: hide + +ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + + geom_point() +ggsave(filename = "my-plot.png") +``` + +```{r} +#| include: false + +file.remove("my-plot.png") +``` + +This will save your plot to your working directory, a concept you'll learn more about in @sec-workflow-scripts. + +If you don't specify the `width` and `height` they will be taken from the dimensions of the current plotting device. +For reproducible code, you'll want to specify them. +You can learn more about `ggsave()` in the documentation. + +Generally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. +You will learn more about Quarto in @sec-quarto. + +### Exercises + +1. Run the following lines of code. + Which of the two plots is saved as `mpg-plot.png`? + Why? + + ```{r} + #| eval: false + + ggplot(mpg, aes(x = class)) + + geom_bar() + ggplot(mpg, aes(x = cty, y = hwy)) + + geom_point() + ggsave("mpg-plot.png") + ``` + +2. What do you need to change in the code above to save the plot as a PDF instead of a PNG? ## Common problems @@ -376,7 +825,9 @@ In this case, it's usually easy to start from scratch again by pressing ESCAPE t One common problem when creating ggplot2 graphics is to put the `+` in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven't accidentally written code like this: -``` r +```{r} +#| eval: false + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ``` @@ -390,933 +841,14 @@ Sometimes the answer will be buried there! But when you're new to R, even if the answer is in the error message, you might not yet know how to understand it. Another great tool is Google: try googling the error message, as it's likely someone else has had the same problem, and has gotten help online. -## Facets - -One way to add additional variables to a plot is by mapping them to an aesthetic. -Another way, which is particularly useful for categorical variables, is to split your plot into **facets**, subplots that each display one subset of the data. - -To facet your plot by a single variable, use `facet_wrap()`. -The first argument of `facet_wrap()` is a formula[^data-visualize-1], which you create with `~` followed by a variable name. -The variable that you pass to `facet_wrap()` should be discrete. - -[^data-visualize-1]: Here "formula" is the name of the type of thing created by `~`, not a synonym for "equation". - -```{r} -#| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars, -#| faceted by class, with facets spanning two rows. - -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_wrap(~cyl) -``` - -To facet your plot with the combination of two variables, switch from `facet_wrap()` to `facet_grid()`. -The first argument of `facet_grid()` is also a formula, but now it's a double sided formula: `rows ~ cols`. - -```{r} -#| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars, faceted -#| by number of cylinders across rows and by type of drive train across -#| columns. This results in a 4x3 grid of 12 facets. Some of these facets have -#| no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front -#| wheel drive. - -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(drv ~ cyl) -``` - -### Exercises - -1. What happens if you facet on a continuous variable? - -2. What do the empty cells in plot with `facet_grid(drv ~ cyl)` mean? - How do they relate to this plot? - - ```{r} - #| fig-alt: > - #| Scatterplot of number of cycles versus type of drive train of cars. - #| The plot shows that there are no cars with 5 cylinders that are 4 - #| wheel drive or with 4 or 5 cylinders that are front wheel drive. - - ggplot(data = mpg) + - geom_point(mapping = aes(x = drv, y = cyl)) - ``` - -3. What plots does the following code make? - What does `.` do? - - ```{r} - #| eval: false - - ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(drv ~ .) - - ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(. ~ cyl) - ``` - -4. Take the first faceted plot in this section: - - ```{r} - #| eval: false - - ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_wrap(~ class, nrow = 2) - ``` - - What are the advantages to using faceting instead of the color aesthetic? - What are the disadvantages? - How might the balance change if you had a larger dataset? - -5. Read `?facet_wrap`. - What does `nrow` do? - What does `ncol` do? - What other options control the layout of the individual panels? - Why doesn't `facet_grid()` have `nrow` and `ncol` arguments? - -6. Which of the following two plots makes it easier to compare engine size (`displ`) across cars with different drive trains? - What does this say about when to place a faceting variable across rows or columns? - - ```{r} - #| fig-alt: > - #| Two faceted plots, both visualizing highway fuel efficiency versus - #| engine size of cars, faceted by drive train. In the top plot, facet - #| are organized across rows and in the second, across columns. - - ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(drv ~ .) - - ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(. ~ drv) - ``` - -7. Recreate this plot using `facet_wrap()` instead of `facet_grid()`. - How do the positions of the facet labels change? - - ```{r} - #| fig-alt: > - #| Scatterplot of highway fuel efficiency versus engine size of cars, - #| faceted by type of drive train across rows. - - ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - facet_grid(drv ~ .) - ``` - -## Geometric objects - -How are these two plots similar? - -```{r} -#| echo: false -#| message: false -#| layout-ncol: 2 -#| fig-width: 4 -#| fig-height: 2 -#| fig-alt: > -#| There are two plots. The plot on the left is a scatterplot of highway fuel -#| efficiency versus engine size of cars and the plot on the right shows a -#| smooth curve that follows the trajectory of the relationship between these -#| variables. A confidence interval around the smooth curve is also displayed. - -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) - -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy)) -``` - -Both plots contain the same x variable, the same y variable, and both describe the same data. -But the plots are not identical. -Each plot uses a different visual object to represent the data. -In ggplot2 syntax, we say that they use different **geoms**. - -A **geom** is the geometrical object that a plot uses to represent data. -People often describe plots by the type of geom that the plot uses. -For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. -Scatterplots break the trend; they use the point geom. -As we see above, you can use different geoms to plot the same data. -The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data. - -To change the geom in your plot, change the geom function that you add to `ggplot()`. -For instance, to make the plots above, you can use this code: - -```{r} -#| eval: false - -# Left -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) - -# Right -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy)) -``` - -Every geom function in ggplot2 takes a `mapping` argument. -However, not every aesthetic works with every geom. -You could set the shape of a point, but you couldn't set the "shape" of a line. -On the other hand, you *could* set the linetype of a line. -`geom_smooth()` will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype. - -```{r} -#| message: false -#| fig-alt: > -#| A plot of highway fuel efficiency versus engine size of cars. The data are -#| represented with smooth curves, which use a different line type (solid, -#| dashed, or long dashed) for each type of drive train. Confidence intervals -#| around the smooth curves are also displayed. - -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) -``` - -Here, `geom_smooth()` separates the cars into three lines based on their `drv` value, which describes a car's drive train. -One line describes all of the points that have a `4` value, one line describes all of the points that have an `f` value, and one line describes all of the points that have an `r` value. -Here, `4` stands for four-wheel drive, `f` for front-wheel drive, and `r` for rear-wheel drive. - -If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to `drv`. - -```{r} -#| echo: false -#| message: false -#| fig-alt: > -#| A plot of highway fuel efficiency versus engine size of cars. The data -#| are represented with points (colored by drive train) as well as smooth -#| curves (where line type is determined based on drive train as well). -#| Confidence intervals around the smooth curves are also displayed. - -ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + - geom_point() + - geom_smooth(mapping = aes(linetype = drv)) -``` - -Notice that this plot contains two geoms in the same graph! -If this makes you excited, buckle up. -You will learn how to place multiple geoms in the same plot very soon. - -ggplot2 provides more than 40 geoms, and extension packages provide even more (see for a sampling). -The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at . -To learn more about any single geom, use the help (e.g. `?geom_smooth`). - -Many geoms, like `geom_smooth()`, use a single geometric object to display multiple rows of data. -For these geoms, you can set the `group` aesthetic to a categorical variable to draw multiple objects. -ggplot2 will draw a separate object for each unique value of the grouping variable. -In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the `linetype` example). -It is convenient to rely on this feature because the `group` aesthetic by itself does not add a legend or distinguishing features to the geoms. - -```{r} -#| layout-ncol: 3 -#| fig-width: 3 -#| fig-height: 3 -#| message: false -#| fig-alt: > -#| Three plots, each with highway fuel efficiency on the y-axis and engine -#| size of cars, where data are represented by a smooth curve. The first plot -#| only has these two variables, the center plot has three separate smooth -#| curves for each level of drive train, and the right plot not only has the -#| same three separate smooth curves for each level of drive train but these -#| curves are plotted in different colors, without a legend explaining which -#| color maps to which level. Confidence intervals around the smooth curves -#| are also displayed. - -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy)) - -ggplot(data = mpg) + - geom_smooth(mapping = aes(x = displ, y = hwy, group = drv)) - -ggplot(data = mpg) + - geom_smooth( - mapping = aes(x = displ, y = hwy, color = drv), - show.legend = FALSE - ) -``` - -To display multiple geoms in the same plot, add multiple geom functions to `ggplot()`: - -```{r} -#| message: false -#| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars with a -#| smooth curve overlaid. A confidence interval around the smooth curves is -#| also displayed. - -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) + - geom_smooth(mapping = aes(x = displ, y = hwy)) -``` - -This, however, introduces some duplication in our code. -Imagine if you wanted to change the y-axis to display `cty` instead of `hwy`. -You'd need to change the variable in two places, and you might forget to update one. -You can avoid this type of repetition by passing a set of mappings to `ggplot()`. -ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. -In other words, this code will produce the same plot as the previous code: - -```{r} -#| eval: false - -ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point() + - geom_smooth() -``` - -If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. -It will use these mappings to extend or overwrite the global mappings *for that layer only*. -This makes it possible to display different aesthetics in different layers. - -```{r} -#| message: false -#| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars, where -#| points are colored according to the car class. A smooth curve following -#| the trajectory of the relationship between highway fuel efficiency versus -#| engine size of cars is overlaid along with a confidence interval around it. - -ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point(mapping = aes(color = class)) + - geom_smooth() -``` - -You can use the same idea to specify different `data` for each layer. -Here, our smooth line displays just a subset of the `mpg` dataset, the subcompact cars. -The local data argument in `geom_smooth()` overrides the global data argument in `ggplot()` for that layer only. - -```{r} -#| message: false -#| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars, where -#| points are colored according to the car class. A smooth curve following -#| the trajectory of the relationship between highway fuel efficiency versus -#| engine size of subcompact cars is overlaid along with a confidence interval -#| around it. - -ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point(mapping = aes(color = class)) + - geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE) -``` - -(You'll learn how `filter()` works in the chapter on data transformations: for now, just know that this command selects only the subcompact cars.) - -### Exercises - -1. What geom would you use to draw a line chart? - A boxplot? - A histogram? - An area chart? - -2. Run this code in your head and predict what the output will look like. - Then, run the code in R and check your predictions. - - ```{r} - #| eval: false - - ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + - geom_point() + - geom_smooth(se = FALSE) - ``` - -3. Earlier in this chapter we used `show.legend` without explaining it: - - ```{r} - #| eval: false - ggplot(data = mpg) + - geom_smooth( - mapping = aes(x = displ, y = hwy, color = drv), - show.legend = FALSE - ) - ``` - - What does `show.legend = FALSE` do here? - What happens if you remove it? - Why do you think we used it earlier? - -4. What does the `se` argument to `geom_smooth()` do? - -5. Will these two graphs look different? - Why/why not? - - ```{r} - #| eval: false - - ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point() + - geom_smooth() - - ggplot() + - geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy)) - ``` - -6. Recreate the R code necessary to generate the following graphs. - Note that wherever a categorical variable is used in the plot, it's `drv`. - - ```{r} - #| echo: false - #| message: false - #| layout-ncol: 2 - #| fig-width: 4 - #| fig-height: 2 - #| fig-alt: > - #| There are six scatterplots in this figure, arranged in a 3x2 grid. - #| In all plots highway fuel efficiency of cars are on the y-axis and - #| engine size is on the x-axis. The first plot shows all points in black - #| with a smooth curve overlaid on them. In the second plot points are - #| also all black, with separate smooth curves overlaid for each level of - #| drive train. On the third plot, points and the smooth curves are - #| represented in different colors for each level of drive train. In the - #| fourth plot the points are represented in different colors for each - #| level of drive train but there is only a single smooth line fitted to - #| the whole data. In the fifth plot, points are represented in different - #| colors for each level of drive train, and a separate smooth curve with - #| different line types are fitted to each level of drive train. And - #| finally in the sixth plot points are represented in different colors - #| for each level of drive train and they have a thick white border. - - ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point() + - geom_smooth(se = FALSE) - ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_smooth(aes(group = drv), se = FALSE) + - geom_point() - ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + - geom_point() + - geom_smooth(se = FALSE) - ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point(aes(color = drv)) + - geom_smooth(se = FALSE) - ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point(aes(color = drv)) + - geom_smooth(aes(linetype = drv), se = FALSE) - ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + - geom_point(size = 4, color = "white") + - geom_point(aes(color = drv)) - ``` - -## Statistical transformations - -Next, let's take a look at a bar chart. -Bar charts seem simple, but they are interesting because they reveal something subtle about plots. -Consider a basic bar chart, as drawn with `geom_bar()` or `geom_col()`. -The following chart displays the total number of diamonds in the `diamonds` dataset, grouped by `cut`. -The `diamonds` dataset is in the ggplot2 package and contains information on \~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. -The chart shows that more diamonds are available with high quality cuts than with low quality cuts. - -```{r} -#| fig-alt: > -#| Bar chart of number of each cut of diamond. There are roughly 1500 -#| Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut -#| diamonds. - -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut)) -``` - -On the x-axis, the chart displays `cut`, a variable from `diamonds`. -On the y-axis, it displays count, but count is not a variable in `diamonds`! -Where does count come from? -Many graphs, like scatterplots, plot the raw values of your dataset. -Other graphs, like bar charts, calculate new values to plot: - -- bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin. - -- smoothers fit a model to your data and then plot predictions from the model. - -- boxplots compute a robust summary of the distribution and then display that summary as a specially formatted box. - -The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation. -@fig-vis-stat-bar shows how this process works with `geom_bar()`. - -```{r} -#| label: fig-vis-stat-bar -#| echo: false -#| out-width: "100%" -#| fig-cap: > -#| When create a bar chart we first start with the raw data, then -#| aggregate it to count the number of observations in each bar, -#| and finally map those computed variables to plot aesthetics. -#| fig-alt: > -#| A figure demonstrating three steps of creating a bar chart. -#| Step 1. geom_bar() begins with the diamonds data set. Step 2. geom_bar() -#| transforms the data with the count stat, which returns a data set of -#| cut values and counts. Step 3. geom_bar() uses the transformed data to -#| build the plot. cut is mapped to the x-axis, count is mapped to the y-axis. - -knitr::include_graphics("images/visualization-stat-bar.png") -``` - -You can learn which stat a geom uses by inspecting the default value for the `stat` argument. -For example, `?geom_bar` shows that the default value for `stat` is "count", which means that `geom_bar()` uses `stat_count()`. -`stat_count()` is documented on the same page as `geom_bar()`. -If you scroll down, the section called "Computed variables" explains that it computes two new variables: `count` and `prop`. - -You can generally use geoms and stats interchangeably. -For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`: - -```{r} -#| fig-alt: > -#| Bar chart of number of each cut of diamond. There are roughly 1500 -#| Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut -#| diamonds. - -ggplot(data = diamonds) + - stat_count(mapping = aes(x = cut)) -``` - -This works because every geom has a default stat; and every stat has a default geom. -This means that you can typically use geoms without worrying about the underlying statistical transformation. -However, there are three reasons why you might need to use a stat explicitly: - -1. You might want to override the default stat. - In the code below, we change the stat of `geom_bar()` from count (the default) to identity. - This lets me map the height of the bars to the raw values of a $y$ variable. - Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows. - - ```{r} - #| warning: false - #| fig-alt: > - #| Bar chart of number of each cut of diamond. There are roughly 1500 - #| Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut - #| diamonds. - - demo <- tribble( - ~cut, ~freq, - "Fair", 1610, - "Good", 4906, - "Very Good", 12082, - "Premium", 13791, - "Ideal", 21551 - ) - - ggplot(data = demo) + - geom_bar(mapping = aes(x = cut, y = freq), stat = "identity") - ``` - - (Don't worry that you haven't seen `<-` or `tribble()` before. - You might be able to guess their meaning from the context, and you'll learn exactly what they do soon!) - -2. You might want to override the default mapping from transformed variables to aesthetics. - For example, you might want to display a bar chart of proportions, rather than counts: - - ```{r} - #| fig-alt: > - #| Bar chart of proportion of each cut of diamond. Roughly, Fair - #| diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 26, and - #| Ideal 0.40. - - ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1)) - ``` - - To find the variables computed by the stat, look for the section titled "computed variables" in the help for `geom_bar()`. - -3. You might want to draw greater attention to the statistical transformation in your code. - For example, you might use `stat_summary()`, which summarizes the y values for each unique x value, to draw attention to the summary that you're computing: - - ```{r} - #| fig-alt: > - #| A plot with depth on the y-axis and cut on the x-axis (with levels - #| fair, good, very good, premium, and ideal) of diamonds. For each level - #| of cut, vertical lines extend from minimum to maximum depth for diamonds - #| in that cut category, and the median depth is indicated on the line - #| with a point. - - ggplot(data = diamonds) + - stat_summary( - mapping = aes(x = cut, y = depth), - fun.min = min, - fun.max = max, - fun = median - ) - ``` - -ggplot2 provides more than 20 stats for you to use. -Each stat is a function, so you can get help in the usual way, e.g. `?stat_bin`. -To see a complete list of stats, try the [ggplot2 cheatsheet](https://rstudio.com/resources/cheatsheets). - -### Exercises - -1. What is the default geom associated with `stat_summary()`? - How could you rewrite the previous plot to use that geom function instead of the stat function? - -2. What does `geom_col()` do? - How is it different from `geom_bar()`? - -3. Most geoms and stats come in pairs that are almost always used in concert. - Read through the documentation and make a list of all the pairs. - What do they have in common? - -4. What variables does `stat_smooth()` compute? - What parameters control its behaviour? - -5. In our proportion bar chart, we need to set `group = 1`. - Why? - In other words, what is the problem with these two graphs? - - ```{r} - #| eval: false - - ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, y = after_stat(prop))) - ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop))) - ``` - -## Position adjustments - -There's one more piece of magic associated with bar charts. -You can color a bar chart using either the `color` aesthetic, or, more usefully, `fill`: - -```{r} -#| layout-ncol: 2 -#| fig-width: 4 -#| fig-height: 2 -#| fig-alt: > -#| Two bar charts of cut of diamonds. In the first plot, the bars have colored -#| borders. In the second plot, they're filled with colors. Heights of the -#| bars correspond to the number of diamonds in each cut category. - -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, color = cut)) -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = cut)) -``` - -Note what happens if you map the fill aesthetic to another variable, like `clarity`: the bars are automatically stacked. -Each colored rectangle represents a combination of `cut` and `clarity`. - -```{r} -#| fig-alt: > -#| Segmented bar chart of cut of diamonds, where each bar is filled with -#| colors for the levels of clarity. Heights of the bars correspond to the -#| number of diamonds in each cut category, and heights of the colored -#| segments are proportional to the number of diamonds with a given clarity -#| level within a given cut level. - -ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = clarity)) -``` - -The stacking is performed automatically using the **position adjustment** specified by the `position` argument. -If you don't want a stacked bar chart, you can use one of three other options: `"identity"`, `"dodge"` or `"fill"`. - -- `position = "identity"` will place each object exactly where it falls in the context of the graph. - This is not very useful for bars, because it overlaps them. - To see that overlapping we either need to make the bars slightly transparent by setting `alpha` to a small value, or completely transparent by setting `fill = NA`. - - ```{r} - #| layout-ncol: 2 - #| fig-width: 4 - #| fig-height: 2 - #| fig-alt: > - #| Two segmented bar charts of cut of diamonds, where each bar is filled - #| with colors for the levels of clarity. Heights of the bars correspond - #| to the number of diamonds in each cut category, and heights of the - #| colored segments are proportional to the number of diamonds with a - #| given clarity level within a given cut level. However the segments - #| overlap. In the first plot the segments are filled with transparent - #| colors, in the second plot the segments are only outlined with colors. - - ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + - geom_bar(alpha = 1/5, position = "identity") - ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) + - geom_bar(fill = NA, position = "identity") - ``` - - The identity position adjustment is more useful for 2d geoms, like points, where it is the default. - -- `position = "fill"` works like stacking, but makes each set of stacked bars the same height. - This makes it easier to compare proportions across groups. - - ```{r} - #| fig-alt: > - #| Segmented bar chart of cut of diamonds, where each bar is filled with - #| colors for the levels of clarity. Height of each bar is 1 and heights - #| of the colored segments are proportional to the proportion of diamonds - #| with a given clarity level within a given cut level. - - ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") - ``` - -- `position = "dodge"` places overlapping objects directly *beside* one another. - This makes it easier to compare individual values. - - ```{r} - #| fig-alt: > - #| Dodged bar chart of cut of diamonds. Dodged bars are grouped by levels - #| of cut (fair, good, very good, premium, and ideal). In each group there - #| are eight bars, one for each level of clarity, and filled with a - #| different color for each level. Heights of these bars represent the - #| number of diamonds with a given level of cut and clarity. - - ggplot(data = diamonds) + - geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") - ``` - -There's one other type of adjustment that's not useful for bar charts, but can be very useful for scatterplots. -Recall our first scatterplot. -Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset? - -```{r} -#| echo: false -#| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars that -#| shows a negative association. - -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy)) -``` - -The underlying values of `hwy` and `displ` are rounded so the points appear on a grid and many points overlap each other. -This problem is known as **overplotting**. -This arrangement makes it difficult to see the distribution of the data. -Are the data points spread equally throughout the graph, or is there one special combination of `hwy` and `displ` that contains 109 values? - -You can avoid this gridding by setting the position adjustment to "jitter". -`position = "jitter"` adds a small amount of random noise to each point. -This spreads the points out because no two points are likely to receive the same amount of random noise. - -```{r} -#| fig-alt: > -#| Jittered scatterplot of highway fuel efficiency versus engine size of cars. -#| The plot shows a negative association. - -ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy), position = "jitter") -``` - -Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph *more* revealing at large scales. -Because this is such a useful operation, ggplot2 comes with a shorthand for `geom_point(position = "jitter")`: `geom_jitter()`. - -To learn more about a position adjustment, look up the help page associated with each adjustment: `?position_dodge`, `?position_fill`, `?position_identity`, `?position_jitter`, and `?position_stack`. - -### Exercises - -1. What is the problem with this plot? - How could you improve it? - - ```{r} - #| fig-alt: > - #| Scatterplot of highway fuel efficiency versus city fuel efficiency - #| of cars that shows a positive association. The number of points - #| visible in this plot is less than the number of points in the dataset. - - ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + - geom_point() - ``` - -2. What parameters to `geom_jitter()` control the amount of jittering? - -3. Compare and contrast `geom_jitter()` with `geom_count()`. - -4. What's the default position adjustment for `geom_boxplot()`? - Create a visualization of the `mpg` dataset that demonstrates it. - -## Coordinate systems - -Coordinate systems are probably the most complicated part of ggplot2. -The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. -There are three other coordinate systems that are occasionally helpful. - -- `coord_flip()` switches the x and y axes. - This is useful (for example), if you want horizontal boxplots. - It's also useful for long labels: it's hard to get them to fit without overlapping on the x-axis. - - ```{r} - #| fig-width: 4 - #| fig-height: 2 - #| layout-ncol: 2 - #| fig-alt: > - #| Two side-by-side box plots of highway fuel efficiency of cars. A - #| separate box plot is created for cars in each level of class (2seater, - #| compact, midsize, minivan, pickup, subcompact, and suv). In the first - #| plot class is on the x-axis, in the second plot class is on the y-axis. - #| The second plot makes it easier to read the names of the levels of class - #| since they are listed down the y-axis, avoiding overlap. - - ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + - geom_boxplot() - ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + - geom_boxplot() + - coord_flip() - ``` - - However, note that you can achieve the same result by flipping the aesthetic mappings of the two variables. - - ```{r} - #| fig-alt: > - #| Side-by-side box plots of highway fuel efficiency of cars. A separate - #| box plot is drawn along the y-axis for cars in each level of class - #| (2seater, compact, midsize, minivan, pickup, subcompact, and suv). - - ggplot(data = mpg, mapping = aes(y = class, x = hwy)) + - geom_boxplot() - ``` - -- `coord_quickmap()` sets the aspect ratio correctly for maps. - This is very important if you're plotting spatial data with ggplot2. - We don't have the space to discuss maps in this book, but you can learn more in the [Maps chapter](https://ggplot2-book.org/maps.html) of *ggplot2: Elegant graphics for data analysis*. - - ```{r} - #| layout-ncol: 2 - #| fig-width: 4 - #| fig-height: 2 - #| message: false - #| fig-alt: > - #| Two maps of the boundaries of New Zealand. In the first plot the aspect - #| ratio is incorrect, in the second plot it is correct. - - nz <- map_data("nz") - - ggplot(nz, aes(long, lat, group = group)) + - geom_polygon(fill = "white", color = "black") - - ggplot(nz, aes(long, lat, group = group)) + - geom_polygon(fill = "white", color = "black") + - coord_quickmap() - ``` - -- `coord_polar()` uses polar coordinates. - Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart. - - ```{r} - #| layout-ncol: 2 - #| fig-width: 4 - #| fig-asp: 1 - #| fig-alt: > - #| There are two plots. On the left is a bar chart of cut of diamonds, - #| on the right is a Coxcomb chart of the same data. - - bar <- ggplot(data = diamonds) + - geom_bar( - mapping = aes(x = cut, fill = cut), - show.legend = FALSE, - width = 1 - ) + - theme(aspect.ratio = 1) + - labs(x = NULL, y = NULL) - - bar + coord_flip() - bar + coord_polar() - ``` - -### Exercises - -1. Turn a stacked bar chart into a pie chart using `coord_polar()`. - -2. What does `labs()` do? - Read the documentation. - -3. What's the difference between `coord_quickmap()` and `coord_map()`? - -4. What does the plot below tell you about the relationship between city and highway mpg? - Why is `coord_fixed()` important? - What does `geom_abline()` do? - - ```{r} - #| fig-alt: > - #| Scatterplot of highway fuel efficiency versus engine size of cars that - #| shows a negative association. The plot also has a straight line that - #| follows the trend of the relationship between the variables but does not - #| go through the cloud of points, it is beneath it. - - ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + - geom_point() + - geom_abline() + - coord_fixed() - ``` - -## The layered grammar of graphics - -In the previous sections, you learned much more than just how to make scatterplots, bar charts, and boxplots. -You learned a foundation that you can use to make *any* type of plot with ggplot2. -To see this, let's add position adjustments, stats, coordinate systems, and faceting to our code template: - - ggplot(data = ) + - ( - mapping = aes(), - stat = , - position = - ) + - + - - -Our new template takes seven parameters, the bracketed words that appear in the template. -In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function. - -The seven parameters in the template compose the grammar of graphics, a formal system for building plots. -The grammar of graphics is based on the insight that you can uniquely describe *any* plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme. - -To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat). - -```{r} -#| echo: false -#| fig-alt: > -#| A figure demonstrating the steps for going from raw data to table of counts -#| where each row represents one level of cut and a count column shows how many -#| diamonds are in that cut level. Steps 1 and 2 are annotated. Step 1. Begin -#| with the diamonds dataset. Step 2. Compute counts for each cut value -#| with stat_count(). - -knitr::include_graphics("images/visualization-grammar-1.png") -``` - -Next, you could choose a geometric object to represent each observation in the transformed data. -You could then use the aesthetic properties of the geoms to represent variables in the data. -You would map the values of each variable to the levels of an aesthetic. - -```{r} -#| echo: false -#| fig-alt: > -#| A figure demonstrating the steps for going from raw data to table of counts -#| where each row represents one level of cut and a count column shows how -#| many diamonds are in that cut level. Each level is also mapped to a color. -#| Steps 3 and 4 are annotated. Step 3. Represent each observation with a bar. -#| Step 4. Map the fill of each bar to the ..count.. variable. - -knitr::include_graphics("images/visualization-grammar-2.png") -``` - -You'd then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. -At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). -You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment. - -```{r} -#| echo: false -#| fig-alt: > -#| A figure demonstrating the steps for going from raw data to bar chart where -#| each bar represents one level of cut and filled in with a different color. -#| Steps 5 and 6 are annotated. Step 5. Place geoms in a Cartesian coordinate -#| system. Step 6. Map the y values to ..count.. and the x values to cut. - -knitr::include_graphics("images/visualization-grammar-3.png") -``` - -You could use this method to build *any* plot that you imagine. -In other words, you can use the code template that you've learned in this chapter to build hundreds of thousands of unique plots. - -If you'd like to learn more about this theoretical underpinnings of ggplot2, you might enjoy reading "[The Layered Grammar of Graphics](https://vita.had.co.nz/papers/layered-grammar.pdf)", the scientific paper that describes the theory of ggplot2 in detail. - ## Summary -In this chapter, you've learn the basics of data visualization with ggplot2. +In this chapter, you've learned the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, colour, size and shape. -You then learned about facets, which allow you to create small multiples, where each panel contains a subgroup from your data. -We then gave you a whirlwind tour of the geoms and stats which control the "type" of graph you get, whether it's a scatterplot, line plot, histogram, or something else. -Position adjustment control the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what `x` and `y` mean. +You then learned about increasing the complexity and improving the presentation of your plots layer-by-layer. +You also learned about commonly used plots for visualizing the distribution of a single variable as well as for visualizing relationships between two or more variables, by levering additional aesthetic mappings and/or splitting your plot into small multiples using faceting. -We'll use visualizations again and again through out this book, introducing new techniques as we need them. -If you want to get a comprehensive understand of ggplot2, we recommend reading the book, [*ggplot2: Elegant Graphics for Data Analysis*](https://ggplot2-book.org). -Other useful resources are the [*R Graphics Cookbook*](https://r-graphics.org) by Winston Chang and [*Fundamentals of Data Visualization*](https://clauswilke.com/dataviz/) by Claus Wilke. +We'll use visualizations again and again through out this book, introducing new techniques as we need them as well as do a deeper dive into creating visualizations with ggplot2 in @sec-layers through @sec-eda. With the basics of visualization under your belt, in the next chapter we're going to switch gears a little and give you some practical workflow advice. We intersperse workflow advice with data science tools throughout this part of the book because it'll help you stay organize as you write increasing amounts of R code. diff --git a/diagrams/data-science.graffle b/diagrams/data-science.graffle index a91b1fb..aff6229 100644 Binary files a/diagrams/data-science.graffle and b/diagrams/data-science.graffle differ diff --git a/diagrams/data-science/communicate.png b/diagrams/data-science/communicate.png index 7ba42b9..ea91ed2 100644 Binary files a/diagrams/data-science/communicate.png and b/diagrams/data-science/communicate.png differ diff --git a/diagrams/data-science/visualize.png b/diagrams/data-science/visualize.png new file mode 100644 index 0000000..b090dc8 Binary files /dev/null and b/diagrams/data-science/visualize.png differ diff --git a/images/visualization-grammar-1.png b/images/visualization-grammar-1.png deleted file mode 100644 index 997fa57..0000000 Binary files a/images/visualization-grammar-1.png and /dev/null differ diff --git a/images/visualization-grammar-2.png b/images/visualization-grammar-2.png deleted file mode 100644 index 4c7ee99..0000000 Binary files a/images/visualization-grammar-2.png and /dev/null differ diff --git a/images/visualization-grammar-3.png b/images/visualization-grammar-3.png deleted file mode 100644 index dc7b517..0000000 Binary files a/images/visualization-grammar-3.png and /dev/null differ diff --git a/images/visualization-grammar.png b/images/visualization-grammar.png new file mode 100644 index 0000000..f4e11c6 Binary files /dev/null and b/images/visualization-grammar.png differ diff --git a/layers.qmd b/layers.qmd new file mode 100644 index 0000000..9a5dc88 --- /dev/null +++ b/layers.qmd @@ -0,0 +1,1057 @@ +# Layers {#sec-layers} + +```{r} +#| results: "asis" +#| echo: false +source("_common.R") +status("complete") +``` + +## Introduction + +In the @sec-data-visualisation, you learned much more than just how to make scatterplots, bar charts, and boxplots. +You learned a foundation that you can use to make *any* type of plot with ggplot2. + +In this chapter, you'll expand on that foundation as you learn about the layered grammar of graphics. +We'll start with a deeper dive into aesthetic mappings, geometric objects, and facets. +Then, you will learn about statistical transformations ggplot2 makes under the hood when creating a plot. +These transformations are used to calculate new values to plot, such as the heights of bars in a bar plot or medians in a box plot. +You will also learn about position adjustments, which modify how geoms are displayed in your plots. +Finally, we'll briefly introduce coordinate systems. + +We will not cover every single function and option for each of these layers, but we will walk you through the most important and commonly used functionality provided by ggplot2 as well as introduce you to packages that extend ggplot2. + +### Prerequisites + +This chapter focuses on ggplot2, one of the core packages in the tidyverse. +To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code: + +```{r} +#| label: setup + +library(tidyverse) +``` + +## Aesthetic mappings + +> "The greatest value of a picture is when it forces us to notice what we never expected to see." --- John Tukey + +The `mpg` data frame that is bundled with the ggplot2 package contains `r nrow(mpg)` observations collected by the US Environmental Protection Agency on `r mpg |> distinct(model) |> nrow()` car models. + +```{r} +mpg +``` + +Among the variables in `mpg` are: + +1. `displ`: A car's engine size, in liters. + A numerical variable. + +2. `hwy`: A car's fuel efficiency on the highway, in miles per gallon (mpg). + A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance. + A numerical variable. + +3. `class`: Type of car. + A categorical variable. + +You can learn about `mpg` on its help page by running `?mpg`. + +Let's start by visualizing the relationship between `displ` and `hwy` for various `class`es of cars. +We can do this with a scatterplot where the numerical variables are mapped to the `x` and `y` aesthetics and the categorical variable is mapped to an aesthetic like `color` or `shape`. + +```{r} +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-height: 2 +#| fig-alt: > +#| Two scatterplots next to each other, both visualizing highway fuel +#| efficiency versus engine size of cars and showing a negative +#| association. In the plot on the left class is mapped to the color +#| aesthetic, resulting in different colors for each class. +#| In the plot on the right class is mapped the shape aesthetic, +#| resulting in different plotting character shapes for each class, +#| except for suv. Each plot comes with a legend that shows the +#| mapping between color or shape and levels of the class variable. + +# Left +ggplot(mpg, aes(x = displ, y = hwy, color = class)) + + geom_point() + +# Right +ggplot(mpg, aes(x = displ, y = hwy, shape = class)) + + geom_point() +``` + +When `class` is mapped to `shape`, we get two warnings: + +> 1: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 7. +> Consider specifying shapes manually if you must have them. +> +> 2: Removed 62 rows containing missing values (`geom_point()`). + +Since ggplot2 will only use six shapes at a time, by default, additional groups will go unplotted when you use the shape aesthetic. +The second warning is related -- there are 62 SUVs in the dataset and they're not plotted. + +Similarly, we can map `class` to `size` or `alpha` (transparency) aesthetics as well. + +```{r} +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-height: 2 +#| fig-alt: > +#| Two scatterplots next to each other, both visualizing highway fuel +#| efficiency versus engine size of cars and showing a negative +#| association. In the plot on the left class is mapped to the size +#| aesthetic, resulting in different sizes for each class. +#| In the plot on the right class is mapped the alpha aesthetic, +#| resulting in different alpha (transparency) levels for each class. +#| Each plot comes with a legend that shows the mapping between size +#| or alpha level and levels of the class variable. + +# Left +ggplot(mpg, aes(x = displ, y = hwy, size = class)) + + geom_point() + +# Right +ggplot(mpg, aes(x = displ, y = hwy, alpha = class)) + + geom_point() +``` + +Both of these produce warnings as well: + +> Using alpha for a discrete variable is not advised. + +Mapping a non-ordinal discrete (categorical) variable (`class`) to an ordered aesthetic (`size` or `alpha`) is generally not a good idea because it implies a ranking that does not in fact exist. + +Similarly, we could have mapped `class` to the `alpha` aesthetic, which controls the transparency of the points, or to the `shape` aesthetic, which controls the shape of the points. + +Once you map an aesthetic, ggplot2 takes care of the rest. +It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. +For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. +The axis line acts as a legend; it explains the mapping between locations and values. + +You can also set the aesthetic properties of your geom manually. +For example, we can make all of the points in our plot blue: + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars +#| that shows a negative association. All points are blue. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(color = "blue") +``` + +Here, the color doesn't convey information about a variable, but only changes the appearance of the plot. +You can set an aesthetic manually by name as an argument of your geom function. +In other words, it goes *outside* of `aes()`. +You'll need to pick a value that makes sense for that aesthetic: + +- The name of a color as a character string. +- The size of a point in mm. +- The shape of a point as a number, as shown in @fig-shapes. + +```{r} +#| label: fig-shapes +#| echo: false +#| warning: false +#| fig.asp: 0.364 +#| fig-align: "center" +#| fig-cap: > +#| R has 25 built in shapes that are identified by numbers. There are some +#| seeming duplicates: for example, 0, 15, and 22 are all squares. The +#| difference comes from the interaction of the `color` and `fill` +#| aesthetics. The hollow shapes (0--14) have a border determined by `color`; +#| the solid shapes (15--20) are filled with `color`; the filled shapes +#| (21--24) have a border of `color` and are filled with `fill`. +#| fig-alt: > +#| Mapping between shapes and the numbers that represent them: 0 - square, +#| 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, +#| 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, +#| 10 - circle plus, 11 - triangles up and down, 12 - square plus, +#| 13 - circle cross, 14 - square and triangle down, 15 - filled square, +#| 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, +#| 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, +#| 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle +#| point-up blue, 25 - filled triangle point down blue. + +shapes <- tibble( + shape = c(0, 1, 2, 5, 3, 4, 6:19, 22, 21, 24, 23, 20), + x = (0:24 %/% 5) / 2, + y = (-(0:24 %% 5)) / 4 +) +ggplot(shapes, aes(x, y)) + + geom_point(aes(shape = shape), size = 5, fill = "red") + + geom_text(aes(label = shape), hjust = 0, nudge_x = 0.15) + + scale_shape_identity() + + expand_limits(x = 4.1) + + scale_x_continuous(NULL, breaks = NULL) + + scale_y_continuous(NULL, breaks = NULL, limits = c(-1.2, 0.2)) + + theme_minimal() + + theme(aspect.ratio = 1/2.75) +``` + +So far we have discussed aesthetics that we can map or set in a scatterplot, when using a point geom. +You can learn more about all possible aesthetic mappings in the aesthetic specifications vignette at . + +The specific aesthetics you can use for a plot depend on the geom you use to represent the data. +In the next section we dive deeper into geoms. + +### Exercises + +1. Create a scatterplot of `hwy` vs. `displ` where the points are pink filled in triangles. + +2. Why did the following code not result in a plot with blue points? + + ```{r} + #| fig-alt: > + #| Scatterplot of highway fuel efficiency versus engine size of cars + #| that shows a negative association. All points are red and + #| the legend shows a red point that is mapped to the word blue. + + ggplot(mpg) + + geom_point(aes(x = displ, y = hwy, color = "blue")) + ``` + +3. What does the `stroke` aesthetic do? + What shapes does it work with? + (Hint: use `?geom_point`) + +4. What happens if you map an aesthetic to something other than a variable name, like `aes(color = displ < 5)`? + Note, you'll also need to specify x and y. + +## Geometric objects {#sec-geometric-objects} + +How are these two plots similar? + +```{r} +#| echo: false +#| message: false +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-height: 2 +#| fig-alt: > +#| There are two plots. The plot on the left is a scatterplot of highway +#| fuel efficiency versus engine size of cars and the plot on the right +#| shows a smooth curve that follows the trajectory of the relationship +#| between these variables. A confidence interval around the smooth +#| curve is also displayed. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_smooth() +``` + +Both plots contain the same x variable, the same y variable, and both describe the same data. +But the plots are not identical. +Each plot uses a different geometric object, geom, to represent the data. +The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data. + +To change the geom in your plot, change the geom function that you add to `ggplot()`. +For instance, to make the plots above, you can use this code: + +```{r} +#| eval: false + +# Left +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + +# Right +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_smooth() +``` + +Every geom function in ggplot2 takes a `mapping` argument. +However, not every aesthetic works with every geom. +You could set the shape of a point, but you couldn't set the "shape" of a line. +If you try, ggplot2 will silently ignore that aesthetic mapping. +On the other hand, you *could* set the linetype of a line. +`geom_smooth()` will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype. + +```{r} +#| message: false +#| fig-alt: > +#| Two plots of highway fuel efficiency versus engine size of cars. +#| The data are represented with smooth curves. On the left, three +#| smooth curves, all with the same linetype. On the right, three +#| smooth curves with different line types (solid, dashed, or long +#| dashed) for each type of drive train. In both plots, confidence +#| intervals around the smooth curves are also displayed. + +ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) + + geom_smooth() +ggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) + + geom_smooth() +``` + +Here, `geom_smooth()` separates the cars into three lines based on their `drv` value, which describes a car's drive train. +One line describes all of the points that have a `4` value, one line describes all of the points that have an `f` value, and one line describes all of the points that have an `r` value. +Here, `4` stands for four-wheel drive, `f` for front-wheel drive, and `r` for rear-wheel drive. + +If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to `drv`. + +```{r} +#| echo: false +#| message: false +#| fig-alt: > +#| A plot of highway fuel efficiency versus engine size of cars. The data +#| are represented with points (colored by drive train) as well as smooth +#| curves (where line type is determined based on drive train as well). +#| Confidence intervals around the smooth curves are also displayed. + +ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + + geom_point() + + geom_smooth(aes(linetype = drv)) +``` + +Notice that this plot contains two geoms in the same graph. + +Many geoms, like `geom_smooth()`, use a single geometric object to display multiple rows of data. +For these geoms, you can set the `group` aesthetic to a categorical variable to draw multiple objects. +ggplot2 will draw a separate object for each unique value of the grouping variable. +In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the `linetype` example). +It is convenient to rely on this feature because the `group` aesthetic by itself does not add a legend or distinguishing features to the geoms. + +```{r} +#| layout-ncol: 3 +#| fig-width: 3 +#| fig-height: 3 +#| message: false +#| fig-alt: > +#| Three plots, each with highway fuel efficiency on the y-axis and engine +#| size of cars, where data are represented by a smooth curve. The first plot +#| only has these two variables, the center plot has three separate smooth +#| curves for each level of drive train, and the right plot not only has the +#| same three separate smooth curves for each level of drive train but these +#| curves are plotted in different colors, without a legend explaining which +#| color maps to which level. Confidence intervals around the smooth curves +#| are also displayed. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_smooth() + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_smooth(aes(group = drv)) + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_smooth(aes(color = drv), show.legend = FALSE) +``` + +If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. +It will use these mappings to extend or overwrite the global mappings *for that layer only*. +This makes it possible to display different aesthetics in different layers. + +```{r} +#| message: false +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars, where +#| points are colored according to the car class. A smooth curve following +#| the trajectory of the relationship between highway fuel efficiency versus +#| engine size of cars is overlaid along with a confidence interval around it. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = class)) + + geom_smooth() +``` + +You can use the same idea to specify different `data` for each layer. +Here, we use red points as well as open circles to highlight two-seater cars. +The local data argument in `geom_smooth()` overrides the global data argument in `ggplot()` for that layer only. + +```{r} +#| message: false +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars, where +#| points are colored according to the car class. A smooth curve following +#| the trajectory of the relationship between highway fuel efficiency versus +#| engine size of subcompact cars is overlaid along with a confidence interval +#| around it. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + geom_point( + data = mpg |> filter(class == "2seater"), + color = "red" + ) + + geom_point( + data = mpg |> filter(class == "2seater"), + shape = "circle open", size = 3, color = "red" + ) +``` + +(You'll learn how `filter()` works in the chapter on data transformations: for now, just know that this command selects only the subcompact cars.) + +Geoms are the fundamental building blocks of ggplot2. +You can completely transform the look of your plot by changing its geom, and different geoms can reveal different features of your data. +For example, the histogram and density plot below reveal that the distribution of highway mileage is bimodal and right skewed while the boxplot reveals two potential outliers. + +```{r} +#| fig-asp: 0.33 +#| fig-alt: > +#| Three plots: histogram, density plot, and box plot of highway +#| mileage. + +# Left +ggplot(mpg, aes(x = hwy)) + + geom_histogram(binwidth = 2) + +# Middle +ggplot(mpg, aes(x = hwy)) + + geom_density() + +# Right +ggplot(mpg, aes(x = hwy)) + + geom_boxplot() +``` + +ggplot2 provides more than 40 geoms but these don't cover all possible plots one could make. +If you need a different geom, we recommend looking into extension packages first to see if someone else has already implemented it (see for a sampling). +For example, the **ggridges** package ([https://wilkelab.org/ggridges](https://wilkelab.org/ggridges/){.uri}) is useful for making ridgeline plots, which can be useful for visualizing the density of a numerical variable for different levels of a categorical variable. +In the following plot not only did we use a new geom (`geom_density_ridges()`), but we have also mapped the same variable to multiple aesthetics (`drv` to `y`, `fill`, and `color`) as well as set an aesthetic (`alpha = 0.5`) to make the density curves transparent. + +```{r} +#| fig-asp: 0.33 +#| fig-alt: +#| Density curves for highway mileage for cars with rear wheel, +#| front wheel, and 4-wheel drives plotted separately. The +#| distribution is bimodal and roughly symmetric for real and +#| 4 wheel drive cars and unimodal and right skewed for front +#| wheel drive cars. + +library(ggridges) + +ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) + + geom_density_ridges(alpha = 0.5, show.legend = FALSE) +``` + +The best place to get a comprehensive overview of all of the geoms ggplot2 offers, as well as all functions in the package, is the reference page: . +To learn more about any single geom, use the help (e.g. `?geom_smooth`). + +### Exercises + +1. What geom would you use to draw a line chart? + A boxplot? + A histogram? + An area chart? + +2. Earlier in this chapter we used `show.legend` without explaining it: + + ```{r} + #| eval: false + ggplot(mpg, aes(x = displ, y = hwy)) + + geom_smooth(aes(color = drv), show.legend = FALSE) + ``` + + What does `show.legend = FALSE` do here? + What happens if you remove it? + Why do you think we used it earlier? + +3. What does the `se` argument to `geom_smooth()` do? + +4. Recreate the R code necessary to generate the following graphs. + Note that wherever a categorical variable is used in the plot, it's `drv`. + + ```{r} + #| echo: false + #| message: false + #| layout-ncol: 2 + #| fig-width: 4 + #| fig-height: 2 + #| fig-alt: > + #| There are six scatterplots in this figure, arranged in a 3x2 grid. + #| In all plots highway fuel efficiency of cars are on the y-axis and + #| engine size is on the x-axis. The first plot shows all points in black + #| with a smooth curve overlaid on them. In the second plot points are + #| also all black, with separate smooth curves overlaid for each level of + #| drive train. On the third plot, points and the smooth curves are + #| represented in different colors for each level of drive train. In the + #| fourth plot the points are represented in different colors for each + #| level of drive train but there is only a single smooth line fitted to + #| the whole data. In the fifth plot, points are represented in different + #| colors for each level of drive train, and a separate smooth curve with + #| different line types are fitted to each level of drive train. And + #| finally in the sixth plot points are represented in different colors + #| for each level of drive train and they have a thick white border. + + ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + geom_smooth(se = FALSE) + ggplot(mpg, aes(x = displ, y = hwy)) + + geom_smooth(aes(group = drv), se = FALSE) + + geom_point() + ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + + geom_point() + + geom_smooth(se = FALSE) + ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = drv)) + + geom_smooth(se = FALSE) + ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(aes(color = drv)) + + geom_smooth(aes(linetype = drv), se = FALSE) + ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(size = 4, color = "white") + + geom_point(aes(color = drv)) + ``` + +## Facets + +In @sec-data-visualisation you learned about faceting with `facet_wrap()`, which splits a plot into subplots that each display one subset of the data based on a categorical variable. + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars, +#| faceted by class, with facets spanning two rows. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + facet_wrap(~cyl) +``` + +To facet your plot with the combination of two variables, switch from `facet_wrap()` to `facet_grid()`. +The first argument of `facet_grid()` is also a formula, but now it's a double sided formula: `rows ~ cols`. + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars, faceted +#| by number of cylinders across rows and by type of drive train across +#| columns. This results in a 4x3 grid of 12 facets. Some of these facets have +#| no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front +#| wheel drive. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + facet_grid(drv ~ cyl) +``` + +By default each of the facets share the same scale for x and y axes. +This is useful when you want to compare data across facets but it can be limiting when you want to visualize the relationship within each facet better. +Setting the `scales` argument in a faceting function to `"free"` will allow for different axis scales across both rows and columns. +Other options for this argument are `"free_x"` (different scales across rows) and `"free_y"` (different scales across columns). + +```{r} +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars, +#| faceted by number of cylinders across rows and by type of drive train +#| across columns. This results in a 4x3 grid of 12 facets. Some of these +#| facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 +#| cylinders and front wheel drive. Facets within a row share the same +#| y-scale and facets within a column share the same x-scale. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() + + facet_grid(drv ~ cyl, scales = "free") +``` + +### Exercises + +1. What happens if you facet on a continuous variable? + +2. What do the empty cells in plot with `facet_grid(drv ~ cyl)` mean? + How do they relate to this plot? + + ```{r} + #| fig-alt: > + #| Scatterplot of number of cycles versus type of drive train of cars. + #| The plot shows that there are no cars with 5 cylinders that are 4 + #| wheel drive or with 4 or 5 cylinders that are front wheel drive. + + ggplot(mpg) + + geom_point(aes(x = drv, y = cyl)) + ``` + +3. What plots does the following code make? + What does `.` do? + + ```{r} + #| eval: false + + ggplot(mpg) + + geom_point(aes(x = displ, y = hwy)) + + facet_grid(drv ~ .) + + ggplot(mpg) + + geom_point(aes(x = displ, y = hwy)) + + facet_grid(. ~ cyl) + ``` + +4. Take the first faceted plot in this section: + + ```{r} + #| eval: false + + ggplot(mpg) + + geom_point(aes(x = displ, y = hwy)) + + facet_wrap(~ class, nrow = 2) + ``` + + What are the advantages to using faceting instead of the color aesthetic? + What are the disadvantages? + How might the balance change if you had a larger dataset? + +5. Read `?facet_wrap`. + What does `nrow` do? + What does `ncol` do? + What other options control the layout of the individual panels? + Why doesn't `facet_grid()` have `nrow` and `ncol` arguments? + +6. Which of the following two plots makes it easier to compare engine size (`displ`) across cars with different drive trains? + What does this say about when to place a faceting variable across rows or columns? + + ```{r} + #| fig-alt: > + #| Two faceted plots, both visualizing highway fuel efficiency versus + #| engine size of cars, faceted by drive train. In the top plot, facet + #| are organized across rows and in the second, across columns. + + ggplot(mpg) + + geom_point(aes(x = displ, y = hwy)) + + facet_grid(drv ~ .) + + ggplot(mpg) + + geom_point(aes(x = displ, y = hwy)) + + facet_grid(. ~ drv) + ``` + +7. Recreate this plot using `facet_wrap()` instead of `facet_grid()`. + How do the positions of the facet labels change? + + ```{r} + #| fig-alt: > + #| Scatterplot of highway fuel efficiency versus engine size of cars, + #| faceted by type of drive train across rows. + + ggplot(mpg) + + geom_point(aes(x = displ, y = hwy)) + + facet_grid(drv ~ .) + ``` + +## Statistical transformations + +Consider a basic bar chart, drawn with `geom_bar()` or `geom_col()`. +The following chart displays the total number of diamonds in the `diamonds` dataset, grouped by `cut`. +The `diamonds` dataset is in the ggplot2 package and contains information on \~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. +The chart shows that more diamonds are available with high quality cuts than with low quality cuts. + +```{r} +#| fig-alt: > +#| Bar chart of number of each cut of diamond. There are roughly 1500 +#| Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut +#| diamonds. + +ggplot(diamonds, aes(x = cut)) + + geom_bar() +``` + +On the x-axis, the chart displays `cut`, a variable from `diamonds`. +On the y-axis, it displays count, but count is not a variable in `diamonds`! +Where does count come from? +Many graphs, like scatterplots, plot the raw values of your dataset. +Other graphs, like bar charts, calculate new values to plot: + +- Bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin. + +- Smoothers fit a model to your data and then plot predictions from the model. + +- Boxplots compute a robust summary of the distribution and then display that summary as a specially formatted box. + +The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation. +@fig-vis-stat-bar shows how this process works with `geom_bar()`. + +```{r} +#| label: fig-vis-stat-bar +#| echo: false +#| out-width: "100%" +#| fig-cap: > +#| When create a bar chart we first start with the raw data, then +#| aggregate it to count the number of observations in each bar, +#| and finally map those computed variables to plot aesthetics. +#| fig-alt: > +#| A figure demonstrating three steps of creating a bar chart. +#| Step 1. geom_bar() begins with the diamonds data set. Step 2. geom_bar() +#| transforms the data with the count stat, which returns a data set of +#| cut values and counts. Step 3. geom_bar() uses the transformed data to +#| build the plot. cut is mapped to the x-axis, count is mapped to the y-axis. + +knitr::include_graphics("images/visualization-stat-bar.png") +``` + +You can learn which stat a geom uses by inspecting the default value for the `stat` argument. +For example, `?geom_bar` shows that the default value for `stat` is "count", which means that `geom_bar()` uses `stat_count()`. +`stat_count()` is documented on the same page as `geom_bar()`. +If you scroll down, the section called "Computed variables" explains that it computes two new variables: `count` and `prop`. + +Every geom has a default stat; and every stat has a default geom. +This means that you can typically use geoms without worrying about the underlying statistical transformation. +However, there are three reasons why you might need to use a stat explicitly: + +1. You might want to override the default stat. + In the code below, we change the stat of `geom_bar()` from count (the default) to identity. + This lets us map the height of the bars to the raw values of a $y$ variable. + + ```{r} + #| warning: false + #| fig-alt: > + #| Bar chart of number of each cut of diamond. There are roughly 1500 + #| Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut + #| diamonds. + + cut_frequencies <- tribble( + ~cut, ~freq, + "Fair", 1610, + "Good", 4906, + "Very Good", 12082, + "Premium", 13791, + "Ideal", 21551 + ) + + ggplot(cut_frequencies, aes(x = cut, y = freq)) + + geom_bar(stat = "identity") + ``` + +2. You might want to override the default mapping from transformed variables to aesthetics. + For example, you might want to display a bar chart of proportions, rather than counts: + + ```{r} + #| fig-alt: > + #| Bar chart of proportion of each cut of diamond. Roughly, Fair + #| diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 26, and + #| Ideal 0.40. + + ggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) + + geom_bar() + ``` + + To find the variables computed by the stat, look for the section titled "computed variables" in the help for `geom_bar()`. + +3. You might want to draw greater attention to the statistical transformation in your code. + For example, you might use `stat_summary()`, which summarizes the y values for each unique x value, to draw attention to the summary that you're computing: + + ```{r} + #| fig-alt: > + #| A plot with depth on the y-axis and cut on the x-axis (with levels + #| fair, good, very good, premium, and ideal) of diamonds. For each level + #| of cut, vertical lines extend from minimum to maximum depth for diamonds + #| in that cut category, and the median depth is indicated on the line + #| with a point. + + ggplot(diamonds) + + stat_summary( + aes(x = cut, y = depth), + fun.min = min, + fun.max = max, + fun = median + ) + ``` + +ggplot2 provides more than 20 stats for you to use. +Each stat is a function, so you can get help in the usual way, e.g. `?stat_bin`. + +### Exercises + +1. What is the default geom associated with `stat_summary()`? + How could you rewrite the previous plot to use that geom function instead of the stat function? + +2. What does `geom_col()` do? + How is it different from `geom_bar()`? + +3. Most geoms and stats come in pairs that are almost always used in concert. + Read through the documentation and make a list of all the pairs. + What do they have in common? + +4. What variables does `stat_smooth()` compute? + What parameters control its behavior? + +5. In our proportion bar chart, we need to set `group = 1`. + Why? + In other words, what is the problem with these two graphs? + + ```{r} + #| eval: false + + ggplot(diamonds, aes(x = cut, y = after_stat(prop))) + + geom_bar() + ggplot(diamonds, aes(x = cut, fill = color, y = after_stat(prop))) + + geom_bar() + ``` + +## Position adjustments + +There's one more piece of magic associated with bar charts. +You can color a bar chart using either the `color` aesthetic, or, more usefully, `fill`: + +```{r} +#| layout-ncol: 2 +#| fig-width: 4 +#| fig-height: 2 +#| fig-alt: > +#| Two bar charts of cut of diamonds. In the first plot, the bars have colored +#| borders. In the second plot, they're filled with colors. Heights of the +#| bars correspond to the number of diamonds in each cut category. + +ggplot(diamonds, aes(x = cut, color = cut)) + + geom_bar() +ggplot(diamonds, aes(x = cut, fill = cut)) + + geom_bar() +``` + +Note what happens if you map the fill aesthetic to another variable, like `clarity`: the bars are automatically stacked. +Each colored rectangle represents a combination of `cut` and `clarity`. + +```{r} +#| fig-alt: > +#| Segmented bar chart of cut of diamonds, where each bar is filled with +#| colors for the levels of clarity. Heights of the bars correspond to the +#| number of diamonds in each cut category, and heights of the colored +#| segments are proportional to the number of diamonds with a given clarity +#| level within a given cut level. + +ggplot(diamonds, aes(x = cut, fill = clarity)) + + geom_bar() +``` + +The stacking is performed automatically using the **position adjustment** specified by the `position` argument. +If you don't want a stacked bar chart, you can use one of three other options: `"identity"`, `"dodge"` or `"fill"`. + +- `position = "identity"` will place each object exactly where it falls in the context of the graph. + This is not very useful for bars, because it overlaps them. + To see that overlapping we either need to make the bars slightly transparent by setting `alpha` to a small value, or completely transparent by setting `fill = NA`. + + ```{r} + #| layout-ncol: 2 + #| fig-width: 4 + #| fig-height: 2 + #| fig-alt: > + #| Two segmented bar charts of cut of diamonds, where each bar is filled + #| with colors for the levels of clarity. Heights of the bars correspond + #| to the number of diamonds in each cut category, and heights of the + #| colored segments are proportional to the number of diamonds with a + #| given clarity level within a given cut level. However the segments + #| overlap. In the first plot the segments are filled with transparent + #| colors, in the second plot the segments are only outlined with colors. + + ggplot(diamonds, aes(x = cut, fill = clarity)) + + geom_bar(alpha = 1/5, position = "identity") + ggplot(diamonds, aes(x = cut, color = clarity)) + + geom_bar(fill = NA, position = "identity") + ``` + + The identity position adjustment is more useful for 2d geoms, like points, where it is the default. + +- `position = "fill"` works like stacking, but makes each set of stacked bars the same height. + This makes it easier to compare proportions across groups. + + ```{r} + #| fig-alt: > + #| Segmented bar chart of cut of diamonds, where each bar is filled with + #| colors for the levels of clarity. Height of each bar is 1 and heights + #| of the colored segments are proportional to the proportion of diamonds + #| with a given clarity level within a given cut level. + + ggplot(diamonds, aes(x = cut, fill = clarity)) + + geom_bar(position = "fill") + ``` + +- `position = "dodge"` places overlapping objects directly *beside* one another. + This makes it easier to compare individual values. + + ```{r} + #| fig-alt: > + #| Dodged bar chart of cut of diamonds. Dodged bars are grouped by levels + #| of cut (fair, good, very good, premium, and ideal). In each group there + #| are eight bars, one for each level of clarity, and filled with a + #| different color for each level. Heights of these bars represent the + #| number of diamonds with a given level of cut and clarity. + + ggplot(diamonds, aes(x = cut, fill = clarity)) + + geom_bar(position = "dodge") + ``` + +There's one other type of adjustment that's not useful for bar charts, but can be very useful for scatterplots. +Recall our first scatterplot. +Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset? + +```{r} +#| echo: false +#| fig-alt: > +#| Scatterplot of highway fuel efficiency versus engine size of cars that +#| shows a negative association. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point() +``` + +The underlying values of `hwy` and `displ` are rounded so the points appear on a grid and many points overlap each other. +This problem is known as **overplotting**. +This arrangement makes it difficult to see the distribution of the data. +Are the data points spread equally throughout the graph, or is there one special combination of `hwy` and `displ` that contains 109 values? + +You can avoid this gridding by setting the position adjustment to "jitter". +`position = "jitter"` adds a small amount of random noise to each point. +This spreads the points out because no two points are likely to receive the same amount of random noise. + +```{r} +#| fig-alt: > +#| Jittered scatterplot of highway fuel efficiency versus engine size of cars. +#| The plot shows a negative association. + +ggplot(mpg, aes(x = displ, y = hwy)) + + geom_point(position = "jitter") +``` + +Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph *more* revealing at large scales. +Because this is such a useful operation, ggplot2 comes with a shorthand for `geom_point(position = "jitter")`: `geom_jitter()`. + +To learn more about a position adjustment, look up the help page associated with each adjustment: `?position_dodge`, `?position_fill`, `?position_identity`, `?position_jitter`, and `?position_stack`. + +### Exercises + +1. What is the problem with this plot? + How could you improve it? + + ```{r} + #| fig-alt: > + #| Scatterplot of highway fuel efficiency versus city fuel efficiency + #| of cars that shows a positive association. The number of points + #| visible in this plot is less than the number of points in the dataset. + + ggplot(mpg, aes(x = cty, y = hwy)) + + geom_point() + ``` + +2. What parameters to `geom_jitter()` control the amount of jittering? + +3. Compare and contrast `geom_jitter()` with `geom_count()`. + +4. What's the default position adjustment for `geom_boxplot()`? + Create a visualization of the `mpg` dataset that demonstrates it. + +## Coordinate systems + +Coordinate systems are probably the most complicated part of ggplot2. +The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. +There are two other coordinate systems that are occasionally helpful. + +- `coord_quickmap()` sets the aspect ratio correctly for maps. + This is very important if you're plotting spatial data with ggplot2. + We don't have the space to discuss maps in this book, but you can learn more in the [Maps chapter](https://ggplot2-book.org/maps.html) of *ggplot2: Elegant graphics for data analysis*. + + ```{r} + #| layout-ncol: 2 + #| fig-width: 4 + #| fig-height: 2 + #| message: false + #| fig-alt: > + #| Two maps of the boundaries of New Zealand. In the first plot the aspect + #| ratio is incorrect, in the second plot it is correct. + + nz <- map_data("nz") + + ggplot(nz, aes(long, lat, group = group)) + + geom_polygon(fill = "white", color = "black") + + ggplot(nz, aes(long, lat, group = group)) + + geom_polygon(fill = "white", color = "black") + + coord_quickmap() + ``` + +- `coord_polar()` uses polar coordinates. + Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart. + + ```{r} + #| layout-ncol: 2 + #| fig-width: 4 + #| fig-asp: 1 + #| fig-alt: > + #| There are two plots. On the left is a bar chart of cut of diamonds, + #| on the right is a Coxcomb chart of the same data. + + bar <- ggplot(data = diamonds) + + geom_bar( + mapping = aes(x = cut, fill = cut), + show.legend = FALSE, + width = 1 + ) + + theme(aspect.ratio = 1) + + labs(x = NULL, y = NULL) + + bar + coord_flip() + bar + coord_polar() + ``` + +### Exercises + +1. Turn a stacked bar chart into a pie chart using `coord_polar()`. + +2. What's the difference between `coord_quickmap()` and `coord_map()`? + +3. What does the plot below tell you about the relationship between city and highway mpg? + Why is `coord_fixed()` important? + What does `geom_abline()` do? + + ```{r} + #| fig-alt: > + #| Scatterplot of highway fuel efficiency versus engine size of cars that + #| shows a negative association. The plot also has a straight line that + #| follows the trend of the relationship between the variables but does not + #| go through the cloud of points, it is beneath it. + + ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + + geom_point() + + geom_abline() + + coord_fixed() + ``` + +## The layered grammar of graphics + +We can expand on the graphing template you learned in @sec-graphing-template by adding position adjustments, stats, coordinate systems, and faceting: + + ggplot(data = ) + + ( + mapping = aes(), + stat = , + position = + ) + + + + + +Our new template takes seven parameters, the bracketed words that appear in the template. +In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function. + +The seven parameters in the template compose the grammar of graphics, a formal system for building plots. +The grammar of graphics is based on the insight that you can uniquely describe *any* plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme. + +To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat). +Next, you could choose a geometric object to represent each observation in the transformed data. +You could then use the aesthetic properties of the geoms to represent variables in the data. +You would map the values of each variable to the levels of an aesthetic. +You'd then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. + +```{r} +#| echo: false +#| fig-alt: > +#| A figure demonstrating the steps for going from raw data to table of counts +#| where each row represents one level of cut and a count column shows how many +#| diamonds are in that cut level. + +knitr::include_graphics("images/visualization-grammar.png") +``` + +At this point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). +You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment. + +You could use this method to build *any* plot that you imagine. +In other words, you can use the code template that you've learned in this chapter to build hundreds of thousands of unique plots. + +If you'd like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading "[The Layered Grammar of Graphics](https://vita.had.co.nz/papers/layered-grammar.pdf)", the scientific paper that describes the theory of ggplot2 in detail. + +## Summary + +In this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what `x` and `y` mean. +One layer we have not yet touched on is theme, which we will introduce in @sec-themes. + +Two very useful resources for getting an overview of the complete ggplot2 functionality are the ggplot2 cheatsheet (which you can find at ) and the ggplot2 package website ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org/)). + +An important lesson you should take from this chapter is that when you feel the need for a geom that is not provided by ggplot2, it's always a good idea to look into whether someone else has already solved your problem by creating a ggplot2 extension package that offers that geom. diff --git a/missing-values.qmd b/missing-values.qmd index 86b21b5..3772dc2 100644 --- a/missing-values.qmd +++ b/missing-values.qmd @@ -10,7 +10,7 @@ status("polishing") ## Introduction You've already learned the basics of missing values earlier in the book. -You first saw them in @sec-summarize where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in @sec-na-comparison. +You first saw them in @sec-data-visualisation where they resulted in a warning when making a plot as well as in @sec-summarize where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in @sec-na-comparison. Now we'll come back to them in more depth, so you can learn more of the details. We'll start by discussing some general tools for working with missing values recorded as `NA`s. diff --git a/quarto.qmd b/quarto.qmd index 38a082f..506d23e 100644 --- a/quarto.qmd +++ b/quarto.qmd @@ -336,35 +336,6 @@ The following table summarizes which types of output each option suppresses: | `message: false` | | | | | \- | | | `warning: false` | | | | | | \- | -### Global options - -As you work more with knitr, you will discover that some of the default chunk options don't fit your needs and you want to change them. - -You can do this by adding the preferred options in the document YAML, under `execute`. -For example, if you are preparing a report for an audience who does not need to see your code but only your results and narrative, you might set `echo: false` at the document level. -That will hide the code by default, so only showing the chunks you deliberately choose to show (with `echo: true`). -You might consider setting `message: false` and `warning: false`, but that would make it harder to debug problems because you wouldn't see any messages in the final document. - -``` yaml -title: "My report" -execute: - echo: false -``` - -Since Quarto is designed to be multi-lingual (works with R as well as other languages like Python, Julia, etc.), all of the knitr options are not available at the document execution level since some of them only work with knitr and not other engines Quarto uses for running code in other languages (e.g., Jupyter). -You can, however, still set these as global options for your document under the `knitr` field, under `opts_chunk`. -For example, when writing books and tutorials we set: - -``` yaml -title: "Tutorial" -knitr: - opts_chunk: - comment: "#>" - collapse: true -``` - -This uses our preferred comment formatting and ensures that the code and output are kept closely entwined. - ### Inline code There is one other way to embed R code into a Quarto document: directly into the text, with: `r inline()`. @@ -607,7 +578,7 @@ This makes it easier to understand the `dependson` specification. 1. Set up a network of chunks where `d` depends on `c` and `b`, and both `b` and `c` depend on `a`. Have each chunk print `lubridate::now()`, set `cache: true`, then verify your understanding of caching. -## Troubleshooting +> > > > > > > 7ff2b1502187f15a978d74f59a88534fa6f1012e \## Troubleshooting Troubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document. diff --git a/transform.qmd b/transform.qmd index e5b6a93..332e4a8 100644 --- a/transform.qmd +++ b/transform.qmd @@ -6,8 +6,7 @@ source("_common.R") ``` -After reading the first part of the book, you understand (at least superficially) the most important tools for doing data science. -Now it's time to start diving into the details. +The second part of the book was a deep dive into data visualization. In this part of the book, you'll learn about the most important types of variables that you'll encounter inside a data frame and learn the tools you can use to work with them. ```{r} @@ -15,9 +14,9 @@ In this part of the book, you'll learn about the most important types of variabl #| echo: false #| fig-cap: > #| The options for data transformation depends heavily on the type of -#| data involve, the subject of this part of the book. +#| data involved, the subject of this part of the book. #| fig-alt: > -#| Our data science model transform, highlighted in blue. +#| Our data science model, with transform highlighted in blue. #| out.width: NULL knitr::include_graphics("diagrams/data-science/transform.png", dpi = 270) diff --git a/visualize.qmd b/visualize.qmd new file mode 100644 index 0000000..d386866 --- /dev/null +++ b/visualize.qmd @@ -0,0 +1,41 @@ +# Visualize {#sec-visualize .unnumbered} + +```{r} +#| results: "asis" +#| echo: false +source("_common.R") +status("drafting") +``` + +After reading the first two parts of the book, you understand (at least superficially) the most important tools for doing data science. +Now it's time to start diving into the details. +In this part of the book, you'll learn about visualizing data in further depth. + +```{r} +#| label: fig-ds-visualize +#| echo: false +#| fig-cap: > +#| Data visualization is often the first step in data exploration. +#| fig-alt: > +#| Our data science model, with visualize highlighted in blue. +#| out.width: NULL + +knitr::include_graphics("diagrams/data-science/visualize.png", dpi = 270) +``` + +Each chapter addresses one to a few aspects of creating a data visualization. + +- In @sec-layers you will learn about the layered grammar of graphics. + +- In @sec-exploratory-data-analysis, you'll combine visualization with your curiosity and skepticism to ask and answer interesting questions about data. + +- Finally, in @sec-communication you will learn how to take your exploratory graphics and turn them into expository graphics, graphics that help the newcomer to your analysis understand what's going on as quickly and easily as possible. + +### Learning more + +The absolute best place to learn more is the ggplot2 book: [*ggplot2: Elegant graphics for data analysis*](https://ggplot2-book.org/). +It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems. + +Another great resource is the ggplot2 extensions gallery . +This site lists many of the packages that extend ggplot2 with new geoms and scales. +It's a great place to start if you're trying to do something that seems hard with ggplot2. diff --git a/whole-game.qmd b/whole-game.qmd index deb23ab..1c59a38 100644 --- a/whole-game.qmd +++ b/whole-game.qmd @@ -39,8 +39,6 @@ Five chapters focus on the tools of data science: - Before you can transform and visualize your data, you need to first get your data into R. In @sec-data-import you'll learn the basics of getting `.csv` files into R. -- Finally, in @sec-exploratory-data-analysis, you'll combine visualization and transformation with your curiosity and skepticism to ask and answer interesting questions about data. - Nestled among these chapters that are five other chapters that focus on your R workflow. In @sec-workflow-basics, @sec-workflow-pipes, @sec-workflow-style, and @sec-workflow-scripts-projects, you'll learn good workflow practices for writing and organizing your R code. These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects. diff --git a/workflow-basics.qmd b/workflow-basics.qmd index 64ca05f..a368b15 100644 --- a/workflow-basics.qmd +++ b/workflow-basics.qmd @@ -231,7 +231,23 @@ knitr::include_graphics("screenshots/rstudio-env.png") What happens? How can you get to the same place using the menus? +4. Let's revisit an exercise from the @sec-ggsave. + Run the following lines of code. + Which of the two plots is saved as `mpg-plot.png`? + Why? + + ```{r} + #| eval: false + + my_bar_plot <- ggplot(mpg, aes(x = class)) + + geom_bar() + my_scatter_plot <- ggplot(mpg, aes(x = cty, y = hwy)) + + geom_point() + ggsave(filename = "mpg-plot.png", plot = my_bar_plot) + ``` + ## Summary Now that you've learned a little more about how R code works, and some tips to help you understand your code when you come back to it in the future. In the next chapter, we'll continue your data science journey by teaching you about dplyr, the tidyverse package that helps you transform data, whether it's selecting important variables, filtering down to rows of interest, or computing summary statistics. + diff --git a/workflow-pipes.qmd b/workflow-pipes.qmd index 5df8aa9..1041eef 100644 --- a/workflow-pipes.qmd +++ b/workflow-pipes.qmd @@ -129,9 +129,24 @@ But they're still good to know about even if you've never used `%>%` because you Luckily there's no need to commit entirely to one pipe or the other --- you can use the base pipe for the majority of cases where it's sufficient, and use the magrittr pipe when you really need its special features. +## `|>` vs `+` + +Sometimes we'll turn the end of a pipeline of data transformation into a plot. +Watch for the transition from `|>` to `+`. +We wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered. + +```{r} +#| eval: false + +diamonds |> + count(cut, clarity) |> + ggplot(aes(clarity, cut, fill = n)) + + geom_tile() +``` + ## Summary -In this chapter, you've learn more about the pipe: why we recommend it and some of the history that lead to `|>`. +In this chapter, you've learned more about the pipe: why we recommend it and some of the history that lead to `|>`. The pipe is important because you'll use it again and again throughout your analysis, but hopefully it will quickly become invisible and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it. In the next chapter, we switch back to data science tools, learning about tidy data.