Let’s use our first graph to answer a question: Do penguins with longer flippers weigh more or less than penguins with shorter flippers? You probably already have an answer, but try to make your answer precise. What does the relationship between flipper length and body mass look like? Is it positive? Negative? Linear? Nonlinear? Does the relationship vary by the species of the penguin? And how about by the island where the penguin lives.
The penguins data frame
You can test your answer with the penguins
data frame found in palmerpenguins (a.k.a. palmerpenguins::penguins
). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). penguins
contains 344 observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTERHorst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218..
penguins
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm
#> <fct> <fct> <dbl> <dbl> <int>
#> 1 Adelie Torgersen 39.1 18.7 181
#> 2 Adelie Torgersen 39.5 17.4 186
#> 3 Adelie Torgersen 40.3 18 195
#> 4 Adelie Torgersen NA NA NA
#> 5 Adelie Torgersen 36.7 19.3 193
#> 6 Adelie Torgersen 39.3 20.6 190
#> # … with 338 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
#> # year <int>
This data frame contains 8 columns. For an alternative view, where you can see all variables and the first few observations of each variable, use glimpse()
. Or, if you’re in RStudio, run View(penguins)
to open an interactive data viewer.
glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.…
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.…
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, …
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347…
#> $ sex <fct> male, female, female, NA, female, male, female, m…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…
Among the variables in penguins
are:
species
: a penguin’s species (Adelie, Chinstrap, or Gentoo).
flipper_length_mm
: length of a penguin’s flipper, in millimeters.
body_mass_g
: body mass of a penguin, in grams.
To learn more about penguins
, open its help page by running ?penguins
.
Creating a ggplot
Let’s recreate this plot layer-by-layer.
With ggplot2, you begin a plot with the function ggplot()
, defining a plot object that you then add layers to. The first argument of ggplot()
is the dataset to use in the graph and So ggplot(data = penguins)
creates an empty graph. This is not a very exciting plot, but you can think of it like an empty canvas you’ll paint the remaining layers of your plot onto.
Next, we need to tell ggplot()
the variables from this data frame that we want to map to visual properties (aesthetics) of the plot. The mapping
argument of the ggplot()
function defines how variables in your dataset are mapped to visual properties of your plot. The mapping
argument is always paired with the aes()
function, and the x
and y
arguments of aes()
specify which variables to map to the x and y axes. For now, we will only map flipper length to the x
aesthetic and body mass to the y
aesthetic. ggplot2 looks for the mapped variables in the data
argument, in this case, penguins
.
The following plots show the result of adding these mappings, one at a time.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm)
)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
Our empty canvas now has more structure – it’s clear where flipper lengths will be displayed (on the x-axis) and where body masses will be displayed (on the y-axis). But the penguins themselves are not yet on the plot. This is because we have not yet articulated, in our code, how to represent the observations from our data frame on our plot.
To do so, we need to define a geom: the geometrical object that a plot uses to represent data. These geometric objects are made available in ggplot2 with functions that start with geom_
. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms (geom_bar()
), line charts use line geoms (geom_line()
), boxplots use boxplot geoms (geom_boxplot()
), and so on. Scatterplots break the trend; they use the point geom: geom_point()
.
The function geom_point()
adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. You’ll learn a whole bunch of geoms throughout the book, particularly in #chp-layers.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
Now we have something that looks like what we might think of as a “scatter plot”. It doesn’t yet match our “ultimate goal” plot, but using this plot we can start answering the question that motivated our exploration: “What does the relationship between flipper length and body mass look like?” The relationship appears to be positive, fairly linear, and moderately strong. Penguins with longer flippers are generally larger in terms of their body mass.
Before we add more layers to this plot, let’s pause for a moment and review the warning message we got:
Removed 2 rows containing missing values (geom_point()
).
We’re seeing this message because there are two penguins in our dataset with missing body mass and flipper length values and ggplot2 has no way of representing them on the plot. You don’t need to worry about understanding the following code yet (you will learn about it in #chp-data-transform), but it’s one way of identifying the observations with NA
s for either body mass or flipper length.
penguins |>
select(species, flipper_length_mm, body_mass_g) |>
filter(is.na(body_mass_g) | is.na(flipper_length_mm))
#> # A tibble: 2 × 3
#> species flipper_length_mm body_mass_g
#> <fct> <int> <int>
#> 1 Adelie NA NA
#> 2 Gentoo NA NA
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. This type of warning is probably one of the most common types of warnings you will see when working with real data – missing values are a very common issue and you’ll learn more about them throughout the book, particularly in #chp-missing-values. For the remaining plots in this chapter we will suppress this warning so it’s not printed alongside every single plot we make.
Adding aesthetics and layers
Scatterplots are useful for displaying the relationship between two variables, but it’s always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship. Let’s incorporate species into our plot and see if this reveals any additional insights into the apparent relationship between flipper length and body mass. We will do this by representing species with different colored points.
To achieve this, where should species
go in the ggplot call from earlier? If you guessed “in the aesthetic mapping, inside of aes()
”, you’re already getting the hang of creating data visualizations with ggplot2! And if not, don’t worry. Throughout the book you will make many more ggplots and have many more opportunities to check your intuition as you make them.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point()
When a variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here a unique color) to each unique level of the variable (each of the three species), a process known as scaling. ggplot2 will also add a legend that explains which values correspond to which levels.
Now let’s add one more layer: a smooth curve displaying the relationship between body mass and flipper length. Before you proceed, refer back to the code above, and think about how we can add this to our existing plot.
Since this is a new geometric object representing our data, we will add a new geom: geom_smooth()
.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point() +
geom_smooth()
We have successfully added smooth curves, but this plot doesn’t look like the plot from #sec-ultimate-goal, which only has one curve for the entire dataset as opposed to separate curves for each of the penguin species.
When aesthetic mappings are defined in ggplot()
, at the global level, they’re inherited by each of the subsequent geom layers of the plot. However, each geom function in ggplot2 can also take a mapping
argument, which allows for aesthetic mappings at the local level. Since we want points to be colored based on species but don’t want the smooth curves to be separated out for them, we should specify color = species
for geom_point()
only.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species)) +
geom_smooth()
Voila! We have something that looks very much like our ultimate goal, though it’s not yet perfect. We still need to use different shapes for each species of penguins and improve labels.
It’s generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences. Therefore, in addition to color, we can also map species
to the shape
aesthetic.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth()
Note that the legend is automatically updated to reflect the different shapes of the points as well.
And finally, we can improve the labels of our plot using the labs()
function in a new layer. Some of the arguments to labs()
might be self explanatory: title
adds a title and subtitle
adds a subtitle to the plot. Other arguments match the aesthetic mappings, x
is the x-axis label, y
is the y-axis label, and color
and shape
define the label for the legend.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(aes(color = species, shape = species)) +
geom_smooth() +
labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)",
y = "Body mass (g)",
color = "Species",
shape = "Species"
)
We finally have a plot that perfectly matches our “ultimate goal”!