457 lines
16 KiB
Plaintext
457 lines
16 KiB
Plaintext
# Factors {#sec-factors}
|
|
|
|
```{r}
|
|
#| results: "asis"
|
|
#| echo: false
|
|
source("_common.R")
|
|
status("complete")
|
|
```
|
|
|
|
## Introduction
|
|
|
|
Factors are used for categorical variables, variables that have a fixed and known set of possible values.
|
|
They are also useful when you want to display character vectors in a non-alphabetical order.
|
|
|
|
We'll start by motivating why factors are needed for data analysis and how you can create them with `factor()`.
|
|
We'll then introduce you to the `gss_cat` dataset which contains a bunch of categorical variables to experiment with.
|
|
You'll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.
|
|
|
|
### Prerequisites
|
|
|
|
Base R provides some basic tools for creating and manipulating factors.
|
|
We'll supplement these with the **forcats** package, which is part of the core tidyverse.
|
|
It provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!) using a wide range of helpers for working with factors.
|
|
|
|
```{r}
|
|
#| label: setup
|
|
#| message: false
|
|
|
|
library(tidyverse)
|
|
```
|
|
|
|
## Factor basics
|
|
|
|
Imagine that you have a variable that records month:
|
|
|
|
```{r}
|
|
x1 <- c("Dec", "Apr", "Jan", "Mar")
|
|
```
|
|
|
|
Using a string to record this variable has two problems:
|
|
|
|
1. There are only twelve possible months, and there's nothing saving you from typos:
|
|
|
|
```{r}
|
|
x2 <- c("Dec", "Apr", "Jam", "Mar")
|
|
```
|
|
|
|
2. It doesn't sort in a useful way:
|
|
|
|
```{r}
|
|
sort(x1)
|
|
```
|
|
|
|
You can fix both of these problems with a factor.
|
|
To create a factor you must start by creating a list of the valid **levels**:
|
|
|
|
```{r}
|
|
month_levels <- c(
|
|
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
|
|
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
|
|
)
|
|
```
|
|
|
|
Now you can create a factor:
|
|
|
|
```{r}
|
|
y1 <- factor(x1, levels = month_levels)
|
|
y1
|
|
|
|
sort(y1)
|
|
```
|
|
|
|
And any values not in the level will be silently converted to NA:
|
|
|
|
```{r}
|
|
y2 <- factor(x2, levels = month_levels)
|
|
y2
|
|
```
|
|
|
|
This seems risky, so you might want to use `fct()` instead:
|
|
|
|
```{r}
|
|
#| error: true
|
|
y2 <- fct(x2, levels = month_levels)
|
|
```
|
|
|
|
If you omit the levels, they'll be taken from the data in alphabetical order:
|
|
|
|
```{r}
|
|
factor(x1)
|
|
```
|
|
|
|
Sometimes you'd prefer that the order of the levels matches the order of the first appearance in the data.
|
|
You can do that when creating the factor by setting levels to `unique(x)`, or after the fact, with `fct_inorder()`:
|
|
|
|
```{r}
|
|
f1 <- factor(x1, levels = unique(x1))
|
|
f1
|
|
|
|
f2 <- x1 |> factor() |> fct_inorder()
|
|
f2
|
|
```
|
|
|
|
If you ever need to access the set of valid levels directly, you can do so with `levels()`:
|
|
|
|
```{r}
|
|
levels(f2)
|
|
```
|
|
|
|
You can also create a factor when reading your data with readr with `col_factor()`:
|
|
|
|
```{r}
|
|
csv <- "
|
|
month,value
|
|
Jan,12
|
|
Feb,56
|
|
Mar,12"
|
|
|
|
df <- read_csv(csv, col_types = cols(month = col_factor(month_levels)))
|
|
df$month
|
|
```
|
|
|
|
## General Social Survey
|
|
|
|
For the rest of this chapter, we're going to use `forcats::gss_cat`.
|
|
It's a sample of data from the [General Social Survey](https://gss.norc.org), a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
|
|
The survey has thousands of questions, so in `gss_cat` Hadley selected a handful that will illustrate some common challenges you'll encounter when working with factors.
|
|
|
|
```{r}
|
|
gss_cat
|
|
```
|
|
|
|
(Remember, since this dataset is provided by a package, you can get more information about the variables with `?gss_cat`.)
|
|
|
|
When factors are stored in a tibble, you can't see their levels so easily.
|
|
One way to view them is with `count()`:
|
|
|
|
```{r}
|
|
gss_cat |>
|
|
count(race)
|
|
```
|
|
|
|
Or with a bar chart:
|
|
|
|
```{r}
|
|
#| fig-alt: >
|
|
#| A bar chart showing the distribution of race. There are ~2000
|
|
#| records with race "Other", 3000 with race "Black", and other
|
|
#| 15,000 with race "White".
|
|
ggplot(gss_cat, aes(x = race)) +
|
|
geom_bar()
|
|
```
|
|
|
|
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels.
|
|
Those operations are described in the sections below.
|
|
|
|
### Exercise
|
|
|
|
1. Explore the distribution of `rincome` (reported income).
|
|
What makes the default bar chart hard to understand?
|
|
How could you improve the plot?
|
|
|
|
2. What is the most common `relig` in this survey?
|
|
What's the most common `partyid`?
|
|
|
|
3. Which `relig` does `denom` (denomination) apply to?
|
|
How can you find out with a table?
|
|
How can you find out with a visualization?
|
|
|
|
## Modifying factor order
|
|
|
|
It's often useful to change the order of the factor levels in a visualization.
|
|
For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
|
|
|
|
```{r}
|
|
#| fig-alt: >
|
|
#| A scatterplot of with tvhours on the x-axis and religion on the y-axis.
|
|
#| The y-axis is ordered seemingly aribtrarily making it hard to get
|
|
#| any sense of overall pattern.
|
|
relig_summary <- gss_cat |>
|
|
group_by(relig) |>
|
|
summarize(
|
|
age = mean(age, na.rm = TRUE),
|
|
tvhours = mean(tvhours, na.rm = TRUE),
|
|
n = n()
|
|
)
|
|
|
|
ggplot(relig_summary, aes(x = tvhours, y = relig)) +
|
|
geom_point()
|
|
```
|
|
|
|
It is hard to read this plot because there's no overall pattern.
|
|
We can improve it by reordering the levels of `relig` using `fct_reorder()`.
|
|
`fct_reorder()` takes three arguments:
|
|
|
|
- `f`, the factor whose levels you want to modify.
|
|
- `x`, a numeric vector that you want to use to reorder the levels.
|
|
- Optionally, `fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`.
|
|
|
|
```{r}
|
|
#| fig-alt: >
|
|
#| The same scatterplot as above, but now the religion is displayed in
|
|
#| increasing order of tvhours. "Other eastern" has the fewest tvhours
|
|
#| under 2, and "Don't know" has the highest (over 5).
|
|
ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
|
|
geom_point()
|
|
```
|
|
|
|
Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions watch much less.
|
|
|
|
As you start making more complicated transformations, we recommend moving them out of `aes()` and into a separate `mutate()` step.
|
|
For example, you could rewrite the plot above as:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
|
|
relig_summary |>
|
|
mutate(
|
|
relig = fct_reorder(relig, tvhours)
|
|
) |>
|
|
ggplot(aes(x = tvhours, y = relig)) +
|
|
geom_point()
|
|
```
|
|
|
|
What if we create a similar plot looking at how average age varies across reported income level?
|
|
|
|
```{r}
|
|
#| fig-alt: >
|
|
#| A scatterplot with age on the x-axis and income on the y-axis. Income
|
|
#| has been reordered in order of average age which doesn't make much
|
|
#| sense. One section of the y-axis goes from $6000-6999, then <$1000,
|
|
#| then $8000-9999.
|
|
rincome_summary <- gss_cat |>
|
|
group_by(rincome) |>
|
|
summarize(
|
|
age = mean(age, na.rm = TRUE),
|
|
tvhours = mean(tvhours, na.rm = TRUE),
|
|
n = n()
|
|
)
|
|
|
|
ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) +
|
|
geom_point()
|
|
```
|
|
|
|
Here, arbitrarily reordering the levels isn't a good idea!
|
|
That's because `rincome` already has a principled order that we shouldn't mess with.
|
|
Reserve `fct_reorder()` for factors whose levels are arbitrarily ordered.
|
|
|
|
However, it does make sense to pull "Not applicable" to the front with the other special levels.
|
|
You can use `fct_relevel()`.
|
|
It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.
|
|
|
|
```{r}
|
|
#| fig-alt: >
|
|
#| The same scatterplot but now "Not Applicable" is displayed at the
|
|
#| bottom of the y-axis. Generally there is a positive association
|
|
#| between income and age, and the income band with the highest average
|
|
#| age is "Not applicable".
|
|
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
|
|
geom_point()
|
|
```
|
|
|
|
Why do you think the average age for "Not applicable" is so high?
|
|
|
|
Another type of reordering is useful when you are coloring the lines on a plot.
|
|
`fct_reorder2(f, x, y)` reorders the factor `f` by the `y` values associated with the largest `x` values.
|
|
This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.
|
|
|
|
```{r}
|
|
#| layout-ncol: 2
|
|
#| fig-width: 4
|
|
#| fig-height: 2
|
|
#| fig-alt: >
|
|
#| A line plot with age on the x-axis and proportion on the y-axis.
|
|
#| There is one line for each category of marital status: no answer,
|
|
#| never married, separated, divorced, widowed, and married. It is
|
|
#| a little hard to read the plot because the order of the legend is
|
|
#| unrelated to the lines on the plot.
|
|
#|
|
|
#| Rearranging the legend makes the plot easier to read because the
|
|
#| legend colors now match the order of the lines on the far right
|
|
#| of the plot. You can see some unsuprising patterns: the proportion
|
|
#| never marred decreases with age, married forms an upside down U
|
|
#| shape, and widowed starts off low but increases steeply after age
|
|
#| 60.
|
|
by_age <- gss_cat |>
|
|
filter(!is.na(age)) |>
|
|
count(age, marital) |>
|
|
group_by(age) |>
|
|
mutate(
|
|
prop = n / sum(n)
|
|
)
|
|
|
|
ggplot(by_age, aes(age, prop, color = marital)) +
|
|
geom_line(na.rm = TRUE)
|
|
|
|
ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +
|
|
geom_line() +
|
|
labs(color = "marital")
|
|
```
|
|
|
|
Finally, for bar plots, you can use `fct_infreq()` to order levels in decreasing frequency: this is the simplest type of reordering because it doesn't need any extra variables.
|
|
Combine it with `fct_rev()` if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.
|
|
|
|
```{r}
|
|
#| fig-alt: >
|
|
#| A bar char of marital status ordered in from least to most common:
|
|
#| no answer (~0), separated (~1,000), widowed (~2,000), divorced
|
|
#| (~3,000), never married (~5,000), married (~10,000).
|
|
gss_cat |>
|
|
mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
|
|
ggplot(aes(x = marital)) +
|
|
geom_bar()
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. There are some suspiciously high numbers in `tvhours`.
|
|
Is the mean a good summary?
|
|
|
|
2. For each factor in `gss_cat` identify whether the order of the levels is arbitrary or principled.
|
|
|
|
3. Why did moving "Not applicable" to the front of the levels move it to the bottom of the plot?
|
|
|
|
## Modifying factor levels
|
|
|
|
More powerful than changing the orders of the levels is changing their values.
|
|
This allows you to clarify labels for publication, and collapse levels for high-level displays.
|
|
The most general and powerful tool is `fct_recode()`.
|
|
It allows you to recode, or change, the value of each level.
|
|
For example, take the `gss_cat$partyid`:
|
|
|
|
```{r}
|
|
gss_cat |> count(partyid)
|
|
```
|
|
|
|
The levels are terse and inconsistent.
|
|
Let's tweak them to be longer and use a parallel construction.
|
|
Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:
|
|
|
|
```{r}
|
|
gss_cat |>
|
|
mutate(
|
|
partyid = fct_recode(partyid,
|
|
"Republican, strong" = "Strong republican",
|
|
"Republican, weak" = "Not str republican",
|
|
"Independent, near rep" = "Ind,near rep",
|
|
"Independent, near dem" = "Ind,near dem",
|
|
"Democrat, weak" = "Not str democrat",
|
|
"Democrat, strong" = "Strong democrat"
|
|
)
|
|
) |>
|
|
count(partyid)
|
|
```
|
|
|
|
`fct_recode()` will leave the levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.
|
|
|
|
To combine groups, you can assign multiple old levels to the same new level:
|
|
|
|
```{r}
|
|
gss_cat |>
|
|
mutate(
|
|
partyid = fct_recode(partyid,
|
|
"Republican, strong" = "Strong republican",
|
|
"Republican, weak" = "Not str republican",
|
|
"Independent, near rep" = "Ind,near rep",
|
|
"Independent, near dem" = "Ind,near dem",
|
|
"Democrat, weak" = "Not str democrat",
|
|
"Democrat, strong" = "Strong democrat",
|
|
"Other" = "No answer",
|
|
"Other" = "Don't know",
|
|
"Other" = "Other party"
|
|
)
|
|
) |>
|
|
count(partyid)
|
|
```
|
|
|
|
Use this technique with care: if you group together categories that are truly different you will end up with misleading results.
|
|
|
|
If you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`.
|
|
For each new variable, you can provide a vector of old levels:
|
|
|
|
```{r}
|
|
gss_cat |>
|
|
mutate(
|
|
partyid = fct_collapse(partyid,
|
|
"other" = c("No answer", "Don't know", "Other party"),
|
|
"rep" = c("Strong republican", "Not str republican"),
|
|
"ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
|
|
"dem" = c("Not str democrat", "Strong democrat")
|
|
)
|
|
) |>
|
|
count(partyid)
|
|
```
|
|
|
|
Sometimes you just want to lump together the small groups to make a plot or table simpler.
|
|
That's the job of the `fct_lump_*()` family of functions.
|
|
`fct_lump_lowfreq()` is a simple starting point that progressively lumps the smallest groups categories into "Other", always keeping "Other" as the smallest category.
|
|
|
|
```{r}
|
|
gss_cat |>
|
|
mutate(relig = fct_lump_lowfreq(relig)) |>
|
|
count(relig)
|
|
```
|
|
|
|
In this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we'd probably like to see some more details!
|
|
Instead, we can use the `fct_lump_n()` to specify that we want exactly 10 groups:
|
|
|
|
```{r}
|
|
gss_cat |>
|
|
mutate(relig = fct_lump_n(relig, n = 10)) |>
|
|
count(relig, sort = TRUE) |>
|
|
print(n = Inf)
|
|
```
|
|
|
|
Read the documentation to learn about `fct_lump_min()` and `fct_lump_prop()` which are useful in other cases.
|
|
|
|
### Exercises
|
|
|
|
1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
|
|
|
|
2. How could you collapse `rincome` into a small set of categories?
|
|
|
|
3. Notice there are 9 groups (excluding other) in the `fct_lump` example above.
|
|
Why not 10?
|
|
(Hint: type `?fct_lump`, and find the default for the argument `other_level` is "Other".)
|
|
|
|
## Ordered factors
|
|
|
|
Before we go on, there's a special type of factor that needs to be mentioned briefly: ordered factors.
|
|
Ordered factors, created with `ordered()`, imply a strict ordering and equal distance between levels: the first level is "less than" the second level by the same amount that the second level is "less than" the third level, and so on..
|
|
You can recognize them when printing because they use `<` between the factor levels:
|
|
|
|
```{r}
|
|
ordered(c("a", "b", "c"))
|
|
```
|
|
|
|
In practice, `ordered()` factors behave very similarly to regular factors.
|
|
There are only two places where you might notice different behavior:
|
|
|
|
- If you map an ordered factor to color or fill in ggplot2, it will default to `scale_color_viridis()`/`scale_fill_viridis()`, a color scale that implies a ranking.
|
|
- If you use an ordered function in a linear model, it will use "polygonal contrasts". These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don't routinely interpret them. If you want to learn more, we recommend `vignette("contrasts", package = "faux")` by Lisa DeBruine.
|
|
|
|
Given the arguable utility of these differences, we don't generally recommend using ordered factors.
|
|
|
|
## Summary
|
|
|
|
This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions.
|
|
forcats contains a wide range of other helpers that we didn't have space to discuss here, so whenever you're facing a factor analysis challenge that you haven't encountered before, I highly recommend skimming the [reference index](https://forcats.tidyverse.org/reference/index.html) to see if there's a canned function that can help solve your problem.
|
|
|
|
If you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
|
|
This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \<sigh\>*](https://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods.
|
|
An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!
|
|
|
|
In the next chapter we'll switch gears to start learning about dates and times in R.
|
|
Dates and times seem deceptively simple, but as you'll soon see, the more you learn about them, the more complex they seem to get!
|