@ -26,22 +26,47 @@ There are two ways to create a factor: during import with readr, using `col_fact
x <- c("pear", "apple", "banana", "apple", "pear", "apple")
factor(x, levels = c("apple", "banana", "pear"))
For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of variables from the [General Social Survey]( The variables have been selected to illustrate a number of challenges with working with factors.
Sometimes you'd prefer that the order of the levels match the order of the first appearnace in the data. You can do that during creation by setting levels to `unique(x)`, or after the with `fct_inorder()`:
factor(x, levels = unique(x))
f <- factor(x)
f <- fct_inorder(f)
## General Social Survey
For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of variables from the [General Social Survey](, which is a long-running US survey run by the the independent research organization NORC at the University of Chicago. The survey has thousands of questions, and in `gss_cat` I've selected a handful of variables to illustrate some common challenges you'll hit when working with factors.
Note that the order of levels is preserved in operations like `count()`:
gss_cat %>%
@ -55,7 +80,7 @@ ggplot(gss_cat, aes(race)) +
By default, ggplot2 will drop levels that don't have any values. You can force them to appear with:
ggplot(gss_cat, aes(race)) +
@ -63,10 +88,27 @@ ggplot(gss_cat, aes(race)) +
scale_x_discrete(drop = FALSE)
Unfortunatealy dplyr doesn't yet have a `drop` option, but it will in the future.
### Exercise
## Modifying factor order
Let's take a look with a concrete example. Here I compute the average number of tv hours for each religion:
relig <- gss_cat %>%
group_by(relig) %>%
@ -77,10 +119,16 @@ relig <- gss_cat %>%
ggplot(relig, aes(tvhours, relig)) + geom_point()
This plot is a little hard to take in because the order of religion is basically arbitary. We can improve it by reordering the levels of `relig`. This makes it much easier to see that "Don't know" seems to watch much more, and Hinduism & Other Eastern religions watch much less.
ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +
What if we do the same thing for income levels?
rincome <- gss_cat %>%
@ -92,54 +140,132 @@ rincome <- gss_cat %>%
by_year <- gss_cat %>%
group_by(year, marital) %>%
But it does make sense to pull "Not applicable" to the front with the other special levels. You can use `fct_relevel()`. Why do you think the average age for "Not applicable" is so high?
ggplot(rincome, aes(age, fct_relevel(rincome, "Not applicable"))) +
Another variation of `fct_reorder()` is useful when you are colouring the lines on a plot. Using `fct_reorder2()` makes the line colours nicely match the order of the legend.
```{r, fig.align = "default", out.width = "50%"}
by_age <- gss_cat %>%
group_by(age, marital) %>%
count() %>%
mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "marital")
Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency. You may want to combine with `fct_rev()`.
gss_cat %>%
mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(marital)) +
### Exercises
1. There are some suspiciously high numbers in `tvhours`. Is the mean a good
1. For each factor in `gss_cat` identify whether the order is arbitrary
or meaningful.
1. Recreate the display of marital status by age, using `geom_area()` instead
of `geom_line()`. What do you need to change to the plot? How might you
tweak the levels?
## Modifying factor levels
More powerful than changing the orders of the levels is to change their values. This allows you to clarify labels for publication, and collapse levels for high-level displays.
### Manually grouping
The most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:
gss_cat %>% count(partyid)
### Lumping small groups together
The levels are little hard to read. Let's tweak them to be longer and more consistent. Any levels that aren't explicitly mentioned will be left as is.
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
You can assign multiple old levels to the same new level:
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
You must use this technique with extreme care: if you group together categories that are truly different you will end up with misleading results.
If you want to collapse a lot of levels, `fct_collapse()` is a useful variant. For each new variable, you can provide a vector of old levels:
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
Sometimes you just want to lump together all the small groups to make a plot or table simpler. That's the job of `fct_lump()`:
gss_cat %>%
mutate(relig = fct_lump(relig)) %>%
The default behaviour is to lump together all the smallest groups, ensuring that the aggregate is still the smallest group. In this case it's not super helpful: it is true that the majority of Americans are protestant, but we've probably over collapsed.
Instead, we can use the `n` parameter to specify how many groups (excluding other) we want to keep:
gss_cat %>%
mutate(relig = fct_lump(relig, n = 5)) %>%
count(relig, sort = TRUE)
### Exercises