Factors polishing

Add alt text. Update code style
This commit is contained in:
Hadley Wickham 2022-05-03 16:02:13 -05:00
parent 7f43bdd7a2
commit 0a4a5c3d55
1 changed files with 94 additions and 49 deletions

View File

@ -2,13 +2,12 @@
## Introduction
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values.
Factors are used for categorical variables, variables that have a fixed and known set of possible values.
They are also useful when you want to display character vectors in a non-alphabetical order.
Historically, factors were much easier to work with than characters.
As a result, many of the functions in base R automatically convert characters to factors.
This means that factors often crop up in places where they're not actually helpful.
Fortunately, you don't need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.
If, after reading this chapter, you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods.
An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!
### Prerequisites
@ -19,12 +18,6 @@ It provides tools for dealing with **cat**egorical variables (and it's an anagra
library(tidyverse)
```
### Learning more
If you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods.
An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!
## Creating factors
Imagine that you have a variable that records month:
@ -103,8 +96,8 @@ levels(f2)
## General Social Survey
For the rest of this chapter, we're going to focus on `forcats::gss_cat`.
It's a sample of data from the [General Social Survey](http://gss.norc.org), which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
For the rest of this chapter, we're going to use `forcats::gss_cat`.
It's a sample of data from the [General Social Survey](http://gss.norc.org), a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
The survey has thousands of questions, so in `gss_cat` I've selected a handful that will illustrate some common challenges you'll encounter when working with factors.
```{r}
@ -124,6 +117,10 @@ gss_cat |>
Or with a bar chart:
```{r}
#| fig.alt: >
#| A bar chart showing the distribution of race. There are ~2000
#| records with race "Other", 3000 with race "Black", and other
#| 15,000 with race "White".
ggplot(gss_cat, aes(race)) +
geom_bar()
```
@ -132,6 +129,9 @@ By default, ggplot2 will drop levels that don't have any values.
You can force them to display with:
```{r}
#> fig.alt: >
#> The same bar chart as the last plot, but now with an missing bar on
#> the far right with label "Not applicable".
ggplot(gss_cat, aes(race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
@ -142,8 +142,7 @@ In dplyr::count() set the `.drop` option to `FALSE`, to show these.
```{r}
gss_cat |>
count(race,
.drop = FALSE)
count(race, .drop = FALSE)
```
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels.
@ -160,14 +159,18 @@ Those operations are described in the sections below.
3. Which `relig` does `denom` (denomination) apply to?
How can you find out with a table?
How can you find out with a visualisation?
How can you find out with a visualization?
## Modifying factor order
It's often useful to change the order of the factor levels in a visualisation.
It's often useful to change the order of the factor levels in a visualization.
For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
```{r}
#| fig.alt: >
#| A scatterplot of with tvhours on the x-axis and religion on the y-axis.
#| The y-axis is ordered seemingly aribtrarily making it hard to get
#| any sense of overall pattern.
relig_summary <- gss_cat |>
group_by(relig) |>
summarise(
@ -176,7 +179,8 @@ relig_summary <- gss_cat |>
n = n()
)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
ggplot(relig_summary, aes(tvhours, relig)) +
geom_point()
```
It is difficult to interpret this plot because there's no overall pattern.
@ -188,6 +192,10 @@ We can improve it by reordering the levels of `relig` using `fct_reorder()`.
- Optionally, `fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`.
```{r}
#| fig.alt: >
#| The same scatterplot as above, but now the religion is displayed in
#| increasing order of tvhours. "Other eastern" has the fewest tvhours
#| under 2, and "Don't know" has the highest (over 5).
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()
```
@ -201,12 +209,17 @@ For example, you could rewrite the plot above as:
relig_summary |>
mutate(relig = fct_reorder(relig, tvhours)) |>
ggplot(aes(tvhours, relig)) +
geom_point()
geom_point()
```
What if we create a similar plot looking at how average age varies across reported income level?
```{r}
#| fig.alt: >
#| A scatterplot with age on the x-axis and income on the y-axis. Income
#| has been reordered in order of average age which doesn't make much
#| sense. One section of the y-axis goes from $6000-6999, then <$1000,
#| then $8000-9999.
rincome_summary <- gss_cat |>
group_by(rincome) |>
summarise(
@ -215,7 +228,8 @@ rincome_summary <- gss_cat |>
n = n()
)
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) +
geom_point()
```
Here, arbitrarily reordering the levels isn't a good idea!
@ -227,22 +241,43 @@ You can use `fct_relevel()`.
It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.
```{r}
#| fig.alt: >
#| The same scatterplot but now "Not Applicable" is displayed at the
#| bottom of the y-axis. Generally there is a positive association
#| between income and age, and the income band with the highest average
#| age is "Not applicable".
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
geom_point()
```
Why do you think the average age for "Not applicable" is so high?
Another type of reordering is useful when you are colouring the lines on a plot.
Another type of reordering is useful when you are coloring the lines on a plot.
`fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values.
This makes the plot easier to read because the line colours line up with the legend.
This makes the plot easier to read because the line colurs line up with the legend.
```{r, fig.align = "default", out.width = "50%", fig.width = 4}
#| fig.alt:
#| - >
#| A line plot with age on the x-axis and proportion on the y-axis.
#| There is one line for each category of marital status: no answer,
#| never married, separated, divorced, widowed, and married. It is
#| a little hard to read the plot because the order of the legend is
#| unrelated to the lines on the plot.
#| - >
#| Rearranging the legend makes the plot easier to read because the
#| legend colours now match the order of the lines on the far right
#| of the plot. You can see some unsuprising patterns: the proportion
#| never marred decreases with age, married forms an upside down U
#| shape, and widowed starts off low but increases steeply after age
#| 60.
by_age <- gss_cat |>
filter(!is.na(age)) |>
count(age, marital) |>
group_by(age) |>
mutate(prop = n / sum(n))
mutate(
prop = n / sum(n)
)
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
@ -256,10 +291,14 @@ Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing
You may want to combine with `fct_rev()`.
```{r}
#| fig.alt: >
#| A bar char of marital status ordered in from least to most common:
#| no answer (~0), separated (~1,000), widowed (~2,000), divorced
#| (~3,000), never married (~5,000), married (~10,000).
gss_cat |>
mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
ggplot(aes(marital)) +
geom_bar()
geom_bar()
```
### Exercises
@ -288,14 +327,16 @@ Let's tweak them to be longer and use a parallel construction.
```{r}
gss_cat |>
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) |>
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)
) |>
count(partyid)
```
@ -305,17 +346,19 @@ To combine groups, you can assign multiple old levels to the same new level:
```{r}
gss_cat |>
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) |>
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)
) |>
count(partyid)
```
@ -326,12 +369,14 @@ For each new variable, you can provide a vector of old levels:
```{r}
gss_cat |>
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) |>
mutate(
partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)
) |>
count(partyid)
```