From 0a4a5c3d55375e095121be084d565bba070f7c44 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Tue, 3 May 2022 16:02:13 -0500 Subject: [PATCH] Factors polishing Add alt text. Update code style --- factors.Rmd | 143 ++++++++++++++++++++++++++++++++++------------------ 1 file changed, 94 insertions(+), 49 deletions(-) diff --git a/factors.Rmd b/factors.Rmd index 5538d35..eed2c79 100644 --- a/factors.Rmd +++ b/factors.Rmd @@ -2,13 +2,12 @@ ## Introduction -In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. +Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. -Historically, factors were much easier to work with than characters. -As a result, many of the functions in base R automatically convert characters to factors. -This means that factors often crop up in places where they're not actually helpful. -Fortunately, you don't need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful. +If, after reading this chapter, you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/). +This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods. +An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick! ### Prerequisites @@ -19,12 +18,6 @@ It provides tools for dealing with **cat**egorical variables (and it's an anagra library(tidyverse) ``` -### Learning more - -If you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/). -This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods. -An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick! - ## Creating factors Imagine that you have a variable that records month: @@ -103,8 +96,8 @@ levels(f2) ## General Social Survey -For the rest of this chapter, we're going to focus on `forcats::gss_cat`. -It's a sample of data from the [General Social Survey](http://gss.norc.org), which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. +For the rest of this chapter, we're going to use `forcats::gss_cat`. +It's a sample of data from the [General Social Survey](http://gss.norc.org), a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in `gss_cat` I've selected a handful that will illustrate some common challenges you'll encounter when working with factors. ```{r} @@ -124,6 +117,10 @@ gss_cat |> Or with a bar chart: ```{r} +#| fig.alt: > +#| A bar chart showing the distribution of race. There are ~2000 +#| records with race "Other", 3000 with race "Black", and other +#| 15,000 with race "White". ggplot(gss_cat, aes(race)) + geom_bar() ``` @@ -132,6 +129,9 @@ By default, ggplot2 will drop levels that don't have any values. You can force them to display with: ```{r} +#> fig.alt: > +#> The same bar chart as the last plot, but now with an missing bar on +#> the far right with label "Not applicable". ggplot(gss_cat, aes(race)) + geom_bar() + scale_x_discrete(drop = FALSE) @@ -142,8 +142,7 @@ In dplyr::count() set the `.drop` option to `FALSE`, to show these. ```{r} gss_cat |> - count(race, - .drop = FALSE) + count(race, .drop = FALSE) ``` When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. @@ -160,14 +159,18 @@ Those operations are described in the sections below. 3. Which `relig` does `denom` (denomination) apply to? How can you find out with a table? - How can you find out with a visualisation? + How can you find out with a visualization? ## Modifying factor order -It's often useful to change the order of the factor levels in a visualisation. +It's often useful to change the order of the factor levels in a visualization. For example, imagine you want to explore the average number of hours spent watching TV per day across religions: ```{r} +#| fig.alt: > +#| A scatterplot of with tvhours on the x-axis and religion on the y-axis. +#| The y-axis is ordered seemingly aribtrarily making it hard to get +#| any sense of overall pattern. relig_summary <- gss_cat |> group_by(relig) |> summarise( @@ -176,7 +179,8 @@ relig_summary <- gss_cat |> n = n() ) -ggplot(relig_summary, aes(tvhours, relig)) + geom_point() +ggplot(relig_summary, aes(tvhours, relig)) + + geom_point() ``` It is difficult to interpret this plot because there's no overall pattern. @@ -188,6 +192,10 @@ We can improve it by reordering the levels of `relig` using `fct_reorder()`. - Optionally, `fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`. ```{r} +#| fig.alt: > +#| The same scatterplot as above, but now the religion is displayed in +#| increasing order of tvhours. "Other eastern" has the fewest tvhours +#| under 2, and "Don't know" has the highest (over 5). ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) + geom_point() ``` @@ -201,12 +209,17 @@ For example, you could rewrite the plot above as: relig_summary |> mutate(relig = fct_reorder(relig, tvhours)) |> ggplot(aes(tvhours, relig)) + - geom_point() + geom_point() ``` What if we create a similar plot looking at how average age varies across reported income level? ```{r} +#| fig.alt: > +#| A scatterplot with age on the x-axis and income on the y-axis. Income +#| has been reordered in order of average age which doesn't make much +#| sense. One section of the y-axis goes from $6000-6999, then <$1000, +#| then $8000-9999. rincome_summary <- gss_cat |> group_by(rincome) |> summarise( @@ -215,7 +228,8 @@ rincome_summary <- gss_cat |> n = n() ) -ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point() +ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + + geom_point() ``` Here, arbitrarily reordering the levels isn't a good idea! @@ -227,22 +241,43 @@ You can use `fct_relevel()`. It takes a factor, `f`, and then any number of levels that you want to move to the front of the line. ```{r} +#| fig.alt: > +#| The same scatterplot but now "Not Applicable" is displayed at the +#| bottom of the y-axis. Generally there is a positive association +#| between income and age, and the income band with the highest average +#| age is "Not applicable". ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) + geom_point() ``` Why do you think the average age for "Not applicable" is so high? -Another type of reordering is useful when you are colouring the lines on a plot. +Another type of reordering is useful when you are coloring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. -This makes the plot easier to read because the line colours line up with the legend. +This makes the plot easier to read because the line colurs line up with the legend. ```{r, fig.align = "default", out.width = "50%", fig.width = 4} +#| fig.alt: +#| - > +#| A line plot with age on the x-axis and proportion on the y-axis. +#| There is one line for each category of marital status: no answer, +#| never married, separated, divorced, widowed, and married. It is +#| a little hard to read the plot because the order of the legend is +#| unrelated to the lines on the plot. +#| - > +#| Rearranging the legend makes the plot easier to read because the +#| legend colours now match the order of the lines on the far right +#| of the plot. You can see some unsuprising patterns: the proportion +#| never marred decreases with age, married forms an upside down U +#| shape, and widowed starts off low but increases steeply after age +#| 60. by_age <- gss_cat |> filter(!is.na(age)) |> count(age, marital) |> group_by(age) |> - mutate(prop = n / sum(n)) + mutate( + prop = n / sum(n) + ) ggplot(by_age, aes(age, prop, colour = marital)) + geom_line(na.rm = TRUE) @@ -256,10 +291,14 @@ Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing You may want to combine with `fct_rev()`. ```{r} +#| fig.alt: > +#| A bar char of marital status ordered in from least to most common: +#| no answer (~0), separated (~1,000), widowed (~2,000), divorced +#| (~3,000), never married (~5,000), married (~10,000). gss_cat |> mutate(marital = marital |> fct_infreq() |> fct_rev()) |> ggplot(aes(marital)) + - geom_bar() + geom_bar() ``` ### Exercises @@ -288,14 +327,16 @@ Let's tweak them to be longer and use a parallel construction. ```{r} gss_cat |> - mutate(partyid = fct_recode(partyid, - "Republican, strong" = "Strong republican", - "Republican, weak" = "Not str republican", - "Independent, near rep" = "Ind,near rep", - "Independent, near dem" = "Ind,near dem", - "Democrat, weak" = "Not str democrat", - "Democrat, strong" = "Strong democrat" - )) |> + mutate( + partyid = fct_recode(partyid, + "Republican, strong" = "Strong republican", + "Republican, weak" = "Not str republican", + "Independent, near rep" = "Ind,near rep", + "Independent, near dem" = "Ind,near dem", + "Democrat, weak" = "Not str democrat", + "Democrat, strong" = "Strong democrat" + ) + ) |> count(partyid) ``` @@ -305,17 +346,19 @@ To combine groups, you can assign multiple old levels to the same new level: ```{r} gss_cat |> - mutate(partyid = fct_recode(partyid, - "Republican, strong" = "Strong republican", - "Republican, weak" = "Not str republican", - "Independent, near rep" = "Ind,near rep", - "Independent, near dem" = "Ind,near dem", - "Democrat, weak" = "Not str democrat", - "Democrat, strong" = "Strong democrat", - "Other" = "No answer", - "Other" = "Don't know", - "Other" = "Other party" - )) |> + mutate( + partyid = fct_recode(partyid, + "Republican, strong" = "Strong republican", + "Republican, weak" = "Not str republican", + "Independent, near rep" = "Ind,near rep", + "Independent, near dem" = "Ind,near dem", + "Democrat, weak" = "Not str democrat", + "Democrat, strong" = "Strong democrat", + "Other" = "No answer", + "Other" = "Don't know", + "Other" = "Other party" + ) + ) |> count(partyid) ``` @@ -326,12 +369,14 @@ For each new variable, you can provide a vector of old levels: ```{r} gss_cat |> - mutate(partyid = fct_collapse(partyid, - other = c("No answer", "Don't know", "Other party"), - rep = c("Strong republican", "Not str republican"), - ind = c("Ind,near rep", "Independent", "Ind,near dem"), - dem = c("Not str democrat", "Strong democrat") - )) |> + mutate( + partyid = fct_collapse(partyid, + other = c("No answer", "Don't know", "Other party"), + rep = c("Strong republican", "Not str republican"), + ind = c("Ind,near rep", "Independent", "Ind,near dem"), + dem = c("Not str democrat", "Strong democrat") + ) + ) |> count(partyid) ```