From 344b5493a99824fc758710462ffba6abb7f0b641 Mon Sep 17 00:00:00 2001 From: hadley Date: Wed, 17 Aug 2016 10:23:57 -0500 Subject: [PATCH] Filling in some text about factors --- factors.Rmd | 204 ++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 165 insertions(+), 39 deletions(-) diff --git a/factors.Rmd b/factors.Rmd index 7c228df..6454fb5 100644 --- a/factors.Rmd +++ b/factors.Rmd @@ -26,22 +26,47 @@ There are two ways to create a factor: during import with readr, using `col_fact To turn a string into a factor, call `factor()`, supplying list of possible values: ```{r} - +x <- c("pear", "apple", "banana", "apple", "pear", "apple") +factor(x, levels = c("apple", "banana", "pear")) ``` -For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of variables from the [General Social Survey](https://gssdataexplorer.norc.org/). The variables have been selected to illustrate a number of challenges with working with factors. +Any values not in the list of levels will be silently converted to `NA`: + +```{r} +factor(x, levels = c("apple", "banana")) +``` + +If you omit the levels, they'll be taken from the data in alphabetical order: + +```{r} +factor(x) +``` + +Sometimes you'd prefer that the order of the levels match the order of the first appearnace in the data. You can do that during creation by setting levels to `unique(x)`, or after the with `fct_inorder()`: + +```{r} +factor(x, levels = unique(x)) + +f <- factor(x) +f <- fct_inorder(f) +f +``` + +You can access the levels of the factor with `levels()`: + +```{r} +levels(f) +``` + +## General Social Survey + +For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of variables from the [General Social Survey](http://gss.norc.org), which is a long-running US survey run by the the independent research organization NORC at the University of Chicago. The survey has thousands of questions, and in `gss_cat` I've selected a handful of variables to illustrate some common challenges you'll hit when working with factors. ```{r} gss_cat -```` - -You can see the levels of a factor with `levels()`: - -```{r} -levels(gss_cat$race) ``` -And this order is preserved in operations like `count()`: +Note that the order of levels is preserved in operations like `count()`: ```{r} gss_cat %>% @@ -55,7 +80,7 @@ ggplot(gss_cat, aes(race)) + geom_bar() ``` -Note that by default, ggplot2 will drop levels that don't have any values. You can force them to appear with : +By default, ggplot2 will drop levels that don't have any values. You can force them to appear with: ```{r} ggplot(gss_cat, aes(race)) + @@ -63,10 +88,27 @@ ggplot(gss_cat, aes(race)) + scale_x_discrete(drop = FALSE) ``` -Currently dplyr doesn't have a `drop` option, but it will in the future. +Unfortunatealy dplyr doesn't yet have a `drop` option, but it will in the future. + +### Exercise + ## Modifying factor order +The levels of a factor can be meaningful or arbitary: + +* arbitrary: where the order of the factor levels is arbitrary, like race, sex, + or religion. You have to pick an order for display, but it doesn't mean + anything. + +* meaningful: where the order of levels reflects an underlying order like + party affiliation (from strong republican - indepedent - strong democrat) + or income (from low to high) + +Generally, you should avoid jumbling the order if it's meaningful. + +Let's take a look with a concrete example. Here I compute the average number of tv hours for each religion: + ```{r} relig <- gss_cat %>% group_by(relig) %>% @@ -77,10 +119,16 @@ relig <- gss_cat %>% ) ggplot(relig, aes(tvhours, relig)) + geom_point() -ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) + geom_point() ``` -If you just want to pull a couple of levels out to the front, you can use `fct_relevel()`. +This plot is a little hard to take in because the order of religion is basically arbitary. We can improve it by reordering the levels of `relig`. This makes it much easier to see that "Don't know" seems to watch much more, and Hinduism & Other Eastern religions watch much less. + +```{r} +ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) + + geom_point() +``` + +What if we do the same thing for income levels? ```{r} rincome <- gss_cat %>% @@ -92,54 +140,132 @@ rincome <- gss_cat %>% ) ggplot(rincome, aes(age, rincome)) + geom_point() - -gss_cat %>% count(fct_rev(rincome)) ``` -`fct_rev(rincome)` -`fct_reorder(religion, rincome)` -`fct_reorder2(religion, year, rincome)` - +Arbitrarily reordering the levels isn't a good idea! ```{r} -by_year <- gss_cat %>% - group_by(year, marital) %>% +ggplot(rincome, aes(age, fct_reorder(rincome, age))) + geom_point() +``` + +But it does make sense to pull "Not applicable" to the front with the other special levels. You can use `fct_relevel()`. Why do you think the average age for "Not applicable" is so high? + +```{r} +ggplot(rincome, aes(age, fct_relevel(rincome, "Not applicable"))) + + geom_point() +``` + +Another variation of `fct_reorder()` is useful when you are colouring the lines on a plot. Using `fct_reorder2()` makes the line colours nicely match the order of the legend. + +```{r, fig.align = "default", out.width = "50%"} +by_age <- gss_cat %>% + group_by(age, marital) %>% count() %>% mutate(prop = n / sum(n)) -ggplot(by_year, aes(year, prop, colour = marital)) + - geom_line() - -ggplot(by_year, aes(year, prop, colour = fct_reorder2(marital, year, prop))) + +ggplot(by_age, aes(age, prop, colour = marital)) + geom_line() +ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) + + geom_line() + + labs(colour = "marital") ``` +Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency. You may want to combine with `fct_rev()`. + +```{r} +gss_cat %>% + mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>% + ggplot(aes(marital)) + + geom_bar() +``` + +### Exercises + +1. There are some suspiciously high numbers in `tvhours`. Is the mean a good + summary? + +1. For each factor in `gss_cat` identify whether the order is arbitrary + or meaningful. + +1. Recreate the display of marital status by age, using `geom_area()` instead + of `geom_line()`. What do you need to change to the plot? How might you + tweak the levels? + ## Modifying factor levels -`fct_recode()` is the most general. It allows you to transform levels. +More powerful than changing the orders of the levels is to change their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. -### Manually grouping +The most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`: ```{r} -fct_count(fct_collapse(gss_cat$partyid, - other = c("No answer", "Don't know", "Other party"), - rep = c("Strong republican", "Not str republican"), - ind = c("Ind,near rep", "Independent", "Ind,near dem"), - dem = c("Not str democrat", "Strong democrat") -)) +gss_cat %>% count(partyid) ``` -### Lumping small groups together +The levels are little hard to read. Let's tweak them to be longer and more consistent. Any levels that aren't explicitly mentioned will be left as is. ```{r} -gss_cat %>% mutate(relig = fct_lump(relig)) %>% count(relig) -gss_cat %>% mutate(relig = fct_lump(relig, 5)) %>% count(relig, sort = TRUE) +gss_cat %>% + mutate(partyid = fct_recode(partyid, + "Republican, strong" = "Strong republican", + "Republican, weak" = "Not str republican", + "Independent, near rep" = "Ind,near rep", + "Independent, near dem" = "Ind,near dem", + "Democrat, weak" = "Not str democrat", + "Democrat, strong" = "Strong democrat" + )) %>% + count(partyid) ``` +You can assign multiple old levels to the same new level: + ```{r} -gss_cat$relig %>% fct_infreq() %>% fct_lump(5) %>% fct_count() -gss_cat$relig %>% fct_lump(5) %>% fct_infreq() %>% fct_count() +gss_cat %>% + mutate(partyid = fct_recode(partyid, + "Republican, strong" = "Strong republican", + "Republican, weak" = "Not str republican", + "Independent, near rep" = "Ind,near rep", + "Independent, near dem" = "Ind,near dem", + "Democrat, weak" = "Not str democrat", + "Democrat, strong" = "Strong democrat", + "Other" = "No answer", + "Other" = "Don't know", + "Other" = "Other party" + )) %>% + count(partyid) +``` + +You must use this technique with extreme care: if you group together categories that are truly different you will end up with misleading results. + +If you want to collapse a lot of levels, `fct_collapse()` is a useful variant. For each new variable, you can provide a vector of old levels: + +```{r} +gss_cat %>% + mutate(partyid = fct_collapse(partyid, + other = c("No answer", "Don't know", "Other party"), + rep = c("Strong republican", "Not str republican"), + ind = c("Ind,near rep", "Independent", "Ind,near dem"), + dem = c("Not str democrat", "Strong democrat") + )) %>% + count(partyid) ``` -`fct_reorder()` is sometimes also useful. It... +Sometimes you just want to lump together all the small groups to make a plot or table simpler. That's the job of `fct_lump()`: + +```{r} +gss_cat %>% + mutate(relig = fct_lump(relig)) %>% + count(relig) +``` + +The default behaviour is to lump together all the smallest groups, ensuring that the aggregate is still the smallest group. In this case it's not super helpful: it is true that the majority of Americans are protestant, but we've probably over collapsed. + +Instead, we can use the `n` parameter to specify how many groups (excluding other) we want to keep: + +```{r} +gss_cat %>% + mutate(relig = fct_lump(relig, n = 5)) %>% + count(relig, sort = TRUE) +``` + +### Exercises