r4ds/factors.Rmd

# Factors

## Introduction

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.

Historically, factors were much easier to work with than characters. As a result, many of the functions in base R automatically convert characters to factors. This means that factors often crop up in places where they're not actually helpful. Fortunately, you don't need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.

### Prerequisites

To work with factors, we'll use the __forcats__ package, which provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!). It provides a wide range of helpers for working with factors. forcats is not part of the core tidyverse, so we need to load it explicitly.

```{r setup, message = FALSE}
library(tidyverse)
library(forcats)
```

### Learning more

If you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Horton’s paper, [_Wrangling categorical data in R_](https://peerj.com/preprints/3163/). This paper lays out some of the history discussed in [_stringsAsFactors: An unauthorized biography_](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) and [_stringsAsFactors = \<sigh\>_](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods. A early version of the paper help motivate and scope the forcats package; thanks Amelia & Nick!

## Creating factors

Imagine that you have a variable that records month:

```{r}
x1 <- c("Dec", "Apr", "Jan", "Mar")
```

Using a string to record this variable has two problems:

1.  There are only twelve possible months, and there's nothing saving you
    from typos:
     
    ```{r}
    x2 <- c("Dec", "Apr", "Jam", "Mar")
    ```
    
1.  It doesn't sort in a useful way:

    ```{r}
    sort(x1)
    ```

You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid __levels__:

```{r}
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
```

Now you can create a factor:

```{r}
y1 <- factor(x1, levels = month_levels)
y1
sort(y1)
```

And any values not in the set will be silently converted to NA:

```{r}
y2 <- factor(x2, levels = month_levels)
y2
```

If you want a warning, you can use `readr::parse_factor()`:

```{r}
y2 <- parse_factor(x2, levels = month_levels)
```

If you omit the levels, they'll be taken from the data in alphabetical order:

```{r}
factor(x1)
```

Sometimes you'd prefer that the order of the levels match the order of the first appearance in the data. You can do that when creating the factor by setting levels to `unique(x)`, or after the fact, with `fct_inorder()`:

```{r}
f1 <- factor(x1, levels = unique(x1))
f1

f2 <- x1 %>% factor() %>% fct_inorder()
f2
```

If you ever need to access the set of valid levels directly, you can do so with `levels()`:

```{r}
levels(f2)
```

## General Social Survey

For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of data from the [General Social Survey](http://gss.norc.org), which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in `gss_cat` I've selected a handful that will illustrate some common challenges you'll encounter when working with factors.

```{r}
gss_cat
```

(Remember, since this dataset is provided by a package, you can get more information about the variables with `?gss_cat`.)

When factors are stored in a tibble, you can't see their levels so easily. One way to see them is with `count()`:

```{r}
gss_cat %>%
  count(race)
```

Or with a bar chart:

```{r}
ggplot(gss_cat, aes(race)) +
  geom_bar()
```

By default, ggplot2 will drop levels that don't have any values. You can force them to display with:

```{r}
ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)
```

These levels represent valid values that simply did not occur in this dataset. Unfortunately, dplyr doesn't yet have a `drop` option, but it will in the future.

When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.

### Exercise

1.  Explore the distribution of `rincome` (reported income). What makes the
    default bar chart hard to understand? How could you improve the plot?

1.  What is the most common `relig` in this survey? What's the most
    common `partyid`?

1.  Which `relig` does `denom` (denomination) apply to? How can you find
    out with a table? How can you find out with a visualisation?

## Modifying factor order

It's often useful to change the order of the factor levels in a visualisation. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:

```{r}
relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
```

It is difficult to interpret this plot because there's no overall pattern. We can improve it by reordering the levels of `relig` using `fct_reorder()`. `fct_reorder()` takes three arguments:

* `f`, the factor whose levels you want to modify.
* `x`, a numeric vector that you want to use to reorder the levels.
* Optionally, `fun`, a function that's used if there are multiple values of
  `x` for each value of `f`. The default value is `median`.

```{r}
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()
```

Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions watch much less.

As you start making more complicated transformations, I'd recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as:

```{r, eval = FALSE}
relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()
```
What if we create a similar plot looking at how average age varies across reported income level?

```{r}
rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
```

Here, arbitrarily reordering the levels isn't a good idea! That's because `rincome` already has a principled order that we shouldn't mess with. Reserve `fct_reorder()` for factors whose levels are arbitrarily ordered.

However, it does make sense to pull "Not applicable" to the front with the other special levels. You can use `fct_relevel()`. It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.

```{r}
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()
```

Why do you think the average age for "Not applicable" is so high?

Another type of reordering is useful when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend.

```{r, fig.align = "default", out.width = "50%", fig.width = 4}
by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)

ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "marital")
```

Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. You may want to combine with `fct_rev()`.

```{r}
gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(marital)) +
    geom_bar()
```

### Exercises

1.  There are some suspiciously high numbers in `tvhours`. Is the mean a good
    summary?

1.  For each factor in `gss_cat` identify whether the order of the levels is
    arbitrary or principled.

1.  Why did moving "Not applicable" to the front of the levels move it to the
    bottom of the plot?

## Modifying factor levels

More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:

```{r}
gss_cat %>% count(partyid)
```

The levels are terse and inconsistent. Let's tweak them to be longer and use a parallel construction.

```{r}
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>%
  count(partyid)
```

`fct_recode()` will leave levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.

To combine groups, you can assign multiple old levels to the same new level:

```{r}
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)
```

You must use this technique with care: if you group together categories that are truly different you will end up with misleading results.

If you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`. For each new variable, you can provide a vector of old levels:

```{r}
gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
  )) %>%
  count(partyid)
```

Sometimes you just want to lump together all the small groups to make a plot or table simpler. That's the job of `fct_lump()`:

```{r}
gss_cat %>%
  mutate(relig = fct_lump(relig)) %>%
  count(relig)
```

The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group. In this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we've probably over collapsed.

Instead, we can use the `n` parameter to specify how many groups (excluding other) we want to keep:

```{r}
gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)
```

### Exercises

1.  How have the proportions of people identifying as Democrat, Republican, and
    Independent changed over time?

1.  How could you collapse `rincome` into a small set of categories?
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								# Factors
 								## Introduction
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
-												Copyedits for communicate-plots.Rmd & factors.Rmd (#282)

* Copyedits for communicate-plots.Rmd & factors.Rmd

* Add missing 't'

											
										
										
											2016-08-18 20:31:26 +08:00
+								Historically, factors were much easier to work with than characters. As a result, many of the functions in base R automatically convert characters to factors. This means that factors often crop up in places where they're not actually helpful. Fortunately, you don't need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								### Prerequisites
-												Use tidyverse package

Fixes #451

											
										
										
											2016-10-04 01:30:24 +08:00
+								To work with factors, we'll use the __forcats__ package, which provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!). It provides a wide range of helpers for working with factors. forcats is not part of the core tidyverse, so we need to load it explicitly.
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r setup, message = FALSE}
-												Use tidyverse package

Fixes #451

											
										
										
											2016-10-04 01:30:24 +08:00
+								library(tidyverse)
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								library(forcats)
 								```
-												Ref & acknowledge @AmeliaMN paper

											
										
										
											2017-10-28 00:53:57 +08:00
+								### Learning more
 								If you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Horton’s paper, [_Wrangling categorical data in R_](https://peerj.com/preprints/3163/). This paper lays out some of the history discussed in [_stringsAsFactors: An unauthorized biography_](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) and [_stringsAsFactors = \<sigh\>_](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods. A early version of the paper help motivate and scope the forcats package; thanks Amelia & Nick!
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								## Creating factors
-												Better factor motivation

Thanks to @csgillespie

											
										
										
											2016-10-04 22:00:33 +08:00
+								Imagine that you have a variable that records month:
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r}
-												Better factor motivation

Thanks to @csgillespie

											
										
										
											2016-10-04 22:00:33 +08:00
+								x1 <- c("Dec", "Apr", "Jan", "Mar")
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								```
-												Better factor motivation

Thanks to @csgillespie

											
										
										
											2016-10-04 22:00:33 +08:00
+								Using a string to record this variable has two problems:
 .  There are only twelve possible months, and there's nothing saving you
 								    from typos:
 								    ```{r}
 								    x2 <- c("Dec", "Apr", "Jam", "Mar")
 								    ```
 .  It doesn't sort in a useful way:
 								    ```{r}
 								    sort(x1)
 								    ```
 								You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid __levels__:
 								```{r}
 								month_levels <- c(
 								  "Jan", "Feb", "Mar", "Apr", "May", "Jun",
 								  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
 								)
 								```
 								Now you can create a factor:
 								```{r}
 								y1 <- factor(x1, levels = month_levels)
 								y1
 								sort(y1)
 								```
 								And any values not in the set will be silently converted to NA:
 								```{r}
 								y2 <- factor(x2, levels = month_levels)
 								y2
 								```
-												Typo correction (#473)

Change "want" for "warning" in line 67.
											
										
										
											2016-10-17 02:56:27 +08:00
+								If you want a warning, you can use `readr::parse_factor()`:
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								```{r}
-												Better factor motivation

Thanks to @csgillespie

											
										
										
											2016-10-04 22:00:33 +08:00
+								y2 <- parse_factor(x2, levels = month_levels)
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								```
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								If you omit the levels, they'll be taken from the data in alphabetical order:
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r}
-												Better factor motivation

Thanks to @csgillespie

											
										
										
											2016-10-04 22:00:33 +08:00
+								factor(x1)
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								```
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								Sometimes you'd prefer that the order of the levels match the order of the first appearance in the data. You can do that when creating the factor by setting levels to `unique(x)`, or after the fact, with `fct_inorder()`:
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
 								```{r}
-												Better factor motivation

Thanks to @csgillespie

											
										
										
											2016-10-04 22:00:33 +08:00
+								f1 <- factor(x1, levels = unique(x1))
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								f1
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
-												Better factor motivation

Thanks to @csgillespie

											
										
										
											2016-10-04 22:00:33 +08:00
+								f2 <- x1 %>% factor() %>% fct_inorder()
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								f2
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								```
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								If you ever need to access the set of valid levels directly, you can do so with `levels()`:
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
 								```{r}
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								levels(f2)
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								```
 								## General Social Survey
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of data from the [General Social Survey](http://gss.norc.org), which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in `gss_cat` I've selected a handful that will illustrate some common challenges you'll encounter when working with factors.
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r}
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								gss_cat
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								```
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								(Remember, since this dataset is provided by a package, you can get more information about the variables with `?gss_cat`.)
 								When factors are stored in a tibble, you can't see their levels so easily. One way to see them is with `count()`:
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r}
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								gss_cat %>%
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								  count(race)
 								```
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								Or with a bar chart:
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r}
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								ggplot(gss_cat, aes(race)) +
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								  geom_bar()
 								```
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								By default, ggplot2 will drop levels that don't have any values. You can force them to display with:
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r}
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								ggplot(gss_cat, aes(race)) +
 								  geom_bar() +
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								  scale_x_discrete(drop = FALSE)
 								```
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								These levels represent valid values that simply did not occur in this dataset. Unfortunately, dplyr doesn't yet have a `drop` option, but it will in the future.
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
-												Fixing typos in factors.Rmd (#306)

* Fixing typos in factors.Rmd

* Update factors.Rmd

If 'Those' is more appropriate.

											
										
										
											2016-08-30 20:55:16 +08:00
+								When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
 								### Exercise
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+.  Explore the distribution of `rincome` (reported income). What makes the
 								    default bar chart hard to understand? How could you improve the plot?
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
-												Fix typos (#422)

* Fix typos

* Fix typos

* Fix typos

											
										
										
											2016-10-03 20:38:26 +08:00
+.  What is the most common `relig` in this survey? What's the most
-												Copyedits for communicate-plots.Rmd & factors.Rmd (#282)

* Copyedits for communicate-plots.Rmd & factors.Rmd

* Add missing 't'

											
										
										
											2016-08-18 20:31:26 +08:00
+								    common `partyid`?
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
-												Fix typos (#422)

* Fix typos

* Fix typos

* Fix typos

											
										
										
											2016-10-03 20:38:26 +08:00
+.  Which `relig` does `denom` (denomination) apply to? How can you find
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								    out with a table? How can you find out with a visualisation?
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								## Modifying factor order
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								It's often useful to change the order of the factor levels in a visualisation. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								```{r}
-												factors.Rmd clarification (#577)

Fixes #576 
											
										
										
											2017-05-04 20:07:34 +08:00
+								relig_summary <- gss_cat %>%
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								  group_by(relig) %>%
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								  summarise(
 								    age = mean(age, na.rm = TRUE),
 								    tvhours = mean(tvhours, na.rm = TRUE),
 								    n = n()
 								  )
-												factors.Rmd clarification (#577)

Fixes #576 
											
										
										
											2017-05-04 20:07:34 +08:00
+								ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								```
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								It is difficult to interpret this plot because there's no overall pattern. We can improve it by reordering the levels of `relig` using `fct_reorder()`. `fct_reorder()` takes three arguments:
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
 								* `f`, the factor whose levels you want to modify.
 								* `x`, a numeric vector that you want to use to reorder the levels.
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								* Optionally, `fun`, a function that's used if there are multiple values of
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								  `x` for each value of `f`. The default value is `median`.
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
 								```{r}
-												factors.Rmd clarification (#577)

Fixes #576 
											
										
										
											2017-05-04 20:07:34 +08:00
+								ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  geom_point()
 								```
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions watch much less.
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								As you start making more complicated transformations, I'd recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as:
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
 								```{r, eval = FALSE}
-												factors.Rmd clarification (#577)

Fixes #576 
											
										
										
											2017-05-04 20:07:34 +08:00
+								relig_summary %>%
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								  mutate(relig = fct_reorder(relig, tvhours)) %>%
 								  ggplot(aes(tvhours, relig)) +
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								    geom_point()
 								```
 								What if we create a similar plot looking at how average age varies across reported income level?
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r}
-												factors.Rmd clarification (#577)

Fixes #576 
											
										
										
											2017-05-04 20:07:34 +08:00
+								rincome_summary <- gss_cat %>%
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								  group_by(rincome) %>%
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								  summarise(
 								    age = mean(age, na.rm = TRUE),
 								    tvhours = mean(tvhours, na.rm = TRUE),
 								    n = n()
 								  )
-												factors.Rmd clarification (#577)

Fixes #576 
											
										
										
											2017-05-04 20:07:34 +08:00
+								ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								```
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								Here, arbitrarily reordering the levels isn't a good idea! That's because `rincome` already has a principled order that we shouldn't mess with. Reserve `fct_reorder()` for factors whose levels are arbitrarily ordered.
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
 								However, it does make sense to pull "Not applicable" to the front with the other special levels. You can use `fct_relevel()`. It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r}
-												factors.Rmd clarification (#577)

Fixes #576 
											
										
										
											2017-05-04 20:07:34 +08:00
+								ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  geom_point()
 								```
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								Why do you think the average age for "Not applicable" is so high?
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								Another type of reordering is useful when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend.
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
 								```{r, fig.align = "default", out.width = "50%", fig.width = 4}
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								by_age <- gss_cat %>%
 								  filter(!is.na(age)) %>%
-												Update factors.Rmd (#624)


											
										
										
											2018-06-20 17:00:59 +08:00
+								  count(age, marital) %>%
 								  group_by(age) %>%
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								  mutate(prop = n / sum(n))
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								ggplot(by_age, aes(age, prop, colour = marital)) +
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								  geom_line(na.rm = TRUE)
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  geom_line() +
 								  labs(colour = "marital")
 								```
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. You may want to combine with `fct_rev()`.
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
 								```{r}
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								gss_cat %>%
 								  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  ggplot(aes(marital)) +
 								    geom_bar()
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								```
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								### Exercises
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+.  There are some suspiciously high numbers in `tvhours`. Is the mean a good
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								    summary?
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+.  For each factor in `gss_cat` identify whether the order of the levels is
 								    arbitrary or principled.
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+.  Why did moving "Not applicable" to the front of the levels move it to the
 								    bottom of the plot?
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								## Modifying factor levels
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
 								```{r}
 								gss_cat %>% count(partyid)
 								```
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
-												Copyedits for communicate-plots.Rmd & factors.Rmd (#282)

* Copyedits for communicate-plots.Rmd & factors.Rmd

* Add missing 't'

											
										
										
											2016-08-18 20:31:26 +08:00
+								The levels are terse and inconsistent. Let's tweak them to be longer and use a parallel construction.
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r}
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								gss_cat %>%
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  mutate(partyid = fct_recode(partyid,
 								    "Republican, strong"    = "Strong republican",
 								    "Republican, weak"      = "Not str republican",
 								    "Independent, near rep" = "Ind,near rep",
 								    "Independent, near dem" = "Ind,near dem",
 								    "Democrat, weak"        = "Not str democrat",
 								    "Democrat, strong"      = "Strong democrat"
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								  )) %>%
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  count(partyid)
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								```
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								`fct_recode()` will leave levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
 								To combine groups, you can assign multiple old levels to the same new level:
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
 								```{r}
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								gss_cat %>%
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  mutate(partyid = fct_recode(partyid,
 								    "Republican, strong"    = "Strong republican",
 								    "Republican, weak"      = "Not str republican",
 								    "Independent, near rep" = "Ind,near rep",
 								    "Independent, near dem" = "Ind,near dem",
 								    "Democrat, weak"        = "Not str democrat",
 								    "Democrat, strong"      = "Strong democrat",
 								    "Other"                 = "No answer",
 								    "Other"                 = "Don't know",
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								    "Other"                 = "Other party"
 								  )) %>%
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  count(partyid)
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								```
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								You must use this technique with care: if you group together categories that are truly different you will end up with misleading results.
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								If you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`. For each new variable, you can provide a vector of old levels:
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
 								```{r}
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								gss_cat %>%
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  mutate(partyid = fct_collapse(partyid,
 								    other = c("No answer", "Don't know", "Other party"),
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								    rep = c("Strong republican", "Not str republican"),
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
 								    dem = c("Not str democrat", "Strong democrat")
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								  )) %>%
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  count(partyid)
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								```
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								Sometimes you just want to lump together all the small groups to make a plot or table simpler. That's the job of `fct_lump()`:
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								```{r}
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								gss_cat %>%
 								  mutate(relig = fct_lump(relig)) %>%
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								  count(relig)
 								```
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group. In this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we've probably over collapsed.
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
 								Instead, we can use the `n` parameter to specify how many groups (excluding other) we want to keep:
 								```{r}
-												Copyedits for factors.Rmd (#280)


											
										
										
											2016-08-18 05:35:33 +08:00
+								gss_cat %>%
 								  mutate(relig = fct_lump(relig, n = 10)) %>%
 								  count(relig, sort = TRUE) %>%
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
+								  print(n = Inf)
-												Start banging out factors chapter

											
										
										
											2016-08-17 06:06:51 +08:00
+								```
-												Filling in some text about factors

											
										
										
											2016-08-17 23:23:57 +08:00
+								### Exercises
-												Second pass through factors

											
										
										
											2016-08-18 02:49:27 +08:00
 .  How have the proportions of people identifying as Democrat, Republican, and
 								    Independent changed over time?
 .  How could you collapse `rincome` into a small set of categories?