parent
0a4a5c3d55
commit
79a761664a
64
factors.Rmd
64
factors.Rmd
|
@ -1,24 +1,29 @@
|
||||||
# Factors
|
# Factors
|
||||||
|
|
||||||
|
```{r, results = "asis", echo = FALSE}
|
||||||
|
status("complete")
|
||||||
|
```
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
Factors are used for categorical variables, variables that have a fixed and known set of possible values.
|
Factors are used for categorical variables, variables that have a fixed and known set of possible values.
|
||||||
They are also useful when you want to display character vectors in a non-alphabetical order.
|
They are also useful when you want to display character vectors in a non-alphabetical order.
|
||||||
|
|
||||||
If, after reading this chapter, you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
|
If you want to learn more about factors after reading this chapter, I recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
|
||||||
This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods.
|
This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods.
|
||||||
An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!
|
An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
To work with factors, we'll use the **forcats** package, which is part of the core tidyverse.
|
Base R some basic tools for creating and manipulating factors.
|
||||||
|
We'll supplement these with the **forcats** package, which is part of the core tidyverse.
|
||||||
It provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!) using a wide range of helpers for working with factors.
|
It provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!) using a wide range of helpers for working with factors.
|
||||||
|
|
||||||
```{r setup, message = FALSE}
|
```{r setup, message = FALSE}
|
||||||
library(tidyverse)
|
library(tidyverse)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Creating factors
|
## Factor basics
|
||||||
|
|
||||||
Imagine that you have a variable that records month:
|
Imagine that you have a variable that records month:
|
||||||
|
|
||||||
|
@ -58,7 +63,7 @@ y1
|
||||||
sort(y1)
|
sort(y1)
|
||||||
```
|
```
|
||||||
|
|
||||||
And any values not in the set will be silently converted to NA:
|
And any values not in the level will be silently converted to NA:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
y2 <- factor(x2, levels = month_levels)
|
y2 <- factor(x2, levels = month_levels)
|
||||||
|
@ -107,7 +112,7 @@ gss_cat
|
||||||
(Remember, since this dataset is provided by a package, you can get more information about the variables with `?gss_cat`.)
|
(Remember, since this dataset is provided by a package, you can get more information about the variables with `?gss_cat`.)
|
||||||
|
|
||||||
When factors are stored in a tibble, you can't see their levels so easily.
|
When factors are stored in a tibble, you can't see their levels so easily.
|
||||||
One way to see them is with `count()`:
|
One way to view them is with `count()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
gss_cat |>
|
gss_cat |>
|
||||||
|
@ -125,26 +130,6 @@ ggplot(gss_cat, aes(race)) +
|
||||||
geom_bar()
|
geom_bar()
|
||||||
```
|
```
|
||||||
|
|
||||||
By default, ggplot2 will drop levels that don't have any values.
|
|
||||||
You can force them to display with:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#> fig.alt: >
|
|
||||||
#> The same bar chart as the last plot, but now with an missing bar on
|
|
||||||
#> the far right with label "Not applicable".
|
|
||||||
ggplot(gss_cat, aes(race)) +
|
|
||||||
geom_bar() +
|
|
||||||
scale_x_discrete(drop = FALSE)
|
|
||||||
```
|
|
||||||
|
|
||||||
These levels represent valid values that simply did not occur in this dataset.
|
|
||||||
In dplyr::count() set the `.drop` option to `FALSE`, to show these.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
gss_cat |>
|
|
||||||
count(race, .drop = FALSE)
|
|
||||||
```
|
|
||||||
|
|
||||||
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels.
|
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels.
|
||||||
Those operations are described in the sections below.
|
Those operations are described in the sections below.
|
||||||
|
|
||||||
|
@ -183,7 +168,7 @@ ggplot(relig_summary, aes(tvhours, relig)) +
|
||||||
geom_point()
|
geom_point()
|
||||||
```
|
```
|
||||||
|
|
||||||
It is difficult to interpret this plot because there's no overall pattern.
|
It is hard to read this plot because there's no overall pattern.
|
||||||
We can improve it by reordering the levels of `relig` using `fct_reorder()`.
|
We can improve it by reordering the levels of `relig` using `fct_reorder()`.
|
||||||
`fct_reorder()` takes three arguments:
|
`fct_reorder()` takes three arguments:
|
||||||
|
|
||||||
|
@ -207,7 +192,9 @@ For example, you could rewrite the plot above as:
|
||||||
|
|
||||||
```{r, eval = FALSE}
|
```{r, eval = FALSE}
|
||||||
relig_summary |>
|
relig_summary |>
|
||||||
mutate(relig = fct_reorder(relig, tvhours)) |>
|
mutate(
|
||||||
|
relig = fct_reorder(relig, tvhours)
|
||||||
|
) |>
|
||||||
ggplot(aes(tvhours, relig)) +
|
ggplot(aes(tvhours, relig)) +
|
||||||
geom_point()
|
geom_point()
|
||||||
```
|
```
|
||||||
|
@ -253,8 +240,8 @@ ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
|
||||||
Why do you think the average age for "Not applicable" is so high?
|
Why do you think the average age for "Not applicable" is so high?
|
||||||
|
|
||||||
Another type of reordering is useful when you are coloring the lines on a plot.
|
Another type of reordering is useful when you are coloring the lines on a plot.
|
||||||
`fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values.
|
`fct_reorder2(f, x, y)` reorders the factor `f` by the `y` values associated with the largest `x` values.
|
||||||
This makes the plot easier to read because the line colurs line up with the legend.
|
This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.
|
||||||
|
|
||||||
```{r, fig.align = "default", out.width = "50%", fig.width = 4}
|
```{r, fig.align = "default", out.width = "50%", fig.width = 4}
|
||||||
#| fig.alt:
|
#| fig.alt:
|
||||||
|
@ -288,7 +275,7 @@ ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
|
||||||
```
|
```
|
||||||
|
|
||||||
Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency: this is the simplest type of reordering because it doesn't need any extra variables.
|
Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency: this is the simplest type of reordering because it doesn't need any extra variables.
|
||||||
You may want to combine with `fct_rev()`.
|
Combine it with `fct_rev()` if you want the largest values on the right, not the left.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| fig.alt: >
|
#| fig.alt: >
|
||||||
|
@ -324,6 +311,7 @@ gss_cat |> count(partyid)
|
||||||
|
|
||||||
The levels are terse and inconsistent.
|
The levels are terse and inconsistent.
|
||||||
Let's tweak them to be longer and use a parallel construction.
|
Let's tweak them to be longer and use a parallel construction.
|
||||||
|
Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
gss_cat |>
|
gss_cat |>
|
||||||
|
@ -340,7 +328,7 @@ gss_cat |>
|
||||||
count(partyid)
|
count(partyid)
|
||||||
```
|
```
|
||||||
|
|
||||||
`fct_recode()` will leave levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.
|
`fct_recode()` will the leave levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.
|
||||||
|
|
||||||
To combine groups, you can assign multiple old levels to the same new level:
|
To combine groups, you can assign multiple old levels to the same new level:
|
||||||
|
|
||||||
|
@ -362,7 +350,7 @@ gss_cat |>
|
||||||
count(partyid)
|
count(partyid)
|
||||||
```
|
```
|
||||||
|
|
||||||
You must use this technique with care: if you group together categories that are truly different you will end up with misleading results.
|
Use this technique with care: if you group together categories that are truly different you will end up with misleading results.
|
||||||
|
|
||||||
If you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`.
|
If you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`.
|
||||||
For each new variable, you can provide a vector of old levels:
|
For each new variable, you can provide a vector of old levels:
|
||||||
|
@ -371,16 +359,16 @@ For each new variable, you can provide a vector of old levels:
|
||||||
gss_cat |>
|
gss_cat |>
|
||||||
mutate(
|
mutate(
|
||||||
partyid = fct_collapse(partyid,
|
partyid = fct_collapse(partyid,
|
||||||
other = c("No answer", "Don't know", "Other party"),
|
"other" = c("No answer", "Don't know", "Other party"),
|
||||||
rep = c("Strong republican", "Not str republican"),
|
"rep" = c("Strong republican", "Not str republican"),
|
||||||
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
|
"ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
|
||||||
dem = c("Not str democrat", "Strong democrat")
|
"dem" = c("Not str democrat", "Strong democrat")
|
||||||
)
|
)
|
||||||
) |>
|
) |>
|
||||||
count(partyid)
|
count(partyid)
|
||||||
```
|
```
|
||||||
|
|
||||||
Sometimes you just want to lump together all the small groups to make a plot or table simpler.
|
Sometimes you just want to lump together the small groups to make a plot or table simpler.
|
||||||
That's the job of the `fct_lump_*()` family of functions.
|
That's the job of the `fct_lump_*()` family of functions.
|
||||||
`fct_lump_lowfreq()` is a simple starting point that progressively lumps the smallest groups categories into "Other", always keeping "Other" as the smallest category.
|
`fct_lump_lowfreq()` is a simple starting point that progressively lumps the smallest groups categories into "Other", always keeping "Other" as the smallest category.
|
||||||
|
|
||||||
|
@ -400,6 +388,8 @@ gss_cat |>
|
||||||
print(n = Inf)
|
print(n = Inf)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Read the documentation to learn about `fct_lump_min()` and `fct_lump_prop()` which are useful in other cases.
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
|
1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
|
||||||
|
|
|
@ -40,6 +40,7 @@ There are two missing values in this dataset:
|
||||||
One way to think about the difference is with this Zen-like koan:
|
One way to think about the difference is with this Zen-like koan:
|
||||||
|
|
||||||
> An explicit missing value is the presence of an absence.\
|
> An explicit missing value is the presence of an absence.\
|
||||||
|
>
|
||||||
> An implicit missing value is the absence of a presence.
|
> An implicit missing value is the absence of a presence.
|
||||||
|
|
||||||
### Pivoting
|
### Pivoting
|
||||||
|
@ -239,6 +240,18 @@ health |>
|
||||||
|
|
||||||
Main con of this approach is that you need to carefully specify the `fill` argument so that
|
Main con of this approach is that you need to carefully specify the `fill` argument so that
|
||||||
|
|
||||||
|
By default, ggplot2 will drop levels that don't have any values.
|
||||||
|
You can force them to display with by using `drop = FALSE` on the discrete axis:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| fig.alt: >
|
||||||
|
#| The same bar chart as the last plot, but now with an missing bar on
|
||||||
|
#| the far right with label "Not applicable".
|
||||||
|
ggplot(gss_cat, aes(race)) +
|
||||||
|
geom_bar() +
|
||||||
|
scale_x_discrete(drop = FALSE)
|
||||||
|
```
|
||||||
|
|
||||||
## NaN
|
## NaN
|
||||||
|
|
||||||
Special not a number.
|
Special not a number.
|
||||||
|
|
Loading…
Reference in New Issue