From 8815e8f91aee4673fdcfb609708e125c96efff17 Mon Sep 17 00:00:00 2001 From: hadley Date: Tue, 16 Aug 2016 17:06:51 -0500 Subject: [PATCH] Start banging out factors chapter --- DESCRIPTION | 2 + _bookdown.yml | 1 + communicate-plots.Rmd | 4 +- factors.Rmd | 145 ++++++++++++++++++++++++++++++++++++++++++ vectors.Rmd | 5 +- wrangle.Rmd | 4 ++ 6 files changed, 155 insertions(+), 6 deletions(-) create mode 100644 factors.Rmd diff --git a/DESCRIPTION b/DESCRIPTION index 307d657..b0ff624 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -12,6 +12,7 @@ Imports: broom, condvis, dplyr, + forcats, gapminder, ggplot2, ggrepel, @@ -37,6 +38,7 @@ Imports: tidyr, viridis Remotes: + hadley/forcats, hadley/modelr, hadley/stringr, hadley/tibble, diff --git a/_bookdown.yml b/_bookdown.yml index cfd80b9..f5d8b47 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -15,6 +15,7 @@ rmd_files: [ "tidy.Rmd", "relational-data.Rmd", "strings.Rmd", + "factors.Rmd", "datetimes.Rmd", "program.Rmd", diff --git a/communicate-plots.Rmd b/communicate-plots.Rmd index e54e568..55af73a 100644 --- a/communicate-plots.Rmd +++ b/communicate-plots.Rmd @@ -10,7 +10,7 @@ Now you need to _communicate_ the result of your analysis to others. Your audien In this chapter, we'll focus once again on ggplot2. We'll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including __ggrepel__ and __viridis__. Rather than loading those extension here we'll refer to their functions explicitly with the `::` notation. That will help make it obvious what functions are built into ggplot2, and what functions come from other packages. -```{r} +```{r, message = FALSE} library(ggplot2) library(dplyr) ``` @@ -473,7 +473,7 @@ ggplot(mpg, aes(displ, hwy)) + theme_bw() ``` -ggplot2 includes eight themes by default, as shown in Figure \@ref(fig:themes). Many more are included in add-on packages like __ggthemes__ (), by Jeremy Arnold. +ggplot2 includes eight themes by default, as shown in Figure \@ref(fig:themes). Many more are included in add-on packages like __ggthemes__ (), by Jeffrey Arnold. ```{r themes, echo = FALSE, fig.cap = "The eight themes built-in to ggplot2."} knitr::include_graphics("images/visualization-themes.png") diff --git a/factors.Rmd b/factors.Rmd new file mode 100644 index 0000000..7c228df --- /dev/null +++ b/factors.Rmd @@ -0,0 +1,145 @@ +# Factors + +## Introduction + +In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors with non-alphabetical order. + +Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [_stringsAsFactors: An unauthorized biography_](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [_stringsAsFactors = \_](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. + +Factors aren't as common in the tidyverse, because no function will automatically turn a character vector into a factor. It is, however, a good idea to use factors when appropriate, and controlling their levels can be particularly useful for tailoring visualisations of categorical data. + +### Prerequisites + +To work with factors, we'll use the __forcats__ packages (tools for dealing **cat**egorical variables + anagram of factors). It provides a wide range of helpers for working with factors. We'll also use ggplot2 because factors are particularly important for visualisation. + +```{r setup, message = FALSE} +# devtools::install_github("hadley/forcats") +library(forcats) +library(ggplot2) +library(dplyr) +``` + +## Creating factors + +There are two ways to create a factor: during import with readr, using `col_factor()`, or after the fact, turning a string into a factor. Often you'll need to do a little experimetation, so I recommend starting with strings. + +To turn a string into a factor, call `factor()`, supplying list of possible values: + +```{r} + +``` + +For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of variables from the [General Social Survey](https://gssdataexplorer.norc.org/). The variables have been selected to illustrate a number of challenges with working with factors. + +```{r} +gss_cat +```` + +You can see the levels of a factor with `levels()`: + +```{r} +levels(gss_cat$race) +``` + +And this order is preserved in operations like `count()`: + +```{r} +gss_cat %>% + count(race) +``` + +And in visualisations like `geom_bar()`: + +```{r} +ggplot(gss_cat, aes(race)) + + geom_bar() +``` + +Note that by default, ggplot2 will drop levels that don't have any values. You can force them to appear with : + +```{r} +ggplot(gss_cat, aes(race)) + + geom_bar() + + scale_x_discrete(drop = FALSE) +``` + +Currently dplyr doesn't have a `drop` option, but it will in the future. + +## Modifying factor order + +```{r} +relig <- gss_cat %>% + group_by(relig) %>% + summarise( + age = mean(age, na.rm = TRUE), + tvhours = mean(tvhours, na.rm = TRUE), + n = n() + ) + +ggplot(relig, aes(tvhours, relig)) + geom_point() +ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) + geom_point() +``` + +If you just want to pull a couple of levels out to the front, you can use `fct_relevel()`. + +```{r} +rincome <- gss_cat %>% + group_by(rincome) %>% + summarise( + age = mean(age, na.rm = TRUE), + tvhours = mean(tvhours, na.rm = TRUE), + n = n() + ) + +ggplot(rincome, aes(age, rincome)) + geom_point() + +gss_cat %>% count(fct_rev(rincome)) +``` + +`fct_rev(rincome)` +`fct_reorder(religion, rincome)` +`fct_reorder2(religion, year, rincome)` + + +```{r} +by_year <- gss_cat %>% + group_by(year, marital) %>% + count() %>% + mutate(prop = n / sum(n)) + +ggplot(by_year, aes(year, prop, colour = marital)) + + geom_line() + +ggplot(by_year, aes(year, prop, colour = fct_reorder2(marital, year, prop))) + + geom_line() + +``` + +## Modifying factor levels + +`fct_recode()` is the most general. It allows you to transform levels. + +### Manually grouping + +```{r} +fct_count(fct_collapse(gss_cat$partyid, + other = c("No answer", "Don't know", "Other party"), + rep = c("Strong republican", "Not str republican"), + ind = c("Ind,near rep", "Independent", "Ind,near dem"), + dem = c("Not str democrat", "Strong democrat") +)) +``` + +### Lumping small groups together + +```{r} +gss_cat %>% mutate(relig = fct_lump(relig)) %>% count(relig) +gss_cat %>% mutate(relig = fct_lump(relig, 5)) %>% count(relig, sort = TRUE) +``` + +```{r} +gss_cat$relig %>% fct_infreq() %>% fct_lump(5) %>% fct_count() +gss_cat$relig %>% fct_lump(5) %>% fct_infreq() %>% fct_count() +``` + +`fct_reorder()` is sometimes also useful. It... diff --git a/vectors.Rmd b/vectors.Rmd index 53afcb8..c214089 100644 --- a/vectors.Rmd +++ b/vectors.Rmd @@ -597,9 +597,7 @@ typeof(x) attributes(x) ``` -Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is modelling. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow". - -Factors aren't common in the tidyverse, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first place. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a character vector. +You can create them from scratch with `factor()` or from a character vector with `as.factor()`. ```{r} x <- factor(letters[1:5]) @@ -607,7 +605,6 @@ is.factor(x) as.factor(letters[1:5]) ``` -Otherwise, you might try my __forcats__ package, which provides handy functions for working with factors (forcats = tools **for** **cat**egorical variables, and is an anagram of factors!). At the time of writing it was only available on github, , but it may have made it to CRAN by the time you read this book. ### Dates and date-times diff --git a/wrangle.Rmd b/wrangle.Rmd index d15ec3d..a8a94a4 100644 --- a/wrangle.Rmd +++ b/wrangle.Rmd @@ -30,6 +30,10 @@ Data wrangling also encompasses data transformation, which you've already learn * [Strings] will introduce regular expressions, a powerful tool for manipulating strings. + +* [Factors] are how R stores categorical data. They are used when a variable + has a fixed set of possible values, or when you want to non-alphabetical + ordering of a string. * [Dates and times] will give you the key tools for working with dates and date-times.