From 78ab61f284079323a7b92e92306ef190de1d1d00 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Mon, 19 Apr 2021 07:59:07 -0500 Subject: [PATCH] Pull content out of tidying --- data-tidy.Rmd | 209 --------------------------------------------- missing-values.Rmd | 84 ++++++++++++++++++ strings.Rmd | 125 +++++++++++++++++++++++++++ 3 files changed, 209 insertions(+), 209 deletions(-) diff --git a/data-tidy.Rmd b/data-tidy.Rmd index 6cba19f..41256de 100644 --- a/data-tidy.Rmd +++ b/data-tidy.Rmd @@ -1,7 +1,5 @@ # Data tidying {#data-tidy} - - ## Introduction > "Happy families are all alike; every unhappy family is unhappy in its own way." ---- Leo Tolstoy @@ -440,213 +438,6 @@ As you might have guessed from their names, `pivot_wider()` and `pivot_longer()` pivot_wider(names_from = drv, values_from = n) ``` -## Separating - -So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`. -`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). -To fix this problem, we'll need the `separate()` function. -You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns. - -### Separate - -`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. -Take `table3`: - -```{r} -table3 -``` - -The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. -`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below. - -```{r} -table3 %>% - separate(rate, into = c("cases", "population")) -``` - -```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."} -knitr::include_graphics("images/tidy-17.png") -``` - -By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). -For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. -If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. -For example, we could rewrite the code above as: - -```{r eval = FALSE} -table3 %>% - separate(rate, into = c("cases", "population"), sep = "/") -``` - -(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).) - -Look carefully at the column types: you'll notice that `cases` and `population` are character columns. -This is the default behaviour in `separate()`: it leaves the type of the column as is. -Here, however, it's not very useful as those really are numbers. -We can ask `separate()` to try and convert to better types using `convert = TRUE`: - -```{r} -table3 %>% - separate(rate, into = c("cases", "population"), convert = TRUE) -``` - -### Unite - -`unite()` is the inverse of `separate()`: it combines multiple columns into a single column. -You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket. - -We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example. -That data is saved as `tidyr::table1`. -`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style: - -```{r} -table1 %>% - unite(rate, cases, population) -``` - -In this case we also need to use the `sep` argument. -The default will place an underscore (`_`) between the values from different columns. -Here we want `"/"` instead: - -```{r} -table1 %>% - unite(rate, cases, population, sep = "/") -``` - -### Exercises - -1. What do the `extra` and `fill` arguments do in `separate()`? - Experiment with the various options for the following two toy datasets. - - ```{r, eval = FALSE} - tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% - separate(x, c("one", "two", "three")) - - tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% - separate(x, c("one", "two", "three")) - ``` - -2. Both `unite()` and `separate()` have a `remove` argument. - What does it do? - Why would you set it to `FALSE`? - -3. Compare and contrast `separate()` and `extract()`. - Why are there three variations of separation (by position, by separator, and with groups), but only one unite? - -4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns. - How would you achieve the same outcome using `mutate()` and `paste()` instead of unite? - - ```{r, eval = FALSE} - events <- tribble( - ~month, ~day, - 1 , 20, - 1 , 21, - 1 , 22 - ) - - events %>% - unite("date", month:day, sep = "-", remove = FALSE) - ``` - -5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. - Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. - Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`. - Do this in two ways: using a positive and a negative value for `sep`. - - ```{r} - baker <- tribble( - ~location, - "FLBaker County", - "GABaker County", - "ORBaker County", - ) - baker - ``` - -## Missing values {#missing-values-tidy} - -Changing the representation of a dataset brings up an important subtlety of missing values. -Surprisingly, a value can be missing in one of two possible ways: - -- **Explicitly**, i.e. flagged with `NA`. -- **Implicitly**, i.e. simply not present in the data. - -Let's illustrate this idea with a very simple data set: - -```{r} -stocks <- tibble( - year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), - qtr = c( 1, 2, 3, 4, 2, 3, 4), - return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) -) -``` - -There are two missing values in this dataset: - -- The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`. - -- The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset. - -One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence. - -The way that a dataset is represented can make implicit values explicit. -For example, we can make the implicit missing value explicit by putting years in the columns: - -```{r} -stocks %>% - pivot_wider(names_from = year, values_from = return) -``` - -Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit: - -```{r} -stocks %>% - pivot_wider(names_from = year, values_from = return) %>% - pivot_longer( - cols = c(`2015`, `2016`), - names_to = "year", - values_to = "return", - values_drop_na = TRUE - ) -``` - -Another important tool for making missing values explicit in tidy data is `complete()`: - -```{r} -stocks %>% - complete(year, qtr) -``` - -`complete()` takes a set of columns, and finds all unique combinations. -It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary. - -There's one other important tool that you should know for working with missing values. -Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward: - -```{r} -treatment <- tribble( - ~person, ~treatment, ~response, - "Derrick Whitmore", 1, 7, - NA, 2, 10, - NA, 3, 9, - "Katherine Burke", 1, 4 -) -``` - -You can fill in these missing values with `fill()`. -It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward). - -```{r} -treatment %>% - fill(person) -``` - -### Exercises - -1. Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`. - -2. What does the direction argument to `fill()` do? - ## Case study To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. diff --git a/missing-values.Rmd b/missing-values.Rmd index ed10e52..8a96a7f 100644 --- a/missing-values.Rmd +++ b/missing-values.Rmd @@ -42,6 +42,90 @@ If you want to determine if a value is missing, use `is.na()`: is.na(x) ``` +## Explicit vs implicit missing values {#missing-values-tidy} + +Changing the representation of a dataset brings up an important subtlety of missing values. +Surprisingly, a value can be missing in one of two possible ways: + +- **Explicitly**, i.e. flagged with `NA`. +- **Implicitly**, i.e. simply not present in the data. + +Let's illustrate this idea with a very simple data set: + +```{r} +stocks <- tibble( + year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), + qtr = c( 1, 2, 3, 4, 2, 3, 4), + return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) +) +``` + +There are two missing values in this dataset: + +- The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`. + +- The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset. + +One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence. + +The way that a dataset is represented can make implicit values explicit. +For example, we can make the implicit missing value explicit by putting years in the columns: + +```{r} +stocks %>% + pivot_wider(names_from = year, values_from = return) +``` + +Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit: + +```{r} +stocks %>% + pivot_wider(names_from = year, values_from = return) %>% + pivot_longer( + cols = c(`2015`, `2016`), + names_to = "year", + values_to = "return", + values_drop_na = TRUE + ) +``` + +Another important tool for making missing values explicit in tidy data is `complete()`: + +```{r} +stocks %>% + complete(year, qtr) +``` + +`complete()` takes a set of columns, and finds all unique combinations. +It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary. + +There's one other important tool that you should know for working with missing values. +Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward: + +```{r} +treatment <- tribble( + ~person, ~treatment, ~response, + "Derrick Whitmore", 1, 7, + NA, 2, 10, + NA, 3, 9, + "Katherine Burke", 1, 4 +) +``` + +You can fill in these missing values with `fill()`. +It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward). + +```{r} +treatment %>% + fill(person) +``` + +### Exercises + +1. Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`. + +2. What does the direction argument to `fill()` do? + ## dplyr verbs `filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. diff --git a/strings.Rmd b/strings.Rmd index 4f42585..9ab8291 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -1048,3 +1048,128 @@ The main difference is the prefix: `str_` vs. `stri_`. c. Generate random text. 2. How do you control the language that `stri_sort()` uses for sorting? + +## tidyr + +So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`. +`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). +To fix this problem, we'll need the `separate()` function. +You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns. + +### Separate + +`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. +Take `table3`: + +```{r} +table3 +``` + +The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. +`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below. + +```{r} +table3 %>% + separate(rate, into = c("cases", "population")) +``` + +```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."} +knitr::include_graphics("images/tidy-17.png") +``` + +By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). +For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. +If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. +For example, we could rewrite the code above as: + +```{r eval = FALSE} +table3 %>% + separate(rate, into = c("cases", "population"), sep = "/") +``` + +(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).) + +Look carefully at the column types: you'll notice that `cases` and `population` are character columns. +This is the default behaviour in `separate()`: it leaves the type of the column as is. +Here, however, it's not very useful as those really are numbers. +We can ask `separate()` to try and convert to better types using `convert = TRUE`: + +```{r} +table3 %>% + separate(rate, into = c("cases", "population"), convert = TRUE) +``` + +### Unite + +`unite()` is the inverse of `separate()`: it combines multiple columns into a single column. +You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket. + +We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example. +That data is saved as `tidyr::table1`. +`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style: + +```{r} +table1 %>% + unite(rate, cases, population) +``` + +In this case we also need to use the `sep` argument. +The default will place an underscore (`_`) between the values from different columns. +Here we want `"/"` instead: + +```{r} +table1 %>% + unite(rate, cases, population, sep = "/") +``` + +### Exercises + +1. What do the `extra` and `fill` arguments do in `separate()`? + Experiment with the various options for the following two toy datasets. + + ```{r, eval = FALSE} + tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% + separate(x, c("one", "two", "three")) + + tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% + separate(x, c("one", "two", "three")) + ``` + +2. Both `unite()` and `separate()` have a `remove` argument. + What does it do? + Why would you set it to `FALSE`? + +3. Compare and contrast `separate()` and `extract()`. + Why are there three variations of separation (by position, by separator, and with groups), but only one unite? + +4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns. + How would you achieve the same outcome using `mutate()` and `paste()` instead of unite? + + ```{r, eval = FALSE} + events <- tribble( + ~month, ~day, + 1 , 20, + 1 , 21, + 1 , 22 + ) + + events %>% + unite("date", month:day, sep = "-", remove = FALSE) + ``` + +5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. + Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. + Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`. + Do this in two ways: using a positive and a negative value for `sep`. + + ```{r} + baker <- tribble( + ~location, + "FLBaker County", + "GABaker County", + "ORBaker County", + ) + baker + ``` + +##