From 78ab61f284079323a7b92e92306ef190de1d1d00 Mon Sep 17 00:00:00 2001
From: Hadley Wickham <h.wickham@gmail.com>
Date: Mon, 19 Apr 2021 07:59:07 -0500
Subject: [PATCH] Pull content out of tidying

---
 data-tidy.Rmd      | 209 ---------------------------------------------
 missing-values.Rmd |  84 ++++++++++++++++++
 strings.Rmd        | 125 +++++++++++++++++++++++++++
 3 files changed, 209 insertions(+), 209 deletions(-)

diff --git a/data-tidy.Rmd b/data-tidy.Rmd
index 6cba19f..41256de 100644
--- a/data-tidy.Rmd
+++ b/data-tidy.Rmd
@@ -1,7 +1,5 @@
 # Data tidying {#data-tidy}
 
-<!--# Take out bit on missing values and move to missing values chapter. Maybe also move case study elsewhere? -->
-
 ## Introduction
 
 > "Happy families are all alike; every unhappy family is unhappy in its own way." ---- Leo Tolstoy
@@ -440,213 +438,6 @@ As you might have guessed from their names, `pivot_wider()` and `pivot_longer()`
       pivot_wider(names_from = drv, values_from = n)
     ```
 
-## Separating
-
-So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`.
-`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
-To fix this problem, we'll need the `separate()` function.
-You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
-
-### Separate
-
-`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
-Take `table3`:
-
-```{r}
-table3
-```
-
-The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
-`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
-
-```{r}
-table3 %>%
-  separate(rate, into = c("cases", "population"))
-```
-
-```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
-knitr::include_graphics("images/tidy-17.png")
-```
-
-By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
-For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
-If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
-For example, we could rewrite the code above as:
-
-```{r eval = FALSE}
-table3 %>%
-  separate(rate, into = c("cases", "population"), sep = "/")
-```
-
-(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).)
-
-Look carefully at the column types: you'll notice that `cases` and `population` are character columns.
-This is the default behaviour in `separate()`: it leaves the type of the column as is.
-Here, however, it's not very useful as those really are numbers.
-We can ask `separate()` to try and convert to better types using `convert = TRUE`:
-
-```{r}
-table3 %>%
-  separate(rate, into = c("cases", "population"), convert = TRUE)
-```
-
-### Unite
-
-`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.
-You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
-
-We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example.
-That data is saved as `tidyr::table1`.
-`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
-
-```{r}
-table1 %>%
-  unite(rate, cases, population)
-```
-
-In this case we also need to use the `sep` argument.
-The default will place an underscore (`_`) between the values from different columns.
-Here we want `"/"` instead:
-
-```{r}
-table1 %>%
-  unite(rate, cases, population, sep = "/")
-```
-
-### Exercises
-
-1.  What do the `extra` and `fill` arguments do in `separate()`?
-    Experiment with the various options for the following two toy datasets.
-
-    ```{r, eval = FALSE}
-    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
-      separate(x, c("one", "two", "three"))
-
-    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
-      separate(x, c("one", "two", "three"))
-    ```
-
-2.  Both `unite()` and `separate()` have a `remove` argument.
-    What does it do?
-    Why would you set it to `FALSE`?
-
-3.  Compare and contrast `separate()` and `extract()`.
-    Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
-
-4.  In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
-    How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
-
-    ```{r, eval = FALSE}
-    events <- tribble(
-      ~month, ~day,
-      1     , 20,
-      1     , 21,
-      1     , 22
-    )
-
-    events %>%
-      unite("date", month:day, sep = "-", remove = FALSE)
-    ```
-
-5.  You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
-    Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
-    Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
-    Do this in two ways: using a positive and a negative value for `sep`.
-
-    ```{r}
-    baker <- tribble(
-      ~location,
-      "FLBaker County",
-      "GABaker County",
-      "ORBaker County",
-    )
-    baker
-    ```
-
-## Missing values {#missing-values-tidy}
-
-Changing the representation of a dataset brings up an important subtlety of missing values.
-Surprisingly, a value can be missing in one of two possible ways:
-
--   **Explicitly**, i.e. flagged with `NA`.
--   **Implicitly**, i.e. simply not present in the data.
-
-Let's illustrate this idea with a very simple data set:
-
-```{r}
-stocks <- tibble(
-  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
-  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
-  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
-)
-```
-
-There are two missing values in this dataset:
-
--   The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`.
-
--   The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
-
-One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
-
-The way that a dataset is represented can make implicit values explicit.
-For example, we can make the implicit missing value explicit by putting years in the columns:
-
-```{r}
-stocks %>%
-  pivot_wider(names_from = year, values_from = return)
-```
-
-Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit:
-
-```{r}
-stocks %>%
-  pivot_wider(names_from = year, values_from = return) %>%
-  pivot_longer(
-    cols = c(`2015`, `2016`),
-    names_to = "year",
-    values_to = "return",
-    values_drop_na = TRUE
-  )
-```
-
-Another important tool for making missing values explicit in tidy data is `complete()`:
-
-```{r}
-stocks %>%
-  complete(year, qtr)
-```
-
-`complete()` takes a set of columns, and finds all unique combinations.
-It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
-
-There's one other important tool that you should know for working with missing values.
-Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
-
-```{r}
-treatment <- tribble(
-  ~person,           ~treatment, ~response,
-  "Derrick Whitmore", 1,         7,
-  NA,                 2,         10,
-  NA,                 3,         9,
-  "Katherine Burke",  1,         4
-)
-```
-
-You can fill in these missing values with `fill()`.
-It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
-
-```{r}
-treatment %>%
-  fill(person)
-```
-
-### Exercises
-
-1.  Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
-
-2.  What does the direction argument to `fill()` do?
-
 ## Case study
 
 To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem.
diff --git a/missing-values.Rmd b/missing-values.Rmd
index ed10e52..8a96a7f 100644
--- a/missing-values.Rmd
+++ b/missing-values.Rmd
@@ -42,6 +42,90 @@ If you want to determine if a value is missing, use `is.na()`:
 is.na(x)
 ```
 
+## Explicit vs implicit missing values {#missing-values-tidy}
+
+Changing the representation of a dataset brings up an important subtlety of missing values.
+Surprisingly, a value can be missing in one of two possible ways:
+
+-   **Explicitly**, i.e. flagged with `NA`.
+-   **Implicitly**, i.e. simply not present in the data.
+
+Let's illustrate this idea with a very simple data set:
+
+```{r}
+stocks <- tibble(
+  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
+  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
+  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
+)
+```
+
+There are two missing values in this dataset:
+
+-   The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`.
+
+-   The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
+
+One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
+
+The way that a dataset is represented can make implicit values explicit.
+For example, we can make the implicit missing value explicit by putting years in the columns:
+
+```{r}
+stocks %>%
+  pivot_wider(names_from = year, values_from = return)
+```
+
+Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit:
+
+```{r}
+stocks %>%
+  pivot_wider(names_from = year, values_from = return) %>%
+  pivot_longer(
+    cols = c(`2015`, `2016`),
+    names_to = "year",
+    values_to = "return",
+    values_drop_na = TRUE
+  )
+```
+
+Another important tool for making missing values explicit in tidy data is `complete()`:
+
+```{r}
+stocks %>%
+  complete(year, qtr)
+```
+
+`complete()` takes a set of columns, and finds all unique combinations.
+It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
+
+There's one other important tool that you should know for working with missing values.
+Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
+
+```{r}
+treatment <- tribble(
+  ~person,           ~treatment, ~response,
+  "Derrick Whitmore", 1,         7,
+  NA,                 2,         10,
+  NA,                 3,         9,
+  "Katherine Burke",  1,         4
+)
+```
+
+You can fill in these missing values with `fill()`.
+It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
+
+```{r}
+treatment %>%
+  fill(person)
+```
+
+### Exercises
+
+1.  Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
+
+2.  What does the direction argument to `fill()` do?
+
 ## dplyr verbs
 
 `filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
diff --git a/strings.Rmd b/strings.Rmd
index 4f42585..9ab8291 100644
--- a/strings.Rmd
+++ b/strings.Rmd
@@ -1048,3 +1048,128 @@ The main difference is the prefix: `str_` vs. `stri_`.
     c.  Generate random text.
 
 2.  How do you control the language that `stri_sort()` uses for sorting?
+
+## tidyr
+
+So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`.
+`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
+To fix this problem, we'll need the `separate()` function.
+You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
+
+### Separate
+
+`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
+Take `table3`:
+
+```{r}
+table3
+```
+
+The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
+`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
+
+```{r}
+table3 %>%
+  separate(rate, into = c("cases", "population"))
+```
+
+```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
+knitr::include_graphics("images/tidy-17.png")
+```
+
+By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
+For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
+If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
+For example, we could rewrite the code above as:
+
+```{r eval = FALSE}
+table3 %>%
+  separate(rate, into = c("cases", "population"), sep = "/")
+```
+
+(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).)
+
+Look carefully at the column types: you'll notice that `cases` and `population` are character columns.
+This is the default behaviour in `separate()`: it leaves the type of the column as is.
+Here, however, it's not very useful as those really are numbers.
+We can ask `separate()` to try and convert to better types using `convert = TRUE`:
+
+```{r}
+table3 %>%
+  separate(rate, into = c("cases", "population"), convert = TRUE)
+```
+
+### Unite
+
+`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.
+You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
+
+We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example.
+That data is saved as `tidyr::table1`.
+`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
+
+```{r}
+table1 %>%
+  unite(rate, cases, population)
+```
+
+In this case we also need to use the `sep` argument.
+The default will place an underscore (`_`) between the values from different columns.
+Here we want `"/"` instead:
+
+```{r}
+table1 %>%
+  unite(rate, cases, population, sep = "/")
+```
+
+### Exercises
+
+1.  What do the `extra` and `fill` arguments do in `separate()`?
+    Experiment with the various options for the following two toy datasets.
+
+    ```{r, eval = FALSE}
+    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
+      separate(x, c("one", "two", "three"))
+
+    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
+      separate(x, c("one", "two", "three"))
+    ```
+
+2.  Both `unite()` and `separate()` have a `remove` argument.
+    What does it do?
+    Why would you set it to `FALSE`?
+
+3.  Compare and contrast `separate()` and `extract()`.
+    Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
+
+4.  In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
+    How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
+
+    ```{r, eval = FALSE}
+    events <- tribble(
+      ~month, ~day,
+      1     , 20,
+      1     , 21,
+      1     , 22
+    )
+
+    events %>%
+      unite("date", month:day, sep = "-", remove = FALSE)
+    ```
+
+5.  You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
+    Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
+    Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
+    Do this in two ways: using a positive and a negative value for `sep`.
+
+    ```{r}
+    baker <- tribble(
+      ~location,
+      "FLBaker County",
+      "GABaker County",
+      "ORBaker County",
+    )
+    baker
+    ```
+
+##