Start sketching out extract section

This commit is contained in:
Hadley Wickham 2022-10-04 16:37:02 -05:00
parent 35d4eed391
commit 7189136cf5
1 changed files with 92 additions and 8 deletions

View File

@ -258,15 +258,99 @@ df |>
Working from <https://github.com/tidyverse/tidyr/pull/1304>.
Common for multiple variables worth of data to be stored in a single string.
In this section you'll learn how to four various tidyr to extract them.
It's very common for multiple variables to be crammed together into a single string.
In this section you'll learn how to use four tidyr to extract them:
- `separate_by_longer()`
- `separate_at_longer()`
- `separate_by_wider()`
- `separate_at_wider()`
- `df |> separate_by_longer(col, sep)`
- `df |> separate_at_longer(col, width)`
- `df |> separate_by_wider(col, sep, names)`
- `df |> separate_at_wider(col, widths)`
We'll come back to the fifth member of this family, `separate_regex_wider()`, in @sec-regular-expressions since you need to know regular expression to use it.
If you look closely you can see there's a common pattern here: `separate` followed by `by` or `at`, followed by longer or `wider`.
`by` splits up a string with a separator like `", "` or `" "`.
`at` splits at given locations, like 5, 10, and 17.
`longer` makes input data frame longer, making new rows; `wider` makes the input data frame wider, add new columns.
There's one more member of this family, `separate_regex_wider()`, that we'll come back in @sec-regular-expressions.
It's the most flexible of the `at` forms but you need to know a bit about regular expression in order to use it.
```{r}
#| include: false
has_dev_tidyr <- packageVersion("tidyr") >= "1.2.1.9001"
```
The next two sections will give you the basic idea behind these separate functions, and then we'll work through a few case studies that require mutliple uses.
### Splitting into rows
`separate_by_longer()` and `separate_at_longer()` are most useful when the number of components varies from row to row.
`separate_by_longer()` arises most commonly:
```{r}
#| eval: !expr has_dev_tidyr
df1 <- tibble(x = c("a,b,c", "d,e", "f"))
df1 |>
separate_by_longer(x, sep = ",")
```
(If the separators have some variation you can use a regular expression instead, if you know about it.)
It's rarer to see `separate_at_longer()` in the wild, but some older datasets can adopt a very compact format where each character is used to record a value:
```{r}
#| eval: !expr has_dev_tidyr
df2 <- tibble(x = c("1211", "131", "21"))
df2 |>
separate_at_longer(x, width = 1)
```
### Splitting into columns
`separate_by_wider()` and `separate_at_wider()` are most useful when there are a fixed number of components in each string, and you want to spread them into columns.
They are more complicated that their `by` equivalents because you need to name the columns.
```{r}
#| eval: !expr has_dev_tidyr
df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015"))
df3 |>
separate_by_wider(x, sep = ",", names = c("letter", "number", "year"))
```
If a specific value is not useful you can use `NA` to omit it from the results:
```{r}
#| eval: !expr has_dev_tidyr
df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015"))
df3 |>
separate_by_wider(x, sep = ",", names = c("letter", NA, "year"))
```
Alternatively, you can provide `names_sep` and `separate_by_wider()` will use that separator to name automatically:
```{r}
#| eval: !expr has_dev_tidyr
df3 |>
separate_by_wider(x, sep = ",", names_sep = "_")
```
`separate_at_wider()` works a little differently, because you typically want to specify the width of each column.
So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies.
You can omit values from the output by not naming them:
```{r}
#| eval: !expr has_dev_tidyr
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
df4 |>
separate_at_wider(x, c(year = 4, age = 2, state = 2))
```
### Case studies
## Letters
@ -355,7 +439,7 @@ TODO: add example with a plot.
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
2. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
## Locale dependent operations {#sec-other-languages}
## Locale dependent {#sec-other-languages}
So far all of our examples have been using English.
The details of the many ways other languages are different to English are too diverse to detail here, but we wanted to give a quick outline of the functions who's behavior differs based on your **locale**, the set of settings that vary from country to country.