Fixes for dev tidyr

This commit is contained in:
Hadley Wickham 2022-11-05 12:05:55 -05:00
parent 5ef6a6af54
commit 40a56c55ed
1 changed files with 16 additions and 33 deletions

View File

@ -261,10 +261,10 @@ Working from <https://github.com/tidyverse/tidyr/pull/1304>.
It's very common for multiple variables to be crammed together into a single string.
In this section you'll learn how to use four tidyr to extract them:
- `df |> separate_by_longer(col, sep)`
- `df |> separate_at_longer(col, width)`
- `df |> separate_by_wider(col, sep, names)`
- `df |> separate_at_wider(col, widths)`
- `df |> separate_longer_delim(col, delim)`
- `df |> separate_longer_position(col, width)`
- `df |> separate_wider_delim(col, delim, names)`
- `df |> separate_wider_(col, widths)`
If you look closely you can see there's a common pattern here: `separate` followed by `by` or `at`, followed by longer or `wider`.
`by` splits up a string with a separator like `", "` or `" "`.
@ -274,80 +274,63 @@ If you look closely you can see there's a common pattern here: `separate` follow
There's one more member of this family, `separate_regex_wider()`, that we'll come back in @sec-regular-expressions.
It's the most flexible of the `at` forms but you need to know a bit about regular expression in order to use it.
```{r}
#| include: false
has_dev_tidyr <- packageVersion("tidyr") >= "1.2.1.9001"
```
The next two sections will give you the basic idea behind these separate functions, and then we'll work through a few case studies that require mutliple uses.
### Splitting into rows
`separate_by_longer()` and `separate_at_longer()` are most useful when the number of components varies from row to row.
`separate_by_longer()` arises most commonly:
`separate_longer_delim()` and `separate_longer_position()` are most useful when the number of components varies from row to row.
`separate_longer_delim()` arises most commonly:
```{r}
#| eval: !expr has_dev_tidyr
df1 <- tibble(x = c("a,b,c", "d,e", "f"))
df1 |>
separate_by_longer(x, sep = ",")
separate_longer_delim(x, delim = ",")
```
(If the separators have some variation you can use a regular expression instead, if you know about it.)
It's rarer to see `separate_at_longer()` in the wild, but some older datasets can adopt a very compact format where each character is used to record a value:
It's rarer to see `separate_longer_position()` in the wild, but some older datasets can adopt a very compact format where each character is used to record a value:
```{r}
#| eval: !expr has_dev_tidyr
df2 <- tibble(x = c("1211", "131", "21"))
df2 |>
separate_at_longer(x, width = 1)
separate_longer_position(x, width = 1)
```
### Splitting into columns
`separate_by_wider()` and `separate_at_wider()` are most useful when there are a fixed number of components in each string, and you want to spread them into columns.
`separate_wider_delim()` and `separate_wider_position()` are most useful when there are a fixed number of components in each string, and you want to spread them into columns.
They are more complicated that their `by` equivalents because you need to name the columns.
```{r}
#| eval: !expr has_dev_tidyr
df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015"))
df3 |>
separate_by_wider(x, sep = ",", names = c("letter", "number", "year"))
separate_wider_delim(x, delim = ",", names = c("letter", "number", "year"))
```
If a specific value is not useful you can use `NA` to omit it from the results:
```{r}
#| eval: !expr has_dev_tidyr
df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015"))
df3 |>
separate_by_wider(x, sep = ",", names = c("letter", NA, "year"))
separate_wider_delim(x, delim = ",", names = c("letter", NA, "year"))
```
Alternatively, you can provide `names_sep` and `separate_by_wider()` will use that separator to name automatically:
Alternatively, you can provide `names_sep` and `separate_wider_delim()` will use that separator to name automatically:
```{r}
#| eval: !expr has_dev_tidyr
df3 |>
separate_by_wider(x, sep = ",", names_sep = "_")
separate_wider_delim(x, delim = ",", names_sep = "_")
```
`separate_at_wider()` works a little differently, because you typically want to specify the width of each column.
`separate_wider_position()` works a little differently, because you typically want to specify the width of each column.
So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies.
You can omit values from the output by not naming them:
```{r}
#| eval: !expr has_dev_tidyr
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
df4 |>
separate_at_wider(x, c(year = 4, age = 2, state = 2))
separate_wider_position(x, c(year = 4, age = 2, state = 2))
```
### Case studies