diff --git a/strings.qmd b/strings.qmd index a0b9c94..ee35379 100644 --- a/strings.qmd +++ b/strings.qmd @@ -261,10 +261,10 @@ Working from . It's very common for multiple variables to be crammed together into a single string. In this section you'll learn how to use four tidyr to extract them: -- `df |> separate_by_longer(col, sep)` -- `df |> separate_at_longer(col, width)` -- `df |> separate_by_wider(col, sep, names)` -- `df |> separate_at_wider(col, widths)` +- `df |> separate_longer_delim(col, delim)` +- `df |> separate_longer_position(col, width)` +- `df |> separate_wider_delim(col, delim, names)` +- `df |> separate_wider_(col, widths)` If you look closely you can see there's a common pattern here: `separate` followed by `by` or `at`, followed by longer or `wider`. `by` splits up a string with a separator like `", "` or `" "`. @@ -274,80 +274,63 @@ If you look closely you can see there's a common pattern here: `separate` follow There's one more member of this family, `separate_regex_wider()`, that we'll come back in @sec-regular-expressions. It's the most flexible of the `at` forms but you need to know a bit about regular expression in order to use it. -```{r} -#| include: false -has_dev_tidyr <- packageVersion("tidyr") >= "1.2.1.9001" -``` - The next two sections will give you the basic idea behind these separate functions, and then we'll work through a few case studies that require mutliple uses. ### Splitting into rows -`separate_by_longer()` and `separate_at_longer()` are most useful when the number of components varies from row to row. -`separate_by_longer()` arises most commonly: +`separate_longer_delim()` and `separate_longer_position()` are most useful when the number of components varies from row to row. +`separate_longer_delim()` arises most commonly: ```{r} -#| eval: !expr has_dev_tidyr - df1 <- tibble(x = c("a,b,c", "d,e", "f")) df1 |> - separate_by_longer(x, sep = ",") + separate_longer_delim(x, delim = ",") ``` (If the separators have some variation you can use a regular expression instead, if you know about it.) -It's rarer to see `separate_at_longer()` in the wild, but some older datasets can adopt a very compact format where each character is used to record a value: +It's rarer to see `separate_longer_position()` in the wild, but some older datasets can adopt a very compact format where each character is used to record a value: ```{r} -#| eval: !expr has_dev_tidyr - df2 <- tibble(x = c("1211", "131", "21")) df2 |> - separate_at_longer(x, width = 1) + separate_longer_position(x, width = 1) ``` ### Splitting into columns -`separate_by_wider()` and `separate_at_wider()` are most useful when there are a fixed number of components in each string, and you want to spread them into columns. +`separate_wider_delim()` and `separate_wider_position()` are most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are more complicated that their `by` equivalents because you need to name the columns. ```{r} -#| eval: !expr has_dev_tidyr - df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015")) df3 |> - separate_by_wider(x, sep = ",", names = c("letter", "number", "year")) + separate_wider_delim(x, delim = ",", names = c("letter", "number", "year")) ``` If a specific value is not useful you can use `NA` to omit it from the results: ```{r} -#| eval: !expr has_dev_tidyr - df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015")) df3 |> - separate_by_wider(x, sep = ",", names = c("letter", NA, "year")) + separate_wider_delim(x, delim = ",", names = c("letter", NA, "year")) ``` -Alternatively, you can provide `names_sep` and `separate_by_wider()` will use that separator to name automatically: +Alternatively, you can provide `names_sep` and `separate_wider_delim()` will use that separator to name automatically: ```{r} -#| eval: !expr has_dev_tidyr - df3 |> - separate_by_wider(x, sep = ",", names_sep = "_") + separate_wider_delim(x, delim = ",", names_sep = "_") ``` -`separate_at_wider()` works a little differently, because you typically want to specify the width of each column. +`separate_wider_position()` works a little differently, because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies. You can omit values from the output by not naming them: ```{r} -#| eval: !expr has_dev_tidyr - df4 <- tibble(x = c("202215TX", "202122LA", "202325CA")) df4 |> - separate_at_wider(x, c(year = 4, age = 2, state = 2)) + separate_wider_position(x, c(year = 4, age = 2, state = 2)) ``` ### Case studies