Suggested edits related to Strings chapter (#1219)

* Add {wakefield} as dependency for Strings chapter

* Move footnote into body of text

The footnote appears to be redundant with the more vague paragraph
immediately following it in the main body of the text, so combine their
information instead.

* Make explicit that `coalesce()` replaces NAs

* Fix definition of `start` & `end` for `str_sub()`

* Edit section on Letter variations

* Edit section on Locale-dependent function

* Apply suggestions from code review

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
This commit is contained in:
Stephan Koenig 2023-01-06 12:13:34 -08:00 committed by GitHub
parent fc51a5f5f8
commit 94033a1331
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 24 additions and 27 deletions

View File

@ -218,7 +218,7 @@ In this book, we'll use three data packages from outside the tidyverse:
```{r}
#| eval: false
install.packages(c("gapminder", "Lahman", "nycflights13", "palmerpenguins"))
install.packages(c("gapminder", "Lahman", "nycflights13", "palmerpenguins", "wakefield"))
```
These packages provide data on world development, baseball, airline flights, and body measurements of penguins that we'll use to illustrate key data science ideas.

View File

@ -160,10 +160,7 @@ That naturally raises the question of what string functions you might use with `
### `str_c()`
`str_c()`[^strings-3] takes any number of vectors as arguments and returns a character vector:
[^strings-3]: `str_c()` is very similar to the base `paste0()`.
There are two main reasons we recommend it: it propagates `NA`s (rather than converting them to `"NA"`) and it uses the tidyverse recycling rules.
`str_c()` takes any number of vectors as arguments and returns a character vector:
```{r}
str_c("x", "y")
@ -171,7 +168,7 @@ str_c("x", "y", "z")
str_c("Hello ", c("John", "Susan"))
```
`str_c()` is designed to be used with `mutate()`, so it obeys the usual rules for recycling and missing values:
`str_c()` is very similar to the base `paste0()`, but is designed to be used with `mutate()` by obeying the usual tidyverse rules for recycling and propagating missing values:
```{r}
set.seed(1410)
@ -179,7 +176,7 @@ df <- tibble(name = c(wakefield::name(3), NA))
df |> mutate(greeting = str_c("Hi ", name, "!"))
```
If you want missing values to display in another way, use `coalesce()`.
If you want missing values to display in another way, use `coalesce()` to replace them.
Depending on what you want, you might use it either inside or outside of `str_c()`:
```{r}
@ -192,9 +189,9 @@ df |>
### `str_glue()` {#sec-glue}
If you are mixing many fixed and variable strings with `str_c()`, you'll notice that you type a lot of `"`s, making it hard to see the overall goal of the code. An alternative approach is provided by the [glue package](https://glue.tidyverse.org) via `str_glue()`[^strings-4]. You give it a single string that has a special feature: anything inside `{}` will be evaluated like it's outside of the quotes:
If you are mixing many fixed and variable strings with `str_c()`, you'll notice that you type a lot of `"`s, making it hard to see the overall goal of the code. An alternative approach is provided by the [glue package](https://glue.tidyverse.org) via `str_glue()`[^strings-3]. You give it a single string that has a special feature: anything inside `{}` will be evaluated like it's outside of the quotes:
[^strings-4]: If you're not using stringr, you can also access it directly with `glue::glue()`.
[^strings-3]: If you're not using stringr, you can also access it directly with `glue::glue()`.
```{r}
df |> mutate(greeting = str_glue("Hi {name}!"))
@ -214,9 +211,9 @@ df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
`str_c()` and `glue()` work well with `mutate()` because their output is the same length as their inputs.
What if you want a function that works well with `summarize()`, i.e., something that always returns a single string?
That's the job of `str_flatten()`[^strings-5]: it takes a character vector and combines each element of the vector into a single string:
That's the job of `str_flatten()`[^strings-4]: it takes a character vector and combines each element of the vector into a single string:
[^strings-5]: The base R equivalent is `paste()` used with the `collapse` argument.
[^strings-4]: The base R equivalent is `paste()` used with the `collapse` argument.
```{r}
str_flatten(c("x", "y", "z"))
@ -344,11 +341,11 @@ df4 |>
### Diagnosing widening problems
`separate_wider_delim()`[^strings-6] requires a fixed and known set of columns.
`separate_wider_delim()`[^strings-5] requires a fixed and known set of columns.
What happens if some of the rows don't have the expected number of pieces?
There are two possible problems, too few or too many pieces, so `separate_wider_delim()` provides two arguments to help: `too_few` and `too_many`. Let's first look at the `too_few` case with the following sample dataset:
[^strings-6]: The same principles apply to `separate_wider_position()` and `separate_wider_regex()`.
[^strings-5]: The same principles apply to `separate_wider_position()` and `separate_wider_regex()`.
```{r}
#| error: true
@ -463,9 +460,9 @@ You'll learn how to find the length of a string, extract substrings, and handle
str_length(c("a", "R for data science", NA))
```
You could use this with `count()` to find the distribution of lengths of US babynames and then with `filter()` to look at the longest names[^strings-7]:
You could use this with `count()` to find the distribution of lengths of US babynames and then with `filter()` to look at the longest names[^strings-6]:
[^strings-7]: Looking at these entries, we'd guess that the babynames data drops spaces or hyphens and truncates after 15 letters.
[^strings-6]: Looking at these entries, we'd guess that the babynames data drops spaces or hyphens and truncates after 15 letters.
```{r}
babynames |>
@ -478,7 +475,7 @@ babynames |>
### Subsetting
You can extract parts of a string using `str_sub(string, start, end)`, where `start` and `end` are the letters where the substring should start and end.
You can extract parts of a string using `str_sub(string, start, end)`, where `start` and `end` are the positions where the substring should start and end.
The `start` and `end` arguments are inclusive, so the length of the returned string will be `end - start + 1`:
```{r}
@ -564,9 +561,9 @@ readr uses UTF-8 everywhere.
This is a good default but will fail for data produced by older systems that don't use UTF-8.
If this happens, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times, you'll get complete gibberish.
For example here are two inline CSVs with unusual encodings[^strings-8]:
For example here are two inline CSVs with unusual encodings[^strings-7]:
[^strings-8]: Here I'm using the special `\x` to encode binary data directly into a string.
[^strings-7]: Here I'm using the special `\x` to encode binary data directly into a string.
```{r}
#| message: false
@ -602,7 +599,7 @@ If you'd like to learn more, we recommend reading the detailed explanation at <h
### Letter variations
If you're working with individual letters (e.g., with `str_length()` and `str_sub()`), there's an important challenge if you're working with a language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g., ü) with a diacritic mark (e.g., ¨).
Working in languages with accents poses a significant challenge when determining the position of letters (e.g., with `str_length()` and `str_sub()`) as accented letters might be encoded as a single individual character (e.g., ü) or as two characters by combining an unaccented letter (e.g., u) with a diacritic mark (e.g., ¨).
For example, this code shows two ways of representing ü that look identical:
```{r}
@ -610,14 +607,14 @@ u <- c("\u00fc", "u\u0308")
str_view(u)
```
But they have different lengths, and the first characters are different:
But both strings differ in length, and their first characters are different:
```{r}
str_length(u)
str_sub(u, 1, 1)
```
Finally, note that these strings look differently when you compare them with `==`, for which stringr provides the handy `str_equal()` function:
Finally, note that a comparison of these strings with `==` interprets these strings as different, while the handy `str_equal()` function in stringr recognizes that both have the same appearance:
```{r}
u[[1]] == u[[2]]
@ -625,7 +622,7 @@ u[[1]] == u[[2]]
str_equal(u[[1]], u[[2]])
```
### Locale-dependent function
### Locale-dependent functions
Finally, there are a handful of stringr functions whose behavior depends on your **locale**.
A locale is similar to a language but includes an optional region specifier to handle regional variations within a language.
@ -635,21 +632,21 @@ If you don't already know the code for your language, [Wikipedia](https://en.wik
Base R string functions automatically use the locale set by your operating system.
This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in a different country.
To avoid this problem, stringr defaults to using English rules by using the "en" locale and requires you to specify the `locale` argument to override it.
To avoid this problem, stringr defaults to English rules by using the "en" locale and requires you to specify the `locale` argument to override it.
Fortunately, there are two sets of functions where the locale really matters: changing case and sorting.
The rules for changing cases are not the same in every language.
For example, Turkish has two i's: with and without a dot, and it capitalizes them in a different way to English:
The rules for changing cases differ among languages.
For example, Turkish has two i's: with and without a dot. Since they're two distinct letters, they're capitalized differently:
```{r}
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```
Sorting strings depends on the order of the alphabet, and the order of the alphabet is not the same in every language[^strings-9]!
Sorting strings depends on the order of the alphabet, and the order of the alphabet is not the same in every language[^strings-8]!
Here's an example: in Czech, "ch" is a compound letter that appears after `h` in the alphabet.
[^strings-9]: Sorting in languages that don't have an alphabet, like Chinese, is more complicated still.
[^strings-8]: Sorting in languages that don't have an alphabet, like Chinese, is more complicated still.
```{r}
str_sort(c("a", "c", "ch", "h", "z"))