From 94033a1331a7c720298152ab67c786ffc4db7bc5 Mon Sep 17 00:00:00 2001 From: Stephan Koenig Date: Fri, 6 Jan 2023 12:13:34 -0800 Subject: [PATCH] Suggested edits related to Strings chapter (#1219) * Add {wakefield} as dependency for Strings chapter * Move footnote into body of text The footnote appears to be redundant with the more vague paragraph immediately following it in the main body of the text, so combine their information instead. * Make explicit that `coalesce()` replaces NAs * Fix definition of `start` & `end` for `str_sub()` * Edit section on Letter variations * Edit section on Locale-dependent function * Apply suggestions from code review Co-authored-by: Mine Cetinkaya-Rundel Co-authored-by: Mine Cetinkaya-Rundel --- intro.qmd | 2 +- strings.qmd | 49 +++++++++++++++++++++++-------------------------- 2 files changed, 24 insertions(+), 27 deletions(-) diff --git a/intro.qmd b/intro.qmd index d1143c0..285cd67 100644 --- a/intro.qmd +++ b/intro.qmd @@ -218,7 +218,7 @@ In this book, we'll use three data packages from outside the tidyverse: ```{r} #| eval: false -install.packages(c("gapminder", "Lahman", "nycflights13", "palmerpenguins")) +install.packages(c("gapminder", "Lahman", "nycflights13", "palmerpenguins", "wakefield")) ``` These packages provide data on world development, baseball, airline flights, and body measurements of penguins that we'll use to illustrate key data science ideas. diff --git a/strings.qmd b/strings.qmd index 4b7fcfe..dd8f097 100644 --- a/strings.qmd +++ b/strings.qmd @@ -160,10 +160,7 @@ That naturally raises the question of what string functions you might use with ` ### `str_c()` -`str_c()`[^strings-3] takes any number of vectors as arguments and returns a character vector: - -[^strings-3]: `str_c()` is very similar to the base `paste0()`. - There are two main reasons we recommend it: it propagates `NA`s (rather than converting them to `"NA"`) and it uses the tidyverse recycling rules. +`str_c()` takes any number of vectors as arguments and returns a character vector: ```{r} str_c("x", "y") @@ -171,7 +168,7 @@ str_c("x", "y", "z") str_c("Hello ", c("John", "Susan")) ``` -`str_c()` is designed to be used with `mutate()`, so it obeys the usual rules for recycling and missing values: +`str_c()` is very similar to the base `paste0()`, but is designed to be used with `mutate()` by obeying the usual tidyverse rules for recycling and propagating missing values: ```{r} set.seed(1410) @@ -179,7 +176,7 @@ df <- tibble(name = c(wakefield::name(3), NA)) df |> mutate(greeting = str_c("Hi ", name, "!")) ``` -If you want missing values to display in another way, use `coalesce()`. +If you want missing values to display in another way, use `coalesce()` to replace them. Depending on what you want, you might use it either inside or outside of `str_c()`: ```{r} @@ -192,9 +189,9 @@ df |> ### `str_glue()` {#sec-glue} -If you are mixing many fixed and variable strings with `str_c()`, you'll notice that you type a lot of `"`s, making it hard to see the overall goal of the code. An alternative approach is provided by the [glue package](https://glue.tidyverse.org) via `str_glue()`[^strings-4]. You give it a single string that has a special feature: anything inside `{}` will be evaluated like it's outside of the quotes: +If you are mixing many fixed and variable strings with `str_c()`, you'll notice that you type a lot of `"`s, making it hard to see the overall goal of the code. An alternative approach is provided by the [glue package](https://glue.tidyverse.org) via `str_glue()`[^strings-3]. You give it a single string that has a special feature: anything inside `{}` will be evaluated like it's outside of the quotes: -[^strings-4]: If you're not using stringr, you can also access it directly with `glue::glue()`. +[^strings-3]: If you're not using stringr, you can also access it directly with `glue::glue()`. ```{r} df |> mutate(greeting = str_glue("Hi {name}!")) @@ -214,9 +211,9 @@ df |> mutate(greeting = str_glue("{{Hi {name}!}}")) `str_c()` and `glue()` work well with `mutate()` because their output is the same length as their inputs. What if you want a function that works well with `summarize()`, i.e., something that always returns a single string? -That's the job of `str_flatten()`[^strings-5]: it takes a character vector and combines each element of the vector into a single string: +That's the job of `str_flatten()`[^strings-4]: it takes a character vector and combines each element of the vector into a single string: -[^strings-5]: The base R equivalent is `paste()` used with the `collapse` argument. +[^strings-4]: The base R equivalent is `paste()` used with the `collapse` argument. ```{r} str_flatten(c("x", "y", "z")) @@ -344,11 +341,11 @@ df4 |> ### Diagnosing widening problems -`separate_wider_delim()`[^strings-6] requires a fixed and known set of columns. +`separate_wider_delim()`[^strings-5] requires a fixed and known set of columns. What happens if some of the rows don't have the expected number of pieces? There are two possible problems, too few or too many pieces, so `separate_wider_delim()` provides two arguments to help: `too_few` and `too_many`. Let's first look at the `too_few` case with the following sample dataset: -[^strings-6]: The same principles apply to `separate_wider_position()` and `separate_wider_regex()`. +[^strings-5]: The same principles apply to `separate_wider_position()` and `separate_wider_regex()`. ```{r} #| error: true @@ -463,9 +460,9 @@ You'll learn how to find the length of a string, extract substrings, and handle str_length(c("a", "R for data science", NA)) ``` -You could use this with `count()` to find the distribution of lengths of US babynames and then with `filter()` to look at the longest names[^strings-7]: +You could use this with `count()` to find the distribution of lengths of US babynames and then with `filter()` to look at the longest names[^strings-6]: -[^strings-7]: Looking at these entries, we'd guess that the babynames data drops spaces or hyphens and truncates after 15 letters. +[^strings-6]: Looking at these entries, we'd guess that the babynames data drops spaces or hyphens and truncates after 15 letters. ```{r} babynames |> @@ -478,7 +475,7 @@ babynames |> ### Subsetting -You can extract parts of a string using `str_sub(string, start, end)`, where `start` and `end` are the letters where the substring should start and end. +You can extract parts of a string using `str_sub(string, start, end)`, where `start` and `end` are the positions where the substring should start and end. The `start` and `end` arguments are inclusive, so the length of the returned string will be `end - start + 1`: ```{r} @@ -564,9 +561,9 @@ readr uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that don't use UTF-8. If this happens, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times, you'll get complete gibberish. -For example here are two inline CSVs with unusual encodings[^strings-8]: +For example here are two inline CSVs with unusual encodings[^strings-7]: -[^strings-8]: Here I'm using the special `\x` to encode binary data directly into a string. +[^strings-7]: Here I'm using the special `\x` to encode binary data directly into a string. ```{r} #| message: false @@ -602,7 +599,7 @@ If you'd like to learn more, we recommend reading the detailed explanation at