From 807795af45f4f249b4d7449e87850a3b17205175 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Mon, 26 Apr 2021 14:49:14 -0500 Subject: [PATCH] More on strings --- prog-strings.Rmd | 18 +-- regexps.Rmd | 14 +++ strings.Rmd | 290 ++++++++++++++++++++++++++++++----------------- 3 files changed, 201 insertions(+), 121 deletions(-) diff --git a/prog-strings.Rmd b/prog-strings.Rmd index ff63ad1..d36968b 100644 --- a/prog-strings.Rmd +++ b/prog-strings.Rmd @@ -153,6 +153,8 @@ str_split(x, " ")[[1]] str_split(x, boundary("word"))[[1]] ``` +Show how `separate_rows()` is a special case of `str_split()` + `summarise()`. + ## Replace with function ## Locations @@ -217,17 +219,5 @@ The main difference is the prefix: `str_` vs. `stri_`. unite("date", month:day, sep = "-", remove = FALSE) ``` -5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. - Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. - Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`. - Do this in two ways: using a positive and a negative value for `sep`. - - ```{r} - baker <- tribble( - ~location, - "FLBaker County", - "GABaker County", - "ORBaker County", - ) - baker - ``` +5. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`. + Think carefully about what it should do if given a vector of length 0, 1, or 2. diff --git a/regexps.Rmd b/regexps.Rmd index 4b47d6c..5f739fc 100644 --- a/regexps.Rmd +++ b/regexps.Rmd @@ -169,6 +169,20 @@ Like with mathematical expressions, if precedence ever gets confusing, use paren str_view(c("grey", "gray"), "gr(e|a)y") ``` +When you have complex logical conditions (e.g. match a or b but not c unless d) it's often easier to combine multiple `str_detect()` calls with logical operators, rather than trying to create a single regular expression. +For example, here are two ways to find all words that don't contain any vowels: + +```{r} +# Find all words containing at least one vowel, and negate +no_vowels_1 <- !str_detect(words, "[aeiou]") +# Find all words consisting only of consonants (non-vowels) +no_vowels_2 <- str_detect(words, "^[^aeiou]+$") +identical(no_vowels_1, no_vowels_2) +``` + +The results are identical, but I think the first approach is significantly easier to understand. +If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations. + ### Exercises 1. Create regular expressions to find all words that: diff --git a/strings.Rmd b/strings.Rmd index 292c9fe..05ee1ea 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -4,11 +4,12 @@ This chapter introduces you to strings in R. You'll learn the basics of how strings work and how to create them by hand. -Big topic so spread over three chapters. +Big topic so spread over three chapters: here we'll focus on the basic mechanics, in Chapter \@ref(regular-expressions) we'll dive into the details of regular expressions the sometimes cryptic language for describing patterns in strings, and we'll return to strings later in Chapter \@ref(programming-with-strings) when we think about them about from a programming perspective (rather than a data analysis perspective). -Base R contains many functions to work with strings but we'll generally avoid them here because they can be inconsistent, which makes them hard to remember. -Instead, we'll use stringr which is designed to be as consistent as possible, and all of its functions start with `str_`. -The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions: +While base R contains functions that allow us to perform pretty much all of the operations described in this chapter, here we're going to use the **stringr** package. +stringr has been carefully designed to be as consistent as possible so that knowledge gained about one function can be more easily transferred to the next. +stringr functions all start with the same `str_` prefix. +This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr's functions: ```{r, echo = FALSE} knitr::include_graphics("screenshots/stringr-autocomplete.png") @@ -17,6 +18,7 @@ knitr::include_graphics("screenshots/stringr-autocomplete.png") ### Prerequisites This chapter will focus on the **stringr** package for string manipulation, which is part of the core tidyverse. +We'll also work with the babynames dataset. ```{r setup, message = FALSE} library(tidyverse) @@ -25,7 +27,9 @@ library(babynames) ## Creating a string -You can create strings with either single quotes or double quotes. +To begin, let's discuss the mechanics of creating a string. +We've created strings in passing earlier in the book, but didn't discuss the details. +First, there are two basic ways to create a string: using either single quotes (`'`) or double quotes (`"`). Unlike other languages, there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`. @@ -41,7 +45,9 @@ If you forget to close a quote, you'll see `+`, the continuation character: + + HELP I'M STUCK -If this happen to you, press Escape and try again! +If this happen to you, press Escape and try again. + +### Escapes To include a literal single or double quote in a string you can use `\` to "escape" it: @@ -50,27 +56,25 @@ double_quote <- "\"" # or '"' single_quote <- '\'' # or "'" ``` -That means if you want to include a literal backslash, you'll need to double it up: `"\\"`. - -Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. -To see the raw contents of the string, use `writeLines()`: +Which means if you want to include a literal backslash, you'll need to double it up: `"\\"`: ```{r} -x <- c("\"", "\\") -x -str_view(x) +backslash <- "\\" ``` -As shown above, you can combine strings into a (character) vector with `c()`: +Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. +To see the raw contents of the string, use `str_view()`: ```{r} -c("one", "two", "three") +x <- c(single_quote, double_quote, backslash) +x +str_view(x) ``` ### Raw strings Creating a string with multiple quotes or backslashes gets confusing quickly. -For example, lets create a string that contains the contents of the chunk above: +For example, lets create a string that contains the contents of the chunk where I define the `double_quote` and `single_quote` variables: ```{r} tricky <- "double_quote <- \"\\\"\" # or '\"' @@ -78,7 +82,9 @@ single_quote <- '\\'' # or \"'\"" str_view(tricky) ``` -In R 4.0.0 and above, you can use a **raw** string to reduce the amount of escaping: +You can instead use a **raw string**[^strings-1] to reduce the amount of escaping: + +[^strings-1]: Available in R 4.0.0 and above. ```{r} tricky <- r"(double_quote <- "\"" # or '"' @@ -88,37 +94,35 @@ str_view(tricky) ``` A raw string starts with `r"(` and finishes with `)"`. -If your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique: `` `r"--()--" ``. +If your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``,etc. ### Other special characters -As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"` with `?'"'` or `?"'"`. +As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list in `?'"'`. -You'll also sometimes see strings containing Unicode escapes like `"\u00b5"`. -This is a way of writing non-English characters that works on all platforms: +You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`. +This is a way of writing non-English characters that works on all systems: ```{r} -x <- "\u00b5" +x <- c("\u00b5", "\U0001f604") x +str_view(x) ``` ## Combining strings -To combine two or more strings, use `str_c()`: +Use `str_c()`[^strings-2] to join together multiple strings into a single string: + +[^strings-2]: `str_c()` is very similar to the base `paste0()`. + There are two main reasons I use it here: it obeys the usual rules for handling `NA`, and it uses the tidyverse recycling rules. ```{r} str_c("x", "y") str_c("x", "y", "z") ``` -Use the `sep` argument to control how they're separated: - -```{r} -str_c("x", "y", sep = ", ") -``` - Like most other functions in R, missing values are contagious. -As usual, if you want to show a different value, use `coalesce()`: +You can use `coalesce()` to replace missing values with a value of your choosing: ```{r} x <- c("abc", NA) @@ -126,7 +130,12 @@ str_c("|-", x, "-|") str_c("|-", coalesce(x, ""), "-|") ``` -`mutate()` +Since `str_c()` creates a new variable, you'll usually use it with a `mutate()`: + +```{r} +starwars %>% + mutate(greeting = str_c("Hi! I'm ", name, "."), .after = name) +``` Another powerful way of combining strings is with the glue package. You can either use `glue::glue()` or call it via the `str_glue()` wrapper that string provides for you. @@ -139,15 +148,20 @@ str_glue("|-{x}-|") Like `str_c()`, `str_glue()` pairs well with `mutate()`: ```{r} -starwars %>% mutate( - intro = str_glue("Hi my is {name} and I'm a {species} from {homeworld}"), - .keep = "none" -) +starwars %>% + mutate( + intro = str_glue("Hi! My is {name} and I'm a {species} from {homeworld}"), + .keep = "none" + ) ``` +You can use any valid R code inside of `{}`, but we recommend placing more complex calculations in their own variables. + ## Length and subsetting -For example, `str_length()` tells you the length of a string: +It's also natural to think about the letters that make up an individual string. +(But note that the idea of a "letter" isn't a natural fit to every language, we'll come back to that in Section \@ref(other-languages)). +For example, `str_length()` tells you the length, the number of characters: ```{r} str_length(c("a", "R for data science", NA)) @@ -157,20 +171,30 @@ You could use this with `count()` to find the distribution of lengths of US baby ```{r} babynames %>% - count(length = str_length(name)) + count(length = str_length(name), wt = n) ``` You can extract parts of a string using `str_sub()`. -As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring: +As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) characters to start and end at: ```{r} x <- c("Apple", "Banana", "Pear") str_sub(x, 1, 3) -# negative numbers count backwards from end +``` + +You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc. + +```{r} str_sub(x, -3, -1) ``` -We could use this with `mutate()` to find the first and last letter of each name: +Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible: + +```{r} +str_sub("a", 1, 5) +``` + +We could use `str_sub()` with `mutate()` to find the first and last letter of each name: ```{r} babynames %>% @@ -180,54 +204,78 @@ babynames %>% ) ``` -Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible: +Sometimes you'll get a column that's made up of individual fixed length strings that have been joined together: ```{r} -str_sub("a", 1, 5) +df <- tribble( + ~ sex_year_age, + "M200115", + "F201503", +) ``` -Note that the idea of a "letter" isn't a natural fit to every language, so you'll need to take care if you're working with text from other languages. -We'll briefly talk about some of the issues in Section \@ref(other-languages). +You can extract the columns using `str_sub()`: -TODO: `separate()` +```{r} +df %>% mutate( + sex = str_sub(sex_year_age, 1, 1), + year = str_sub(sex_year_age, 2, 5), + age = str_sub(sex_year_age, 6, 7), +) +``` + +Or use the `separate()` helper function: + +```{r} +df %>% + separate(sex_year_age, c("sex", "year", "age"), c(1, 5)) +``` + +Note that you give `separate()` three columns but only two positions --- that's because you're telling `separate()` where to break up the string. + +TODO: draw diagram to emphasise that it's the space between the characters. + +Later on, we'll come back two related problems: the components having vary length are a separated by a character ### Exercises -1. In code that doesn't use stringr, you'll often see `paste()` and `paste0()`. - What's the difference between the two functions? - What stringr function are they equivalent to? - How do the functions differ in their handling of `NA`? - -2. In your own words, describe the difference between the `sep` and `collapse` arguments to `str_c()`. - -3. Use `str_length()` and `str_sub()` to extract the middle character from a string. - What will you do if the string has an even number of characters? - -4. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`. - Think carefully about what it should do if given a vector of length 0, 1, or 2. - -## String summaries - -You can perform the opposite operation with `summarise()` and `str_flatten()`: - -To collapse a vector of strings into a single string, use `collapse`: - -```{r} -str_flatten(c("x", "y", "z"), ", ") -``` - -This is a great tool for `summarise()`ing character data. -Later we'll come back to the inverse of this, `separate_rows()`. +1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters? ## Long strings -`str_wrap()` +Sometimes the reason you care about the length of a string is because you're trying to fit it into a label. +stringr provides two useful tools for cases where your string is too long: -`str_trunc()` +- `str_trunc(x, 20)` ensures that no string is longer than 20 characters, replacing any thing too long with `…`. -## Introduction to regular expressions +- `str_wrap(x, 20)` wraps a string introducing new lines so that each line is at most 20 characters (it doesn't hyphenate, however, so any word longer than 20 characters will make a longer time) -Opting out by using `fixed()` +## String summaries + +`str_c()` combines multiple character vectors into a single character vector; the output is the same length as the input. +An related function is `str_flatten()`: it takes a character vector and returns a single string: + +```{r} +str_flatten(c("x", "y", "z")) +``` + +Just like `sum()` and `mean()` take a vector of numbers and return a single number, `str_flatten()` takes a character vector and returns a single string. +This makes `str_flatten()` a summary function for strings, so you'll often pair it with `summarise()`: + +```{r} +df <- tribble( + ~ name, ~ fruit, + "Carmen", "banana", + "Carmen", "apple", + "Marvin", "nectarine", + "Terence", "cantaloupe", + "Terence", "papaya", + "Terence", "madarine" +) +df %>% + group_by(name) %>% + summarise(fruits = str_flatten(fruit, ", ")) +``` ## Detect matches @@ -239,49 +287,27 @@ x <- c("apple", "banana", "pear") str_detect(x, "e") ``` +This makes it a logical pairing with `filter()`: + +```{r} +babynames %>% filter(str_detect(name, "x")) +``` + Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want to answer questions about matches across a larger vector: -```{r} -# How many common words start with t? -sum(str_detect(words, "^t")) -# What proportion of common words end with a vowel? -mean(str_detect(words, "[aeiou]$")) -``` - -When you have complex logical conditions (e.g. match a or b but not c unless d) it's often easier to combine multiple `str_detect()` calls with logical operators, rather than trying to create a single regular expression. -For example, here are two ways to find all words that don't contain any vowels: - -```{r} -# Find all words containing at least one vowel, and negate -no_vowels_1 <- !str_detect(words, "[aeiou]") -# Find all words consisting only of consonants (non-vowels) -no_vowels_2 <- str_detect(words, "^[^aeiou]+$") -identical(no_vowels_1, no_vowels_2) -``` - -The results are identical, but I think the first approach is significantly easier to understand. -If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations. - -A common use of `str_detect()` is to select the elements that match a pattern. -This makes it a natural pairing with `filter()`. -The following regexp finds all names with repeated pairs of letters (you'll learn how that regexp works in the next chapter) - ```{r} babynames %>% - filter(n > 100) %>% - count(name, wt = n) %>% - filter(str_detect(name, "(..).*\\1")) + group_by(year) %>% + summarise(prop_x = mean(str_detect(name, "x"))) ``` +(Note that this gives us the proportion of names that contain an x; if you wanted the proportion of babies given a name containing an x, you'd need to perform a weighted mean). + A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string: ```{r} -x <- c("apple", "banana", "pear") -str_count(x, "a") - -# On average, how many vowels per word? -mean(str_count(words, "[aeiou]")) +str_count(x, "p") ``` It's natural to use `str_count()` with `mutate()`: @@ -306,6 +332,54 @@ babynames %>% What word has the highest proportion of vowels? (Hint: what is the denominator?) +## Introduction to regular expressions + +Before we can continue on we need to discuss the second argument to continue to `str_detect()` --- it's not a fixed string, but a pattern, called a regular expression. +A regular expression uses special characters + +```{r} +str_detect(x, ".") +``` + +You can opt-out with by using `fixed`: + +```{r} +str_detect(x, fixed(".")) +``` + +Note that regular expressions are case sensitive by default: + +```{r} +babynames %>% filter(str_detect(name, "X")) +babynames %>% filter(str_detect(name, fixed("X", ignore_case = TRUE))) +``` + +A common use of `str_detect()` is to select the elements that match a pattern. +This makes it a natural pairing with `filter()`. +The following regexp finds all names with repeated pairs of letters (you'll learn how that regexp works in the next chapter) + +```{r} +babynames %>% + filter(n > 100) %>% + count(name, wt = n) %>% + filter(str_detect(name, "(..).*\\1")) +``` + +Simple patterns we'll use: + +- `.` match any character + +- `[abcd]` match "a", "b", "c", or "d". + +- `+` means match one or more: `a+` means match one or more as in a row; `.+` means match one or more of anything; `[abcd]+` means match one of more of a/b/c/d in a row. + +Can use `str_view_all()` see what a regular expression matches: + +```{r} +str_view_all(x, "p+") +str_view_all(x, "a.") +``` + ## Replacing matches `str_replace_all()` allow you to replace matches with new strings. @@ -324,6 +398,8 @@ x <- c("1 house", "1 person has 2 cars", "3 people") str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three")) ``` +`str_remove_all()` is a short cut for `str_replace_all(x, pattern, "")` --- it removes matching patterns from a string. + Use in `mutate()` #### Exercises