Writing about strings

This commit is contained in:
Hadley Wickham 2021-12-02 11:57:51 -06:00
parent c0daa382c1
commit 26ab1cc1eb
1 changed files with 133 additions and 89 deletions

View File

@ -8,35 +8,38 @@ status("restructuring")
This chapter introduces you to strings.
You'll learn the basics of how strings work in R and how to create them "by hand".
Big topic so spread over three chapters: here we'll focus on the basic mechanics, in Chapter \@ref(regular-expressions) we'll dive into the details of regular expressions the sometimes cryptic language for describing patterns in strings, and we'll return to strings later in Chapter \@ref(programming-with-strings) when we think about them about from a programming perspective (rather than a data analysis perspective).
You'll also learn the basics of regular expressions, a powerful, but sometimes cryptic language for describing string patterns.
Regular expression are a big topic, so we'll come back to them again in Chapter \@ref(regular-expressions) to discuss more of the details.
We'll finish up with a discussion of some of the new challenges that arise when working with non-English strings.
While base R contains functions that allow us to perform pretty much all of the operations described in this chapter, here we're going to use the **stringr** package.
stringr has been carefully designed to be as consistent as possible so that knowledge gained about one function can be more easily transferred to the next.
stringr functions all start with the same `str_` prefix.
This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr's functions:
```{r, echo = FALSE}
knitr::include_graphics("screenshots/stringr-autocomplete.png")
```
We'll come back to strings again in Chapter \@ref(programming-with-strings) where we'll think about them about more from a programming perspective than a data analysis perspective.
### Prerequisites
This chapter will focus on the **stringr** package for string manipulation, which is part of the core tidyverse.
We'll also work with the babynames dataset.
In this chapter, we'll use functions from the stringr package.
The equivalent functionality is available in base R (through functions like `grepl()`, `gsub()`, and `regmatches()`) but we think you'll find stringr easier to use because it's been carefully designed to be as consistent as possible.
We'll also work with the babynames dataset since it provides some fun data to apply string manipulation to.
```{r setup, message = FALSE}
library(tidyverse)
library(babynames)
```
You can easily tell when you're using a stringr function because all stringr functions start with `str_`.
This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you jog your memory of which functions are available.
```{r, echo = FALSE}
knitr::include_graphics("screenshots/stringr-autocomplete.png")
```
## Creating a string
To begin, let's discuss the mechanics of creating a string.
To begin, let's discuss the mechanics of creating a string[^strings-1].
We've created strings in passing earlier in the book, but didn't discuss the details.
First, there are two basic ways to create a string: using either single quotes (`'`) or double quotes (`"`).
Unlike other languages, there is no difference in behaviour.
I recommend always using `"`, unless you want to create a string that contains multiple `"`.
Unlike other languages, there is no difference in behaviour, but the [tidyverse style guide](https://style.tidyverse.org/syntax.html#character-vectors) recommends using `"`, unless the string contains multiple `"`
[^strings-1]: A string is a length-1 character vector.
```{r}
string1 <- "This is a string"
@ -50,7 +53,14 @@ If you forget to close a quote, you'll see `+`, the continuation character:
+
+ HELP I'M STUCK
If this happen to you, press Escape and try again.
If this happen to you and you can't figure out which quote you need to close, press Escape to cancel, then try again.
You can combine multiple strings into a character vector by using `c()`:
```{r}
x <- c("first string", "second string", "third string")
x
```
### Escapes
@ -61,14 +71,16 @@ double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```
Which means if you want to include a literal backslash, you'll need to double it up: `"\\"`:
This means if you want to include a literal backslash in your string, you'll need to double it up: `"\\"`:
```{r}
backslash <- "\\"
```
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes.
To see the raw contents of the string, use `str_view()`:
To see the raw contents of the string, use `str_view()` [^strings-2]:
[^strings-2]: You can also use the base R function `writeLines()`
```{r}
x <- c(single_quote, double_quote, backslash)
@ -79,7 +91,7 @@ str_view(x)
### Raw strings
Creating a string with multiple quotes or backslashes gets confusing quickly.
For example, lets create a string that contains the contents of the chunk where I define the `double_quote` and `single_quote` variables:
To illustrate the problem, lets create a string that contains the contents of the chunk where I define the `double_quote` and `single_quote` variables:
```{r}
tricky <- "double_quote <- \"\\\"\" # or '\"'
@ -87,9 +99,11 @@ single_quote <- '\\'' # or \"'\""
str_view(tricky)
```
You can instead use a **raw string**[^strings-1] to reduce the amount of escaping:
That's a lot of backslashes!
[^strings-1]: Available in R 4.0.0 and above.
To eliminate the escaping you can instead use a **raw string**[^strings-3]:
[^strings-3]: Available in R 4.0.0 and above.
```{r}
tricky <- r"(double_quote <- "\"" # or '"'
@ -98,13 +112,12 @@ single_quote <- '\'' # or "'"
str_view(tricky)
```
A raw string starts with `r"(` and finishes with `)"`.
If your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``,etc.
A raw string usually starts with `r"(` and finishes with `)"`.
But if your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``, etc. Raw strings are flexible enough to handle any text.
### Other special characters
As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list in `?'"'`.
As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `\n`, newline, and `\t`, tab, but you can see the complete list in `?'"'`.
You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`.
This is a way of writing non-English characters that works on all systems:
@ -114,72 +127,19 @@ x
str_view(x)
```
## Combining strings
Use `str_c()`[^strings-2] to join together multiple character vectors into a single vector:
[^strings-2]: `str_c()` is very similar to the base `paste0()`.
There are two main reasons I use it here: it obeys the usual rules for handling `NA`, and it uses the tidyverse recycling rules.
```{r}
str_c("x", "y")
str_c("x", "y", "z")
```
`str_c()` obeys the usual recycling rules:
```{r}
names <- c("Timothy", "Dewey", "Mable")
str_c("Hi ", names, "!")
```
And like most other functions in R, missing values are contagious.
You can use `coalesce()` to replace missing values with a value of your choosing:
```{r}
x <- c("abc", NA)
str_c("|-", x, "-|")
str_c("|-", coalesce(x, ""), "-|")
```
Since `str_c()` creates a vector, you'll usually use it with a `mutate()`:
```{r}
starwars %>%
mutate(greeting = str_c("Hi! I'm ", name, "."), .after = name)
```
Another powerful way of combining strings is with the glue package.
You can either use `glue::glue()` directly or call it via the `str_glue()` wrapper that stringr provides for you.
Glue works a little differently to the other methods: you give it a single string then within the string use `{}` to indicate where existing variables should be evaluated:
```{r}
str_glue("|-{x}-|")
```
Like `str_c()`, `str_glue()` pairs well with `mutate()`:
```{r}
starwars %>%
mutate(
intro = str_glue("Hi! My is {name} and I'm a {species} from {homeworld}"),
.keep = "none"
)
```
You can use any valid R code inside of `{}`, but it's a good idea to pull complex calculations out into their own variables so you can more easily check your work.
## Length and subsetting
It's natural to think about the letters that make up an individual string.
(But note that the idea of a "letter" isn't a natural fit to every language, we'll come back to that in Section \@ref(other-languages)).
(Not every language uses letters, which we'll talk about more in Section \@ref(other-languages)).
For example, `str_length()` tells you the length of a string in characters:
```{r}
str_length(c("a", "R for data science", NA))
```
You could use this with `count()` to find the distribution of lengths of US babynames, and then with `filter()` to look at the longest names:
You could use this with `count()` to find the distribution of lengths of US babynames, and then with `filter()` to look at the longest names[^strings-4]:
[^strings-4]: Looking at these entries, I'd say the babynames data removes spaces or hyphens from names and truncates after 15 letters.
```{r}
babynames %>%
@ -257,7 +217,7 @@ Later on, we'll come back two related problems: the components have varying leng
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
## Long strings
### Long strings
Sometimes the reason you care about the length of a string is because you're trying to fit it into a label.
stringr provides two useful tools for cases where your string is too long:
@ -273,15 +233,87 @@ str_trunc(x, 30)
str_view(str_wrap(x, 30))
```
## String summaries
##
## Combining strings
There are two ways in which you might want to combine strings.
You might have a few character vectors which you want to combine together creating a new vector.
Or you might have a single vector that you want to collapse down into a single string.
### str_c()
Use `str_c()`[^strings-5] to join together multiple character vectors into a single vector:
[^strings-5]: `str_c()` is very similar to the base `paste0()`.
There are two main reasons I use it here: it obeys the usual rules for handling `NA`, and it uses the tidyverse recycling rules.
```{r}
str_c("x", "y")
str_c("x", "y", "z")
```
`str_c()` obeys the tidyverse recycling rules so any length-1 vectors (aka strings) will be recycled to the length of the longest vector[^strings-6]:
[^strings-6]: If the other vectors don't have the same length, `str_c()` will error.
```{r}
names <- c("Timothy", "Dewey", "Mable")
str_c("Hi ", names, "!")
```
Like most other functions in R, missing values are contagious, so any missing input will cause the output to be missing.
If you don't want this behaviour, use `coalesce()` to replace missing values with something else:
```{r}
x <- c("abc", NA)
str_c("|-", x, "-|")
str_c("|-", coalesce(x, ""), "-|")
```
Since `str_c()` creates a vector, you'll usually use it with `mutate()`:
```{r}
starwars %>%
mutate(greeting = str_c("Hi! I'm ", name, "."), .after = name)
```
### Glue
Another powerful way of combining strings is with the glue package.
You can either use `glue::glue()` directly or call it via the `str_glue()` wrapper that stringr provides for you.
Glue works a little differently to the other methods: you give it a single string then within the string use `{}` to indicate where existing variables should be evaluated:
```{r}
x <- c("abc", NA)
str_glue("|-{x}-|")
```
Like `str_c()`, `str_glue()` pairs well with `mutate()`:
```{r}
starwars %>%
mutate(
intro = str_glue("Hi! My is {name} and I'm a {species} from {homeworld}"),
.keep = "none"
)
```
You can use any valid R code inside of `{}`, but it's a good idea to pull complex calculations out into their own variables so you can more easily check your work.
Differences with `NA` handling.
### `str_flatten()`
`str_c()` combines multiple character vectors into a single character vector; the output is the same length as the input.
An related function is `str_flatten()`: it takes a character vector and returns a single string:
An related function is `str_flatten()`:[^strings-7] it takes a character vector and returns a single string:
[^strings-7]: The base R equivalent is `paste()` with the `collapse` argument set.
```{r}
str_flatten(c("x", "y", "z"))
str_flatten(c("x", "y", "z"), ", ")
str_flatten(c("x", "y", "z"), ", ", ", and ")
str_flatten(c("x", "y", "z"), ", ", last = ", and ")
```
Just like `sum()` and `mean()` take a vector of numbers and return a single number, `str_flatten()` takes a character vector and returns a single string.
@ -302,6 +334,18 @@ df %>%
summarise(fruits = str_flatten(fruit, ", "))
```
### Exercises
1. Compare the results of `paste0()` with `str_c()` for the following inputs:
```{r, eval = FALSE}
str_c("hi ", NA)
str_c("hi ", character())
str_c(letters[1:2], letters[1:3])
```
## Splitting apart strings
## Detect matches
To determine if a character vector matches a pattern, use `str_detect()`.
@ -474,7 +518,7 @@ tibble(sentence = sentences) %>%
2. Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word.
3. Find all contractions. Separate out the pieces before and after the apostrophe.
## Strings -\> Columns
## Strings -> Columns
## Separate
@ -509,7 +553,7 @@ table3 %>%
`separate_rows()`
## Strings -\> Rows
## Strings -> Rows
```{r}
starwars %>%
@ -546,11 +590,11 @@ Maybe things you think are true, but aren't list?
You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
And typically the problem is that the declaring encoding is wrong.
The tidyverse follows best practices[^strings-3] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
The tidyverse follows best practices[^strings-8] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
It's still possible to have problems, but they'll typically arise during data import.
Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
[^strings-3]: <http://utf8everywhere.org>
[^strings-8]: <http://utf8everywhere.org>
### Length and subsetting