632 lines
20 KiB
Plaintext
632 lines
20 KiB
Plaintext
# Strings
|
||
|
||
## Introduction
|
||
|
||
This chapter introduces you to strings in R.
|
||
You'll learn the basics of how strings work and how to create them by hand.
|
||
Big topic so spread over three chapters: here we'll focus on the basic mechanics, in Chapter \@ref(regular-expressions) we'll dive into the details of regular expressions the sometimes cryptic language for describing patterns in strings, and we'll return to strings later in Chapter \@ref(programming-with-strings) when we think about them about from a programming perspective (rather than a data analysis perspective).
|
||
|
||
While base R contains functions that allow us to perform pretty much all of the operations described in this chapter, here we're going to use the **stringr** package.
|
||
stringr has been carefully designed to be as consistent as possible so that knowledge gained about one function can be more easily transferred to the next.
|
||
stringr functions all start with the same `str_` prefix.
|
||
This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr's functions:
|
||
|
||
```{r, echo = FALSE}
|
||
knitr::include_graphics("screenshots/stringr-autocomplete.png")
|
||
```
|
||
|
||
### Prerequisites
|
||
|
||
This chapter will focus on the **stringr** package for string manipulation, which is part of the core tidyverse.
|
||
We'll also work with the babynames dataset.
|
||
|
||
```{r setup, message = FALSE}
|
||
library(tidyverse)
|
||
library(babynames)
|
||
```
|
||
|
||
## Creating a string
|
||
|
||
To begin, let's discuss the mechanics of creating a string.
|
||
We've created strings in passing earlier in the book, but didn't discuss the details.
|
||
First, there are two basic ways to create a string: using either single quotes (`'`) or double quotes (`"`).
|
||
Unlike other languages, there is no difference in behaviour.
|
||
I recommend always using `"`, unless you want to create a string that contains multiple `"`.
|
||
|
||
```{r}
|
||
string1 <- "This is a string"
|
||
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
|
||
```
|
||
|
||
If you forget to close a quote, you'll see `+`, the continuation character:
|
||
|
||
> "This is a string without a closing quote
|
||
+
|
||
+
|
||
+ HELP I'M STUCK
|
||
|
||
If this happen to you, press Escape and try again.
|
||
|
||
### Escapes
|
||
|
||
To include a literal single or double quote in a string you can use `\` to "escape" it:
|
||
|
||
```{r}
|
||
double_quote <- "\"" # or '"'
|
||
single_quote <- '\'' # or "'"
|
||
```
|
||
|
||
Which means if you want to include a literal backslash, you'll need to double it up: `"\\"`:
|
||
|
||
```{r}
|
||
backslash <- "\\"
|
||
```
|
||
|
||
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes.
|
||
To see the raw contents of the string, use `str_view()`:
|
||
|
||
```{r}
|
||
x <- c(single_quote, double_quote, backslash)
|
||
x
|
||
str_view(x)
|
||
```
|
||
|
||
### Raw strings
|
||
|
||
Creating a string with multiple quotes or backslashes gets confusing quickly.
|
||
For example, lets create a string that contains the contents of the chunk where I define the `double_quote` and `single_quote` variables:
|
||
|
||
```{r}
|
||
tricky <- "double_quote <- \"\\\"\" # or '\"'
|
||
single_quote <- '\\'' # or \"'\""
|
||
str_view(tricky)
|
||
```
|
||
|
||
You can instead use a **raw string**[^strings-1] to reduce the amount of escaping:
|
||
|
||
[^strings-1]: Available in R 4.0.0 and above.
|
||
|
||
```{r}
|
||
tricky <- r"(double_quote <- "\"" # or '"'
|
||
single_quote <- '\'' # or "'"
|
||
)"
|
||
str_view(tricky)
|
||
```
|
||
|
||
A raw string starts with `r"(` and finishes with `)"`.
|
||
If your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``,etc.
|
||
|
||
### Other special characters
|
||
|
||
As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list in `?'"'`.
|
||
|
||
You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`.
|
||
This is a way of writing non-English characters that works on all systems:
|
||
|
||
```{r}
|
||
x <- c("\u00b5", "\U0001f604")
|
||
x
|
||
str_view(x)
|
||
```
|
||
|
||
## Combining strings
|
||
|
||
Use `str_c()`[^strings-2] to join together multiple strings into a single string:
|
||
|
||
[^strings-2]: `str_c()` is very similar to the base `paste0()`.
|
||
There are two main reasons I use it here: it obeys the usual rules for handling `NA`, and it uses the tidyverse recycling rules.
|
||
|
||
```{r}
|
||
str_c("x", "y")
|
||
str_c("x", "y", "z")
|
||
```
|
||
|
||
Like most other functions in R, missing values are contagious.
|
||
You can use `coalesce()` to replace missing values with a value of your choosing:
|
||
|
||
```{r}
|
||
x <- c("abc", NA)
|
||
str_c("|-", x, "-|")
|
||
str_c("|-", coalesce(x, ""), "-|")
|
||
```
|
||
|
||
Since `str_c()` creates a new variable, you'll usually use it with a `mutate()`:
|
||
|
||
```{r}
|
||
starwars %>%
|
||
mutate(greeting = str_c("Hi! I'm ", name, "."), .after = name)
|
||
```
|
||
|
||
Another powerful way of combining strings is with the glue package.
|
||
You can either use `glue::glue()` or call it via the `str_glue()` wrapper that string provides for you.
|
||
Glue works a little differently to the other methods: you give it a single string using `{}` to indicate where you want to interpolate in existing variables:
|
||
|
||
```{r}
|
||
str_glue("|-{x}-|")
|
||
```
|
||
|
||
Like `str_c()`, `str_glue()` pairs well with `mutate()`:
|
||
|
||
```{r}
|
||
starwars %>%
|
||
mutate(
|
||
intro = str_glue("Hi! My is {name} and I'm a {species} from {homeworld}"),
|
||
.keep = "none"
|
||
)
|
||
```
|
||
|
||
You can use any valid R code inside of `{}`, but we recommend placing more complex calculations in their own variables.
|
||
|
||
## Length and subsetting
|
||
|
||
It's also natural to think about the letters that make up an individual string.
|
||
(But note that the idea of a "letter" isn't a natural fit to every language, we'll come back to that in Section \@ref(other-languages)).
|
||
For example, `str_length()` tells you the length, the number of characters:
|
||
|
||
```{r}
|
||
str_length(c("a", "R for data science", NA))
|
||
```
|
||
|
||
You could use this with `count()` to find the distribution of lengths of US babynames:
|
||
|
||
```{r}
|
||
babynames %>%
|
||
count(length = str_length(name), wt = n)
|
||
```
|
||
|
||
You can extract parts of a string using `str_sub()`.
|
||
As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) characters to start and end at:
|
||
|
||
```{r}
|
||
x <- c("Apple", "Banana", "Pear")
|
||
str_sub(x, 1, 3)
|
||
```
|
||
|
||
You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.
|
||
|
||
```{r}
|
||
str_sub(x, -3, -1)
|
||
```
|
||
|
||
Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:
|
||
|
||
```{r}
|
||
str_sub("a", 1, 5)
|
||
```
|
||
|
||
We could use `str_sub()` with `mutate()` to find the first and last letter of each name:
|
||
|
||
```{r}
|
||
babynames %>%
|
||
mutate(
|
||
first = str_sub(name, 1, 1),
|
||
last = str_sub(name, -1, -1)
|
||
)
|
||
```
|
||
|
||
Sometimes you'll get a column that's made up of individual fixed length strings that have been joined together:
|
||
|
||
```{r}
|
||
df <- tribble(
|
||
~ sex_year_age,
|
||
"M200115",
|
||
"F201503",
|
||
)
|
||
```
|
||
|
||
You can extract the columns using `str_sub()`:
|
||
|
||
```{r}
|
||
df %>% mutate(
|
||
sex = str_sub(sex_year_age, 1, 1),
|
||
year = str_sub(sex_year_age, 2, 5),
|
||
age = str_sub(sex_year_age, 6, 7),
|
||
)
|
||
```
|
||
|
||
Or use the `separate()` helper function:
|
||
|
||
```{r}
|
||
df %>%
|
||
separate(sex_year_age, c("sex", "year", "age"), c(1, 5))
|
||
```
|
||
|
||
Note that you give `separate()` three columns but only two positions --- that's because you're telling `separate()` where to break up the string.
|
||
|
||
TODO: draw diagram to emphasise that it's the space between the characters.
|
||
|
||
Later on, we'll come back two related problems: the components having vary length are a separated by a character
|
||
|
||
### Exercises
|
||
|
||
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
|
||
|
||
## Long strings
|
||
|
||
Sometimes the reason you care about the length of a string is because you're trying to fit it into a label.
|
||
stringr provides two useful tools for cases where your string is too long:
|
||
|
||
- `str_trunc(x, 20)` ensures that no string is longer than 20 characters, replacing any thing too long with `…`.
|
||
|
||
- `str_wrap(x, 20)` wraps a string introducing new lines so that each line is at most 20 characters (it doesn't hyphenate, however, so any word longer than 20 characters will make a longer time)
|
||
|
||
## String summaries
|
||
|
||
`str_c()` combines multiple character vectors into a single character vector; the output is the same length as the input.
|
||
An related function is `str_flatten()`: it takes a character vector and returns a single string:
|
||
|
||
```{r}
|
||
str_flatten(c("x", "y", "z"))
|
||
```
|
||
|
||
Just like `sum()` and `mean()` take a vector of numbers and return a single number, `str_flatten()` takes a character vector and returns a single string.
|
||
This makes `str_flatten()` a summary function for strings, so you'll often pair it with `summarise()`:
|
||
|
||
```{r}
|
||
df <- tribble(
|
||
~ name, ~ fruit,
|
||
"Carmen", "banana",
|
||
"Carmen", "apple",
|
||
"Marvin", "nectarine",
|
||
"Terence", "cantaloupe",
|
||
"Terence", "papaya",
|
||
"Terence", "madarine"
|
||
)
|
||
df %>%
|
||
group_by(name) %>%
|
||
summarise(fruits = str_flatten(fruit, ", "))
|
||
```
|
||
|
||
## Detect matches
|
||
|
||
To determine if a character vector matches a pattern, use `str_detect()`.
|
||
It returns a logical vector the same length as the input:
|
||
|
||
```{r}
|
||
x <- c("apple", "banana", "pear")
|
||
str_detect(x, "e")
|
||
```
|
||
|
||
This makes it a logical pairing with `filter()`:
|
||
|
||
```{r}
|
||
babynames %>% filter(str_detect(name, "x"))
|
||
```
|
||
|
||
Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1.
|
||
That makes `sum()` and `mean()` useful if you want to answer questions about matches across a larger vector:
|
||
|
||
```{r}
|
||
babynames %>%
|
||
group_by(year) %>%
|
||
summarise(prop_x = mean(str_detect(name, "x")))
|
||
```
|
||
|
||
(Note that this gives us the proportion of names that contain an x; if you wanted the proportion of babies given a name containing an x, you'd need to perform a weighted mean).
|
||
|
||
A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:
|
||
|
||
```{r}
|
||
str_count(x, "p")
|
||
```
|
||
|
||
It's natural to use `str_count()` with `mutate()`:
|
||
|
||
```{r}
|
||
babynames %>%
|
||
mutate(
|
||
vowels = str_count(name, "[aeiou]"),
|
||
consonants = str_count(name, "[^aeiou]")
|
||
)
|
||
```
|
||
|
||
### Exercises
|
||
|
||
1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
|
||
|
||
a. Find all words that start or end with `x`.
|
||
b. Find all words that start with a vowel and end with a consonant.
|
||
c. Are there any words that contain at least one of each different vowel?
|
||
|
||
2. What word has the highest number of vowels?
|
||
What word has the highest proportion of vowels?
|
||
(Hint: what is the denominator?)
|
||
|
||
## Introduction to regular expressions
|
||
|
||
Before we can continue on we need to discuss the second argument to continue to `str_detect()` --- it's not a fixed string, but a pattern, called a regular expression.
|
||
A regular expression uses special characters
|
||
|
||
```{r}
|
||
str_detect(x, ".")
|
||
```
|
||
|
||
You can opt-out with by using `fixed`:
|
||
|
||
```{r}
|
||
str_detect(x, fixed("."))
|
||
```
|
||
|
||
Note that regular expressions are case sensitive by default:
|
||
|
||
```{r}
|
||
babynames %>% filter(str_detect(name, "X"))
|
||
babynames %>% filter(str_detect(name, fixed("X", ignore_case = TRUE)))
|
||
```
|
||
|
||
A common use of `str_detect()` is to select the elements that match a pattern.
|
||
This makes it a natural pairing with `filter()`.
|
||
The following regexp finds all names with repeated pairs of letters (you'll learn how that regexp works in the next chapter)
|
||
|
||
```{r}
|
||
babynames %>%
|
||
filter(n > 100) %>%
|
||
count(name, wt = n) %>%
|
||
filter(str_detect(name, "(..).*\\1"))
|
||
```
|
||
|
||
Simple patterns we'll use:
|
||
|
||
- `.` match any character
|
||
|
||
- `[abcd]` match "a", "b", "c", or "d".
|
||
|
||
- `+` means match one or more: `a+` means match one or more as in a row; `.+` means match one or more of anything; `[abcd]+` means match one of more of a/b/c/d in a row.
|
||
|
||
Can use `str_view_all()` see what a regular expression matches:
|
||
|
||
```{r}
|
||
str_view_all(x, "p+")
|
||
str_view_all(x, "a.")
|
||
```
|
||
|
||
## Replacing matches
|
||
|
||
`str_replace_all()` allow you to replace matches with new strings.
|
||
The simplest use is to replace a pattern with a fixed string:
|
||
|
||
```{r}
|
||
x <- c("apple", "pear", "banana")
|
||
str_replace_all(x, "[aeiou]", "-")
|
||
```
|
||
|
||
With `str_replace_all()` you can perform multiple replacements by supplying a named vector.
|
||
The name gives a regular expression to match, and the value gives the replacement.
|
||
|
||
```{r}
|
||
x <- c("1 house", "1 person has 2 cars", "3 people")
|
||
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
|
||
```
|
||
|
||
`str_remove_all()` is a short cut for `str_replace_all(x, pattern, "")` --- it removes matching patterns from a string.
|
||
|
||
Use in `mutate()`
|
||
|
||
#### Exercises
|
||
|
||
1. Replace all forward slashes in a string with backslashes.
|
||
|
||
2. Implement a simple version of `str_to_lower()` using `str_replace_all()`.
|
||
|
||
3. Switch the first and last letters in `words`.
|
||
Which of those strings are still words?
|
||
|
||
## Extract full matches
|
||
|
||
To extract the actual text of a match, use `str_extract()`.
|
||
To show that off, we're going to need a more complicated example.
|
||
I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexps.
|
||
These are provided in `stringr::sentences`:
|
||
|
||
```{r}
|
||
length(sentences)
|
||
head(sentences)
|
||
```
|
||
|
||
Imagine we want to find all sentences that contain a colour.
|
||
We first create a vector of colour names, and then turn it into a single regular expression:
|
||
|
||
```{r}
|
||
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
|
||
colour_match <- str_c(colours, collapse = "|")
|
||
colour_match
|
||
```
|
||
|
||
Now we can select the sentences that contain a colour, and then extract the colour to figure out which one it is:
|
||
|
||
```{r}
|
||
has_colour <- str_subset(sentences, colour_match)
|
||
matches <- str_extract(has_colour, colour_match)
|
||
head(matches)
|
||
```
|
||
|
||
Note that `str_extract()` only extracts the first match.
|
||
We can see that most easily by first selecting all the sentences that have more than 1 match:
|
||
|
||
```{r}
|
||
more <- sentences[str_count(sentences, colour_match) > 1]
|
||
str_view_all(more, colour_match)
|
||
|
||
str_extract(more, colour_match)
|
||
```
|
||
|
||
This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures.
|
||
To get all matches, use `str_extract_all()`.
|
||
It returns a list, so we'll come back to this later on.
|
||
|
||
### Exercises
|
||
|
||
1. In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem.
|
||
|
||
## Extract part of matches
|
||
|
||
If your data is in a tibble, it's often easier to use `tidyr::extract()`.
|
||
It works like `str_match()` but requires you to name the matches, which are then placed in new columns:
|
||
|
||
```{r}
|
||
tibble(sentence = sentences) %>%
|
||
tidyr::extract(
|
||
sentence, c("article", "noun"), "(a|the) ([^ ]+)",
|
||
remove = FALSE
|
||
)
|
||
```
|
||
|
||
#### Exercises
|
||
|
||
1. Find all words that come after a "number" like "one", "two", "three" etc.
|
||
Pull out both the number and the word.
|
||
|
||
2. Find all contractions.
|
||
Separate out the pieces before and after the apostrophe.
|
||
|
||
## Strings -\> Columns
|
||
|
||
## Separate
|
||
|
||
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
|
||
Take `table3`:
|
||
|
||
```{r}
|
||
table3
|
||
```
|
||
|
||
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
|
||
`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
|
||
|
||
```{r}
|
||
table3 %>%
|
||
separate(rate, into = c("cases", "population"))
|
||
```
|
||
|
||
```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
|
||
knitr::include_graphics("images/tidy-17.png")
|
||
```
|
||
|
||
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
|
||
For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
|
||
If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
|
||
For example, we could rewrite the code above as:
|
||
|
||
```{r eval = FALSE}
|
||
table3 %>%
|
||
separate(rate, into = c("cases", "population"), sep = "/")
|
||
```
|
||
|
||
`separate_rows()`
|
||
|
||
## Strings -\> Rows
|
||
|
||
```{r}
|
||
starwars %>%
|
||
select(name, eye_color) %>%
|
||
filter(str_detect(eye_color, ", ")) %>%
|
||
separate_rows(eye_color)
|
||
```
|
||
|
||
### Exercises
|
||
|
||
1. Split up a string like `"apples, pears, and bananas"` into individual components.
|
||
|
||
2. Why is it better to split up by `boundary("word")` than `" "`?
|
||
|
||
3. What does splitting with an empty string (`""`) do?
|
||
Experiment, and then read the documentation.
|
||
|
||
## Other languages {#other-languages}
|
||
|
||
Encoding, and why not to trust `Encoding`.
|
||
As a general rule, we recommend using UTF-8 everywhere, converting as a early as possible (i.e. by using the `encoding` argument to `readr::locale()`).
|
||
|
||
### Length and subsetting
|
||
|
||
This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
|
||
Include some examples from <https://gankra.github.io/blah/text-hates-you/>.
|
||
|
||
This is a problem even with European problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
|
||
|
||
```{r}
|
||
x <- c("\u00e1", "a\u0301")
|
||
x
|
||
str_length(x)
|
||
str_sub(x, 1, 1)
|
||
```
|
||
|
||
### Locales
|
||
|
||
Above I used `str_to_lower()` to change the text to lower case.
|
||
You can also use `str_to_upper()` or `str_to_title()`.
|
||
However, changing case is more complicated than it might at first appear because different languages have different rules for changing case.
|
||
You can pick which set of rules to use by specifying a locale:
|
||
|
||
```{r}
|
||
# Turkish has two i's: with and without a dot, and it
|
||
# has a different rule for capitalising them:
|
||
str_to_upper(c("i", "ı"))
|
||
str_to_upper(c("i", "ı"), locale = "tr")
|
||
```
|
||
|
||
The locale is specified as a ISO 639 language code, which is a two or three letter abbreviation.
|
||
If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list.
|
||
If you leave the locale blank, it will use English.
|
||
|
||
Another important operation that's affected by the locale is sorting.
|
||
The base R `order()` and `sort()` functions sort strings using the current locale.
|
||
If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
|
||
|
||
```{r}
|
||
x <- c("apple", "eggplant", "banana")
|
||
|
||
str_sort(x, locale = "en") # English
|
||
|
||
str_sort(x, locale = "haw") # Hawaiian
|
||
```
|
||
|
||
TODO: add connection to `arrange()`
|
||
|
||
### `coll()`
|
||
|
||
Beware using `fixed()` with non-English data.
|
||
It is problematic because there are often multiple ways of representing the same character.
|
||
For example, there are two ways to define "á": either as a single character or as an "a" plus an accent:
|
||
|
||
```{r}
|
||
a1 <- "\u00e1"
|
||
a2 <- "a\u0301"
|
||
c(a1, a2)
|
||
a1 == a2
|
||
```
|
||
|
||
They render identically, but because they're defined differently, `fixed()` doesn't find a match.
|
||
Instead, you can use `coll()`, defined next, to respect human character comparison rules:
|
||
|
||
```{r}
|
||
str_detect(a1, fixed(a2))
|
||
str_detect(a1, coll(a2))
|
||
```
|
||
|
||
-
|
||
|
||
`coll()`: compare strings using standard **coll**ation rules.
|
||
This is useful for doing case insensitive matching.
|
||
Note that `coll()` takes a `locale` parameter that controls which rules are used for comparing characters.
|
||
Unfortunately different parts of the world use different rules!
|
||
|
||
```{r}
|
||
# That means you also need to be aware of the difference
|
||
# when doing case insensitive matches:
|
||
i <- c("I", "İ", "i", "ı")
|
||
i
|
||
|
||
str_subset(i, coll("i", ignore_case = TRUE))
|
||
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
|
||
```
|
||
|
||
Both `fixed()` and `regex()` have `ignore_case` arguments, but they do not allow you to pick the locale: they always use the default locale.
|
||
You can see what that is with the following code; more on stringi later.
|
||
|
||
```{r}
|
||
stringi::stri_locale_info()
|
||
```
|
||
|
||
The downside of `coll()` is speed; because the rules for recognising which characters are the same are complicated, `coll()` is relatively slow compared to `regex()` and `fixed()`.
|