302 lines
9.1 KiB
Plaintext
302 lines
9.1 KiB
Plaintext
# Programming with strings {#sec-programming-with-strings}
|
||
|
||
```{r}
|
||
#| results: "asis"
|
||
#| echo: false
|
||
source("_common.R")
|
||
status("drafting")
|
||
```
|
||
|
||
```{r}
|
||
library(stringr)
|
||
library(tidyr)
|
||
library(tibble)
|
||
```
|
||
|
||
### Encoding
|
||
|
||
You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
|
||
And typically the problem is that the declaring encoding is wrong.
|
||
|
||
The tidyverse follows best practices[^prog-strings-1] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
|
||
It's still possible to have problems, but they'll typically arise during data import.
|
||
Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
|
||
|
||
[^prog-strings-1]: <http://utf8everywhere.org>
|
||
|
||
### Length and subsetting
|
||
|
||
This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
|
||
|
||
Four most common are Latin, Chinese, Arabic, and Devangari, which represent three different systems of writing systems:
|
||
|
||
- Latin uses an alphabet, where each consonant and vowel gets its own letter.
|
||
|
||
- Chinese.
|
||
Logograms.
|
||
Half width vs full width.
|
||
English letters are roughly twice as high as they are wide.
|
||
Chinese characters are roughly square.
|
||
|
||
- Arabic is an abjad, only consonants are written and vowels are optionally as diacritics.
|
||
Additionally, it's written from right-to-left, so the first letter is the letter on the far right.
|
||
|
||
- Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary.
|
||
|
||
> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
|
||
> --- <http://utf8everywhere.org>
|
||
|
||
```{r}
|
||
# But
|
||
str_split("check", boundary("character", locale = "cs_CZ"))
|
||
```
|
||
|
||
This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet.
|
||
This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
|
||
|
||
```{r}
|
||
x <- c("á", "x́")
|
||
str_length(x)
|
||
# str_width(x)
|
||
str_sub(x, 1, 1)
|
||
|
||
# stri_width(c("全形", "ab"))
|
||
# 0, 1, or 2
|
||
# but this assumes no font substitution
|
||
```
|
||
|
||
```{r}
|
||
cyrillic_a <- "А"
|
||
latin_a <- "A"
|
||
cyrillic_a == latin_a
|
||
stringi::stri_escape_unicode(cyrillic_a)
|
||
stringi::stri_escape_unicode(latin_a)
|
||
```
|
||
|
||
### str_c
|
||
|
||
`NULL`s are silently dropped.
|
||
This is particularly useful in conjunction with `if`:
|
||
|
||
```{r}
|
||
name <- "Hadley"
|
||
time_of_day <- "morning"
|
||
birthday <- FALSE
|
||
|
||
str_c(
|
||
"Good ", time_of_day, " ", name,
|
||
if (birthday) " and HAPPY BIRTHDAY",
|
||
"."
|
||
)
|
||
```
|
||
|
||
### `str_dup()`
|
||
|
||
Closely related to `str_c()` is `str_dup()`.
|
||
`str_c(a, a, a)` is like `a + a + a`, what's the equivalent of `3 * a`?
|
||
That's `str_dup()`:
|
||
|
||
```{r}
|
||
str_dup(letters[1:3], 3)
|
||
str_dup("a", 1:3)
|
||
```
|
||
|
||
## Performance
|
||
|
||
`fixed()`: matches exactly the specified sequence of bytes.
|
||
It ignores all special regular expressions and operates at a very low level.
|
||
This allows you to avoid complex escaping and can be much faster than regular expressions.
|
||
The following microbenchmark shows that it's about 3x faster for a simple example.
|
||
|
||
```{r}
|
||
microbenchmark::microbenchmark(
|
||
fixed = str_detect(sentences, fixed("the")),
|
||
regex = str_detect(sentences, "the"),
|
||
times = 20
|
||
)
|
||
```
|
||
|
||
As you saw with `str_split()` you can use `boundary()` to match boundaries.
|
||
You can also use it with the other functions:
|
||
|
||
```{r}
|
||
x <- "This is a sentence."
|
||
str_view_all(x, boundary("word"))
|
||
str_extract_all(x, boundary("word"))
|
||
```
|
||
|
||
### Extract
|
||
|
||
```{r}
|
||
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
|
||
colour_match <- str_c(colours, collapse = "|")
|
||
colour_match
|
||
|
||
more <- sentences[str_count(sentences, colour_match) > 1]
|
||
str_extract_all(more, colour_match)
|
||
```
|
||
|
||
If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
|
||
|
||
```{r}
|
||
|
||
str_extract_all(more, colour_match, simplify = TRUE)
|
||
|
||
x <- c("a", "a b", "a b c")
|
||
str_extract_all(x, "[a-z]", simplify = TRUE)
|
||
```
|
||
|
||
We don't talk about matrices here, but they are useful elsewhere.
|
||
|
||
### Exercises
|
||
|
||
1. From the Harvard sentences data, extract:
|
||
|
||
1. The first word from each sentence.
|
||
2. All words ending in `ing`.
|
||
3. All plurals.
|
||
|
||
## Grouped matches
|
||
|
||
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching.
|
||
You can also use parentheses to extract parts of a complex match.
|
||
For example, imagine we want to extract nouns from the sentences.
|
||
As a heuristic, we'll look for any word that comes after "a" or "the".
|
||
Defining a "word" in a regular expression is a little tricky, so here we use a simple approximation: a sequence of at least one character that isn't a space.
|
||
|
||
```{r}
|
||
noun <- "(a|the) ([^ ]+)"
|
||
|
||
has_noun <- sentences |>
|
||
str_subset(noun) |>
|
||
head(10)
|
||
has_noun |>
|
||
str_extract(noun)
|
||
```
|
||
|
||
`str_extract()` gives us the complete match; `str_match()` gives each individual component.
|
||
Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
|
||
|
||
```{r}
|
||
has_noun |>
|
||
str_match(noun)
|
||
```
|
||
|
||
(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
|
||
|
||
## Splitting
|
||
|
||
Use `str_split()` to split a string up into pieces.
|
||
For example, we could split sentences into words:
|
||
|
||
```{r}
|
||
sentences |>
|
||
head(5) |>
|
||
str_split(" ")
|
||
```
|
||
|
||
Because each component might contain a different number of pieces, this returns a list.
|
||
If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
|
||
|
||
```{r}
|
||
str_split("a|b|c|d", "\\|")[[1]]
|
||
```
|
||
|
||
Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
|
||
|
||
```{r}
|
||
sentences |>
|
||
head(5) |>
|
||
str_split(" ", simplify = TRUE)
|
||
```
|
||
|
||
You can also request a maximum number of pieces:
|
||
|
||
```{r}
|
||
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
|
||
fields |> str_split(": ", n = 2, simplify = TRUE)
|
||
```
|
||
|
||
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
|
||
|
||
```{r}
|
||
x <- "This is a sentence. This is another sentence."
|
||
str_view_all(x, boundary("word"))
|
||
|
||
str_split(x, " ")[[1]]
|
||
str_split(x, boundary("word"))[[1]]
|
||
```
|
||
|
||
Show how `separate_rows()` is a special case of `str_split()` + `summarize()`.
|
||
|
||
## Replace with function
|
||
|
||
## Locations
|
||
|
||
`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match.
|
||
These are particularly useful when none of the other functions does exactly what you want.
|
||
You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
|
||
|
||
## stringi
|
||
|
||
stringr is built on top of the **stringi** package.
|
||
stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions.
|
||
stringi, on the other hand, is designed to be comprehensive.
|
||
It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
|
||
|
||
If you find yourself struggling to do something in stringr, it's worth taking a look at stringi.
|
||
The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way.
|
||
The main difference is the prefix: `str_` vs. `stri_`.
|
||
|
||
### Exercises
|
||
|
||
1. Find the stringi functions that:
|
||
|
||
a. Count the number of words.
|
||
b. Find duplicated strings.
|
||
c. Generate random text.
|
||
|
||
2. How do you control the language that `stri_sort()` uses for sorting?
|
||
|
||
### Exercises
|
||
|
||
1. What do the `extra` and `fill` arguments do in `separate()`?
|
||
Experiment with the various options for the following two toy datasets.
|
||
|
||
```{r}
|
||
#| eval: false
|
||
|
||
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) |>
|
||
separate(x, c("one", "two", "three"))
|
||
|
||
tibble(x = c("a,b,c", "d,e", "f,g,i")) |>
|
||
separate(x, c("one", "two", "three"))
|
||
```
|
||
|
||
2. Both `unite()` and `separate()` have a `remove` argument.
|
||
What does it do?
|
||
Why would you set it to `FALSE`?
|
||
|
||
3. Compare and contrast `separate()` and `extract()`.
|
||
Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
|
||
|
||
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
|
||
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
|
||
|
||
```{r}
|
||
#| eval: false
|
||
|
||
events <- tribble(
|
||
~month, ~day,
|
||
1 , 20,
|
||
1 , 21,
|
||
1 , 22
|
||
)
|
||
|
||
events |>
|
||
unite("date", month:day, sep = "-", remove = FALSE)
|
||
```
|
||
|
||
5. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
|
||
Think carefully about what it should do if given a vector of length 0, 1, or 2.
|