Another pass through strings

This commit is contained in:
hadley 2016-08-08 10:45:11 -05:00
parent d64eb1ef0a
commit 0ab8d322fb
2 changed files with 119 additions and 125 deletions

View File

@ -2,13 +2,13 @@
## Introduction
This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically come as unstructured or semi-structured data. When this happens, you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead I'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions, or regexps for short. Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps provide powerful tools to make order from this sort of madness.
This chapter will focus on the __stringr__ package. This package provides a consistent set of functions that all work the same way and are easier to learn than the base R equivalents. We'll also take a brief look at the __stringi__ package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
Regexps are a very concise language that let you describe patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead I'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
### Prerequisites
In this chapter you'll use the stringr package to manipulate strings.
This chapter will focus on the __stringr__ package. This package provides a consistent set of functions that all work the same way and are easier to learn than the base R equivalents.
```{r setup}
library(stringr)
@ -16,7 +16,7 @@ library(stringr)
## String basics
In R, strings are stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.
You can create strings with either single quotes or double quotes: unlike other languages, there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`.
```{r}
string1 <- "This is a string"
@ -40,23 +40,22 @@ x
writeLines(x)
```
There are a handful of other special characters. The most common used are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes see strings like `"\u00b5"`, this is a way of writing non-English characters that works on all platforms:
There are a handful of other special characters. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes see strings like `"\u00b5"`, this is a way of writing non-English characters that works on all platforms:
```{r}
x <- "\u00b5"
x
```
### String length
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent and hard to remember. Their behaviour is particularly inconsistent when it comes to missing values. For example, `nchar()`, which gives the length of a string, returns 2 for `NA` (instead of `NA`)
Multiple strings are often stored in a character vector, which you can create with `c()`:
```{r}
# Bug will be fixed in R 3.3.0
nchar(NA)
c("one", "two", "three")
```
Instead we'll use functions from stringr. These have more intuitive names, and all start with `str_`:
### String length
Base R contains many functions to work with strings but we'll avoid them because they can be inconsistent, which makes them hard to remember. Instead we'll use functions from stringr. These have more intuitive names, and all start with `str_`:
```{r}
str_length(NA)
@ -91,7 +90,7 @@ str_c("|-", x, "-|")
str_c("|-", str_replace_na(x), "-|")
```
As shown above, `str_c()` is vectorised, automatically recycling shorter vectors to the same length as the longest:
As shown above, `str_c()` is vectorised, and it automatically recycles shorter vectors to the same length as the longest:
```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
@ -110,7 +109,7 @@ str_c("Good ", time_of_day, " ", name,
)
```
To collapse vectors into a single string, use `collapse`:
To collapse a vector of strings into a single string, use `collapse`:
```{r}
str_c(c("x", "y", "z"), collapse = ", ")
@ -142,7 +141,7 @@ x
### Locales
Above I used `str_to_lower()` to change to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first seem because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:
Above I used `str_to_lower()` to change the text to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first appear because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:
```{r}
# Turkish has two i's: with and without a dot, and it
@ -151,7 +150,7 @@ str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```
The locale is specified as ISO 639 language codes, which are two or three letter abbreviations. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list. If you leave the locale blank, it will use the current locale.
The locale is specified as a ISO 639 language code, which are two or three letter abbreviations. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list. If you leave the locale blank, it will use the current locale.
Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
@ -172,7 +171,7 @@ str_sort(x, locale = "haw") # Hawaiian
`NA`?
1. Use `str_length()` and `str_sub()` to extract the middle character from
a character vector.
a string. What will you do if the string has an even number of characters?
1. What does `str_wrap()` do? When might you want to use it?
@ -184,7 +183,7 @@ str_sort(x, locale = "haw") # Hawaiian
## Matching patterns with regular expressions
Regular expressions, regexps for short, are a very terse language that allow to describe patterns in strings. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
Regexps are a very terse language that allow you to describe patterns in strings. They take a little while to get your head around, but once you understand them, you'll find them extremely useful.
To learn regular expressions, we'll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
@ -203,7 +202,7 @@ The next step up in complexity is `.`, which matches any character (except a new
str_view(x, ".a.")
```
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. In other words, you need to make the regular expression `\.`, but this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So the string `"\."` reduces to the special character written as `\.` In this case, `\.` is not a recognized special character and the string would lead to an error; but `"\n"` would reduce to a new line, `"\t"` would reduce to a tab, and `"\\"` would reduce to a literal `\`, which provides a way forward. To create a string that reduces to a literal backslash followed by a period, you need to escape the backslash. So to match a literal "`.`" you need to use `"\\."`, which simplifies to the regular expression `\.`.
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an `.`, you need the regexp `\.`. Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we need the string `"\\."`.
```{r}
# To create the regular expression, we need \\
@ -212,7 +211,7 @@ dot <- "\\."
# But the expression itself only contains one:
writeLines(dot)
# And this tells R to look for explicit .
# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
```
@ -225,7 +224,7 @@ writeLines(x)
str_view(x, "\\\\")
```
In this book, I'll write a regular expression like `\.` and the string that represents the regular expression as `"\\."`.
In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
#### Exercises
@ -233,7 +232,7 @@ In this book, I'll write a regular expression like `\.` and the string that repr
1. How would you match the sequence `"'\`?
1. What patterns does will this regular expression match `"\..\..\..`?
1. What patterns will the regular expression `"\..\..\..` match?
How would you represent it as a string?
### Anchors
@ -251,7 +250,7 @@ str_view(x, "a$")
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
To force a regular expression to only match a complete string, anchor it with both `^` and `$`:
```{r}
x <- c("apple pie", "apple", "apple cake")
@ -259,18 +258,14 @@ str_view(x, "apple")
str_view(x, "^apple$")
```
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
You can also match the boundary between words with `\b`. I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
#### Exercises
1. How would you match the literal string `"$^$"`?
1. Given this corpus of common words:
```{r}
```
Create regular expressions that find all words that:
1. Given the corpus of common words in `stringr::words`, create regular
expressions that find all words that:
1. Start with "y".
1. End with "x"
@ -282,17 +277,16 @@ You can also match the boundary between words with `\b`. I don't find I often us
### Character classes and alternatives
There are number of other special patterns that match more than one character:
There are number of special patterns that match more than one character. You've already seen `.`, which matches any character apart from a newline. There are four other useful tools:
* `.`: any character apart from a newline.
* `\d`: any digit.
* `\s`: any whitespace (space, tab, newline).
* `[abc]`: match a, b, or c.
* `[^abc]`: match anything except a, b, or c.
* `\d`: matches any digit.
* `\s`: matches any whitespace (e.g. space, tab, newline).
* `[abc]`: matches a, b, or c.
* `[^abc]`: matches anything except a, b, or c.
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`:
```{r}
str_view(c("abc", "xyz"), "abc|xyz")
@ -306,7 +300,7 @@ str_view(c("grey", "gray"), "gr(e|a)y")
#### Exercises
1. Create regular expressions that find all words that:
1. Create regular expressions to find all words that:
1. Start with a vowel.
@ -317,6 +311,10 @@ str_view(c("grey", "gray"), "gr(e|a)y")
1. End with `ing` or `ise`.
1. Empirically verify the rule "i before e except after c".
1. Is "q" always followed by a "u"?
1. Write a regular expression that matches a word if it's probably written
in British English, not American English.
@ -325,26 +323,41 @@ str_view(c("grey", "gray"), "gr(e|a)y")
### Repetition
The next step up in power involves control over how many times a pattern matches:
The next step up in power involves controlling how many times a pattern matches:
* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more
```{r}
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, 'C[LX]+')
```
Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.
You can also specify the number of matches numerically:
* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m
```{r}
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
```
By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them. This is an advanced feature of regular expressions, but it's useful to know that it exists:
```{r}
str_view(x, 'C{2,3}?')
str_view(x, 'C[LX]+?')
```
Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.
#### Exercises
1. Describe in words what these regular expressions match:
@ -358,40 +371,43 @@ Note that the precedence of these operators is high, so you can write: `colou?r`
1. Create regular expressions to find all words that:
1. Have three or more vowels in a row.
1. Start with three consonants.
1. Have three or more vowels in a row.
1. Have two or more vowel-consonant pairs in a row.
1. Solve the beginner regexp crosswords:
<https://regexcrossword.com/challenges/beginner/puzzles/1>
### Grouping and backreferences
You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a pair of letters that's repeated.
Earlier, you learned about parentheses as a way to disambiguate complex expressions. They also definie "groups" that you can refer to with _backreferences_, like `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a pair of letters that's repeated.
```{r}
str_view(fruit, "(..)\\1", match = TRUE)
```
(You'll also see how they're useful in conjunction with `str_match()` in a few pages.)
(Shortly, you'll also see how they're useful in conjunction with `str_match()`.)
Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use them for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. `(?:)` are called non-capturing parentheses.
For example:
```{r}
str_detect(c("grey", "gray"), "gr(e|a)y")
str_detect(c("grey", "gray"), "gr(?:e|a)y")
```
### Exercises
#### Exercises
1. Describe, in words, what these expressions will match:
1. `"(.)(.)\\2\\1"`
1. `(..)\1`
1. `"(.).\\1.\\1"`
1. `"(.)(.)(.).*\\3\\2\\1"`
1. Construct regular expressions to match words that:
1. Start and end with the same character.
1. Contain a repeated pair of letters
(e.g. "church" contains "ch" repeated twice)
1. Contain one letter repeated in at least three places
(e.g. "eleven" contains three "e"s.)
## Tools
@ -430,10 +446,10 @@ When you have complex logical conditions (e.g. match a or b but not c unless d)
no_vowels_1 <- !str_detect(words, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
all.equal(no_vowels_1, no_vowels_2)
identical(no_vowels_1, no_vowels_2)
```
The results are identical, but I think the first approach is significantly easier to understand. So if you find your regular expression is getting overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining them with logical operations.
The results are identical, but I think the first approach is significantly easier to understand. If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations.
A common use of `str_detect()` is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient `str_subset()` wrapper:
@ -478,7 +494,7 @@ Note the use of `str_view_all()`. As you'll shortly learn, many stringr function
### Extract matches
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexes.
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexes. These are provided in `stringr::sentences`:
```{r}
length(sentences)
@ -510,16 +526,19 @@ str_view_all(more, colour_match)
str_extract(more, colour_match)
```
This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use `str_extract_all()`. It returns either a list or a matrix, based on the value of the `simplify` argument:
This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use `str_extract_all()`. It returns a list:
```{r}
str_extract_all(more, colour_match)
str_extract_all(more, colour_match, simplify = TRUE)
```
You'll learn more about working with lists in Chapter XYZ. If you use `simplify = TRUE`, note that short matches are expanded to the same length as the longest:
You'll learn more about lists in [lists](#lists) and [handling hierarchy].
If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
```{r}
str_extract_all(more, colour_match, simplify = TRUE)
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```
@ -527,7 +546,7 @@ str_extract_all(x, "[a-z]", simplify = TRUE)
#### Exercises
1. In the previous example, you might have noticed that the regular
expression matched "fickered", which is not a colour. Modify the
expression matched "flickered", which is not a colour. Modify the
regex to fix the problem.
1. From the Harvard sentences data, extract:
@ -538,44 +557,40 @@ str_extract_all(x, "[a-z]", simplify = TRUE)
### Grouped matches
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching. You can also use parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky. Here I use a sequence of at least one character that isn't a space.
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching. You can also use parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky, so here I use a sequence of at least one character that isn't a space.
```{r}
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
str_extract(has_noun, noun)
has_noun %>%
str_extract(noun)
```
`str_extract()` gives us the complete match; `str_match()` gives each individual component. Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
```{r}
str_match(has_noun, noun)
has_noun %>%
str_match(noun)
```
(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
```{r}
num <- str_c("one", "two", "three", "four", "five", "six",
"seven", "eight", "nine", "ten", sep = "|")
match <- str_interp("(${num}) ([^ ]+s)\\b")
sentences %>%
str_subset(match) %>%
head(10) %>%
str_match(match)
```
Like `str_extract()`, if you want all matches for each string, you'll need `str_match_all()`.
#### Exercises
1. Find all words that come after a "number" like "one", "two", "three" etc.
Pull out both the number and the word.
1. Find all contractions. Separate out the pieces before and after the
apostrophe.
### Replacing matches
`str_replace()` and `str_replace_all()` allow you to replace matches with new strings:
`str_replace()` and `str_replace_all()` allow you to replace matches with new strings. The simplest use to replace a pattern with a fixed string:
```{r}
x <- c("apple", "pear", "banana")
@ -583,27 +598,28 @@ str_replace(x, "[aeiou]", "-")
str_replace_all(x, "[aeiou]", "-")
```
With `str_replace_all()` you can also perform multiple replacements by supplying a named vector:
With `str_replace_all()` you can perform multiple replacements by supplying a named vector:
```{r}
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
You can refer to groups with backreferences:
Instead of replacing with a fixed string you can use backreferences to insert components of the match. For example, the following code flips the order of the second and third words.
```{r}
sentences %>%
head(5) %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2")
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
```
<!-- Replacing with a function call (hopefully) -->
#### Exercises
1. Replace all `/`s in a string with `\`s.
1. Switch the first and last letters in `words`. Which of those strings
are still words?
### Splitting
Use `str_split()` to split a string up into pieces. For example, we could split sentences into words:
@ -654,11 +670,12 @@ str_split(x, boundary("word"))[[1]]
1. Why is it better to split up by `boundary("word")` than `" "`?
1. What does splitting with an empty string (`""`) do?
1. What does splitting with an empty string (`""`) do? Experiment, and
then read the documentation.
### Find matches
`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
## Other types of pattern
@ -703,21 +720,21 @@ There are three other functions you can use instead of `regex()`:
* `fixed()`: matches exactly the specified sequence of bytes. It ignores
all special regular expressions and operates at a very low level.
This allows you to avoid complex escaping and can be much faster than
regular expressions:
regular expressions. The following microbenchmark shows that it's about
3x faster for a simple exmaple.
```{r}
microbenchmark::microbenchmark(
fixed = str_detect(sentences, fixed("the")),
regex = str_detect(sentences, "the")
regex = str_detect(sentences, "the"),
times = 20
)
```
Here the fixed match is almost 3x times faster than the regular
expression match. However, if you're working with non-English data
`fixed()` can lead to unreliable matches because there are often
multiple ways of representing the same character. For example, there
are two ways to define "á": either as a single character or as an "a"
plus an accent:
Beware using `fixed()` with non-English data. It is problematic because
there are often multiple ways of representing the same character. For
example, there are two ways to define "á": either as a single character or
as an "a" plus an accent:
```{r}
a1 <- "\u00e1"
@ -728,7 +745,7 @@ There are three other functions you can use instead of `regex()`:
They render identically, but because they're defined differently,
`fixed()` doesn't find a match. Instead, you can use `coll()`, defined
next to respect human character comparison rules:
next, to respect human character comparison rules:
```{r}
str_detect(a1, fixed(a2))
@ -807,42 +824,19 @@ There are a few other functions in base R that accept regular expressions:
that you have to use a regular expression to filter them all, you
need to think about what you're doing! (And probably use a list instead).
## stringi
## Advanced topics
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi, on the other hand, is designed to be comprehensive. It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
If you find yourself struggling to do something in stringr, it's worth taking a look at stringi. The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way. The main difference is the prefix: `str_` vs. `stri_`.
### The stringi package
### Exercises
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
1. Find the stringi functions that:
So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages is very similar because stringr was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs. `stri_`.
### Encoding
Complicated and fraught with difficulty. Best approach is to convert to UTF-8 as soon as possible. All stringr and stringi functions do this. Readr always reads as UTF-8.
* UTF-8
* Latin1
* bytes: everything else
Generally, you should fix encoding problems during the data import phase.
Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. It's fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).
```{r}
x <- "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu."
x
str_conv(x, "ISO-8859-1")
as.data.frame(stringi::stri_enc_detect(x))
str_conv(x, "ISO-8859-2")
```
### UTF-8
<http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes>
<http://www.joelonsoftware.com/articles/Unicode.html>
Homoglyph attack, https://github.com/reinderien/mimic.
1. Count the number of words.
1. Find duplicated strings.
1. Generate random text.
1. How do you control the language that `stri_sort()` uses for
sorting?

View File

@ -390,7 +390,7 @@ There is an important variation of `[` called `[[`. `[[` only ever extracts a si
than the length of the vector? What happens when you subset with a
name that doesn't exist?
## Recursive vectors (lists)
## Recursive vectors (lists) {#lists}
Lists are a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with `list()`: