More work on strings

This commit is contained in:
hadley 2015-10-29 10:13:19 -05:00
parent 96af27e155
commit 23908731a6
1 changed files with 137 additions and 45 deletions

View File

@ -194,7 +194,7 @@ x <- c("apple", "banana", "pear")
str_view(x, "an")
```
The next step up in complexity is `.`, which matches any character:
The next step up in complexity is `.`, which matches any character (except a new line):
```{r}
str_view(x, ".a.")
@ -224,49 +224,73 @@ str_view(x, "\\\\")
In this book, I'll write a regular expression like `\.` and the string that represents the regular expression as `"\\."`.
### Exercises
#### Exercises
* Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
* Solve this crossword puzzle clue: `a??le`
1. How would you match the sequence `"'\`?
1. What patterns does will this regular expression match `"\..\..\..`?
How would you represent it as a string?
### Anchors
Regular expressions can also match things that are not characters. The most important non-character matches are:
By default, regular expressions will match any part of a string. It's often useful to _anchor_ the regular expression so that it matches from the start or end of the string. You can use:
* `^`: the start of the line.
* `*`: the end of the line.
* `^` to match the start of the string.
* `*` to match the end of the string.
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
```
To remember which is which, try this mneomic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
```{r}
str_view(c("abcdef", "bcd"), "^bcd$")
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
```
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
Practice these by finding all common words:
#### Exercises
* Start with y.
* End in x.
* That are exactly 4 letters long. Without using `str_length()`
1. How would you match the literal string `"$^$"`?
1. Given this corpus of common words:
```{r}
common <- rcorpora::corpora("words/common")$commonWords
```
Create regular expressions that find all words that:
1. Start with "y".
1. End with "x"
1. Are exactly three letters long. (Don't cheat by using `str_length()`!)
1. Have seven letters or more.
Since this list is long, you might want to use the `match` argument to
`str_view()` to show only the matching or non-matching words.
### Character classes and alternatives
As well as `.` there are a number of other special patterns that match more than one character:
There are number of other special patterns that match more than one character:
* `\d`: any digit
* `\s`: any whitespace (space, tab, newline)
* `[abc]`: match a, b, or c
* `[a-e]`: match any character between a and e
* `[!abc]`: match anything except a, b, or c
* `.`: any character apart from a new line.
* `\d`: any digit.
* `\s`: any whitespace (space, tab, newline).
* `[abc]`: match a, b, or c.
* `[!abc]`: match anything except a, b, or c.
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
A similar idea is alternation: `x|y` matches either x or y. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
```{r}
str_view(c("abc", "xyz"), "abc|xyz")
@ -278,15 +302,29 @@ Like with mathematical expression, if precedence ever gets confusing, use parent
str_view(c("grey", "gray"), "gr(e|a)y")
```
Practice these by finding:
#### Exercises
* Start with a vowel.
* That only contain constants.
* That don't contain any vowels.
1. Create regular expressions that find all words that:
1. Start with a vowel.
1. That only contain constants. (Hint: thinking about matching
"not"-vowels.)
1. End with `ed`, but not with `eed`.
1. End with `ing` or `ise`.
1. Write a regular expression that matches a word if it's probably written
in British English, not American English.
1. Create a regular expression that will match telephone numbers as commonly
written in your country.
### Repetition
The next step up in power involves control how many times a pattern matches:
* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more
@ -297,19 +335,36 @@ Practice these by finding:
(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
Note that the precedence of these operators are high, so you write: `colou?r`. That means you'll need to use parentheses for many uses: `bana(na)+` or `ba(na){2,}`.
Note that the precedence of these operators are high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.
Practice these by finding all common words:
#### Exercises
* That contain three or more vowels in a row.
1. Describe in words what these regular expressions match:
(read carefully to see I'm using a regular expression or a string
that defines a regular expression.)
1. `^.*$`
1. `"\\{.+\\}"`
1. `\d{4}-\d{2}-\d{2}`
1. `"\\\\{4}"`
1. Create regular expressions to find all words that:
1. Have three or more vowels in a row.
1. Start with three consonants
1. Have two or more vowel-consontant pairs in a row.
### Grouping and backreferences
You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc.For example, the following regular expression finds all fruits that have a pair letters that's repeated.
```{r}
fruit <- rcorpora::corpora("foods/fruits")$fruits
str_subset(fruit, "(..)\\1")
str_view(fruit, "(..)\\1", match = TRUE)
```
(You'll also see how they're useful in conjunction with `str_match()` in a few pages.)
Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. They are called non-capturing parentheses.
For example:
@ -319,44 +374,81 @@ str_detect(c("grey", "gray"), "gr(e|a)y")
str_detect(c("grey", "gray"), "gr(?:e|a)y")
```
Describe in words what these expressions will match:
### Exercises
* `str_subset(common, "(.)(.)\\2\\1")`
1. Describe, in words, what these expressions will match:
1. `"(.)(.)\\2\\1"`
1. `(..)\1`
1. `"(.)(.)(.).*\\3\\2\\1"`
1. Construct regular expressions to match words that:
1. Start and end with the same character.
## Tools
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories:
Now that you've learned the basics of regular expression, it's time to learn how to apply to real problems. In this section you'll learn a wide array of stringr functions that let you:
* What matches the pattern?
* Does a string match a pattern?
* How can you replace a pattern with text?
* How can you split a string into pieces?
* Determine which elements match a pattern.
* Find the positions of matches.
* Extract the content of matches.
* Replace matches with new values.
* How can you split a string into based on a match.
### Detecting matches
### Detect matches
`str_detect()`, `str_subset()`, `str_count()`
To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector:
### Extracting matches
```{r}
# How many common words start with t?
sum(str_detect(common, "^t"))
```
When you have complicated logical conditions (e.g. match this or that but not these) combining multiple `str_detect()` calls with logical operators is often easy. A simple example is if you want to find all words that don't contain any vowels:
```{r}
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(common, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(common, "^[^aeiou]+$")
all.equal(no_vowels_1, no_vowels_2)
```
If you find your regular expression is getting hard to understand, trying breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.
`str_count()` is similar to `str_detect()` but it returns an integer count of the number of matches, instead of a true/false:
```{r}
# What's the average number of vowels per word?
mean(str_count(common, "[aeiou]"))
```
`str_subset()` is a wrapper for the common pattern `x[str_detect(x, pattern)]`.
### Find matches
`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
### Extract matches
`str_extract()`, `str_extract_all()`
### Extracting grouped matches
`str_match()`, `str_match_all()`
Note that matches are always non-overlapping. The second match starts after the first is complete.
### Replacing patterns
### Replacing matches
`str_replace()`, `str_replace_all()`
Backreferences.
### Splitting
`str_split()`, `str_split_fixed()`.
### Finding locations
`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
`boundary()`
### Exercises