Separate regexps into own chapter
This commit is contained in:
parent
5f45c33adb
commit
18253a1d52
|
@ -25,6 +25,7 @@ rmd_files: [
|
|||
"vector-tools.Rmd",
|
||||
"missing-values.Rmd",
|
||||
"strings.Rmd",
|
||||
"regexps.Rmd",
|
||||
"factors.Rmd",
|
||||
"datetimes.Rmd",
|
||||
"column-wise.Rmd",
|
||||
|
|
|
@ -0,0 +1,382 @@
|
|||
# Regular expressions
|
||||
|
||||
## Matching patterns with regular expressions
|
||||
|
||||
Regexps are a very terse language that allow you to describe patterns in strings.
|
||||
They take a little while to get your head around, but once you understand them, you'll find them extremely useful.
|
||||
|
||||
To learn regular expressions, we'll use `str_view()` and `str_view_all()`.
|
||||
These functions take a character vector and a regular expression, and show you how they match.
|
||||
We'll start with very simple regular expressions and then gradually get more and more complicated.
|
||||
Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
This chapter will focus on the **stringr** package for string manipulation, which is part of the core tidyverse.
|
||||
|
||||
```{r setup, message = FALSE}
|
||||
library(tidyverse)
|
||||
```
|
||||
|
||||
## Basic matches
|
||||
|
||||
The simplest patterns match exact strings:
|
||||
|
||||
```{r}
|
||||
x <- c("apple", "banana", "pear")
|
||||
str_view(x, "an")
|
||||
```
|
||||
|
||||
The next step up in complexity is `.`, which matches any character (except a newline):
|
||||
|
||||
```{r}
|
||||
str_view(x, ".a.")
|
||||
```
|
||||
|
||||
But if "`.`" matches any character, how do you match the character "`.`"?
|
||||
You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour.
|
||||
Like strings, regexps use the backslash, `\`, to escape special behaviour.
|
||||
So to match an `.`, you need the regexp `\.`.
|
||||
Unfortunately this creates a problem.
|
||||
We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings.
|
||||
So to create the regular expression `\.` we need the string `"\\."`.
|
||||
|
||||
```{r}
|
||||
# To create the regular expression, we need \\
|
||||
dot <- "\\."
|
||||
|
||||
# But the expression itself only contains one:
|
||||
writeLines(dot)
|
||||
|
||||
# And this tells R to look for an explicit .
|
||||
str_view(c("abc", "a.c", "bef"), "a\\.c")
|
||||
```
|
||||
|
||||
If `\` is used as an escape character in regular expressions, how do you match a literal `\`?
|
||||
Well you need to escape it, creating the regular expression `\\`.
|
||||
To create that regular expression, you need to use a string, which also needs to escape `\`.
|
||||
That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!
|
||||
|
||||
```{r}
|
||||
x <- "a\\b"
|
||||
writeLines(x)
|
||||
|
||||
str_view(x, "\\\\")
|
||||
```
|
||||
|
||||
In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
|
||||
|
||||
2. How would you match the sequence `"'\`?
|
||||
|
||||
3. What patterns will the regular expression `\..\..\..` match?
|
||||
How would you represent it as a string?
|
||||
|
||||
## Anchors
|
||||
|
||||
By default, regular expressions will match any part of a string.
|
||||
It's often useful to *anchor* the regular expression so that it matches from the start or end of the string.
|
||||
You can use:
|
||||
|
||||
- `^` to match the start of the string.
|
||||
- `$` to match the end of the string.
|
||||
|
||||
```{r}
|
||||
x <- c("apple", "banana", "pear")
|
||||
str_view(x, "^a")
|
||||
str_view(x, "a$")
|
||||
```
|
||||
|
||||
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
|
||||
|
||||
To force a regular expression to only match a complete string, anchor it with both `^` and `$`:
|
||||
|
||||
```{r}
|
||||
x <- c("apple pie", "apple", "apple cake")
|
||||
str_view(x, "apple")
|
||||
str_view(x, "^apple$")
|
||||
```
|
||||
|
||||
You can also match the boundary between words with `\b`.
|
||||
I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions.
|
||||
For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. How would you match the literal string `"$^$"`?
|
||||
|
||||
2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:
|
||||
|
||||
a. Start with "y".
|
||||
b. End with "x"
|
||||
c. Are exactly three letters long. (Don't cheat by using `str_length()`!)
|
||||
d. Have seven letters or more.
|
||||
|
||||
Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
|
||||
|
||||
## Character classes and alternatives
|
||||
|
||||
There are a number of special patterns that match more than one character.
|
||||
You've already seen `.`, which matches any character apart from a newline.
|
||||
There are four other useful tools:
|
||||
|
||||
- `\d`: matches any digit.
|
||||
- `\s`: matches any whitespace (e.g. space, tab, newline).
|
||||
- `[abc]`: matches a, b, or c.
|
||||
- `[^abc]`: matches anything except a, b, or c.
|
||||
|
||||
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
|
||||
|
||||
A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex.
|
||||
Many people find this more readable.
|
||||
|
||||
```{r}
|
||||
# Look for a literal character that normally has special meaning in a regex
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
|
||||
```
|
||||
|
||||
This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`.
|
||||
Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`.
|
||||
|
||||
You can use *alternation* to pick between one or more alternative patterns.
|
||||
For example, `abc|d..f` will match either '"abc"', or `"deaf"`.
|
||||
Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`.
|
||||
Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||
|
||||
```{r}
|
||||
str_view(c("grey", "gray"), "gr(e|a)y")
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Create regular expressions to find all words that:
|
||||
|
||||
a. Start with a vowel.
|
||||
b. That only contain consonants. (Hint: thinking about matching "not"-vowels.)
|
||||
c. End with `ed`, but not with `eed`.
|
||||
d. End with `ing` or `ise`.
|
||||
|
||||
2. Empirically verify the rule "i before e except after c".
|
||||
|
||||
3. Is "q" always followed by a "u"?
|
||||
|
||||
4. Write a regular expression that matches a word if it's probably written in British English, not American English.
|
||||
|
||||
5. Create a regular expression that will match telephone numbers as commonly written in your country.
|
||||
|
||||
## Repetition / Quantifiers
|
||||
|
||||
The next step up in power involves controlling how many times a pattern matches:
|
||||
|
||||
- `?`: 0 or 1
|
||||
- `+`: 1 or more
|
||||
- `*`: 0 or more
|
||||
|
||||
```{r}
|
||||
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
|
||||
str_view(x, "CC?")
|
||||
str_view(x, "CC+")
|
||||
str_view(x, 'C[LX]+')
|
||||
```
|
||||
|
||||
Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings.
|
||||
That means most uses will need parentheses, like `bana(na)+`.
|
||||
|
||||
You can also specify the number of matches precisely:
|
||||
|
||||
- `{n}`: exactly n
|
||||
- `{n,}`: n or more
|
||||
- `{1,m}`: at most m
|
||||
- `{n,m}`: between n and m
|
||||
|
||||
```{r}
|
||||
str_view(x, "C{2}")
|
||||
str_view(x, "C{2,}")
|
||||
str_view(x, "C{1,3}")
|
||||
str_view(x, "C{2,3}")
|
||||
```
|
||||
|
||||
By default these matches are "greedy": they will match the longest string possible.
|
||||
You can make them "lazy", matching the shortest string possible by putting a `?` after them.
|
||||
This is an advanced feature of regular expressions, but it's useful to know that it exists:
|
||||
|
||||
```{r}
|
||||
str_view(x, 'C{2,3}?')
|
||||
str_view(x, 'C[LX]+?')
|
||||
```
|
||||
|
||||
Collectively, these operators are called **quantifiers** because they quantify how many times a match can occur.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
|
||||
|
||||
2. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
|
||||
|
||||
a. `^.*$`
|
||||
b. `"\\{.+\\}"`
|
||||
c. `\d{4}-\d{2}-\d{2}`
|
||||
d. `"\\\\{4}"`
|
||||
|
||||
3. Create regular expressions to find all words that:
|
||||
|
||||
a. Start with three consonants.
|
||||
b. Have three or more vowels in a row.
|
||||
c. Have two or more vowel-consonant pairs in a row.
|
||||
|
||||
4. Solve the beginner regexp crosswords at [<https://regexcrossword.com/challenges/beginner>](https://regexcrossword.com/challenges/beginner){.uri}.
|
||||
|
||||
## Grouping and backreferences
|
||||
|
||||
Earlier, you learned about parentheses as a way to disambiguate complex expressions.
|
||||
Parentheses also create a *numbered* capturing group (number 1, 2 etc.).
|
||||
A capturing group stores *the part of the string* matched by the part of the regular expression inside the parentheses.
|
||||
You can refer to the same text as previously matched by a capturing group with *backreferences*, like `\1`, `\2` etc.
|
||||
For example, the following regular expression finds all fruits that have a repeated pair of letters.
|
||||
|
||||
```{r}
|
||||
str_view(fruit, "(..)\\1", match = TRUE)
|
||||
```
|
||||
|
||||
(Shortly, you'll also see how they're useful in conjunction with `str_match()`.)
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Describe, in words, what these expressions will match:
|
||||
|
||||
a. `(.)\1\1`
|
||||
b. `"(.)(.)\\2\\1"`
|
||||
c. `(..)\1`
|
||||
d. `"(.).\\1.\\1"`
|
||||
e. `"(.)(.)(.).*\\3\\2\\1"`
|
||||
|
||||
2. Construct regular expressions to match words that:
|
||||
|
||||
a. Start and end with the same character.
|
||||
b. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
|
||||
c. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
|
||||
|
||||
## Other uses of regular expressions
|
||||
|
||||
There are two useful function in base R that also use regular expressions:
|
||||
|
||||
- `apropos()` searches all objects available from the global environment.
|
||||
This is useful if you can't quite remember the name of the function.
|
||||
|
||||
```{r}
|
||||
apropos("replace")
|
||||
```
|
||||
|
||||
- `dir()` lists all the files in a directory.
|
||||
The `pattern` argument takes a regular expression and only returns file names that match the pattern.
|
||||
For example, you can find all the R Markdown files in the current directory with:
|
||||
|
||||
```{r}
|
||||
head(dir(pattern = "\\.Rmd$"))
|
||||
```
|
||||
|
||||
(If you're more comfortable with "globs" like `*.Rmd`, you can convert them to regular expressions with `glob2rx()`):
|
||||
|
||||
## A caution
|
||||
|
||||
A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression.
|
||||
In the words of Jamie Zawinski:
|
||||
|
||||
> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
|
||||
|
||||
As a cautionary tale, check out this regular expression that checks if a email address is valid:
|
||||
|
||||
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
|
||||
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
|
||||
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
|
||||
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
|
||||
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
|
||||
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
|
||||
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
|
||||
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
|
||||
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|
||||
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
|
||||
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
|
||||
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
|
||||
\t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
|
||||
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
|
||||
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
|
||||
\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
|
||||
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
|
||||
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
|
||||
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|
||||
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
|
||||
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
|
||||
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
|
||||
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
|
||||
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
|
||||
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
|
||||
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
|
||||
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
|
||||
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
|
||||
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\]
|
||||
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
|
||||
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
|
||||
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
|
||||
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
|
||||
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
|
||||
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
|
||||
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
|
||||
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
|
||||
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
|
||||
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
|
||||
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
|
||||
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
|
||||
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
|
||||
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
|
||||
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
|
||||
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
|
||||
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|
||||
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
|
||||
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
|
||||
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
|
||||
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
|
||||
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
|
||||
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
|
||||
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
|
||||
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
|
||||
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
|
||||
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
|
||||
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
|
||||
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
|
||||
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
|
||||
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
|
||||
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
|
||||
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
|
||||
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
|
||||
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
|
||||
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
|
||||
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
|
||||
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
|
||||
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
|
||||
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
|
||||
\t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
|
||||
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
|
||||
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
|
||||
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
|
||||
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
|
||||
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
|
||||
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
|
||||
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
|
||||
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
|
||||
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
|
||||
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|
||||
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
|
||||
?:\r\n)?[ \t])*))*)?;\s*)
|
||||
|
||||
This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code.
|
||||
See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for more details.
|
||||
|
||||
Don't forget that you're in a programming language and you have other tools at your disposal.
|
||||
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
|
||||
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
481
strings.Rmd
481
strings.Rmd
|
@ -214,259 +214,6 @@ TODO: add connection to `arrange()`
|
|||
6. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
|
||||
Think carefully about what it should do if given a vector of length 0, 1, or 2.
|
||||
|
||||
## Matching patterns with regular expressions
|
||||
|
||||
Regexps are a very terse language that allow you to describe patterns in strings.
|
||||
They take a little while to get your head around, but once you understand them, you'll find them extremely useful.
|
||||
|
||||
To learn regular expressions, we'll use `str_view()` and `str_view_all()`.
|
||||
These functions take a character vector and a regular expression, and show you how they match.
|
||||
We'll start with very simple regular expressions and then gradually get more and more complicated.
|
||||
Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
|
||||
|
||||
### Basic matches
|
||||
|
||||
The simplest patterns match exact strings:
|
||||
|
||||
```{r}
|
||||
x <- c("apple", "banana", "pear")
|
||||
str_view(x, "an")
|
||||
```
|
||||
|
||||
The next step up in complexity is `.`, which matches any character (except a newline):
|
||||
|
||||
```{r}
|
||||
str_view(x, ".a.")
|
||||
```
|
||||
|
||||
But if "`.`" matches any character, how do you match the character "`.`"?
|
||||
You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour.
|
||||
Like strings, regexps use the backslash, `\`, to escape special behaviour.
|
||||
So to match an `.`, you need the regexp `\.`.
|
||||
Unfortunately this creates a problem.
|
||||
We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings.
|
||||
So to create the regular expression `\.` we need the string `"\\."`.
|
||||
|
||||
```{r}
|
||||
# To create the regular expression, we need \\
|
||||
dot <- "\\."
|
||||
|
||||
# But the expression itself only contains one:
|
||||
writeLines(dot)
|
||||
|
||||
# And this tells R to look for an explicit .
|
||||
str_view(c("abc", "a.c", "bef"), "a\\.c")
|
||||
```
|
||||
|
||||
If `\` is used as an escape character in regular expressions, how do you match a literal `\`?
|
||||
Well you need to escape it, creating the regular expression `\\`.
|
||||
To create that regular expression, you need to use a string, which also needs to escape `\`.
|
||||
That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!
|
||||
|
||||
```{r}
|
||||
x <- "a\\b"
|
||||
writeLines(x)
|
||||
|
||||
str_view(x, "\\\\")
|
||||
```
|
||||
|
||||
In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
|
||||
|
||||
#### Exercises
|
||||
|
||||
1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
|
||||
|
||||
2. How would you match the sequence `"'\`?
|
||||
|
||||
3. What patterns will the regular expression `\..\..\..` match?
|
||||
How would you represent it as a string?
|
||||
|
||||
### Anchors
|
||||
|
||||
By default, regular expressions will match any part of a string.
|
||||
It's often useful to *anchor* the regular expression so that it matches from the start or end of the string.
|
||||
You can use:
|
||||
|
||||
- `^` to match the start of the string.
|
||||
- `$` to match the end of the string.
|
||||
|
||||
```{r}
|
||||
x <- c("apple", "banana", "pear")
|
||||
str_view(x, "^a")
|
||||
str_view(x, "a$")
|
||||
```
|
||||
|
||||
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
|
||||
|
||||
To force a regular expression to only match a complete string, anchor it with both `^` and `$`:
|
||||
|
||||
```{r}
|
||||
x <- c("apple pie", "apple", "apple cake")
|
||||
str_view(x, "apple")
|
||||
str_view(x, "^apple$")
|
||||
```
|
||||
|
||||
You can also match the boundary between words with `\b`.
|
||||
I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions.
|
||||
For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
|
||||
|
||||
#### Exercises
|
||||
|
||||
1. How would you match the literal string `"$^$"`?
|
||||
|
||||
2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:
|
||||
|
||||
a. Start with "y".
|
||||
b. End with "x"
|
||||
c. Are exactly three letters long. (Don't cheat by using `str_length()`!)
|
||||
d. Have seven letters or more.
|
||||
|
||||
Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
|
||||
|
||||
### Character classes and alternatives
|
||||
|
||||
There are a number of special patterns that match more than one character.
|
||||
You've already seen `.`, which matches any character apart from a newline.
|
||||
There are four other useful tools:
|
||||
|
||||
- `\d`: matches any digit.
|
||||
- `\s`: matches any whitespace (e.g. space, tab, newline).
|
||||
- `[abc]`: matches a, b, or c.
|
||||
- `[^abc]`: matches anything except a, b, or c.
|
||||
|
||||
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
|
||||
|
||||
A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex.
|
||||
Many people find this more readable.
|
||||
|
||||
```{r}
|
||||
# Look for a literal character that normally has special meaning in a regex
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
|
||||
```
|
||||
|
||||
This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`.
|
||||
Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`.
|
||||
|
||||
You can use *alternation* to pick between one or more alternative patterns.
|
||||
For example, `abc|d..f` will match either '"abc"', or `"deaf"`.
|
||||
Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`.
|
||||
Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||
|
||||
```{r}
|
||||
str_view(c("grey", "gray"), "gr(e|a)y")
|
||||
```
|
||||
|
||||
#### Exercises
|
||||
|
||||
1. Create regular expressions to find all words that:
|
||||
|
||||
a. Start with a vowel.
|
||||
b. That only contain consonants. (Hint: thinking about matching "not"-vowels.)
|
||||
c. End with `ed`, but not with `eed`.
|
||||
d. End with `ing` or `ise`.
|
||||
|
||||
2. Empirically verify the rule "i before e except after c".
|
||||
|
||||
3. Is "q" always followed by a "u"?
|
||||
|
||||
4. Write a regular expression that matches a word if it's probably written in British English, not American English.
|
||||
|
||||
5. Create a regular expression that will match telephone numbers as commonly written in your country.
|
||||
|
||||
### Repetition / Quantifiers
|
||||
|
||||
The next step up in power involves controlling how many times a pattern matches:
|
||||
|
||||
- `?`: 0 or 1
|
||||
- `+`: 1 or more
|
||||
- `*`: 0 or more
|
||||
|
||||
```{r}
|
||||
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
|
||||
str_view(x, "CC?")
|
||||
str_view(x, "CC+")
|
||||
str_view(x, 'C[LX]+')
|
||||
```
|
||||
|
||||
Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings.
|
||||
That means most uses will need parentheses, like `bana(na)+`.
|
||||
|
||||
You can also specify the number of matches precisely:
|
||||
|
||||
- `{n}`: exactly n
|
||||
- `{n,}`: n or more
|
||||
- `{1,m}`: at most m
|
||||
- `{n,m}`: between n and m
|
||||
|
||||
```{r}
|
||||
str_view(x, "C{2}")
|
||||
str_view(x, "C{2,}")
|
||||
str_view(x, "C{1,3}")
|
||||
str_view(x, "C{2,3}")
|
||||
```
|
||||
|
||||
By default these matches are "greedy": they will match the longest string possible.
|
||||
You can make them "lazy", matching the shortest string possible by putting a `?` after them.
|
||||
This is an advanced feature of regular expressions, but it's useful to know that it exists:
|
||||
|
||||
```{r}
|
||||
str_view(x, 'C{2,3}?')
|
||||
str_view(x, 'C[LX]+?')
|
||||
```
|
||||
|
||||
Collectively, these operators are called **quantifiers** because they quantify how many times a match can occur.
|
||||
|
||||
#### Exercises
|
||||
|
||||
1. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
|
||||
|
||||
2. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
|
||||
|
||||
a. `^.*$`
|
||||
b. `"\\{.+\\}"`
|
||||
c. `\d{4}-\d{2}-\d{2}`
|
||||
d. `"\\\\{4}"`
|
||||
|
||||
3. Create regular expressions to find all words that:
|
||||
|
||||
a. Start with three consonants.
|
||||
b. Have three or more vowels in a row.
|
||||
c. Have two or more vowel-consonant pairs in a row.
|
||||
|
||||
4. Solve the beginner regexp crosswords at <https://regexcrossword.com/challenges/beginner>.
|
||||
|
||||
### Grouping and backreferences
|
||||
|
||||
Earlier, you learned about parentheses as a way to disambiguate complex expressions.
|
||||
Parentheses also create a *numbered* capturing group (number 1, 2 etc.).
|
||||
A capturing group stores *the part of the string* matched by the part of the regular expression inside the parentheses.
|
||||
You can refer to the same text as previously matched by a capturing group with *backreferences*, like `\1`, `\2` etc.
|
||||
For example, the following regular expression finds all fruits that have a repeated pair of letters.
|
||||
|
||||
```{r}
|
||||
str_view(fruit, "(..)\\1", match = TRUE)
|
||||
```
|
||||
|
||||
(Shortly, you'll also see how they're useful in conjunction with `str_match()`.)
|
||||
|
||||
#### Exercises
|
||||
|
||||
1. Describe, in words, what these expressions will match:
|
||||
|
||||
a. `(.)\1\1`
|
||||
b. `"(.)(.)\\2\\1"`
|
||||
c. `(..)\1`
|
||||
d. `"(.).\\1.\\1"`
|
||||
e. `"(.)(.)(.).*\\3\\2\\1"`
|
||||
|
||||
2. Construct regular expressions to match words that:
|
||||
|
||||
a. Start and end with the same character.
|
||||
b. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
|
||||
c. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
|
||||
|
||||
## Tools
|
||||
|
||||
Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems.
|
||||
|
@ -478,103 +225,6 @@ In this section you'll learn a wide array of stringr functions that let you:
|
|||
- Replace matches with new values.
|
||||
- Split a string based on a match.
|
||||
|
||||
A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression.
|
||||
In the words of Jamie Zawinski:
|
||||
|
||||
> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
|
||||
|
||||
As a cautionary tale, check out this regular expression that checks if a email address is valid:
|
||||
|
||||
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
|
||||
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
|
||||
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
|
||||
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
|
||||
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
|
||||
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
|
||||
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
|
||||
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
|
||||
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|
||||
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
|
||||
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
|
||||
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
|
||||
\t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
|
||||
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
|
||||
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
|
||||
\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
|
||||
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
|
||||
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
|
||||
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|
||||
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
|
||||
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
|
||||
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
|
||||
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
|
||||
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
|
||||
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
|
||||
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
|
||||
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
|
||||
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
|
||||
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\]
|
||||
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
|
||||
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
|
||||
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
|
||||
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
|
||||
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
|
||||
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
|
||||
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
|
||||
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
|
||||
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
|
||||
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
|
||||
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
|
||||
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
|
||||
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
|
||||
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
|
||||
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
|
||||
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
|
||||
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|
||||
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
|
||||
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
|
||||
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
|
||||
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
|
||||
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
|
||||
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
|
||||
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
|
||||
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
|
||||
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
|
||||
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
|
||||
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
|
||||
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
|
||||
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
|
||||
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
|
||||
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
|
||||
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
|
||||
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
|
||||
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
|
||||
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
|
||||
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
|
||||
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
|
||||
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
|
||||
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
|
||||
\t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
|
||||
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
|
||||
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
|
||||
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
|
||||
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
|
||||
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
|
||||
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
|
||||
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
|
||||
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
|
||||
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
|
||||
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|
||||
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
|
||||
?:\r\n)?[ \t])*))*)?;\s*)
|
||||
|
||||
This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code.
|
||||
See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for more details.
|
||||
|
||||
Don't forget that you're in a programming language and you have other tools at your disposal.
|
||||
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
|
||||
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
||||
|
||||
### Detect matches
|
||||
|
||||
To determine if a character vector matches a pattern, use `str_detect()`.
|
||||
|
@ -872,6 +522,37 @@ str_split(x, " ")[[1]]
|
|||
str_split(x, boundary("word"))[[1]]
|
||||
```
|
||||
|
||||
### Separate
|
||||
|
||||
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
|
||||
Take `table3`:
|
||||
|
||||
```{r}
|
||||
table3
|
||||
```
|
||||
|
||||
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
|
||||
`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
|
||||
|
||||
```{r}
|
||||
table3 %>%
|
||||
separate(rate, into = c("cases", "population"))
|
||||
```
|
||||
|
||||
```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
|
||||
knitr::include_graphics("images/tidy-17.png")
|
||||
```
|
||||
|
||||
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
|
||||
For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
|
||||
If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
|
||||
For example, we could rewrite the code above as:
|
||||
|
||||
```{r eval = FALSE}
|
||||
table3 %>%
|
||||
separate(rate, into = c("cases", "population"), sep = "/")
|
||||
```
|
||||
|
||||
#### Exercises
|
||||
|
||||
1. Split up a string like `"apples, pears, and bananas"` into individual components.
|
||||
|
@ -1009,27 +690,6 @@ There are three other functions you can use instead of `regex()`:
|
|||
|
||||
2. What are the five most common words in `sentences`?
|
||||
|
||||
## Other uses of regular expressions
|
||||
|
||||
There are two useful function in base R that also use regular expressions:
|
||||
|
||||
- `apropos()` searches all objects available from the global environment.
|
||||
This is useful if you can't quite remember the name of the function.
|
||||
|
||||
```{r}
|
||||
apropos("replace")
|
||||
```
|
||||
|
||||
- `dir()` lists all the files in a directory.
|
||||
The `pattern` argument takes a regular expression and only returns file names that match the pattern.
|
||||
For example, you can find all the R Markdown files in the current directory with:
|
||||
|
||||
```{r}
|
||||
head(dir(pattern = "\\.Rmd$"))
|
||||
```
|
||||
|
||||
(If you're more comfortable with "globs" like `*.Rmd`, you can convert them to regular expressions with `glob2rx()`):
|
||||
|
||||
## stringi
|
||||
|
||||
stringr is built on top of the **stringi** package.
|
||||
|
@ -1051,83 +711,6 @@ The main difference is the prefix: `str_` vs. `stri_`.
|
|||
|
||||
2. How do you control the language that `stri_sort()` uses for sorting?
|
||||
|
||||
## tidyr
|
||||
|
||||
So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`.
|
||||
`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
|
||||
To fix this problem, we'll need the `separate()` function.
|
||||
You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
|
||||
|
||||
```{r}
|
||||
library(tidyr)
|
||||
```
|
||||
|
||||
### Separate
|
||||
|
||||
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
|
||||
Take `table3`:
|
||||
|
||||
```{r}
|
||||
table3
|
||||
```
|
||||
|
||||
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
|
||||
`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
|
||||
|
||||
```{r}
|
||||
table3 %>%
|
||||
separate(rate, into = c("cases", "population"))
|
||||
```
|
||||
|
||||
```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
|
||||
knitr::include_graphics("images/tidy-17.png")
|
||||
```
|
||||
|
||||
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
|
||||
For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
|
||||
If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
|
||||
For example, we could rewrite the code above as:
|
||||
|
||||
```{r eval = FALSE}
|
||||
table3 %>%
|
||||
separate(rate, into = c("cases", "population"), sep = "/")
|
||||
```
|
||||
|
||||
(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).)
|
||||
|
||||
Look carefully at the column types: you'll notice that `cases` and `population` are character columns.
|
||||
This is the default behaviour in `separate()`: it leaves the type of the column as is.
|
||||
Here, however, it's not very useful as those really are numbers.
|
||||
We can ask `separate()` to try and convert to better types using `convert = TRUE`:
|
||||
|
||||
```{r}
|
||||
table3 %>%
|
||||
separate(rate, into = c("cases", "population"), convert = TRUE)
|
||||
```
|
||||
|
||||
### Unite
|
||||
|
||||
`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.
|
||||
You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
|
||||
|
||||
We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example.
|
||||
That data is saved as `tidyr::table1`.
|
||||
`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
|
||||
|
||||
```{r}
|
||||
table1 %>%
|
||||
unite(rate, cases, population)
|
||||
```
|
||||
|
||||
In this case we also need to use the `sep` argument.
|
||||
The default will place an underscore (`_`) between the values from different columns.
|
||||
Here we want `"/"` instead:
|
||||
|
||||
```{r}
|
||||
table1 %>%
|
||||
unite(rate, cases, population, sep = "/")
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. What do the `extra` and `fill` arguments do in `separate()`?
|
||||
|
@ -1177,5 +760,3 @@ table1 %>%
|
|||
)
|
||||
baker
|
||||
```
|
||||
|
||||
##
|
||||
|
|
Loading…
Reference in New Issue