435 lines
17 KiB
Plaintext
435 lines
17 KiB
Plaintext
# Regular expressions
|
|
|
|
```{r, results = "asis", echo = FALSE}
|
|
status("restructuring")
|
|
```
|
|
|
|
## Introduction
|
|
|
|
You learned the basics of regular expressions in Chapter \@ref(strings), but regular expressions are fairly rich language so it's worth spending some extra time on the details.
|
|
|
|
The chapter starts by expanding your knowledge of patterns, to cover six important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, and alternation).
|
|
Here we'll focus mostly on the language itself, not the functions that use it.
|
|
That means we'll mostly work with toy character vectors, showing the results with `str_view()` and `str_view_all()`.
|
|
You'll need to take what you learn here and apply it to data frames with tidyr functions or by combining dplyr and stringr functions.
|
|
|
|
Next we'll talk about the important concepts of "grouping" and "capturing" which give you new ways to extract variables out of strings using `tidyr::separate_group()`.
|
|
Grouping also allows you to use back references which allow you do things like match repeated patterns.
|
|
|
|
We'll finish by discussing the various "flags" that allow you to tweak the operation of regular expressions and cover a few final details about how regular expressions work.
|
|
These aren't particularly important in day-to-day usage, but at little extra understanding of the underlying tools is often helpful.
|
|
|
|
### Prerequisites
|
|
|
|
This chapter will use regular expressions as provided by the **stringr** package.
|
|
|
|
```{r setup, message = FALSE}
|
|
library(tidyverse)
|
|
```
|
|
|
|
It's worth noting that the regular expressions used by stringr are very slightly different to those of base R.
|
|
That's because stringr is built on top of the [stringi package](https://stringi.gagolewski.com), which is in turn built on top of the [ICU engine](https://unicode-org.github.io/icu/userguide/strings/regexp.html), whereas base R functions (like `gsub()` and `grepl()`) use either the [TRE engine](https://github.com/laurikari/tre) or the [PCRE engine](https://www.pcre.org).
|
|
Fortunately, the basics of regular expressions are so well established that you're unlikely to encounter any differences when working with the patterns you'll learn in this book.
|
|
You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the `(?…)` syntax.
|
|
You can learn more about these advanced features in `vignette("regular-expressions", package = "stringr")`.
|
|
|
|
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
|
|
It's not R specific, but it includes a lot more information about how regular expressions actually work.
|
|
|
|
### Exercises
|
|
|
|
1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
|
|
|
|
2. How would you match the sequence `"'\`?
|
|
|
|
3. What patterns will the regular expression `\..\..\..` match?
|
|
How would you represent it as a string?
|
|
|
|
## Pattern language
|
|
|
|
You learned the very basics of the regular expression pattern language in Chapter \@ref(strings), and now its time to dig into more of the details.
|
|
First, we'll start with **escaping**, which allows you to match characters that the pattern language otherwise treats specially.
|
|
Next you'll learn about **anchors**, which allow you to match the start or end of the string.
|
|
Then you'll learn about **character classes** and their shortcuts, which allow you to match any character from a set.
|
|
We'll finish up with **quantifiers**, which control how many times a pattern can match, and **alternation**, which allows you to match either *this* or *that.*
|
|
|
|
The terms I use here are the technical names for each component.
|
|
They're not always the most evocative of their purpose, but it's very helpful to know the correct terms if you later want to Google for more details.
|
|
|
|
### Escaping {#regexp-escaping}
|
|
|
|
In Chapter \@ref(strings), you'll learned how to match a literal `.` by using `fixed(".")`.
|
|
But what if you want to match a literal `.` as part of a bigger regular expression?
|
|
You'll need to use an **escape**, which tells the regular expression you want it to match exactly, not use its special behavior.
|
|
Like strings, regexps use the backslash for escaping, so to match a `.`, you need the regexp `\.`.
|
|
Unfortunately this creates a problem.
|
|
We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings.
|
|
So, as the following example shows, to create the regular expression `\.` we need the string `"\\."`.
|
|
|
|
```{r}
|
|
# To create the regular expression \., we need to use \\.
|
|
dot <- "\\."
|
|
|
|
# But the expression itself only contains one \
|
|
str_view(dot)
|
|
|
|
# And this tells R to look for an explicit .
|
|
str_view(c("abc", "a.c", "bef"), "a\\.c")
|
|
```
|
|
|
|
In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
|
|
|
|
If `\` is used as an escape character in regular expressions, how do you match a literal `\`?
|
|
Well you need to escape it, creating the regular expression `\\`.
|
|
To create that regular expression, you need to use a string, which also needs to escape `\`.
|
|
That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!
|
|
|
|
```{r}
|
|
x <- "a\\b"
|
|
str_view(x)
|
|
str_view(x, "\\\\")
|
|
```
|
|
|
|
Alternatively, you might find it easier to use the raw strings you learned about in Section \@ref(raw-strings)).
|
|
That lets you to avoid one layer of escaping:
|
|
|
|
```{r}
|
|
str_view(x, r"(\\)")
|
|
```
|
|
|
|
The full set of characters with special meanings that need to be escaped is `.^$\|*+?{}[]()`.
|
|
In general, look at punctuation characters with suspicion; if your regular expression isn't matching what you think it should, check if you've used any of these characters.
|
|
|
|
### Anchors
|
|
|
|
By default, regular expressions will match any part of a string.
|
|
If you want to match at the start of end you need to **anchor** the regular expression using `^` or `$`.
|
|
|
|
- `^` to match the start of the string.
|
|
- `$` to match the end of the string.
|
|
|
|
```{r}
|
|
x <- c("apple", "banana", "pear")
|
|
str_view(x, "a") # match "a" anywhere
|
|
str_view(x, "^a") # match "a" at start
|
|
str_view(x, "a$") # match "a" at end
|
|
```
|
|
|
|
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
|
|
It's tempting to put `$` at the start, because that's how we write sums of money, but it's not what regular expressions want.
|
|
|
|
To force a regular expression to only match the full string, anchor it with both `^` and `$`:
|
|
|
|
```{r}
|
|
x <- c("apple pie", "apple", "apple cake")
|
|
str_view(x, "apple")
|
|
str_view(x, "^apple$")
|
|
```
|
|
|
|
You can also match the boundary between words with `\b`.
|
|
I don't often use this in my R code, but I'll sometimes use it when I'm doing a search in RStudio.
|
|
It's useful to find the name of a function that's a component of other functions.
|
|
For example, if I want to find all uses of `sum()`, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on:
|
|
|
|
```{r}
|
|
x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
|
|
str_view(x, "sum")
|
|
str_view(x, "\\bsum\\b")
|
|
```
|
|
|
|
### Character classes
|
|
|
|
A **character class**, or character **set**, allows you to match any character in a set.
|
|
The basic syntax lists each character you want to match inside of `[]`, so `[abc]` will match a, b, or c.
|
|
Inside of `[]` only `-`, `^`, and `\` have special meanings:
|
|
|
|
- `-` defines a range. `[a-z]` matches any lower case letter and `[0-9]` matches any number.
|
|
- `^` takes the inverse of the set. `[^abc]`: matches anything except a, b, or c.
|
|
- `\` escapes special characters so `[\^\-\]]`: matches `^`, `-`, or `]`.
|
|
|
|
```{r}
|
|
str_view_all("abcd12345-!@#%. [", "[abc]")
|
|
str_view_all("abcd12345-!@#%. [", "[a-z]")
|
|
str_view_all("abcd12345-!@#%. [", "[^a-z0-9]")
|
|
str_view_all("abcd12345-!@#%. []", "[\\-]")
|
|
```
|
|
|
|
Remember that regular expressions are case sensitive so if you want to match any lowercase or uppercase letter, you'd need to write `[a-zA-Z0-9]`.
|
|
|
|
### Shorthand character classes
|
|
|
|
There are a few character classes that are used so commonly that they get their own single character shortcut.
|
|
You've already seen `.`, which matches any character apart from a newline.
|
|
There are three other useful pairs:
|
|
|
|
- `\d`: matches any digit; `\D` matches anything that isn't a digit.
|
|
- `\s`: matches any whitespace (e.g. space, tab, newline); `\S` matches anything that isn't whitespace.
|
|
- `\w` matches any "word" character, i.e. letters and numbers; `\W`, matches any non-word character.
|
|
|
|
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
|
|
The following code demonstrates the different matches with a selection of letters, numbers, and punctuation characters.
|
|
|
|
```{r}
|
|
str_view_all("abcd12345!@#%. ", "\\d+")
|
|
str_view_all("abcd12345!@#%. ", "\\D+")
|
|
str_view_all("abcd12345!@#%. ", "\\w+")
|
|
str_view_all("abcd12345!@#%. ", "\\W+")
|
|
str_view_all("abcd12345!@#%. ", "\\s+")
|
|
str_view_all("abcd12345!@#%. ", "\\S+")
|
|
```
|
|
|
|
### Quantifiers
|
|
|
|
The **quantifiers** control how many times a pattern matches.
|
|
In Chapter \@ref(strings) we discussed `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches).
|
|
So `colou?r` will match American and British spelling, `\d+` will match one or more digits, `\s?` will optionally match a single whitespace.
|
|
You can also specify the number of matches precisely:
|
|
|
|
- `{n}`: exactly n
|
|
- `{n,}`: n or more
|
|
- `{n,m}`: between n and m
|
|
|
|
```{r}
|
|
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
|
|
str_view(x, "C{2}")
|
|
str_view(x, "C{2,}")
|
|
str_view(x, "C{1,3}")
|
|
str_view(x, "C{2,3}")
|
|
```
|
|
|
|
By default these matches are **greedy**: they will match the longest string possible.
|
|
You can make them **lazy**, matching the shortest string possible by putting a `?` after them.
|
|
This is an advanced feature of regular expressions, but it's useful to know that it exists:
|
|
|
|
```{r}
|
|
str_view(x, 'C{2,3}?')
|
|
str_view(x, 'C+[LX]+')
|
|
str_view(x, 'C+[LX]+?')
|
|
```
|
|
|
|
### Parentheses
|
|
|
|
Quantifiers apply to the preceding pattern, so `a+` matches one or more "a"s, `\d+` matches one or more digits, and `[aeiou]+` matches one or more vowels.
|
|
You can use parentheses to define a more complex pattern.
|
|
For example, `([aeiou].)+` matches a vowel followed by any letter, repeated any number of times.
|
|
|
|
### Alternation
|
|
|
|
You can use **alternation** to pick between one or more alternative patterns.
|
|
Here are a few examples:
|
|
|
|
- Match apple, pear, or banana: `apple|pear|banana`.
|
|
- Match 3 letters or two digits: `\w{3}|\d{2}`.
|
|
|
|
`|` has very low precedence, so if you want to use it inside a bigger pattern you'll need to wrap it in parenthesis.
|
|
For example if you want to match only a complete string, you'll need `^(apple|pear|banana)$`.
|
|
`^apple|pear|banana$` will match apple at the start of the string, pear anywhere, and banana at the end.
|
|
|
|
### Exercises
|
|
|
|
1. How would you match the literal string `"$^$"`?
|
|
|
|
2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:
|
|
|
|
a. Start with "y".
|
|
b. Don't start with "y".
|
|
c. End with "x".
|
|
d. Are exactly three letters long. (Don't cheat by using `str_length()`!)
|
|
e. Have seven letters or more.
|
|
|
|
Since `words` is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
|
|
|
|
3. Create regular expressions that match the British or American spellings of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
|
|
|
|
4. What strings will `$a` match?
|
|
|
|
5. Create regular expressions to find all words that:
|
|
|
|
a. Start with a vowel.
|
|
b. That only contain consonants. (Hint: thinking about matching "not"-vowels.)
|
|
c. End with `ed`, but not with `eed`.
|
|
d. End with `ing` or `ise`.
|
|
|
|
6. Empirically verify the rule "i before e except after c".
|
|
|
|
7. Is "q" always followed by a "u"?
|
|
|
|
8. Write a regular expression that matches a `word` if it's probably written in British English, not American English.
|
|
|
|
9. Create a regular expression that will match telephone numbers as commonly written in your country.
|
|
|
|
10. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
|
|
|
|
11. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
|
|
|
|
a. `^.*$`
|
|
b. `"\\{.+\\}"`
|
|
c. `\d{4}-\d{2}-\d{2}`
|
|
d. `"\\\\{4}"`
|
|
|
|
12. Create regular expressions to find all words that:
|
|
|
|
a. Start with three consonants.
|
|
b. Have three or more vowels in a row.
|
|
c. Have two or more vowel-consonant pairs in a row.
|
|
|
|
13. Solve the beginner regexp crosswords at <https://regexcrossword.com/challenges/beginner>.
|
|
|
|
## Grouping and capturing
|
|
|
|
Earlier, you learned about parentheses as a way to create complex patterns.
|
|
Parentheses also create a numbered capturing group (number 1, 2 etc.).
|
|
A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses.
|
|
You can refer to the same text as previously matched by a capturing group with **backreferences**, like `\1`, `\2` etc.
|
|
|
|
For example, the following regular expression finds all fruits that have a repeated pair of letters.
|
|
|
|
```{r}
|
|
str_view(fruit, "(..)\\1", match = TRUE)
|
|
```
|
|
|
|
### Replacement
|
|
|
|
You can also use backreferences when replacing.
|
|
The following code will switch the order of the second and third words:
|
|
|
|
```{r}
|
|
sentences %>%
|
|
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") %>%
|
|
head(5)
|
|
```
|
|
|
|
Names that start and end with the same letter.
|
|
Implement with `str_sub()` instead.
|
|
|
|
### str_match()
|
|
|
|
```{r}
|
|
sentences %>%
|
|
str_view("the (\\w+) (\\w+)", match = TRUE) %>%
|
|
head()
|
|
```
|
|
|
|
### Non-capturing groups
|
|
|
|
Occasionally, you'll want to use parentheses without creating matching groups.
|
|
You can create a non-capturing group with `(?:)`.
|
|
Typically, however, you'll find it easier to just ignore that result in the output of `str_match()`.
|
|
|
|
```{r}
|
|
x <- c("a gray cat", "a grey dog")
|
|
str_match(x, "(gr(e|a)y)")
|
|
str_match(x, "(gr(?:e|a)y)")
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. Describe, in words, what these expressions will match:
|
|
|
|
a. `(.)\1\1`
|
|
b. `"(.)(.)\\2\\1"`
|
|
c. `(..)\1`
|
|
d. `"(.).\\1.\\1"`
|
|
e. `"(.)(.)(.).*\\3\\2\\1"`
|
|
|
|
2. Construct regular expressions to match words that:
|
|
|
|
a. Start and end with the same character.
|
|
b. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
|
|
c. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
|
|
|
|
## Flags
|
|
|
|
The are a number of settings, called **flags**, that you can use to control some of the details of the pattern language.
|
|
In stringr, you can supply these by instead of passing a simple string as a pattern, by passing the object created by `regex()`:
|
|
|
|
```{r, eval = FALSE}
|
|
# The regular call:
|
|
str_view(fruit, "nana")
|
|
# Is shorthand for
|
|
str_view(fruit, regex("nana"))
|
|
```
|
|
|
|
This is useful because it allows you to pass additional arguments to control the details of the match the most useful is probably `ignore_case = TRUE` because it allows characters to match either their uppercase or lowercase forms:
|
|
|
|
```{r}
|
|
bananas <- c("banana", "Banana", "BANANA")
|
|
str_view(bananas, "banana")
|
|
str_view(bananas, regex("banana", ignore_case = TRUE))
|
|
```
|
|
|
|
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `multiline` and `dotall` can also be useful.
|
|
`dotall = TRUE` allows `.` to match everything, including `\n`:
|
|
|
|
```{r}
|
|
x <- "Line 1\nLine 2\nLine 3"
|
|
str_view_all(x, ".L")
|
|
str_view_all(x, regex(".L", dotall = TRUE))
|
|
```
|
|
|
|
And `multiline = TRUE` allows `^` and `$` to match the start and end of each line rather than the start and end of the complete string:
|
|
|
|
```{r}
|
|
x <- "Line 1\nLine 2\nLine 3"
|
|
str_view_all(x, "^Line")
|
|
str_view_all(x, regex("^Line", multiline = TRUE))
|
|
```
|
|
|
|
If you're writing a complicated regular expression and you're worried you might not understand it in the future, `comments = TRUE` can be super useful.
|
|
It allows you to use comments and white space to make complex regular expressions more understandable.
|
|
Spaces and new lines are ignored, as is everything after `#`.
|
|
(Note that I'm using a raw string here to minimise the number of escapes needed)
|
|
|
|
```{r}
|
|
phone <- regex(r"(
|
|
\(? # optional opening parens
|
|
(\d{3}) # area code
|
|
[) -]? # optional closing parens, space, or dash
|
|
(\d{3}) # another three numbers
|
|
[ -]? # optional space or dash
|
|
(\d{3}) # three more numbers
|
|
)", comments = TRUE)
|
|
|
|
str_match("514-791-8141", phone)
|
|
```
|
|
|
|
If you're using comments and want to match a space, newline, or `#`, you'll need to escape it:
|
|
|
|
```{r}
|
|
str_view("x x #", regex("x #", comments = TRUE))
|
|
str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
|
|
```
|
|
|
|
## Some details
|
|
|
|
### Overlapping
|
|
|
|
Matches never overlap, and the regular expression engine only starts looking for a new match after the end of the last match.
|
|
For example, in `"abababa"`, how many times will the pattern `"aba"` match?
|
|
Regular expressions say two, not three:
|
|
|
|
```{r}
|
|
str_count("abababa", "aba")
|
|
str_view_all("abababa", "aba")
|
|
```
|
|
|
|
### Zero width matches
|
|
|
|
It's possible for a regular expression to match no character, i.e. the space between too characters.
|
|
This typically happens when you use a quantifier that allows zero matches:
|
|
|
|
```{r}
|
|
str_view_all("abcdef", "c?")
|
|
```
|
|
|
|
But anchors also create zero-width matches:
|
|
|
|
```{r}
|
|
str_view_all("this is a sentence", "\\b")
|
|
str_view_all("this is a sentence", "^")
|
|
```
|
|
|
|
### Greediness
|
|
|
|
Regular expressions always attempt to match the longest possible string.
|