Polishing regexps
This commit is contained in:
parent
bd50322b2b
commit
39be3c0f41
221
regexps.qmd
221
regexps.qmd
|
@ -318,14 +318,11 @@ In general, look at punctuation characters with suspicion; if your regular expre
|
||||||
### Anchors
|
### Anchors
|
||||||
|
|
||||||
By default, regular expressions will match any part of a string.
|
By default, regular expressions will match any part of a string.
|
||||||
If you want to match at the start of end you need to **anchor** the regular expression using `^` or `$`.
|
If you want to match at the start of end you need to **anchor** the regular expression using `^` to match the start of the string or `$` to match the end of the string:
|
||||||
|
|
||||||
- `^` to match the start of the string.
|
|
||||||
- `$` to match the end of the string.
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view(fruit, "^a") # match "a" at start
|
str_view(fruit, "^a")
|
||||||
str_view(fruit, "a$") # match "a" at end
|
str_view(fruit, "a$")
|
||||||
```
|
```
|
||||||
|
|
||||||
To remember which is which, try this mnemonic which we learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
|
To remember which is which, try this mnemonic which we learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
|
||||||
|
@ -339,8 +336,7 @@ str_view(fruit, "^apple$")
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also match the boundary between words (i.e. the start or end of a word) with `\b`.
|
You can also match the boundary between words (i.e. the start or end of a word) with `\b`.
|
||||||
This is not that useful in R code, but it can be handy when searching in RStudio.
|
This can be particularly when using RStudio's find and replace tool.
|
||||||
It's useful to find the name of a function that's a component of other functions.
|
|
||||||
For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on:
|
For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -349,7 +345,7 @@ str_view(x, "sum")
|
||||||
str_view(x, "\\bsum\\b")
|
str_view(x, "\\bsum\\b")
|
||||||
```
|
```
|
||||||
|
|
||||||
When used alone anchors will produce a zero-width match:
|
When used alone, anchors will produce a zero-width match:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view("abc", c("$", "^", "\\b"))
|
str_view("abc", c("$", "^", "\\b"))
|
||||||
|
@ -364,13 +360,15 @@ str_replace_all("abc", c("$", "^", "\\b"), "--")
|
||||||
### Character classes
|
### Character classes
|
||||||
|
|
||||||
A **character class**, or character **set**, allows you to match any character in a set.
|
A **character class**, or character **set**, allows you to match any character in a set.
|
||||||
The basic syntax lists each character you want to match inside of `[]`, so `[abc]` will match a, b, or c.
|
You can construct your own sets with `[]`, where `[abc]` matches a, b, or c.
|
||||||
Inside of `[]` only `-`, `^`, and `\` have special meanings:
|
There are three characters that have special meaning inside of `[]:`
|
||||||
|
|
||||||
- `-` defines a range, e.g. `[a-z]`: matches any lower case letter and `[0-9]` matches any number.
|
- `-` defines a range, e.g. `[a-z]`: matches any lower case letter and `[0-9]` matches any number.
|
||||||
- `^` takes the inverse of the set, e.g. `[^abc]`: matches anything except a, b, or c.
|
- `^` takes the inverse of the set, e.g. `[^abc]`: matches anything except a, b, or c.
|
||||||
- `\` escapes special characters, so `[\^\-\]]`: matches `^`, `-`, or `]`.
|
- `\` escapes special characters, so `[\^\-\]]`: matches `^`, `-`, or `]`.
|
||||||
|
|
||||||
|
Here are few examples:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view("abcd ABCD 12345 -!@#%.", "[abc]+")
|
str_view("abcd ABCD 12345 -!@#%.", "[abc]+")
|
||||||
str_view("abcd ABCD 12345 -!@#%.", "[a-z]+")
|
str_view("abcd ABCD 12345 -!@#%.", "[a-z]+")
|
||||||
|
@ -382,11 +380,11 @@ str_view("a-b-c", "[a-c]")
|
||||||
str_view("a-b-c", "[a\\-c]")
|
str_view("a-b-c", "[a\\-c]")
|
||||||
```
|
```
|
||||||
|
|
||||||
### Shorthand character classes
|
Some character classes are used so commonly that they get their own shortcut.
|
||||||
|
|
||||||
There are a few character classes that are used so commonly that they get their own shortcut.
|
|
||||||
You've already seen `.`, which matches any character apart from a newline.
|
You've already seen `.`, which matches any character apart from a newline.
|
||||||
There are three other particularly useful pairs:
|
There are three other particularly useful pairs[^regexps-4]:
|
||||||
|
|
||||||
|
[^regexps-4]: Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
|
||||||
|
|
||||||
- `\d`: matches any digit;\
|
- `\d`: matches any digit;\
|
||||||
`\D`: matches anything that isn't a digit.
|
`\D`: matches anything that isn't a digit.
|
||||||
|
@ -395,9 +393,7 @@ There are three other particularly useful pairs:
|
||||||
- `\w`: matches any "word" character, i.e. letters and numbers;\
|
- `\w`: matches any "word" character, i.e. letters and numbers;\
|
||||||
`\W`: matches any "non-word" character.
|
`\W`: matches any "non-word" character.
|
||||||
|
|
||||||
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
|
The following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.
|
||||||
|
|
||||||
The following code demonstrates the different shortcuts with a selection of letters, numbers, and punctuation characters.
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view("abcd 12345 !@#%.", "\\d+")
|
str_view("abcd 12345 !@#%.", "\\d+")
|
||||||
|
@ -412,21 +408,27 @@ str_view("abcd 12345 !@#%.", "\\S+")
|
||||||
|
|
||||||
The **quantifiers** control how many times a pattern matches.
|
The **quantifiers** control how many times a pattern matches.
|
||||||
In @sec-reg-basics you learned about `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches).
|
In @sec-reg-basics you learned about `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches).
|
||||||
For example, `colou?r` will match American or British spelling, `\d+` will match one or more digits, and `\s?` will optionally match a single whitespace.
|
For example, `colou?r` will match American or British spelling, `\d+` will match one or more digits, and `\s?` will optionally match a single item of whitespace.
|
||||||
You can also specify the number of matches precisely:
|
You can also specify the number of matches precisely:
|
||||||
|
|
||||||
- `{n}`: exactly n matches.
|
- `{n}` matches exactly n times.
|
||||||
- `{n,}`: n or more matches.
|
- `{n,}` matches at least n times.
|
||||||
- `{n,m}`: between n and m matches.
|
- `{n,m}` matches between n and m times.
|
||||||
|
|
||||||
The following code shows how this works for a few simple examples using `\b` to make the match start at the beginning of a word.
|
The following code shows how this works for a few simple examples:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
x <- " x xx xxx xxxx"
|
x <- "-- -x- -xx- -xxx- -xxxx- -xxxxx-"
|
||||||
str_view(x, "\\bx{2}")
|
str_view(x, "-x?-") # [0, 1]
|
||||||
str_view(x, "\\bx{2,}")
|
str_view(x, "-x+-") # [1, Inf)
|
||||||
str_view(x, "\\bx{1,3}")
|
str_view(x, "-x*-") # [0, Inf)
|
||||||
str_view(x, "\\bx{2,3}")
|
str_view(x, "-x{2}-") # [2. 2]
|
||||||
|
str_view(x, "-x{2,}-") # [2, Inf)
|
||||||
|
str_view(x, "-x{2,3}-") # [2, 3]
|
||||||
|
```
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
str_view(fruit, "")
|
||||||
```
|
```
|
||||||
|
|
||||||
### Operator precedence and parentheses
|
### Operator precedence and parentheses
|
||||||
|
@ -435,21 +437,19 @@ What does `ab+` match?
|
||||||
Does it match "a" followed by one or more "b"s, or does it match "ab" repeated any number of times?
|
Does it match "a" followed by one or more "b"s, or does it match "ab" repeated any number of times?
|
||||||
What does `^a|b$` match?
|
What does `^a|b$` match?
|
||||||
Does it match the complete string a or the complete string b, or does it match a string starting with a or a string starting with "b"?
|
Does it match the complete string a or the complete string b, or does it match a string starting with a or a string starting with "b"?
|
||||||
The answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school for what `a + b * c`.
|
|
||||||
|
|
||||||
You already know that `a + b * c` is equivalent to `a + (b * c)` not `(a + b) * c` because `*` has higher precedence and `+` has lower precedence: you compute `*` before `+`.
|
The answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school to understand how to compute `a + b * c`.
|
||||||
In regular expressions, quantifiers have high precedence and alternation has low precedence.
|
You know that `a + b * c` is equivalent to `a + (b * c)` not `(a + b) * c` because `*` has higher precedence and `+` has lower precedence: you compute `*` before `+`.
|
||||||
That means `ab+` is equivalent to `a(b+)`, and `^a|b$` is equivalent to `(^a)|(b$)`.
|
In regular expressions, quantifiers have higher precedence and alternation has lower precedence which means that `ab+` is equivalent to `a(b+)`, and `^a|b$` is equivalent to `(^a)|(b$)`.
|
||||||
|
|
||||||
Just like with algebra, you can use parentheses to override the usual order.
|
Just like with algebra, you can use parentheses to override the usual order.
|
||||||
Unlike algebra you're unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.
|
Unlike algebra you're unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.
|
||||||
|
|
||||||
Technically the escape, character classes, and parentheses are all operators that also have precedence.
|
|
||||||
But these tend to be less likely to cause confusion because they mostly behave how you expect: it's unlikely that you'd think that `\(s|d)` would mean `(\s)|(\d)`.
|
|
||||||
|
|
||||||
### Grouping and capturing
|
### Grouping and capturing
|
||||||
|
|
||||||
Parentheses are an important tool for controlling the order in which pattern operations are applied but they also have an important additional effect: they create **capturing groups** that allow you to use to sub-components of the match.
|
Parentheses are important for controlling the order in which pattern operations are applied but they also have an important additional effect: they create **capturing groups** that allow you to use to sub-components of the match.
|
||||||
You can refer back to previously matched text inside parentheses by using **back reference**: `\1` refers to the match contained in the first parenthesis, `\2` in the second parenthesis, and so on.
|
|
||||||
|
The first way to use a capturing group is to refer back to it within a match by using a **back reference**: `\1` refers to the match contained in the first parenthesis, `\2` in the second parenthesis, and so on.
|
||||||
For example, the following pattern finds all fruits that have a repeated pair of letters:
|
For example, the following pattern finds all fruits that have a repeated pair of letters:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -459,19 +459,22 @@ str_view(fruit, "(..)\\1")
|
||||||
And this one finds all words that start and end with the same pair of letters:
|
And this one finds all words that start and end with the same pair of letters:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view(words, "^(..).*\\1$")
|
str_view(words, "(..).*\\1$")
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also use backreferences in `str_replace()`:
|
You can also use backreferences in `str_replace()`.
|
||||||
|
For example, this code switches the order of the second and third words in `sentences`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
sentences |>
|
sentences |>
|
||||||
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |>
|
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |>
|
||||||
head(5)
|
str_view()
|
||||||
```
|
```
|
||||||
|
|
||||||
If you want extract the matches for each group you can use `str_match()`.
|
If you want extract the matches for each group you can use `str_match()`.
|
||||||
But it returns a matrix, so isn't as easy to work with:
|
But `str_match()` returns a matrix, so it's not particularly easy to work with[^regexps-5]:
|
||||||
|
|
||||||
|
[^regexps-5]: Mostly because we never discuss matrices in this book!
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
sentences |>
|
sentences |>
|
||||||
|
@ -488,8 +491,8 @@ sentences |>
|
||||||
set_names("match", "word1", "word2")
|
set_names("match", "word1", "word2")
|
||||||
```
|
```
|
||||||
|
|
||||||
But then you've basically recreated your own simple version of `separate_regex_wider()`.
|
But then you've basically recreated your own version of `separate_regex_wider()`.
|
||||||
Indeed, behind the scenes `separate_regexp_wider()` converts your vector of patterns to a single regexp that uses grouping to capture only the named components.
|
And,i indeed, behind the scenes `separate_regexp_wider()` converts your vector of patterns to a single regexp that uses grouping to capture only the named components.
|
||||||
|
|
||||||
Occasionally, you'll want to use parentheses without creating matching groups.
|
Occasionally, you'll want to use parentheses without creating matching groups.
|
||||||
You can create a non-capturing group with `(?:)`.
|
You can create a non-capturing group with `(?:)`.
|
||||||
|
@ -502,24 +505,27 @@ str_match(x, "(gr(?:e|a)y)")
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
2. How would you match the literal string `"'\`? How about `"$^$"`?
|
1. How would you match the literal string `"'\`? How about `"$^$"`?
|
||||||
|
|
||||||
3. Explain why each of these patterns don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
|
2. Explain why each of these patterns don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
|
||||||
|
|
||||||
4. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:
|
3. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:
|
||||||
|
|
||||||
a. Start with "y".
|
a. Start with "y".
|
||||||
b. Don't start with "y".
|
b. Don't start with "y".
|
||||||
c. End with "x".
|
c. End with "x".
|
||||||
d. Are exactly three letters long. (Don't cheat by using `str_length()`!)
|
d. Are exactly three letters long. (Don't cheat by using `str_length()`!)
|
||||||
e. Have seven letters or more.
|
e. Have seven letters or more.
|
||||||
|
f. Contain a vowel-consonant pair
|
||||||
|
g. Contain at least two vowel-consonant pairs in a row
|
||||||
|
h. Only consist of repeated vowel-consonant pairs.
|
||||||
|
|
||||||
5. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
|
4. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
|
||||||
Try and make the shortest possible regex!
|
Try and make the shortest possible regex!
|
||||||
|
|
||||||
6. Create a regular expression that will match telephone numbers as commonly written in your country.
|
5. Create a regular expression that will match telephone numbers as commonly written in your country.
|
||||||
|
|
||||||
7. Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)
|
6. Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)
|
||||||
|
|
||||||
a. `^.*$`
|
a. `^.*$`
|
||||||
b. `"\\{.+\\}"`
|
b. `"\\{.+\\}"`
|
||||||
|
@ -529,24 +535,17 @@ str_match(x, "(gr(?:e|a)y)")
|
||||||
f. `(.)\1\1`
|
f. `(.)\1\1`
|
||||||
g. `"(..)\\1"`
|
g. `"(..)\\1"`
|
||||||
|
|
||||||
8. Solve the beginner regexp crosswords at <https://regexcrossword.com/challenges/beginner>.
|
7. Solve the beginner regexp crosswords at <https://regexcrossword.com/challenges/beginner>.
|
||||||
|
|
||||||
## Pattern control
|
## Pattern control
|
||||||
|
|
||||||
### Regex Flags {#sec-flags}
|
It's possible to exercise control over the details of the match by supplying a richer object to the `pattern` argument.
|
||||||
|
There are three particularly useful options: `regex()`, `fixed()`, and `coll()`, as described in the following sections.
|
||||||
|
|
||||||
The are a number of settings, often called **flags** in other programming languages, that you can use to control some of the details of the regex.
|
### Regex flags {#sec-flags}
|
||||||
In stringr, you can use these by wrapping the pattern in a call to `regex()`:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#| eval: false
|
|
||||||
|
|
||||||
# The regular call:
|
|
||||||
str_view(fruit, "nana")
|
|
||||||
# is shorthand for
|
|
||||||
str_view(fruit, regex("nana"))
|
|
||||||
```
|
|
||||||
|
|
||||||
|
There are a number of settings that can use to control the details of the regexp, which are often called **flags** in other programming languages.
|
||||||
|
In stringr, you can use these by wrapping the pattern in a call to `regex()`.
|
||||||
The most useful flag is probably `ignore_case = TRUE` because it allows characters to match either their uppercase or lowercase forms:
|
The most useful flag is probably `ignore_case = TRUE` because it allows characters to match either their uppercase or lowercase forms:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -555,16 +554,16 @@ str_view(bananas, "banana")
|
||||||
str_view(bananas, regex("banana", ignore_case = TRUE))
|
str_view(bananas, regex("banana", ignore_case = TRUE))
|
||||||
```
|
```
|
||||||
|
|
||||||
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `multiline` and `dotall` can also be useful.
|
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `dotall`and `multiline` also be useful.
|
||||||
`dotall = TRUE` allows `.` to match everything, including `\n`:
|
`dotall = TRUE` lets `.` match everything, including `\n`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
x <- "Line 1\nLine 2\nLine 3"
|
x <- "Line 1\nLine 2\nLine 3"
|
||||||
str_view(x, ".L")
|
str_view(x, ".Line")
|
||||||
str_view(x, regex(".L", dotall = TRUE))
|
str_view(x, regex(".Line", dotall = TRUE))
|
||||||
```
|
```
|
||||||
|
|
||||||
And `multiline = TRUE` allows `^` and `$` to match the start and end of each line rather than the start and end of the complete string:
|
And `multiline = TRUE` makes `^` and `$` match the start and end of each line rather than the start and end of the complete string:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
x <- "Line 1\nLine 2\nLine 3"
|
x <- "Line 1\nLine 2\nLine 3"
|
||||||
|
@ -572,20 +571,23 @@ str_view(x, "^Line")
|
||||||
str_view(x, regex("^Line", multiline = TRUE))
|
str_view(x, regex("^Line", multiline = TRUE))
|
||||||
```
|
```
|
||||||
|
|
||||||
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, `comments = TRUE` can be extremely useful.
|
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, you might find `comments = TRUE` to be useful.
|
||||||
It allows you to use comments and whitespace to make complex regular expressions more understandable.
|
It ignores spaces and new lines, as well is everything after `#`, allowing you to use comments and whitespace to make complex regular expressions more understandable[^regexps-6].
|
||||||
Spaces and new lines are ignored, as is everything after `#`.
|
|
||||||
(Note that we use a raw string here to minimize the number of escapes needed.)
|
[^regexps-6]: `comments = TRUE` is particularly effective in combination with a raw string, as we use here.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
phone <- regex(r"(
|
phone <- regex(
|
||||||
|
r"(
|
||||||
\(? # optional opening parens
|
\(? # optional opening parens
|
||||||
(\d{3}) # area code
|
(\d{3}) # area code
|
||||||
[)\ -]? # optional closing parens, space, or dash
|
[)\ -]? # optional closing parens, space, or dash
|
||||||
(\d{3}) # another three numbers
|
(\d{3}) # another three numbers
|
||||||
[\ -]? # optional space or dash
|
[\ -]? # optional space or dash
|
||||||
(\d{3}) # three more numbers
|
(\d{3}) # three more numbers
|
||||||
)", comments = TRUE)
|
)",
|
||||||
|
comments = TRUE
|
||||||
|
)
|
||||||
|
|
||||||
str_match("514-791-8141", phone)
|
str_match("514-791-8141", phone)
|
||||||
```
|
```
|
||||||
|
@ -593,7 +595,7 @@ str_match("514-791-8141", phone)
|
||||||
If you're using comments and want to match a space, newline, or `#`, you'll need to escape it:
|
If you're using comments and want to match a space, newline, or `#`, you'll need to escape it:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view("x x #", regex("x #", comments = TRUE))
|
str_view("x x #", regex(r"(x #)", comments = TRUE))
|
||||||
str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
|
str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -605,33 +607,25 @@ You can opt-out of the regular expression rules by using `fixed()`:
|
||||||
str_view(c("", "a", "."), fixed("."))
|
str_view(c("", "a", "."), fixed("."))
|
||||||
```
|
```
|
||||||
|
|
||||||
You can opt out by setting `ignore_case = TRUE`.
|
`fixed()` also gives you the ability to ignore case:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view("x X xy", "X")
|
str_view("x X", "X")
|
||||||
str_view("x X xy", fixed("X", ignore_case = TRUE))
|
str_view("x X", fixed("X", ignore_case = TRUE))
|
||||||
```
|
```
|
||||||
|
|
||||||
If you're working with non-English text, it's slightly safer to use `coll()` rather than
|
If you're working with non-English text, you should generally use `coll()` instead, as it implements the full rules for capitalization as used by the `locale` you specify.
|
||||||
|
See @#sec-other-languages for more details.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
|
str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
|
||||||
str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
|
str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
|
||||||
```
|
```
|
||||||
|
|
||||||
### Boundaries
|
|
||||||
|
|
||||||
## Practice
|
## Practice
|
||||||
|
|
||||||
To put these ideas in practice we'll solve a few semi-authentic problems using the `words` and `sentences` datasets built into stringr.
|
To put these ideas in practice we'll solve a few semi-authentic problems to show you how you might iteratively solve a more complex problem.
|
||||||
`words` is a list of common English words and `sentences` is a set of simple sentences originally used for testing voice transmission.
|
We'll discuss three general techniques: checking you work by creating simple positive and negative controls, combining regular expressions with Boolean algebra, and creating complex patterns using string manipulation.
|
||||||
|
|
||||||
```{r}
|
|
||||||
str_view(head(words))
|
|
||||||
str_view(head(sentences))
|
|
||||||
```
|
|
||||||
|
|
||||||
The following three sections help you practice the components of a pattern by discussing three general techniques: checking you work by creating simple positive and negative controls, combining regular expressions with Boolean algebra, and creating complex patterns using string manipulation.
|
|
||||||
|
|
||||||
### Check your work
|
### Check your work
|
||||||
|
|
||||||
|
@ -676,7 +670,7 @@ str_detect(neg, pattern)
|
||||||
|
|
||||||
It's typically much easier to come up with positive examples than negative examples, because it takes some time until you're good enough with regular expressions to predict where your weaknesses are.
|
It's typically much easier to come up with positive examples than negative examples, because it takes some time until you're good enough with regular expressions to predict where your weaknesses are.
|
||||||
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on your problem.
|
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on your problem.
|
||||||
If you later get more into programming and learn about unit tests, you can then turn these examples into automated tests that ensure you never make the same mistake twice.)
|
If you later get more into programming and learn about unit tests, you can then turn these examples into automated tests that ensure you never make the same mistake twice.
|
||||||
|
|
||||||
### Boolean operations {#sec-boolean-operations}
|
### Boolean operations {#sec-boolean-operations}
|
||||||
|
|
||||||
|
@ -742,7 +736,7 @@ The basic idea is simple: we just combine alternation with word boundaries.
|
||||||
str_view(sentences, "\\b(red|green|blue)\\b")
|
str_view(sentences, "\\b(red|green|blue)\\b")
|
||||||
```
|
```
|
||||||
|
|
||||||
But it would be tedious to construct this pattern by hand.
|
But as the number of colours grows, it would quickly get tedious to construct this pattern by hand.
|
||||||
Wouldn't it be nice if we could store the colours in a vector?
|
Wouldn't it be nice if we could store the colours in a vector?
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -760,15 +754,15 @@ We could make this pattern more comprehensive if we had a good list of colors.
|
||||||
One place we could start from is the list of built-in colours that R can use for plots:
|
One place we could start from is the list of built-in colours that R can use for plots:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view(colors())[1:27]
|
str_view(colors())
|
||||||
```
|
```
|
||||||
|
|
||||||
But first lets element the numbered variants:
|
But lets first element the numbered variants:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
cols <- colors()
|
cols <- colors()
|
||||||
cols <- cols[!str_detect(cols, "\\d")]
|
cols <- cols[!str_detect(cols, "\\d")]
|
||||||
cols[1:27]
|
str_view(cols)
|
||||||
```
|
```
|
||||||
|
|
||||||
Then we can turn this into one giant pattern:
|
Then we can turn this into one giant pattern:
|
||||||
|
@ -778,14 +772,20 @@ pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
|
||||||
str_view(sentences, pattern)
|
str_view(sentences, pattern)
|
||||||
```
|
```
|
||||||
|
|
||||||
In this example `cols` only contains numbers and letters so you don't need to worry about metacharacters.
|
In this example `cols` only contains numbers and letters so you don't need to worry about special characters.
|
||||||
But in general, when creating patterns from existing strings it's good practice to run through `str_escape()` which will automatically add `\` in front of otherwise special characters.
|
But generally, when creating patterns from existing strings it's wise to run them through `str_escape()` which will automatically escape any special characters.
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
1. Construct patterns to find evidence for and against the rule "i before e except after c"?
|
1. Construct patterns to find evidence for and against the rule "i before e except after c"?
|
||||||
2. `colors()` contains a number of modifiers like "lightgray" and "darkblue". How could you automatically identify these modifiers? (Think about how you might detect and removed what colors are being modified).
|
|
||||||
3. Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the `data()` function: `data(package = "datasets")$results[, "Item"]`. Note that a number of old datasets are individual vectors; these contain the name of the grouping "data frame" in parentheses, so you'll need to also strip these off.
|
2. `colors()` contains a number of modifiers like "lightgray" and "darkblue".
|
||||||
|
How could you automatically identify these modifiers?
|
||||||
|
(Think about how you might detect and removed what colors are being modified).
|
||||||
|
|
||||||
|
3. Create a regular expression that finds any base R dataset.
|
||||||
|
You can get a list of these datasets via a special use of the `data()` function: `data(package = "datasets")$results[, "Item"]`.
|
||||||
|
Note that a number of old datasets are individual vectors; these contain the name of the grouping "data frame" in parentheses, so you'll need to also strip these off.
|
||||||
|
|
||||||
## Elsewhere
|
## Elsewhere
|
||||||
|
|
||||||
|
@ -813,24 +813,25 @@ Fortunately, the basics of regular expressions are so well established that you'
|
||||||
You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the `(?…)` syntax.
|
You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the `(?…)` syntax.
|
||||||
You can learn more about these advanced features in `vignette("regular-expressions", package = "stringr")`.
|
You can learn more about these advanced features in `vignette("regular-expressions", package = "stringr")`.
|
||||||
|
|
||||||
- `apropos()` searches all objects available from the global environment.
|
`apropos()` searches all objects available from the global environment.
|
||||||
This is useful if you can't quite remember the name of the function.
|
This is useful if you can't quite remember the name of the function.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
apropos("replace")
|
apropos("replace")
|
||||||
```
|
```
|
||||||
|
|
||||||
- `dir()` lists all the files in a directory.
|
`dir()` lists all the files in a directory.
|
||||||
The `pattern` argument takes a regular expression and only returns file names that match the pattern.
|
The `pattern` argument takes a regular expression and only returns file names that match the pattern.
|
||||||
For example, you can find all the R Markdown files in the current directory with:
|
For example, you can find all the R Markdown files in the current directory with:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
head(dir(pattern = "\\.Rmd$"))
|
head(dir(pattern = "\\.Rmd$"))
|
||||||
```
|
```
|
||||||
|
|
||||||
(If you're more comfortable with "globs" like `*.Rmd`, you can convert them to regular expressions with `glob2rx()`).
|
(If you're more comfortable with "globs" like `*.Rmd`, you can convert them to regular expressions with `glob2rx()`).
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
|
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
|
||||||
It's not R specific, but it covers the most advanced features and explains how regular expressions work under the hood.
|
It's not R specific, but it covers the most advanced features and explains how regular expressions work under the hood.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue