More regexp polish

This commit is contained in:
Hadley Wickham 2022-11-05 12:06:02 -05:00
parent 40a56c55ed
commit f97f5479e3
1 changed files with 30 additions and 29 deletions

View File

@ -273,8 +273,6 @@ And we'll finish off with some details of **grouping** components of the pattern
The terms we use here are the technical names for each component.
They're not always the most evocative of their purpose, but it's very helpful to know the correct terms if you later want to Google for more details.
We'll concentrate on showing how these patterns work with `str_view()`; remember that you can use them with any of the functions that you learned above.
### Escaping {#sec-regexp-escaping}
In order to match a literal `.`, you need an **escape**, which tells the regular expression to ignore the special behavior and match exactly.
@ -491,7 +489,7 @@ sentences |>
```
But then you've basically recreated your own version of `separate_wider_regex()`.
And indeed, behind the scenes `separate_wider_regex()` converts your vector of patterns to a single regex that uses grouping to capture the named components.
Indeed, behind the scenes, `separate_wider_regex()` converts your vector of patterns to a single regex that uses grouping to capture the named components.
Occasionally, you'll want to use parentheses without creating matching groups.
You can create a non-capturing group with `(?:)`.
@ -553,25 +551,27 @@ str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))
```
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `dotall`and `multiline` also be useful.
`dotall = TRUE` lets `.` match everything, including `\n`:
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `dotall`and `multiline` also be useful:
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_view(x, ".Line")
str_view(x, regex(".Line", dotall = TRUE))
```
- `dotall = TRUE` lets `.` match everything, including `\n`:
And `multiline = TRUE` makes `^` and `$` match the start and end of each line rather than the start and end of the complete string:
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_view(x, ".Line")
str_view(x, regex(".Line", dotall = TRUE))
```
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_view(x, "^Line")
str_view(x, regex("^Line", multiline = TRUE))
```
- `multiline = TRUE` makes `^` and `$` match the start and end of each line rather than the start and end of the complete string:
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, you might find `comments = TRUE` to be useful.
It ignores spaces and new lines, as well is everything after `#`, allowing you to use comments and whitespace to make complex regular expressions more understandable[^regexps-7].
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_view(x, "^Line")
str_view(x, regex("^Line", multiline = TRUE))
```
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, you might try `comments = TRUE`.
It tweaks the pattern language to ignore spaces and new lines, as well as everything after `#`.
This allows you to use comments and whitespace to make complex regular expressions more understandable[^regexps-7], as in the following example:
[^regexps-7]: `comments = TRUE` is particularly effective in combination with a raw string, as we use here.
@ -614,7 +614,7 @@ str_view("x X", fixed("X", ignore_case = TRUE))
```
If you're working with non-English text, you should generally use `coll()` instead, as it implements the full rules for capitalization as used by the `locale` you specify.
See \@#sec-other-languages for more details.
See @sec-other-languages for more details on locales.
```{r}
str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
@ -667,8 +667,8 @@ str_detect(pos, pattern)
str_detect(neg, pattern)
```
It's typically much easier to come up with positive examples than negative examples, because it takes some time until you're good enough with regular expressions to predict where your weaknesses are.
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on your problem.
It's typically much easier to come up with positive examples than negative examples, because it takes a while before you're good enough with regular expressions to predict where your weaknesses are.
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on the problem.
If you later get more into programming and learn about unit tests, you can then turn these examples into automated tests that ensure you never make the same mistake twice.
### Boolean operations {#sec-boolean-operations}
@ -684,7 +684,7 @@ But we can make this problem a bit easier by flipping the problem around.
Instead of looking for words that contain only consonants, we could look for words that don't contain any vowels:
```{r}
words[!str_detect(words, "[aeiou]")]
str_view(words[!str_detect(words, "[aeiou]")])
```
This is a useful technique whenever you're dealing with logical combinations, particularly those involving "and" or "not".
@ -692,7 +692,7 @@ For example, imagine if you want to find all words that contain "a" and "b".
There's no "and" operator built in to regular expressions so we have to tackle it by looking for all words that contain an "a" followed by a "b", or a "b" followed by an "a":
```{r}
words[str_detect(words, "a.*b|b.*a")]
str_view(words, "a.*b|b.*a")
```
It's simpler to combine the results of two calls to `str_detect()`:
@ -735,8 +735,8 @@ The basic idea is simple: we just combine alternation with word boundaries.
str_view(sentences, "\\b(red|green|blue)\\b")
```
But as the number of colours grows, it would quickly get tedious to construct this pattern by hand.
Wouldn't it be nice if we could store the colours in a vector?
But as the number of colors grows, it would quickly get tedious to construct this pattern by hand.
Wouldn't it be nice if we could store the colors in a vector?
```{r}
rgb <- c("red", "green", "blue")
@ -750,7 +750,7 @@ str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
```
We could make this pattern more comprehensive if we had a good list of colors.
One place we could start from is the list of built-in colours that R can use for plots:
One place we could start from is the list of built-in colors that R can use for plots:
```{r}
str_view(colors())
@ -786,15 +786,16 @@ But generally, when creating patterns from existing strings it's wise to run the
3. `colors()` contains a number of modifiers like "lightgray" and "darkblue".
How could you automatically identify these modifiers?
(Think about how you might detect and removed what colors are being modified).
(Think about how you might detect and then removed the colors that are modified).
4. Create a regular expression that finds any base R dataset.
You can get a list of these datasets via a special use of the `data()` function: `data(package = "datasets")$results[, "Item"]`.
Note that a number of old datasets are individual vectors; these contain the name of the grouping "data frame" in parentheses, so you'll need to also strip these off.
## Elsewhere
## Regular expressions
The are a bunch of other places you can use regular expressions outside of stringr.
As well as the stringr and tidyr functions we discussed at the very start of other chapter, there are many other places where you can use regular expressions.
The following sections describe some other use stringr functions, some other places in the tidyverse that use regular expressions, and some handy base R functions.
### stringr