More regexp polish
This commit is contained in:
parent
40a56c55ed
commit
f97f5479e3
59
regexps.qmd
59
regexps.qmd
|
@ -273,8 +273,6 @@ And we'll finish off with some details of **grouping** components of the pattern
|
||||||
The terms we use here are the technical names for each component.
|
The terms we use here are the technical names for each component.
|
||||||
They're not always the most evocative of their purpose, but it's very helpful to know the correct terms if you later want to Google for more details.
|
They're not always the most evocative of their purpose, but it's very helpful to know the correct terms if you later want to Google for more details.
|
||||||
|
|
||||||
We'll concentrate on showing how these patterns work with `str_view()`; remember that you can use them with any of the functions that you learned above.
|
|
||||||
|
|
||||||
### Escaping {#sec-regexp-escaping}
|
### Escaping {#sec-regexp-escaping}
|
||||||
|
|
||||||
In order to match a literal `.`, you need an **escape**, which tells the regular expression to ignore the special behavior and match exactly.
|
In order to match a literal `.`, you need an **escape**, which tells the regular expression to ignore the special behavior and match exactly.
|
||||||
|
@ -491,7 +489,7 @@ sentences |>
|
||||||
```
|
```
|
||||||
|
|
||||||
But then you've basically recreated your own version of `separate_wider_regex()`.
|
But then you've basically recreated your own version of `separate_wider_regex()`.
|
||||||
And indeed, behind the scenes `separate_wider_regex()` converts your vector of patterns to a single regex that uses grouping to capture the named components.
|
Indeed, behind the scenes, `separate_wider_regex()` converts your vector of patterns to a single regex that uses grouping to capture the named components.
|
||||||
|
|
||||||
Occasionally, you'll want to use parentheses without creating matching groups.
|
Occasionally, you'll want to use parentheses without creating matching groups.
|
||||||
You can create a non-capturing group with `(?:)`.
|
You can create a non-capturing group with `(?:)`.
|
||||||
|
@ -553,25 +551,27 @@ str_view(bananas, "banana")
|
||||||
str_view(bananas, regex("banana", ignore_case = TRUE))
|
str_view(bananas, regex("banana", ignore_case = TRUE))
|
||||||
```
|
```
|
||||||
|
|
||||||
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `dotall`and `multiline` also be useful.
|
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `dotall`and `multiline` also be useful:
|
||||||
`dotall = TRUE` lets `.` match everything, including `\n`:
|
|
||||||
|
|
||||||
```{r}
|
- `dotall = TRUE` lets `.` match everything, including `\n`:
|
||||||
x <- "Line 1\nLine 2\nLine 3"
|
|
||||||
str_view(x, ".Line")
|
|
||||||
str_view(x, regex(".Line", dotall = TRUE))
|
|
||||||
```
|
|
||||||
|
|
||||||
And `multiline = TRUE` makes `^` and `$` match the start and end of each line rather than the start and end of the complete string:
|
```{r}
|
||||||
|
x <- "Line 1\nLine 2\nLine 3"
|
||||||
|
str_view(x, ".Line")
|
||||||
|
str_view(x, regex(".Line", dotall = TRUE))
|
||||||
|
```
|
||||||
|
|
||||||
```{r}
|
- `multiline = TRUE` makes `^` and `$` match the start and end of each line rather than the start and end of the complete string:
|
||||||
x <- "Line 1\nLine 2\nLine 3"
|
|
||||||
str_view(x, "^Line")
|
|
||||||
str_view(x, regex("^Line", multiline = TRUE))
|
|
||||||
```
|
|
||||||
|
|
||||||
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, you might find `comments = TRUE` to be useful.
|
```{r}
|
||||||
It ignores spaces and new lines, as well is everything after `#`, allowing you to use comments and whitespace to make complex regular expressions more understandable[^regexps-7].
|
x <- "Line 1\nLine 2\nLine 3"
|
||||||
|
str_view(x, "^Line")
|
||||||
|
str_view(x, regex("^Line", multiline = TRUE))
|
||||||
|
```
|
||||||
|
|
||||||
|
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, you might try `comments = TRUE`.
|
||||||
|
It tweaks the pattern language to ignore spaces and new lines, as well as everything after `#`.
|
||||||
|
This allows you to use comments and whitespace to make complex regular expressions more understandable[^regexps-7], as in the following example:
|
||||||
|
|
||||||
[^regexps-7]: `comments = TRUE` is particularly effective in combination with a raw string, as we use here.
|
[^regexps-7]: `comments = TRUE` is particularly effective in combination with a raw string, as we use here.
|
||||||
|
|
||||||
|
@ -614,7 +614,7 @@ str_view("x X", fixed("X", ignore_case = TRUE))
|
||||||
```
|
```
|
||||||
|
|
||||||
If you're working with non-English text, you should generally use `coll()` instead, as it implements the full rules for capitalization as used by the `locale` you specify.
|
If you're working with non-English text, you should generally use `coll()` instead, as it implements the full rules for capitalization as used by the `locale` you specify.
|
||||||
See \@#sec-other-languages for more details.
|
See @sec-other-languages for more details on locales.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
|
str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
|
||||||
|
@ -667,8 +667,8 @@ str_detect(pos, pattern)
|
||||||
str_detect(neg, pattern)
|
str_detect(neg, pattern)
|
||||||
```
|
```
|
||||||
|
|
||||||
It's typically much easier to come up with positive examples than negative examples, because it takes some time until you're good enough with regular expressions to predict where your weaknesses are.
|
It's typically much easier to come up with positive examples than negative examples, because it takes a while before you're good enough with regular expressions to predict where your weaknesses are.
|
||||||
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on your problem.
|
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on the problem.
|
||||||
If you later get more into programming and learn about unit tests, you can then turn these examples into automated tests that ensure you never make the same mistake twice.
|
If you later get more into programming and learn about unit tests, you can then turn these examples into automated tests that ensure you never make the same mistake twice.
|
||||||
|
|
||||||
### Boolean operations {#sec-boolean-operations}
|
### Boolean operations {#sec-boolean-operations}
|
||||||
|
@ -684,7 +684,7 @@ But we can make this problem a bit easier by flipping the problem around.
|
||||||
Instead of looking for words that contain only consonants, we could look for words that don't contain any vowels:
|
Instead of looking for words that contain only consonants, we could look for words that don't contain any vowels:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
words[!str_detect(words, "[aeiou]")]
|
str_view(words[!str_detect(words, "[aeiou]")])
|
||||||
```
|
```
|
||||||
|
|
||||||
This is a useful technique whenever you're dealing with logical combinations, particularly those involving "and" or "not".
|
This is a useful technique whenever you're dealing with logical combinations, particularly those involving "and" or "not".
|
||||||
|
@ -692,7 +692,7 @@ For example, imagine if you want to find all words that contain "a" and "b".
|
||||||
There's no "and" operator built in to regular expressions so we have to tackle it by looking for all words that contain an "a" followed by a "b", or a "b" followed by an "a":
|
There's no "and" operator built in to regular expressions so we have to tackle it by looking for all words that contain an "a" followed by a "b", or a "b" followed by an "a":
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
words[str_detect(words, "a.*b|b.*a")]
|
str_view(words, "a.*b|b.*a")
|
||||||
```
|
```
|
||||||
|
|
||||||
It's simpler to combine the results of two calls to `str_detect()`:
|
It's simpler to combine the results of two calls to `str_detect()`:
|
||||||
|
@ -735,8 +735,8 @@ The basic idea is simple: we just combine alternation with word boundaries.
|
||||||
str_view(sentences, "\\b(red|green|blue)\\b")
|
str_view(sentences, "\\b(red|green|blue)\\b")
|
||||||
```
|
```
|
||||||
|
|
||||||
But as the number of colours grows, it would quickly get tedious to construct this pattern by hand.
|
But as the number of colors grows, it would quickly get tedious to construct this pattern by hand.
|
||||||
Wouldn't it be nice if we could store the colours in a vector?
|
Wouldn't it be nice if we could store the colors in a vector?
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
rgb <- c("red", "green", "blue")
|
rgb <- c("red", "green", "blue")
|
||||||
|
@ -750,7 +750,7 @@ str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
|
||||||
```
|
```
|
||||||
|
|
||||||
We could make this pattern more comprehensive if we had a good list of colors.
|
We could make this pattern more comprehensive if we had a good list of colors.
|
||||||
One place we could start from is the list of built-in colours that R can use for plots:
|
One place we could start from is the list of built-in colors that R can use for plots:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_view(colors())
|
str_view(colors())
|
||||||
|
@ -786,15 +786,16 @@ But generally, when creating patterns from existing strings it's wise to run the
|
||||||
|
|
||||||
3. `colors()` contains a number of modifiers like "lightgray" and "darkblue".
|
3. `colors()` contains a number of modifiers like "lightgray" and "darkblue".
|
||||||
How could you automatically identify these modifiers?
|
How could you automatically identify these modifiers?
|
||||||
(Think about how you might detect and removed what colors are being modified).
|
(Think about how you might detect and then removed the colors that are modified).
|
||||||
|
|
||||||
4. Create a regular expression that finds any base R dataset.
|
4. Create a regular expression that finds any base R dataset.
|
||||||
You can get a list of these datasets via a special use of the `data()` function: `data(package = "datasets")$results[, "Item"]`.
|
You can get a list of these datasets via a special use of the `data()` function: `data(package = "datasets")$results[, "Item"]`.
|
||||||
Note that a number of old datasets are individual vectors; these contain the name of the grouping "data frame" in parentheses, so you'll need to also strip these off.
|
Note that a number of old datasets are individual vectors; these contain the name of the grouping "data frame" in parentheses, so you'll need to also strip these off.
|
||||||
|
|
||||||
## Elsewhere
|
## Regular expressions
|
||||||
|
|
||||||
The are a bunch of other places you can use regular expressions outside of stringr.
|
As well as the stringr and tidyr functions we discussed at the very start of other chapter, there are many other places where you can use regular expressions.
|
||||||
|
The following sections describe some other use stringr functions, some other places in the tidyverse that use regular expressions, and some handy base R functions.
|
||||||
|
|
||||||
### stringr
|
### stringr
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue