Polishing regexps

This commit is contained in:
Hadley Wickham 2022-11-07 08:32:57 -06:00
parent f084940a37
commit 2dda48bc96
2 changed files with 88 additions and 57 deletions

View File

@ -116,7 +116,7 @@ Lets start kick of that process by practicing with some useful stringr functions
### Exercises
## Key functions
## Key functions {#sec-stringr-regex-funs}
Now that you've got the basics of regular expressions under your belt, lets use them with some stringr and tidyr functions.
In the following section, you'll learn about how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.
@ -242,10 +242,43 @@ These functions are naturally paired with `mutate()` when doing data cleaning.,
### Extract variables
The last function comes from tidyr: `separate_wider_regex()`.
This works similarly to `separate_wider_location()` and `separate_wider_delim()` but you give it a vector of regular expressions rather than a vector widths or a delimiter.
The last function we'll discuss comes from tidyr: `separate_wider_regex()`.
It works like the `separate_wider_location()` and `separate_wider_delim()` functions that you learned about in @sec-string-columns but takes a vector of regular expressions rather than a vector of widths or a delimiter.
<!-- TODO: complete once tidyr has a nice dataset -->
Let's create a simple dataset to show how it works.
Here we have some data derived from `babynames` where we have the name, gender, and age of a bunch of people in a rather weird format[^regexps-5]:
[^regexps-5]: We wish we could reassure you that you'd never see something this weird in real life, but unfortunately over the course of your career you're likely to see much weirder!
```{r}
df <- tribble(
~str,
"<Sheryl>-F_34",
"<Kisha>-F_45",
"<Brandon>-N_33",
"<Sharon>-F_38",
"<Penny>-F_58",
"<Justin>-M_41",
"<Patricia>-F_84",
)
```
To extract this data using `separate_wider_regex()` we just need to construct a sequence of regular expressions that match each piece.
If we want the contents of that piece to appear in the output, we give it a name:
```{r}
df |>
separate_wider_regex(
str,
patterns = c(
"<", name = "[A-Za-z]+", ">-",
gender = ".", "_",
age = "[0-9]+"
)
)
```
If the match fails, you can use `too_short = "debug"` to figure out what went wrong, just like `separate_wider_delim()` and `separate_wider_position()`.
### Exercises
@ -257,8 +290,7 @@ This works similarly to `separate_wider_location()` and `separate_wider_delim()`
3. Implement a simple version of `str_to_lower()` using `str_replace_all()`.
4. Switch the first and last letters in `words`.
Which of those strings are still `words`?
4. Create a regular expression that will match telephone numbers as commonly written in your country.
## Pattern details
@ -383,9 +415,9 @@ str_view("a-b-c", "[a\\-c]")
Some character classes are used so commonly that they get their own shortcut.
You've already seen `.`, which matches any character apart from a newline.
There are three other particularly useful pairs[^regexps-5]:
There are three other particularly useful pairs[^regexps-6]:
[^regexps-5]: Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
[^regexps-6]: Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
- `\d`: matches any digit;\
`\D`: matches anything that isn't a digit.
@ -469,9 +501,9 @@ sentences |>
```
If you want extract the matches for each group you can use `str_match()`.
But `str_match()` returns a matrix, so it's not particularly easy to work with[^regexps-6]:
But `str_match()` returns a matrix, so it's not particularly easy to work with[^regexps-7]:
[^regexps-6]: Mostly because we never discuss matrices in this book!
[^regexps-7]: Mostly because we never discuss matrices in this book!
```{r}
sentences |>
@ -513,14 +545,15 @@ str_match(x, "gr(?:e|a)y")
c. End with "x".
d. Are exactly three letters long. (Don't cheat by using `str_length()`!)
e. Have seven letters or more.
f. Contain a vowel-consonant pair
g. Contain at least two vowel-consonant pairs in a row
f. Contain a vowel-consonant pair.
g. Contain at least two vowel-consonant pairs in a row.
h. Only consist of repeated vowel-consonant pairs.
4. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
Try and make the shortest possible regex!
5. Create a regular expression that will match telephone numbers as commonly written in your country.
5. Switch the first and last letters in `words`.
Which of those strings are still `words`?
6. Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)
@ -571,9 +604,9 @@ If you're doing a lot of work with multiline strings (i.e. strings that contain
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, you might try `comments = TRUE`.
It tweaks the pattern language to ignore spaces and new lines, as well as everything after `#`.
This allows you to use comments and whitespace to make complex regular expressions more understandable[^regexps-7], as in the following example:
This allows you to use comments and whitespace to make complex regular expressions more understandable[^regexps-8], as in the following example:
[^regexps-7]: `comments = TRUE` is particularly effective in combination with a raw string, as we use here.
[^regexps-8]: `comments = TRUE` is particularly effective in combination with a raw string, as we use here.
```{r}
phone <- regex(
@ -613,7 +646,7 @@ str_view("x X", "X")
str_view("x X", fixed("X", ignore_case = TRUE))
```
If you're working with non-English text, you should generally use `coll()` instead, as it implements the full rules for capitalization as used by the `locale` you specify.
If you're working with non-English text, you will probably want `coll()` instead of `fixed()`, as it implements the full rules for capitalization as used by the `locale` you specify.
See @sec-other-languages for more details on locales.
```{r}
@ -623,7 +656,7 @@ str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
## Practice
To put these ideas in practice we'll solve a few semi-authentic problems to show you how you might iteratively solve a more complex problem.
To put these ideas in practice we'll next solve a few semi-authentic problems.
We'll discuss three general techniques: checking you work by creating simple positive and negative controls, combining regular expressions with Boolean algebra, and creating complex patterns using string manipulation.
### Check your work
@ -635,7 +668,7 @@ Using the `^` anchor alone is not enough:
str_view(sentences, "^The")
```
Because it all matches sentences starting with `They` or `Those`.
Because that pattern also matches sentences starting with words like `They` or `These`.
We need to make sure that the "e" is the last letter in the word, which we can do by adding adding a word boundary:
```{r}
@ -656,7 +689,7 @@ str_view(sentences, "^(She|He|It|They)\\b")
```
You might wonder how you might spot such a mistake if it didn't occur in the first few matches.
A good technique is to create a few positive and negative matches and use them to test that you pattern works as expected.
A good technique is to create a few positive and negative matches and use them to test that you pattern works as expected:
```{r}
pos <- c("He is a boy", "She had a good time")
@ -667,9 +700,8 @@ str_detect(pos, pattern)
str_detect(neg, pattern)
```
It's typically much easier to come up with positive examples than negative examples, because it takes a while before you're good enough with regular expressions to predict where your weaknesses are.
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on the problem.
If you later get more into programming and learn about unit tests, you can then turn these examples into automated tests that ensure you never make the same mistake twice.
It's typically much easier to come up with good positive examples than negative examples, because it takes a while before you're good enough with regular expressions to predict where your weaknesses are.
Nevertheless, they're still useful: as you work on the problem you can slowly accumulate a collection of your mistakes, ensuring that you never make the same mistake twice.
### Boolean operations {#sec-boolean-operations}
@ -680,7 +712,7 @@ One technique is to create a character class that contains all letters except fo
str_view(words, "^[^aeiou]+$")
```
But we can make this problem a bit easier by flipping the problem around.
But you can make this problem a bit easier by flipping the problem around.
Instead of looking for words that contain only consonants, we could look for words that don't contain any vowels:
```{r}
@ -756,7 +788,7 @@ One place we could start from is the list of built-in colors that R can use for
str_view(colors())
```
But lets first element the numbered variants:
But lets first eliminate the numbered variants:
```{r}
cols <- colors()
@ -764,7 +796,8 @@ cols <- cols[!str_detect(cols, "\\d")]
str_view(cols)
```
Then we can turn this into one giant pattern:
Then we can turn this into one giant pattern.
We won't show the pattern here because it's huge, but you can see it working:
```{r}
pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
@ -772,7 +805,7 @@ str_view(sentences, pattern)
```
In this example `cols` only contains numbers and letters so you don't need to worry about special characters.
But generally, when creating patterns from existing strings it's wise to run them through `str_escape()` which will automatically escape any special characters.
But in general, whenever you create create patterns from existing strings it's wise to run them through `str_escape()` to escape any special behavior.
### Exercises
@ -790,62 +823,60 @@ But generally, when creating patterns from existing strings it's wise to run the
4. Create a regular expression that finds any base R dataset.
You can get a list of these datasets via a special use of the `data()` function: `data(package = "datasets")$results[, "Item"]`.
Note that a number of old datasets are individual vectors; these contain the name of the grouping "data frame" in parentheses, so you'll need to also strip these off.
Note that a number of old datasets are individual vectors; these contain the name of the grouping "data frame" in parentheses, so you'll need to strip those off.
## Regular expressions
## Regular expressions in other places
As well as the stringr and tidyr functions we discussed at the very start of other chapter, there are many other places where you can use regular expressions.
The following sections describe some other use stringr functions, some other places in the tidyverse that use regular expressions, and some handy base R functions.
### stringr
- `str_locate()`, `str_locate_all()`
- `str_split()` and friends
- `str_extract()`
As well as the stringr and tidyr functions we discussed at the very start of other chapter, there are many other places in R where you can use regular expressions.
The following sections describe some other useful functions in the wider tidyverse and base R.
### tidyverse
- `matches()`: a "tidyselect" function that you can use anywhere in the tidyverse when selecting variables (e.g. `dplyr::select()`, `rename_with()`, `across()`, ...).
There are three other particularly useful places where you might want to use a regular expressions
- `names_pattern` in `pivot_longer()`
- `matches(pattern)` will select all variables whose name matches the supplied pattern.
It's a "tidyselect" function that you can use anywhere in any tidyverse function that selects variables (e.g. `select()`, `rename_with()` and `across()`).
- `delim` in `separate_delim_longer()` and `separate_delim_wider()`.
By default it matches a fixed string, but you can use `regex()` to make it match a pattern.
`regex(", ?")` is particularly useful.
- `pivot_longer()'s` `names_pattern` argument takes a vector of regular expressions, just like `separate_with_regex()`.
It's useful when extracting data out of variable names with a complex structure
- The `delim` argument in `separate_delim_longer()` and `separate_delim_wider()` usually matches a fixed string, but you can use `regex()` to make it match a pattern.
This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. `regex(", ?")`.
### Base R
The regular expressions used by stringr are very slightly different to those of base R.
That's because stringr is built on top of the [stringi package](https://stringi.gagolewski.com), which is in turn built on top of the [ICU engine](https://unicode-org.github.io/icu/userguide/strings/regexp.html), whereas base R functions (like `gsub()` and `grepl()`) use either the [TRE engine](https://github.com/laurikari/tre) or the [PCRE engine](https://www.pcre.org).
Fortunately, the basics of regular expressions are so well established that you'll encounter few variations when working with the patterns you'll learn in this book (and we'll point them out where important).
You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the `(?…)` syntax.
You can learn more about these advanced features in `vignette("regular-expressions", package = "stringr")`.
`apropos()` searches all objects available from the global environment.
This is useful if you can't quite remember the name of the function.
`apropos(pattern)` searches all objects available from the global environment that match the given pattern.
This is useful if you can't quite remember the name of a function:
```{r}
apropos("replace")
```
`dir()` lists all the files in a directory.
The `pattern` argument takes a regular expression and only returns file names that match the pattern.
`dir(path, pattern)` lists all files in `path` that match a regular expression `pattern`.
For example, you can find all the R Markdown files in the current directory with:
```{r}
head(dir(pattern = "\\.Rmd$"))
```
(If you're more comfortable with "globs" like `*.Rmd`, you can convert them to regular expressions with `glob2rx()`).
It's worth noting that the pattern language used by base R is very slightly different to that used by stringr.
That's because stringr is built on top of the [stringi package](https://stringi.gagolewski.com), which is in turn built on top of the [ICU engine](https://unicode-org.github.io/icu/userguide/strings/regexp.html), whereas base R functions use either the [TRE engine](https://github.com/laurikari/tre) or the [PCRE engine](https://www.pcre.org), depending on whether or not you've set `perl = TRUE`.
Fortunately, the basics of regular expressions are so well established that you'll encounter few variations when working with the patterns you'll learn in this book.
You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the `(?…)` syntax.
## Summary
To continue learning about regular expressions, start with `vignette("regular-expressions", package = "stringr")`: it documents the full set of syntax supported by stringr.
Don't forget that stringr is implemented on top of stringi, so if you're struggling to find a function that does what you need, don't be afraid to look in stringi too.
You'll find it very easy to pick up because it follows the same conventions as stringr.
Regular expressions are one of the most compact languages out there, with every punctuation character potentially overloaded with meaning.
They're definitely confusing at first, but as you train your eyes to read them and your brain to understand them you unlock a huge amount of powerful.
In this chapter, you've started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language.
There are plenty of resources to learn more.
A good place to start is `vignette("regular-expressions", package = "stringr")`: it documents the full set of syntax supported by stringr.
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
It's not R specific, but it covers the most advanced features and explains how regular expressions work under the hood.
It's not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.
It's also good to know that stringr is implemented on top of the stringi package, by Marek Gagolewsk.
If you're struggling to find a function that does what you need in stringr, don't be afraid to look in stringi.
You'll find stringi very easy to pick up because it follows many of the the same conventions as stringr.
In the next chapter, we'll talk about a data structure closely related to strings: factors.
Factors are used to represent categorical data in R, data where there is a fixed and known set of possible values identified by a vector of strings.

View File

@ -297,7 +297,7 @@ df2 |>
separate_longer_position(x, width = 1)
```
### Splitting into columns
### Splitting into columns {#sec-string-columns}
`separate_wider_delim()` and `separate_wider_position()` are most useful when there are a fixed number of components in each string, and you want to spread them into columns.
They are more complicated that their `by` equivalents because you need to name the columns.