Typos and small grammar mistakes in regex chapter (#1129)

This commit is contained in:
AlbertRapp 2022-11-09 21:47:46 +01:00 committed by GitHub
parent 4a761c77c6
commit 8edfbadba3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 62 additions and 50 deletions

View File

@ -10,21 +10,21 @@ status("polishing")
## Introduction
In @sec-strings, you learned a whole bunch of useful functions for working with strings.
In this chapter we'll learn even more focusing on functions that use **regular expressions**, are a concise and powerful language for describing patterns within strings.
The term "regular expression" is a bit of a mouthful, so most people abbreviate to "regex"[^regexps-1] or "regexp".
In this chapter we'll focusing on functions that use **regular expressions**, a concise and powerful language for describing patterns within strings.
The term "regular expression" is a bit of a mouthful, so most people abbreviate it to "regex"[^regexps-1] or "regexp".
[^regexps-1]: You can pronounce with either a hard-g (reg-x) or a soft-g (rej-x).
[^regexps-1]: You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).
The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis.
We'll then expand your knowledge of patterns, to cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping).
Next we'll talk about some of the other types of pattern that stringr functions can work with, and the various "flags" that allow you to tweak the operation of regular expressions.
We'll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping).
Next, we'll talk about some of the other types of patterns that stringr functions can work with, and the various "flags" that allow you to tweak the operation of regular expressions.
We'll finish up with a survey of other places in the tidyverse and base R where you might use regexes.
### Prerequisites
::: callout-important
This chapter relies on features only found in stringr 1.5.0 and tidyr 1.3.0 which are still in development.
If you want to live life on the edge you can get the dev versions with `devtools::install_github(c("tidyverse/stringr", "tidyverse/tidyr"))`.
If you want to live life on the edge, you can get the dev versions with `devtools::install_github(c("tidyverse/stringr", "tidyverse/tidyr"))`.
:::
In this chapter, we'll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.
@ -45,11 +45,11 @@ Through this chapter we'll use a mix of very simple inline examples so you can g
## Pattern basics {#sec-reg-basics}
We'll use with `str_view()` to learn how regex patterns work.
We'll use `str_view()` to learn how regex patterns work.
We used `str_view()` in the last chapter to better understand a string vs its printed representation, and now we'll use it with its second argument, a regular expression.
When this is supplied, `str_view()` will show only the elements of the string the match, surrounding each match with `<>`, and, where possible, highlight the match in blue.
When this is supplied, `str_view()` will show only the elements of the string vector that match, surrounding each match with `<>`, and, where possible, highlighting the match in blue.
The simplest patterns consist of letters and numbers, which match those characters exactly:
The simplest patterns consist of letters and numbers which match those characters exactly:
```{r}
str_view(fruit, "berry")
@ -57,7 +57,7 @@ str_view(fruit, "berry")
str_view(fruit, "BERRY")
```
Letters and numbers match exactly and so are called **literal characters**.
Letters and numbers match exactly and are called **literal characters**.
Punctuation characters like `.`, `+`, `*`, `[`, `]`, `?` have special meanings[^regexps-2] and are called **meta-characters**. For example, `.`
will match any character[^regexps-3], so `"a."` will match any string that contains an "a" followed by another character
:
@ -76,7 +76,11 @@ Or we could find all the fruits that contain an "a", followed by three letters,
str_view(fruit, "a...e")
```
**Quantifiers** control how many times a pattern can match: `?` makes a pattern optional (i.e. it matches 0 or 1 times), `+` lets a pattern repeat (i.e. it matches at least once), and `*` lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
**Quantifiers** control how many times a pattern can match:
- `?` makes a pattern optional (i.e. it matches 0 or 1 times)
- `+` lets a pattern repeat (i.e. it matches at least once)
- `*` lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
```{r}
# ab? matches an "a", optionally followed by a "b".
@ -117,24 +121,24 @@ str_view(fruit, "aa|ee|ii|oo|uu")
Regular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first.
Don't worry; you'll get better with practice, and simple patterns will soon become second nature.
Lets start kick of that process by practicing with some useful stringr functions.
Let's kick off that process by practicing with some useful stringr functions.
### Exercises
## Key functions {#sec-stringr-regex-funs}
Now that you've got the basics of regular expressions under your belt, lets use them with some stringr and tidyr functions.
Now that you've got the basics of regular expressions under your belt, let's use them with some stringr and tidyr functions.
In the following section, you'll learn about how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.
### Detect matches
`str_detect()` returns a logical vector that says is `TRUE` is the pattern matched an element of the character vector, and `FALSE` otherwise:
`str_detect()` returns a logical vector that is `TRUE` if the pattern matched an element of the character vector and `FALSE` otherwise:
```{r}
str_detect(c("a", "b", "c"), "[aeiou]")
```
Since `str_detect()` returns a logical vector the same length as the vector, it pairs well with `filter()`.
Since `str_detect()` returns a logical vector of the same length as the initial vector, it pairs well with `filter()`.
For example, this code finds all the most popular names containing a lower-case "x":
```{r}
@ -166,7 +170,7 @@ babynames |>
geom_line()
```
There are two functions that are closely related to `str_detect()`: `str_subset()` returns just the strings that contain a match, and `str_which()` returns the indexes of strings that have a match:
There are two functions that are closely related to `str_detect()`, namely `str_subset()` which returns just the strings that contain a match and `str_which()` which returns the indexes of strings that have a match:
```{r}
str_subset(c("a", "b", "c"), "[aeiou]")
@ -243,7 +247,7 @@ x <- c("apple", "pear", "banana")
str_remove_all(x, "[aeiou]")
```
These functions are naturally paired with `mutate()` when doing data cleaning., and you'll often apply them repeatedly to peel off layers of inconsistent formatting.
These functions are naturally paired with `mutate()` when doing data cleaning, and you'll often apply them repeatedly to peel off layers of inconsistent formatting.
### Extract variables
@ -299,12 +303,12 @@ If the match fails, you can use `too_short = "debug"` to figure out what went wr
## Pattern details
Now that you understand the basics of the pattern language and how it use it with some stringr and tidyr functions, its time to dig into more of the details.
Now that you understand the basics of the pattern language and how to use it with some stringr and tidyr functions, its time to dig into more of the details.
First, we'll start with **escaping**, which allows you to match metacharacters that would otherwise be treated specially.
Next you'll learn about **anchors**, which allow you to match the start or end of the string.
Then you'll more learn about **character classes** and their shortcuts, which allow you to match any character from a set.
Next you'll learn the final details of **quantifiers**, which control how many times a pattern can match.
Then we have to cover the important (but complex) topic of **operator precedence** and parentheses.
Next, you'll learn about **anchors** which allow you to match the start or end of the string.
Then, you'll more learn about **character classes** and their shortcuts which allow you to match any character from a set.
Next, you'll learn the final details of **quantifiers** which control how many times a pattern can match.
Then, we have to cover the important (but complex) topic of **operator precedence** and parentheses.
And we'll finish off with some details of **grouping** components of the pattern.
The terms we use here are the technical names for each component.
@ -312,8 +316,9 @@ They're not always the most evocative of their purpose, but it's very helpful to
### Escaping {#sec-regexp-escaping}
In order to match a literal `.`, you need an **escape**, which tells the regular expression to match metacharacters literally.
Like strings, regexps use the backslash for escaping, so to match a `.`, you need the regexp `\.`.
In order to match a literal `.`, you need an **escape** which tells the regular expression to match metacharacters literally.
Like strings, regexps use the backslash for escaping.
So, to match a `.`, you need the regexp `\.`.
Unfortunately this creates a problem.
We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings.
So to create the regular expression `\.` we need the string `"\\."`, as the following example shows.
@ -333,7 +338,7 @@ In this book, we'll usually write regular expression without quotes, like `\.`.
If we need to emphasize what you'll actually type, we'll surround it with quotes and add extra escapes, like `"\\."`.
If `\` is used as an escape character in regular expressions, how do you match a literal `\`?
Well you need to escape it, creating the regular expression `\\`.
Well, you need to escape it, creating the regular expression `\\`.
To create that regular expression, you need to use a string, which also needs to escape `\`.
That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!
@ -408,9 +413,9 @@ A **character class**, or character **set**, allows you to match any character i
As we discussed above, you can construct your own sets with `[]`, where `[abc]` matches a, b, or c.
There are three characters that have special meaning inside of `[]:`
- `-` defines a range, e.g. `[a-z]`: matches any lower case letter and `[0-9]` matches any number.
- `^` takes the inverse of the set, e.g. `[^abc]`: matches anything except a, b, or c.
- `\` escapes special characters, so `[\^\-\]]`: matches `^`, `-`, or `]`.
- `-` defines a range, e.g. `[a-z]` matches any lower case letter and `[0-9]` matches any number.
- `^` takes the inverse of the set, e.g. `[^abc]` matches anything except a, b, or c.
- `\` escapes special characters, so `[\^\-\]]` matches `^`, `-`, or `]`.
Here are few examples:
@ -432,12 +437,12 @@ There are three other particularly useful pairs[^regexps-6]:
[^regexps-6]: Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
- `\d`: matches any digit;\
`\D`: matches anything that isn't a digit.
- `\s`: matches any whitespace (e.g. space, tab, newline);\
`\S`: matches anything that isn't whitespace.
- `\w`: matches any "word" character, i.e. letters and numbers;\
`\W`: matches any "non-word" character.
- `\d` matches any digit;\
`\D` matches anything that isn't a digit.
- `\s` matches any whitespace (e.g. space, tab, newline);\
`\S` matches anything that isn't whitespace.
- `\w` matches any "word" character, i.e. letters and numbers;\
`\W` matches any "non-word" character.
The following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.
@ -483,13 +488,14 @@ Does it match the complete string a or the complete string b, or does it match a
The answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school.
You know that `a + b * c` is equivalent to `a + (b * c)` not `(a + b) * c` because `*` has higher precedence and `+` has lower precedence: you compute `*` before `+`.
Similarly, regular expressions have their own precedence rules: quantifiers have high precedence and alternation has low precedence which means that `ab+` is equivalent to `a(b+)`, and `^a|b$` is equivalent to `(^a)|(b$)`.
Just like with algebra, you can use parentheses to override the usual order.
But unlike algebra you're unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.
### Grouping and capturing
As well overriding operator precedence, parentheses have another important effect: they create **capturing groups** that allow you to use to sub-components of the match.
As well overriding operator precedence, parentheses have another important effect: they create **capturing groups** that allow you to use sub-components of the match.
The first way to use a capturing group is to refer back to it within a match with **back reference**: `\1` refers to the match contained in the first parenthesis, `\2` in the second parenthesis, and so on.
For example, the following pattern finds all fruits that have a repeated pair of letters:
@ -587,7 +593,8 @@ This allows you control the so called regex flags and match various types of fix
### Regex flags {#sec-flags}
There are a number of settings that can use to control the details of the regexp, which are often called **flags** in other programming languages.
There are a number of settings that can use to control the details of the regexp.
These settings are often called **flags** in other programming languages.
In stringr, you can use these by wrapping the pattern in a call to `regex()`.
The most useful flag is probably `ignore_case = TRUE` because it allows characters to match either their uppercase or lowercase forms:
@ -597,7 +604,7 @@ str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))
```
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `dotall`and `multiline` also be useful:
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `dotall`and `multiline` may also be useful:
- `dotall = TRUE` lets `.` match everything, including `\n`:
@ -669,8 +676,12 @@ str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
## Practice
To put these ideas in practice we'll next solve a few semi-authentic problems.
We'll discuss three general techniques: checking you work by creating simple positive and negative controls, combining regular expressions with Boolean algebra, and creating complex patterns using string manipulation.
To put these ideas into practice we'll solve a few semi-authentic problems next.
We'll discuss three general techniques:
1. checking you work by creating simple positive and negative controls
2. combining regular expressions with Boolean algebra
3. creating complex patterns using string manipulation
### Check your work
@ -702,7 +713,7 @@ str_view(sentences, "^(She|He|It|They)\\b")
```
You might wonder how you might spot such a mistake if it didn't occur in the first few matches.
A good technique is to create a few positive and negative matches and use them to test that you pattern works as expected:
A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:
```{r}
pos <- c("He is a boy", "She had a good time")
@ -757,7 +768,7 @@ words[str_detect(words, "a.*e.*i.*o.*u")]
words[str_detect(words, "u.*o.*i.*e.*a")]
```
It's much simpler to combine six calls to `str_detect()`:
It's much simpler to combine five calls to `str_detect()`:
```{r}
words[
@ -817,7 +828,7 @@ pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
str_view(sentences, pattern)
```
In this example `cols` only contains numbers and letters so you don't need to worry about metacharacters.
In this example, `cols` only contains numbers and letters so you don't need to worry about metacharacters.
But in general, whenever you create create patterns from existing strings it's wise to run them through `str_escape()` to ensure they match literally.
### Exercises
@ -840,7 +851,7 @@ But in general, whenever you create create patterns from existing strings it's w
## Regular expressions in other places
As well as the stringr and tidyr functions we discussed at the very start of other chapter, there are many other places in R where you can use regular expressions.
Just like in the stringr and tidyr functions, there are many other places in R where you can use regular expressions.
The following sections describe some other useful functions in the wider tidyverse and base R.
### tidyverse
@ -879,18 +890,19 @@ You only need to be aware of the difference when you start to rely on advanced f
## Summary
Regular expressions are one of the most compact languages out there, with every punctuation character potentially overloaded with meaning.
They're definitely confusing at first, but as you train your eyes to read them and your brain to understand them you unlock a huge amount of powerful.
In this chapter, you've started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language.
With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there.
They're definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.
In this chapter, you've started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language.
And there are plenty of resources to learn more.
There are plenty of resources to learn more.
A good place to start is `vignette("regular-expressions", package = "stringr")`: it documents the full set of syntax supported by stringr.
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
It's not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.
It's also good to know that stringr is implemented on top of the stringi package, by Marek Gagolewsk.
It's also good to know that stringr is implemented on top of the stringi package by Marek Gagolewsk.
If you're struggling to find a function that does what you need in stringr, don't be afraid to look in stringi.
You'll find stringi very easy to pick up because it follows many of the the same conventions as stringr.
In the next chapter, we'll talk about a data structure closely related to strings: factors.
Factors are used to represent categorical data in R, data where there is a fixed and known set of possible values identified by a vector of strings.
Factors are used to represent categorical data in R, i.e. data with a fixed and known set of possible values identified by a vector of strings.