Separate regexps into own chapter

2021-04-21 08:43:19 -05:00 · 2021-04-21 08:43:19 -05:00 · 18253a1d52
parent 5f45c33adb
commit 18253a1d52
3 changed files with 414 additions and 450 deletions
--- a/_bookdown.yml
+++ b/_bookdown.yml
@ -25,6 +25,7 @@ rmd_files: [
  "vector-tools.Rmd",
  "missing-values.Rmd",
  "strings.Rmd",
+  "regexps.Rmd",
  "factors.Rmd",
  "datetimes.Rmd",
  "column-wise.Rmd",
--- a/regexps.Rmd
+++ b/regexps.Rmd
@ -0,0 +1,382 @@
+# Regular expressions
+
+## Matching patterns with regular expressions
+
+Regexps are a very terse language that allow you to describe patterns in strings.
+They take a little while to get your head around, but once you understand them, you'll find them extremely useful.
+
+To learn regular expressions, we'll use `str_view()` and `str_view_all()`.
+These functions take a character vector and a regular expression, and show you how they match.
+We'll start with very simple regular expressions and then gradually get more and more complicated.
+Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
+
+### Prerequisites
+
+This chapter will focus on the **stringr** package for string manipulation, which is part of the core tidyverse.
+
+```{r setup, message = FALSE}
+library(tidyverse)
+```
+
+## Basic matches
+
+The simplest patterns match exact strings:
+
+```{r}
+x <- c("apple", "banana", "pear")
+str_view(x, "an")
+```
+
+The next step up in complexity is `.`, which matches any character (except a newline):
+
+```{r}
+str_view(x, ".a.")
+```
+
+But if "`.`" matches any character, how do you match the character "`.`"?
+You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour.
+Like strings, regexps use the backslash, `\`, to escape special behaviour.
+So to match an `.`, you need the regexp `\.`.
+Unfortunately this creates a problem.
+We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings.
+So to create the regular expression `\.` we need the string `"\\."`.
+
+```{r}
+# To create the regular expression, we need \\
+dot <- "\\."
+
+# But the expression itself only contains one:
+writeLines(dot)
+
+# And this tells R to look for an explicit .
+str_view(c("abc", "a.c", "bef"), "a\\.c")
+```
+
+If `\` is used as an escape character in regular expressions, how do you match a literal `\`?
+Well you need to escape it, creating the regular expression `\\`.
+To create that regular expression, you need to use a string, which also needs to escape `\`.
+That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!
+
+```{r}
+x <- "a\\b"
+writeLines(x)
+
+str_view(x, "\\\\")
+```
+
+In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
+
+### Exercises
+
+1.  Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
+
+2.  How would you match the sequence `"'\`?
+
+3.  What patterns will the regular expression `\..\..\..` match?
+    How would you represent it as a string?
+
+## Anchors
+
+By default, regular expressions will match any part of a string.
+It's often useful to *anchor* the regular expression so that it matches from the start or end of the string.
+You can use:
+
+-   `^` to match the start of the string.
+-   `$` to match the end of the string.
+
+```{r}
+x <- c("apple", "banana", "pear")
+str_view(x, "^a")
+str_view(x, "a$")
+```
+
+To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
+
+To force a regular expression to only match a complete string, anchor it with both `^` and `$`:
+
+```{r}
+x <- c("apple pie", "apple", "apple cake")
+str_view(x, "apple")
+str_view(x, "^apple$")
+```
+
+You can also match the boundary between words with `\b`.
+I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions.
+For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
+
+### Exercises
+
+1.  How would you match the literal string `"$^$"`?
+
+2.  Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:
+
+    a.  Start with "y".
+    b.  End with "x"
+    c.  Are exactly three letters long. (Don't cheat by using `str_length()`!)
+    d.  Have seven letters or more.
+
+    Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
+
+## Character classes and alternatives
+
+There are a number of special patterns that match more than one character.
+You've already seen `.`, which matches any character apart from a newline.
+There are four other useful tools:
+
+-   `\d`: matches any digit.
+-   `\s`: matches any whitespace (e.g. space, tab, newline).
+-   `[abc]`: matches a, b, or c.
+-   `[^abc]`: matches anything except a, b, or c.
+
+Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
+
+A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex.
+Many people find this more readable.
+
+```{r}
+# Look for a literal character that normally has special meaning in a regex
+str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
+str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
+str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
+```
+
+This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`.
+Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`.
+
+You can use *alternation* to pick between one or more alternative patterns.
+For example, `abc|d..f` will match either '"abc"', or `"deaf"`.
+Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`.
+Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
+
+```{r}
+str_view(c("grey", "gray"), "gr(e|a)y")
+```
+
+### Exercises
+
+1.  Create regular expressions to find all words that:
+
+    a.  Start with a vowel.
+    b.  That only contain consonants. (Hint: thinking about matching "not"-vowels.)
+    c.  End with `ed`, but not with `eed`.
+    d.  End with `ing` or `ise`.
+
+2.  Empirically verify the rule "i before e except after c".
+
+3.  Is "q" always followed by a "u"?
+
+4.  Write a regular expression that matches a word if it's probably written in British English, not American English.
+
+5.  Create a regular expression that will match telephone numbers as commonly written in your country.
+
+## Repetition / Quantifiers
+
+The next step up in power involves controlling how many times a pattern matches:
+
+-   `?`: 0 or 1
+-   `+`: 1 or more
+-   `*`: 0 or more
+
+```{r}
+x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
+str_view(x, "CC?")
+str_view(x, "CC+")
+str_view(x, 'C[LX]+')
+```
+
+Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings.
+That means most uses will need parentheses, like `bana(na)+`.
+
+You can also specify the number of matches precisely:
+
+-   `{n}`: exactly n
+-   `{n,}`: n or more
+-   `{1,m}`: at most m
+-   `{n,m}`: between n and m
+
+```{r}
+str_view(x, "C{2}")
+str_view(x, "C{2,}")
+str_view(x, "C{1,3}")
+str_view(x, "C{2,3}")
+```
+
+By default these matches are "greedy": they will match the longest string possible.
+You can make them "lazy", matching the shortest string possible by putting a `?` after them.
+This is an advanced feature of regular expressions, but it's useful to know that it exists:
+
+```{r}
+str_view(x, 'C{2,3}?')
+str_view(x, 'C[LX]+?')
+```
+
+Collectively, these operators are called **quantifiers** because they quantify how many times a match can occur.
+
+### Exercises
+
+1.  Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
+
+2.  Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
+
+    a.  `^.*$`
+    b.  `"\\{.+\\}"`
+    c.  `\d{4}-\d{2}-\d{2}`
+    d.  `"\\\\{4}"`
+
+3.  Create regular expressions to find all words that:
+
+    a.  Start with three consonants.
+    b.  Have three or more vowels in a row.
+    c.  Have two or more vowel-consonant pairs in a row.
+
+4.  Solve the beginner regexp crosswords at [<https://regexcrossword.com/challenges/beginner>](https://regexcrossword.com/challenges/beginner){.uri}.
+
+## Grouping and backreferences
+
+Earlier, you learned about parentheses as a way to disambiguate complex expressions.
+Parentheses also create a *numbered* capturing group (number 1, 2 etc.).
+A capturing group stores *the part of the string* matched by the part of the regular expression inside the parentheses.
+You can refer to the same text as previously matched by a capturing group with *backreferences*, like `\1`, `\2` etc.
+For example, the following regular expression finds all fruits that have a repeated pair of letters.
+
+```{r}
+str_view(fruit, "(..)\\1", match = TRUE)
+```
+
+(Shortly, you'll also see how they're useful in conjunction with `str_match()`.)
+
+### Exercises
+
+1.  Describe, in words, what these expressions will match:
+
+    a.  `(.)\1\1`
+    b.  `"(.)(.)\\2\\1"`
+    c.  `(..)\1`
+    d.  `"(.).\\1.\\1"`
+    e.  `"(.)(.)(.).*\\3\\2\\1"`
+
+2.  Construct regular expressions to match words that:
+
+    a.  Start and end with the same character.
+    b.  Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
+    c.  Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
+
+## Other uses of regular expressions
+
+There are two useful function in base R that also use regular expressions:
+
+-   `apropos()` searches all objects available from the global environment.
+    This is useful if you can't quite remember the name of the function.
+
+    ```{r}
+    apropos("replace")
+    ```
+
+-   `dir()` lists all the files in a directory.
+    The `pattern` argument takes a regular expression and only returns file names that match the pattern.
+    For example, you can find all the R Markdown files in the current directory with:
+
+    ```{r}
+    head(dir(pattern = "\\.Rmd$"))
+    ```
+
+    (If you're more comfortable with "globs" like `*.Rmd`, you can convert them to regular expressions with `glob2rx()`):
+
+## A caution
+
+A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression.
+In the words of Jamie Zawinski:
+
+> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
+
+As a cautionary tale, check out this regular expression that checks if a email address is valid:
+
+    (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
+    )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
+    \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
+    ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
+    \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
+    31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
+    ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
+    (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
+    (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
+    |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
+    ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
+    r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
+     \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
+    ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
+    )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
+     \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
+    )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
+    )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
+    *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
+    |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
+    \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
+    \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
+    ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
+    ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
+    ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
+    :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
+    :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
+    :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
+    [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
+    \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
+    \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
+    @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
+    (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
+    )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
+    ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
+    :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
+    \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
+    \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
+    ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
+    :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
+    ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
+    .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
+    ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
+    [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
+    r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
+    \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
+    |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
+    00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
+    .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
+    ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
+    :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
+    (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
+    \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
+    ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
+    ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
+    ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
+    ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
+    ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
+    \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
+    ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
+    ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
+    :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
+    \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
+    [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
+    ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
+    ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
+    ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
+    ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
+    @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
+     \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
+    ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
+    )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
+    ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
+    (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
+    \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
+    \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
+    "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
+    *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+    +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
+    .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
+    |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
+    ?:\r\n)?[ \t])*))*)?;\s*)
+
+This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code.
+See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for more details.
+
+Don't forget that you're in a programming language and you have other tools at your disposal.
+Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
+If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
--- a/strings.Rmd
+++ b/strings.Rmd
@ -214,259 +214,6 @@ TODO: add connection to `arrange()`
 6.  Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
    Think carefully about what it should do if given a vector of length 0, 1, or 2.

-## Matching patterns with regular expressions
-
-Regexps are a very terse language that allow you to describe patterns in strings.
-They take a little while to get your head around, but once you understand them, you'll find them extremely useful.
-
-To learn regular expressions, we'll use `str_view()` and `str_view_all()`.
-These functions take a character vector and a regular expression, and show you how they match.
-We'll start with very simple regular expressions and then gradually get more and more complicated.
-Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
-
-### Basic matches
-
-The simplest patterns match exact strings:
-
-```{r}
-x <- c("apple", "banana", "pear")
-str_view(x, "an")
-```
-
-The next step up in complexity is `.`, which matches any character (except a newline):
-
-```{r}
-str_view(x, ".a.")
-```
-
-But if "`.`" matches any character, how do you match the character "`.`"?
-You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour.
-Like strings, regexps use the backslash, `\`, to escape special behaviour.
-So to match an `.`, you need the regexp `\.`.
-Unfortunately this creates a problem.
-We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings.
-So to create the regular expression `\.` we need the string `"\\."`.
-
-```{r}
-# To create the regular expression, we need \\
-dot <- "\\."
-
-# But the expression itself only contains one:
-writeLines(dot)
-
-# And this tells R to look for an explicit .
-str_view(c("abc", "a.c", "bef"), "a\\.c")
-```
-
-If `\` is used as an escape character in regular expressions, how do you match a literal `\`?
-Well you need to escape it, creating the regular expression `\\`.
-To create that regular expression, you need to use a string, which also needs to escape `\`.
-That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!
-
-```{r}
-x <- "a\\b"
-writeLines(x)
-
-str_view(x, "\\\\")
-```
-
-In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
-
-#### Exercises
-
-1.  Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
-
-2.  How would you match the sequence `"'\`?
-
-3.  What patterns will the regular expression `\..\..\..` match?
-    How would you represent it as a string?
-
-### Anchors
-
-By default, regular expressions will match any part of a string.
-It's often useful to *anchor* the regular expression so that it matches from the start or end of the string.
-You can use:
-
-   `^` to match the start of the string.
-   `$` to match the end of the string.
-
-```{r}
-x <- c("apple", "banana", "pear")
-str_view(x, "^a")
-str_view(x, "a$")
-```
-
-To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
-
-To force a regular expression to only match a complete string, anchor it with both `^` and `$`:
-
-```{r}
-x <- c("apple pie", "apple", "apple cake")
-str_view(x, "apple")
-str_view(x, "^apple$")
-```
-
-You can also match the boundary between words with `\b`.
-I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions.
-For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
-
-#### Exercises
-
-1.  How would you match the literal string `"$^$"`?
-
-2.  Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:
-
-    a.  Start with "y".
-    b.  End with "x"
-    c.  Are exactly three letters long. (Don't cheat by using `str_length()`!)
-    d.  Have seven letters or more.
-
-    Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
-
-### Character classes and alternatives
-
-There are a number of special patterns that match more than one character.
-You've already seen `.`, which matches any character apart from a newline.
-There are four other useful tools:
-
-   `\d`: matches any digit.
-   `\s`: matches any whitespace (e.g. space, tab, newline).
-   `[abc]`: matches a, b, or c.
-   `[^abc]`: matches anything except a, b, or c.
-
-Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
-
-A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex.
-Many people find this more readable.
-
-```{r}
-# Look for a literal character that normally has special meaning in a regex
-str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
-str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
-str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
-```
-
-This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`.
-Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`.
-
-You can use *alternation* to pick between one or more alternative patterns.
-For example, `abc|d..f` will match either '"abc"', or `"deaf"`.
-Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`.
-Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
-
-```{r}
-str_view(c("grey", "gray"), "gr(e|a)y")
-```
-
-#### Exercises
-
-1.  Create regular expressions to find all words that:
-
-    a.  Start with a vowel.
-    b.  That only contain consonants. (Hint: thinking about matching "not"-vowels.)
-    c.  End with `ed`, but not with `eed`.
-    d.  End with `ing` or `ise`.
-
-2.  Empirically verify the rule "i before e except after c".
-
-3.  Is "q" always followed by a "u"?
-
-4.  Write a regular expression that matches a word if it's probably written in British English, not American English.
-
-5.  Create a regular expression that will match telephone numbers as commonly written in your country.
-
-### Repetition / Quantifiers
-
-The next step up in power involves controlling how many times a pattern matches:
-
-   `?`: 0 or 1
-   `+`: 1 or more
-   `*`: 0 or more
-
-```{r}
-x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
-str_view(x, "CC?")
-str_view(x, "CC+")
-str_view(x, 'C[LX]+')
-```
-
-Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings.
-That means most uses will need parentheses, like `bana(na)+`.
-
-You can also specify the number of matches precisely:
-
-   `{n}`: exactly n
-   `{n,}`: n or more
-   `{1,m}`: at most m
-   `{n,m}`: between n and m
-
-```{r}
-str_view(x, "C{2}")
-str_view(x, "C{2,}")
-str_view(x, "C{1,3}")
-str_view(x, "C{2,3}")
-```
-
-By default these matches are "greedy": they will match the longest string possible.
-You can make them "lazy", matching the shortest string possible by putting a `?` after them.
-This is an advanced feature of regular expressions, but it's useful to know that it exists:
-
-```{r}
-str_view(x, 'C{2,3}?')
-str_view(x, 'C[LX]+?')
-```
-
-Collectively, these operators are called **quantifiers** because they quantify how many times a match can occur.
-
-#### Exercises
-
-1.  Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
-
-2.  Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
-
-    a.  `^.*$`
-    b.  `"\\{.+\\}"`
-    c.  `\d{4}-\d{2}-\d{2}`
-    d.  `"\\\\{4}"`
-
-3.  Create regular expressions to find all words that:
-
-    a.  Start with three consonants.
-    b.  Have three or more vowels in a row.
-    c.  Have two or more vowel-consonant pairs in a row.
-
-4.  Solve the beginner regexp crosswords at <https://regexcrossword.com/challenges/beginner>.
-
-### Grouping and backreferences
-
-Earlier, you learned about parentheses as a way to disambiguate complex expressions.
-Parentheses also create a *numbered* capturing group (number 1, 2 etc.).
-A capturing group stores *the part of the string* matched by the part of the regular expression inside the parentheses.
-You can refer to the same text as previously matched by a capturing group with *backreferences*, like `\1`, `\2` etc.
-For example, the following regular expression finds all fruits that have a repeated pair of letters.
-
-```{r}
-str_view(fruit, "(..)\\1", match = TRUE)
-```
-
-(Shortly, you'll also see how they're useful in conjunction with `str_match()`.)
-
-#### Exercises
-
-1.  Describe, in words, what these expressions will match:
-
-    a.  `(.)\1\1`
-    b.  `"(.)(.)\\2\\1"`
-    c.  `(..)\1`
-    d.  `"(.).\\1.\\1"`
-    e.  `"(.)(.)(.).*\\3\\2\\1"`
-
-2.  Construct regular expressions to match words that:
-
-    a.  Start and end with the same character.
-    b.  Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
-    c.  Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
-
 ## Tools

 Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems.
@ -478,103 +225,6 @@ In this section you'll learn a wide array of stringr functions that let you:
 -   Replace matches with new values.
 -   Split a string based on a match.

-A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression.
-In the words of Jamie Zawinski:
-
-> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-
-As a cautionary tale, check out this regular expression that checks if a email address is valid:
-
-    (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
-    )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
-    \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
-    ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
-    \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
-    31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
-    ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
-    (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
-    (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
-    |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
-    ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
-    r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
-     \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
-    ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
-    )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
-     \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
-    )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
-    )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
-    *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
-    |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
-    \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
-    \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
-    ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
-    ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
-    ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
-    :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
-    :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
-    :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
-    [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
-    \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
-    \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
-    @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
-    (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
-    )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
-    ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
-    :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
-    \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
-    \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
-    ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
-    :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
-    ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
-    .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
-    ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
-    [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
-    r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
-    \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
-    |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
-    00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
-    .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
-    ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
-    :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
-    (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
-    \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
-    ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
-    ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
-    ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
-    ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
-    ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
-    \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
-    ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
-    ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
-    :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
-    \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
-    [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
-    ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
-    ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
-    ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
-    ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
-    @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
-     \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
-    ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
-    )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
-    ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
-    (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
-    \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
-    \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
-    "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
-    *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
-    +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
-    .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
-    |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
-    ?:\r\n)?[ \t])*))*)?;\s*)
-
-This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code.
-See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for more details.
-
-Don't forget that you're in a programming language and you have other tools at your disposal.
-Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
-If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
-
 ### Detect matches

 To determine if a character vector matches a pattern, use `str_detect()`.
@ -872,6 +522,37 @@ str_split(x, " ")[[1]]
 str_split(x, boundary("word"))[[1]]
 ```

+### Separate
+
+`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
+Take `table3`:
+
+```{r}
+table3
+```
+
+The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
+`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
+
+```{r}
+table3 %>%
+  separate(rate, into = c("cases", "population"))
+```
+
+```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
+knitr::include_graphics("images/tidy-17.png")
+```
+
+By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
+For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
+If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
+For example, we could rewrite the code above as:
+
+```{r eval = FALSE}
+table3 %>%
+  separate(rate, into = c("cases", "population"), sep = "/")
+```
+
 #### Exercises

 1.  Split up a string like `"apples, pears, and bananas"` into individual components.
@ -1009,27 +690,6 @@ There are three other functions you can use instead of `regex()`:

 2.  What are the five most common words in `sentences`?

-## Other uses of regular expressions
-
-There are two useful function in base R that also use regular expressions:
-
-   `apropos()` searches all objects available from the global environment.
-    This is useful if you can't quite remember the name of the function.
-
-    ```{r}
-    apropos("replace")
-    ```
-
-   `dir()` lists all the files in a directory.
-    The `pattern` argument takes a regular expression and only returns file names that match the pattern.
-    For example, you can find all the R Markdown files in the current directory with:
-
-    ```{r}
-    head(dir(pattern = "\\.Rmd$"))
-    ```
-
-    (If you're more comfortable with "globs" like `*.Rmd`, you can convert them to regular expressions with `glob2rx()`):
-
 ## stringi

 stringr is built on top of the **stringi** package.
@ -1051,83 +711,6 @@ The main difference is the prefix: `str_` vs. `stri_`.

 2.  How do you control the language that `stri_sort()` uses for sorting?

-## tidyr
-
-So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`.
-`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
-To fix this problem, we'll need the `separate()` function.
-You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
-
-```{r}
-library(tidyr)
-```
-
-### Separate
-
-`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
-Take `table3`:
-
-```{r}
-table3
-```
-
-The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
-`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
-
-```{r}
-table3 %>%
-  separate(rate, into = c("cases", "population"))
-```
-
-```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
-knitr::include_graphics("images/tidy-17.png")
-```
-
-By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
-For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
-If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
-For example, we could rewrite the code above as:
-
-```{r eval = FALSE}
-table3 %>%
-  separate(rate, into = c("cases", "population"), sep = "/")
-```
-
-(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).)
-
-Look carefully at the column types: you'll notice that `cases` and `population` are character columns.
-This is the default behaviour in `separate()`: it leaves the type of the column as is.
-Here, however, it's not very useful as those really are numbers.
-We can ask `separate()` to try and convert to better types using `convert = TRUE`:
-
-```{r}
-table3 %>%
-  separate(rate, into = c("cases", "population"), convert = TRUE)
-```
-
-### Unite
-
-`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.
-You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
-
-We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example.
-That data is saved as `tidyr::table1`.
-`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
-
-```{r}
-table1 %>%
-  unite(rate, cases, population)
-```
-
-In this case we also need to use the `sep` argument.
-The default will place an underscore (`_`) between the values from different columns.
-Here we want `"/"` instead:
-
-```{r}
-table1 %>%
-  unite(rate, cases, population, sep = "/")
-```
-
 ### Exercises

 1.  What do the `extra` and `fill` arguments do in `separate()`?
@ -1177,5 +760,3 @@ table1 %>%
    )
    baker
    ```
-
-##