From 18253a1d52128c53e50cf8135d46637853c3e8fd Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Wed, 21 Apr 2021 08:43:19 -0500 Subject: [PATCH] Separate regexps into own chapter --- _bookdown.yml | 1 + regexps.Rmd | 382 +++++++++++++++++++++++++++++++++++++++ strings.Rmd | 481 ++++---------------------------------------------- 3 files changed, 414 insertions(+), 450 deletions(-) create mode 100644 regexps.Rmd diff --git a/_bookdown.yml b/_bookdown.yml index 1c2dddc..ef6904c 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -25,6 +25,7 @@ rmd_files: [ "vector-tools.Rmd", "missing-values.Rmd", "strings.Rmd", + "regexps.Rmd", "factors.Rmd", "datetimes.Rmd", "column-wise.Rmd", diff --git a/regexps.Rmd b/regexps.Rmd new file mode 100644 index 0000000..0e45762 --- /dev/null +++ b/regexps.Rmd @@ -0,0 +1,382 @@ +# Regular expressions + +## Matching patterns with regular expressions + +Regexps are a very terse language that allow you to describe patterns in strings. +They take a little while to get your head around, but once you understand them, you'll find them extremely useful. + +To learn regular expressions, we'll use `str_view()` and `str_view_all()`. +These functions take a character vector and a regular expression, and show you how they match. +We'll start with very simple regular expressions and then gradually get more and more complicated. +Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions. + +### Prerequisites + +This chapter will focus on the **stringr** package for string manipulation, which is part of the core tidyverse. + +```{r setup, message = FALSE} +library(tidyverse) +``` + +## Basic matches + +The simplest patterns match exact strings: + +```{r} +x <- c("apple", "banana", "pear") +str_view(x, "an") +``` + +The next step up in complexity is `.`, which matches any character (except a newline): + +```{r} +str_view(x, ".a.") +``` + +But if "`.`" matches any character, how do you match the character "`.`"? +You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour. +Like strings, regexps use the backslash, `\`, to escape special behaviour. +So to match an `.`, you need the regexp `\.`. +Unfortunately this creates a problem. +We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. +So to create the regular expression `\.` we need the string `"\\."`. + +```{r} +# To create the regular expression, we need \\ +dot <- "\\." + +# But the expression itself only contains one: +writeLines(dot) + +# And this tells R to look for an explicit . +str_view(c("abc", "a.c", "bef"), "a\\.c") +``` + +If `\` is used as an escape character in regular expressions, how do you match a literal `\`? +Well you need to escape it, creating the regular expression `\\`. +To create that regular expression, you need to use a string, which also needs to escape `\`. +That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one! + +```{r} +x <- "a\\b" +writeLines(x) + +str_view(x, "\\\\") +``` + +In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`. + +### Exercises + +1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`. + +2. How would you match the sequence `"'\`? + +3. What patterns will the regular expression `\..\..\..` match? + How would you represent it as a string? + +## Anchors + +By default, regular expressions will match any part of a string. +It's often useful to *anchor* the regular expression so that it matches from the start or end of the string. +You can use: + +- `^` to match the start of the string. +- `$` to match the end of the string. + +```{r} +x <- c("apple", "banana", "pear") +str_view(x, "^a") +str_view(x, "a$") +``` + +To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`). + +To force a regular expression to only match a complete string, anchor it with both `^` and `$`: + +```{r} +x <- c("apple pie", "apple", "apple cake") +str_view(x, "apple") +str_view(x, "^apple$") +``` + +You can also match the boundary between words with `\b`. +I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions. +For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on. + +### Exercises + +1. How would you match the literal string `"$^$"`? + +2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that: + + a. Start with "y". + b. End with "x" + c. Are exactly three letters long. (Don't cheat by using `str_length()`!) + d. Have seven letters or more. + + Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words. + +## Character classes and alternatives + +There are a number of special patterns that match more than one character. +You've already seen `.`, which matches any character apart from a newline. +There are four other useful tools: + +- `\d`: matches any digit. +- `\s`: matches any whitespace (e.g. space, tab, newline). +- `[abc]`: matches a, b, or c. +- `[^abc]`: matches anything except a, b, or c. + +Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`. + +A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. +Many people find this more readable. + +```{r} +# Look for a literal character that normally has special meaning in a regex +str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c") +str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c") +str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]") +``` + +This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`. +Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`. + +You can use *alternation* to pick between one or more alternative patterns. +For example, `abc|d..f` will match either '"abc"', or `"deaf"`. +Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`. +Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want: + +```{r} +str_view(c("grey", "gray"), "gr(e|a)y") +``` + +### Exercises + +1. Create regular expressions to find all words that: + + a. Start with a vowel. + b. That only contain consonants. (Hint: thinking about matching "not"-vowels.) + c. End with `ed`, but not with `eed`. + d. End with `ing` or `ise`. + +2. Empirically verify the rule "i before e except after c". + +3. Is "q" always followed by a "u"? + +4. Write a regular expression that matches a word if it's probably written in British English, not American English. + +5. Create a regular expression that will match telephone numbers as commonly written in your country. + +## Repetition / Quantifiers + +The next step up in power involves controlling how many times a pattern matches: + +- `?`: 0 or 1 +- `+`: 1 or more +- `*`: 0 or more + +```{r} +x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII" +str_view(x, "CC?") +str_view(x, "CC+") +str_view(x, 'C[LX]+') +``` + +Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. +That means most uses will need parentheses, like `bana(na)+`. + +You can also specify the number of matches precisely: + +- `{n}`: exactly n +- `{n,}`: n or more +- `{1,m}`: at most m +- `{n,m}`: between n and m + +```{r} +str_view(x, "C{2}") +str_view(x, "C{2,}") +str_view(x, "C{1,3}") +str_view(x, "C{2,3}") +``` + +By default these matches are "greedy": they will match the longest string possible. +You can make them "lazy", matching the shortest string possible by putting a `?` after them. +This is an advanced feature of regular expressions, but it's useful to know that it exists: + +```{r} +str_view(x, 'C{2,3}?') +str_view(x, 'C[LX]+?') +``` + +Collectively, these operators are called **quantifiers** because they quantify how many times a match can occur. + +### Exercises + +1. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form. + +2. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.) + + a. `^.*$` + b. `"\\{.+\\}"` + c. `\d{4}-\d{2}-\d{2}` + d. `"\\\\{4}"` + +3. Create regular expressions to find all words that: + + a. Start with three consonants. + b. Have three or more vowels in a row. + c. Have two or more vowel-consonant pairs in a row. + +4. Solve the beginner regexp crosswords at [](https://regexcrossword.com/challenges/beginner){.uri}. + +## Grouping and backreferences + +Earlier, you learned about parentheses as a way to disambiguate complex expressions. +Parentheses also create a *numbered* capturing group (number 1, 2 etc.). +A capturing group stores *the part of the string* matched by the part of the regular expression inside the parentheses. +You can refer to the same text as previously matched by a capturing group with *backreferences*, like `\1`, `\2` etc. +For example, the following regular expression finds all fruits that have a repeated pair of letters. + +```{r} +str_view(fruit, "(..)\\1", match = TRUE) +``` + +(Shortly, you'll also see how they're useful in conjunction with `str_match()`.) + +### Exercises + +1. Describe, in words, what these expressions will match: + + a. `(.)\1\1` + b. `"(.)(.)\\2\\1"` + c. `(..)\1` + d. `"(.).\\1.\\1"` + e. `"(.)(.)(.).*\\3\\2\\1"` + +2. Construct regular expressions to match words that: + + a. Start and end with the same character. + b. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.) + c. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.) + +## Other uses of regular expressions + +There are two useful function in base R that also use regular expressions: + +- `apropos()` searches all objects available from the global environment. + This is useful if you can't quite remember the name of the function. + + ```{r} + apropos("replace") + ``` + +- `dir()` lists all the files in a directory. + The `pattern` argument takes a regular expression and only returns file names that match the pattern. + For example, you can find all the R Markdown files in the current directory with: + + ```{r} + head(dir(pattern = "\\.Rmd$")) + ``` + + (If you're more comfortable with "globs" like `*.Rmd`, you can convert them to regular expressions with `glob2rx()`): + +## A caution + +A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. +In the words of Jamie Zawinski: + +> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. + +As a cautionary tale, check out this regular expression that checks if a email address is valid: + + (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] + )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: + \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( + ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ + \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 + 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ + ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ + (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: + (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z + |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) + ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ + r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ + \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) + ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] + )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ + \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* + )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] + )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) + *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ + |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r + \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: + \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t + ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 + ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( + ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? + :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? + :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? + :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? + [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] + \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| + \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> + @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" + (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] + )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ + ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? + :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ + \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- + \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( + ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; + :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ + ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" + .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ + ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ + [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ + r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] + \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] + |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 + 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ + .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, + ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? + :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* + (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". + \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ + ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] + ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( + ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ + ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( + ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ + \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t + ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t + ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(? + :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| + \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: + [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ + ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) + ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" + ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) + ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> + @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ + \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, + ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] + )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ + ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? + (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". + \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: + \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ + "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) + *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) + +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ + .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z + |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( + ?:\r\n)?[ \t])*))*)?;\s*) + +This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code. +See the Stack Overflow discussion at for more details. + +Don't forget that you're in a programming language and you have other tools at your disposal. +Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. +If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one. diff --git a/strings.Rmd b/strings.Rmd index 97c54c2..aca01fd 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -214,259 +214,6 @@ TODO: add connection to `arrange()` 6. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`. Think carefully about what it should do if given a vector of length 0, 1, or 2. -## Matching patterns with regular expressions - -Regexps are a very terse language that allow you to describe patterns in strings. -They take a little while to get your head around, but once you understand them, you'll find them extremely useful. - -To learn regular expressions, we'll use `str_view()` and `str_view_all()`. -These functions take a character vector and a regular expression, and show you how they match. -We'll start with very simple regular expressions and then gradually get more and more complicated. -Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions. - -### Basic matches - -The simplest patterns match exact strings: - -```{r} -x <- c("apple", "banana", "pear") -str_view(x, "an") -``` - -The next step up in complexity is `.`, which matches any character (except a newline): - -```{r} -str_view(x, ".a.") -``` - -But if "`.`" matches any character, how do you match the character "`.`"? -You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour. -Like strings, regexps use the backslash, `\`, to escape special behaviour. -So to match an `.`, you need the regexp `\.`. -Unfortunately this creates a problem. -We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. -So to create the regular expression `\.` we need the string `"\\."`. - -```{r} -# To create the regular expression, we need \\ -dot <- "\\." - -# But the expression itself only contains one: -writeLines(dot) - -# And this tells R to look for an explicit . -str_view(c("abc", "a.c", "bef"), "a\\.c") -``` - -If `\` is used as an escape character in regular expressions, how do you match a literal `\`? -Well you need to escape it, creating the regular expression `\\`. -To create that regular expression, you need to use a string, which also needs to escape `\`. -That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one! - -```{r} -x <- "a\\b" -writeLines(x) - -str_view(x, "\\\\") -``` - -In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`. - -#### Exercises - -1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`. - -2. How would you match the sequence `"'\`? - -3. What patterns will the regular expression `\..\..\..` match? - How would you represent it as a string? - -### Anchors - -By default, regular expressions will match any part of a string. -It's often useful to *anchor* the regular expression so that it matches from the start or end of the string. -You can use: - -- `^` to match the start of the string. -- `$` to match the end of the string. - -```{r} -x <- c("apple", "banana", "pear") -str_view(x, "^a") -str_view(x, "a$") -``` - -To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`). - -To force a regular expression to only match a complete string, anchor it with both `^` and `$`: - -```{r} -x <- c("apple pie", "apple", "apple cake") -str_view(x, "apple") -str_view(x, "^apple$") -``` - -You can also match the boundary between words with `\b`. -I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions. -For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on. - -#### Exercises - -1. How would you match the literal string `"$^$"`? - -2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that: - - a. Start with "y". - b. End with "x" - c. Are exactly three letters long. (Don't cheat by using `str_length()`!) - d. Have seven letters or more. - - Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words. - -### Character classes and alternatives - -There are a number of special patterns that match more than one character. -You've already seen `.`, which matches any character apart from a newline. -There are four other useful tools: - -- `\d`: matches any digit. -- `\s`: matches any whitespace (e.g. space, tab, newline). -- `[abc]`: matches a, b, or c. -- `[^abc]`: matches anything except a, b, or c. - -Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`. - -A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. -Many people find this more readable. - -```{r} -# Look for a literal character that normally has special meaning in a regex -str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c") -str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c") -str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]") -``` - -This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`. -Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`. - -You can use *alternation* to pick between one or more alternative patterns. -For example, `abc|d..f` will match either '"abc"', or `"deaf"`. -Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`. -Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want: - -```{r} -str_view(c("grey", "gray"), "gr(e|a)y") -``` - -#### Exercises - -1. Create regular expressions to find all words that: - - a. Start with a vowel. - b. That only contain consonants. (Hint: thinking about matching "not"-vowels.) - c. End with `ed`, but not with `eed`. - d. End with `ing` or `ise`. - -2. Empirically verify the rule "i before e except after c". - -3. Is "q" always followed by a "u"? - -4. Write a regular expression that matches a word if it's probably written in British English, not American English. - -5. Create a regular expression that will match telephone numbers as commonly written in your country. - -### Repetition / Quantifiers - -The next step up in power involves controlling how many times a pattern matches: - -- `?`: 0 or 1 -- `+`: 1 or more -- `*`: 0 or more - -```{r} -x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII" -str_view(x, "CC?") -str_view(x, "CC+") -str_view(x, 'C[LX]+') -``` - -Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. -That means most uses will need parentheses, like `bana(na)+`. - -You can also specify the number of matches precisely: - -- `{n}`: exactly n -- `{n,}`: n or more -- `{1,m}`: at most m -- `{n,m}`: between n and m - -```{r} -str_view(x, "C{2}") -str_view(x, "C{2,}") -str_view(x, "C{1,3}") -str_view(x, "C{2,3}") -``` - -By default these matches are "greedy": they will match the longest string possible. -You can make them "lazy", matching the shortest string possible by putting a `?` after them. -This is an advanced feature of regular expressions, but it's useful to know that it exists: - -```{r} -str_view(x, 'C{2,3}?') -str_view(x, 'C[LX]+?') -``` - -Collectively, these operators are called **quantifiers** because they quantify how many times a match can occur. - -#### Exercises - -1. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form. - -2. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.) - - a. `^.*$` - b. `"\\{.+\\}"` - c. `\d{4}-\d{2}-\d{2}` - d. `"\\\\{4}"` - -3. Create regular expressions to find all words that: - - a. Start with three consonants. - b. Have three or more vowels in a row. - c. Have two or more vowel-consonant pairs in a row. - -4. Solve the beginner regexp crosswords at . - -### Grouping and backreferences - -Earlier, you learned about parentheses as a way to disambiguate complex expressions. -Parentheses also create a *numbered* capturing group (number 1, 2 etc.). -A capturing group stores *the part of the string* matched by the part of the regular expression inside the parentheses. -You can refer to the same text as previously matched by a capturing group with *backreferences*, like `\1`, `\2` etc. -For example, the following regular expression finds all fruits that have a repeated pair of letters. - -```{r} -str_view(fruit, "(..)\\1", match = TRUE) -``` - -(Shortly, you'll also see how they're useful in conjunction with `str_match()`.) - -#### Exercises - -1. Describe, in words, what these expressions will match: - - a. `(.)\1\1` - b. `"(.)(.)\\2\\1"` - c. `(..)\1` - d. `"(.).\\1.\\1"` - e. `"(.)(.)(.).*\\3\\2\\1"` - -2. Construct regular expressions to match words that: - - a. Start and end with the same character. - b. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.) - c. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.) - ## Tools Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems. @@ -478,103 +225,6 @@ In this section you'll learn a wide array of stringr functions that let you: - Replace matches with new values. - Split a string based on a match. -A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. -In the words of Jamie Zawinski: - -> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - -As a cautionary tale, check out this regular expression that checks if a email address is valid: - - (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] - )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: - \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( - ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ - \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 - 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ - ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ - (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: - (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z - |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) - ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ - r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ - \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) - ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] - )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ - \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* - )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] - )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) - *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ - |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r - \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: - \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t - ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 - ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( - ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? - :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? - :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? - :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? - [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] - \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| - \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> - @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" - (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] - )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ - ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? - :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ - \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- - \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( - ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; - :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ - ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" - .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ - ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ - [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ - r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] - \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] - |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 - 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ - .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, - ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? - :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* - (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". - \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ - ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] - ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( - ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ - ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( - ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ - \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t - ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t - ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(? - :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| - \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: - [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ - ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) - ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" - ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) - ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> - @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ - \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, - ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] - )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ - ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? - (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". - \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: - \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ - "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) - *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) - +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ - .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z - |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( - ?:\r\n)?[ \t])*))*)?;\s*) - -This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code. -See the Stack Overflow discussion at for more details. - -Don't forget that you're in a programming language and you have other tools at your disposal. -Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. -If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one. - ### Detect matches To determine if a character vector matches a pattern, use `str_detect()`. @@ -872,6 +522,37 @@ str_split(x, " ")[[1]] str_split(x, boundary("word"))[[1]] ``` +### Separate + +`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. +Take `table3`: + +```{r} +table3 +``` + +The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. +`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below. + +```{r} +table3 %>% + separate(rate, into = c("cases", "population")) +``` + +```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."} +knitr::include_graphics("images/tidy-17.png") +``` + +By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). +For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. +If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. +For example, we could rewrite the code above as: + +```{r eval = FALSE} +table3 %>% + separate(rate, into = c("cases", "population"), sep = "/") +``` + #### Exercises 1. Split up a string like `"apples, pears, and bananas"` into individual components. @@ -1009,27 +690,6 @@ There are three other functions you can use instead of `regex()`: 2. What are the five most common words in `sentences`? -## Other uses of regular expressions - -There are two useful function in base R that also use regular expressions: - -- `apropos()` searches all objects available from the global environment. - This is useful if you can't quite remember the name of the function. - - ```{r} - apropos("replace") - ``` - -- `dir()` lists all the files in a directory. - The `pattern` argument takes a regular expression and only returns file names that match the pattern. - For example, you can find all the R Markdown files in the current directory with: - - ```{r} - head(dir(pattern = "\\.Rmd$")) - ``` - - (If you're more comfortable with "globs" like `*.Rmd`, you can convert them to regular expressions with `glob2rx()`): - ## stringi stringr is built on top of the **stringi** package. @@ -1051,83 +711,6 @@ The main difference is the prefix: `str_` vs. `stri_`. 2. How do you control the language that `stri_sort()` uses for sorting? -## tidyr - -So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`. -`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). -To fix this problem, we'll need the `separate()` function. -You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns. - -```{r} -library(tidyr) -``` - -### Separate - -`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. -Take `table3`: - -```{r} -table3 -``` - -The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. -`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below. - -```{r} -table3 %>% - separate(rate, into = c("cases", "population")) -``` - -```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."} -knitr::include_graphics("images/tidy-17.png") -``` - -By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). -For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. -If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. -For example, we could rewrite the code above as: - -```{r eval = FALSE} -table3 %>% - separate(rate, into = c("cases", "population"), sep = "/") -``` - -(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).) - -Look carefully at the column types: you'll notice that `cases` and `population` are character columns. -This is the default behaviour in `separate()`: it leaves the type of the column as is. -Here, however, it's not very useful as those really are numbers. -We can ask `separate()` to try and convert to better types using `convert = TRUE`: - -```{r} -table3 %>% - separate(rate, into = c("cases", "population"), convert = TRUE) -``` - -### Unite - -`unite()` is the inverse of `separate()`: it combines multiple columns into a single column. -You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket. - -We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example. -That data is saved as `tidyr::table1`. -`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style: - -```{r} -table1 %>% - unite(rate, cases, population) -``` - -In this case we also need to use the `sep` argument. -The default will place an underscore (`_`) between the values from different columns. -Here we want `"/"` instead: - -```{r} -table1 %>% - unite(rate, cases, population, sep = "/") -``` - ### Exercises 1. What do the `extra` and `fill` arguments do in `separate()`? @@ -1177,5 +760,3 @@ table1 %>% ) baker ``` - -##