# Regular expressions ```{r, results = "asis", echo = FALSE} status("restructuring") ``` ## Introduction We touched on regular expressions in Chapter \@ref(strings), but regular expressions really are their own miniature language so it's worth spending some extra time on them. Regular expressions can be overwhelming at first, and you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense. More details in `vignette("regular-expressions", package = "stringr")`. Here we'll focus mostly on pattern language itself, not the functions that use it. That means we'll mostly work with simple vectors showing the results with `str_view()` and `str_view_all()`. You'll need to take what you learn and apply it to data frames either with tidyr functions or by combining dplyr functions with stringr functions. ### Prerequisites This chapter will focus on the **stringr** package for string manipulation, which is part of the core tidyverse. ```{r setup, message = FALSE} library(tidyverse) ``` ## Escaping {#regexp-escaping} But if "`.`" matches any character, how do you match the character "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an `.`, you need the regexp `\.`. Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we need the string `"\\."`. ```{r} # To create the regular expression, we need \\ dot <- "\\." # But the expression itself only contains one: str_view(dot) # And this tells R to look for an explicit . str_view(c("abc", "a.c", "bef"), "a\\.c") ``` If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one! ```{r} x <- "a\\b" str_view(x) str_view(x, "\\\\") ``` Alternatively, you might find it easier to use the raw strings we discussed in Section \@ref(raw-strings) as that allows you to avoid one layer of escaping: ```{r} str_view(x, r"(\\)") ``` In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`. ### Exercises 1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`. 2. How would you match the sequence `"'\`? 3. What patterns will the regular expression `\..\..\..` match? How would you represent it as a string? ## Anchors By default, regular expressions will match any part of a string. It's often useful to **anchor** the regular expression so that it matches from the start or end of the string. You can use: - `^` to match the start of the string. - `$` to match the end of the string. ```{r} x <- c("apple", "banana", "pear") str_view(x, "^a") str_view(x, "a$") ``` To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`). To force a regular expression to only match a complete string, anchor it with both `^` and `$`: ```{r} x <- c("apple pie", "apple", "apple cake") str_view(x, "apple") str_view(x, "^apple$") ``` You can also match the boundary between words with `\b`. I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on. ```{r} x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)") str_view(x, "sum") str_view_all(x, "\\bsum\\b") ``` ### Exercises 1. How would you match the literal string `"$^$"`? 2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that: a. Start with "y". b. End with "x" c. Are exactly three letters long. (Don't cheat by using `str_length()`!) d. Have seven letters or more. Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words. ## Matching multiple characters There are a number of special patterns that match more than one character. You've already seen `.`, which matches any character apart from a newline. There are four other useful tools: - `\d`: matches any digit. `\D` matches anything that isn't a digit. - `\s`: matches any whitespace (e.g. space, tab, newline). `\S` matches anything that isn't whitespace. - `[abc]`: matches a, b, or c. - `[^abc]`: matches anything except a, b, or c. Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`. A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable. ```{r} # Look for a literal character that normally has special meaning in a regex str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c") str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c") str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]") ``` This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`. When you have complex logical conditions (e.g. match a or b but not c unless d) it's often easier to combine multiple `str_detect()` calls with logical operators, rather than trying to create a single regular expression. For example, here are two ways to find all words that don't contain any vowels: ```{r} # Find all words containing at least one vowel, and negate no_vowels_1 <- !str_detect(words, "[aeiou]") # Find all words consisting only of consonants (non-vowels) no_vowels_2 <- str_detect(words, "^[^aeiou]+$") identical(no_vowels_1, no_vowels_2) ``` The results are identical, but I think the first approach is significantly easier to understand. If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations. ### Exercises 1. Create regular expressions to find all words that: a. Start with a vowel. b. That only contain consonants. (Hint: thinking about matching "not"-vowels.) c. End with `ed`, but not with `eed`. d. End with `ing` or `ise`. 2. Empirically verify the rule "i before e except after c". 3. Is "q" always followed by a "u"? 4. Write a regular expression that matches a word if it's probably written in British English, not American English. 5. Create a regular expression that will match telephone numbers as commonly written in your country. ## Repetition / Quantifiers The next step up in power involves controlling how many times a pattern matches, the so called **quantifiers**. We discussed `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches) in the last chapter. Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+`. You can also specify the number of matches precisely: - `{n}`: exactly n - `{n,}`: n or more - `{n,m}`: between n and m ```{r} x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII" str_view(x, "C{2}") str_view(x, "C{2,}") str_view(x, "C{1,3}") str_view(x, "C{2,3}") ``` By default these matches are **greedy**: they will match the longest string possible. You can make them **lazy**, matching the shortest string possible by putting a `?` after them. This is an advanced feature of regular expressions, but it's useful to know that it exists: ```{r} str_view(x, 'C{2,3}?') str_view(x, 'C+[LX]+') str_view(x, 'C+[LX]+?') ``` ### Exercises 1. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form. 2. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.) a. `^.*$` b. `"\\{.+\\}"` c. `\d{4}-\d{2}-\d{2}` d. `"\\\\{4}"` 3. Create regular expressions to find all words that: a. Start with three consonants. b. Have three or more vowels in a row. c. Have two or more vowel-consonant pairs in a row. 4. Solve the beginner regexp crosswords at [\](https://regexcrossword.com/challenges/beginner){.uri}. ## Grouping and backreferences Earlier, you learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a *numbered* capturing group (number 1, 2 etc.). A capturing group stores *the part of the string* matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with *backreferences*, like `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a repeated pair of letters. ```{r} str_view(fruit, "(..)\\1", match = TRUE) ``` Also use for replacement: ```{r} sentences %>% str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") %>% head(5) ``` Names that start and end with the same letter. Implement with `str_sub()` instead. Can create non-capturing groups with `(?:)`. ### Exercises 1. Describe, in words, what these expressions will match: a. `(.)\1\1` b. `"(.)(.)\\2\\1"` c. `(..)\1` d. `"(.).\\1.\\1"` e. `"(.)(.)(.).*\\3\\2\\1"` 2. Construct regular expressions to match words that: a. Start and end with the same character. b. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.) c. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.) ## Other uses of regular expressions There are two useful function in base R that also use regular expressions: ## Options When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`: ```{r, eval = FALSE} # The regular call: str_view(fruit, "nana") # Is shorthand for str_view(fruit, regex("nana")) ``` You can use the other arguments of `regex()` to control details of the match: - `ignore_case = TRUE` allows characters to match either their uppercase or lowercase forms. This always uses the current locale. ```{r} bananas <- c("banana", "Banana", "BANANA") str_view(bananas, "banana") str_view(bananas, regex("banana", ignore_case = TRUE)) ``` - `multiline = TRUE` allows `^` and `$` to match the start and end of each line rather than the start and end of the complete string. ```{r} x <- "Line 1\nLine 2\nLine 3" str_extract_all(x, "^Line")[[1]] str_extract_all(x, regex("^Line", multiline = TRUE))[[1]] ``` - `comments = TRUE` allows you to use comments and white space to make complex regular expressions more understandable. Spaces are ignored, as is everything after `#`. To match a literal space, you'll need to escape it: `"\\ "`. ```{r} phone <- regex(" \\(? # optional opening parens (\\d{3}) # area code [) -]? # optional closing parens, space, or dash (\\d{3}) # another three numbers [ -]? # optional space or dash (\\d{3}) # three more numbers ", comments = TRUE) str_match("514-791-8141", phone) ``` - `dotall = TRUE` allows `.` to match everything, including `\n`. ## Some details ### Overlapping Matches never overlap, and the regular expression engine only starts looking for a new match after the end of the last match. For example, in `"abababa"`, how many times will the pattern `"aba"` match? Regular expressions say two, not three: ```{r} str_count("abababa", "aba") str_view_all("abababa", "aba") ``` ### Zero width matches It's possible for a regular expression to match no character, i.e. the space between too characters. This typically happens when you use a quantifier that allows zero matches: ```{r} str_view_all("abcdef", "c?") ``` But `\b` also creatse a match: ```{r} str_view_all("this is a sentence", "\\b") ``` ### Operator precedence You can use *alternation* to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want: ```{r} str_view(c("grey", "gray"), "gr(e|a)y") ``` ## A caution A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. In the words of Jamie Zawinski: > Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. As a cautionary tale, check out this regular expression that checks if a email address is valid: (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*) This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code. See the Stack Overflow discussion at for more details. Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one. ### Exercises 1. In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem. 2. Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word. 3. Find all contractions. Separate out the pieces before and after the apostrophe.