r4ds/strings.Rmd

# Strings

## Introduction

This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions, or regexps for short. Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings. When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.

### Prerequisites

This chapter will focus on the __stringr__ package for string manipulation. stringr is not part of the core tidyverse because you don't always have textual data, so we need to load it explicitly.

```{r setup, message = FALSE}
library(tidyverse)
library(stringr)
```

## String basics

You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`.

```{r}
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
```

If you forget to close a quote, you'll see `+`, the continuation character:

```
> "This is a string without a closing quote
+ 
+ 
+ HELP I'M STUCK
```

If this happen to you, press Escape and try again!

To include a literal single or double quote in a string you can use `\` to "escape" it:

```{r}
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```

That means if you want to include a literal backslash, you'll need to double it up: `"\\"`.

Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:

```{r}
x <- c("\"", "\\")
x
writeLines(x)
```

There are a handful of other special characters. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes see strings like `"\u00b5"`, this is a way of writing non-English characters that works on all platforms:

```{r}
x <- "\u00b5"
x
```

Multiple strings are often stored in a character vector, which you can create with `c()`:

```{r}
c("one", "two", "three")
```

### String length

Base R contains many functions to work with strings but we'll avoid them because they can be inconsistent, which makes them hard to remember. Instead we'll use functions from stringr. These have more intuitive names, and all start with `str_`. For example, `str_length()` tells you the number of characters in a string:

```{r}
str_length(c("a", "R for data science", NA))
```

The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions:

```{r, echo = FALSE}
knitr::include_graphics("screenshots/stringr-autocomplete.png")
```

### Combining strings

To combine two or more strings, use `str_c()`:

```{r}
str_c("x", "y")
str_c("x", "y", "z")
```

Use the `sep` argument to control how they're separated:

```{r}
str_c("x", "y", sep = ", ")
```

Like most other functions in R, missing values are contagious. If you want them to print as `"NA"`, use `str_replace_na()`:

```{r}
x <- c("abc", NA)
str_c("|-", x, "-|")
str_c("|-", str_replace_na(x), "-|")
```

As shown above, `str_c()` is vectorised, and it automatically recycles shorter vectors to the same length as the longest:

```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```

Objects of length 0 are silently dropped. This is particularly useful in conjunction with `if`:

```{r}
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE

str_c(
  "Good ", time_of_day, " ", name,
  if (birthday) " and HAPPY BIRTHDAY",
  "."
)
```

To collapse a vector of strings into a single string, use `collapse`:

```{r}
str_c(c("x", "y", "z"), collapse = ", ")
```

### Subsetting strings

You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:

```{r}
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
# negative numbers count backwards from end
str_sub(x, -3, -1)
```

Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:

```{r}
str_sub("a", 1, 5)
```

You can also use the assignment form of `str_sub()` to modify strings:

```{r}
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```

### Locales

Above I used `str_to_lower()` to change the text to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first appear because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:

```{r}
# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```

The locale is specified as a ISO 639 language code, which is a two or three letter abbreviation. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list. If you leave the locale blank, it will use the current locale, as provided by your operating system.

Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:

```{r}
x <- c("apple", "eggplant", "banana")

str_sort(x, locale = "en")  # English

str_sort(x, locale = "haw") # Hawaiian
```

### Exercises

1.  In code that doesn't use stringr, you'll often see `paste()` and `paste0()`.
    What's the difference between the two functions? What stringr function are
    they equivalent to? How do the functions differ in their handling of 
    `NA`?
    
1.  In your own words, describe the difference between the `sep` and `collapse`
    arguments to `str_c()`.

1.  Use `str_length()` and `str_sub()` to extract the middle character from 
    a string. What will you do if the string has an even number of characters?

1.  What does `str_wrap()` do? When might you want to use it?

1.  What does `str_trim()` do? What's the opposite of `str_trim()`?

1.  Write a function that turns (e.g.) a vector `c("a", "b", "c")` into 
    the string `a, b, and c`. Think carefully about what it should do if
    given a vector of length 0, 1, or 2.

## Matching patterns with regular expressions

Regexps are a very terse language that allow you to describe patterns in strings. They take a little while to get your head around, but once you understand them, you'll find them extremely useful. 

To learn regular expressions, we'll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.

### Basic matches

The simplest patterns match exact strings:

```{r}
x <- c("apple", "banana", "pear")
str_view(x, "an")
```

The next step up in complexity is `.`, which matches any character (except a newline):

```{r}
str_view(x, ".a.")
```

But if "`.`" matches any character, how do you match the character "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an `.`, you need the regexp `\.`. Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we need the string `"\\."`. 

```{r}
# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
```

If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!

```{r}
x <- "a\\b"
writeLines(x)

str_view(x, "\\\\")
```

In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.

#### Exercises

1.  Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.

1.  How would you match the sequence `"'\`?

1.  What patterns will the regular expression `\..\..\..` match? 
    How would you represent it as a string?

### Anchors

By default, regular expressions will match any part of a string. It's often useful to _anchor_ the regular expression so that it matches from the start or end of the string. You can use:

* `^` to match the start of the string.
* `$` to match the end of the string.

```{r}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
```

To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).

To force a regular expression to only match a complete string, anchor it with both `^` and `$`:

```{r}
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
```

You can also match the boundary between words with `\b`. I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.

#### Exercises

1.  How would you match the literal string `"$^$"`?

1.  Given the corpus of common words in `stringr::words`, create regular
    expressions that find all words that:
    
    1. Start with "y".
    1. End with "x"
    1. Are exactly three letters long. (Don't cheat by using `str_length()`!)
    1. Have seven letters or more.

    Since this list is long, you might want to use the `match` argument to
    `str_view()` to show only the matching or non-matching words.

### Character classes and alternatives

There are a number of special patterns that match more than one character. You've already seen `.`, which matches any character apart from a newline. There are four other useful tools:

* `\d`: matches any digit.
* `\s`: matches any whitespace (e.g. space, tab, newline).
* `[abc]`: matches a, b, or c.
* `[^abc]`: matches anything except a, b, or c.

Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.

A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable.

```{r}
# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
```

This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`.

You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:

```{r}
str_view(c("grey", "gray"), "gr(e|a)y")
```

#### Exercises

1.  Create regular expressions to find all words that:

    1. Start with a vowel.

    1. That only contain consonants. (Hint: thinking about matching 
       "not"-vowels.)

    1. End with `ed`, but not with `eed`.
    
    1. End with `ing` or `ise`.
    
1.  Empirically verify the rule "i before e except after c".

1.  Is "q" always followed by a "u"?

1.  Write a regular expression that matches a word if it's probably written
    in British English, not American English.

1.  Create a regular expression that will match telephone numbers as commonly
    written in your country.

### Repetition

The next step up in power involves controlling how many times a pattern matches:

* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more

```{r}
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, 'C[LX]+')
```

Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+`.

You can also specify the number of matches precisely:

* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m

```{r}
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
```

By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them. This is an advanced feature of regular expressions, but it's useful to know that it exists:

```{r}
str_view(x, 'C{2,3}?')
str_view(x, 'C[LX]+?')
```

#### Exercises

1.  Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.

1.  Describe in words what these regular expressions match:
    (read carefully to see if I'm using a regular expression or a string
    that defines a regular expression.)

    1. `^.*$`
    1. `"\\{.+\\}"`
    1. `\d{4}-\d{2}-\d{2}`
    1. `"\\\\{4}"`

1.  Create regular expressions to find all words that:

    1. Start with three consonants.
    1. Have three or more vowels in a row.
    1. Have two or more vowel-consonant pairs in a row.

1.  Solve the beginner regexp crosswords at
    <https://regexcrossword.com/challenges/beginner>.

### Grouping and backreferences

Earlier, you learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a _numbered_ capturing group (number 1, 2 etc.). A capturing group stores _the part of the string_ matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with _backreferences_, like `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a repeated pair of letters.

```{r}
str_view(fruit, "(..)\\1", match = TRUE)
```

(Shortly, you'll also see how they're useful in conjunction with `str_match()`.)

#### Exercises

1.  Describe, in words, what these expressions will match:

    1. `(.)\1\1`
    1. `"(.)(.)\\2\\1"`
    1. `(..)\1`
    1. `"(.).\\1.\\1"`
    1. `"(.)(.)(.).*\\3\\2\\1"`

1.  Construct regular expressions to match words that:

    1. Start and end with the same character.
    
    1. Contain a repeated pair of letters
       (e.g. "church" contains "ch" repeated twice.)
    
    1. Contain one letter repeated in at least three places
       (e.g. "eleven" contains three "e"s.)

## Tools

Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems. In this section you'll learn a wide array of stringr functions that let you:

* Determine which strings match a pattern.
* Find the positions of matches.
* Extract the content of matches.
* Replace matches with new values.
* Split a string based on a match.

A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. In the words of Jamie Zawinski:

> Some people, when confronted with a problem, think “I know, I’ll use regular
> expressions.” Now they have two problems. 

As a cautionary tale, check out this regular expression that checks if a email address is valid:

```
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)
```

This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code. See the stackoverflow discussion at <http://stackoverflow.com/a/201378> for more details. 

Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.

### Detect matches

To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector the same length as the input:

```{r}
x <- c("apple", "banana", "pear")
str_detect(x, "e")
```

Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want to answer questions about matches across a larger vector:

```{r}
# How many common words start with t?
sum(str_detect(words, "^t"))
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
```

When you have complex logical conditions (e.g. match a or b but not c unless d) it's often easier to combine multiple `str_detect()` calls with logical operators, rather than trying to create a single regular expression. For example, here are two ways to find all words that don't contain any vowels:

```{r}
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
```

The results are identical, but I think the first approach is significantly easier to understand. If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations.

A common use of `str_detect()` is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient `str_subset()` wrapper:

```{r}
words[str_detect(words, "x$")]
str_subset(words, "x$")
```

Typically, however, your strings will be one column of a data frame, and you'll want to use filter instead:

```{r}
df <- tibble(
  word = words, 
  i = seq_along(word)
)
df %>% 
  filter(str_detect(word, "x$"))
```


A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:

```{r}
x <- c("apple", "banana", "pear")
str_count(x, "a")

# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
```

It's natural to use `str_count()` with `mutate()`:

```{r}
df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )
```

Note that matches never overlap. For example, in `"abababa"`, how many times will the pattern `"aba"` match? Regular expressions say two, not three:

```{r}
str_count("abababa", "aba")
str_view_all("abababa", "aba")
```

Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches. The second function will have the suffix `_all`.

#### Exercises

1.  For each of the following challenges, try solving it by using both a single
    regular expression, and a combination of multiple `str_detect()` calls.
    
    1.  Find all words that start or end with `x`.
    
    1.  Find all words that start with a vowel and end with a consonant.
    
    1.  Are there any words that contain at least one of each different
        vowel?

1.  What word has the highest number of vowels? What word has the highest
    proportion of vowels? (Hint: what is the denominator?)

### Extract matches

To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexps. These are provided in `stringr::sentences`:

```{r}
length(sentences)
head(sentences)
```

Imagine we want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:

```{r}
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
```

Now we can select the sentences that contain a colour, and then extract the colour to figure out which one it is:

```{r}
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
```

Note that `str_extract()` only extracts the first match. We can see that most easily by first selecting all the sentences that have more than 1 match:

```{r}
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)

str_extract(more, colour_match)
```

This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use `str_extract_all()`. It returns a list:

```{r}
str_extract_all(more, colour_match)
```

You'll learn more about lists in [lists](#lists) and [iteration].

If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:

```{r}
str_extract_all(more, colour_match, simplify = TRUE)

x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```

#### Exercises

1.  In the previous example, you might have noticed that the regular
    expression matched "flickered", which is not a colour. Modify the 
    regex to fix the problem.

1.  From the Harvard sentences data, extract:

    1. The first word from each sentence.
    1. All words ending in `ing`.
    1. All plurals.

### Grouped matches

Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching. You can also use parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space.

```{r}
noun <- "(a|the) ([^ ]+)"

has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)
has_noun %>% 
  str_extract(noun)
```

`str_extract()` gives us the complete match; `str_match()` gives each individual component. Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:

```{r}
has_noun %>% 
  str_match(noun)
```

(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)

If your data is in a tibble, it's often easier to use `tidyr::extract()`. It works like `str_match()` but requires you to name the matches, which are then placed in new columns:

```{r}
tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE
  )
```

Like `str_extract()`, if you want all matches for each string, you'll need `str_match_all()`.

#### Exercises

1. Find all words that come after a "number" like "one", "two", "three" etc.
   Pull out both the number and the word.

1. Find all contractions. Separate out the pieces before and after the 
   apostrophe.

### Replacing matches

`str_replace()` and `str_replace_all()` allow you to replace matches with new strings. The simplest use is to replace a pattern with a fixed string:

```{r}
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
str_replace_all(x, "[aeiou]", "-")
```

With `str_replace_all()` you can perform multiple replacements by supplying a named vector:

```{r}
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```

Instead of replacing with a fixed string you can use backreferences to insert components of the match. In the following code, I flip the order of the second and third words.

```{r}
sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)
```

#### Exercises

1.   Replace all forward slashes in a string with backslashes.

1.   Implement a simple version of `str_to_lower()` using `replace_all()`.

1.   Switch the first and last letters in `words`. Which of those strings
     are still words?

### Splitting

Use `str_split()` to split a string up into pieces. For example, we could split sentences into words:

```{r}
sentences %>%
  head(5) %>% 
  str_split(" ")
```

Because each component might contain a different number of pieces, this returns a list. If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:

```{r}
"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]
```

Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:

```{r}
sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
```

You can also request a maximum number of pieces:

```{r}
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
```

Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:

```{r}
x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))

str_split(x, " ")[[1]]
str_split(x, boundary("word"))[[1]]
```

#### Exercises

1.  Split up a string like `"apples, pears, and bananas"` into individual
    components.
    
1.  Why is it better to split up by `boundary("word")` than `" "`?

1.  What does splitting with an empty string (`""`) do? Experiment, and
    then read the documentation.

### Find matches

`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.

## Other types of pattern

When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:

```{r, eval = FALSE}
# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
```

You can use the other arguments of `regex()` to control details of the match:

*   `ignore_case = TRUE` allows characters to match either their uppercase or 
    lowercase forms. This always uses the current locale.
    
    ```{r}
    bananas <- c("banana", "Banana", "BANANA")
    str_view(bananas, "banana")
    str_view(bananas, regex("banana", ignore_case = TRUE))
    ```
    
*   `multiline = TRUE` allows `^` and `$` to match the start and end of each
    line rather than the start and end of the complete string.
    
    ```{r}
    x <- "Line 1\nLine 2\nLine 3"
    str_extract_all(x, "^Line")[[1]]
    str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
    ```
    
*   `comments = TRUE` allows you to use comments and white space to make 
    complex regular expressions more understandable. Spaces are ignored, as is 
    everything after `#`. To match a literal space, you'll need to escape it: 
    `"\\ "`.
    
    ```{r}
    phone <- regex("
      \\(?     # optional opening parens
      (\\d{3}) # area code
      [) -]?   # optional closing parens, space, or dash
      (\\d{3}) # another three numbers
      [ -]?    # optional space or dash
      (\\d{3}) # three more numbers
      ", comments = TRUE)
    
    str_match("514-791-8141", phone)
    ```

*   `dotall = TRUE` allows `.` to match everything, including `\n`.

There are three other functions you can use instead of `regex()`:

*   `fixed()`: matches exactly the specified sequence of bytes. It ignores
    all special regular expressions and operates at a very low level. 
    This allows you to avoid complex escaping and can be much faster than 
    regular expressions. The following microbenchmark shows that it's about
    3x faster for a simple example.
  
    ```{r}
    microbenchmark::microbenchmark(
      fixed = str_detect(sentences, fixed("the")),
      regex = str_detect(sentences, "the"),
      times = 20
    )
    ```
    
    Beware using `fixed()` with non-English data. It is problematic because 
    there are often multiple ways of representing the same character. For 
    example, there are two ways to define "á": either as a single character or 
    as an "a" plus an accent:
    
    ```{r}
    a1 <- "\u00e1"
    a2 <- "a\u0301"
    c(a1, a2)
    a1 == a2
    ```

    They render identically, but because they're defined differently, 
    `fixed()` doesn't find a match. Instead, you can use `coll()`, defined
    next, to respect human character comparison rules:

    ```{r}
    str_detect(a1, fixed(a2))
    str_detect(a1, coll(a2))
    ```
    
*   `coll()`: compare strings using standard **coll**ation rules. This is 
    useful for doing case insensitive matching. Note that `coll()` takes a
    `locale` parameter that controls which rules are used for comparing
    characters. Unfortunately different parts of the world use different rules!

    ```{r}
    # That means you also need to be aware of the difference
    # when doing case insensitive matches:
    i <- c("I", "İ", "i", "ı")
    i
    
    str_subset(i, coll("i", ignore_case = TRUE))
    str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
    ```
    
    Both `fixed()` and `regex()` have `ignore_case` arguments, but they
    do not allow you to pick the locale: they always use the default locale.
    You can see what that is with the following code; more on stringi
    later.
    
    ```{r}
    stringi::stri_locale_info()
    ```
    
    The downside of `coll()` is speed; because the rules for recognising which
    characters are the same are complicated, `coll()` is relatively slow
    compared to `regex()` and `fixed()`.

*   As you saw with `str_split()` you can use `boundary()` to match boundaries.
    You can also use it with the other functions: 
    
    ```{r}
    x <- "This is a sentence."
    str_view_all(x, boundary("word"))
    str_extract_all(x, boundary("word"))
    ```

### Exercises

1.  How would you find all strings containing `\` with `regex()` vs.
    with `fixed()`?

1.  What are the five most common words in `sentences`?

## Other uses of regular expressions

There are two useful function in base R that also use regular expressions:

*   `apropos()` searches all objects available from the global environment. This
    is useful if you can't quite remember the name of the function.
    
    ```{r}
    apropos("replace")
    ```
    
*   `dir()` lists all the files in a directory. The `pattern` argument takes
    a regular expression and only returns file names that match the pattern.
    For example, you can find all the R Markdown files in the current
    directory with:
    
    ```{r}
    head(dir(pattern = "\\.Rmd$"))
    ```
    
    (If you're more comfortable with "globs" like `*.Rmd`, you can convert
    them to regular expressions with `glob2rx()`):

## stringi

stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. stringi, on the other hand, is designed to be comprehensive. It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.

If you find yourself struggling to do something in stringr, it's worth taking a look at stringi. The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way. The main difference is the prefix: `str_` vs. `stri_`.

### Exercises

1.  Find the stringi functions that:

    1. Count the number of words.
    1. Find duplicated strings.
    1. Generate random text.

1.  How do you control the language that `stri_sort()` uses for 
    sorting?
-												Fix yaml metadata

											
										
										
											2015-12-17 07:22:03 +08:00
+								# Strings
-												Make sure first element is heading

											
										
										
											2015-12-12 02:34:20 +08:00
-												Consistent chapter intro layout

											
										
										
											2016-07-19 21:01:50 +08:00
+								## Introduction
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions, or regexps for short. Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings. When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.
-												Consistent chapter intro layout

											
										
										
											2016-07-19 21:01:50 +08:00
 								### Prerequisites
-												Use tidyverse package

Fixes #451

											
										
										
											2016-10-04 01:30:24 +08:00
+								This chapter will focus on the __stringr__ package for string manipulation. stringr is not part of the core tidyverse because you don't always have textual data, so we need to load it explicitly.
-												Consistent chapter intro layout

											
										
										
											2016-07-19 21:01:50 +08:00
-												Use tidyverse package

Fixes #451

											
										
										
											2016-10-04 01:30:24 +08:00
+								```{r setup, message = FALSE}
 								library(tidyverse)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								library(stringr)
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								## String basics
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
 								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								string1 <- "This is a string"
 								string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								```
-												More @csgillespie changes

											
										
										
											2016-10-04 22:01:12 +08:00
+								If you forget to close a quote, you'll see `+`, the continuation character:
 								```
 								> "This is a string without a closing quote
 								+
 								+
 								+ HELP I'M STUCK
 								```
 								If this happen to you, press Escape and try again!
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								To include a literal single or double quote in a string you can use `\` to "escape" it:
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								```{r}
 								double_quote <- "\"" # or '"'
 								single_quote <- '\'' # or "'"
 								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Small copy edits edits to strings.Rmd

											
										
										
											2016-04-08 04:10:32 +08:00
+								That means if you want to include a literal backslash, you'll need to double it up: `"\\"`.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Small copy edits edits to strings.Rmd

											
										
										
											2016-04-08 04:10:32 +08:00
+								Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								```{r}
 								x <- c("\"", "\\")
 								x
 								writeLines(x)
 								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								There are a handful of other special characters. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes see strings like `"\u00b5"`, this is a way of writing non-English characters that works on all platforms:
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								```{r}
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								x <- "\u00b5"
 								x
 								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								Multiple strings are often stored in a character vector, which you can create with `c()`:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								```{r}
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								c("one", "two", "three")
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								### String length
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								Base R contains many functions to work with strings but we'll avoid them because they can be inconsistent, which makes them hard to remember. Instead we'll use functions from stringr. These have more intuitive names, and all start with `str_`. For example, `str_length()` tells you the number of characters in a string:
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
 								```{r}
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								str_length(c("a", "R for data science", NA))
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Add autocomplete screenshot

											
										
										
											2015-11-09 20:32:56 +08:00
+								The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions:
-												Local bookdown working

											
										
										
											2015-12-12 03:28:10 +08:00
+								```{r, echo = FALSE}
 								knitr::include_graphics("screenshots/stringr-autocomplete.png")
-												Add autocomplete screenshot

											
										
										
											2015-11-09 20:32:56 +08:00
+								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### Combining strings
 								To combine two or more strings, use `str_c()`:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```{r}
 								str_c("x", "y")
 								str_c("x", "y", "z")
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								Use the `sep` argument to control how they're separated:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								```{r}
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								str_c("x", "y", sep = ", ")
 								```
-												Small copy edits edits to strings.Rmd

											
										
										
											2016-04-08 04:10:32 +08:00
+								Like most other functions in R, missing values are contagious. If you want them to print as `"NA"`, use `str_replace_na()`:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								x <- c("abc", NA)
 								str_c("|-", x, "-|")
 								str_c("|-", str_replace_na(x), "-|")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								As shown above, `str_c()` is vectorised, and it automatically recycles shorter vectors to the same length as the longest:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
 								str_c("prefix-", c("a", "b", "c"), "-suffix")
 								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								Objects of length 0 are silently dropped. This is particularly useful in conjunction with `if`:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								name <- "Hadley"
 								time_of_day <- "morning"
 								birthday <- FALSE
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								str_c(
 								  "Good ", time_of_day, " ", name,
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								  if (birthday) " and HAPPY BIRTHDAY",
 								  "."
 								)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								To collapse a vector of strings into a single string, use `collapse`:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								str_c(c("x", "y", "z"), collapse = ", ")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
 								### Subsetting strings
-												Small copy edits edits to strings.Rmd

											
										
										
											2016-04-08 04:10:32 +08:00
+								You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
-												Fix for string subsetting example

Words in example vector `x <- c("apple", "banana", "pear")` should start with an uppercase letter so the `str_to_lower` example makes sense.
											
										
										
											2015-11-10 02:41:00 +08:00
+								x <- c("Apple", "Banana", "Pear")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								str_sub(x, 1, 3)
 								# negative numbers count backwards from end
 								str_sub(x, -3, -1)
 								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
 								```{r}
 								str_sub("a", 1, 5)
 								```
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								You can also use the assignment form of `str_sub()` to modify strings:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
 								str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
 								x
 								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								### Locales
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								Above I used `str_to_lower()` to change the text to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first appear because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
 								```{r}
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								# Turkish has two i's: with and without a dot, and it
 								# has a different rule for capitalising them:
 								str_to_upper(c("i", "ı"))
 								str_to_upper(c("i", "ı"), locale = "tr")
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								```
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								The locale is specified as a ISO 639 language code, which is a two or three letter abbreviation. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list. If you leave the locale blank, it will use the current locale, as provided by your operating system.
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
 								```{r}
 								x <- c("apple", "eggplant", "banana")
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								str_sort(x, locale = "en")  # English
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								str_sort(x, locale = "haw") # Hawaiian
 								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### Exercises
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+.  In code that doesn't use stringr, you'll often see `paste()` and `paste0()`.
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								    What's the difference between the two functions? What stringr function are
 								    they equivalent to? How do the functions differ in their handling of
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								    `NA`?
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+.  In your own words, describe the difference between the `sep` and `collapse`
 								    arguments to `str_c()`.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+.  Use `str_length()` and `str_sub()` to extract the middle character from
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								    a string. What will you do if the string has an even number of characters?
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
 .  What does `str_wrap()` do? When might you want to use it?
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+.  What does `str_trim()` do? What's the opposite of `str_trim()`?
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+.  Write a function that turns (e.g.) a vector `c("a", "b", "c")` into
 								    the string `a, b, and c`. Think carefully about what it should do if
 								    given a vector of length 0, 1, or 2.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								## Matching patterns with regular expressions
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								Regexps are a very terse language that allow you to describe patterns in strings. They take a little while to get your head around, but once you understand them, you'll find them extremely useful.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Small copy edits edits to strings.Rmd

											
										
										
											2016-04-08 04:10:32 +08:00
+								To learn regular expressions, we'll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								### Basic matches
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								The simplest patterns match exact strings:
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								x <- c("apple", "banana", "pear")
 								str_view(x, "an")
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								```
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								The next step up in complexity is `.`, which matches any character (except a newline):
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								str_view(x, ".a.")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								But if "`.`" matches any character, how do you match the character "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an `.`, you need the regexp `\.`. Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we need the string `"\\."`.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								# To create the regular expression, we need \\
 								dot <- "\\."
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								# But the expression itself only contains one:
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								writeLines(dot)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								# And this tells R to look for an explicit .
-												Use str_view htmlwidget

											
										
										
											2015-10-28 00:03:27 +08:00
+								str_view(c("abc", "a.c", "bef"), "a\\.c")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								x <- "a\\b"
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								writeLines(x)
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								str_view(x, "\\\\")
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								#### Exercises
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  How would you match the sequence `"'\`?
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+.  What patterns will the regular expression `\..\..\..` match?
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								    How would you represent it as a string?
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								### Anchors
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								By default, regular expressions will match any part of a string. It's often useful to _anchor_ the regular expression so that it matches from the start or end of the string. You can use:
 								* `^` to match the start of the string.
-												Change `*` to `$` in Anchors Section of strings.Rmd

I'm just learning regular expressions, but I think you meant to use $ instead of * in the second bullet point in the section titled Anchors in strings.Rmd.
											
										
										
											2015-12-02 03:27:35 +08:00
+								* `$` to match the end of the string.
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								x <- c("apple", "banana", "pear")
 								str_view(x, "^a")
 								str_view(x, "a$")
 								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								To force a regular expression to only match a complete string, anchor it with both `^` and `$`:
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								x <- c("apple pie", "apple", "apple cake")
 								str_view(x, "apple")
 								str_view(x, "^apple$")
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								You can also match the boundary between words with `\b`. I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								#### Exercises
 .  How would you match the literal string `"$^$"`?
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+.  Given the corpus of common words in `stringr::words`, create regular
 								    expressions that find all words that:
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
 . Start with "y".
 . End with "x"
 . Are exactly three letters long. (Don't cheat by using `str_length()`!)
 . Have seven letters or more.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								    Since this list is long, you might want to use the `match` argument to
 								    `str_view()` to show only the matching or non-matching words.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### Character classes and alternatives
-												Update strings.Rmd (#262)

Typo
											
										
										
											2016-08-15 20:32:17 +08:00
+								There are a number of special patterns that match more than one character. You've already seen `.`, which matches any character apart from a newline. There are four other useful tools:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								* `\d`: matches any digit.
 								* `\s`: matches any whitespace (e.g. space, tab, newline).
 								* `[abc]`: matches a, b, or c.
 								* `[^abc]`: matches anything except a, b, or c.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
-												Mention the use of a character class for metacharacters (#687)

Closes #673
											
										
										
											2018-06-21 11:08:05 +08:00
+								A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable.
 								```{r}
 								# Look for a literal character that normally has special meaning in a regex
 								str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
 								str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
 								str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
 								```
 								This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`.
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												Use str_view htmlwidget

											
										
										
											2015-10-28 00:03:27 +08:00
+								str_view(c("grey", "gray"), "gr(e|a)y")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								#### Exercises
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+.  Create regular expressions to find all words that:
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+. Start with a vowel.
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+. That only contain consonants. (Hint: thinking about matching
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								       "not"-vowels.)
 . End with `ed`, but not with `eed`.
 . End with `ing` or `ise`.
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+.  Empirically verify the rule "i before e except after c".
 .  Is "q" always followed by a "u"?
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  Write a regular expression that matches a word if it's probably written
 								    in British English, not American English.
 .  Create a regular expression that will match telephone numbers as commonly
 								    written in your country.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								### Repetition
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								The next step up in power involves controlling how many times a pattern matches:
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								* `?`: 0 or 1
 								* `+`: 1 or more
 								* `*`: 0 or more
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
 								```{r}
 								x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
 								str_view(x, "CC?")
 								str_view(x, "CC+")
 								str_view(x, 'C[LX]+')
 								```
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+`.
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								You can also specify the number of matches precisely:
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								* `{n}`: exactly n
 								* `{n,}`: n or more
 								* `{,m}`: at most m
 								* `{n,m}`: between n and m
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								```{r}
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								str_view(x, "C{2}")
 								str_view(x, "C{2,}")
 								str_view(x, "C{2,3}")
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								```
 								By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them. This is an advanced feature of regular expressions, but it's useful to know that it exists:
 								```{r}
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								str_view(x, 'C{2,3}?')
 								str_view(x, 'C[LX]+?')
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								#### Exercises
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+.  Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  Describe in words what these regular expressions match:
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								    (read carefully to see if I'm using a regular expression or a string
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								    that defines a regular expression.)
 . `^.*$`
 . `"\\{.+\\}"`
 . `\d{4}-\d{2}-\d{2}`
 . `"\\\\{4}"`
-												Update strings.Rmd (#263)

Typo. In some exercises there seem like the spacing between the numbering is not consistent. I tried to fix one here.
											
										
										
											2016-08-15 20:32:32 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  Create regular expressions to find all words that:
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+. Start with three consonants.
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+. Have three or more vowels in a row.
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+. Have two or more vowel-consonant pairs in a row.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+.  Solve the beginner regexp crosswords at
 								    <https://regexcrossword.com/challenges/beginner>.
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								### Grouping and backreferences
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Elaborate on capturing groups (#615)

Clarify the meaning and use of capturing groups.
											
										
										
											2018-06-20 16:58:59 +08:00
+								Earlier, you learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a _numbered_ capturing group (number 1, 2 etc.). A capturing group stores _the part of the string_ matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with _backreferences_, like `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a repeated pair of letters.
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								str_view(fruit, "(..)\\1", match = TRUE)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								(Shortly, you'll also see how they're useful in conjunction with `str_match()`.)
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								#### Exercises
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  Describe, in words, what these expressions will match:
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+. `(.)\1\1`
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+. `"(.)(.)\\2\\1"`
 . `(..)\1`
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+. `"(.).\\1.\\1"`
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+. `"(.)(.)(.).*\\3\\2\\1"`
 .  Construct regular expressions to match words that:
 . Start and end with the same character.
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
 . Contain a repeated pair of letters
-												Update strings.Rmd (#263)

Typo. In some exercises there seem like the spacing between the numbering is not consistent. I tried to fix one here.
											
										
										
											2016-08-15 20:32:32 +08:00
+								       (e.g. "church" contains "ch" repeated twice.)
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
 . Contain one letter repeated in at least three places
 								       (e.g. "eleven" contains three "e"s.)
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								## Tools
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems. In this section you'll learn a wide array of stringr functions that let you:
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								* Determine which strings match a pattern.
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								* Find the positions of matches.
 								* Extract the content of matches.
 								* Replace matches with new values.
-												Small copy edits edits to strings.Rmd

											
										
										
											2016-04-08 04:10:32 +08:00
+								* Split a string based on a match.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More @csgillespie changes

											
										
										
											2016-10-04 22:01:12 +08:00
+								A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. In the words of Jamie Zawinski:
 								> Some people, when confronted with a problem, think “I know, I’ll use regular
 								> expressions.” Now they have two problems.
 								As a cautionary tale, check out this regular expression that checks if a email address is valid:
 								```
 								(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
 								)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
 								\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
 								?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
 								\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
 								](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
 								(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
 								(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
 								|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
 								?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
 								r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 								 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
 								?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
 								)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 								 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
 								)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
 								)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
 								*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
 								|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
 								\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
 								\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
 								]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
 								]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
 								?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
 								:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
 								:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
 								:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
 								[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\]
 								\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
 								\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
 								@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
 								(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
 								)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
 								".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
 								:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
 								\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
 								\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
 								?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
 								:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
 								^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
 								.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
 								]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
 								[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
 								r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
 								\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
 								|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
 -\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
 								.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
 								;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
 								:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
 								(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
 								\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
 								^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
 								]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
 								?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
 								".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
 								?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
 								\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
 								])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
 								])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
 								:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
 								\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
 								[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
 								]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
 								?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
 								()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
 								?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
 								@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 								 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
 								;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
 								)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
 								".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
 								(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
 								\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
 								\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
 								"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
 								*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
 								+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
 								.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
 								|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
 								?:\r\n)?[ \t])*))*)?;\s*)
 								```
-												suprinsingly -> surprisingly (#658)

one char typo
											
										
										
											2018-06-20 17:09:36 +08:00
+								This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code. See the stackoverflow discussion at <http://stackoverflow.com/a/201378> for more details.
-												More @csgillespie changes

											
										
										
											2016-10-04 22:01:12 +08:00
-												fix list of typos (#488)


											
										
										
											2016-10-25 02:04:21 +08:00
+								Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								### Detect matches
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector the same length as the input:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								```{r}
 								x <- c("apple", "banana", "pear")
 								str_detect(x, "e")
 								```
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want to answer questions about matches across a larger vector:
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								```{r}
 								# How many common words start with t?
-												common is now words

											
										
										
											2016-07-21 00:51:53 +08:00
+								sum(str_detect(words, "^t"))
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								# What proportion of common words end with a vowel?
-												common is now words

											
										
										
											2016-07-21 00:51:53 +08:00
+								mean(str_detect(words, "[aeiou]$"))
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								When you have complex logical conditions (e.g. match a or b but not c unless d) it's often easier to combine multiple `str_detect()` calls with logical operators, rather than trying to create a single regular expression. For example, here are two ways to find all words that don't contain any vowels:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								```{r}
 								# Find all words containing at least one vowel, and negate
-												common is now words

											
										
										
											2016-07-21 00:51:53 +08:00
+								no_vowels_1 <- !str_detect(words, "[aeiou]")
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								# Find all words consisting only of consonants (non-vowels)
-												common is now words

											
										
										
											2016-07-21 00:51:53 +08:00
+								no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								identical(no_vowels_1, no_vowels_2)
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								The results are identical, but I think the first approach is significantly easier to understand. If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations.
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								A common use of `str_detect()` is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient `str_subset()` wrapper:
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
 								```{r}
-												common is now words

											
										
										
											2016-07-21 00:51:53 +08:00
+								words[str_detect(words, "x$")]
 								str_subset(words, "x$")
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								```
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								Typically, however, your strings will be one column of a data frame, and you'll want to use filter instead:
 								```{r}
 								df <- tibble(
 								  word = words,
 								  i = seq_along(word)
 								)
 								df %>%
-												small typo referencing wrong object (#686)

Line 571 should reference word, the column name in df, not words, the vector.
											
										
										
											2018-10-25 01:25:50 +08:00
+								  filter(str_detect(word, "x$"))
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								```
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:
 								```{r}
 								x <- c("apple", "banana", "pear")
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								str_count(x, "a")
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								# On average, how many vowels per word?
-												common is now words

											
										
										
											2016-07-21 00:51:53 +08:00
+								mean(str_count(words, "[aeiou]"))
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								```
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								It's natural to use `str_count()` with `mutate()`:
 								```{r}
 								df %>%
 								  mutate(
 								    vowels = str_count(word, "[aeiou]"),
-												Fixed Typo (#317)


											
										
										
											2016-08-26 23:33:27 +08:00
+								    consonants = str_count(word, "[^aeiou]")
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								  )
 								```
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								Note that matches never overlap. For example, in `"abababa"`, how many times will the pattern `"aba"` match? Regular expressions say two, not three:
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								str_count("abababa", "aba")
 								str_view_all("abababa", "aba")
 								```
-												Small copy edits edits to strings.Rmd

											
										
										
											2016-04-08 04:10:32 +08:00
+								Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches. The second function will have the suffix `_all`.
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
-.4 exercise of Detect matches should be 4th level heading (#727)


											
										
										
											2018-11-12 23:55:21 +08:00
+								#### Exercises
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+.  For each of the following challenges, try solving it by using both a single
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								    regular expression, and a combination of multiple `str_detect()` calls.
 .  Find all words that start or end with `x`.
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+.  Find all words that start with a vowel and end with a consonant.
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
 .  Are there any words that contain at least one of each different
 								        vowel?
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
 .  What word has the highest number of vowels? What word has the highest
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								    proportion of vowels? (Hint: what is the denominator?)
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
 								### Extract matches
-												fix list of typos (#488)


											
										
										
											2016-10-25 02:04:21 +08:00
+								To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexps. These are provided in `stringr::sentences`:
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
 								```{r}
 								length(sentences)
 								head(sentences)
 								```
 								Imagine we want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:
 								```{r}
 								colours <- c("red", "orange", "yellow", "green", "blue", "purple")
 								colour_match <- str_c(colours, collapse = "|")
 								colour_match
 								```
 								Now we can select the sentences that contain a colour, and then extract the colour to figure out which one it is:
 								```{r}
 								has_colour <- str_subset(sentences, colour_match)
 								matches <- str_extract(has_colour, colour_match)
 								head(matches)
 								```
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								Note that `str_extract()` only extracts the first match. We can see that most easily by first selecting all the sentences that have more than 1 match:
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								more <- sentences[str_count(sentences, colour_match) > 1]
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								str_view_all(more, colour_match)
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
 								str_extract(more, colour_match)
 								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use `str_extract_all()`. It returns a list:
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
 								```{r}
 								str_extract_all(more, colour_match)
 								```
-												Drop handling hierarchy

It's just a bit too raw - and rather than polishing it, it would be better to put the time in to (e.g.) ggplot2 scales

											
										
										
											2016-08-15 22:18:56 +08:00
+								You'll learn more about lists in [lists](#lists) and [iteration].
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
 								If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
 								```{r}
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								str_extract_all(more, colour_match, simplify = TRUE)
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								x <- c("a", "a b", "a b c")
 								str_extract_all(x, "[a-z]", simplify = TRUE)
 								```
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								#### Exercises
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+.  In the previous example, you might have noticed that the regular
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								    expression matched "flickered", which is not a colour. Modify the
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								    regex to fix the problem.
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+.  From the Harvard sentences data, extract:
 . The first word from each sentence.
 . All words ending in `ing`.
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+. All plurals.
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								### Grouped matches
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching. You can also use parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space.
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
 								```{r}
 								noun <- "(a|the) ([^ ]+)"
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								has_noun <- sentences %>%
 								  str_subset(noun) %>%
 								  head(10)
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								has_noun %>%
 								  str_extract(noun)
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								`str_extract()` gives us the complete match; `str_match()` gives each individual component. Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
 								```{r}
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								has_noun %>%
 								  str_match(noun)
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								```
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								If your data is in a tibble, it's often easier to use `tidyr::extract()`. It works like `str_match()` but requires you to name the matches, which are then placed in new columns:
 								```{r}
 								tibble(sentence = sentences) %>%
 								  tidyr::extract(
 								    sentence, c("article", "noun"), "(a|the) ([^ ]+)",
 								    remove = FALSE
 								  )
 								```
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								Like `str_extract()`, if you want all matches for each string, you'll need `str_match_all()`.
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
 								#### Exercises
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+. Find all words that come after a "number" like "one", "two", "three" etc.
 								   Pull out both the number and the word.
 . Find all contractions. Separate out the pieces before and after the
 								   apostrophe.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								### Replacing matches
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								`str_replace()` and `str_replace_all()` allow you to replace matches with new strings. The simplest use is to replace a pattern with a fixed string:
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								```{r}
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								x <- c("apple", "pear", "banana")
 								str_replace(x, "[aeiou]", "-")
 								str_replace_all(x, "[aeiou]", "-")
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								With `str_replace_all()` you can perform multiple replacements by supplying a named vector:
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								```{r}
 								x <- c("1 house", "2 cars", "3 people")
 								str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
 								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								Instead of replacing with a fixed string you can use backreferences to insert components of the match. In the following code, I flip the order of the second and third words.
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
 								```{r}
 								sentences %>%
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
 								  head(5)
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								```
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								#### Exercises
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+.   Replace all forward slashes in a string with backslashes.
 .   Implement a simple version of `str_to_lower()` using `replace_all()`.
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+.   Switch the first and last letters in `words`. Which of those strings
 								     are still words?
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								### Splitting
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								Use `str_split()` to split a string up into pieces. For example, we could split sentences into words:
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
 								```{r}
 								sentences %>%
 								  head(5) %>%
 								  str_split(" ")
 								```
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								Because each component might contain a different number of pieces, this returns a list. If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
 								```{r}
 								"a|b|c|d" %>%
 								  str_split("\\|") %>%
 								  .[[1]]
 								```
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
 								```{r}
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								sentences %>%
 								  head(5) %>%
 								  str_split(" ", simplify = TRUE)
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								```
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								You can also request a maximum number of pieces:
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
 								```{r}
-												Small copy edits edits to strings.Rmd

											
										
										
											2016-04-08 04:10:32 +08:00
+								fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								fields %>% str_split(": ", n = 2, simplify = TRUE)
 								```
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								```{r}
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								x <- "This is a sentence.  This is another sentence."
 								str_view_all(x, boundary("word"))
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								str_split(x, " ")[[1]]
 								str_split(x, boundary("word"))[[1]]
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								```
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								#### Exercises
 .  Split up a string like `"apples, pears, and bananas"` into individual
 								    components.
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+.  Why is it better to split up by `boundary("word")` than `" "`?
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+.  What does splitting with an empty string (`""`) do? Experiment, and
 								    then read the documentation.
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								### Find matches
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								## Other types of pattern
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								```{r, eval = FALSE}
 								# The regular call:
 								str_view(fruit, "nana")
 								# Is shorthand for
 								str_view(fruit, regex("nana"))
 								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								You can use the other arguments of `regex()` to control details of the match:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								*   `ignore_case = TRUE` allows characters to match either their uppercase or
 								    lowercase forms. This always uses the current locale.
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								    ```{r}
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								    bananas <- c("banana", "Banana", "BANANA")
 								    str_view(bananas, "banana")
 								    str_view(bananas, regex("banana", ignore_case = TRUE))
 								    ```
 								*   `multiline = TRUE` allows `^` and `$` to match the start and end of each
 								    line rather than the start and end of the complete string.
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								    ```{r}
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								    x <- "Line 1\nLine 2\nLine 3"
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								    str_extract_all(x, "^Line")[[1]]
 								    str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								    ```
 								*   `comments = TRUE` allows you to use comments and white space to make
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								    complex regular expressions more understandable. Spaces are ignored, as is
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								    everything after `#`. To match a literal space, you'll need to escape it:
 								    `"\\ "`.
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
 								    ```{r}
 								    phone <- regex("
 								      \\(?     # optional opening parens
 								      (\\d{3}) # area code
-												Minor typo: dash needs to be first in character class group (#664)


											
										
										
											2018-06-20 17:10:38 +08:00
+								      [) -]?   # optional closing parens, space, or dash
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								      (\\d{3}) # another three numbers
 								      [ -]?    # optional space or dash
 								      (\\d{3}) # three more numbers
 								      ", comments = TRUE)
 								    str_match("514-791-8141", phone)
 								    ```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								*   `dotall = TRUE` allows `.` to match everything, including `\n`.
 								There are three other functions you can use instead of `regex()`:
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								*   `fixed()`: matches exactly the specified sequence of bytes. It ignores
 								    all special regular expressions and operates at a very low level.
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								    This allows you to avoid complex escaping and can be much faster than
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								    regular expressions. The following microbenchmark shows that it's about
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+x faster for a simple example.
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
 								    ```{r}
 								    microbenchmark::microbenchmark(
 								      fixed = str_detect(sentences, fixed("the")),
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								      regex = str_detect(sentences, "the"),
 								      times = 20
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								    )
 								    ```
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								    Beware using `fixed()` with non-English data. It is problematic because
 								    there are often multiple ways of representing the same character. For
 								    example, there are two ways to define "á": either as a single character or
 								    as an "a" plus an accent:
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
 								    ```{r}
 								    a1 <- "\u00e1"
 								    a2 <- "a\u0301"
 								    c(a1, a2)
 								    a1 == a2
 								    ```
 								    They render identically, but because they're defined differently,
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								    `fixed()` doesn't find a match. Instead, you can use `coll()`, defined
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								    next, to respect human character comparison rules:
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
 								    ```{r}
 								    str_detect(a1, fixed(a2))
 								    str_detect(a1, coll(a2))
 								    ```
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								*   `coll()`: compare strings using standard **coll**ation rules. This is
 								    useful for doing case insensitive matching. Note that `coll()` takes a
 								    `locale` parameter that controls which rules are used for comparing
 								    characters. Unfortunately different parts of the world use different rules!
 								    ```{r}
 								    # That means you also need to be aware of the difference
 								    # when doing case insensitive matches:
 								    i <- c("I", "İ", "i", "ı")
 								    i
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								    str_subset(i, coll("i", ignore_case = TRUE))
 								    str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								    ```
 								    Both `fixed()` and `regex()` have `ignore_case` arguments, but they
 								    do not allow you to pick the locale: they always use the default locale.
 								    You can see what that is with the following code; more on stringi
 								    later.
 								    ```{r}
 								    stringi::stri_locale_info()
 								    ```
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								    The downside of `coll()` is speed; because the rules for recognising which
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								    characters are the same are complicated, `coll()` is relatively slow
 								    compared to `regex()` and `fixed()`.
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
 								*   As you saw with `str_split()` you can use `boundary()` to match boundaries.
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								    You can also use it with the other functions:
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
-												Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package

											
										
										
											2016-07-20 07:01:52 +08:00
+								    ```{r}
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								    x <- "This is a sentence."
 								    str_view_all(x, boundary("word"))
 								    str_extract_all(x, boundary("word"))
 								    ```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								### Exercises
 .  How would you find all strings containing `\` with `regex()` vs.
 								    with `fixed()`?
 .  What are the five most common words in `sentences`?
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								## Other uses of regular expressions
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								There are two useful function in base R that also use regular expressions:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								*   `apropos()` searches all objects available from the global environment. This
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								    is useful if you can't quite remember the name of the function.
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
 								    ```{r}
 								    apropos("replace")
 								    ```
 								*   `dir()` lists all the files in a directory. The `pattern` argument takes
-												Update strings.Rmd

typos
											
										
										
											2016-02-12 01:45:33 +08:00
+								    a regular expression and only returns file names that match the pattern.
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								    For example, you can find all the R Markdown files in the current
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
+								    directory with:
 								    ```{r}
 								    head(dir(pattern = "\\.Rmd$"))
 								    ```
 								    (If you're more comfortable with "globs" like `*.Rmd`, you can convert
 								    them to regular expressions with `glob2rx()`):
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								## stringi
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
+								stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. stringi, on the other hand, is designed to be comprehensive. It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								If you find yourself struggling to do something in stringr, it's worth taking a look at stringi. The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way. The main difference is the prefix: `str_` vs. `stri_`.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+								### Exercises
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+.  Find the stringi functions that:
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+. Count the number of words.
 . Find duplicated strings.
 . Generate random text.
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
-												Another pass through strings

											
										
										
											2016-08-08 23:45:11 +08:00
+.  How do you control the language that `stri_sort()` uses for
 								    sorting?