r4ds/strings.Rmd

---
layout: default
title: String manipulation
output: bookdown::html_chapter
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(stringr)
library(stringi)
```

# String manipulation

When working with text data, one of the most powerful tools at your disposal is regular expressions. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely.

In this chapter, you'll learn the basics of regular expressions using the stringr package. 

```{r}
library(stringr)
```

The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.

For example, if you have

```{r}

```

The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.

## String basics

In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.

To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).

```{r}
x <- c("\"", "\\")
x
writeLines(x)
```

There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`.

You'll also sometimes strings like `"\u00b5"`, this is a way of writing special characters that works on all platforms:

```{r}
x <- "\u00b5"
x
```

Remember that the representation of a string is different from the string itself.

### String length

Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)

```{r}
# (Will be fixed in R 3.3.0)
nchar(NA)
str_length(NA)
```

Every stringr function starts with `str_`. That's particularly useful if you're using RStudio, because by the time you've type `str_`, RStudio will be ready to offer autocomplete for the reminaing characters. That's useful if you can't quite remember the name of the function.

### Combining strings

To combine two or more strings, use `str_c()`:

```{r}
str_c("x", "y")
str_c("x", "y", "z")
```

Use the `sep` argument to control how they're separated:

```{r}
str_c("x", "y", sep = ", ")
```

Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:

```{r}
x <- c("abc", NA)
str_c("|-", x, "-|")
str_c("|-", str_replace_na(x), "-|")
```

As shown above, `str_c()` is vectorised, automatically recycling shorter vectors to the same length as the longest:

```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```

To collapse vectors into a single string, use `collapse`:

```{r}
str_c(c("x", "y", "z"), collapse = ", ")
```

When creating strings you might also find `str_pad()` and `str_dup()` useful:

```{r}
x <- c("apple", "banana", "pear")
str_pad(x, 10)

str_c("Na ", str_dup("na ", 4), "batman!") 
```

### Subsetting strings

You can extract parts of a string using `str_sub()`. `str_sub()` takes two arguments in addition to the string: the position to start at, and the postion to end at (inclusive):

```{r}
x <- c("apple", "banana", "pear")
str_sub(x, 1, 3)
# negative numbers count backwards from end
str_sub(x, -3, -1)
```

`str_sub()` returns the longest string possible. If you don't want this behaviour, you'll need to check `str_length()` yourself.

```{r}
str_sub("a", 1, 5)
```

You can also use `str_sub()` to modify strings:

```{r}
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```

You can use `str_to_lower()`, `str_to_upper()`, and `str_to_title()` to convert the case of a vector. Note that what that means depends on where you are in the world, so these functions all have a locale argument. If left blank it will use the current locale.

### Exercises

1.  In your own words, describe the difference between `sep` and `collapse`.

1.  In code that doesn't use stringr, you'll often see `paste()` and `paste0()`.
    What's the difference between the two functions? What's stringr function are
    they equivalent too? How do the functions differ in their handling of 
    `NA`?
    
1.  Use `str_length()` and `str_sub()` to extract the middle character from 
    a character vector.
    
1.  Write a function that turns (e.g.) a vector `c("a", "b", "c")` into 
    the string `a, b, and c`. Think carefully about what it should do if
    given a vector of length 0, 1, or 2.

1.  What does `str_wrap()` do? When might you want to use it?

1.  What does `str_trim()` do? 

## Matching patterns with regular expressions

Regular expressions, regexps for short, a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful. 

To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and shows you how they match. We'll start with very simple regular expressions and then gradually move to more and more complicated. Once you've mastered the basics of pattern matching with regular expression, you'll learn all the stringr functions that use them and learn how to use them to solve real problems.

### Fixed matches

The simplest patterns that 

```{r}

```

```{r}
common <- rcorpora::corpora("words/common")$commonWords
```

### Match anything and escaping

```{r}
str_subset(c("abc", "adc", "bef"), "a.c")
```

But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.

```{r}
# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
cat(dot, "\n")

# And this tells R to look for explicit .
str_subset(c("abc", "a.c", "bef"), "a\\.c")
```

If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!

```{r}
x <- "a\\b"
cat(x, "\n")

y <- str_replace(x, "\\\\", "-slash-")
cat(y, "\n")
```

Here I'll write a regular expression like `\.` and the string that represents the regular expression as `"\."`.

Use regular expressions to:

* Solve this crossword puzzle clue: `a??le`

### Anchors

Regular expressions can also match things that are not characters. The most important non-character matches are:

* `^`: the start of the line.
* `*`: the end of the line.

To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:

```{r}
str_detect(c("abcdef", "bcd"), "^bcd$")
```

My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).

You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.

Practice these by finding all common words:

* Start with y.
* End in x.
* That are exactly 4 letters long. Without using `str_length()`

### Character classes and alternatives

As well as `.` there are a number of other special patterns that match more than one character:

* `\d`: any digit
* `\s`: any whitespace (space, tab, newline)
* `[abc]`: match a, b, or c
* `[a-e]`: match any character between a and e
* `[!abc]`: match anything except a, b, or c

Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.

A similar idea is alternation: `x|y` matches either x or y. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:

```{r}
str_detect(c("abc", "xyz"), "abc|xyz")
```

Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:

```{r}
str_detect(c("grey", "gray"), "gr(e|a)y")
```

Practice these by finding:

* Start with a vowel.
* That only contain constants.
* That don't contain any vowels.


### Repetition

* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more
* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m

(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)

Note that the precedence of these operators are high, so you write: `colou?r`. That means you'll need to use parentheses for many uses: `bana(na)+` or `ba(na){2,}`.

Practice these by finding all common words:

* That contain three or more vowels in a row.

### Grouping and backreferences

```{r}
fruit <- rcorpora::corpora("foods/fruits")$fruits
str_subset(fruit, "(..)\\1")
```

Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. They are called non-capturing parentheses.

For example:

```{r}
str_detect(c("grey", "gray"), "gr(e|a)y")
str_detect(c("grey", "gray"), "gr(?:e|a)y")
```

Describe in words what these expressions will match:

* `str_subset(common, "(.)(.)\\2\\1")`

## Tools

The stringr package contains functions for working with strings and patterns. We'll focus on four main categories:

* What matches the pattern?
* Does a string match a pattern? 
* How can you replace a pattern with text?
* How can you split a string into pieces?

### Detecting matches

`str_detect()`, `str_subset()`, `str_count()`

### Extracting matches

`str_extract()`, `str_extract_all()`

### Extracting grouped matches

`str_match()`, `str_match_all()`

Note that matches are always non-overlapping. The second match starts after the first is complete.

### Replacing patterns

`str_replace()`, `str_replace_all()`

### Splitting

`str_split()`, `str_split_fixed()`. 

### Finding locations

`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.

### Exercises

1.   Replace all `/` in a string with `\`.

## Other types of pattern

When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the 

* `fixed()`: matches exactly that sequence of characters (i.e. ignored
  all special regular expression pattern).
  
* `coll()`: compare strings using standard **coll**ation rules. This is 
  useful for doing case insensitive matching. Note that `coll()` takes a
  `locale` parameter that controls which rules are used for comparing
  characters. Unfortunately different parts of the world use different rules!

```{r}
# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")

# That means you also need to be aware of the difference
# when doing case insensitive matches:
i <- c("I", "İ", "i", "ı")
i

str_subset(i, fixed("i", TRUE))
str_subset(i, coll("i", TRUE))
str_subset(i, coll("i", TRUE, locale = "tr"))
```

## Other uses of regular expressions

There are a few other functions in base R that accept regular expressions:

*   `apropos()` searchs all objects avaiable from the global environment. This
    is useful if you can't quite remember the name of the function.
   
*   `ls()` is similar to `apropos()` but only works in the current 
    environment. However, if you have so many objects in your environment
    that you have to use a regular expression to filter them all, you 
    need to think about what you're doing! (And probably use a list instead).

*   `dir()` lists all the files in a directory. The `pattern` argument takes
    a regular expression and only return file names that match the pattern.
    For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
    (If you're more comfortable with "globs" like `*.csv`, you can convert
    them to regular expressions with `glob2rx()`)

## Advanced topics


### The stringi package

stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `length(ls("package:stringi"))` functions to stringr's `length(ls("package:stringr"))`.

So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages are very similar because stringi was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.

### Encoding

Complicated and fraught with difficulty. Best approach is to convert to UTF-8 as soon as possible. All stringr and stringi functions do this. Readr always reads as UTF-8.

* UTF-8
* Latin1
* bytes: everything else

Generally, you should fix encoding problems during the data import phase.

Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. Fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).

```{r}
x <- "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu."
x
str_conv(x, "ISO-8859-1")

as.data.frame(stringi::stri_enc_detect(x))
str_conv(x, "ISO-8859-2")
```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								---
 								layout: default
 								title: String manipulation
 								output: bookdown::html_chapter
 								---
 								```{r setup, include=FALSE}
 								knitr::opts_chunk$set(echo = TRUE)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								library(stringr)
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								library(stringi)
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								```
 								# String manipulation
 								When working with text data, one of the most powerful tools at your disposal is regular expressions. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely.
 								In this chapter, you'll learn the basics of regular expressions using the stringr package.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```{r}
 								library(stringr)
 								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								For example, if you have
 								```{r}
 								```
 								The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								## String basics
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
 								```{r}
 								x <- c("\"", "\\")
 								x
 								writeLines(x)
 								```
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`.
 								You'll also sometimes strings like `"\u00b5"`, this is a way of writing special characters that works on all platforms:
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								```{r}
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								x <- "\u00b5"
 								x
 								```
 								Remember that the representation of a string is different from the string itself.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### String length
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
 								```{r}
 								# (Will be fixed in R 3.3.0)
 								nchar(NA)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								str_length(NA)
 								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								Every stringr function starts with `str_`. That's particularly useful if you're using RStudio, because by the time you've type `str_`, RStudio will be ready to offer autocomplete for the reminaing characters. That's useful if you can't quite remember the name of the function.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### Combining strings
 								To combine two or more strings, use `str_c()`:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```{r}
 								str_c("x", "y")
 								str_c("x", "y", "z")
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								Use the `sep` argument to control how they're separated:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								```{r}
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								str_c("x", "y", sep = ", ")
 								```
 								Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
 								```{r}
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								x <- c("abc", NA)
 								str_c("|-", x, "-|")
 								str_c("|-", str_replace_na(x), "-|")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								As shown above, `str_c()` is vectorised, automatically recycling shorter vectors to the same length as the longest:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
 								str_c("prefix-", c("a", "b", "c"), "-suffix")
 								```
 								To collapse vectors into a single string, use `collapse`:
 								```{r}
 								str_c(c("x", "y", "z"), collapse = ", ")
 								```
 								When creating strings you might also find `str_pad()` and `str_dup()` useful:
 								```{r}
 								x <- c("apple", "banana", "pear")
 								str_pad(x, 10)
 								str_c("Na ", str_dup("na ", 4), "batman!")
 								```
 								### Subsetting strings
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								You can extract parts of a string using `str_sub()`. `str_sub()` takes two arguments in addition to the string: the position to start at, and the postion to end at (inclusive):
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
 								x <- c("apple", "banana", "pear")
 								str_sub(x, 1, 3)
 								# negative numbers count backwards from end
 								str_sub(x, -3, -1)
 								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								`str_sub()` returns the longest string possible. If you don't want this behaviour, you'll need to check `str_length()` yourself.
 								```{r}
 								str_sub("a", 1, 5)
 								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								You can also use `str_sub()` to modify strings:
 								```{r}
 								str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
 								x
 								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								You can use `str_to_lower()`, `str_to_upper()`, and `str_to_title()` to convert the case of a vector. Note that what that means depends on where you are in the world, so these functions all have a locale argument. If left blank it will use the current locale.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### Exercises
 .  In your own words, describe the difference between `sep` and `collapse`.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+.  In code that doesn't use stringr, you'll often see `paste()` and `paste0()`.
 								    What's the difference between the two functions? What's stringr function are
 								    they equivalent too? How do the functions differ in their handling of
 								    `NA`?
 .  Use `str_length()` and `str_sub()` to extract the middle character from
 								    a character vector.
 .  Write a function that turns (e.g.) a vector `c("a", "b", "c")` into
 								    the string `a, b, and c`. Think carefully about what it should do if
 								    given a vector of length 0, 1, or 2.
 .  What does `str_wrap()` do? When might you want to use it?
 .  What does `str_trim()` do?
 								## Matching patterns with regular expressions
 								Regular expressions, regexps for short, a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and shows you how they match. We'll start with very simple regular expressions and then gradually move to more and more complicated. Once you've mastered the basics of pattern matching with regular expression, you'll learn all the stringr functions that use them and learn how to use them to solve real problems.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								### Fixed matches
 								The simplest patterns that
 								```{r}
 								```
 								```{r}
 								common <- rcorpora::corpora("words/common")$commonWords
 								```
 								### Match anything and escaping
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```{r}
 								str_subset(c("abc", "adc", "bef"), "a.c")
 								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```{r}
 								# To create the regular expression, we need \\
 								dot <- "\\."
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								# But the expression itself only contains one:
 								cat(dot, "\n")
 								# And this tells R to look for explicit .
 								str_subset(c("abc", "a.c", "bef"), "a\\.c")
 								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								```{r}
 								x <- "a\\b"
 								cat(x, "\n")
 								y <- str_replace(x, "\\\\", "-slash-")
 								cat(y, "\n")
 								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								Here I'll write a regular expression like `\.` and the string that represents the regular expression as `"\."`.
 								Use regular expressions to:
 								* Solve this crossword puzzle clue: `a??le`
 								### Anchors
 								Regular expressions can also match things that are not characters. The most important non-character matches are:
 								* `^`: the start of the line.
 								* `*`: the end of the line.
 								To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
 								```{r}
 								str_detect(c("abcdef", "bcd"), "^bcd$")
 								```
 								My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
 								You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
 								Practice these by finding all common words:
 								* Start with y.
 								* End in x.
 								* That are exactly 4 letters long. Without using `str_length()`
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### Character classes and alternatives
 								As well as `.` there are a number of other special patterns that match more than one character:
 								* `\d`: any digit
 								* `\s`: any whitespace (space, tab, newline)
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								* `[abc]`: match a, b, or c
 								* `[a-e]`: match any character between a and e
 								* `[!abc]`: match anything except a, b, or c
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
 								A similar idea is alternation: `x|y` matches either x or y. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
 								```{r}
 								str_detect(c("abc", "xyz"), "abc|xyz")
 								```
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
 								str_detect(c("grey", "gray"), "gr(e|a)y")
 								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								Practice these by finding:
 								* Start with a vowel.
 								* That only contain constants.
 								* That don't contain any vowels.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								### Repetition
 								* `?`: 0 or 1
 								* `+`: 1 or more
 								* `*`: 0 or more
 								* `{n}`: exactly n
 								* `{n,}`: n or more
 								* `{,m}`: at most m
 								* `{n,m}`: between n and m
 								(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								Note that the precedence of these operators are high, so you write: `colou?r`. That means you'll need to use parentheses for many uses: `bana(na)+` or `ba(na){2,}`.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								Practice these by finding all common words:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								* That contain three or more vowels in a row.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								### Grouping and backreferences
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								fruit <- rcorpora::corpora("foods/fruits")$fruits
 								str_subset(fruit, "(..)\\1")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. They are called non-capturing parentheses.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								For example:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								```{r}
 								str_detect(c("grey", "gray"), "gr(e|a)y")
 								str_detect(c("grey", "gray"), "gr(?:e|a)y")
 								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								Describe in words what these expressions will match:
 								* `str_subset(common, "(.)(.)\\2\\1")`
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								## Tools
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
 								The stringr package contains functions for working with strings and patterns. We'll focus on four main categories:
 								* What matches the pattern?
 								* Does a string match a pattern?
 								* How can you replace a pattern with text?
 								* How can you split a string into pieces?
 								### Detecting matches
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								`str_detect()`, `str_subset()`, `str_count()`
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								### Extracting matches
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								`str_extract()`, `str_extract_all()`
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								### Extracting grouped matches
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								`str_match()`, `str_match_all()`
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								Note that matches are always non-overlapping. The second match starts after the first is complete.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								### Replacing patterns
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								`str_replace()`, `str_replace_all()`
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								### Splitting
 								`str_split()`, `str_split_fixed()`.
 								### Finding locations
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								### Exercises
 .   Replace all `/` in a string with `\`.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								## Other types of pattern
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the
 								* `fixed()`: matches exactly that sequence of characters (i.e. ignored
 								  all special regular expression pattern).
 								* `coll()`: compare strings using standard **coll**ation rules. This is
 								  useful for doing case insensitive matching. Note that `coll()` takes a
 								  `locale` parameter that controls which rules are used for comparing
 								  characters. Unfortunately different parts of the world use different rules!
 								```{r}
 								# Turkish has two i's: with and without a dot, and it
 								# has a different rule for capitalising them:
 								str_to_upper(c("i", "ı"))
 								str_to_upper(c("i", "ı"), locale = "tr")
 								# That means you also need to be aware of the difference
 								# when doing case insensitive matches:
 								i <- c("I", "İ", "i", "ı")
 								i
 								str_subset(i, fixed("i", TRUE))
 								str_subset(i, coll("i", TRUE))
 								str_subset(i, coll("i", TRUE, locale = "tr"))
 								```
 								## Other uses of regular expressions
 								There are a few other functions in base R that accept regular expressions:
 								*   `apropos()` searchs all objects avaiable from the global environment. This
 								    is useful if you can't quite remember the name of the function.
 								*   `ls()` is similar to `apropos()` but only works in the current
 								    environment. However, if you have so many objects in your environment
 								    that you have to use a regular expression to filter them all, you
 								    need to think about what you're doing! (And probably use a list instead).
 								*   `dir()` lists all the files in a directory. The `pattern` argument takes
 								    a regular expression and only return file names that match the pattern.
 								    For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
 								    (If you're more comfortable with "globs" like `*.csv`, you can convert
 								    them to regular expressions with `glob2rx()`)
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
 								## Advanced topics
 								### The stringi package
 								stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `length(ls("package:stringi"))` functions to stringr's `length(ls("package:stringr"))`.
 								So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages are very similar because stringi was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.
 								### Encoding
 								Complicated and fraught with difficulty. Best approach is to convert to UTF-8 as soon as possible. All stringr and stringi functions do this. Readr always reads as UTF-8.
 								* UTF-8
 								* Latin1
 								* bytes: everything else
 								Generally, you should fix encoding problems during the data import phase.
 								Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. Fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).
 								```{r}
 								x <- "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu."
 								x
 								str_conv(x, "ISO-8859-1")
 								as.data.frame(stringi::stri_enc_detect(x))
 								str_conv(x, "ISO-8859-2")
 								```