More about strings

This commit is contained in:
hadley 2015-10-22 13:17:00 -05:00
parent 88626be626
commit ec529ef1fa
1 changed files with 189 additions and 24 deletions

View File

@ -6,6 +6,7 @@ output: bookdown::html_chapter
```{r setup, include=FALSE} ```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(echo = TRUE)
library(stringr)
``` ```
# String manipulation # String manipulation
@ -14,6 +15,10 @@ When working with text data, one of the most powerful tools at your disposal is
In this chapter, you'll learn the basics of regular expressions using the stringr package. In this chapter, you'll learn the basics of regular expressions using the stringr package.
```{r}
library(stringr)
```
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%. The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
## String basics ## String basics
@ -28,56 +33,153 @@ x
writeLines(x) writeLines(x)
``` ```
### String length
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`) Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
```{r} ```{r}
# (Will be fixed in R 3.3.0) # (Will be fixed in R 3.3.0)
nchar(NA) nchar(NA)
str_length(NA)
stringr::str_length(NA)
``` ```
## Introduction to stringr ### Combining strings
To combine two or more strings, use `str_c()`:
```{r} ```{r}
library(stringr) str_c("x", "y")
str_c("x", "y", "z")
``` ```
The stringr package contains functions for working with strings and patterns. We'll focus on three: Use the `sep` argument to control how they're separated:
* `str_detect(string, pattern)`: does string match a pattern? ```{r}
* `str_extract(string, pattern)`: extact matching pattern from string str_c("x", "y", sep = ", ")
* `str_replace(string, pattern, replacement)`: replace pattern with replacement ```
* `str_split(string, pattern)`.
## Extracting patterns Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
## Introduction to regular expressions ```{r}
str_c("x", NA, "y")
str_c("x", str_replace_na(NA), "y")
```
Goal is not to be exhaustive. `str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest:
### Character classes and alternative ```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```
* `.`: any character To collapse vectors into a single string, use `collapse`:
* `\d`: a digit
* `\s`: whitespace
* `x|y`: match x or y ```{r}
str_c(c("x", "y", "z"), collapse = ", ")
```
When creating strings you might also find `str_pad()` and `str_dup()` useful:
```{r}
x <- c("apple", "banana", "pear")
str_pad(x, 10)
str_c("Na ", str_dup("na ", 4), "batman!")
```
### Subsetting strings
You can extract parts of a string using `str_sub()`:
```{r}
x <- c("apple", "banana", "pear")
str_sub(x, 1, 3)
# negative numbers count backwards from end
str_sub(x, -3, -1)
```
You can also use `str_sub()` to modify strings:
```{r}
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```
### Exercises
1. In your own words, describe the difference between `sep` and `collapse`.
## Regular expressions
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories
* What matches the pattern?
* Does a string match a pattern?
* How can you replace a pattern with text?
* How can you split a string into pieces?
Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
```{r}
```
Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions.
### Matching anything and escaping
Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character:
```{r}
str_subset(c("abc", "adc", "bef"), "a.c")
```
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
```{r}
# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
cat(dot, "\n")
# And this tells R to look for explicit .
str_subset(c("abc", "a.c", "bef"), "a\\.c")
```
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
### Character classes and alternatives
As well as `.` there are a number of other special patterns that match more than one character:
* `\d`: any digit
* `\s`: any whitespace (space, tab, newline)
* `[abc]`: match a, b, or c * `[abc]`: match a, b, or c
* `[a-e]`: match any character between a and e * `[a-e]`: match any character between a and e
* `[!abc]`: match anything except a, b, or c * `[!abc]`: match anything except a, b, or c
### Escaping Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
You may have noticed that since `.` is a special regular expression character, you'll need to escape `.` A similar idea is alternation: `x|y` matches either x or y. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
```{r}
str_detect(c("abc", "xyz"), "abc|xyz")
```
Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want:
```{r}
str_detect(c("grey", "gray"), "gr(e|a)y")
str_detect(c("grey", "gray"), "gr(?:e|a)y")
```
Unfortunately parentheses have some other side-effects in regular expressions, which we'll learn about later. Technically, the parentheses you should use are `(?:)` which are called non-capturing parentheses. Most of the time this won't make any difference so it's easy to use `()`, but it sometimes helpful to be aware of `(?:)`.
### Repetition ### Repetition
* `?`: 0 or 1 * `?`: 0 or 1
* `+`: 1 or more * `+`: 1 or more
* `*`: 0 or more * `*`: 0 or more
* `{n}`: exactly n * `{n}`: exactly n
* `{n,}`: n or more * `{n,}`: n or more
* `{,m}`: at most m * `{,m}`: at most m
@ -85,17 +187,34 @@ You may have noticed that since `.` is a special regular expression character, y
(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.) (By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
Note that the precedence of these operators are high, so you write: `colou?r`. That means you'll need to use parentheses for many uses: `bana(na)+` or `ba(na){2,}`.
### Anchors ### Anchors
* `^` match the start of the line * `^` match the start of the line
* `*` match the end of the line * `*` match the end of the line
* `\b` match boundary between words
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`). My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
To force a regular expression to only match a complete string:
```{r}
str_detect(c("abcdef", "bcd"), "^bcd$")
```
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
### Exercises
1. Replace all `/` in a string with `\`.
## Detecting matches ## Detecting matches
`str_detect()`, `str_subset()`, `str_count()`
## Extracting matches
`str_extract()`, `str_extract_all()`
### Groups ### Groups
@ -103,8 +222,54 @@ My favourite mneomic for rememember which is which (from [Evan Misshula](https:/
## Replacing patterns ## Replacing patterns
`str_replace()`, `str_replace_all()`
## Splitting
`str_split()`, `str_split_fixed()`.
## Other types of pattern ## Other types of pattern
* `fixed()` When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the
* `coll()`
* `boundary()` * `fixed()`: matches exactly that sequence of characters (i.e. ignored
all special regular expression pattern).
* `coll()`: compare strings using standard **coll**ation rules. This is
useful for doing case insensitive matching. Note that `coll()` takes a
`locale` parameter that controls which rules are used for comparing
characters. Unfortunately different parts of the world use different rules!
```{r}
# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
# That means you also need to be aware of the difference
# when doing case insensitive matches:
i <- c("I", "İ", "i", "ı")
i
str_subset(i, fixed("i", TRUE))
str_subset(i, coll("i", TRUE))
str_subset(i, coll("i", TRUE, locale = "tr"))
```
## Other uses of regular expressions
There are a few other functions in base R that accept regular expressions:
* `apropos()` searchs all objects avaiable from the global environment. This
is useful if you can't quite remember the name of the function.
* `ls()` is similar to `apropos()` but only works in the current
environment. However, if you have so many objects in your environment
that you have to use a regular expression to filter them all, you
need to think about what you're doing! (And probably use a list instead).
* `dir()` lists all the files in a directory. The `pattern` argument takes
a regular expression and only return file names that match the pattern.
For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
(If you're more comfortable with "globs" like `*.csv`, you can convert
them to regular expressions with `glob2rx()`)