Some string tweaking

This commit is contained in:
hadley 2015-10-26 09:52:24 -05:00
parent ec529ef1fa
commit 979289c50b
1 changed files with 59 additions and 31 deletions

View File

@ -21,9 +21,17 @@ library(stringr)
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
For example, if you have
```{r}
```
The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
## String basics
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour.
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.
To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
@ -33,6 +41,17 @@ x
writeLines(x)
```
There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`.
You'll also sometimes strings like `"\u00b5"`, this is a way of writing special characters that works on all platforms:
```R
x <- "\u00b5"
x
```
Remember that the representation of a string is different from the string itself.
### String length
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
@ -61,11 +80,12 @@ str_c("x", "y", sep = ", ")
Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
```{r}
str_c("x", NA, "y")
str_c("x", str_replace_na(NA), "y")
x <- c("abc", NA)
str_c("|-", x, "-|")
str_c("|-", str_replace_na(x), "-|")
```
`str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest:
As shown above, `str_c()` is vectorised, automatically recycling the shortest vectors to the same length as the longest:
```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
@ -108,25 +128,10 @@ x
1. In your own words, describe the difference between `sep` and `collapse`.
## Regular expressions
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories
* What matches the pattern?
* Does a string match a pattern?
* How can you replace a pattern with text?
* How can you split a string into pieces?
## Regular expressions basics
Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
```{r}
```
Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions.
### Matching anything and escaping
Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character:
```{r}
@ -148,6 +153,14 @@ str_subset(c("abc", "a.c", "bef"), "a\\.c")
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
```{r}
x <- "a\\b"
cat(x, "\n")
y <- str_replace(x, "\\\\", "-slash-")
cat(y, "\n")
```
### Character classes and alternatives
As well as `.` there are a number of other special patterns that match more than one character:
@ -166,7 +179,7 @@ A similar idea is alternation: `x|y` matches either x or y. Note that the preced
str_detect(c("abc", "xyz"), "abc|xyz")
```
Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want:
Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:
```{r}
str_detect(c("grey", "gray"), "gr(e|a)y")
@ -191,42 +204,57 @@ Note that the precedence of these operators are high, so you write: `colou?r`. T
### Anchors
* `^` match the start of the line
* `*` match the end of the line
Regular expressions can also match things that are not characters. The most important non-character matches are:
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
* `^`: the start of the line.
* `*`: the end of the line.
To force a regular expression to only match a complete string:
To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
```{r}
str_detect(c("abcdef", "bcd"), "^bcd$")
```
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
### Exercises
1. Replace all `/` in a string with `\`.
## Detecting matches
## Regular expression operations
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories:
* What matches the pattern?
* Does a string match a pattern?
* How can you replace a pattern with text?
* How can you split a string into pieces?
### Detecting matches
`str_detect()`, `str_subset()`, `str_count()`
## Extracting matches
### Extracting matches
`str_extract()`, `str_extract_all()`
### Groups
### Extracting grouped matches
`str_match()`, `str_match_all()`
## Replacing patterns
### Replacing patterns
`str_replace()`, `str_replace_all()`
## Splitting
### Splitting
`str_split()`, `str_split_fixed()`.
`str_split()`, `str_split_fixed()`.
### Finding locations
`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
## Other types of pattern