Strings tweaks

* Don't need to cache html widgets anymore
* Data now in stringr package
This commit is contained in:
hadley 2016-07-19 18:01:52 -05:00
parent 68c05a49a1
commit 6a09717269
2 changed files with 21 additions and 24 deletions

View File

@ -9,3 +9,7 @@ bookdown::gitbook:
text: "Edit"
sharing: no
css: r4ds.css
bookdown::pdf_book:
latex_engine: "xelatex"

View File

@ -1,7 +1,5 @@
# Strings
<!-- look at http://d-rug.github.io/blog/2015/regex.fick/, http://qntm.org/files/re/re.html -->
## Introduction
This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically come as unstructured or semi-structured data. When this happens, you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead I'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
@ -12,13 +10,8 @@ This chapter will focus on the __stringr__ package. This package provides a cons
In this chapter you'll use the stringr package to manipulate strings.
```{r setup, cache = FALSE}
```{r setup}
library(stringr)
# To be moved into stringr
common <- rcorpora::corpora("words/common")$commonWords
fruit <- rcorpora::corpora("foods/fruits")$fruits
sentences <- readr::read_lines("harvard-sentences.txt")
```
## String basics
@ -199,20 +192,20 @@ To learn regular expressions, we'll use `str_view()` and `str_view_all()`. These
The simplest patterns match exact strings:
```{r, cache = FALSE}
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "an")
```
The next step up in complexity is `.`, which matches any character (except a newline):
```{r, cache = FALSE}
```{r}
str_view(x, ".a.")
```
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. In other words, you need to make the regular expression `\.`, but this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So the string `"\."` reduces to the special character written as `\.` In this case, `\.` is not a recognized special character and the string would lead to an error; but `"\n"` would reduce to a new line, `"\t"` would reduce to a tab, and `"\\"` would reduce to a literal `\`, which provides a way forward. To create a string that reduces to a literal backslash followed by a period, you need to escape the backslash. So to match a literal "`.`" you need to use `"\\."`, which simplifies to the regular expression `\.`.
```{r, cache = FALSE}
```{r}
# To create the regular expression, we need \\
dot <- "\\."
@ -225,7 +218,7 @@ str_view(c("abc", "a.c", "bef"), "a\\.c")
If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!
```{r, cache = FALSE}
```{r}
x <- "a\\b"
writeLines(x)
@ -250,7 +243,7 @@ By default, regular expressions will match any part of a string. It's often usef
* `^` to match the start of the string.
* `$` to match the end of the string.
```{r, cache = FALSE}
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
@ -260,7 +253,7 @@ To remember which is which, try this mnemonic which I learned from [Evan Misshul
To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
```{r, cache = FALSE}
```{r}
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
@ -301,13 +294,13 @@ Remember, to create a regular expression containing `\d` or `\s`, you'll need to
You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
```{r, cache = FALSE}
```{r}
str_view(c("abc", "xyz"), "abc|xyz")
```
Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
```{r, cache = FALSE}
```{r}
str_view(c("grey", "gray"), "gr(e|a)y")
```
@ -373,7 +366,7 @@ Note that the precedence of these operators is high, so you can write: `colou?r`
You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a pair of letters that's repeated.
```{r, cache = FALSE}
```{r}
str_view(fruit, "(..)\\1", match = TRUE)
```
@ -461,7 +454,7 @@ mean(str_count(common, "[aeiou]"))
Note that matches never overlap. For example, in `"abababa"`, how many times will the pattern `"aba"` match? Regular expressions say two, not three:
```{r, cache = FALSE}
```{r}
str_count("abababa", "aba")
str_view_all("abababa", "aba")
```
@ -510,7 +503,7 @@ head(matches)
Note that `str_extract()` only extracts the first match. We can see that most easily by first selecting all the sentences that have more than 1 match:
```{r, cache = FALSE}
```{r}
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
@ -646,7 +639,7 @@ fields %>% str_split(": ", n = 2, simplify = TRUE)
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
```{r, cache = FALSE}
```{r}
x <- "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))
@ -683,7 +676,7 @@ You can use the other arguments of `regex()` to control details of the match:
* `ignore_case = TRUE` allows characters to match either their uppercase or
lowercase forms. This always uses the current locale.
```{r, cache = FALSE}
```{r}
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))
@ -692,7 +685,7 @@ You can use the other arguments of `regex()` to control details of the match:
* `multiline = TRUE` allows `^` and `$` to match the start and end of each
line rather than the start and end of the complete string.
```{r, cache = FALSE}
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_view_all(x, "^Line")
str_view_all(x, regex("^Line", multiline = TRUE))
@ -773,7 +766,7 @@ There are three other functions you can use instead of `regex()`:
* As you saw with `str_split()` you can use `boundary()` to match boundaries.
You can also use it with the other functions:
```{r, cache = FALSE}
```{r}
x <- "This is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
@ -820,7 +813,7 @@ There are a few other functions in base R that accept regular expressions:
### The stringi package
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `r length(ls(getNamespace("stringi")))` functions to stringr's `r length(ls("package:stringr"))`.
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages is very similar because stringr was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs. `stri_`.