Small copy edits edits to strings.Rmd

This commit is contained in:
Garrett 2016-04-07 16:10:32 -04:00
parent 90bd4fb1d1
commit b724f667bb
1 changed files with 12 additions and 12 deletions

View File

@ -10,7 +10,7 @@ sentences <- readr::read_lines("harvard-sentences.txt")
<!-- look at http://d-rug.github.io/blog/2015/regex.fick/, http://qntm.org/files/re/re.html -->
This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically unstructured or semi-structured data so you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically come as unstructured or semi-structured data. When this happens, you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead I'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
This chapter will focus on the __stringr__ package. This package provides a consistent set of functions that all work the same way and are easier to learn than the base R equivalents. We'll also take a brief look at the __stringi__ package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
@ -30,9 +30,9 @@ double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```
That means if you want to include a literal `\`, you'll need to double it up: `"\\"`.
That means if you want to include a literal backslash, you'll need to double it up: `"\\"`.
Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:
```{r}
x <- c("\"", "\\")
@ -83,7 +83,7 @@ Use the `sep` argument to control how they're separated:
str_c("x", "y", sep = ", ")
```
Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
Like most other functions in R, missing values are contagious. If you want them to print as `"NA"`, use `str_replace_na()`:
```{r}
x <- c("abc", NA)
@ -118,7 +118,7 @@ str_c(c("x", "y", "z"), collapse = ", ")
### Subsetting strings
You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` argument which give the (inclusive) position of the substring:
You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:
```{r}
x <- c("Apple", "Banana", "Pear")
@ -186,7 +186,7 @@ str_sort(x, locale = "haw") # Hawaiian
Regular expressions, regexps for short, are a very terse language that allow to describe patterns in strings. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
To learn regular expressions, we'll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
### Basic matches
@ -203,7 +203,7 @@ The next step up in complexity is `.`, which matches any character (except a new
str_view(x, ".a.")
```
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. In other words, you need to make the regular expression `\.`, but this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So the string `"\."` reduces to the special character written as `\.` In this case, `\.` is not a recognized special character and the string would lead to an error; but `"\n"` would reduce to a new line, `"\t"` would reduce to a tab, and `"\\"` would reduce to a literal `\`, which provides a way forward. To create a string that reduces to a literal backslash followed by a period, you need to escape the backslash. So to match a literal "`.`" you need to use `"\\."`, which simplifies to the regular expression `\.`.
```{r, cache = FALSE}
# To create the regular expression, we need \\
@ -216,7 +216,7 @@ writeLines(dot)
str_view(c("abc", "a.c", "bef"), "a\\.c")
```
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!
If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!
```{r, cache = FALSE}
x <- "a\\b"
@ -372,7 +372,7 @@ str_view(fruit, "(..)\\1", match = TRUE)
(You'll also see how they're useful in conjunction with `str_match()` in a few pages.)
Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. They are called non-capturing parentheses.
Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use them for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. `(?:)` are called non-capturing parentheses.
For example:
@ -401,7 +401,7 @@ Now that you've learned the basics of regular expressions, it's time to learn ho
* Find the positions of matches.
* Extract the content of matches.
* Replace matches with new values.
* How can you split a string based on a match.
* Split a string based on a match.
Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
@ -459,7 +459,7 @@ str_count("abababa", "aba")
str_view_all("abababa", "aba")
```
Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches.
Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches. The second function will have the suffix `_all`.
### Exercises
@ -633,7 +633,7 @@ sentences %>%
You can also request a maximum number of pieces:
```{r}
fields <- c("Name: Hadley", "County: NZ", "Age: 35")
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
```