Merge pull request #42 from radugrosu/patch-5

Update strings.Rmd
This commit is contained in:
Hadley Wickham 2016-02-11 13:26:28 -06:00
commit a95709fa5f
1 changed files with 39 additions and 39 deletions

View File

@ -37,7 +37,7 @@ single_quote <- '\'' # or "'"
That means if you want to include a literal `\`, you'll need to double it up: `"\\"`.
Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines()`:
Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:
```{r}
x <- c("\"", "\\")
@ -45,7 +45,7 @@ x
writeLines(x)
```
There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes strings like `"\u00b5"`, this is a way of writing non-English characters that works on all platforms:
There are a handful of other special characters. The most common used are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes see strings like `"\u00b5"`, this is a way of writing non-English characters that works on all platforms:
```{r}
x <- "\u00b5"
@ -54,7 +54,7 @@ x
### String length
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent and hard to remember. Their behaviour is particularly inconsistent when it comes to missing values. For examle, `nchar()`, which gives the length of a string, returns 2 for `NA` (instead of `NA`)
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent and hard to remember. Their behaviour is particularly inconsistent when it comes to missing values. For example, `nchar()`, which gives the length of a string, returns 2 for `NA` (instead of `NA`)
```{r}
# Bug will be fixed in R 3.3.0
@ -147,7 +147,7 @@ x
### Locales
Above I used`str_to_lower()` to change to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first seem because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:
Above I used `str_to_lower()` to change to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first seem because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:
```{r}
# Turkish has two i's: with and without a dot, and it
@ -158,7 +158,7 @@ str_to_upper(c("i", "ı"), locale = "tr")
The locale is specified as ISO 639 language codes, which are two or three letter abbreviations. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list. If you leave the locale blank, it will use the current locale.
Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the currect locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
```{r}
x <- c("apple", "eggplant", "banana")
@ -191,9 +191,9 @@ str_sort(x, locale = "haw") # Hawaiian
Regular expressions, regexps for short, are a very terse language that allow to describe patterns in strings. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and shows you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
### Basics matches
### Basic matches
The simplest patterns match exact strings:
@ -202,7 +202,7 @@ x <- c("apple", "banana", "pear")
str_view(x, "an")
```
The next step up in complexity is `.`, which matches any character (except a new line):
The next step up in complexity is `.`, which matches any character (except a newline):
```{r, cache = FALSE}
str_view(x, ".a.")
@ -254,7 +254,7 @@ str_view(x, "^a")
str_view(x, "a$")
```
To remember which is which, try this mneomic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
@ -289,7 +289,7 @@ You can also match the boundary between words with `\b`. I don't find I often us
There are number of other special patterns that match more than one character:
* `.`: any character apart from a new line.
* `.`: any character apart from a newline.
* `\d`: any digit.
* `\s`: any whitespace (space, tab, newline).
* `[abc]`: match a, b, or c.
@ -303,7 +303,7 @@ You can use _alternation_ to pick between one or more alternative patterns. For
str_view(c("abc", "xyz"), "abc|xyz")
```
Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:
Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
```{r, cache = FALSE}
str_view(c("grey", "gray"), "gr(e|a)y")
@ -315,7 +315,7 @@ str_view(c("grey", "gray"), "gr(e|a)y")
1. Start with a vowel.
1. That only contain constants. (Hint: thinking about matching
1. That only contain consonants. (Hint: thinking about matching
"not"-vowels.)
1. End with `ed`, but not with `eed`.
@ -348,12 +348,12 @@ By default these matches are "greedy": they will match the longest string possib
```{r}
```
Note that the precedence of these operators are high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.
Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.
#### Exercises
1. Describe in words what these regular expressions match:
(read carefully to see I'm using a regular expression or a string
(read carefully to see if I'm using a regular expression or a string
that defines a regular expression.)
1. `^.*$`
@ -364,12 +364,12 @@ Note that the precedence of these operators are high, so you can write: `colou?r
1. Create regular expressions to find all words that:
1. Have three or more vowels in a row.
1. Start with three consonants
1. Have two or more vowel-consontant pairs in a row.
1. Start with three consonants.
1. Have two or more vowel-consonant pairs in a row.
### Grouping and backreferences
You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc.For example, the following regular expression finds all fruits that have a pair letters that's repeated.
You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc.For example, the following regular expression finds all fruits that have a pair of letters that's repeated.
```{r, cache = FALSE}
str_view(fruit, "(..)\\1", match = TRUE)
@ -400,15 +400,15 @@ str_detect(c("grey", "gray"), "gr(?:e|a)y")
## Tools
Now that you've learned the basics of regular expression, it's time to learn how to apply to real problems. In this section you'll learn a wide array of stringr functions that let you:
Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems. In this section you'll learn a wide array of stringr functions that let you:
* Determine which elements match a pattern.
* Find the positions of matches.
* Extract the content of matches.
* Replace matches with new values.
* How can you split a string into based on a match.
* How can you split a string based on a match.
Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down in to smaller pieces, solving each challenge before moving onto the next one.
Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
### Detect matches
@ -419,7 +419,7 @@ x <- c("apple", "banana", "pear")
str_detect(x, "e")
```
Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want answer questions about matches across a larger vector:
Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want to answer questions about matches across a larger vector:
```{r}
# How many common words start with t?
@ -438,7 +438,7 @@ no_vowels_2 <- str_detect(common, "^[^aeiou]+$")
all.equal(no_vowels_1, no_vowels_2)
```
The results are identical, but I think the first approach is significantly easier to understand. So if you find your regular expression is getting overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.
The results are identical, but I think the first approach is significantly easier to understand. So if you find your regular expression is getting overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining them with logical operations.
A common use of `str_detect()` is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient `str_subset()` wrapper:
@ -468,7 +468,7 @@ Note the use of `str_view_all()`. As you'll shortly learn, many stringr function
### Exercises
1. For each of the following challenges, try solving it both a single
1. For each of the following challenges, try solving it by using both a single
regular expression, and a combination of multiple `str_detect()` calls.
1. Find all words that start or end with `x`.
@ -483,7 +483,7 @@ Note the use of `str_view_all()`. As you'll shortly learn, many stringr function
### Extract matches
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to tested VOIP systems, but are also useful for practicing regexs.
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexes.
```{r}
length(sentences)
@ -543,7 +543,7 @@ str_extract_all(x, "[a-z]", simplify = TRUE)
### Grouped matches
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and to use with backreferences when matching. You can also parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky. Here I use a sequence of at least one character that isn't a space.
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching. You can also use parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky. Here I use a sequence of at least one character that isn't a space.
```{r}
noun <- "(a|the) ([^ ]+)"
@ -607,7 +607,7 @@ sentences %>%
#### Exercises
1. Replace all `/` in a string with `\`.
1. Replace all `/`'s in a string with `\`'s.
### Splitting
@ -619,7 +619,7 @@ sentences %>%
str_split(" ")
```
Because each component might contain a different number of pieces, this returns a list. If you're working with a length-1 vector, the easiest thing is to just extra the first element of the list:
Because each component might contain a different number of pieces, this returns a list. If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
```{r}
"a|b|c|d" %>%
@ -635,7 +635,7 @@ sentences %>%
str_split(" ", simplify = TRUE)
```
You can also request a maximum number of pieces;
You can also request a maximum number of pieces:
```{r}
fields <- c("Name: Hadley", "County: NZ", "Age: 35")
@ -657,7 +657,7 @@ str_split(x, boundary("word"))[[1]]
1. Split up a string like `"apples, pears, and bananas"` into individual
components.
1. Why is it's better to split up by `boundary("word")` than `" "`?
1. Why is it better to split up by `boundary("word")` than `" "`?
1. What does splitting with an empty string (`""`) do?
@ -697,7 +697,7 @@ You can use the other arguments of `regex()` to control details of the match:
```
* `comments = TRUE` allows you to use comments and white space to make
complex regular expressions more understand. Space are ignored, as is
complex regular expressions more understandable. Spaces are ignored, as is
everything after `#`. To match a literal space, you'll need to escape it:
`"\\ "`.
@ -707,7 +707,7 @@ There are three other functions you can use instead of `regex()`:
* `fixed()`: matches exactly the specified sequence of bytes. It ignores
all special regular expressions and operates at a very low level.
This allows you to avoid complex escaping can be much faster than
This allows you to avoid complex escaping and can be much faster than
regular expressions:
```{r}
@ -732,7 +732,7 @@ There are three other functions you can use instead of `regex()`:
```
They render identically, but because they're defined differently,
`fixed()` does find a match. Instead, you can use `coll()`, defined
`fixed()` doesn't find a match. Instead, you can use `coll()`, defined
next to respect human character comparison rules:
```{r}
@ -764,12 +764,12 @@ There are three other functions you can use instead of `regex()`:
stringi::stri_locale_info()
```
The downside of `coll()` is because the rules for recognising which
The downside of `coll()` is speed; because the rules for recognising which
characters are the same are complicated, `coll()` is relatively slow
compared to `regex()` and `fixed()`.
* As you saw with `str_split()` you can use `boundary()` to match boundaries.
You can also use it with the other functions, all though
You can also use it with the other functions:
```{r, cache = FALSE}
x <- "This is a sentence."
@ -788,7 +788,7 @@ There are three other functions you can use instead of `regex()`:
There are a few other functions in base R that accept regular expressions:
* `apropos()` searchs all objects avaiable from the global environment. This
* `apropos()` searches all objects available from the global environment. This
is useful if you can't quite remember the name of the function.
```{r}
@ -796,7 +796,7 @@ There are a few other functions in base R that accept regular expressions:
```
* `dir()` lists all the files in a directory. The `pattern` argument takes
a regular expression and only return file names that match the pattern.
a regular expression and only returns file names that match the pattern.
For example, you can find all the rmarkdown files in the current
directory with:
@ -818,9 +818,9 @@ There are a few other functions in base R that accept regular expressions:
### The stringi package
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `length(ls("package:stringi"))` functions to stringr's `length(ls("package:stringr"))`.
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `r length(ls("package:stringi"))` functions to stringr's `r length(ls("package:stringr"))`.
So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages are very similar because stringr was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.
So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages is very similar because stringr was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.
### Encoding
@ -832,7 +832,7 @@ Complicated and fraught with difficulty. Best approach is to convert to UTF-8 as
Generally, you should fix encoding problems during the data import phase.
Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. Fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).
Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. It's fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).
```{r}
x <- "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu."