From 99338eece05013be5ee508007c27eddffd3cd38c Mon Sep 17 00:00:00 2001
From: hadley <h.wickham@gmail.com>
Date: Thu, 5 Nov 2015 08:10:27 -0600
Subject: [PATCH] More on strings

---
 strings.Rmd | 221 ++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 161 insertions(+), 60 deletions(-)

diff --git a/strings.Rmd b/strings.Rmd
index a2ddb92..7462de6 100644
--- a/strings.Rmd
+++ b/strings.Rmd
@@ -408,21 +408,23 @@ Because regular expressions are so powerful, it's easy to try and solve every pr
 
 ### Detect matches
 
-To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector:
+To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector the same length as the input:
 
 ```{r}
 x <- c("apple", "banana", "pear")
 str_detect(x, "e")
 ```
 
-Remember that logical vectors are effectively combined with `sum()` and `mean()`. This makes it easy to answer questions about a complete vector:
+Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want answer questions about matches across a larger vector:
 
 ```{r}
 # How many common words start with t?
 sum(str_detect(common, "^t"))
+# What proportion of common words end with a vowel?
+mean(str_detect(common, "[aeiou]$"))
 ```
 
-When you have complicated logical conditions (e.g. match this or that but not these) combining multiple `str_detect()` calls with logical operators is often easy. A simple example is if you want to find all words that don't contain any vowels:
+When you have complex logical conditions (e.g. match a or b but not c unless d) it's often easier to combine multiple `str_detect()` calls with logical operators, rather than trying to create a single regular expression. For example, here are two ways to find all words that don't contain any vowels:
 
 ```{r}
 # Find all words containing at least one vowel, and negate
@@ -432,33 +434,52 @@ no_vowels_2 <- str_detect(common, "^[^aeiou]+$")
 all.equal(no_vowels_1, no_vowels_2)
 ```
 
-If you find your regular expression is getting hard to understand, trying breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.
+The results are identical, but I think the first approach is significantly easier to understand. So if you find your regular expression is getting overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.
 
-`str_count()` is similar to `str_detect()` but it returns an integer count of the number of matches, instead of a true/false:
+A common use of `str_detect()` is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient `str_subset()` wrapper:
 
 ```{r}
+common[str_detect(common, "x$")]
+str_subset(common, "x$")
+```
+
+A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:
+
+```{r}
+x <- c("apple", "banana", "pear")
 str_count(x, "a")
 
-# What's the average number of vowels per word?
+# On average, how many vowels per word?
 mean(str_count(common, "[aeiou]"))
 ```
 
-`str_subset()` is a wrapper for the common pattern `x[str_detect(x, pattern)]`.
+Note that matches never overlap. For example, in `"abababa"`, how many times will the pattern `"aba"` match? Regular expressions say two, not three:
+
+```{r}
+str_count("abababa", "aba")
+str_view_all("abababa", "aba")
+```
+
+Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches.
 
 ### Exercises
 
-1.  For each of the following challenges, try solving with both a single
+1.  For each of the following challenges, try solving it both a single
     regular expression, and a combination of multiple `str_detect()` calls.
     
     1.  Find all words that start or end with `x`.
+    
     1.  Find all words that start with a vowel and end with a consonant.
+    
+    1.  Are there any words that contain at least one of each different
+        vowel?
 
 1.  What word has the highest number of vowels? What word has the highest
-    proportion of vowels?
+    proportion of vowels? (Hint: what is the denominator?)
 
 ### Extract matches
 
-To extract the actual text of a match, use `str_extract()`. For that to be useful, we need a somewhat more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences): these are sentences designed to tested VOIP systems, but we're going to use them as random data.
+To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to tested VOIP systems, but are also useful for practicing regexs.
 
 ```{r}
 length(sentences)
@@ -481,26 +502,23 @@ matches <- str_extract(has_colour, colour_match)
 head(matches)
 ```
 
-A few sentences contain more than one colour and `str_extract()` only extracts the first:
+Note that `str_extract()` only extracts the first match. We can see that most easily by first selecting all the sentences that have more than 1 match:
 
 ```{r}
-table(str_count(sentences, colour_match))
 more <- sentences[str_count(sentences, colour_match) > 1]
-more
+str_view_all(more, colour_match)
 
 str_extract(more, colour_match)
 ```
 
-To get all matches, use `str_extract_all()`:
+This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use `str_extract_all()`. It returns either a list or a matrix, based on the value of the `simplify` argument:
 
 ```{r}
-str_view_all(more, colour_match)
 str_extract_all(more, colour_match)
+str_extract_all(more, colour_match, simplify = TRUE)
 ```
 
-This returns a list, which is a little hard to work with, which is why it's not the default. You'll learn more about working with lists in Chapter XYZ. Note that matches are always non-overlapping: the second match starts after the first is complete.
-
-Another options is to convert it to a character matrix with `simplify = TRUE`. Short matches are expanded with `""` to the length of the longest:
+You'll learn more about working with lists in Chapter XYZ. If you use `simplify = TRUE`, note that short matches are expanded to the same length as the longest:
 
 ```{r}
 x <- c("a", "a b", "a b c")
@@ -509,18 +527,19 @@ str_extract_all(x, "[a-z]", simplify = TRUE)
 
 #### Exercises
 
+1.  In the previous example, you might have noticed that the regular
+    expression matched "fickered", which is not a colour. Modify the 
+    regex to fix the problem.
+
 1.  From the Harvard sentences data, extract:
 
     1. The first word from each sentence.
     1. All words ending in `ing`.
-
-1.  In the previous example, you might have noticed that our regular expression
-    matched "fickered", which is not a colour. Modify the regex to prevent
-    this problematic match.
+    1. All plurals.
 
 ### Grouped matches
 
-We talked early about the use of parentheses. You can use them if you want to extract parts of a match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the":
+Earlier in this chapter we talked about the use of parentheses for clarifying precedence and to use with backreferences when matching. You can also parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky. Here I use a sequence of at least one character that isn't a space.
 
 ```{r}
 noun <- "(a|the) ([^ ]+)"
@@ -531,47 +550,64 @@ has_noun <- sentences %>%
 str_extract(has_noun, noun)
 ```
 
-(Defining a "word" in a regular expression is a little tricky. I've decided to go for a sequence of any characters except for a space.)
-
-`str_extract()` gives us the complete match, but we'd like to be able to dig into the pieces. That's the job of `str_match()`. Instead of a character vector, it returns a matrix, with one column for each group, and one column for the complete match:
+`str_extract()` gives us the complete match; `str_match()` gives each individual component. Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
 
 ```{r}
 str_match(has_noun, noun)
 ```
 
-(You can see our heuristic for finding nouns isn't that good as it also picks up adjectives like smooth and parked.)
+(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
 
-Like `str_extract()`, if you want all matches, you'll need to use `str_match_all()` and then work with the list that it returns.
+```{r}
+num <- str_c("one", "two", "three", "four", "five", "six",
+  "seven", "eight", "nine", "ten", sep = "|")
+
+match <- str_interp("(${num}) ([^ ]+s)\\b")
+sentences %>% 
+  str_subset(match) %>% 
+  head(10) %>% 
+  str_match(match)
+```
+
+Like `str_extract()`, if you want all matches for each string, you'll need `str_match_all()`.
 
 #### Exercises
 
 
 ### Replacing matches
 
-`str_replace()` allows you to transform 
+`str_replace()` and `str_replace_all()` allow you to replace matches with new strings:
+
+```{r}
+x <- c("apple", "pear", "banana")
+str_replace(x, "[aeiou]", "-")
+str_replace_all(x, "[aeiou]", "-")
+```
+
+With `str_replace_all()` you can also perform multiple replacements by supplying a named vector:
+
+```{r}
+x <- c("1 house", "2 cars", "3 people")
+str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
+```
+
+You can refer to groups with backreferences:
 
 ```{r}
 sentences %>% 
   head(5) %>% 
-  str_replace("([^ ]+) ([^ ]+)", "\\2 \\1")
+  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2")
 ```
 
-Like `str_extract()` and `str_match()`, `str_replace()` only affects the first match. To replace every match, use `str_replace_all()`. Compared to the other two `all()` functions, the output from `str_replace_all()` is simpler because it can stay as a character vector.
-
-Multiple replacements
-
-Backreferences.
-
-Replacing with a function call (hopefully)
+<!-- Replacing with a function call (hopefully) -->
 
 #### Exercises
 
 1.   Replace all `/` in a string with `\`.
 
-
 ### Splitting
 
-Another useful application is to split strings up into pieces. For example we could split sentences up into words
+Use `str_split()` to split a string up into pieces. For example, we could split sentences into words:
 
 ```{r}
 sentences %>%
@@ -579,7 +615,7 @@ sentences %>%
   str_split(" ")
 ```
 
-Note that this function has to return a list: the number of pieces each element is split up into might be difference, so there's no way to put them in a vector. If you're working with a length-1 vector, the easiest thing is to just extra the first element of the list:
+Because each component might contain a different number of pieces, this returns a list. If you're working with a length-1 vector, the easiest thing is to just extra the first element of the list:
 
 ```{r}
 "a|b|c|d" %>% 
@@ -587,27 +623,40 @@ Note that this function has to return a list: the number of pieces each element
   .[[1]]
 ```
 
-You'll learn other techniques in the lists chapter.
-
-If you want all strings to be split up into the same number of pieces, you can use `str_split_fixed()`. This outputs a matrix with one row for each string and one column for each piece:
+Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
 
 ```{r}
-c("Name: Hadley", "County: NZ", "Age: 35") %>% 
-  str_split_fixed(": ", 2)
+sentences %>%
+  head(5) %>% 
+  str_split(" ", simplify = TRUE)
 ```
 
-<!-- Add comment to stringi issue that split should also preserve names -->
+You can also request a maximum number of pieces;
 
-Instead of splitting up strings by patterns, you can also split up by a predefined set of boundaries with `boundary()`: by character, by line, by sentence and by word.
+```{r}
+fields <- c("Name: Hadley", "County: NZ", "Age: 35")
+fields %>% str_split(": ", n = 2, simplify = TRUE)
+```
+
+Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
 
 ```{r}
 x <- "This is a sentence.  This is another sentence."
 str_view_all(x, boundary("word"))
 
-str_split(x, " ")
-str_split(x, boundary("word"))
+str_split(x, " ")[[1]]
+str_split(x, boundary("word"))[[1]]
 ```
 
+#### Exercises
+
+1.  Split up a string like `"apples, pears, and bananas"` into individual
+    components.
+    
+1.  Why is it's better to split up by `boundary("word")` than `" "`?
+
+1.  What does splitting with an empty string (`""`) do?
+
 ### Find matches
 
 `str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
@@ -652,9 +701,10 @@ You can use the other arguments of `regex()` to control details of the match:
 
 There are three other functions you can use instead of `regex()`:
 
-*   `fixed()`: matches exactly that sequence of characters (i.e. ignored
-    all special regular expression pattern). This allows you to avoid complex
-    escaping and is faster than matching regular expressions:
+*   `fixed()`: matches exactly the specified sequence of bytes. It ignores
+    all special regular expressions and operates at a very low level. 
+    This allows you to avoid complex escaping can be much faster than 
+    regular expressions:
   
     ```{r}
     microbenchmark::microbenchmark(
@@ -663,9 +713,29 @@ There are three other functions you can use instead of `regex()`:
     )
     ```
     
-    The fixed match is almost 3x times faster than the regular expression match.
-    But note the units: here it's only 200 µs faster. 
-  
+    Here the fixed match is almost 3x times faster than the regular 
+    expression match. However, if you're working with non-English data 
+    `fixed()` can lead to unreliable matches because there are often
+    multiple ways of representing the same character. For example, there
+    are two ways to define "á": either as a single character or as an "a" 
+    plus an accent:
+    
+    ```{r}
+    a1 <- "\u00e1"
+    a2 <- "a\u0301"
+    c(a1, a2)
+    a1 == a2
+    ```
+
+    They render identically, but because they're defined differently, 
+    `fixed()` does find a match. Instead, you can use `coll()`, defined
+    next to respect human character comparison rules:
+
+    ```{r}
+    str_detect(a1, fixed(a2))
+    str_detect(a1, coll(a2))
+    ```
+    
 *   `coll()`: compare strings using standard **coll**ation rules. This is 
     useful for doing case insensitive matching. Note that `coll()` takes a
     `locale` parameter that controls which rules are used for comparing
@@ -689,6 +759,10 @@ There are three other functions you can use instead of `regex()`:
     ```{r}
     stringi::stri_locale_info()
     ```
+    
+    The downside of `coll()` is because the rules for recognising which
+    characters are the same are complicated, `coll()` is relatively slow
+    compared to `regex()` and `fixed()`.
 
 *   As you saw with `str_split()` you can use `boundary()` to match boundaries.
     You can also use it with the other functions, all though 
@@ -699,23 +773,41 @@ There are three other functions you can use instead of `regex()`:
     str_extract_all(x, boundary("word"))
     ```
 
+### Exercises
+
+1.  How would you find all strings containing `\` with `regex()` vs.
+    with `fixed()`?
+
+1.  What are the five most common words in `sentences`?
+
 ## Other uses of regular expressions
 
 There are a few other functions in base R that accept regular expressions:
 
 *   `apropos()` searchs all objects avaiable from the global environment. This
     is useful if you can't quite remember the name of the function.
+    
+    ```{r}
+    apropos("replace")
+    ```
+    
+*   `dir()` lists all the files in a directory. The `pattern` argument takes
+    a regular expression and only return file names that match the pattern.
+    For example, you can find all the rmarkdown files in the current
+    directory with:
+    
+    ```{r}
+    head(dir(pattern = "\\.Rmd$"))
+    ```
+    
+    (If you're more comfortable with "globs" like `*.Rmd`, you can convert
+    them to regular expressions with `glob2rx()`):
    
 *   `ls()` is similar to `apropos()` but only works in the current 
     environment. However, if you have so many objects in your environment
     that you have to use a regular expression to filter them all, you 
     need to think about what you're doing! (And probably use a list instead).
 
-*   `dir()` lists all the files in a directory. The `pattern` argument takes
-    a regular expression and only return file names that match the pattern.
-    For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
-    (If you're more comfortable with "globs" like `*.csv`, you can convert
-    them to regular expressions with `glob2rx()`)
 
 ## Advanced topics
 
@@ -746,3 +838,12 @@ str_conv(x, "ISO-8859-1")
 as.data.frame(stringi::stri_enc_detect(x))
 str_conv(x, "ISO-8859-2")
 ```
+
+### UTF-8
+
+<http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes>
+
+<http://www.joelonsoftware.com/articles/Unicode.html>
+
+Homoglyph attack, https://github.com/reinderien/mimic.
+