Break up strings chapter

2021-04-21 12:30:25 -05:00 · 2021-04-21 12:30:25 -05:00 · 58f7f16db1
parent 18253a1d52
commit 58f7f16db1
4 changed files with 257 additions and 238 deletions
--- a/_bookdown.yml
+++ b/_bookdown.yml
@ -46,6 +46,7 @@ rmd_files: [
  "functions.Rmd",
  "vectors.Rmd",
  "iteration.Rmd",
+  "prog-strings.Rmd",

  "communicate.Rmd",
  "rmarkdown.Rmd",
--- a/prog-strings.Rmd
+++ b/prog-strings.Rmd
@ -0,0 +1,190 @@
+## Programming with strings
+
+```{r}
+library(stringr)
+library(tidyr)
+library(tibble)
+```
+
+### Extract
+
+```{r}
+colours <- c("red", "orange", "yellow", "green", "blue", "purple")
+colour_match <- str_c(colours, collapse = "|")
+colour_match
+
+more <- sentences[str_count(sentences, colour_match) > 1]
+str_extract_all(more, colour_match)
+```
+
+If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
+
+```{r}
+
+str_extract_all(more, colour_match, simplify = TRUE)
+
+x <- c("a", "a b", "a b c")
+str_extract_all(x, "[a-z]", simplify = TRUE)
+```
+
+We don't talk about matrices here, but they are useful elsewhere.
+
+### Exercises
+
+1.  From the Harvard sentences data, extract:
+
+    1.  The first word from each sentence.
+    2.  All words ending in `ing`.
+    3.  All plurals.
+
+## Grouped matches
+
+Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching.
+You can also use parentheses to extract parts of a complex match.
+For example, imagine we want to extract nouns from the sentences.
+As a heuristic, we'll look for any word that comes after "a" or "the".
+Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space.
+
+```{r}
+noun <- "(a|the) ([^ ]+)"
+
+has_noun <- sentences %>%
+  str_subset(noun) %>%
+  head(10)
+has_noun %>% 
+  str_extract(noun)
+```
+
+`str_extract()` gives us the complete match; `str_match()` gives each individual component.
+Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
+
+```{r}
+has_noun %>% 
+  str_match(noun)
+```
+
+(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
+
+## Spitting
+
+Use `str_split()` to split a string up into pieces.
+For example, we could split sentences into words:
+
+```{r}
+sentences %>%
+  head(5) %>% 
+  str_split(" ")
+```
+
+Because each component might contain a different number of pieces, this returns a list.
+If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
+
+```{r}
+"a|b|c|d" %>% 
+  str_split("\\|") %>% 
+  .[[1]]
+```
+
+Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
+
+```{r}
+sentences %>%
+  head(5) %>% 
+  str_split(" ", simplify = TRUE)
+```
+
+You can also request a maximum number of pieces:
+
+```{r}
+fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
+fields %>% str_split(": ", n = 2, simplify = TRUE)
+```
+
+Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
+
+```{r}
+x <- "This is a sentence.  This is another sentence."
+str_view_all(x, boundary("word"))
+
+str_split(x, " ")[[1]]
+str_split(x, boundary("word"))[[1]]
+```
+
+## Replace with function
+
+## Locations
+
+`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match.
+These are particularly useful when none of the other functions does exactly what you want.
+You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
+
+## stringi
+
+stringr is built on top of the **stringi** package.
+stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions.
+stringi, on the other hand, is designed to be comprehensive.
+It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
+
+If you find yourself struggling to do something in stringr, it's worth taking a look at stringi.
+The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way.
+The main difference is the prefix: `str_` vs. `stri_`.
+
+### Exercises
+
+1.  Find the stringi functions that:
+
+    a.  Count the number of words.
+    b.  Find duplicated strings.
+    c.  Generate random text.
+
+2.  How do you control the language that `stri_sort()` uses for sorting?
+
+### Exercises
+
+1.  What do the `extra` and `fill` arguments do in `separate()`?
+    Experiment with the various options for the following two toy datasets.
+
+    ```{r, eval = FALSE}
+    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
+      separate(x, c("one", "two", "three"))
+
+    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
+      separate(x, c("one", "two", "three"))
+    ```
+
+2.  Both `unite()` and `separate()` have a `remove` argument.
+    What does it do?
+    Why would you set it to `FALSE`?
+
+3.  Compare and contrast `separate()` and `extract()`.
+    Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
+
+4.  In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
+    How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
+
+    ```{r, eval = FALSE}
+    events <- tribble(
+      ~month, ~day,
+      1     , 20,
+      1     , 21,
+      1     , 22
+    )
+
+    events %>%
+      unite("date", month:day, sep = "-", remove = FALSE)
+    ```
+
+5.  You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
+    Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
+    Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
+    Do this in two ways: using a positive and a negative value for `sep`.
+
+    ```{r}
+    baker <- tribble(
+      ~location,
+      "FLBaker County",
+      "GABaker County",
+      "ORBaker County",
+    )
+    baker
+    ```
--- a/regexps.Rmd
+++ b/regexps.Rmd
@ -1,5 +1,11 @@
 # Regular expressions

+## Introduction
+
+The focus of this chapter will be on regular expressions, or regexps for short.
+Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings.
+When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.
+
 ## Matching patterns with regular expressions

 Regexps are a very terse language that allow you to describe patterns in strings.
@ -229,7 +235,7 @@ Collectively, these operators are called **quantifiers** because they quantify h
    b.  Have three or more vowels in a row.
    c.  Have two or more vowel-consonant pairs in a row.

-4.  Solve the beginner regexp crosswords at [<https://regexcrossword.com/challenges/beginner>](https://regexcrossword.com/challenges/beginner){.uri}.
+4.  Solve the beginner regexp crosswords at [\<https://regexcrossword.com/challenges/beginner\>](https://regexcrossword.com/challenges/beginner){.uri}.

 ## Grouping and backreferences

@ -245,6 +251,14 @@ str_view(fruit, "(..)\\1", match = TRUE)

 (Shortly, you'll also see how they're useful in conjunction with `str_match()`.)

+Also use for replacement:
+
+```{r}
+sentences %>% 
+  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
+  head(5)
+```
+
 ### Exercises

 1.  Describe, in words, what these expressions will match:
@ -380,3 +394,4 @@ See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for mor
 Don't forget that you're in a programming language and you have other tools at your disposal.
 Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
 If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
+
--- a/strings.Rmd
+++ b/strings.Rmd
@ -3,9 +3,8 @@
 ## Introduction

 This chapter introduces you to string manipulation in R.
-You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions, or regexps for short.
-Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings.
-When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.
+You'll learn the basics of how strings work and how to create them by hand.
+Big topic so spread over three chapters.

 ### Prerequisites

@ -15,7 +14,7 @@ This chapter will focus on the **stringr** package for string manipulation, whic
 library(tidyverse)
 ```

-## String basics
+## Creating a string

 You can create strings with either single quotes or double quotes.
 Unlike other languages, there is no difference in behaviour.
@ -44,6 +43,8 @@ single_quote <- '\'' # or "'"

 That means if you want to include a literal backslash, you'll need to double it up: `"\\"`.

+TODO: raw string.
+
 Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes.
 To see the raw contents of the string, use `writeLines()`:

@ -68,7 +69,7 @@ Multiple strings are often stored in a character vector, which you can create wi
 c("one", "two", "three")
 ```

-### String length
+## String length

 Base R contains many functions to work with strings but we'll avoid them because they can be inconsistent, which makes them hard to remember.
 Instead we'll use functions from stringr.
@ -79,13 +80,15 @@ For example, `str_length()` tells you the number of characters in a string:
 str_length(c("a", "R for data science", NA))
 ```

+What is a letter?
+
 The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions:

 ```{r, echo = FALSE}
 knitr::include_graphics("screenshots/stringr-autocomplete.png")
 ```

-### Combining strings
+## Combining strings

 To combine two or more strings, use `str_c()`:

@ -115,7 +118,7 @@ As shown above, `str_c()` is vectorised, and it automatically recycles shorter v
 str_c("prefix-", c("a", "b", "c"), "-suffix")
 ```

-Objects of length 0 are silently dropped.
+`NULL`s are silently dropped.
 This is particularly useful in conjunction with `if`:

 ```{r}
@ -136,7 +139,7 @@ To collapse a vector of strings into a single string, use `collapse`:
 str_c(c("x", "y", "z"), collapse = ", ")
 ```

-### Subsetting strings
+## Subsetting strings

 You can extract parts of a string using `str_sub()`.
 As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:
@ -161,7 +164,9 @@ str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
 x
 ```

-### Locales
+TODO: `separate()`
+
+## Locales

 Above I used `str_to_lower()` to change the text to lower case.
 You can also use `str_to_upper()` or `str_to_title()`.
@ -214,18 +219,7 @@ TODO: add connection to `arrange()`
 6.  Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
    Think carefully about what it should do if given a vector of length 0, 1, or 2.

-## Tools
-
-Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems.
-In this section you'll learn a wide array of stringr functions that let you:
-
-   Determine which strings match a pattern.
-   Find the positions of matches.
-   Extract the content of matches.
-   Replace matches with new values.
-   Split a string based on a match.
-
-### Detect matches
+## Detect matches

 To determine if a character vector matches a pattern, use `str_detect()`.
 It returns a logical vector the same length as the input:
@ -235,6 +229,8 @@ x <- c("apple", "banana", "pear")
 str_detect(x, "e")
 ```

+TODO: add basic intro to regexps.
+
 Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1.
 That makes `sum()` and `mean()` useful if you want to answer questions about matches across a larger vector:

@ -307,11 +303,7 @@ str_count("abababa", "aba")
 str_view_all("abababa", "aba")
 ```

-Note the use of `str_view_all()`.
-As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches.
-The second function will have the suffix `_all`.
-
-#### Exercises
+### Exercises

 1.  For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.

@ -323,7 +315,33 @@ The second function will have the suffix `_all`.
    What word has the highest proportion of vowels?
    (Hint: what is the denominator?)

-### Extract matches
+## Replacing matches
+
+`str_replace_all()` allow you to replace matches with new strings.
+The simplest use is to replace a pattern with a fixed string:
+
+```{r}
+x <- c("apple", "pear", "banana")
+str_replace_all(x, "[aeiou]", "-")
+```
+
+With `str_replace_all()` you can perform multiple replacements by supplying a named vector:
+
+```{r}
+x <- c("1 house", "2 cars", "3 people")
+str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
+```
+
+#### Exercises
+
+1.  Replace all forward slashes in a string with backslashes.
+
+2.  Implement a simple version of `str_to_lower()` using `str_replace_all()`.
+
+3.  Switch the first and last letters in `words`.
+    Which of those strings are still words?
+
+## Extract full matches

 To extract the actual text of a match, use `str_extract()`.
 To show that off, we're going to need a more complicated example.
@ -364,61 +382,14 @@ str_extract(more, colour_match)

 This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures.
 To get all matches, use `str_extract_all()`.
-It returns a list:
+It returns a list, so we'll come back to this later on.

-```{r}
-str_extract_all(more, colour_match)
-```
-
-You'll learn more about lists in Section \@ref(lists) on lists and Chapter \@ref(iteration) on iteration.
-
-If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
-
-```{r}
-str_extract_all(more, colour_match, simplify = TRUE)
-
-x <- c("a", "a b", "a b c")
-str_extract_all(x, "[a-z]", simplify = TRUE)
-```
-
-#### Exercises
+### Exercises

 1.  In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour.
    Modify the regex to fix the problem.

-2.  From the Harvard sentences data, extract:
-
-    1.  The first word from each sentence.
-    2.  All words ending in `ing`.
-    3.  All plurals.
-
-### Grouped matches
-
-Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching.
-You can also use parentheses to extract parts of a complex match.
-For example, imagine we want to extract nouns from the sentences.
-As a heuristic, we'll look for any word that comes after "a" or "the".
-Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space.
-
-```{r}
-noun <- "(a|the) ([^ ]+)"
-
-has_noun <- sentences %>%
-  str_subset(noun) %>%
-  head(10)
-has_noun %>% 
-  str_extract(noun)
-```
-
-`str_extract()` gives us the complete match; `str_match()` gives each individual component.
-Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
-
-```{r}
-has_noun %>% 
-  str_match(noun)
-```
-
-(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
+## Extract part of matches

 If your data is in a tibble, it's often easier to use `tidyr::extract()`.
 It works like `str_match()` but requires you to name the matches, which are then placed in new columns:
@ -441,88 +412,7 @@ Like `str_extract()`, if you want all matches for each string, you'll need `str_
 2.  Find all contractions.
    Separate out the pieces before and after the apostrophe.

-### Replacing matches
-
-`str_replace()` and `str_replace_all()` allow you to replace matches with new strings.
-The simplest use is to replace a pattern with a fixed string:
-
-```{r}
-x <- c("apple", "pear", "banana")
-str_replace(x, "[aeiou]", "-")
-str_replace_all(x, "[aeiou]", "-")
-```
-
-With `str_replace_all()` you can perform multiple replacements by supplying a named vector:
-
-```{r}
-x <- c("1 house", "2 cars", "3 people")
-str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
-```
-
-Instead of replacing with a fixed string you can use backreferences to insert components of the match.
-In the following code, I flip the order of the second and third words.
-
-```{r}
-sentences %>% 
-  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
-  head(5)
-```
-
-#### Exercises
-
-1.  Replace all forward slashes in a string with backslashes.
-
-2.  Implement a simple version of `str_to_lower()` using `str_replace_all()`.
-
-3.  Switch the first and last letters in `words`.
-    Which of those strings are still words?
-
-### Splitting
-
-Use `str_split()` to split a string up into pieces.
-For example, we could split sentences into words:
-
-```{r}
-sentences %>%
-  head(5) %>% 
-  str_split(" ")
-```
-
-Because each component might contain a different number of pieces, this returns a list.
-If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
-
-```{r}
-"a|b|c|d" %>% 
-  str_split("\\|") %>% 
-  .[[1]]
-```
-
-Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
-
-```{r}
-sentences %>%
-  head(5) %>% 
-  str_split(" ", simplify = TRUE)
-```
-
-You can also request a maximum number of pieces:
-
-```{r}
-fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
-fields %>% str_split(": ", n = 2, simplify = TRUE)
-```
-
-Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
-
-```{r}
-x <- "This is a sentence.  This is another sentence."
-str_view_all(x, boundary("word"))
-
-str_split(x, " ")[[1]]
-str_split(x, boundary("word"))[[1]]
-```
-
-### Separate
+## Separate

 `separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
 Take `table3`:
@ -553,7 +443,7 @@ table3 %>%
  separate(rate, into = c("cases", "population"), sep = "/")
 ```

-#### Exercises
+### Exercises

 1.  Split up a string like `"apples, pears, and bananas"` into individual components.

@ -562,12 +452,6 @@ table3 %>%
 3.  What does splitting with an empty string (`""`) do?
    Experiment, and then read the documentation.

-### Find matches
-
-`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match.
-These are particularly useful when none of the other functions does exactly what you want.
-You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
-
 ## Other types of pattern

 When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:
@ -689,74 +573,3 @@ There are three other functions you can use instead of `regex()`:
 1.  How would you find all strings containing `\` with `regex()` vs. with `fixed()`?

 2.  What are the five most common words in `sentences`?
-
-## stringi
-
-stringr is built on top of the **stringi** package.
-stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions.
-stringi, on the other hand, is designed to be comprehensive.
-It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
-
-If you find yourself struggling to do something in stringr, it's worth taking a look at stringi.
-The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way.
-The main difference is the prefix: `str_` vs. `stri_`.
-
-### Exercises
-
-1.  Find the stringi functions that:
-
-    a.  Count the number of words.
-    b.  Find duplicated strings.
-    c.  Generate random text.
-
-2.  How do you control the language that `stri_sort()` uses for sorting?
-
-### Exercises
-
-1.  What do the `extra` and `fill` arguments do in `separate()`?
-    Experiment with the various options for the following two toy datasets.
-
-    ```{r, eval = FALSE}
-    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
-      separate(x, c("one", "two", "three"))
-
-    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
-      separate(x, c("one", "two", "three"))
-    ```
-
-2.  Both `unite()` and `separate()` have a `remove` argument.
-    What does it do?
-    Why would you set it to `FALSE`?
-
-3.  Compare and contrast `separate()` and `extract()`.
-    Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
-
-4.  In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
-    How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
-
-    ```{r, eval = FALSE}
-    events <- tribble(
-      ~month, ~day,
-      1     , 20,
-      1     , 21,
-      1     , 22
-    )
-
-    events %>%
-      unite("date", month:day, sep = "-", remove = FALSE)
-    ```
-
-5.  You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
-    Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
-    Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
-    Do this in two ways: using a positive and a negative value for `sep`.
-
-    ```{r}
-    baker <- tribble(
-      ~location,
-      "FLBaker County",
-      "GABaker County",
-      "ORBaker County",
-    )
-    baker
-    ```