From 58f7f16db133e494c8b51443f16c64856392603a Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Wed, 21 Apr 2021 12:30:25 -0500 Subject: [PATCH] Break up strings chapter --- _bookdown.yml | 1 + prog-strings.Rmd | 190 +++++++++++++++++++++++++++++++ regexps.Rmd | 17 ++- strings.Rmd | 287 +++++++++-------------------------------------- 4 files changed, 257 insertions(+), 238 deletions(-) create mode 100644 prog-strings.Rmd diff --git a/_bookdown.yml b/_bookdown.yml index ef6904c..b245c12 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -46,6 +46,7 @@ rmd_files: [ "functions.Rmd", "vectors.Rmd", "iteration.Rmd", + "prog-strings.Rmd", "communicate.Rmd", "rmarkdown.Rmd", diff --git a/prog-strings.Rmd b/prog-strings.Rmd new file mode 100644 index 0000000..c8c0773 --- /dev/null +++ b/prog-strings.Rmd @@ -0,0 +1,190 @@ +## Programming with strings + +```{r} +library(stringr) +library(tidyr) +library(tibble) +``` + +### Extract + +```{r} +colours <- c("red", "orange", "yellow", "green", "blue", "purple") +colour_match <- str_c(colours, collapse = "|") +colour_match + +more <- sentences[str_count(sentences, colour_match) > 1] +str_extract_all(more, colour_match) +``` + +If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest: + +```{r} + +str_extract_all(more, colour_match, simplify = TRUE) + +x <- c("a", "a b", "a b c") +str_extract_all(x, "[a-z]", simplify = TRUE) +``` + +We don't talk about matrices here, but they are useful elsewhere. + +### Exercises + +1. From the Harvard sentences data, extract: + + 1. The first word from each sentence. + 2. All words ending in `ing`. + 3. All plurals. + +## Grouped matches + +Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching. +You can also use parentheses to extract parts of a complex match. +For example, imagine we want to extract nouns from the sentences. +As a heuristic, we'll look for any word that comes after "a" or "the". +Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space. + +```{r} +noun <- "(a|the) ([^ ]+)" + +has_noun <- sentences %>% + str_subset(noun) %>% + head(10) +has_noun %>% + str_extract(noun) +``` + +`str_extract()` gives us the complete match; `str_match()` gives each individual component. +Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group: + +```{r} +has_noun %>% + str_match(noun) +``` + +(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.) + +## Spitting + +Use `str_split()` to split a string up into pieces. +For example, we could split sentences into words: + +```{r} +sentences %>% + head(5) %>% + str_split(" ") +``` + +Because each component might contain a different number of pieces, this returns a list. +If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list: + +```{r} +"a|b|c|d" %>% + str_split("\\|") %>% + .[[1]] +``` + +Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix: + +```{r} +sentences %>% + head(5) %>% + str_split(" ", simplify = TRUE) +``` + +You can also request a maximum number of pieces: + +```{r} +fields <- c("Name: Hadley", "Country: NZ", "Age: 35") +fields %>% str_split(": ", n = 2, simplify = TRUE) +``` + +Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s: + +```{r} +x <- "This is a sentence. This is another sentence." +str_view_all(x, boundary("word")) + +str_split(x, " ")[[1]] +str_split(x, boundary("word"))[[1]] +``` + +## Replace with function + +## Locations + +`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match. +These are particularly useful when none of the other functions does exactly what you want. +You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them. + +## stringi + +stringr is built on top of the **stringi** package. +stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. +stringi, on the other hand, is designed to be comprehensive. +It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`. + +If you find yourself struggling to do something in stringr, it's worth taking a look at stringi. +The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way. +The main difference is the prefix: `str_` vs. `stri_`. + +### Exercises + +1. Find the stringi functions that: + + a. Count the number of words. + b. Find duplicated strings. + c. Generate random text. + +2. How do you control the language that `stri_sort()` uses for sorting? + +### Exercises + +1. What do the `extra` and `fill` arguments do in `separate()`? + Experiment with the various options for the following two toy datasets. + + ```{r, eval = FALSE} + tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% + separate(x, c("one", "two", "three")) + + tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% + separate(x, c("one", "two", "three")) + ``` + +2. Both `unite()` and `separate()` have a `remove` argument. + What does it do? + Why would you set it to `FALSE`? + +3. Compare and contrast `separate()` and `extract()`. + Why are there three variations of separation (by position, by separator, and with groups), but only one unite? + +4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns. + How would you achieve the same outcome using `mutate()` and `paste()` instead of unite? + + ```{r, eval = FALSE} + events <- tribble( + ~month, ~day, + 1 , 20, + 1 , 21, + 1 , 22 + ) + + events %>% + unite("date", month:day, sep = "-", remove = FALSE) + ``` + +5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. + Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. + Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`. + Do this in two ways: using a positive and a negative value for `sep`. + + ```{r} + baker <- tribble( + ~location, + "FLBaker County", + "GABaker County", + "ORBaker County", + ) + baker + ``` diff --git a/regexps.Rmd b/regexps.Rmd index 0e45762..2a811e6 100644 --- a/regexps.Rmd +++ b/regexps.Rmd @@ -1,5 +1,11 @@ # Regular expressions +## Introduction + +The focus of this chapter will be on regular expressions, or regexps for short. +Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings. +When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense. + ## Matching patterns with regular expressions Regexps are a very terse language that allow you to describe patterns in strings. @@ -229,7 +235,7 @@ Collectively, these operators are called **quantifiers** because they quantify h b. Have three or more vowels in a row. c. Have two or more vowel-consonant pairs in a row. -4. Solve the beginner regexp crosswords at [](https://regexcrossword.com/challenges/beginner){.uri}. +4. Solve the beginner regexp crosswords at [\](https://regexcrossword.com/challenges/beginner){.uri}. ## Grouping and backreferences @@ -245,6 +251,14 @@ str_view(fruit, "(..)\\1", match = TRUE) (Shortly, you'll also see how they're useful in conjunction with `str_match()`.) +Also use for replacement: + +```{r} +sentences %>% + str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% + head(5) +``` + ### Exercises 1. Describe, in words, what these expressions will match: @@ -380,3 +394,4 @@ See the Stack Overflow discussion at for mor Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one. + diff --git a/strings.Rmd b/strings.Rmd index aca01fd..3dc93f7 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -3,9 +3,8 @@ ## Introduction This chapter introduces you to string manipulation in R. -You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions, or regexps for short. -Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings. -When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense. +You'll learn the basics of how strings work and how to create them by hand. +Big topic so spread over three chapters. ### Prerequisites @@ -15,7 +14,7 @@ This chapter will focus on the **stringr** package for string manipulation, whic library(tidyverse) ``` -## String basics +## Creating a string You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. @@ -44,6 +43,8 @@ single_quote <- '\'' # or "'" That means if you want to include a literal backslash, you'll need to double it up: `"\\"`. +TODO: raw string. + Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`: @@ -68,7 +69,7 @@ Multiple strings are often stored in a character vector, which you can create wi c("one", "two", "three") ``` -### String length +## String length Base R contains many functions to work with strings but we'll avoid them because they can be inconsistent, which makes them hard to remember. Instead we'll use functions from stringr. @@ -79,13 +80,15 @@ For example, `str_length()` tells you the number of characters in a string: str_length(c("a", "R for data science", NA)) ``` +What is a letter? + The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions: ```{r, echo = FALSE} knitr::include_graphics("screenshots/stringr-autocomplete.png") ``` -### Combining strings +## Combining strings To combine two or more strings, use `str_c()`: @@ -115,7 +118,7 @@ As shown above, `str_c()` is vectorised, and it automatically recycles shorter v str_c("prefix-", c("a", "b", "c"), "-suffix") ``` -Objects of length 0 are silently dropped. +`NULL`s are silently dropped. This is particularly useful in conjunction with `if`: ```{r} @@ -136,7 +139,7 @@ To collapse a vector of strings into a single string, use `collapse`: str_c(c("x", "y", "z"), collapse = ", ") ``` -### Subsetting strings +## Subsetting strings You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring: @@ -161,7 +164,9 @@ str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1)) x ``` -### Locales +TODO: `separate()` + +## Locales Above I used `str_to_lower()` to change the text to lower case. You can also use `str_to_upper()` or `str_to_title()`. @@ -214,18 +219,7 @@ TODO: add connection to `arrange()` 6. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`. Think carefully about what it should do if given a vector of length 0, 1, or 2. -## Tools - -Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems. -In this section you'll learn a wide array of stringr functions that let you: - -- Determine which strings match a pattern. -- Find the positions of matches. -- Extract the content of matches. -- Replace matches with new values. -- Split a string based on a match. - -### Detect matches +## Detect matches To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector the same length as the input: @@ -235,6 +229,8 @@ x <- c("apple", "banana", "pear") str_detect(x, "e") ``` +TODO: add basic intro to regexps. + Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want to answer questions about matches across a larger vector: @@ -307,11 +303,7 @@ str_count("abababa", "aba") str_view_all("abababa", "aba") ``` -Note the use of `str_view_all()`. -As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches. -The second function will have the suffix `_all`. - -#### Exercises +### Exercises 1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls. @@ -323,7 +315,33 @@ The second function will have the suffix `_all`. What word has the highest proportion of vowels? (Hint: what is the denominator?) -### Extract matches +## Replacing matches + +`str_replace_all()` allow you to replace matches with new strings. +The simplest use is to replace a pattern with a fixed string: + +```{r} +x <- c("apple", "pear", "banana") +str_replace_all(x, "[aeiou]", "-") +``` + +With `str_replace_all()` you can perform multiple replacements by supplying a named vector: + +```{r} +x <- c("1 house", "2 cars", "3 people") +str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three")) +``` + +#### Exercises + +1. Replace all forward slashes in a string with backslashes. + +2. Implement a simple version of `str_to_lower()` using `str_replace_all()`. + +3. Switch the first and last letters in `words`. + Which of those strings are still words? + +## Extract full matches To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. @@ -364,61 +382,14 @@ str_extract(more, colour_match) This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use `str_extract_all()`. -It returns a list: +It returns a list, so we'll come back to this later on. -```{r} -str_extract_all(more, colour_match) -``` - -You'll learn more about lists in Section \@ref(lists) on lists and Chapter \@ref(iteration) on iteration. - -If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest: - -```{r} -str_extract_all(more, colour_match, simplify = TRUE) - -x <- c("a", "a b", "a b c") -str_extract_all(x, "[a-z]", simplify = TRUE) -``` - -#### Exercises +### Exercises 1. In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem. -2. From the Harvard sentences data, extract: - - 1. The first word from each sentence. - 2. All words ending in `ing`. - 3. All plurals. - -### Grouped matches - -Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching. -You can also use parentheses to extract parts of a complex match. -For example, imagine we want to extract nouns from the sentences. -As a heuristic, we'll look for any word that comes after "a" or "the". -Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space. - -```{r} -noun <- "(a|the) ([^ ]+)" - -has_noun <- sentences %>% - str_subset(noun) %>% - head(10) -has_noun %>% - str_extract(noun) -``` - -`str_extract()` gives us the complete match; `str_match()` gives each individual component. -Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group: - -```{r} -has_noun %>% - str_match(noun) -``` - -(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.) +## Extract part of matches If your data is in a tibble, it's often easier to use `tidyr::extract()`. It works like `str_match()` but requires you to name the matches, which are then placed in new columns: @@ -441,88 +412,7 @@ Like `str_extract()`, if you want all matches for each string, you'll need `str_ 2. Find all contractions. Separate out the pieces before and after the apostrophe. -### Replacing matches - -`str_replace()` and `str_replace_all()` allow you to replace matches with new strings. -The simplest use is to replace a pattern with a fixed string: - -```{r} -x <- c("apple", "pear", "banana") -str_replace(x, "[aeiou]", "-") -str_replace_all(x, "[aeiou]", "-") -``` - -With `str_replace_all()` you can perform multiple replacements by supplying a named vector: - -```{r} -x <- c("1 house", "2 cars", "3 people") -str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three")) -``` - -Instead of replacing with a fixed string you can use backreferences to insert components of the match. -In the following code, I flip the order of the second and third words. - -```{r} -sentences %>% - str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% - head(5) -``` - -#### Exercises - -1. Replace all forward slashes in a string with backslashes. - -2. Implement a simple version of `str_to_lower()` using `str_replace_all()`. - -3. Switch the first and last letters in `words`. - Which of those strings are still words? - -### Splitting - -Use `str_split()` to split a string up into pieces. -For example, we could split sentences into words: - -```{r} -sentences %>% - head(5) %>% - str_split(" ") -``` - -Because each component might contain a different number of pieces, this returns a list. -If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list: - -```{r} -"a|b|c|d" %>% - str_split("\\|") %>% - .[[1]] -``` - -Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix: - -```{r} -sentences %>% - head(5) %>% - str_split(" ", simplify = TRUE) -``` - -You can also request a maximum number of pieces: - -```{r} -fields <- c("Name: Hadley", "Country: NZ", "Age: 35") -fields %>% str_split(": ", n = 2, simplify = TRUE) -``` - -Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s: - -```{r} -x <- "This is a sentence. This is another sentence." -str_view_all(x, boundary("word")) - -str_split(x, " ")[[1]] -str_split(x, boundary("word"))[[1]] -``` - -### Separate +## Separate `separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take `table3`: @@ -553,7 +443,7 @@ table3 %>% separate(rate, into = c("cases", "population"), sep = "/") ``` -#### Exercises +### Exercises 1. Split up a string like `"apples, pears, and bananas"` into individual components. @@ -562,12 +452,6 @@ table3 %>% 3. What does splitting with an empty string (`""`) do? Experiment, and then read the documentation. -### Find matches - -`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match. -These are particularly useful when none of the other functions does exactly what you want. -You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them. - ## Other types of pattern When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`: @@ -689,74 +573,3 @@ There are three other functions you can use instead of `regex()`: 1. How would you find all strings containing `\` with `regex()` vs. with `fixed()`? 2. What are the five most common words in `sentences`? - -## stringi - -stringr is built on top of the **stringi** package. -stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. -stringi, on the other hand, is designed to be comprehensive. -It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`. - -If you find yourself struggling to do something in stringr, it's worth taking a look at stringi. -The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way. -The main difference is the prefix: `str_` vs. `stri_`. - -### Exercises - -1. Find the stringi functions that: - - a. Count the number of words. - b. Find duplicated strings. - c. Generate random text. - -2. How do you control the language that `stri_sort()` uses for sorting? - -### Exercises - -1. What do the `extra` and `fill` arguments do in `separate()`? - Experiment with the various options for the following two toy datasets. - - ```{r, eval = FALSE} - tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% - separate(x, c("one", "two", "three")) - - tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% - separate(x, c("one", "two", "three")) - ``` - -2. Both `unite()` and `separate()` have a `remove` argument. - What does it do? - Why would you set it to `FALSE`? - -3. Compare and contrast `separate()` and `extract()`. - Why are there three variations of separation (by position, by separator, and with groups), but only one unite? - -4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns. - How would you achieve the same outcome using `mutate()` and `paste()` instead of unite? - - ```{r, eval = FALSE} - events <- tribble( - ~month, ~day, - 1 , 20, - 1 , 21, - 1 , 22 - ) - - events %>% - unite("date", month:day, sep = "-", remove = FALSE) - ``` - -5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. - Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. - Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`. - Do this in two ways: using a positive and a negative value for `sep`. - - ```{r} - baker <- tribble( - ~location, - "FLBaker County", - "GABaker County", - "ORBaker County", - ) - baker - ```