Break up strings chapter

This commit is contained in:
Hadley Wickham 2021-04-21 12:30:25 -05:00
parent 18253a1d52
commit 58f7f16db1
4 changed files with 257 additions and 238 deletions

View File

@ -46,6 +46,7 @@ rmd_files: [
"functions.Rmd",
"vectors.Rmd",
"iteration.Rmd",
"prog-strings.Rmd",
"communicate.Rmd",
"rmarkdown.Rmd",

190
prog-strings.Rmd Normal file
View File

@ -0,0 +1,190 @@
## Programming with strings
```{r}
library(stringr)
library(tidyr)
library(tibble)
```
### Extract
```{r}
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
more <- sentences[str_count(sentences, colour_match) > 1]
str_extract_all(more, colour_match)
```
If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
```{r}
str_extract_all(more, colour_match, simplify = TRUE)
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```
We don't talk about matrices here, but they are useful elsewhere.
### Exercises
1. From the Harvard sentences data, extract:
1. The first word from each sentence.
2. All words ending in `ing`.
3. All plurals.
## Grouped matches
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching.
You can also use parentheses to extract parts of a complex match.
For example, imagine we want to extract nouns from the sentences.
As a heuristic, we'll look for any word that comes after "a" or "the".
Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space.
```{r}
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
```
`str_extract()` gives us the complete match; `str_match()` gives each individual component.
Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
```{r}
has_noun %>%
str_match(noun)
```
(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
## Spitting
Use `str_split()` to split a string up into pieces.
For example, we could split sentences into words:
```{r}
sentences %>%
head(5) %>%
str_split(" ")
```
Because each component might contain a different number of pieces, this returns a list.
If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
```{r}
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
```
Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
```{r}
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
```
You can also request a maximum number of pieces:
```{r}
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
```
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
```{r}
x <- "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")[[1]]
str_split(x, boundary("word"))[[1]]
```
## Replace with function
## Locations
`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match.
These are particularly useful when none of the other functions does exactly what you want.
You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
## stringi
stringr is built on top of the **stringi** package.
stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions.
stringi, on the other hand, is designed to be comprehensive.
It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
If you find yourself struggling to do something in stringr, it's worth taking a look at stringi.
The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way.
The main difference is the prefix: `str_` vs. `stri_`.
### Exercises
1. Find the stringi functions that:
a. Count the number of words.
b. Find duplicated strings.
c. Generate random text.
2. How do you control the language that `stri_sort()` uses for sorting?
### Exercises
1. What do the `extra` and `fill` arguments do in `separate()`?
Experiment with the various options for the following two toy datasets.
```{r, eval = FALSE}
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
```
2. Both `unite()` and `separate()` have a `remove` argument.
What does it do?
Why would you set it to `FALSE`?
3. Compare and contrast `separate()` and `extract()`.
Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
```{r, eval = FALSE}
events <- tribble(
~month, ~day,
1 , 20,
1 , 21,
1 , 22
)
events %>%
unite("date", month:day, sep = "-", remove = FALSE)
```
5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
Do this in two ways: using a positive and a negative value for `sep`.
```{r}
baker <- tribble(
~location,
"FLBaker County",
"GABaker County",
"ORBaker County",
)
baker
```

View File

@ -1,5 +1,11 @@
# Regular expressions
## Introduction
The focus of this chapter will be on regular expressions, or regexps for short.
Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings.
When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.
## Matching patterns with regular expressions
Regexps are a very terse language that allow you to describe patterns in strings.
@ -229,7 +235,7 @@ Collectively, these operators are called **quantifiers** because they quantify h
b. Have three or more vowels in a row.
c. Have two or more vowel-consonant pairs in a row.
4. Solve the beginner regexp crosswords at [<https://regexcrossword.com/challenges/beginner>](https://regexcrossword.com/challenges/beginner){.uri}.
4. Solve the beginner regexp crosswords at [\<https://regexcrossword.com/challenges/beginner\>](https://regexcrossword.com/challenges/beginner){.uri}.
## Grouping and backreferences
@ -245,6 +251,14 @@ str_view(fruit, "(..)\\1", match = TRUE)
(Shortly, you'll also see how they're useful in conjunction with `str_match()`.)
Also use for replacement:
```{r}
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
```
### Exercises
1. Describe, in words, what these expressions will match:
@ -380,3 +394,4 @@ See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for mor
Don't forget that you're in a programming language and you have other tools at your disposal.
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.

View File

@ -3,9 +3,8 @@
## Introduction
This chapter introduces you to string manipulation in R.
You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions, or regexps for short.
Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings.
When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.
You'll learn the basics of how strings work and how to create them by hand.
Big topic so spread over three chapters.
### Prerequisites
@ -15,7 +14,7 @@ This chapter will focus on the **stringr** package for string manipulation, whic
library(tidyverse)
```
## String basics
## Creating a string
You can create strings with either single quotes or double quotes.
Unlike other languages, there is no difference in behaviour.
@ -44,6 +43,8 @@ single_quote <- '\'' # or "'"
That means if you want to include a literal backslash, you'll need to double it up: `"\\"`.
TODO: raw string.
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes.
To see the raw contents of the string, use `writeLines()`:
@ -68,7 +69,7 @@ Multiple strings are often stored in a character vector, which you can create wi
c("one", "two", "three")
```
### String length
## String length
Base R contains many functions to work with strings but we'll avoid them because they can be inconsistent, which makes them hard to remember.
Instead we'll use functions from stringr.
@ -79,13 +80,15 @@ For example, `str_length()` tells you the number of characters in a string:
str_length(c("a", "R for data science", NA))
```
What is a letter?
The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions:
```{r, echo = FALSE}
knitr::include_graphics("screenshots/stringr-autocomplete.png")
```
### Combining strings
## Combining strings
To combine two or more strings, use `str_c()`:
@ -115,7 +118,7 @@ As shown above, `str_c()` is vectorised, and it automatically recycles shorter v
str_c("prefix-", c("a", "b", "c"), "-suffix")
```
Objects of length 0 are silently dropped.
`NULL`s are silently dropped.
This is particularly useful in conjunction with `if`:
```{r}
@ -136,7 +139,7 @@ To collapse a vector of strings into a single string, use `collapse`:
str_c(c("x", "y", "z"), collapse = ", ")
```
### Subsetting strings
## Subsetting strings
You can extract parts of a string using `str_sub()`.
As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:
@ -161,7 +164,9 @@ str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```
### Locales
TODO: `separate()`
## Locales
Above I used `str_to_lower()` to change the text to lower case.
You can also use `str_to_upper()` or `str_to_title()`.
@ -214,18 +219,7 @@ TODO: add connection to `arrange()`
6. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
Think carefully about what it should do if given a vector of length 0, 1, or 2.
## Tools
Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems.
In this section you'll learn a wide array of stringr functions that let you:
- Determine which strings match a pattern.
- Find the positions of matches.
- Extract the content of matches.
- Replace matches with new values.
- Split a string based on a match.
### Detect matches
## Detect matches
To determine if a character vector matches a pattern, use `str_detect()`.
It returns a logical vector the same length as the input:
@ -235,6 +229,8 @@ x <- c("apple", "banana", "pear")
str_detect(x, "e")
```
TODO: add basic intro to regexps.
Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1.
That makes `sum()` and `mean()` useful if you want to answer questions about matches across a larger vector:
@ -307,11 +303,7 @@ str_count("abababa", "aba")
str_view_all("abababa", "aba")
```
Note the use of `str_view_all()`.
As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches.
The second function will have the suffix `_all`.
#### Exercises
### Exercises
1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
@ -323,7 +315,33 @@ The second function will have the suffix `_all`.
What word has the highest proportion of vowels?
(Hint: what is the denominator?)
### Extract matches
## Replacing matches
`str_replace_all()` allow you to replace matches with new strings.
The simplest use is to replace a pattern with a fixed string:
```{r}
x <- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")
```
With `str_replace_all()` you can perform multiple replacements by supplying a named vector:
```{r}
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
#### Exercises
1. Replace all forward slashes in a string with backslashes.
2. Implement a simple version of `str_to_lower()` using `str_replace_all()`.
3. Switch the first and last letters in `words`.
Which of those strings are still words?
## Extract full matches
To extract the actual text of a match, use `str_extract()`.
To show that off, we're going to need a more complicated example.
@ -364,61 +382,14 @@ str_extract(more, colour_match)
This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures.
To get all matches, use `str_extract_all()`.
It returns a list:
It returns a list, so we'll come back to this later on.
```{r}
str_extract_all(more, colour_match)
```
You'll learn more about lists in Section \@ref(lists) on lists and Chapter \@ref(iteration) on iteration.
If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
```{r}
str_extract_all(more, colour_match, simplify = TRUE)
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```
#### Exercises
### Exercises
1. In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour.
Modify the regex to fix the problem.
2. From the Harvard sentences data, extract:
1. The first word from each sentence.
2. All words ending in `ing`.
3. All plurals.
### Grouped matches
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching.
You can also use parentheses to extract parts of a complex match.
For example, imagine we want to extract nouns from the sentences.
As a heuristic, we'll look for any word that comes after "a" or "the".
Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space.
```{r}
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
```
`str_extract()` gives us the complete match; `str_match()` gives each individual component.
Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
```{r}
has_noun %>%
str_match(noun)
```
(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
## Extract part of matches
If your data is in a tibble, it's often easier to use `tidyr::extract()`.
It works like `str_match()` but requires you to name the matches, which are then placed in new columns:
@ -441,88 +412,7 @@ Like `str_extract()`, if you want all matches for each string, you'll need `str_
2. Find all contractions.
Separate out the pieces before and after the apostrophe.
### Replacing matches
`str_replace()` and `str_replace_all()` allow you to replace matches with new strings.
The simplest use is to replace a pattern with a fixed string:
```{r}
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
str_replace_all(x, "[aeiou]", "-")
```
With `str_replace_all()` you can perform multiple replacements by supplying a named vector:
```{r}
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
Instead of replacing with a fixed string you can use backreferences to insert components of the match.
In the following code, I flip the order of the second and third words.
```{r}
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
```
#### Exercises
1. Replace all forward slashes in a string with backslashes.
2. Implement a simple version of `str_to_lower()` using `str_replace_all()`.
3. Switch the first and last letters in `words`.
Which of those strings are still words?
### Splitting
Use `str_split()` to split a string up into pieces.
For example, we could split sentences into words:
```{r}
sentences %>%
head(5) %>%
str_split(" ")
```
Because each component might contain a different number of pieces, this returns a list.
If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
```{r}
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
```
Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
```{r}
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
```
You can also request a maximum number of pieces:
```{r}
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
```
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
```{r}
x <- "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")[[1]]
str_split(x, boundary("word"))[[1]]
```
### Separate
## Separate
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
Take `table3`:
@ -553,7 +443,7 @@ table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
```
#### Exercises
### Exercises
1. Split up a string like `"apples, pears, and bananas"` into individual components.
@ -562,12 +452,6 @@ table3 %>%
3. What does splitting with an empty string (`""`) do?
Experiment, and then read the documentation.
### Find matches
`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match.
These are particularly useful when none of the other functions does exactly what you want.
You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
## Other types of pattern
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:
@ -689,74 +573,3 @@ There are three other functions you can use instead of `regex()`:
1. How would you find all strings containing `\` with `regex()` vs. with `fixed()`?
2. What are the five most common words in `sentences`?
## stringi
stringr is built on top of the **stringi** package.
stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions.
stringi, on the other hand, is designed to be comprehensive.
It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
If you find yourself struggling to do something in stringr, it's worth taking a look at stringi.
The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way.
The main difference is the prefix: `str_` vs. `stri_`.
### Exercises
1. Find the stringi functions that:
a. Count the number of words.
b. Find duplicated strings.
c. Generate random text.
2. How do you control the language that `stri_sort()` uses for sorting?
### Exercises
1. What do the `extra` and `fill` arguments do in `separate()`?
Experiment with the various options for the following two toy datasets.
```{r, eval = FALSE}
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
```
2. Both `unite()` and `separate()` have a `remove` argument.
What does it do?
Why would you set it to `FALSE`?
3. Compare and contrast `separate()` and `extract()`.
Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
```{r, eval = FALSE}
events <- tribble(
~month, ~day,
1 , 20,
1 , 21,
1 , 22
)
events %>%
unite("date", month:day, sep = "-", remove = FALSE)
```
5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
Do this in two ways: using a positive and a negative value for `sep`.
```{r}
baker <- tribble(
~location,
"FLBaker County",
"GABaker County",
"ORBaker County",
)
baker
```