Noodling on strings
This commit is contained in:
parent
2505136477
commit
9091a1484d
|
@ -14,6 +14,7 @@ URL: https://github.com/hadley/r4ds
|
||||||
Depends:
|
Depends:
|
||||||
R (>= 3.1.0)
|
R (>= 3.1.0)
|
||||||
Imports:
|
Imports:
|
||||||
|
babynames,
|
||||||
feather,
|
feather,
|
||||||
gapminder,
|
gapminder,
|
||||||
ggrepel,
|
ggrepel,
|
||||||
|
|
15
regexps.Rmd
15
regexps.Rmd
|
@ -123,6 +123,17 @@ For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`,
|
||||||
|
|
||||||
Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
|
Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
|
||||||
|
|
||||||
|
## Overlapping and zero-width patterns
|
||||||
|
|
||||||
|
Note that matches never overlap.
|
||||||
|
For example, in `"abababa"`, how many times will the pattern `"aba"` match?
|
||||||
|
Regular expressions say two, not three:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
str_count("abababa", "aba")
|
||||||
|
str_view_all("abababa", "aba")
|
||||||
|
```
|
||||||
|
|
||||||
## Character classes and alternatives
|
## Character classes and alternatives
|
||||||
|
|
||||||
There are a number of special patterns that match more than one character.
|
There are a number of special patterns that match more than one character.
|
||||||
|
@ -259,6 +270,9 @@ sentences %>%
|
||||||
head(5)
|
head(5)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Names that start and end with the same letter.
|
||||||
|
Implement with `str_sub()` instead.
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
1. Describe, in words, what these expressions will match:
|
1. Describe, in words, what these expressions will match:
|
||||||
|
@ -443,3 +457,4 @@ See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for mor
|
||||||
Don't forget that you're in a programming language and you have other tools at your disposal.
|
Don't forget that you're in a programming language and you have other tools at your disposal.
|
||||||
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
|
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
|
||||||
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
||||||
|
|
||||||
|
|
124
strings.Rmd
124
strings.Rmd
|
@ -2,12 +2,17 @@
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
This chapter introduces you to string manipulation in R.
|
This chapter introduces you to strings in R.
|
||||||
You'll learn the basics of how strings work and how to create them by hand.
|
You'll learn the basics of how strings work and how to create them by hand.
|
||||||
Big topic so spread over three chapters.
|
Big topic so spread over three chapters.
|
||||||
|
|
||||||
Base R contains many functions to work with strings but we'll generally avoid them here because they can be inconsistent, which makes them hard to remember.
|
Base R contains many functions to work with strings but we'll generally avoid them here because they can be inconsistent, which makes them hard to remember.
|
||||||
Instead, we'll use stringr which is designed to be as consistent as possible, and all of its functions start with `str_`.
|
Instead, we'll use stringr which is designed to be as consistent as possible, and all of its functions start with `str_`.
|
||||||
|
The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions:
|
||||||
|
|
||||||
|
```{r, echo = FALSE}
|
||||||
|
knitr::include_graphics("screenshots/stringr-autocomplete.png")
|
||||||
|
```
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
|
@ -15,6 +20,7 @@ This chapter will focus on the **stringr** package for string manipulation, whic
|
||||||
|
|
||||||
```{r setup, message = FALSE}
|
```{r setup, message = FALSE}
|
||||||
library(tidyverse)
|
library(tidyverse)
|
||||||
|
library(babynames)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Creating a string
|
## Creating a string
|
||||||
|
@ -86,7 +92,7 @@ If your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that
|
||||||
|
|
||||||
### Other special characters
|
### Other special characters
|
||||||
|
|
||||||
As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`.
|
As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"` with `?'"'` or `?"'"`.
|
||||||
|
|
||||||
You'll also sometimes see strings containing Unicode escapes like `"\u00b5"`.
|
You'll also sometimes see strings containing Unicode escapes like `"\u00b5"`.
|
||||||
This is a way of writing non-English characters that works on all platforms:
|
This is a way of writing non-English characters that works on all platforms:
|
||||||
|
@ -105,12 +111,6 @@ str_c("x", "y")
|
||||||
str_c("x", "y", "z")
|
str_c("x", "y", "z")
|
||||||
```
|
```
|
||||||
|
|
||||||
The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions:
|
|
||||||
|
|
||||||
```{r, echo = FALSE}
|
|
||||||
knitr::include_graphics("screenshots/stringr-autocomplete.png")
|
|
||||||
```
|
|
||||||
|
|
||||||
Use the `sep` argument to control how they're separated:
|
Use the `sep` argument to control how they're separated:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -126,24 +126,24 @@ str_c("|-", x, "-|")
|
||||||
str_c("|-", coalesce(x, ""), "-|")
|
str_c("|-", coalesce(x, ""), "-|")
|
||||||
```
|
```
|
||||||
|
|
||||||
`str_c()` is vectorised which means that it automatically recycles individual strings to the same length as the longest vector input:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
str_c("prefix-", c("a", "b", "c"), "-suffix")
|
|
||||||
```
|
|
||||||
|
|
||||||
`mutate()`
|
`mutate()`
|
||||||
|
|
||||||
## Flattening strings
|
Another powerful way of combining strings is with the glue package.
|
||||||
|
You can either use `glue::glue()` or call it via the `str_glue()` wrapper that string provides for you.
|
||||||
To collapse a vector of strings into a single string, use `collapse`:
|
Glue works a little differently to the other methods: you give it a single string using `{}` to indicate where you want to interpolate in existing variables:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_flatten(c("x", "y", "z"), ", ")
|
str_glue("|-{x}-|")
|
||||||
```
|
```
|
||||||
|
|
||||||
This is a great tool for `summarise()`ing character data.
|
Like `str_c()`, `str_glue()` pairs well with `mutate()`:
|
||||||
Later we'll come back to the inverse of this, `separate_rows()`.
|
|
||||||
|
```{r}
|
||||||
|
starwars %>% mutate(
|
||||||
|
intro = str_glue("Hi my is {name} and I'm a {species} from {homeworld}"),
|
||||||
|
.keep = "none"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
## Length and subsetting
|
## Length and subsetting
|
||||||
|
|
||||||
|
@ -153,6 +153,13 @@ For example, `str_length()` tells you the length of a string:
|
||||||
str_length(c("a", "R for data science", NA))
|
str_length(c("a", "R for data science", NA))
|
||||||
```
|
```
|
||||||
|
|
||||||
|
You could use this with `count()` to find the distribution of lengths of US babynames:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
babynames %>%
|
||||||
|
count(length = str_length(name))
|
||||||
|
```
|
||||||
|
|
||||||
You can extract parts of a string using `str_sub()`.
|
You can extract parts of a string using `str_sub()`.
|
||||||
As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:
|
As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:
|
||||||
|
|
||||||
|
@ -163,6 +170,16 @@ str_sub(x, 1, 3)
|
||||||
str_sub(x, -3, -1)
|
str_sub(x, -3, -1)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
We could use this with `mutate()` to find the first and last letter of each name:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
babynames %>%
|
||||||
|
mutate(
|
||||||
|
first = str_sub(name, 1, 1),
|
||||||
|
last = str_sub(name, -1, -1)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:
|
Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -189,6 +206,19 @@ TODO: `separate()`
|
||||||
4. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
|
4. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
|
||||||
Think carefully about what it should do if given a vector of length 0, 1, or 2.
|
Think carefully about what it should do if given a vector of length 0, 1, or 2.
|
||||||
|
|
||||||
|
## String summaries
|
||||||
|
|
||||||
|
You can perform the opposite operation with `summarise()` and `str_flatten()`:
|
||||||
|
|
||||||
|
To collapse a vector of strings into a single string, use `collapse`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
str_flatten(c("x", "y", "z"), ", ")
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a great tool for `summarise()`ing character data.
|
||||||
|
Later we'll come back to the inverse of this, `separate_rows()`.
|
||||||
|
|
||||||
## Long strings
|
## Long strings
|
||||||
|
|
||||||
`str_wrap()`
|
`str_wrap()`
|
||||||
|
@ -234,15 +264,14 @@ The results are identical, but I think the first approach is significantly easie
|
||||||
If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations.
|
If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations.
|
||||||
|
|
||||||
A common use of `str_detect()` is to select the elements that match a pattern.
|
A common use of `str_detect()` is to select the elements that match a pattern.
|
||||||
This makes it a natural pairing with `filter()`:
|
This makes it a natural pairing with `filter()`.
|
||||||
|
The following regexp finds all names with repeated pairs of letters (you'll learn how that regexp works in the next chapter)
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df <- tibble(
|
babynames %>%
|
||||||
word = words,
|
filter(n > 100) %>%
|
||||||
i = seq_along(word)
|
count(name, wt = n) %>%
|
||||||
)
|
filter(str_detect(name, "(..).*\\1"))
|
||||||
df %>%
|
|
||||||
filter(str_detect(word, "x$"))
|
|
||||||
```
|
```
|
||||||
|
|
||||||
A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:
|
A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:
|
||||||
|
@ -258,22 +287,13 @@ mean(str_count(words, "[aeiou]"))
|
||||||
It's natural to use `str_count()` with `mutate()`:
|
It's natural to use `str_count()` with `mutate()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df %>%
|
babynames %>%
|
||||||
mutate(
|
mutate(
|
||||||
vowels = str_count(word, "[aeiou]"),
|
vowels = str_count(name, "[aeiou]"),
|
||||||
consonants = str_count(word, "[^aeiou]")
|
consonants = str_count(name, "[^aeiou]")
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
Note that matches never overlap.
|
|
||||||
For example, in `"abababa"`, how many times will the pattern `"aba"` match?
|
|
||||||
Regular expressions say two, not three:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
str_count("abababa", "aba")
|
|
||||||
str_view_all("abababa", "aba")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
|
1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
|
||||||
|
@ -383,6 +403,8 @@ tibble(sentence = sentences) %>%
|
||||||
2. Find all contractions.
|
2. Find all contractions.
|
||||||
Separate out the pieces before and after the apostrophe.
|
Separate out the pieces before and after the apostrophe.
|
||||||
|
|
||||||
|
## Strings -\> Columns
|
||||||
|
|
||||||
## Separate
|
## Separate
|
||||||
|
|
||||||
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
|
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
|
||||||
|
@ -416,6 +438,15 @@ table3 %>%
|
||||||
|
|
||||||
`separate_rows()`
|
`separate_rows()`
|
||||||
|
|
||||||
|
## Strings -\> Rows
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
starwars %>%
|
||||||
|
select(name, eye_color) %>%
|
||||||
|
filter(str_detect(eye_color, ", ")) %>%
|
||||||
|
separate_rows(eye_color)
|
||||||
|
```
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
1. Split up a string like `"apples, pears, and bananas"` into individual components.
|
1. Split up a string like `"apples, pears, and bananas"` into individual components.
|
||||||
|
@ -427,11 +458,22 @@ table3 %>%
|
||||||
|
|
||||||
## Other languages {#other-languages}
|
## Other languages {#other-languages}
|
||||||
|
|
||||||
### Length
|
Encoding, and why not to trust `Encoding`.
|
||||||
|
As a general rule, we recommend using UTF-8 everywhere, converting as a early as possible (i.e. by using the `encoding` argument to `readr::locale()`).
|
||||||
|
|
||||||
|
### Length and subsetting
|
||||||
|
|
||||||
This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
|
This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
|
||||||
Include some examples from <https://gankra.github.io/blah/text-hates-you/>.
|
Include some examples from <https://gankra.github.io/blah/text-hates-you/>.
|
||||||
(Maybe better to include a non-English text section later?)
|
|
||||||
|
This is a problem even with European problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
x <- c("\u00e1", "a\u0301")
|
||||||
|
x
|
||||||
|
str_length(x)
|
||||||
|
str_sub(x, 1, 1)
|
||||||
|
```
|
||||||
|
|
||||||
### Locales
|
### Locales
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue