From 9091a1484d3d6199c95c8ba8259187bdd7764ac4 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Fri, 23 Apr 2021 08:07:16 -0500 Subject: [PATCH] Noodling on strings --- DESCRIPTION | 1 + regexps.Rmd | 15 +++++++ strings.Rmd | 124 +++++++++++++++++++++++++++++++++++----------------- 3 files changed, 99 insertions(+), 41 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index b7f9a24..817b6da 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -14,6 +14,7 @@ URL: https://github.com/hadley/r4ds Depends: R (>= 3.1.0) Imports: + babynames, feather, gapminder, ggrepel, diff --git a/regexps.Rmd b/regexps.Rmd index ad9707e..4b47d6c 100644 --- a/regexps.Rmd +++ b/regexps.Rmd @@ -123,6 +123,17 @@ For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words. +## Overlapping and zero-width patterns + +Note that matches never overlap. +For example, in `"abababa"`, how many times will the pattern `"aba"` match? +Regular expressions say two, not three: + +```{r} +str_count("abababa", "aba") +str_view_all("abababa", "aba") +``` + ## Character classes and alternatives There are a number of special patterns that match more than one character. @@ -259,6 +270,9 @@ sentences %>% head(5) ``` +Names that start and end with the same letter. +Implement with `str_sub()` instead. + ### Exercises 1. Describe, in words, what these expressions will match: @@ -443,3 +457,4 @@ See the Stack Overflow discussion at for mor Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one. + diff --git a/strings.Rmd b/strings.Rmd index afd9b67..292c9fe 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -2,12 +2,17 @@ ## Introduction -This chapter introduces you to string manipulation in R. +This chapter introduces you to strings in R. You'll learn the basics of how strings work and how to create them by hand. Big topic so spread over three chapters. Base R contains many functions to work with strings but we'll generally avoid them here because they can be inconsistent, which makes them hard to remember. Instead, we'll use stringr which is designed to be as consistent as possible, and all of its functions start with `str_`. +The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions: + +```{r, echo = FALSE} +knitr::include_graphics("screenshots/stringr-autocomplete.png") +``` ### Prerequisites @@ -15,6 +20,7 @@ This chapter will focus on the **stringr** package for string manipulation, whic ```{r setup, message = FALSE} library(tidyverse) +library(babynames) ``` ## Creating a string @@ -86,7 +92,7 @@ If your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that ### Other special characters -As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. +As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"` with `?'"'` or `?"'"`. You'll also sometimes see strings containing Unicode escapes like `"\u00b5"`. This is a way of writing non-English characters that works on all platforms: @@ -105,12 +111,6 @@ str_c("x", "y") str_c("x", "y", "z") ``` -The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions: - -```{r, echo = FALSE} -knitr::include_graphics("screenshots/stringr-autocomplete.png") -``` - Use the `sep` argument to control how they're separated: ```{r} @@ -126,24 +126,24 @@ str_c("|-", x, "-|") str_c("|-", coalesce(x, ""), "-|") ``` -`str_c()` is vectorised which means that it automatically recycles individual strings to the same length as the longest vector input: - -```{r} -str_c("prefix-", c("a", "b", "c"), "-suffix") -``` - `mutate()` -## Flattening strings - -To collapse a vector of strings into a single string, use `collapse`: +Another powerful way of combining strings is with the glue package. +You can either use `glue::glue()` or call it via the `str_glue()` wrapper that string provides for you. +Glue works a little differently to the other methods: you give it a single string using `{}` to indicate where you want to interpolate in existing variables: ```{r} -str_flatten(c("x", "y", "z"), ", ") +str_glue("|-{x}-|") ``` -This is a great tool for `summarise()`ing character data. -Later we'll come back to the inverse of this, `separate_rows()`. +Like `str_c()`, `str_glue()` pairs well with `mutate()`: + +```{r} +starwars %>% mutate( + intro = str_glue("Hi my is {name} and I'm a {species} from {homeworld}"), + .keep = "none" +) +``` ## Length and subsetting @@ -153,6 +153,13 @@ For example, `str_length()` tells you the length of a string: str_length(c("a", "R for data science", NA)) ``` +You could use this with `count()` to find the distribution of lengths of US babynames: + +```{r} +babynames %>% + count(length = str_length(name)) +``` + You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring: @@ -163,6 +170,16 @@ str_sub(x, 1, 3) str_sub(x, -3, -1) ``` +We could use this with `mutate()` to find the first and last letter of each name: + +```{r} +babynames %>% + mutate( + first = str_sub(name, 1, 1), + last = str_sub(name, -1, -1) + ) +``` + Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible: ```{r} @@ -189,6 +206,19 @@ TODO: `separate()` 4. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`. Think carefully about what it should do if given a vector of length 0, 1, or 2. +## String summaries + +You can perform the opposite operation with `summarise()` and `str_flatten()`: + +To collapse a vector of strings into a single string, use `collapse`: + +```{r} +str_flatten(c("x", "y", "z"), ", ") +``` + +This is a great tool for `summarise()`ing character data. +Later we'll come back to the inverse of this, `separate_rows()`. + ## Long strings `str_wrap()` @@ -234,15 +264,14 @@ The results are identical, but I think the first approach is significantly easie If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations. A common use of `str_detect()` is to select the elements that match a pattern. -This makes it a natural pairing with `filter()`: +This makes it a natural pairing with `filter()`. +The following regexp finds all names with repeated pairs of letters (you'll learn how that regexp works in the next chapter) ```{r} -df <- tibble( - word = words, - i = seq_along(word) -) -df %>% - filter(str_detect(word, "x$")) +babynames %>% + filter(n > 100) %>% + count(name, wt = n) %>% + filter(str_detect(name, "(..).*\\1")) ``` A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string: @@ -258,22 +287,13 @@ mean(str_count(words, "[aeiou]")) It's natural to use `str_count()` with `mutate()`: ```{r} -df %>% +babynames %>% mutate( - vowels = str_count(word, "[aeiou]"), - consonants = str_count(word, "[^aeiou]") + vowels = str_count(name, "[aeiou]"), + consonants = str_count(name, "[^aeiou]") ) ``` -Note that matches never overlap. -For example, in `"abababa"`, how many times will the pattern `"aba"` match? -Regular expressions say two, not three: - -```{r} -str_count("abababa", "aba") -str_view_all("abababa", "aba") -``` - ### Exercises 1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls. @@ -383,6 +403,8 @@ tibble(sentence = sentences) %>% 2. Find all contractions. Separate out the pieces before and after the apostrophe. +## Strings -\> Columns + ## Separate `separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. @@ -416,6 +438,15 @@ table3 %>% `separate_rows()` +## Strings -\> Rows + +```{r} +starwars %>% + select(name, eye_color) %>% + filter(str_detect(eye_color, ", ")) %>% + separate_rows(eye_color) +``` + ### Exercises 1. Split up a string like `"apples, pears, and bananas"` into individual components. @@ -427,11 +458,22 @@ table3 %>% ## Other languages {#other-languages} -### Length +Encoding, and why not to trust `Encoding`. +As a general rule, we recommend using UTF-8 everywhere, converting as a early as possible (i.e. by using the `encoding` argument to `readr::locale()`). + +### Length and subsetting This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages. Include some examples from . -(Maybe better to include a non-English text section later?) + +This is a problem even with European problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components. + +```{r} +x <- c("\u00e1", "a\u0301") +x +str_length(x) +str_sub(x, 1, 1) +``` ### Locales