diff --git a/prog-strings.Rmd b/prog-strings.Rmd index e7174b6..c3989d6 100644 --- a/prog-strings.Rmd +++ b/prog-strings.Rmd @@ -10,6 +10,66 @@ library(tidyr) library(tibble) ``` +### Encoding + +You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is. +And typically the problem is that the declaring encoding is wrong. + +The tidyverse follows best practices[^prog-strings-1] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8. +It's still possible to have problems, but they'll typically arise during data import. +Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`). + +[^prog-strings-1]: + +### Length and subsetting + +This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages. + +Four most common are Latin, Chinese, Arabic, and Devangari, which represent three different systems of writing systems: + +- Latin uses an alphabet, where each consonant and vowel gets its own letter. + +- Chinese. + Logograms. + Half width vs full width. + English letters are roughly twice as high as they are wide. + Chinese characters are roughly square. + +- Arabic is an abjad, only consonants are written and vowels are optionally as diacritics. + Additionally, it's written from right-to-left, so the first letter is the letter on the far right. + +- Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary. + +> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak. +> --- + +```{r} +# But +str_split("check", boundary("character", locale = "cs_CZ")) +``` + +This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet. +This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components. + +```{r} +x <- c("á", "x́") +str_length(x) +# str_width(x) +str_sub(x, 1, 1) + +# stri_width(c("全形", "ab")) +# 0, 1, or 2 +# but this assumes no font substitution +``` + +```{r} +cyrillic_a <- "А" +latin_a <- "A" +cyrillic_a == latin_a +stringi::stri_escape_unicode(cyrillic_a) +stringi::stri_escape_unicode(latin_a) +``` + ### str_c `NULL`s are silently dropped. @@ -51,8 +111,6 @@ str_view_all(x, boundary("word")) str_extract_all(x, boundary("word")) ``` -### - ### Extract ```{r} diff --git a/regexps.Rmd b/regexps.Rmd index d17136b..cf9bced 100644 --- a/regexps.Rmd +++ b/regexps.Rmd @@ -264,7 +264,7 @@ Collectively, these operators are called **quantifiers** because they quantify h b. Have three or more vowels in a row. c. Have two or more vowel-consonant pairs in a row. -4. Solve the beginner regexp crosswords at [\](https://regexcrossword.com/challenges/beginner){.uri}. +4. Solve the beginner regexp crosswords at [\](https://regexcrossword.com/challenges/beginner){.uri}. ## Grouping and backreferences @@ -475,3 +475,9 @@ See the Stack Overflow discussion at for mor Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one. + +### Exercises + +1. In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem. +2. Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word. +3. Find all contractions. Separate out the pieces before and after the apostrophe. diff --git a/strings.Rmd b/strings.Rmd index ea77d0f..b43faa3 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -6,12 +6,14 @@ status("restructuring") ## Introduction -This chapter introduces you to strings. -You'll learn the basics of how strings work in R and how to create them "by hand". -You'll also learn the basics of regular expressions, a powerful, but sometimes cryptic language for describing string patterns. -Regular expression are a big topic, so we'll come back to them again in Chapter \@ref(regular-expressions) to discuss more of the details. -We'll finish up with a discussion of some of the new challenges that arise when working with non-English strings. +So far, we've used a bunch of strings without really talking about how they work or the powerful tools you have to work with them. +This chapter begins by diving into the details of creating strings, and from strings, character vectors. +You'll then learn a grab bag of handy string functions before we dive into creating strings from data, then extracting data from strings. +We'll then cover the basics of regular expressions, a powerful, but very concise and sometimes cryptic, language for describing patterns in string. +The chapter concludes with a brief discussion of where your exceptions of English might steer you wrong when working with text from other languages. +This chapter is paired with two other chapters. +Regular expression are a big topic, so we'll come back to them again in Chapter \@ref(regular-expressions). We'll come back to strings again in Chapter \@ref(programming-with-strings) where we'll think about them about more from a programming perspective than a data analysis perspective. ### Prerequisites @@ -55,13 +57,6 @@ If you forget to close a quote, you'll see `+`, the continuation character: If this happen to you and you can't figure out which quote you need to close, press Escape to cancel, then try again. -You can combine multiple strings into a character vector by using `c()`: - -```{r} -x <- c("first string", "second string", "third string") -x -``` - ### Escapes To include a literal single or double quote in a string you can use `\` to "escape" it: @@ -127,7 +122,25 @@ x str_view(x) ``` -## Length and subsetting +Now that you've learned the basics of creating strings by "hand", we'll go into the details of creating strings from other strings, starting with combining strings. + +### Vectors + +You can combine multiple strings into a character vector by using `c()`: + +```{r} +x <- c("first string", "second string", "third string") +x +``` + +You can create a length zero character vector with `character()`. +This is not usually very useful, but can help you understand the general principle of functions by giving them an unusual input. + +### Exercises + +## Handy functions + +### Length It's natural to think about the letters that make up an individual string. (Not every language uses letters, which we'll talk about more in Section \@ref(other-languages)). @@ -150,6 +163,8 @@ babynames %>% count(name, wt = n, sort = TRUE) ``` +### Subsetting + You can extract parts of a string using `str_sub(string, start, end)`. The `start` and `end` arguments are inclusive, so the length of the returned string will be `end - start + 1`: @@ -180,42 +195,7 @@ babynames %>% ) ``` -Sometimes you'll get a column that's made up of individual fixed length strings that have been joined together: - -```{r} -df <- tribble( - ~ sex_year_age, - "M200115", - "F201503", -) -``` - -You can extract the columns using `str_sub()`: - -```{r} -df %>% mutate( - sex = str_sub(sex_year_age, 1, 1), - year = str_sub(sex_year_age, 2, 5), - age = str_sub(sex_year_age, 6, 7), -) -``` - -Or use the `separate()` helper function: - -```{r} -df %>% - separate(sex_year_age, c("sex", "year", "age"), c(1, 5)) -``` - -Note that you give `separate()` three columns but only two positions --- that's because you're telling `separate()` where to break up the string. - -TODO: draw diagram to emphasise that it's the space between the characters. - -Later on, we'll come back two related problems: the components have varying length and are a separated by a character, or they have an varying number of components and you want to split up into rows, rather than columns. - -### Exercises - -1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters? +Later, we'll come back to the problem of extracting data from strings. ### Long strings @@ -233,7 +213,9 @@ str_trunc(x, 30) str_view(str_wrap(x, 30)) ``` -## +### Exercises + +1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters? ## Combining strings @@ -278,6 +260,16 @@ starwars %>% mutate(greeting = str_c("Hi! I'm ", name, "."), .after = name) ``` +### `str_dup()` + +`str_c(a, a, a)` is like `a + a + a`, what's the equivalent of `3 * a`? +That's `str_dup()`: + +```{r} +str_dup(letters[1:3], 3) +str_dup("a", 1:3) +``` + ### Glue Another powerful way of combining strings is with the glue package. @@ -301,12 +293,13 @@ starwars %>% You can use any valid R code inside of `{}`, but it's a good idea to pull complex calculations out into their own variables so you can more easily check your work. -Differences with `NA` handling. +Differences with `NA` handling? ### `str_flatten()` -`str_c()` combines multiple character vectors into a single character vector; the output is the same length as the input. -An related function is `str_flatten()`:[^strings-7] it takes a character vector and returns a single string: +So far I've shown you vectorised functions that work will with `mutate()`: the output of these functions is the same length as the input. +There's one last important function that's a summary function: the output is always length 1, regardless of the length of the input. +That's `str_flatten()`:[^strings-7] it takes a character vector and always returns a single string: [^strings-7]: The base R equivalent is `paste()` with the `collapse` argument set. @@ -336,7 +329,7 @@ df %>% ### Exercises -1. Compare the results of `paste0()` with `str_c()` for the following inputs: +1. Compare and contrast the results of `paste0()` with `str_c()` for the following inputs: ```{r, eval = FALSE} str_c("hi ", NA) @@ -344,9 +337,18 @@ df %>% str_c(letters[1:2], letters[1:3]) ``` +2. What does `str_flatten()` return if you give it a length 0 character vector? + ## Splitting apart strings -## Detect matches +Common for multiple variables worth of data to be stored in a single string. +In this section you'll learn how to use various functions tidyr to extract them. + +Waiting on: + +## Working with patterns + +### Detect matches To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector the same length as the input: @@ -377,6 +379,8 @@ babynames %>% (Note that this gives us the proportion of names that contain an x; if you wanted the proportion of babies given a name containing an x, you'd need to perform a weighted mean). +### Count matches + A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string: ```{r} @@ -394,14 +398,23 @@ babynames %>% ) ``` -### Exercises +You also wonder if any names include special characters like periods: -1. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?) +```{r} +babynames %>% + distinct(name) %>% + head() %>% + mutate( + periods = str_count(name, "."), + ) +``` -## Introduction to regular expressions +That's weird! -Before we can continue on we need to discuss the second argument to `str_detect()` --- the pattern that you want to match. -Above, I used a simple string, but the pattern actually a much richer tool called a **regular expression**. +### Introduction to regular expressions + +To understand what's going on, we need to discuss what the second argument to `str_detect()` really is. +It looks like a simple string, but it's pattern actually a much richer tool called a **regular expression**. A regular expression uses special characters to match string patterns. For example, `.` will match any character, so `"a."` will match any string that contains an a followed by another character: @@ -426,17 +439,6 @@ There are three useful **quantifiers** that can be applied to other pattern: `?` - `ab*` matches an "a", followed by any number of bs -You can use `()` to control precedence: - -- `(ab)?` optionally matches "ab" - -- `(ab)+` matches one or more "ab" repeats - -```{r} -str_view(c("aba", "ababab", "abbbbbb"), "ab+") -str_view(c("aba", "ababab", "abbbbbb"), "(ab)+") -``` - There are various alternatives to `.` that match a restricted set of characters. One useful operator is the **character class:** `[abcd]` match "a", "b", "c", or "d"; `[^abcd]` matches anything **except** "a", "b", "c", or "d". @@ -457,15 +459,7 @@ str_view_all("x X xy", regex(".Y", ignore_case = TRUE)) We'll come back to case later, because it's not trivial for many languages. -### Exercises - -1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls. - - a. Find all words that start or end with `x`. - b. Find all words that start with a vowel and end with a consonant. - c. Are there any words that contain at least one of each different vowel? - -## Replacing matches +### Replacing matches `str_replace_all()` allow you to replace matches with new strings. The simplest use is to replace a pattern with a fixed string: @@ -490,226 +484,76 @@ Use in `mutate()` Using pipe inside mutate. Recommendation to make a function, and think about testing it --- don't need formal tests, but useful to build up a set of positive and negative test cases as you. -#### Exercises +### Exercises -1. Replace all forward slashes in a string with backslashes. +1. What word has the highest number of vowels? + What word has the highest proportion of vowels? + (Hint: what is the denominator?) -2. Implement a simple version of `str_to_lower()` using `str_replace_all()`. +2. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls. -3. Switch the first and last letters in `words`. + a. Find all words that start or end with `x`. + b. Find all words that start with a vowel and end with a consonant. + c. Are there any words that contain at least one of each different vowel? + +3. Replace all forward slashes in a string with backslashes. + +4. Implement a simple version of `str_to_lower()` using `str_replace_all()`. + +5. Switch the first and last letters in `words`. Which of those strings are still `words`? -## Extract full matches +## Locale dependent operations {#other-languages} -If your data is in a tibble, it's often easier to use `tidyr::extract()`. -It works like `str_match()` but requires you to name the matches, which are then placed in new columns: +So far all of our examples have been using English. +The details of the many ways other languages are different to English are too diverse to detail here, but I wanted to give a quick outline of the functions who's behaviour differs based on your **locale**, the set of settings that vary from country to country. -```{r} -tibble(sentence = sentences) %>% - tidyr::extract( - sentence, c("article", "noun"), "(a|the) ([^ ]+)", - remove = FALSE - ) -``` - -### Exercises - -1. In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem. -2. Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word. -3. Find all contractions. Separate out the pieces before and after the apostrophe. - -## Strings -> Columns - -## Separate - -`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. -Take `table3`: - -```{r} -table3 -``` - -The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. -`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below. - -```{r} -table3 %>% - separate(rate, into = c("cases", "population")) -``` - -```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."} -knitr::include_graphics("images/tidy-17.png") -``` - -By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). -For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. -If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. -For example, we could rewrite the code above as: - -```{r eval = FALSE} -table3 %>% - separate(rate, into = c("cases", "population"), sep = "/") -``` - -`separate_rows()` - -## Strings -> Rows - -```{r} -starwars %>% - select(name, eye_color) %>% - filter(str_detect(eye_color, ", ")) %>% - separate_rows(eye_color) -``` - -### Exercises - -1. Split up a string like `"apples, pears, and bananas"` into individual components. - -2. Why is it better to split up by `boundary("word")` than `" "`? - -3. What does splitting with an empty string (`""`) do? - Experiment, and then read the documentation. - -## Other writing systems {#other-languages} - -Unicode is a system for representing the many writing systems used around the world. -Fundamental unit is a **code point**. -This usually represents something like a letter or symbol, but might also be formatting like a diacritic mark or a (e.g.) the skin tone of an emoji. -Character vs grapheme cluster. - -Include some examples from . - -All stringr functions default to the English locale. -This ensures that your code works the same way on every system, avoiding subtle bugs. - -Maybe things you think are true, but aren't list? - -### Encoding - -You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is. -And typically the problem is that the declaring encoding is wrong. - -The tidyverse follows best practices[^strings-8] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8. -It's still possible to have problems, but they'll typically arise during data import. -Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`). - -[^strings-8]: - -### Length and subsetting - -This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages. - -Four most common are Latin, Chinese, Arabic, and Devangari, which represent three different systems of writing systems: - -- Latin uses an alphabet, where each consonant and vowel gets its own letter. - -- Chinese. - Logograms. - Half width vs full width. - English letters are roughly twice as high as they are wide. - Chinese characters are roughly square. - -- Arabic is an abjad, only consonants are written and vowels are optionally as diacritics. - Additionally, it's written from right-to-left, so the first letter is the letter on the far right. - -- Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary. - -> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak. -> --- - -```{r} -# But -str_split("check", boundary("character", locale = "cs_CZ")) -``` - -This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet. -This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components. - -```{r} -x <- c("á", "x́") -str_length(x) -# str_width(x) -str_sub(x, 1, 1) - -# stri_width(c("全形", "ab")) -# 0, 1, or 2 -# but this assumes no font substitution -``` - -```{r} -cyrillic_a <- "А" -latin_a <- "A" -cyrillic_a == latin_a -stringi::stri_escape_unicode(cyrillic_a) -stringi::stri_escape_unicode(latin_a) -``` - -### Collation rules - -`coll()`: compare strings using standard **coll**ation rules. -This is useful for doing case insensitive matching. -Note that `coll()` takes a `locale` parameter that controls which rules are used for comparing characters. -Unfortunately different parts of the world use different rules!B -oth `fixed()` and `regex()` have `ignore_case` arguments, but they do not allow you to pick the locale: they always use the default locale. -You can see what that is with the following code; more on stringi later. - -```{r} -a1 <- "\u00e1" -a2 <- "a\u0301" -c(a1, a2) -a1 == a2 - -str_detect(a1, fixed(a2)) -str_detect(a1, coll(a2)) -``` - -The downside of `coll()` is speed; because the rules for recognising which characters are the same are complicated, `coll()` is relatively slow compared to `regex()` and `fixed()`. - -### Upper and lower case - -Relatively few writing systems have upper and lower case: Latin, Greek, and Cyrillic, plus a handful of lessor known languages. - -Above I used `str_to_lower()` to change the text to lower case. -You can also use `str_to_upper()` or `str_to_title()`. -However, changing case is more complicated than it might at first appear because different languages have different rules for changing case. -You can pick which set of rules to use by specifying a locale: - -```{r} -# Turkish has two i's: with and without a dot, and it -# has a different rule for capitalising them: -str_to_upper(c("i", "ı")) -str_to_upper(c("i", "ı"), locale = "tr") -``` +- Words are broken up by spaces. +- Words are composed of individual spaces. +- All letters in a word are written down. The locale is specified as a ISO 639 language code, which is a two or three letter abbreviation. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list, and you can see which are supported with `stringi::stri_locale_list()`. -If you leave the locale blank, it will use English. -The locale also affects case-insensitive matching, which `coll(ignore_case = TRUE)` which you can control with `coll()`: +Base R string functions automatically use your locale current locale, but stringr functions all default to the English locale. +This ensures that your code works the same way on every system, avoiding subtle bugs. +To choose a different locale you'll need to specify the `locale` argument; seeing that a function has a locale argument tells you that its behaviour will differ from locale to locale. -```{r} -i <- c("Iİiı") +Here are a few places where locale matter:S -str_view_all(i, coll("i", ignore_case = TRUE)) -str_view_all(i, coll("i", ignore_case = TRUE, locale = "tr")) -``` +- Upper and lower case: only relatively few languages have upper and lower case (Latin, Greek, and Cyrillic, plus a handful of lessor known languages). The rules are not te same in every language that uses these alphabets. For example, Turkish has two i's: with and without a dot, and it has a different rule for capitalising them: -You can also do case insensitive matching this `fixed(ignore_case = TRUE)`, but this uses a simple approximation which will not work in all cases. + ```{r} + str_to_upper(c("i", "ı")) + str_to_upper(c("i", "ı"), locale = "tr") + ``` -### Sorting +- This also affects case insensitive matching with `coll(ignore_case = TRUE)` which you can control with `coll()`: -Unicode collation algorithm: + ```{r} + i <- c("Iİiı") -Another important operation that's affected by the locale is sorting. -The base R `order()` and `sort()` functions sort strings using the current locale. -If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument. + str_view_all(i, coll("i", ignore_case = TRUE)) + str_view_all(i, coll("i", ignore_case = TRUE, locale = "tr")) + ``` -Can also control the "strength", which determines how accents are sorted. +- Many characters with diacritics can be recorded in multiple ways: these will print identically but won't match with `fixed()`. -```{r} -str_sort(c("a", "ch", "c", "h")) -str_sort(c("a", "ch", "c", "h"), locale = "cs_CZ") -``` + ```{r} + a1 <- "\u00e1" + a2 <- "a\u0301" + c(a1, a2) + a1 == a2 -TODO: add connection to `arrange()` + str_view(a1, fixed(a2)) + str_view(a1, coll(a2)) + ``` + +- Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument. Here's an example: in Czech, "ch" is a digraph that appears after `h` in the alphabet. + + ```{r} + str_sort(c("a", "ch", "c", "h")) + str_sort(c("a", "ch", "c", "h"), locale = "cs") + ``` + + TODO after dplyr 1.1.0: discuss `arrange()`