diff --git a/list-columns.qmd b/list-columns.qmd deleted file mode 100644 index 65d0576..0000000 --- a/list-columns.qmd +++ /dev/null @@ -1,26 +0,0 @@ -# List columns {#sec-list-columns} - -```{r} -#| results: "asis" -#| echo: false -source("_common.R") -status("drafting") -``` - -## Introduction - - - -### Prerequisites - -In this chapter we'll continue using tidyr, which also provides a bunch of tools to rectangle your datasets. -tidyr is a member of the core tidyverse. - -```{r} -#| label: setup -#| message: false - -library(tidyverse) -``` - - diff --git a/prog-strings.qmd b/prog-strings.qmd deleted file mode 100644 index 14c7b58..0000000 --- a/prog-strings.qmd +++ /dev/null @@ -1,301 +0,0 @@ -# Programming with strings {#sec-programming-with-strings} - -```{r} -#| results: "asis" -#| echo: false -source("_common.R") -status("drafting") -``` - -```{r} -library(stringr) -library(tidyr) -library(tibble) -``` - -### Encoding - -You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is. -And typically the problem is that the declaring encoding is wrong. - -The tidyverse follows best practices[^prog-strings-1] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8. -It's still possible to have problems, but they'll typically arise during data import. -Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`). - -[^prog-strings-1]: - -### Length and subsetting - -This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages. - -Four most common are Latin, Chinese, Arabic, and Devangari, which represent three different systems of writing systems: - -- Latin uses an alphabet, where each consonant and vowel gets its own letter. - -- Chinese. - Logograms. - Half width vs. full width. - English letters are roughly twice as high as they are wide. - Chinese characters are roughly square. - -- Arabic is an abjad, only consonants are written and vowels are optionally as diacritics. - Additionally, it's written from right-to-left, so the first letter is the letter on the far right. - -- Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary. - -> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak. -> --- - -```{r} -# But -str_split("check", boundary("character", locale = "cs_CZ")) -``` - -This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet. -This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components. - -```{r} -x <- c("á", "x́") -str_length(x) -# str_width(x) -str_sub(x, 1, 1) - -# stri_width(c("全形", "ab")) -# 0, 1, or 2 -# but this assumes no font substitution -``` - -```{r} -cyrillic_a <- "А" -latin_a <- "A" -cyrillic_a == latin_a -stringi::stri_escape_unicode(cyrillic_a) -stringi::stri_escape_unicode(latin_a) -``` - -### str_c - -`NULL`s are silently dropped. -This is particularly useful in conjunction with `if`: - -```{r} -name <- "Hadley" -time_of_day <- "morning" -birthday <- FALSE - -str_c( - "Good ", time_of_day, " ", name, - if (birthday) " and HAPPY BIRTHDAY", - "." -) -``` - -### `str_dup()` - -Closely related to `str_c()` is `str_dup()`. -`str_c(a, a, a)` is like `a + a + a`, what's the equivalent of `3 * a`? -That's `str_dup()`: - -```{r} -str_dup(letters[1:3], 3) -str_dup("a", 1:3) -``` - -## Performance - -`fixed()`: matches exactly the specified sequence of bytes. -It ignores all special regular expressions and operates at a very low level. -This allows you to avoid complex escaping and can be much faster than regular expressions. -The following microbenchmark shows that it's about 3x faster for a simple example. - -```{r} -microbenchmark::microbenchmark( - fixed = str_detect(sentences, fixed("the")), - regex = str_detect(sentences, "the"), - times = 20 -) -``` - -As you saw with `str_split()` you can use `boundary()` to match boundaries. -You can also use it with the other functions: - -```{r} -x <- "This is a sentence." -str_view_all(x, boundary("word")) -str_extract_all(x, boundary("word")) -``` - -### Extract - -```{r} -colors <- c("red", "orange", "yellow", "green", "blue", "purple") -color_match <- str_c(colors, collapse = "|") -color_match - -more <- sentences[str_count(sentences, color_match) > 1] -str_extract_all(more, color_match) -``` - -If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest: - -```{r} - -str_extract_all(more, color_match, simplify = TRUE) - -x <- c("a", "a b", "a b c") -str_extract_all(x, "[a-z]", simplify = TRUE) -``` - -We don't talk about matrices here, but they are useful elsewhere. - -### Exercises - -1. From the Harvard sentences data, extract: - - 1. The first word from each sentence. - 2. All words ending in `ing`. - 3. All plurals. - -## Grouped matches - -Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching. -You can also use parentheses to extract parts of a complex match. -For example, imagine we want to extract nouns from the sentences. -As a heuristic, we'll look for any word that comes after "a" or "the". -Defining a "word" in a regular expression is a little tricky, so here we use a simple approximation: a sequence of at least one character that isn't a space. - -```{r} -noun <- "(a|the) ([^ ]+)" - -has_noun <- sentences |> - str_subset(noun) |> - head(10) -has_noun |> - str_extract(noun) -``` - -`str_extract()` gives us the complete match; `str_match()` gives each individual component. -Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group: - -```{r} -has_noun |> - str_match(noun) -``` - -(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.) - -## Splitting - -Use `str_split()` to split a string up into pieces. -For example, we could split sentences into words: - -```{r} -sentences |> - head(5) |> - str_split(" ") -``` - -Because each component might contain a different number of pieces, this returns a list. -If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list: - -```{r} -str_split("a|b|c|d", "\\|")[[1]] -``` - -Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix: - -```{r} -sentences |> - head(5) |> - str_split(" ", simplify = TRUE) -``` - -You can also request a maximum number of pieces: - -```{r} -fields <- c("Name: Hadley", "Country: NZ", "Age: 35") -fields |> str_split(": ", n = 2, simplify = TRUE) -``` - -Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s: - -```{r} -x <- "This is a sentence. This is another sentence." -str_view_all(x, boundary("word")) - -str_split(x, " ")[[1]] -str_split(x, boundary("word"))[[1]] -``` - -Show how `separate_rows()` is a special case of `str_split()` + `summarize()`. - -## Replace with function - -## Locations - -`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match. -These are particularly useful when none of the other functions does exactly what you want. -You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them. - -## stringi - -stringr is built on top of the **stringi** package. -stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. -stringi, on the other hand, is designed to be comprehensive. -It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`. - -If you find yourself struggling to do something in stringr, it's worth taking a look at stringi. -The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way. -The main difference is the prefix: `str_` vs. `stri_`. - -### Exercises - -1. Find the stringi functions that: - - a. Count the number of words. - b. Find duplicated strings. - c. Generate random text. - -2. How do you control the language that `stri_sort()` uses for sorting? - -### Exercises - -1. What do the `extra` and `fill` arguments do in `separate()`? - Experiment with the various options for the following two toy datasets. - - ```{r} - #| eval: false - - tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) |> - separate(x, c("one", "two", "three")) - - tibble(x = c("a,b,c", "d,e", "f,g,i")) |> - separate(x, c("one", "two", "three")) - ``` - -2. Both `unite()` and `separate()` have a `remove` argument. - What does it do? - Why would you set it to `FALSE`? - -3. Compare and contrast `separate()` and `extract()`. - Why are there three variations of separation (by position, by separator, and with groups), but only one unite? - -4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns. - How would you achieve the same outcome using `mutate()` and `paste()` instead of unite? - - ```{r} - #| eval: false - - events <- tribble( - ~month, ~day, - 1 , 20, - 1 , 21, - 1 , 22 - ) - - events |> - unite("date", month:day, sep = "-", remove = FALSE) - ``` - -5. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`. - Think carefully about what it should do if given a vector of length 0, 1, or 2.