The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.
You can quickly tell when you're using a stringr function because all stringr functions start with `str_`.
This is particularly useful if you use RStudio because typing `str_` will trigger autocomplete, allowing you to jog your memory of the available functions.
#| fig-alt: "`str_c` typed into the RStudio console with the autocomplete tooltip shown on top, which lists functions beginning with `str_c`. The funtion signature and beginning of the man page for the highlighted function from the autocomplete list are shown in a panel to its right."
There's no difference in behavior between the two, so in the interests of consistency, the [tidyverse style guide](https://style.tidyverse.org/syntax.html#character-vectors) recommends using `"`, unless the string contains multiple `"`.
Beware that the printed representation of a string is not the same as the string itself because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string).
To illustrate the problem, let's create a string that contains the contents of the code block where we define the `double_quote` and `single_quote` variables:
(This is sometimes called [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).) To eliminate the escaping, you can instead use a **raw string**[^strings-2]:
But if your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g., `` `r"--()--" ``, `` `r"---()---" ``, etc. Raw strings are flexible enough to handle any text.
As well as `\"`, `\'`, and `\\`, there are a handful of other special characters that may come in handy. The most common are `\n`, a new line, and `\t`, tab. You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`. This is a way of writing non-English characters that work on all systems. You can see the complete list of other special characters in `?Quotes`.
One of the challenges of working with text is that there's a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.
This will help you solve the common problem where you have some text you wrote that you want to combine with strings from a data frame.
For example, you might combine "Hello" with a `name` variable to create a greeting.
We'll show you how to do this with `str_c()` and `str_glue()` and how you can use them with `mutate()`.
That naturally raises the question of what string functions you might use with `summarize()`, so we'll finish this section with a discussion of `str_flatten()`, which is a summary function for strings.
`str_c()` is very similar to the base `paste0()`, but is designed to be used with `mutate()` by obeying the usual tidyverse rules for recycling and propagating missing values:
If you are mixing many fixed and variable strings with `str_c()`, you'll notice that you type a lot of `"`s, making it hard to see the overall goal of the code. An alternative approach is provided by the [glue package](https://glue.tidyverse.org) via `str_glue()`[^strings-3]. You give it a single string that has a special feature: anything inside `{}` will be evaluated like it's outside of the quotes:
The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like `\`, you double up the special characters:
- Just like with `pivot_longer()` and `pivot_wider()`, `_longer` functions make the input data frame longer by creating new rows and `_wider` functions make the input data frame wider by generating new columns.
The following two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating into columns.
We'll finish off by discussing the tools that the `wider` functions give you to diagnose problems.
It's rarer to see `separate_longer_position()` in the wild, but some older datasets do use a very compact format where each character is used to record a value:
Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns.
They are slightly more complicated than their `longer` equivalents because you need to name the columns.
What happens if some of the rows don't have the expected number of pieces?
There are two possible problems, too few or too many pieces, so `separate_wider_delim()` provides two arguments to help: `too_few` and `too_many`. Let's first look at the `too_few` case with the following sample dataset:
When you use the debug mode, you get three extra columns added to the output: `x_ok`, `x_pieces`, and `x_remainder` (if you separate a variable with a different name, you'll get a different prefix).
Here, `x_ok` lets you quickly find the inputs that failed:
```{r}
debug |> filter(!x_ok)
```
`x_pieces` tells us how many pieces were found, compared to the expected 3 (the length of `names`).
`x_remainder` isn't useful when there are too few pieces, but we'll see it again shortly.
Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating.
You have a slightly different set of options for handling too many pieces: you can either silently "drop" any additional pieces or "merge" them all into the final column:
You could use this with `count()` to find the distribution of lengths of US babynames and then with `filter()` to look at the longest names, which happen to have 15 letters[^strings-6]:
You can extract parts of a string using `str_sub(string, start, end)`, where `start` and `end` are the positions where the substring should start and end.
1. When computing the distribution of the length of babynames, why did we use `wt = n`?
2. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
3. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
Unfortunately, we don't have room for a full treatment of non-English languages.
Still, we wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale-dependent functions.
In the early days of computing, there were many competing standards for encoding non-English characters.
For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages, and Latin2 (aka ISO-8859-2) was used for Central European languages.
Working in languages with accents poses a significant challenge when determining the position of letters (e.g., with `str_length()` and `str_sub()`) as accented letters might be encoded as a single individual character (e.g., ü) or as two characters by combining an unaccented letter (e.g., u) with a diacritic mark (e.g., ¨).
Finally, note that a comparison of these strings with `==` interprets these strings as different, while the handy `str_equal()` function in stringr recognizes that both have the same appearance:
For example, "en" is English, "en_GB" is British English, and "en_US" is American English.
If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list, and you can see which are supported in stringr by looking at `stringi::stri_locale_list()`.
Base R string functions automatically use the locale set by your operating system.
This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in a different country.
In this chapter, you've learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings.
Now it's time to learn one of the most important and powerful tools for working with strings: regular expressions.
Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter.