r4ds/strings.qmd

# Strings {#sec-strings}

```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("restructuring")
```

## Introduction

So far, you've used a bunch of strings without learning much about the details.
Now it's time to dive into them, learning what makes strings tick, and mastering some of the powerful string manipulation tool you have at your disposal.

We'll begin with the details of creating strings and character vectors.
You'll then dive into creating strings from data, then the opposite; extracting strings from data.
The chapter finishes up with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.

### Prerequisites

In this chapter, we'll use functions from the stringr package which is part of the core tidyverse.
We'll also use the babynames data since it provides some fun strings to manipulate.

```{r}
#| label: setup
#| message: false

library(tidyverse)
library(babynames)
```

You can easily tell when you're using a stringr function because all stringr functions start with `str_`.
This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you jog your memory of which functions are available.

```{r}
#| echo: false

knitr::include_graphics("screenshots/stringr-autocomplete.png")
```

## Creating a string

We've created strings in passing earlier in the book, but didn't discuss the details.
Firstly, you can create a string using either single quotes (`'`) or double quotes (`"`).
There's no difference in behavior between the two so in the interests of consistency the [tidyverse style guide](https://style.tidyverse.org/syntax.html#character-vectors) recommends using `"`, unless the string contains multiple `"`.

```{r}
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
```

If you forget to close a quote, you'll see `+`, the continuation character:

    > "This is a string without a closing quote
    + 
    + 
    + HELP I'M STUCK

If this happen to you and you can't figure out which quote you need to close, press Escape to cancel, and try again.

### Escapes

To include a literal single or double quote in a string you can use `\` to "escape" it:

```{r}
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```

And if you want to include a literal backslash in your string, you'll need to double it up: `"\\"`:

```{r}
backslash <- "\\"
```

Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string).
To see the raw contents of the string, use `str_view()`[^strings-1]:

[^strings-1]: Or use the base R function `writeLines()`.

```{r}
x <- c(single_quote, double_quote, backslash)
x
str_view(x)
```

### Raw strings {#sec-raw-strings}

Creating a string with multiple quotes or backslashes gets confusing quickly.
To illustrate the problem, lets create a string that contains the contents of the code block where we define the `double_quote` and `single_quote` variables:

```{r}
tricky <- "double_quote <- \"\\\"\" # or '\"'
single_quote <- '\\'' # or \"'\""
str_view(tricky)
```

That's a lot of backslashes!
(This is sometimes called [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).) To eliminate the escaping you can instead use a **raw string**[^strings-2]:

[^strings-2]: Available in R 4.0.0 and above.

```{r}
tricky <- r"(double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'")"
str_view(tricky)
```

A raw string usually starts with `r"(` and finishes with `)"`.
But if your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``, etc. Raw strings are flexible enough to handle any text.

### Other special characters

As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `\n`, newline, and `\t`, tab. You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in `?'"'`.

```{r}
x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
x
str_view(x)
```

Note that `str_view()` uses a blue background for tabs to make them easier to spot.
One of the challenges of working with text is that there's a variety of ways that white space can end up in text, so this background helps you recognize that something strange is going on.

### Exercises

1.  Create strings that contain the following values:

    1.  `He said "That's amazing!"`

    2.  `\a\b\c\d`

    3.  `\\\\\\`

2.  Create the string in your R session and print it.
    What happens to the special "\\u00a0"?
    How does `str_view()` display it?
    Can you do a little googling to figure out what this special character is?

    ```{r}
    x <- "This\u00a0is\u00a0tricky"
    ```

## Creating many strings from data

Now that you've learned the basics of creating a string or two by "hand", we'll go into the details of creating strings from other strings.
This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame.
For example, to create a greeting you might combine "Hello" with a `name` variable.
We'll show you how to do this with `str_c()` and `str_glue()` and how you might use them with `mutate()`.
That naturally raises the question of what functions you might use with `summarise()`, so we'll finish this section with a discussion of `str_flatten()` which is a summary function for strings.

### `str_c()`

`str_c()`[^strings-3] takes any number of vectors as arguments and returns a character vector:

[^strings-3]: `str_c()` is very similar to the base `paste0()`.
    There are two main reasons we recommend: it obeys the usual rules for propagating `NA`s and it uses the tidyverse recycling rules.

```{r}
str_c("x", "y")
str_c("x", "y", "z")
str_c("Hello ", c("John", "Susan"))
```

`str_c()` is designed to be used with `mutate()` so it obeys the usual rules for recycling and missing values:

```{r}
set.seed(1410)
df <- tibble(name = c(wakefield::name(3), NA))
df |> mutate(greeting = str_c("Hi ", name, "!"))
```

If you want missing values to display in some other way, use `coalesce()` either inside or outside of `str_c()`:

```{r}
df |> mutate(
  greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
  greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
)
```

### `str_glue()` {#sec-glue}

If you are mixing many fixed and variable strings with `str_c()`, you'll notice that you have to type `""` repeatedly, and this can make it hard to see the overall goal of the code.
An alternative approach is provided by the [glue package](https://glue.tidyverse.org) via `str_glue()`[^strings-4] .
You give it a single string containing `{}`; anything inside `{}` will be evaluated like it's outside of the string:

[^strings-4]: If you're not using stringr, you can also access it directly with `glue::glue()`.

```{r}
df |> mutate(greeting = str_glue("Hi {name}!"))
```

As you can see, `str_glue()` currently converts missing values to the string "NA" making it inconsistent with `str_c()`.
We'll hopefully have fixed that by the time you're reading this[^strings-5].

[^strings-5]: Track our progress at <https://github.com/tidyverse/glue/issues/246>.

You also might wonder what happens if you need to include a regular `{` or `}` in your string.
If you guess that you'll need to somehow escape it, you're on the right track.
The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like `\`, you double up the special characters:

```{r}
df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
```

### `str_flatten()`

`str_c()` and `glue()` work well with `mutate()` because their output is the same length as their inputs.
What if you want a function that works well with `summarise()`, i.e. something that always returns a single string?
That's the job of `str_flatten()`[^strings-6]: it takes a character vector and combines each element of the vector into a single string:

[^strings-6]: The base R equivalent is `paste()` used with the `collapse` argument.

```{r}
str_flatten(c("x", "y", "z"))
str_flatten(c("x", "y", "z"), ", ")
str_flatten(c("x", "y", "z"), ", ", last = ", and ")
```

This makes it work well with `summarise()`:

```{r}
df <- tribble(
  ~ name, ~ fruit,
  "Carmen", "banana",
  "Carmen", "apple",
  "Marvin", "nectarine",
  "Terence", "cantaloupe",
  "Terence", "papaya",
  "Terence", "madarine"
)
df |>
  group_by(name) |> 
  summarise(fruits = str_flatten(fruit, ", "))
```

### Exercises

1.  Compare and contrast the results of `paste0()` with `str_c()` for the following inputs:

    ```{r}
    #| eval: false

    str_c("hi ", NA)
    str_c(letters[1:2], letters[1:3])
    ```

2.  Convert the following expressions from `str_c()` to `str_glue()` or vice versa:

    a.  `str_c("The price of ", food, " is ", price)`

    b.  `glue("I'm {age} years old and live in {country}")`

    c.  `str_c("\\section{", title, "}")`

## Extracting data from strings

Working from <https://github.com/tidyverse/tidyr/pull/1304>.

It's very common for multiple variables to be crammed together into a single string.
In this section you'll learn how to use four tidyr to extract them:

-   `df |> separate_longer_delim(col, delim)`
-   `df |> separate_longer_position(col, width)`
-   `df |> separate_wider_delim(col, delim, names)`
-   `df |> separate_wider_(col, widths)`

If you look closely you can see there's a common pattern here: `separate` followed by `by` or `at`, followed by longer or `wider`.
`by` splits up a string with a separator like `", "` or `" "`.
`at` splits at given locations, like 5, 10, and 17.
`longer` makes input data frame longer, making new rows; `wider` makes the input data frame wider, add new columns.

There's one more member of this family, `separate_regex_wider()`, that we'll come back in @sec-regular-expressions.
It's the most flexible of the `at` forms but you need to know a bit about regular expression in order to use it.

The next two sections will give you the basic idea behind these separate functions, and then we'll work through a few case studies that require mutliple uses.

### Splitting into rows

`separate_longer_delim()` and `separate_longer_position()` are most useful when the number of components varies from row to row.
`separate_longer_delim()` arises most commonly:

```{r}
df1 <- tibble(x = c("a,b,c", "d,e", "f"))
df1 |> 
  separate_longer_delim(x, delim = ",")
```

(If the separators have some variation you can use a regular expression instead, if you know about it.)

It's rarer to see `separate_longer_position()` in the wild, but some older datasets can adopt a very compact format where each character is used to record a value:

```{r}
df2 <- tibble(x = c("1211", "131", "21"))
df2 |> 
  separate_longer_position(x, width = 1)
```

### Splitting into columns

`separate_wider_delim()` and `separate_wider_position()` are most useful when there are a fixed number of components in each string, and you want to spread them into columns.
They are more complicated that their `by` equivalents because you need to name the columns.

```{r}
df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015"))
df3 |> 
  separate_wider_delim(x, delim = ",", names = c("letter", "number", "year"))
```

If a specific value is not useful you can use `NA` to omit it from the results:

```{r}
df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015"))
df3 |> 
  separate_wider_delim(x, delim = ",", names = c("letter", NA, "year"))
```

Alternatively, you can provide `names_sep` and `separate_wider_delim()` will use that separator to name automatically:

```{r}
df3 |> 
  separate_wider_delim(x, delim = ",", names_sep = "_")
```

`separate_wider_position()` works a little differently, because you typically want to specify the width of each column.
So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies.
You can omit values from the output by not naming them:

```{r}
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA")) 
df4 |> 
  separate_wider_position(x, c(year = 4, age = 2, state = 2))
```

### Case studies

## Letters

This section discusses string function that work with individual characters.
In English, characters are easy to understand because they're correspond to the 26 letters of the alphabet (plus a handful of punctuation characters).
Things get complicated quickly when you move beyond English.
Even languages that use the same alphabet, but add additional accents (like å, é, ï, ô, ū) are non-trivial because these extra letters might be represented as an individual character or by composing an unaccented letter with a diacritic mark.
Things get more complicated still as you move further away.
To give just a few examples in Japanese each "letter" is a syllable, in Chinese each "letter" is a complex logogram, and in Arabic, letters look radically different depending on where in the word they fail.

In this section, we'll you're using English (or a nearby language); if you're working with another language, these examples either may not applty or need radically different approaches.

### Length

`str_length()` tells you the number of letters in the string:

```{r}
str_length(c("a", "R for data science", NA))
```

You could use this with `count()` to find the distribution of lengths of US babynames, and then with `filter()` to look at the longest names[^strings-7]:

[^strings-7]: Looking at these entries, we'd guess that the babynames data removes spaces or hyphens from names and truncates after 15 letters.

```{r}
babynames |>
  count(length = str_length(name), wt = n)

babynames |> 
  filter(str_length(name) == 15) |> 
  count(name, wt = n, sort = TRUE)
```

### Subsetting

You can extract parts of a string using `str_sub(string, start, end)`.
The `start` and `end` arguments are inclusive, so the length of the returned string will be `end - start + 1`:

```{r}
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
```

You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.

```{r}
str_sub(x, -3, -1)
```

Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:

```{r}
str_sub("a", 1, 5)
```

We could use `str_sub()` with `mutate()` to find the first and last letter of each name:

```{r}
babynames |> 
  mutate(
    first = str_sub(name, 1, 1),
    last = str_sub(name, -1, -1)
  )
```

### Long strings

Sometimes the reason you care about the length of a string is because you're trying to fit it into a label on a plot or in a table.
stringr provides two useful tools for cases where your string is too long:

-   `str_trunc(x, 30)` ensures that no string is longer than 20 characters, replacing any thing too long with `…`.

-   `str_wrap(x, 30)` wraps a string introducing new lines so that each line is at most 30 characters (it doesn't hyphenate, however, so any word longer than 30 characters will make a longer line)

```{r}
x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."

str_view(str_trunc(x, 30))
str_view(str_wrap(x, 30))
```

TODO: add example with a plot.

### Exercises

1.  Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
2.  Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?

## Locale dependent {#sec-other-languages}

So far all of our examples have been using English.
The details of the many ways other languages are different to English are too diverse to detail here, but we wanted to give a quick outline of the functions who's behavior differs based on your **locale**, the set of settings that vary from country to country.

Locale is specified with lower-case language abbreviation, optionally followed by a `_` and a upper-case region identifier.
For example, "en" is English, "en_GB" is British English, and "en_US" is American English.
If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list, and you can see which are supported with `stringi::stri_locale_list()`.

Base R string functions automatically use your locale current locale.
This means that string manipulation code works the way you expect when you're working with text in your native language, but it might work differently when you share it with someone who lives in another country.
To avoid this problem, stringr defaults to the "en" locale, and requires you to specify the `locale` argument to override it.
This also makes it easy to tell if a function might have different behavior in different locales.

Fortunately there are three sets of functions where the locale matters:

-   **Changing case**: while only relatively few languages have upper and lower case (Latin, Greek, and Cyrillic, plus a handful of lessor known languages).
    The rules are not the same in every language that uses these alphabets.
    For example, Turkish has two i's: with and without a dot, and it has a different rule for capitalizing them:

    ```{r}
    str_to_upper(c("i", "ı"))
    str_to_upper(c("i", "ı"), locale = "tr")
    ```

-   **Comparing strings**: `str_equal()` lets you compare if two strings are equal, optionally ignoring case:

    ```{r}
    str_equal("i", "I", ignore_case = TRUE)
    str_equal("i", "I", ignore_case = TRUE, locale = "tr")
    ```

-   **Sorting strings**: `str_sort()` and `str_order()` sort vectors alphabetically, but the alphabet is not the same in every language[^strings-8]!
    Here's an example: in Czech, "ch" is a compound letter that appears after `h` in the alphabet.

    ```{r}
    str_sort(c("a", "c", "ch", "h", "z"))
    str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
    ```

    Danish has a similar problem.
    Normally, characters with diacritics (e.g. à, á, â) sort after the plain character (e.g. a).
    But in Danish ø and å are their own letters that come at the end of the alphabet:

    ```{r}
    str_sort(c("a", "å", "o", "ø", "z"))
    str_sort(c("a", "å", "o", "ø", "z"), locale = "da")
    ```

    This also comes up when sorting strings with `dplyr::arrange()` which is why it also has a `locale` argument.

[^strings-8]: Sorting in languages that don't have an alphabet (like Chinese) is more complicated still.

## Summary

In this chapter you've learned a wide of tools for working with strings, but you haven't learned one of the most important and powerful tools: regular expressions.
Regular expressions are very concise, but very expressive, language for describing patterns within strings, and are the topic of the next chapter.
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								# Strings {#sec-strings}
-												Make sure first element is heading

											
										
										
											2015-12-12 02:34:20 +08:00
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								```{r}
 								#| results: "asis"
 								#| echo: false
 								source("_common.R")
-												Add chapter status

											
										
										
											2021-05-04 21:10:39 +08:00
+								status("restructuring")
 								```
-												Consistent chapter intro layout

											
										
										
											2016-07-19 21:01:50 +08:00
+								## Introduction
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								So far, you've used a bunch of strings without learning much about the details.
 								Now it's time to dive into them, learning what makes strings tick, and mastering some of the powerful string manipulation tool you have at your disposal.
 								We'll begin with the details of creating strings and character vectors.
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								You'll then dive into creating strings from data, then the opposite; extracting strings from data.
 								The chapter finishes up with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
-												Consistent chapter intro layout

											
										
										
											2016-07-19 21:01:50 +08:00
+								### Prerequisites
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								In this chapter, we'll use functions from the stringr package which is part of the core tidyverse.
 								We'll also use the babynames data since it provides some fun strings to manipulate.
-												Consistent chapter intro layout

											
										
										
											2016-07-19 21:01:50 +08:00
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								```{r}
 								#| label: setup
 								#| message: false
-												Use tidyverse package

Fixes #451

											
										
										
											2016-10-04 01:30:24 +08:00
+								library(tidyverse)
-												Noodling on strings

											
										
										
											2021-04-23 21:07:16 +08:00
+								library(babynames)
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								```
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
+								You can easily tell when you're using a stringr function because all stringr functions start with `str_`.
 								This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you jog your memory of which functions are available.
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								```{r}
 								#| echo: false
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
+								knitr::include_graphics("screenshots/stringr-autocomplete.png")
 								```
-												Break up strings chapter

											
										
										
											2021-04-22 01:30:25 +08:00
+								## Creating a string
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
+								We've created strings in passing earlier in the book, but didn't discuss the details.
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								Firstly, you can create a string using either single quotes (`'`) or double quotes (`"`).
 								There's no difference in behavior between the two so in the interests of consistency the [tidyverse style guide](https://style.tidyverse.org/syntax.html#character-vectors) recommends using `"`, unless the string contains multiple `"`.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
 								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								string1 <- "This is a string"
 								string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								```
-												More @csgillespie changes

											
										
										
											2016-10-04 22:01:12 +08:00
+								If you forget to close a quote, you'll see `+`, the continuation character:
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								    > "This is a string without a closing quote
 								    +
 								    +
 								    + HELP I'M STUCK
-												More @csgillespie changes

											
										
										
											2016-10-04 22:01:12 +08:00
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								If this happen to you and you can't figure out which quote you need to close, press Escape to cancel, and try again.
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
+								### Escapes
-												More @csgillespie changes

											
										
										
											2016-10-04 22:01:12 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								To include a literal single or double quote in a string you can use `\` to "escape" it:
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								```{r}
 								double_quote <- "\"" # or '"'
 								single_quote <- '\'' # or "'"
 								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								And if you want to include a literal backslash in your string, you'll need to double it up: `"\\"`:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								```{r}
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
+								backslash <- "\\"
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								```
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string).
 								To see the raw contents of the string, use `str_view()`[^strings-1]:
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								[^strings-1]: Or use the base R function `writeLines()`.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								```{r}
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
+								x <- c(single_quote, double_quote, backslash)
 								x
 								str_view(x)
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								```
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								### Raw strings {#sec-raw-strings}
-												More hacking of string chapter

											
										
										
											2021-04-22 21:41:06 +08:00
 								Creating a string with multiple quotes or backslashes gets confusing quickly.
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								To illustrate the problem, lets create a string that contains the contents of the code block where we define the `double_quote` and `single_quote` variables:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								```{r}
-												More hacking of string chapter

											
										
										
											2021-04-22 21:41:06 +08:00
+								tricky <- "double_quote <- \"\\\"\" # or '\"'
 								single_quote <- '\\'' # or \"'\""
 								str_view(tricky)
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								```
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
+								That's a lot of backslashes!
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								(This is sometimes called [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).) To eliminate the escaping you can instead use a **raw string**[^strings-2]:
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								[^strings-2]: Available in R 4.0.0 and above.
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
 								```{r}
-												More hacking of string chapter

											
										
										
											2021-04-22 21:41:06 +08:00
+								tricky <- r"(double_quote <- "\"" # or '"'
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								single_quote <- '\'' # or "'")"
-												More hacking of string chapter

											
										
										
											2021-04-22 21:41:06 +08:00
+								str_view(tricky)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
+								A raw string usually starts with `r"(` and finishes with `)"`.
 								But if your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``, etc. Raw strings are flexible enough to handle any text.
-												Break up strings chapter

											
										
										
											2021-04-22 01:30:25 +08:00
-												More hacking of string chapter

											
										
										
											2021-04-22 21:41:06 +08:00
+								### Other special characters
-												Add autocomplete screenshot

											
										
										
											2015-11-09 20:32:56 +08:00
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `\n`, newline, and `\t`, tab. You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in `?'"'`.
-												More hacking of string chapter

											
										
										
											2021-04-22 21:41:06 +08:00
 								```{r}
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
-												More hacking of string chapter

											
										
										
											2021-04-22 21:41:06 +08:00
+								x
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
+								str_view(x)
-												Add autocomplete screenshot

											
										
										
											2015-11-09 20:32:56 +08:00
+								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								Note that `str_view()` uses a blue background for tabs to make them easier to spot.
 								One of the challenges of working with text is that there's a variety of ways that white space can end up in text, so this background helps you recognize that something strange is going on.
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								### Exercises
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+.  Create strings that contain the following values:
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+.  `He said "That's amazing!"`
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+.  `\a\b\c\d`
-												Noodling on strings

											
										
										
											2021-04-23 21:07:16 +08:00
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+.  `\\\\\\`
-												More hacking away at iteration

											
										
										
											2022-09-16 03:56:12 +08:00
+.  Create the string in your R session and print it.
 								    What happens to the special "\\u00a0"?
 								    How does `str_view()` display it?
 								    Can you do a little googling to figure out what this special character is?
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
 								    ```{r}
 								    x <- "This\u00a0is\u00a0tricky"
 								    ```
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								## Creating many strings from data
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								Now that you've learned the basics of creating a string or two by "hand", we'll go into the details of creating strings from other strings.
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame.
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								For example, to create a greeting you might combine "Hello" with a `name` variable.
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								We'll show you how to do this with `str_c()` and `str_glue()` and how you might use them with `mutate()`.
 								That naturally raises the question of what functions you might use with `summarise()`, so we'll finish this section with a discussion of `str_flatten()` which is a summary function for strings.
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								### `str_c()`
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								`str_c()`[^strings-3] takes any number of vectors as arguments and returns a character vector:
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								[^strings-3]: `str_c()` is very similar to the base `paste0()`.
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								    There are two main reasons we recommend: it obeys the usual rules for propagating `NA`s and it uses the tidyverse recycling rules.
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
 								```{r}
 								str_c("x", "y")
 								str_c("x", "y", "z")
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								str_c("Hello ", c("John", "Susan"))
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
+								```
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								`str_c()` is designed to be used with `mutate()` so it obeys the usual rules for recycling and missing values:
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
 								```{r}
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								set.seed(1410)
 								df <- tibble(name = c(wakefield::name(3), NA))
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								df |> mutate(greeting = str_c("Hi ", name, "!"))
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
+								```
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								If you want missing values to display in some other way, use `coalesce()` either inside or outside of `str_c()`:
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
 								```{r}
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								df |> mutate(
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								  greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
 								  greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
 								)
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
+								```
-												More hacking away at iteration

											
										
										
											2022-09-16 03:56:12 +08:00
+								### `str_glue()` {#sec-glue}
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								If you are mixing many fixed and variable strings with `str_c()`, you'll notice that you have to type `""` repeatedly, and this can make it hard to see the overall goal of the code.
 								An alternative approach is provided by the [glue package](https://glue.tidyverse.org) via `str_glue()`[^strings-4] .
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								You give it a single string containing `{}`; anything inside `{}` will be evaluated like it's outside of the string:
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												fix minor plaintext typos and links (#1082)


											
										
										
											2022-09-01 19:43:57 +08:00
+								[^strings-4]: If you're not using stringr, you can also access it directly with `glue::glue()`.
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
 								```{r}
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								df |> mutate(greeting = str_glue("Hi {name}!"))
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
+								```
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								As you can see, `str_glue()` currently converts missing values to the string "NA" making it inconsistent with `str_c()`.
 								We'll hopefully have fixed that by the time you're reading this[^strings-5].
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								[^strings-5]: Track our progress at <https://github.com/tidyverse/glue/issues/246>.
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								You also might wonder what happens if you need to include a regular `{` or `}` in your string.
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								If you guess that you'll need to somehow escape it, you're on the right track.
 								The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like `\`, you double up the special characters:
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
 								```{r}
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								```
-												More hacking of string chapter

											
										
										
											2021-04-22 21:41:06 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								### `str_flatten()`
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								`str_c()` and `glue()` work well with `mutate()` because their output is the same length as their inputs.
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								What if you want a function that works well with `summarise()`, i.e. something that always returns a single string?
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								That's the job of `str_flatten()`[^strings-6]: it takes a character vector and combines each element of the vector into a single string:
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+								[^strings-6]: The base R equivalent is `paste()` used with the `collapse` argument.
-												More hacking of string chapter

											
										
										
											2021-04-22 21:41:06 +08:00
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
+								```{r}
 								str_flatten(c("x", "y", "z"))
-												More noodling on strings

											
										
										
											2021-04-27 21:41:47 +08:00
+								str_flatten(c("x", "y", "z"), ", ")
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
+								str_flatten(c("x", "y", "z"), ", ", last = ", and ")
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
+								```
-												More hacking of string chapter

											
										
										
											2021-04-22 21:41:06 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								This makes it work well with `summarise()`:
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
 								```{r}
 								df <- tribble(
 								  ~ name, ~ fruit,
 								  "Carmen", "banana",
 								  "Carmen", "apple",
 								  "Marvin", "nectarine",
 								  "Terence", "cantaloupe",
 								  "Terence", "papaya",
 								  "Terence", "madarine"
 								)
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								df |>
 								  group_by(name) |>
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
+								  summarise(fruits = str_flatten(fruit, ", "))
 								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
+								### Exercises
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
+.  Compare and contrast the results of `paste0()` with `str_c()` for the following inputs:
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								    ```{r}
 								    #| eval: false
-												Writing about strings

											
										
										
											2021-12-03 01:57:51 +08:00
+								    str_c("hi ", NA)
 								    str_c(letters[1:2], letters[1:3])
 								    ```
-												Polishing strings

											
										
										
											2022-09-07 20:28:51 +08:00
+.  Convert the following expressions from `str_c()` to `str_glue()` or vice versa:
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								    a.  `str_c("The price of ", food, " is ", price)`
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								    b.  `glue("I'm {age} years old and live in {country}")`
 								    c.  `str_c("\\section{", title, "}")`
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								## Extracting data from strings
-												Proof strings

											
										
										
											2016-08-13 00:28:16 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								Working from <https://github.com/tidyverse/tidyr/pull/1304>.
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								It's very common for multiple variables to be crammed together into a single string.
 								In this section you'll learn how to use four tidyr to extract them:
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								-   `df |> separate_longer_delim(col, delim)`
 								-   `df |> separate_longer_position(col, width)`
 								-   `df |> separate_wider_delim(col, delim, names)`
 								-   `df |> separate_wider_(col, widths)`
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								If you look closely you can see there's a common pattern here: `separate` followed by `by` or `at`, followed by longer or `wider`.
 								`by` splits up a string with a separator like `", "` or `" "`.
 								`at` splits at given locations, like 5, 10, and 17.
 								`longer` makes input data frame longer, making new rows; `wider` makes the input data frame wider, add new columns.
 								There's one more member of this family, `separate_regex_wider()`, that we'll come back in @sec-regular-expressions.
 								It's the most flexible of the `at` forms but you need to know a bit about regular expression in order to use it.
 								The next two sections will give you the basic idea behind these separate functions, and then we'll work through a few case studies that require mutliple uses.
 								### Splitting into rows
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								`separate_longer_delim()` and `separate_longer_position()` are most useful when the number of components varies from row to row.
 								`separate_longer_delim()` arises most commonly:
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
 								```{r}
 								df1 <- tibble(x = c("a,b,c", "d,e", "f"))
 								df1 |>
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								  separate_longer_delim(x, delim = ",")
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								```
 								(If the separators have some variation you can use a regular expression instead, if you know about it.)
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								It's rarer to see `separate_longer_position()` in the wild, but some older datasets can adopt a very compact format where each character is used to record a value:
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
 								```{r}
 								df2 <- tibble(x = c("1211", "131", "21"))
 								df2 |>
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								  separate_longer_position(x, width = 1)
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								```
 								### Splitting into columns
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								`separate_wider_delim()` and `separate_wider_position()` are most useful when there are a fixed number of components in each string, and you want to spread them into columns.
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								They are more complicated that their `by` equivalents because you need to name the columns.
 								```{r}
 								df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015"))
 								df3 |>
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								  separate_wider_delim(x, delim = ",", names = c("letter", "number", "year"))
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								```
 								If a specific value is not useful you can use `NA` to omit it from the results:
 								```{r}
 								df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015"))
 								df3 |>
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								  separate_wider_delim(x, delim = ",", names = c("letter", NA, "year"))
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								```
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								Alternatively, you can provide `names_sep` and `separate_wider_delim()` will use that separator to name automatically:
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
 								```{r}
 								df3 |>
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								  separate_wider_delim(x, delim = ",", names_sep = "_")
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								```
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								`separate_wider_position()` works a little differently, because you typically want to specify the width of each column.
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies.
 								You can omit values from the output by not naming them:
 								```{r}
 								df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
 								df4 |>
-												Fixes for dev tidyr

											
										
										
											2022-11-06 01:05:55 +08:00
+								  separate_wider_position(x, c(year = 4, age = 2, state = 2))
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								```
 								### Case studies
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								## Letters
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								This section discusses string function that work with individual characters.
 								In English, characters are easy to understand because they're correspond to the 26 letters of the alphabet (plus a handful of punctuation characters).
 								Things get complicated quickly when you move beyond English.
 								Even languages that use the same alphabet, but add additional accents (like å, é, ï, ô, ū) are non-trivial because these extra letters might be represented as an individual character or by composing an unaccented letter with a diacritic mark.
 								Things get more complicated still as you move further away.
 								To give just a few examples in Japanese each "letter" is a syllable, in Chinese each "letter" is a complex logogram, and in Arabic, letters look radically different depending on where in the word they fail.
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								In this section, we'll you're using English (or a nearby language); if you're working with another language, these examples either may not applty or need radically different approaches.
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								### Length
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								`str_length()` tells you the number of letters in the string:
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
 								```{r}
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								str_length(c("a", "R for data science", NA))
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
+								```
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								You could use this with `count()` to find the distribution of lengths of US babynames, and then with `filter()` to look at the longest names[^strings-7]:
-												More noodling on strings

											
										
										
											2021-04-27 21:41:47 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								[^strings-7]: Looking at these entries, we'd guess that the babynames data removes spaces or hyphens from names and truncates after 15 letters.
-												More noodling on strings

											
										
										
											2021-04-27 21:41:47 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								```{r}
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								babynames |>
 								  count(length = str_length(name), wt = n)
-												More noodling on strings

											
										
										
											2021-04-27 21:41:47 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								babynames |>
 								  filter(str_length(name) == 15) |>
 								  count(name, wt = n, sort = TRUE)
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								```
-												More noodling on strings

											
										
										
											2021-04-27 21:41:47 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								### Subsetting
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								You can extract parts of a string using `str_sub(string, start, end)`.
 								The `start` and `end` arguments are inclusive, so the length of the returned string will be `end - start + 1`:
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
-												More noodling on strings

											
										
										
											2021-04-27 21:41:47 +08:00
+								```{r}
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								x <- c("Apple", "Banana", "Pear")
 								str_sub(x, 1, 3)
-												More noodling on strings

											
										
										
											2021-04-27 21:41:47 +08:00
+								```
-												More on strings

											
										
										
											2021-04-27 03:49:14 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
 								```{r}
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								str_sub(x, -3, -1)
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								```
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
 								```{r}
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								str_sub("a", 1, 5)
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								```
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								We could use `str_sub()` with `mutate()` to find the first and last letter of each name:
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
 								```{r}
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								babynames |>
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								  mutate(
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								    first = str_sub(name, 1, 1),
 								    last = str_sub(name, -1, -1)
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								  )
 								```
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								### Long strings
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								Sometimes the reason you care about the length of a string is because you're trying to fit it into a label on a plot or in a table.
 								stringr provides two useful tools for cases where your string is too long:
-												More string stuff

											
										
										
											2022-01-06 12:49:31 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								-   `str_trunc(x, 30)` ensures that no string is longer than 20 characters, replacing any thing too long with `…`.
-												More string stuff

											
										
										
											2022-01-06 12:49:31 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								-   `str_wrap(x, 30)` wraps a string introducing new lines so that each line is at most 30 characters (it doesn't hyphenate, however, so any word longer than 30 characters will make a longer line)
-												More on regexps

											
										
										
											2022-01-12 16:08:01 +08:00
 								```{r}
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
-												Break up strings chapter

											
										
										
											2021-04-22 01:30:25 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								str_view(str_trunc(x, 30))
 								str_view(str_wrap(x, 30))
-												Break up strings chapter

											
										
										
											2021-04-22 01:30:25 +08:00
+								```
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								TODO: add example with a plot.
-												More noodling on strings

											
										
										
											2021-04-27 21:41:47 +08:00
 								### Exercises
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+.  Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
 .  Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
-												Start sketching out extract section

											
										
										
											2022-10-05 05:37:02 +08:00
+								## Locale dependent {#sec-other-languages}
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								So far all of our examples have been using English.
-												Switch from I to we

Fixes #642

											
										
										
											2022-08-10 00:43:12 +08:00
+								The details of the many ways other languages are different to English are too diverse to detail here, but we wanted to give a quick outline of the functions who's behavior differs based on your **locale**, the set of settings that vary from country to country.
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
 								Locale is specified with lower-case language abbreviation, optionally followed by a `_` and a upper-case region identifier.
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								For example, "en" is English, "en_GB" is British English, and "en_US" is American English.
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
+								If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list, and you can see which are supported with `stringi::stri_locale_list()`.
-												Noodling on strings

											
										
										
											2021-04-23 21:07:16 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								Base R string functions automatically use your locale current locale.
 								This means that string manipulation code works the way you expect when you're working with text in your native language, but it might work differently when you share it with someone who lives in another country.
 								To avoid this problem, stringr defaults to the "en" locale, and requires you to specify the `locale` argument to override it.
 								This also makes it easy to tell if a function might have different behavior in different locales.
-												Noodling on other writing systems

											
										
										
											2021-04-28 21:43:44 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								Fortunately there are three sets of functions where the locale matters:
-												Noodling on strings

											
										
										
											2021-04-23 21:07:16 +08:00
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								-   **Changing case**: while only relatively few languages have upper and lower case (Latin, Greek, and Cyrillic, plus a handful of lessor known languages).
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								    The rules are not the same in every language that uses these alphabets.
 								    For example, Turkish has two i's: with and without a dot, and it has a different rule for capitalizing them:
-												Noodling on other writing systems

											
										
										
											2021-04-28 21:43:44 +08:00
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
+								    ```{r}
 								    str_to_upper(c("i", "ı"))
 								    str_to_upper(c("i", "ı"), locale = "tr")
 								    ```
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								-   **Comparing strings**: `str_equal()` lets you compare if two strings are equal, optionally ignoring case:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
+								    ```{r}
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
+								    str_equal("i", "I", ignore_case = TRUE)
 								    str_equal("i", "I", ignore_case = TRUE, locale = "tr")
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
+								    ```
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								-   **Sorting strings**: `str_sort()` and `str_order()` sort vectors alphabetically, but the alphabet is not the same in every language[^strings-8]!
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								    Here's an example: in Czech, "ch" is a compound letter that appears after `h` in the alphabet.
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
 								    ```{r}
 								    str_sort(c("a", "c", "ch", "h", "z"))
 								    str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
 								    ```
 								    Danish has a similar problem.
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								    Normally, characters with diacritics (e.g. à, á, â) sort after the plain character (e.g. a).
 								    But in Danish ø and å are their own letters that come at the end of the alphabet:
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
+								    ```{r}
-												More string re-orgs

											
										
										
											2021-12-06 22:41:49 +08:00
+								    str_sort(c("a", "å", "o", "ø", "z"))
 								    str_sort(c("a", "å", "o", "ø", "z"), locale = "da")
-												Reorganising bigger structure of strings

											
										
										
											2021-12-06 00:52:47 +08:00
+								    ```
-												More on strings

											
										
										
											2015-11-05 22:10:27 +08:00
-												Polishing strings

											
										
										
											2022-01-20 09:35:16 +08:00
+								    This also comes up when sorting strings with `dplyr::arrange()` which is why it also has a `locale` argument.
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								[^strings-8]: Sorting in languages that don't have an alphabet (like Chinese) is more complicated still.
-												More strings and regexps

											
										
										
											2021-12-14 04:44:52 +08:00
-												Thrashing together new string/regexp structure

											
										
										
											2022-10-04 05:32:07 +08:00
+								## Summary
-												Rough out summaries for all transform chapters

											
										
										
											2022-10-21 22:14:49 +08:00
 								In this chapter you've learned a wide of tools for working with strings, but you haven't learned one of the most important and powerful tools: regular expressions.
 								Regular expressions are very concise, but very expressive, language for describing patterns within strings, and are the topic of the next chapter.