r4ds/oreilly/strings.html

<section data-type="chapter" id="chp-strings">
<h1><span id="sec-strings" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Strings</span></span></h1>
<section id="strings-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>So far, you’ve used a bunch of strings without learning much about the details. Now it’s time to dive into them, learn what makes strings tick, and master some of the powerful string manipulation tools you have at your disposal.</p>
<p>We’ll begin with the details of creating strings and character vectors. You’ll then dive into creating strings from data, then the opposite; extracting strings from data. We’ll then discuss tools that work with individual letters. The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.</p>
<p>We’ll keep working with strings in the next chapter, where you’ll learn more about the power of regular expressions.</p>

<section id="strings-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, we’ll use functions from the stringr package, which is part of the core tidyverse. We’ll also use the babynames data since it provides some fun strings to manipulate.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(babynames)</pre>
</div>
<p>You can quickly tell when you’re using a stringr function because all stringr functions start with <code>str_</code>. This is particularly useful if you use RStudio because typing <code>str_</code> will trigger autocomplete, allowing you to jog your memory of the available functions.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/stringr-autocomplete.png" class="img-fluid" width="678"/></p>
</div>
</div>
</section>
</section>

<section id="creating-a-string" data-type="sect1">
<h1>
Creating a string</h1>
<p>We’ve created strings in passing earlier in the book but didn’t discuss the details. Firstly, you can create a string using either single quotes (<code>'</code>) or double quotes (<code>"</code>). There’s no difference in behavior between the two, so in the interests of consistency, the <a href="https://style.tidyverse.org/syntax.html#character-vectors">tidyverse style guide</a> recommends using <code>"</code>, unless the string contains multiple <code>"</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">string1 &lt;- "This is a string"
string2 &lt;- 'If I want to include a "quote" inside a string, I use single quotes'</pre>
</div>
<p>If you forget to close a quote, you’ll see <code>+</code>, the continuation character:</p>
<pre><code>&gt; "This is a string without a closing quote
+ 
+ 
+ HELP I'M STUCK IN A STRING</code></pre>
<p>If this happens to you and you can’t figure out which quote to close, press Escape to cancel and try again.</p>

<section id="escapes" data-type="sect2">
<h2>
Escapes</h2>
<p>To include a literal single or double quote in a string, you can use <code>\</code> to “escape” it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">double_quote &lt;- "\"" # or '"'
single_quote &lt;- '\'' # or "'"</pre>
</div>
<p>So if you want to include a literal backslash in your string, you’ll need to escape it: <code>"\\"</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">backslash &lt;- "\\"</pre>
</div>
<p>Beware that the printed representation of a string is not the same as the string itself because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code><span data-type="footnote">Or use the base R function <code><a href="https://rdrr.io/r/base/writeLines.html">writeLines()</a></code>.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(single_quote, double_quote, backslash)
x
#&gt; [1] "'"  "\"" "\\"

str_view(x)
#&gt; [1] │ '
#&gt; [2] │ "
#&gt; [3] │ \</pre>
</div>
</section>

<section id="sec-raw-strings" data-type="sect2">
<h2>
Raw strings</h2>
<p>Creating a string with multiple quotes or backslashes gets confusing quickly. To illustrate the problem, let’s create a string that contains the contents of the code block where we define the <code>double_quote</code> and <code>single_quote</code> variables:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tricky &lt;- "double_quote &lt;- \"\\\"\" # or '\"'
single_quote &lt;- '\\'' # or \"'\""
str_view(tricky)
#&gt; [1] │ double_quote &lt;- "\"" # or '"'
#&gt;     │ single_quote &lt;- '\'' # or "'"</pre>
</div>
<p>That’s a lot of backslashes! (This is sometimes called <a href="https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome">leaning toothpick syndrome</a>.) To eliminate the escaping, you can instead use a <strong>raw string</strong><span data-type="footnote">Available in R 4.0.0 and above.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tricky &lt;- r"(double_quote &lt;- "\"" # or '"'
single_quote &lt;- '\'' # or "'")"
str_view(tricky)
#&gt; [1] │ double_quote &lt;- "\"" # or '"'
#&gt;     │ single_quote &lt;- '\'' # or "'"</pre>
</div>
<p>A raw string usually starts with <code>r"(</code> and finishes with <code>)"</code>. But if your string contains <code>)"</code> you can instead use <code>r"[]"</code> or <code>r"{}"</code>, and if that’s still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g., <code>`r"--()--"</code>, <code>`r"---()---"</code>, etc. Raw strings are flexible enough to handle any text.</p>
</section>

<section id="other-special-characters" data-type="sect2">
<h2>
Other special characters</h2>
<p>As well as <code>\"</code>, <code>\'</code>, and <code>\\</code>, there are a handful of other special characters that may come in handy. The most common are <code>\n</code>, newline, and <code>\t</code>, tab. You’ll also sometimes see strings containing Unicode escapes that start with <code>\u</code> or <code>\U</code>. This is a way of writing non-English characters that work on all systems. You can see the complete list of other special characters in <code><a href="https://rdrr.io/r/base/Quotes.html">?'"'</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
x
#&gt; [1] "one\ntwo" "one\ttwo" "µ"        "😄"
str_view(x)
#&gt; [1] │ one
#&gt;     │ two
#&gt; [2] │ one{\t}two
#&gt; [3] │ µ
#&gt; [4] │ 😄</pre>
</div>
<p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that there’s a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.</p>
</section>

<section id="strings-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Create strings that contain the following values:</p>
<ol type="1"><li><p><code>He said "That's amazing!"</code></p></li>
<li><p><code>\a\b\c\d</code></p></li>
<li><p><code>\\\\\\</code></p></li>
</ol></li>
<li>
<p>Create the string in your R session and print it. What happens to the special “\u00a0”? How does <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> display it? Can you do a little googling to figure out what this special character is?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- "This\u00a0is\u00a0tricky"</pre>
</div>
</li>
</ol></section>
</section>

<section id="creating-many-strings-from-data" data-type="sect1">
<h1>
Creating many strings from data</h1>
<p>Now that you’ve learned the basics of creating a string or two by “hand”, we’ll go into the details of creating strings from other strings. This will help you solve the common problem where you have some text you wrote that you want to combine with strings from a data frame. For example, you might combine “Hello” with a <code>name</code> variable to create a greeting. We’ll show you how to do this with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> and how you can use them with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. That naturally raises the question of what string functions you might use with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, so we’ll finish this section with a discussion of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code>, which is a summary function for strings.</p>

<section id="str_c" data-type="sect2">
<h2>
str_c()
</h2>
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> takes any number of vectors as arguments and returns a character vector:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_c("x", "y")
#&gt; [1] "xy"
str_c("x", "y", "z")
#&gt; [1] "xyz"
str_c("Hello ", c("John", "Susan"))
#&gt; [1] "Hello John"  "Hello Susan"</pre>
</div>
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> is very similar to the base <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code>, but is designed to be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> by obeying the usual tidyverse rules for recycling and propagating missing values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(name = c("Flora", "David", "Terra"))
df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
#&gt; # A tibble: 3 × 2
#&gt;   name  greeting 
#&gt;   &lt;chr&gt; &lt;chr&gt;    
#&gt; 1 Flora Hi Flora!
#&gt; 2 David Hi David!
#&gt; 3 Terra Hi Terra!</pre>
</div>
<p>If you want missing values to display in another way, use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace them. Depending on what you want, you might use it either inside or outside of <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; 
  mutate(
    greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
    greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
  )
#&gt; # A tibble: 3 × 3
#&gt;   name  greeting1 greeting2
#&gt;   &lt;chr&gt; &lt;chr&gt;     &lt;chr&gt;    
#&gt; 1 Flora Hi Flora! Hi Flora!
#&gt; 2 David Hi David! Hi David!
#&gt; 3 Terra Hi Terra! Hi Terra!</pre>
</div>
</section>

<section id="sec-glue" data-type="sect2">
<h2>
str_glue()
</h2>
<p>If you are mixing many fixed and variable strings with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>, you’ll notice that you type a lot of <code>"</code>s, making it hard to see the overall goal of the code. An alternative approach is provided by the <a href="https://glue.tidyverse.org">glue package</a> via <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code><span data-type="footnote">If you’re not using stringr, you can also access it directly with <code><a href="https://glue.tidyverse.org/reference/glue.html">glue::glue()</a></code>.</span>. You give it a single string that has a special feature: anything inside <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code> will be evaluated like it’s outside of the quotes:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(greeting = str_glue("Hi {name}!"))
#&gt; # A tibble: 3 × 2
#&gt;   name  greeting 
#&gt;   &lt;chr&gt; &lt;glue&gt;   
#&gt; 1 Flora Hi Flora!
#&gt; 2 David Hi David!
#&gt; 3 Terra Hi Terra!</pre>
</div>
<p>As you can see, <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> currently converts missing values to the string <code>"NA"</code> unfortunately making it inconsistent with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>.</p>
<p>You also might wonder what happens if you need to include a regular <code>{</code> or <code>}</code> in your string. You’re on the right track if you guess you’ll need to escape it somehow. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like <code>\</code>, you double up the special characters:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(greeting = str_glue("{{Hi {name}!}}"))
#&gt; # A tibble: 3 × 2
#&gt;   name  greeting   
#&gt;   &lt;chr&gt; &lt;glue&gt;     
#&gt; 1 Flora {Hi Flora!}
#&gt; 2 David {Hi David!}
#&gt; 3 Terra {Hi Terra!}</pre>
</div>
</section>

<section id="str_flatten" data-type="sect2">
<h2>
str_flatten()
</h2>
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code>glue()</code> work well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because their output is the same length as their inputs. What if you want a function that works well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, i.e., something that always returns a single string? That’s the job of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code><span data-type="footnote">The base R equivalent is <code><a href="https://rdrr.io/r/base/paste.html">paste()</a></code> used with the <code>collapse</code> argument.</span>: it takes a character vector and combines each element of the vector into a single string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_flatten(c("x", "y", "z"))
#&gt; [1] "xyz"
str_flatten(c("x", "y", "z"), ", ")
#&gt; [1] "x, y, z"
str_flatten(c("x", "y", "z"), ", ", last = ", and ")
#&gt; [1] "x, y, and z"</pre>
</div>
<p>This makes it work well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tribble(
  ~ name, ~ fruit,
  "Carmen", "banana",
  "Carmen", "apple",
  "Marvin", "nectarine",
  "Terence", "cantaloupe",
  "Terence", "papaya",
  "Terence", "madarine"
)
df |&gt;
  group_by(name) |&gt; 
  summarize(fruits = str_flatten(fruit, ", "))
#&gt; # A tibble: 3 × 2
#&gt;   name    fruits                      
#&gt;   &lt;chr&gt;   &lt;chr&gt;                       
#&gt; 1 Carmen  banana, apple               
#&gt; 2 Marvin  nectarine                   
#&gt; 3 Terence cantaloupe, papaya, madarine</pre>
</div>
</section>

<section id="strings-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Compare and contrast the results of <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code> with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> for the following inputs:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_c("hi ", NA)
str_c(letters[1:2], letters[1:3])</pre>
</div>
</li>
<li>
<p>Convert the following expressions from <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> or vice versa:</p>
<ol type="a"><li><p><code>str_c("The price of ", food, " is ", price)</code></p></li>
<li><p><code>str_glue("I'm {age} years old and live in {country}")</code></p></li>
<li><p><code>str_c("\\section{", title, "}")</code></p></li>
</ol></li>
</ol></section>
</section>

<section id="extracting-data-from-strings" data-type="sect1">
<h1>
Extracting data from strings</h1>
<p>It’s very common for multiple variables to be crammed together into a single string. In this section, you’ll learn how to use four tidyr functions to extract them:</p>
<ul><li><code>df |&gt; separate_longer_delim(col, delim)</code></li>
<li><code>df |&gt; separate_longer_position(col, width)</code></li>
<li><code>df |&gt; separate_wider_delim(col, delim, names)</code></li>
<li><code>df |&gt; separate_wider_position(col, widths)</code></li>
</ul><p>If you look closely, you can see there’s a common pattern here: <code>separate_</code>, then <code>longer</code> or <code>wider</code>, then <code>_</code>, then by <code>delim</code> or <code>position</code>. That’s because these four functions are composed of two simpler primitives:</p>
<ul><li>
<code>longer</code> makes the input data frame longer, creating new rows; <code>wider</code> makes the input data frame wider, generating new columns.</li>
<li>
<code>delim</code> splits up a string with a delimiter like <code>", "</code> or <code>" "</code>; <code>position</code> splits at specified widths, like <code>c(3, 5, 2)</code>.</li>
</ul><p>We’ll return to the last member of this family, <code>separate_regex_wider()</code>, in <a href="#chp-regexps" data-type="xref">#chp-regexps</a>. It’s the most flexible of the <code>wider</code> functions, but you need to know something about regular expressions before you can use it.</p>
<p>The following two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating into columns. We’ll finish off by discussing the tools that the <code>wider</code> functions give you to diagnose problems.</p>

<section id="separating-into-rows" data-type="sect2">
<h2>
Separating into rows</h2>
<p>Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_delim()</a></code> to split based on a delimiter:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df1 &lt;- tibble(x = c("a,b,c", "d,e", "f"))
df1 |&gt; 
  separate_longer_delim(x, delim = ",")
#&gt; # A tibble: 6 × 1
#&gt;   x    
#&gt;   &lt;chr&gt;
#&gt; 1 a    
#&gt; 2 b    
#&gt; 3 c    
#&gt; 4 d    
#&gt; 5 e    
#&gt; 6 f</pre>
</div>
<p>It’s rarer to see <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_position()</a></code> in the wild, but some older datasets do use a very compact format where each character is used to record a value:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df2 &lt;- tibble(x = c("1211", "131", "21"))
df2 |&gt; 
  separate_longer_position(x, width = 1)
#&gt; # A tibble: 9 × 1
#&gt;   x    
#&gt;   &lt;chr&gt;
#&gt; 1 1    
#&gt; 2 2    
#&gt; 3 1    
#&gt; 4 1    
#&gt; 5 1    
#&gt; 6 3    
#&gt; # … with 3 more rows</pre>
</div>
</section>

<section id="sec-string-columns" data-type="sect2">
<h2>
Separating into columns</h2>
<p>Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their <code>longer</code> equivalents because you need to name the columns. For example, in this following dataset, <code>x</code> is made up of a code, an edition number, and a year, separated by <code>"."</code>. To use <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code>, we supply the delimiter and the names in two arguments:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df3 &lt;- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
df3 |&gt; 
  separate_wider_delim(
    x,
    delim = ".",
    names = c("code", "edition", "year")
  )
#&gt; # A tibble: 3 × 3
#&gt;   code  edition year 
#&gt;   &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;
#&gt; 1 a10   1       2022 
#&gt; 2 b10   2       2011 
#&gt; 3 e15   1       2015</pre>
</div>
<p>If a specific piece is not useful you can use an <code>NA</code> name to omit it from the results:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df3 |&gt; 
  separate_wider_delim(
    x,
    delim = ".",
    names = c("code", NA, "year")
  )
#&gt; # A tibble: 3 × 2
#&gt;   code  year 
#&gt;   &lt;chr&gt; &lt;chr&gt;
#&gt; 1 a10   2022 
#&gt; 2 b10   2011 
#&gt; 3 e15   2015</pre>
</div>
<p><code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> works a little differently because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df4 &lt;- tibble(x = c("202215TX", "202122LA", "202325CA")) 
df4 |&gt; 
  separate_wider_position(
    x,
    widths = c(year = 4, age = 2, state = 2)
  )
#&gt; # A tibble: 3 × 3
#&gt;   year  age   state
#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2022  15    TX   
#&gt; 2 2021  22    LA   
#&gt; 3 2023  25    CA</pre>
</div>
</section>

<section id="diagnosing-widening-problems" data-type="sect2">
<h2>
Diagnosing widening problems</h2>
<p><code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code><span data-type="footnote">The same principles apply to <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>.</span> requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> provides two arguments to help: <code>too_few</code> and <code>too_many</code>. Let’s first look at the <code>too_few</code> case with the following sample dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))

df |&gt; 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z")
  )
#&gt; Error in `separate_wider_delim()`:
#&gt; ! Expected 3 pieces in each element of `x`.
#&gt; ! 2 values were too short.
#&gt; ℹ Use `too_few = "debug"` to diagnose the problem.
#&gt; ℹ Use `too_few = "align_start"/"align_end"` to silence this message.</pre>
</div>
<p>You’ll notice that we get an error, but the error gives us some suggestions on how you might proceed. Let’s start by debugging the problem:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">debug &lt;- df |&gt; 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = "debug"
  )
#&gt; Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
#&gt; `x_remainder`.
debug
#&gt; # A tibble: 5 × 6
#&gt;   x     y     z     x_ok  x_pieces x_remainder
#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;    &lt;int&gt; &lt;chr&gt;      
#&gt; 1 1-1-1 1     1     TRUE         3 ""         
#&gt; 2 1-1-2 1     2     TRUE         3 ""         
#&gt; 3 1-3   3     &lt;NA&gt;  FALSE        2 ""         
#&gt; 4 1-3-2 3     2     TRUE         3 ""         
#&gt; 5 1     &lt;NA&gt;  &lt;NA&gt;  FALSE        1 ""</pre>
</div>
<p>When you use the debug mode, you get three extra columns added to the output: <code>x_ok</code>, <code>x_pieces</code>, and <code>x_remainder</code> (if you separate a variable with a different name, you’ll get a different prefix). Here, <code>x_ok</code> lets you quickly find the inputs that failed:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">debug |&gt; filter(!x_ok)
#&gt; # A tibble: 2 × 6
#&gt;   x     y     z     x_ok  x_pieces x_remainder
#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;    &lt;int&gt; &lt;chr&gt;      
#&gt; 1 1-3   3     &lt;NA&gt;  FALSE        2 ""         
#&gt; 2 1     &lt;NA&gt;  &lt;NA&gt;  FALSE        1 ""</pre>
</div>
<p><code>x_pieces</code> tells us how many pieces were found, compared to the expected 3 (the length of <code>names</code>). <code>x_remainder</code> isn’t useful when there are too few pieces, but we’ll see it again shortly.</p>
<p>Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating. In that case, fix the problem upstream and make sure to remove <code>too_few = "debug"</code> to ensure that new problems become errors.</p>
<p>In other cases, you may want to fill in the missing pieces with <code>NA</code>s and move on. That’s the job of <code>too_few = "align_start"</code> and <code>too_few = "align_end"</code> which allow you to control where the <code>NA</code>s should go:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = "align_start"
  )
#&gt; # A tibble: 5 × 3
#&gt;   x     y     z    
#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1     1     1    
#&gt; 2 1     1     2    
#&gt; 3 1     3     &lt;NA&gt; 
#&gt; 4 1     3     2    
#&gt; 5 1     &lt;NA&gt;  &lt;NA&gt;</pre>
</div>
<p>The same principles apply if you have too many pieces:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))

df |&gt; 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z")
  )
#&gt; Error in `separate_wider_delim()`:
#&gt; ! Expected 3 pieces in each element of `x`.
#&gt; ! 2 values were too long.
#&gt; ℹ Use `too_many = "debug"` to diagnose the problem.
#&gt; ℹ Use `too_many = "drop"/"merge"` to silence this message.</pre>
</div>
<p>But now, when we debug the result, you can see the purpose of <code>x_remainder</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">debug &lt;- df |&gt; 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = "debug"
  )
#&gt; Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
#&gt; `x_remainder`.
debug |&gt; filter(!x_ok)
#&gt; # A tibble: 2 × 6
#&gt;   x         y     z     x_ok  x_pieces x_remainder
#&gt;   &lt;chr&gt;     &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;    &lt;int&gt; &lt;chr&gt;      
#&gt; 1 1-3-5-6   3     5     FALSE        4 -6         
#&gt; 2 1-3-5-7-9 3     5     FALSE        5 -7-9</pre>
</div>
<p>You have a slightly different set of options for handling too many pieces: you can either silently “drop” any additional pieces or “merge” them all into the final column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = "drop"
  )
#&gt; # A tibble: 5 × 3
#&gt;   x     y     z    
#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1     1     1    
#&gt; 2 1     1     2    
#&gt; 3 1     3     5    
#&gt; 4 1     3     2    
#&gt; 5 1     3     5


df |&gt; 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = "merge"
  )
#&gt; # A tibble: 5 × 3
#&gt;   x     y     z    
#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1     1     1    
#&gt; 2 1     1     2    
#&gt; 3 1     3     5-6  
#&gt; 4 1     3     2    
#&gt; 5 1     3     5-7-9</pre>
</div>
</section>
</section>

<section id="letters" data-type="sect1">
<h1>
Letters</h1>
<p>In this section, we’ll introduce you to functions that allow you to work with the individual letters within a string. You’ll learn how to find the length of a string, extract substrings, and handle long strings in plots and tables.</p>

<section id="length" data-type="sect2">
<h2>
Length</h2>
<p><code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> tells you the number of letters in the string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_length(c("a", "R for data science", NA))
#&gt; [1]  1 18 NA</pre>
</div>
<p>You could use this with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to find the distribution of lengths of US babynames and then with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to look at the longest names<span data-type="footnote">Looking at these entries, we’d guess that the babynames data drops spaces or hyphens and truncates after 15 letters.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">babynames |&gt;
  count(length = str_length(name), wt = n)
#&gt; # A tibble: 14 × 2
#&gt;   length        n
#&gt;    &lt;int&gt;    &lt;int&gt;
#&gt; 1      2   338150
#&gt; 2      3  8589596
#&gt; 3      4 48506739
#&gt; 4      5 87011607
#&gt; 5      6 90749404
#&gt; 6      7 72120767
#&gt; # … with 8 more rows

babynames |&gt; 
  filter(str_length(name) == 15) |&gt; 
  count(name, wt = n, sort = TRUE)
#&gt; # A tibble: 34 × 2
#&gt;   name                n
#&gt;   &lt;chr&gt;           &lt;int&gt;
#&gt; 1 Franciscojavier   123
#&gt; 2 Christopherjohn   118
#&gt; 3 Johnchristopher   118
#&gt; 4 Christopherjame   108
#&gt; 5 Christophermich    52
#&gt; 6 Ryanchristopher    45
#&gt; # … with 28 more rows</pre>
</div>
</section>

<section id="subsetting" data-type="sect2">
<h2>
Subsetting</h2>
<p>You can extract parts of a string using <code>str_sub(string, start, end)</code>, where <code>start</code> and <code>end</code> are the positions where the substring should start and end. The <code>start</code> and <code>end</code> arguments are inclusive, so the length of the returned string will be <code>end - start + 1</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#&gt; [1] "App" "Ban" "Pea"</pre>
</div>
<p>You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_sub(x, -3, -1)
#&gt; [1] "ple" "ana" "ear"</pre>
</div>
<p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> won’t fail if the string is too short: it will just return as much as possible:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_sub("a", 1, 5)
#&gt; [1] "a"</pre>
</div>
<p>We could use <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to find the first and last letter of each name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">babynames |&gt; 
  mutate(
    first = str_sub(name, 1, 1),
    last = str_sub(name, -1, -1)
  )
#&gt; # A tibble: 1,924,665 × 7
#&gt;    year sex   name          n   prop first last 
#&gt;   &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;     &lt;int&gt;  &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1  1880 F     Mary       7065 0.0724 M     y    
#&gt; 2  1880 F     Anna       2604 0.0267 A     a    
#&gt; 3  1880 F     Emma       2003 0.0205 E     a    
#&gt; 4  1880 F     Elizabeth  1939 0.0199 E     h    
#&gt; 5  1880 F     Minnie     1746 0.0179 M     e    
#&gt; 6  1880 F     Margaret   1578 0.0162 M     t    
#&gt; # … with 1,924,659 more rows</pre>
</div>
</section>

<section id="long-strings" data-type="sect2">
<h2>
Long strings</h2>
<p>Sometimes you care about the length of a string because you’re trying to fit it into a label on a plot or table. stringr provides two useful tools for cases where your string is too long:</p>
<ul><li><p><code>str_trunc(x, 30)</code> ensures that no string is longer than 30 characters, replacing any letters after 30 with <code>…</code>.</p></li>
<li><p><code>str_wrap(x, 30)</code> wraps a string introducing new lines so that each line is at most 30 characters (it doesn’t hyphenate, however, so any word longer than 30 characters will make a longer line)</p></li>
</ul><p>The following code shows these functions in action with a made-up string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- paste0(
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod ",
  "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim ",
  "veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea",
  "commodo consequat."
)

str_view(str_trunc(x, 30))
#&gt; [1] │ Lorem ipsum dolor sit amet,...
str_view(str_wrap(x, 30))
#&gt; [1] │ Lorem ipsum dolor sit amet,
#&gt;     │ consectetur adipiscing
#&gt;     │ elit, sed do eiusmod tempor
#&gt;     │ incididunt ut labore et dolore
#&gt;     │ magna aliqua. Ut enim ad
#&gt;     │ minim veniam, quis nostrud
#&gt;     │ exercitation ullamco laboris
#&gt;     │ nisi ut aliquip ex eacommodo
#&gt;     │ consequat.</pre>
</div>
</section>

<section id="strings-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Use <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> to extract the middle letter from each baby name. What will you do if the string has an even number of characters?</li>
<li>Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?</li>
</ol></section>
</section>

<section id="sec-other-languages" data-type="sect1">
<h1>
Non-English text</h1>
<p>So far, we’ve focused on English language text which is particularly easy to work with for two reasons. Firstly, the English alphabet is relatively simple: there are just 26 letters. Secondly (and maybe more importantly), the computing infrastructure we use today was predominantly designed by English speakers. Unfortunately, we don’t have room for a full treatment of non-English languages. Still, we wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale-dependent functions.</p>

<section id="encoding" data-type="sect2">
<h2>
Encoding</h2>
<p>When working with non-English text, the first challenge is often the <strong>encoding</strong>. To understand what’s going on, we need to dive into how computers represent strings. In R, we can get at the underlying representation of a string using <code><a href="https://rdrr.io/r/base/rawConversion.html">charToRaw()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">charToRaw("Hadley")
#&gt; [1] 48 61 64 6c 65 79</pre>
</div>
<p>Each of these six hexadecimal numbers represents one letter: <code>48</code> is H, <code>61</code> is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case, the encoding is called ASCII. ASCII does a great job of representing English characters because it’s the <strong>American</strong> Standard Code for Information Interchange.</p>
<p>Things aren’t so easy for languages other than English. In the early days of computing, there were many competing standards for encoding non-English characters. For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages, and Latin2 (aka ISO-8859-2) was used for Central European languages. In Latin1, the byte <code>b1</code> is “±”, but in Latin2, it’s “ą”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today and many extra symbols like emojis.</p>
<p>readr uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that don’t use UTF-8. If this happens, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times, you’ll get complete gibberish. For example here are two inline CSVs with unusual encodings<span data-type="footnote">Here I’m using the special <code>\x</code> to encode binary data directly into a string.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x1 &lt;- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(x1)
#&gt; # A tibble: 1 × 1
#&gt;   text                                       
#&gt;   &lt;chr&gt;                                      
#&gt; 1 "El Ni\xf1o was particularly bad this year"

x2 &lt;- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
read_csv(x2)
#&gt; # A tibble: 1 × 1
#&gt;   text                                      
#&gt;   &lt;chr&gt;                                     
#&gt; 1 "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"</pre>
</div>
<p>To read these correctly, you specify the encoding via the <code>locale</code> argument:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv(x1, locale = locale(encoding = "Latin1"))
#&gt; # A tibble: 1 × 1
#&gt;   text                                  
#&gt;   &lt;chr&gt;                                 
#&gt; 1 El Niño was particularly bad this year

read_csv(x2, locale = locale(encoding = "Shift-JIS"))
#&gt; # A tibble: 1 × 1
#&gt;   text      
#&gt;   &lt;chr&gt;     
#&gt; 1 こんにちは</pre>
</div>
<p>How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides <code><a href="https://readr.tidyverse.org/reference/encoding.html">guess_encoding()</a></code> to help you figure it out. It’s not foolproof and works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">guess_encoding(x1)
#&gt; # A tibble: 1 × 2
#&gt;   encoding   confidence
#&gt;   &lt;chr&gt;           &lt;dbl&gt;
#&gt; 1 ISO-8859-1       0.41
guess_encoding(x2)
#&gt; # A tibble: 1 × 2
#&gt;   encoding confidence
#&gt;   &lt;chr&gt;         &lt;dbl&gt;
#&gt; 1 KOI8-R         0.27</pre>
</div>
<p>Encodings are a rich and complex topic; we’ve only scratched the surface here. If you’d like to learn more, we recommend reading the detailed explanation at <a href="http://kunststube.net/encoding/" class="uri">http://kunststube.net/encoding/</a>.</p>
</section>

<section id="letter-variations" data-type="sect2">
<h2>
Letter variations</h2>
<p>Working in languages with accents poses a significant challenge when determining the position of letters (e.g., with <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code>) as accented letters might be encoded as a single individual character (e.g., ü) or as two characters by combining an unaccented letter (e.g., u) with a diacritic mark (e.g., ¨). For example, this code shows two ways of representing ü that look identical:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">u &lt;- c("\u00fc", "u\u0308")
str_view(u)
#&gt; [1] │ ü
#&gt; [2] │ ü</pre>
</div>
<p>But both strings differ in length, and their first characters are different:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_length(u)
#&gt; [1] 1 2
str_sub(u, 1, 1)
#&gt; [1] "ü" "u"</pre>
</div>
<p>Finally, note that a comparison of these strings with <code>==</code> interprets these strings as different, while the handy <code><a href="https://stringr.tidyverse.org/reference/str_equal.html">str_equal()</a></code> function in stringr recognizes that both have the same appearance:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">u[[1]] == u[[2]]
#&gt; [1] FALSE

str_equal(u[[1]], u[[2]])
#&gt; [1] TRUE</pre>
</div>
</section>

<section id="locale-dependent-functions" data-type="sect2">
<h2>
Locale-dependent functions</h2>
<p>Finally, there are a handful of stringr functions whose behavior depends on your <strong>locale</strong>. A locale is similar to a language but includes an optional region specifier to handle regional variations within a language. A locale is specified by a lower-case language abbreviation, optionally followed by a <code>_</code> and an upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you don’t already know the code for your language, <a href="https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes">Wikipedia</a> has a good list, and you can see which are supported in stringr by looking at <code><a href="https://rdrr.io/pkg/stringi/man/stri_locale_list.html">stringi::stri_locale_list()</a></code>.</p>
<p>Base R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in a different country. To avoid this problem, stringr defaults to English rules by using the “en” locale and requires you to specify the <code>locale</code> argument to override it. Fortunately, there are two sets of functions where the locale really matters: changing case and sorting.</p>
<p>The rules for changing cases differ among languages. For example, Turkish has two i’s: with and without a dot. Since they’re two distinct letters, they’re capitalized differently:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_to_upper(c("i", "ı"))
#&gt; [1] "I" "I"
str_to_upper(c("i", "ı"), locale = "tr")
#&gt; [1] "İ" "I"</pre>
</div>
<p>Sorting strings depends on the order of the alphabet, and the order of the alphabet is not the same in every language<span data-type="footnote">Sorting in languages that don’t have an alphabet, like Chinese, is more complicated still.</span>! Here’s an example: in Czech, “ch” is a compound letter that appears after <code>h</code> in the alphabet.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_sort(c("a", "c", "ch", "h", "z"))
#&gt; [1] "a"  "c"  "ch" "h"  "z"
str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
#&gt; [1] "a"  "c"  "h"  "ch" "z"</pre>
</div>
<p>This also comes up when sorting strings with <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">dplyr::arrange()</a></code>, which is why it also has a <code>locale</code> argument.</p>
</section>
</section>

<section id="strings-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, you’ve learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Now it’s time to learn one of the most important and powerful tools for working with strings: regular expressions. Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter.</p>


</section>
</section>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<section data-type="chapter" id="chp-strings">
-												Actually strip status

											
										
										
											2022-11-19 01:55:22 +08:00
+								<h1><span id="sec-strings" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Strings</span></span></h1>
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								<section id="strings-introduction" data-type="sect1">
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<h1>
 								Introduction</h1>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>So far, you’ve used a bunch of strings without learning much about the details. Now it’s time to dive into them, learn what makes strings tick, and master some of the powerful string manipulation tools you have at your disposal.</p>
 								<p>We’ll begin with the details of creating strings and character vectors. You’ll then dive into creating strings from data, then the opposite; extracting strings from data. We’ll then discuss tools that work with individual letters. The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<p>We’ll keep working with strings in the next chapter, where you’ll learn more about the power of regular expressions.</p>
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								<section id="strings-prerequisites" data-type="sect2">
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<h2>
 								Prerequisites</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>In this chapter, we’ll use functions from the stringr package, which is part of the core tidyverse. We’ll also use the babynames data since it provides some fun strings to manipulate.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">library(tidyverse)
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								library(babynames)</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>You can quickly tell when you’re using a stringr function because all stringr functions start with <code>str_</code>. This is particularly useful if you use RStudio because typing <code>str_</code> will trigger autocomplete, allowing you to jog your memory of the available functions.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
 								<div class="cell-output-display">
 								<p><img src="screenshots/stringr-autocomplete.png" class="img-fluid" width="678"/></p>
 								</div>
 								</div>
 								</section>
 								</section>
 								<section id="creating-a-string" data-type="sect1">
 								<h1>
 								Creating a string</h1>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>We’ve created strings in passing earlier in the book but didn’t discuss the details. Firstly, you can create a string using either single quotes (<code>'</code>) or double quotes (<code>"</code>). There’s no difference in behavior between the two, so in the interests of consistency, the <a href="https://style.tidyverse.org/syntax.html#character-vectors">tidyverse style guide</a> recommends using <code>"</code>, unless the string contains multiple <code>"</code>.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">string1 &lt;- "This is a string"
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								string2 &lt;- 'If I want to include a "quote" inside a string, I use single quotes'</pre>
 								</div>
 								<p>If you forget to close a quote, you’ll see <code>+</code>, the continuation character:</p>
 								<pre><code>&gt; "This is a string without a closing quote
 								+
 								+
 								+ HELP I'M STUCK IN A STRING</code></pre>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>If this happens to you and you can’t figure out which quote to close, press Escape to cancel and try again.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
 								<section id="escapes" data-type="sect2">
 								<h2>
 								Escapes</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>To include a literal single or double quote in a string, you can use <code>\</code> to “escape” it:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">double_quote &lt;- "\"" # or '"'
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								single_quote &lt;- '\'' # or "'"</pre>
 								</div>
 								<p>So if you want to include a literal backslash in your string, you’ll need to escape it: <code>"\\"</code>:</p>
 								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">backslash &lt;- "\\"</pre>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Beware that the printed representation of a string is not the same as the string itself because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code><span data-type="footnote">Or use the base R function <code><a href="https://rdrr.io/r/base/writeLines.html">writeLines()</a></code>.</span>:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">x &lt;- c(single_quote, double_quote, backslash)
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								x
 								#&gt; [1] "'"  "\"" "\\"
 								str_view(x)
 								#&gt; [1] │ '
 								#&gt; [2] │ "
 								#&gt; [3] │ \</pre>
 								</div>
 								</section>
 								<section id="sec-raw-strings" data-type="sect2">
 								<h2>
 								Raw strings</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Creating a string with multiple quotes or backslashes gets confusing quickly. To illustrate the problem, let’s create a string that contains the contents of the code block where we define the <code>double_quote</code> and <code>single_quote</code> variables:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">tricky &lt;- "double_quote &lt;- \"\\\"\" # or '\"'
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								single_quote &lt;- '\\'' # or \"'\""
 								str_view(tricky)
 								#&gt; [1] │ double_quote &lt;- "\"" # or '"'
 								#&gt;     │ single_quote &lt;- '\'' # or "'"</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>That’s a lot of backslashes! (This is sometimes called <a href="https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome">leaning toothpick syndrome</a>.) To eliminate the escaping, you can instead use a <strong>raw string</strong><span data-type="footnote">Available in R 4.0.0 and above.</span>:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">tricky &lt;- r"(double_quote &lt;- "\"" # or '"'
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								single_quote &lt;- '\'' # or "'")"
 								str_view(tricky)
 								#&gt; [1] │ double_quote &lt;- "\"" # or '"'
 								#&gt;     │ single_quote &lt;- '\'' # or "'"</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>A raw string usually starts with <code>r"(</code> and finishes with <code>)"</code>. But if your string contains <code>)"</code> you can instead use <code>r"[]"</code> or <code>r"{}"</code>, and if that’s still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g., <code>`r"--()--"</code>, <code>`r"---()---"</code>, etc. Raw strings are flexible enough to handle any text.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</section>
 								<section id="other-special-characters" data-type="sect2">
 								<h2>
 								Other special characters</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>As well as <code>\"</code>, <code>\'</code>, and <code>\\</code>, there are a handful of other special characters that may come in handy. The most common are <code>\n</code>, newline, and <code>\t</code>, tab. You’ll also sometimes see strings containing Unicode escapes that start with <code>\u</code> or <code>\U</code>. This is a way of writing non-English characters that work on all systems. You can see the complete list of other special characters in <code><a href="https://rdrr.io/r/base/Quotes.html">?'"'</a></code>.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">x &lt;- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								x
 								#&gt; [1] "one\ntwo" "one\ttwo" "µ"        "😄"
 								str_view(x)
 								#&gt; [1] │ one
 								#&gt;     │ two
 								#&gt; [2] │ one{\t}two
 								#&gt; [3] │ µ
 								#&gt; [4] │ 😄</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that there’s a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</section>
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								<section id="strings-exercises" data-type="sect2">
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<h2>
 								Exercises</h2>
 								<ol type="1"><li>
 								<p>Create strings that contain the following values:</p>
 								<ol type="1"><li><p><code>He said "That's amazing!"</code></p></li>
 								<li><p><code>\a\b\c\d</code></p></li>
 								<li><p><code>\\\\\\</code></p></li>
 								</ol></li>
 								<li>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<p>Create the string in your R session and print it. What happens to the special “\u00a0”? How does <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> display it? Can you do a little googling to figure out what this special character is?</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">x &lt;- "This\u00a0is\u00a0tricky"</pre>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</div>
 								</li>
 								</ol></section>
 								</section>
 								<section id="creating-many-strings-from-data" data-type="sect1">
 								<h1>
 								Creating many strings from data</h1>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Now that you’ve learned the basics of creating a string or two by “hand”, we’ll go into the details of creating strings from other strings. This will help you solve the common problem where you have some text you wrote that you want to combine with strings from a data frame. For example, you might combine “Hello” with a <code>name</code> variable to create a greeting. We’ll show you how to do this with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> and how you can use them with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. That naturally raises the question of what string functions you might use with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, so we’ll finish this section with a discussion of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code>, which is a summary function for strings.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
 								<section id="str_c" data-type="sect2">
 								<h2>
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								str_c()
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> takes any number of vectors as arguments and returns a character vector:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">str_c("x", "y")
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; [1] "xy"
 								str_c("x", "y", "z")
 								#&gt; [1] "xyz"
 								str_c("Hello ", c("John", "Susan"))
 								#&gt; [1] "Hello John"  "Hello Susan"</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> is very similar to the base <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code>, but is designed to be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> by obeying the usual tidyverse rules for recycling and propagating missing values:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(name = c("Flora", "David", "Terra"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								#&gt; # A tibble: 3 × 2
 								#&gt;   name  greeting
 								#&gt;   &lt;chr&gt; &lt;chr&gt;
 								#&gt; 1 Flora Hi Flora!
 								#&gt; 2 David Hi David!
 								#&gt; 3 Terra Hi Terra!</pre>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>If you want missing values to display in another way, use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace them. Depending on what you want, you might use it either inside or outside of <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df |&gt;
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								  mutate(
 								    greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
 								    greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
 								  )
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								#&gt; # A tibble: 3 × 3
 								#&gt;   name  greeting1 greeting2
 								#&gt;   &lt;chr&gt; &lt;chr&gt;     &lt;chr&gt;
 								#&gt; 1 Flora Hi Flora! Hi Flora!
 								#&gt; 2 David Hi David! Hi David!
 								#&gt; 3 Terra Hi Terra! Hi Terra!</pre>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</div>
 								</section>
 								<section id="sec-glue" data-type="sect2">
 								<h2>
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								str_glue()
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</h2>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<p>If you are mixing many fixed and variable strings with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>, you’ll notice that you type a lot of <code>"</code>s, making it hard to see the overall goal of the code. An alternative approach is provided by the <a href="https://glue.tidyverse.org">glue package</a> via <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code><span data-type="footnote">If you’re not using stringr, you can also access it directly with <code><a href="https://glue.tidyverse.org/reference/glue.html">glue::glue()</a></code>.</span>. You give it a single string that has a special feature: anything inside <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code> will be evaluated like it’s outside of the quotes:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(greeting = str_glue("Hi {name}!"))
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								#&gt; # A tibble: 3 × 2
 								#&gt;   name  greeting
 								#&gt;   &lt;chr&gt; &lt;glue&gt;
 								#&gt; 1 Flora Hi Flora!
 								#&gt; 2 David Hi David!
 								#&gt; 3 Terra Hi Terra!</pre>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</div>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<p>As you can see, <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> currently converts missing values to the string <code>"NA"</code> unfortunately making it inconsistent with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>.</p>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>You also might wonder what happens if you need to include a regular <code>{</code> or <code>}</code> in your string. You’re on the right track if you guess you’ll need to escape it somehow. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like <code>\</code>, you double up the special characters:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(greeting = str_glue("{{Hi {name}!}}"))
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								#&gt; # A tibble: 3 × 2
 								#&gt;   name  greeting
 								#&gt;   &lt;chr&gt; &lt;glue&gt;
 								#&gt; 1 Flora {Hi Flora!}
 								#&gt; 2 David {Hi David!}
 								#&gt; 3 Terra {Hi Terra!}</pre>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</div>
 								</section>
 								<section id="str_flatten" data-type="sect2">
 								<h2>
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								str_flatten()
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code>glue()</code> work well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because their output is the same length as their inputs. What if you want a function that works well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, i.e., something that always returns a single string? That’s the job of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code><span data-type="footnote">The base R equivalent is <code><a href="https://rdrr.io/r/base/paste.html">paste()</a></code> used with the <code>collapse</code> argument.</span>: it takes a character vector and combines each element of the vector into a single string:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">str_flatten(c("x", "y", "z"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; [1] "xyz"
 								str_flatten(c("x", "y", "z"), ", ")
 								#&gt; [1] "x, y, z"
 								str_flatten(c("x", "y", "z"), ", ", last = ", and ")
 								#&gt; [1] "x, y, and z"</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>This makes it work well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df &lt;- tribble(
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								  ~ name, ~ fruit,
 								  "Carmen", "banana",
 								  "Carmen", "apple",
 								  "Marvin", "nectarine",
 								  "Terence", "cantaloupe",
 								  "Terence", "papaya",
 								  "Terence", "madarine"
 								)
 								df |&gt;
 								  group_by(name) |&gt;
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								  summarize(fruits = str_flatten(fruit, ", "))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; # A tibble: 3 × 2
 								#&gt;   name    fruits
 								#&gt;   &lt;chr&gt;   &lt;chr&gt;
 								#&gt; 1 Carmen  banana, apple
 								#&gt; 2 Marvin  nectarine
 								#&gt; 3 Terence cantaloupe, papaya, madarine</pre>
 								</div>
 								</section>
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								<section id="strings-exercises-1" data-type="sect2">
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<h2>
 								Exercises</h2>
 								<ol type="1"><li>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<p>Compare and contrast the results of <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code> with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> for the following inputs:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">str_c("hi ", NA)
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								str_c(letters[1:2], letters[1:3])</pre>
 								</div>
 								</li>
 								<li>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<p>Convert the following expressions from <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> or vice versa:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<ol type="a"><li><p><code>str_c("The price of ", food, " is ", price)</code></p></li>
 								<li><p><code>str_glue("I'm {age} years old and live in {country}")</code></p></li>
 								<li><p><code>str_c("\\section{", title, "}")</code></p></li>
 								</ol></li>
 								</ol></section>
 								</section>
 								<section id="extracting-data-from-strings" data-type="sect1">
 								<h1>
 								Extracting data from strings</h1>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>It’s very common for multiple variables to be crammed together into a single string. In this section, you’ll learn how to use four tidyr functions to extract them:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<ul><li><code>df |&gt; separate_longer_delim(col, delim)</code></li>
 								<li><code>df |&gt; separate_longer_position(col, width)</code></li>
 								<li><code>df |&gt; separate_wider_delim(col, delim, names)</code></li>
 								<li><code>df |&gt; separate_wider_position(col, widths)</code></li>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								</ul><p>If you look closely, you can see there’s a common pattern here: <code>separate_</code>, then <code>longer</code> or <code>wider</code>, then <code>_</code>, then by <code>delim</code> or <code>position</code>. That’s because these four functions are composed of two simpler primitives:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<ul><li>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<code>longer</code> makes the input data frame longer, creating new rows; <code>wider</code> makes the input data frame wider, generating new columns.</li>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<li>
 								<code>delim</code> splits up a string with a delimiter like <code>", "</code> or <code>" "</code>; <code>position</code> splits at specified widths, like <code>c(3, 5, 2)</code>.</li>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								</ul><p>We’ll return to the last member of this family, <code>separate_regex_wider()</code>, in <a href="#chp-regexps" data-type="xref">#chp-regexps</a>. It’s the most flexible of the <code>wider</code> functions, but you need to know something about regular expressions before you can use it.</p>
 								<p>The following two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating into columns. We’ll finish off by discussing the tools that the <code>wider</code> functions give you to diagnose problems.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
 								<section id="separating-into-rows" data-type="sect2">
 								<h2>
 								Separating into rows</h2>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<p>Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_delim()</a></code> to split based on a delimiter:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df1 &lt;- tibble(x = c("a,b,c", "d,e", "f"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								df1 |&gt;
 								  separate_longer_delim(x, delim = ",")
 								#&gt; # A tibble: 6 × 1
 								#&gt;   x
 								#&gt;   &lt;chr&gt;
 								#&gt; 1 a
 								#&gt; 2 b
 								#&gt; 3 c
 								#&gt; 4 d
 								#&gt; 5 e
 								#&gt; 6 f</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>It’s rarer to see <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_position()</a></code> in the wild, but some older datasets do use a very compact format where each character is used to record a value:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df2 &lt;- tibble(x = c("1211", "131", "21"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								df2 |&gt;
 								  separate_longer_position(x, width = 1)
 								#&gt; # A tibble: 9 × 1
 								#&gt;   x
 								#&gt;   &lt;chr&gt;
 								#&gt; 1 1
 								#&gt; 2 2
 								#&gt; 3 1
 								#&gt; 4 1
 								#&gt; 5 1
 								#&gt; 6 3
 								#&gt; # … with 3 more rows</pre>
 								</div>
 								</section>
 								<section id="sec-string-columns" data-type="sect2">
 								<h2>
 								Separating into columns</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their <code>longer</code> equivalents because you need to name the columns. For example, in this following dataset, <code>x</code> is made up of a code, an edition number, and a year, separated by <code>"."</code>. To use <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code>, we supply the delimiter and the names in two arguments:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df3 &lt;- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								df3 |&gt;
 								  separate_wider_delim(
 								    x,
 								    delim = ".",
 								    names = c("code", "edition", "year")
 								  )
 								#&gt; # A tibble: 3 × 3
 								#&gt;   code  edition year
 								#&gt;   &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;
 								#&gt; 1 a10   1       2022
 								#&gt; 2 b10   2       2011
 								#&gt; 3 e15   1       2015</pre>
 								</div>
 								<p>If a specific piece is not useful you can use an <code>NA</code> name to omit it from the results:</p>
 								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df3 |&gt;
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								  separate_wider_delim(
 								    x,
 								    delim = ".",
 								    names = c("code", NA, "year")
 								  )
 								#&gt; # A tibble: 3 × 2
 								#&gt;   code  year
 								#&gt;   &lt;chr&gt; &lt;chr&gt;
 								#&gt; 1 a10   2022
 								#&gt; 2 b10   2011
 								#&gt; 3 e15   2015</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p><code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> works a little differently because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df4 &lt;- tibble(x = c("202215TX", "202122LA", "202325CA"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								df4 |&gt;
 								  separate_wider_position(
 								    x,
 								    widths = c(year = 4, age = 2, state = 2)
 								  )
 								#&gt; # A tibble: 3 × 3
 								#&gt;   year  age   state
 								#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
 								#&gt; 1 2022  15    TX
 								#&gt; 2 2021  22    LA
 								#&gt; 3 2023  25    CA</pre>
 								</div>
 								</section>
 								<section id="diagnosing-widening-problems" data-type="sect2">
 								<h2>
 								Diagnosing widening problems</h2>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<p><code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code><span data-type="footnote">The same principles apply to <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>.</span> requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> provides two arguments to help: <code>too_few</code> and <code>too_many</code>. Let’s first look at the <code>too_few</code> case with the following sample dataset:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
 								df |&gt;
 								  separate_wider_delim(
 								    x,
 								    delim = "-",
 								    names = c("x", "y", "z")
 								  )
 								#&gt; Error in `separate_wider_delim()`:
 								#&gt; ! Expected 3 pieces in each element of `x`.
 								#&gt; ! 2 values were too short.
 								#&gt; ℹ Use `too_few = "debug"` to diagnose the problem.
 								#&gt; ℹ Use `too_few = "align_start"/"align_end"` to silence this message.</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>You’ll notice that we get an error, but the error gives us some suggestions on how you might proceed. Let’s start by debugging the problem:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">debug &lt;- df |&gt;
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								  separate_wider_delim(
 								    x,
 								    delim = "-",
 								    names = c("x", "y", "z"),
 								    too_few = "debug"
 								  )
 								#&gt; Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
 								#&gt; `x_remainder`.
 								debug
 								#&gt; # A tibble: 5 × 6
 								#&gt;   x     y     z     x_ok  x_pieces x_remainder
 								#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;    &lt;int&gt; &lt;chr&gt;
 								#&gt; 1 1-1-1 1     1     TRUE         3 ""
 								#&gt; 2 1-1-2 1     2     TRUE         3 ""
 								#&gt; 3 1-3   3     &lt;NA&gt;  FALSE        2 ""
 								#&gt; 4 1-3-2 3     2     TRUE         3 ""
 								#&gt; 5 1     &lt;NA&gt;  &lt;NA&gt;  FALSE        1 ""</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>When you use the debug mode, you get three extra columns added to the output: <code>x_ok</code>, <code>x_pieces</code>, and <code>x_remainder</code> (if you separate a variable with a different name, you’ll get a different prefix). Here, <code>x_ok</code> lets you quickly find the inputs that failed:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">debug |&gt; filter(!x_ok)
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; # A tibble: 2 × 6
 								#&gt;   x     y     z     x_ok  x_pieces x_remainder
 								#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;    &lt;int&gt; &lt;chr&gt;
 								#&gt; 1 1-3   3     &lt;NA&gt;  FALSE        2 ""
 								#&gt; 2 1     &lt;NA&gt;  &lt;NA&gt;  FALSE        1 ""</pre>
 								</div>
 								<p><code>x_pieces</code> tells us how many pieces were found, compared to the expected 3 (the length of <code>names</code>). <code>x_remainder</code> isn’t useful when there are too few pieces, but we’ll see it again shortly.</p>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating. In that case, fix the problem upstream and make sure to remove <code>too_few = "debug"</code> to ensure that new problems become errors.</p>
 								<p>In other cases, you may want to fill in the missing pieces with <code>NA</code>s and move on. That’s the job of <code>too_few = "align_start"</code> and <code>too_few = "align_end"</code> which allow you to control where the <code>NA</code>s should go:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df |&gt;
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								  separate_wider_delim(
 								    x,
 								    delim = "-",
 								    names = c("x", "y", "z"),
 								    too_few = "align_start"
 								  )
 								#&gt; # A tibble: 5 × 3
 								#&gt;   x     y     z
 								#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
 								#&gt; 1 1     1     1
 								#&gt; 2 1     1     2
 								#&gt; 3 1     3     &lt;NA&gt;
 								#&gt; 4 1     3     2
 								#&gt; 5 1     &lt;NA&gt;  &lt;NA&gt;</pre>
 								</div>
 								<p>The same principles apply if you have too many pieces:</p>
 								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
 								df |&gt;
 								  separate_wider_delim(
 								    x,
 								    delim = "-",
 								    names = c("x", "y", "z")
 								  )
 								#&gt; Error in `separate_wider_delim()`:
 								#&gt; ! Expected 3 pieces in each element of `x`.
 								#&gt; ! 2 values were too long.
 								#&gt; ℹ Use `too_many = "debug"` to diagnose the problem.
 								#&gt; ℹ Use `too_many = "drop"/"merge"` to silence this message.</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>But now, when we debug the result, you can see the purpose of <code>x_remainder</code>:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">debug &lt;- df |&gt;
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								  separate_wider_delim(
 								    x,
 								    delim = "-",
 								    names = c("x", "y", "z"),
 								    too_many = "debug"
 								  )
 								#&gt; Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
 								#&gt; `x_remainder`.
 								debug |&gt; filter(!x_ok)
 								#&gt; # A tibble: 2 × 6
 								#&gt;   x         y     z     x_ok  x_pieces x_remainder
 								#&gt;   &lt;chr&gt;     &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;    &lt;int&gt; &lt;chr&gt;
 								#&gt; 1 1-3-5-6   3     5     FALSE        4 -6
 								#&gt; 2 1-3-5-7-9 3     5     FALSE        5 -7-9</pre>
 								</div>
 								<p>You have a slightly different set of options for handling too many pieces: you can either silently “drop” any additional pieces or “merge” them all into the final column:</p>
 								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">df |&gt;
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								  separate_wider_delim(
 								    x,
 								    delim = "-",
 								    names = c("x", "y", "z"),
 								    too_many = "drop"
 								  )
 								#&gt; # A tibble: 5 × 3
 								#&gt;   x     y     z
 								#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
 								#&gt; 1 1     1     1
 								#&gt; 2 1     1     2
 								#&gt; 3 1     3     5
 								#&gt; 4 1     3     2
 								#&gt; 5 1     3     5
 								df |&gt;
 								  separate_wider_delim(
 								    x,
 								    delim = "-",
 								    names = c("x", "y", "z"),
 								    too_many = "merge"
 								  )
 								#&gt; # A tibble: 5 × 3
 								#&gt;   x     y     z
 								#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
 								#&gt; 1 1     1     1
 								#&gt; 2 1     1     2
 								#&gt; 3 1     3     5-6
 								#&gt; 4 1     3     2
 								#&gt; 5 1     3     5-7-9</pre>
 								</div>
 								</section>
 								</section>
 								<section id="letters" data-type="sect1">
 								<h1>
 								Letters</h1>
 								<p>In this section, we’ll introduce you to functions that allow you to work with the individual letters within a string. You’ll learn how to find the length of a string, extract substrings, and handle long strings in plots and tables.</p>
 								<section id="length" data-type="sect2">
 								<h2>
 								Length</h2>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<p><code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> tells you the number of letters in the string:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">str_length(c("a", "R for data science", NA))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; [1]  1 18 NA</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>You could use this with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to find the distribution of lengths of US babynames and then with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to look at the longest names<span data-type="footnote">Looking at these entries, we’d guess that the babynames data drops spaces or hyphens and truncates after 15 letters.</span>:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">babynames |&gt;
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								  count(length = str_length(name), wt = n)
 								#&gt; # A tibble: 14 × 2
 								#&gt;   length        n
 								#&gt;    &lt;int&gt;    &lt;int&gt;
 								#&gt; 1      2   338150
 								#&gt; 2      3  8589596
 								#&gt; 3      4 48506739
 								#&gt; 4      5 87011607
 								#&gt; 5      6 90749404
 								#&gt; 6      7 72120767
 								#&gt; # … with 8 more rows
 								babynames |&gt;
 								  filter(str_length(name) == 15) |&gt;
 								  count(name, wt = n, sort = TRUE)
 								#&gt; # A tibble: 34 × 2
 								#&gt;   name                n
 								#&gt;   &lt;chr&gt;           &lt;int&gt;
 								#&gt; 1 Franciscojavier   123
 								#&gt; 2 Christopherjohn   118
 								#&gt; 3 Johnchristopher   118
 								#&gt; 4 Christopherjame   108
 								#&gt; 5 Christophermich    52
 								#&gt; 6 Ryanchristopher    45
 								#&gt; # … with 28 more rows</pre>
 								</div>
 								</section>
 								<section id="subsetting" data-type="sect2">
 								<h2>
 								Subsetting</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>You can extract parts of a string using <code>str_sub(string, start, end)</code>, where <code>start</code> and <code>end</code> are the positions where the substring should start and end. The <code>start</code> and <code>end</code> arguments are inclusive, so the length of the returned string will be <code>end - start + 1</code>:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">x &lt;- c("Apple", "Banana", "Pear")
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								str_sub(x, 1, 3)
 								#&gt; [1] "App" "Ban" "Pea"</pre>
 								</div>
 								<p>You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.</p>
 								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">str_sub(x, -3, -1)
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; [1] "ple" "ana" "ear"</pre>
 								</div>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> won’t fail if the string is too short: it will just return as much as possible:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">str_sub("a", 1, 5)
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; [1] "a"</pre>
 								</div>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<p>We could use <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to find the first and last letter of each name:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">babynames |&gt;
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								  mutate(
 								    first = str_sub(name, 1, 1),
 								    last = str_sub(name, -1, -1)
 								  )
 								#&gt; # A tibble: 1,924,665 × 7
 								#&gt;    year sex   name          n   prop first last
 								#&gt;   &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;     &lt;int&gt;  &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
 								#&gt; 1  1880 F     Mary       7065 0.0724 M     y
 								#&gt; 2  1880 F     Anna       2604 0.0267 A     a
 								#&gt; 3  1880 F     Emma       2003 0.0205 E     a
 								#&gt; 4  1880 F     Elizabeth  1939 0.0199 E     h
 								#&gt; 5  1880 F     Minnie     1746 0.0179 M     e
 								#&gt; 6  1880 F     Margaret   1578 0.0162 M     t
 								#&gt; # … with 1,924,659 more rows</pre>
 								</div>
 								</section>
 								<section id="long-strings" data-type="sect2">
 								<h2>
 								Long strings</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Sometimes you care about the length of a string because you’re trying to fit it into a label on a plot or table. stringr provides two useful tools for cases where your string is too long:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<ul><li><p><code>str_trunc(x, 30)</code> ensures that no string is longer than 30 characters, replacing any letters after 30 with <code>…</code>.</p></li>
 								<li><p><code>str_wrap(x, 30)</code> wraps a string introducing new lines so that each line is at most 30 characters (it doesn’t hyphenate, however, so any word longer than 30 characters will make a longer line)</p></li>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								</ul><p>The following code shows these functions in action with a made-up string:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								<pre data-type="programlisting" data-code-language="r">x &lt;- paste0(
 								  "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod ",
 								  "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim ",
 								  "veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea",
 								  "commodo consequat."
 								)
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
 								str_view(str_trunc(x, 30))
 								#&gt; [1] │ Lorem ipsum dolor sit amet,...
 								str_view(str_wrap(x, 30))
 								#&gt; [1] │ Lorem ipsum dolor sit amet,
 								#&gt;     │ consectetur adipiscing
 								#&gt;     │ elit, sed do eiusmod tempor
 								#&gt;     │ incididunt ut labore et dolore
 								#&gt;     │ magna aliqua. Ut enim ad
 								#&gt;     │ minim veniam, quis nostrud
 								#&gt;     │ exercitation ullamco laboris
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								#&gt;     │ nisi ut aliquip ex eacommodo
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt;     │ consequat.</pre>
 								</div>
 								</section>
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								<section id="strings-exercises-2" data-type="sect2">
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<h2>
 								Exercises</h2>
-												Don't transform non-crossref links

											
										
										
											2022-11-19 00:30:32 +08:00
+								<ol type="1"><li>Use <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> to extract the middle letter from each baby name. What will you do if the string has an even number of characters?</li>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<li>Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?</li>
 								</ol></section>
 								</section>
 								<section id="sec-other-languages" data-type="sect1">
 								<h1>
 								Non-English text</h1>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>So far, we’ve focused on English language text which is particularly easy to work with for two reasons. Firstly, the English alphabet is relatively simple: there are just 26 letters. Secondly (and maybe more importantly), the computing infrastructure we use today was predominantly designed by English speakers. Unfortunately, we don’t have room for a full treatment of non-English languages. Still, we wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale-dependent functions.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
 								<section id="encoding" data-type="sect2">
 								<h2>
 								Encoding</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>When working with non-English text, the first challenge is often the <strong>encoding</strong>. To understand what’s going on, we need to dive into how computers represent strings. In R, we can get at the underlying representation of a string using <code><a href="https://rdrr.io/r/base/rawConversion.html">charToRaw()</a></code>:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">charToRaw("Hadley")
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; [1] 48 61 64 6c 65 79</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Each of these six hexadecimal numbers represents one letter: <code>48</code> is H, <code>61</code> is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case, the encoding is called ASCII. ASCII does a great job of representing English characters because it’s the <strong>American</strong> Standard Code for Information Interchange.</p>
 								<p>Things aren’t so easy for languages other than English. In the early days of computing, there were many competing standards for encoding non-English characters. For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages, and Latin2 (aka ISO-8859-2) was used for Central European languages. In Latin1, the byte <code>b1</code> is “±”, but in Latin2, it’s “ą”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today and many extra symbols like emojis.</p>
 								<p>readr uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that don’t use UTF-8. If this happens, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times, you’ll get complete gibberish. For example here are two inline CSVs with unusual encodings<span data-type="footnote">Here I’m using the special <code>\x</code> to encode binary data directly into a string.</span>:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">x1 &lt;- "text\nEl Ni\xf1o was particularly bad this year"
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								read_csv(x1)
 								#&gt; # A tibble: 1 × 1
 								#&gt;   text
 								#&gt;   &lt;chr&gt;
 								#&gt; 1 "El Ni\xf1o was particularly bad this year"
 								x2 &lt;- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
 								read_csv(x2)
 								#&gt; # A tibble: 1 × 1
 								#&gt;   text
 								#&gt;   &lt;chr&gt;
 								#&gt; 1 "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>To read these correctly, you specify the encoding via the <code>locale</code> argument:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">read_csv(x1, locale = locale(encoding = "Latin1"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; # A tibble: 1 × 1
 								#&gt;   text
 								#&gt;   &lt;chr&gt;
 								#&gt; 1 El Niño was particularly bad this year
 								read_csv(x2, locale = locale(encoding = "Shift-JIS"))
 								#&gt; # A tibble: 1 × 1
 								#&gt;   text
 								#&gt;   &lt;chr&gt;
 								#&gt; 1 こんにちは</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides <code><a href="https://readr.tidyverse.org/reference/encoding.html">guess_encoding()</a></code> to help you figure it out. It’s not foolproof and works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">guess_encoding(x1)
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; # A tibble: 1 × 2
 								#&gt;   encoding   confidence
 								#&gt;   &lt;chr&gt;           &lt;dbl&gt;
 								#&gt; 1 ISO-8859-1       0.41
 								guess_encoding(x2)
 								#&gt; # A tibble: 1 × 2
 								#&gt;   encoding confidence
 								#&gt;   &lt;chr&gt;         &lt;dbl&gt;
 								#&gt; 1 KOI8-R         0.27</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Encodings are a rich and complex topic; we’ve only scratched the surface here. If you’d like to learn more, we recommend reading the detailed explanation at <a href="http://kunststube.net/encoding/" class="uri">http://kunststube.net/encoding/</a>.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</section>
 								<section id="letter-variations" data-type="sect2">
 								<h2>
 								Letter variations</h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Working in languages with accents poses a significant challenge when determining the position of letters (e.g., with <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code>) as accented letters might be encoded as a single individual character (e.g., ü) or as two characters by combining an unaccented letter (e.g., u) with a diacritic mark (e.g., ¨). For example, this code shows two ways of representing ü that look identical:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">u &lt;- c("\u00fc", "u\u0308")
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								str_view(u)
 								#&gt; [1] │ ü
 								#&gt; [2] │ ü</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>But both strings differ in length, and their first characters are different:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">str_length(u)
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; [1] 1 2
 								str_sub(u, 1, 1)
 								#&gt; [1] "ü" "u"</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Finally, note that a comparison of these strings with <code>==</code> interprets these strings as different, while the handy <code><a href="https://stringr.tidyverse.org/reference/str_equal.html">str_equal()</a></code> function in stringr recognizes that both have the same appearance:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">u[[1]] == u[[2]]
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; [1] FALSE
 								str_equal(u[[1]], u[[2]])
 								#&gt; [1] TRUE</pre>
 								</div>
 								</section>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<section id="locale-dependent-functions" data-type="sect2">
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<h2>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								Locale-dependent functions</h2>
 								<p>Finally, there are a handful of stringr functions whose behavior depends on your <strong>locale</strong>. A locale is similar to a language but includes an optional region specifier to handle regional variations within a language. A locale is specified by a lower-case language abbreviation, optionally followed by a <code>_</code> and an upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you don’t already know the code for your language, <a href="https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes">Wikipedia</a> has a good list, and you can see which are supported in stringr by looking at <code><a href="https://rdrr.io/pkg/stringi/man/stri_locale_list.html">stringi::stri_locale_list()</a></code>.</p>
 								<p>Base R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in a different country. To avoid this problem, stringr defaults to English rules by using the “en” locale and requires you to specify the <code>locale</code> argument to override it. Fortunately, there are two sets of functions where the locale really matters: changing case and sorting.</p>
 								<p>The rules for changing cases differ among languages. For example, Turkish has two i’s: with and without a dot. Since they’re two distinct letters, they’re capitalized differently:</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">str_to_upper(c("i", "ı"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; [1] "I" "I"
 								str_to_upper(c("i", "ı"), locale = "tr")
 								#&gt; [1] "İ" "I"</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>Sorting strings depends on the order of the alphabet, and the order of the alphabet is not the same in every language<span data-type="footnote">Sorting in languages that don’t have an alphabet, like Chinese, is more complicated still.</span>! Here’s an example: in Czech, “ch” is a compound letter that appears after <code>h</code> in the alphabet.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<div class="cell">
-												Fix code language

											
										
										
											2022-11-19 01:26:25 +08:00
+								<pre data-type="programlisting" data-code-language="r">str_sort(c("a", "c", "ch", "h", "z"))
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								#&gt; [1] "a"  "c"  "ch" "h"  "z"
 								str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
 								#&gt; [1] "a"  "c"  "h"  "ch" "z"</pre>
 								</div>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>This also comes up when sorting strings with <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">dplyr::arrange()</a></code>, which is why it also has a <code>locale</code> argument.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								</section>
 								</section>
-												More minor page count tweaks & fixes

And re-convert with latest htmlbook

											
										
										
											2023-01-27 00:36:07 +08:00
+								<section id="strings-summary" data-type="sect1">
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
+								<h1>
 								Summary</h1>
-												Re-render book for O'Reilly

											
										
										
											2023-01-13 07:22:57 +08:00
+								<p>In this chapter, you’ve learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Now it’s time to learn one of the most important and powerful tools for working with strings: regular expressions. Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter.</p>
-												Commit O'Reilly HTML files to monitor fixes

											
										
										
											2022-11-19 00:28:19 +08:00
 								</section>
 								</section>