r4ds/oreilly/workflow-pipes.html

99 lines
8.0 KiB
HTML
Raw Normal View History

<section data-type="chapter" id="chp-workflow-pipes">
2022-11-19 01:55:22 +08:00
<h1><span id="sec-workflow-pipes" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Pipes</span></span></h1><p>The pipe, <code>|&gt;</code>, is a powerful tool for clearly expressing a sequence of operations that transform an object. We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss <code>%&gt;%</code>, a predecessor to <code>|&gt;</code>.</p><p>To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M. Youll need to make one change to your RStudio options to use <code>|&gt;</code> instead of <code>%&gt;%</code> as shown in <a href="#fig-pipe-options" data-type="xref">#fig-pipe-options</a>; more on <code>%&gt;%</code> shortly.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-pipe-options"><p><img src="screenshots/rstudio-pipe-options.png" alt="Screenshot showing the &quot;Use native pipe operator&quot; option which can be found on the &quot;Editing&quot; panel of the &quot;Code&quot; options." width="616"/></p>
<figcaption>To insert <code>|&gt;</code>, make sure the “Use native pipe operator” option is checked.</figcaption>
</figure>
</div>
</div>
<section id="why-use-a-pipe" data-type="sect1">
<h1>
Why use a pipe?</h1>
<p>Each individual dplyr verb is quite simple, so solving complex problems typically requires combining multiple verbs. For example, the last chapter finished with a moderately complex pipe:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(!is.na(arr_delay), !is.na(tailnum)) |&gt;
group_by(tailnum) |&gt;
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)</pre>
</div>
<p>Even though this pipe has four steps, its easy to skim because the verbs come at the start of each line: start with the <code>flights</code> data, then filter, then group, then summarize.</p>
<p>What would happen if we didnt have the pipe? We could nest each function call inside the previous call:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">summarise(
group_by(
filter(
flights,
!is.na(arr_delay), !is.na(tailnum)
),
tailnum
),
delay = mean(arr_delay, na.rm = TRUE
),
n = n()
)</pre>
</div>
<p>Or we could use a bunch of intermediate variables:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights1 &lt;- filter(flights, !is.na(arr_delay), !is.na(tailnum))
flights2 &lt;- group_by(flights1, tailnum)
flights3 &lt;- summarise(flight2,
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)</pre>
</div>
<p>While both of these forms have their time and place, the pipe generally produces data analysis code thats both easier to write and easier to read.</p>
</section>
<section id="magrittr-and-the-pipe" data-type="sect1">
<h1>
magrittr and the<code>%&gt;%</code> pipe</h1>
<p>If youve been using the tidyverse for a while, you might be familiar with the <code>%&gt;%</code> pipe provided by the <strong>magrittr</strong> package. The magrittr package is included in the core tidyverse, so you can use <code>%&gt;%</code> whenever you load the tidyverse:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
mtcars %&gt;%
group_by(cyl) %&gt;%
summarise(n = n())
#&gt; # A tibble: 3 × 2
#&gt; cyl n
#&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 4 11
#&gt; 2 6 7
#&gt; 3 8 14</pre>
</div>
<p>For simple cases <code>|&gt;</code> and <code>%&gt;%</code> behave identically. So why do we recommend the base pipe? Firstly, because its part of base R, its always available for you to use, even when youre not using the tidyverse. Secondly, <code>|&gt;</code> is quite a bit simpler than <code>%&gt;%</code>: in the time between the invention of <code>%&gt;%</code> in 2014 and the inclusion of <code>|&gt;</code> in R 4.1.0 in 2021, we gained a better understanding of the pipe. This allowed the base implementation to jettison infrequently used and less important features.</p>
</section>
<section id="vs" data-type="sect1">
<h1>
<code>|&gt;</code> vs <code>%&gt;%</code>
</h1>
<p>While <code>|&gt;</code> and <code>%&gt;%</code> behave identically for simple cases, there are a few important differences. These are most likely to affect you if youre a long-term user of <code>%&gt;%</code> who has taken advantage of some of the more advanced features. But theyre still good to know about even if youve never used <code>%&gt;%</code> because youre likely to encounter some of them when reading wild-caught code.</p>
<ul><li><p>By default, the pipe passes the object on its left hand side to the first argument of the function on the right-hand side. <code>%&gt;%</code> allows you change the placement with a <code>.</code> placeholder. For example, <code>x %&gt;% f(1)</code> is equivalent to <code>f(x, 1)</code> but <code>x %&gt;% f(1, .)</code> is equivalent to <code>f(1, x)</code>. R 4.2.0 added a <code>_</code> placeholder to the base pipe, with one additional restriction: the argument has to be named. For example, <code>x |&gt; f(1, y = _)</code> is equivalent to <code>f(1, y = x)</code>.</p></li>
<li>
<p>The <code>|&gt;</code> placeholder is deliberately simple and cant replicate many features of the <code>%&gt;%</code> placeholder: you cant pass it to multiple arguments, and it doesnt have any special behavior when the placeholder is used inside another function. For example, <code>df %&gt;% split(.$var)</code> is equivalent to <code>split(df, df$var)</code> and <code>df %&gt;% {split(.$x, .$y)}</code> is equivalent to <code>split(df$x, df$y)</code>.</p>
2022-11-19 00:30:32 +08:00
<p>With <code>%&gt;%</code> you can use <code>.</code> on the left-hand side of operators like <code>$</code>, <code>[[</code>, <code>[</code> (which youll learn about in <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>), so you can extract a single column from a data frame with (e.g.) <code>mtcars %&gt;% .$cyl</code>. A future version of R may add similar support for <code>|&gt;</code> and <code>_</code>. For the special case of extracting a column out of a data frame, you can also use <code><a href="https://dplyr.tidyverse.org/reference/pull.html">dplyr::pull()</a></code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">mtcars |&gt; pull(cyl)
#&gt; [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4</pre>
</div>
</li>
<li><p><code>%&gt;%</code> allows you to drop the parentheses when calling a function with no other arguments; <code>|&gt;</code> always requires the parentheses.</p></li>
<li><p><code>%&gt;%</code> allows you to start a pipe with <code>.</code> to create a function rather than immediately executing the pipe; this is not supported by the base pipe.</p></li>
</ul><p>Luckily theres no need to commit entirely to one pipe or the other — you can use the base pipe for the majority of cases where its sufficient, and use the magrittr pipe when you really need its special features.</p>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learn more about the pipe: why we recommend it and some of the history that lead to <code>|&gt;</code>. The pipe is important because youll use it again and again throughout your analysis, but hopefully it will quickly become invisible and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it.</p>
<p>In the next chapter, we switch back to data science tools, learning about tidy data. Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse. This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions. Of course, life is never easy and most datasets that you encounter in the wild will not already be tidy. So well also teach you how to use the tidyr package to tidy your untidy data.</p>
</section>
</section>