r4ds/oreilly/workflow-style.html

204 lines
12 KiB
HTML
Raw Normal View History

<section data-type="chapter" id="chp-workflow-style">
2023-01-13 07:22:57 +08:00
<h1><span id="sec-workflow-style" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: code style</span></span></h1><p>Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer, its a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work and is particularly important if you need to get help from someone else. This chapter will introduce the most important points of the <a href="https://style.tidyverse.org">tidyverse style guide</a>, which is used throughout this book.</p><p>Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the <a href="https://styler.r-lib.org"><strong>styler</strong></a> package by Lorenz Walthert. Once youve installed it with <code>install.packages("styler")</code>, an easy way to use it is via RStudios <strong>command palette</strong>. The command palette lets you use any built-in RStudio command and many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts offered by styler. <a href="#fig-styler" data-type="xref">#fig-styler</a> shows the results.</p><div class="cell">
<div class="cell-output-display">
2022-11-19 00:42:43 +08:00
<figure id="fig-styler"><p><img src="screenshots/rstudio-palette.png" alt="A screenshot showing the command palette after typing &quot;styler&quot;, showing the four styling tool provided by the package." width="638"/></p>
<figcaption>RStudios command palette makes it easy to access every RStudio command using only the keyboard.</figcaption>
</figure>
</div>
</div><div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(nycflights13)</pre>
</div>
<section id="names" data-type="sect1">
<h1>
Names</h1>
2022-11-19 00:30:32 +08:00
<p>We talked briefly about names in <a href="#sec-whats-in-a-name" data-type="xref">#sec-whats-in-a-name</a>. Remember that variable names (those created by <code>&lt;-</code> and those created by <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>) should use only lowercase letters, numbers, and <code>_</code>. Use <code>_</code> to separate words within a name.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># Strive for:
short_flights &lt;- flights |&gt; filter(air_time &lt; 60)
# Avoid:
SHORTFLIGHTS &lt;- flights |&gt; filter(air_time &lt; 60)</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>As a general rule of thumb, its better to prefer long, descriptive names that are easy to understand rather than concise names that are fast to type. Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but it can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation.</p>
<p>If you have a bunch of names for related things, do your best to be consistent. Its easy for inconsistencies to arise when you forget a previous convention, so dont feel bad if you have to go back and rename things. In general, if you have a bunch of variables that are a variation on a theme, youre better off giving them a common prefix rather than a common suffix because autocomplete works best on the start of a variable.</p>
</section>
<section id="spaces" data-type="sect1">
<h1>
Spaces</h1>
2023-01-13 07:22:57 +08:00
<p>Put spaces on either side of mathematical operators apart from <code>^</code> (i.e. <code>+</code>, <code>-</code>, <code>==</code>, <code>&lt;</code>, …), and around the assignment operator (<code>&lt;-</code>).</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># Strive for
z &lt;- (a + b)^2 / d
# Avoid
z&lt;-( a + b ) ^ 2/d</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Dont put spaces inside or outside parentheses for regular function calls. Always put a space after a comma, just like in standard English.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># Strive for
mean(x, na.rm = TRUE)
# Avoid
mean (x ,na.rm=TRUE)</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Its OK to add extra spaces if it improves alignment. For example, if youre creating multiple variables in <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, you might want to add spaces so that all the <code>=</code> line up. This makes it easier to skim the code.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
speed = air_time / distance,
dep_hour = dep_time %/% 100,
dep_minute = dep_time %% 100
)</pre>
</div>
</section>
<section id="sec-pipes" data-type="sect1">
<h1>
Pipes</h1>
2023-01-13 07:22:57 +08:00
<p><code>|&gt;</code> should always have a space before it and should typically be the last thing on a line. This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and get a 50,000 ft view by skimming the verbs on the left-hand side.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># Strive for
flights |&gt;
filter(!is.na(arr_delay), !is.na(tailnum)) |&gt;
count(dest)
# Avoid
flights|&gt;filter(!is.na(arr_delay), !is.na(tailnum))|&gt;count(dest)</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>If the function youre piping into has named arguments (like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>), put each argument on a new line. If the function doesnt have named arguments (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>), keep everything on one line unless it doesnt fit, in which case you should put each argument on its own line.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># Strive for
flights |&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
# Avoid
flights |&gt;
group_by(
tailnum
) |&gt;
summarize(delay = mean(arr_delay, na.rm = TRUE), n = n())</pre>
</div>
<p>After the first step of the pipeline, indent each line by two spaces. If youre putting each argument on its own line, indent by an extra two spaces. Make sure <code>)</code> is on its own line, and un-indented to match the horizontal position of the function name.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># Strive for
flights |&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
# Avoid
flights|&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
flights|&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)</pre>
</div>
<p>Its OK to shirk some of these rules if your pipeline fits easily on one line. But in our collective experience, its common for short snippets to grow longer, so youll usually save time in the long run by starting with all the vertical space you need.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># This fits compactly on one line
df |&gt; mutate(y = x + 1)
# While this takes up 4x as many lines, it's easily extended to
# more variables and more steps in the future
df |&gt;
mutate(
y = x + 1
)</pre>
</div>
<p>Finally, be wary of writing very long pipes, say longer than 10-15 lines. Try to break them up into smaller sub-tasks, giving each task an informative name. The names will help cue the reader into whats happening and makes it easier to check that intermediate results are as expected. Whenever you can give something an informative name, you should give it an informative name. Dont expect to get it right the first time! This means breaking up long pipelines if there are intermediate states that can get good names.</p>
</section>
<section id="ggplot2" data-type="sect1">
<h1>
ggplot2</h1>
<p>The same basic rules that apply to the pipe also apply to ggplot2; just treat <code>+</code> the same way as <code>|&gt;</code>.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE)
) |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = month, y = delay)) +
geom_point() +
geom_line()</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Again, if you cant fit all of the arguments to a function on to a single line, put each argument on its own line:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
summarize(
distance = mean(distance),
speed = mean(air_time / distance, na.rm = TRUE)
) |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = distance, y = speed)) +
geom_smooth(
method = "loess",
span = 0.5,
se = FALSE,
color = "white",
linewidth = 4
) +
geom_point()</pre>
</div>
</section>
<section id="sectioning-comments" data-type="sect1">
<h1>
Sectioning comments</h1>
<p>As your scripts get longer, you can use <strong>sectioning</strong> comments to break up your file into manageable pieces:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># Load data --------------------------------------
# Plot data --------------------------------------</pre>
</div>
<p>RStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in <a href="#fig-rstudio-sections" data-type="xref">#fig-rstudio-sections</a>.</p>
<div class="cell">
<div class="cell-output-display">
2022-11-19 00:42:43 +08:00
<figure id="fig-rstudio-sections"><p><img src="screenshots/rstudio-nav.png" width="125"/></p>
<figcaption>After adding sectioning comments to your script, you can easily navigate to them using the code navigation tool in the bottom-left of the script editor.</figcaption>
</figure>
</div>
</div>
</section>
<section id="workflow-style-exercises" data-type="sect1">
<h1>
Exercises</h1>
<ol type="1"><li>
<p>Restyle the following pipelines following the guidelines above.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights|&gt;filter(dest=="IAH")|&gt;group_by(year,month,day)|&gt;summarize(n=n(),delay=mean(arr_delay,na.rm=TRUE))|&gt;filter(n&gt;10)
flights|&gt;filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time&gt;0900,sched_arr_time&lt;2000)|&gt;group_by(flight)|&gt;summarize(delay=mean(arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|&gt;filter(n&gt;10)</pre>
</div>
</li>
</ol></section>
<section id="workflow-style-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learn the most important principles of code style. These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, youll see how important a consistent style is. And dont forget about the styler package: its a great way to quickly improve the quality of poorly styled code.</p>
<p>So far, weve worked with datasets bundled inside of R packages. This makes it easier to get some practice on pre-prepared data, but obviously your data wont available in this way. So in the next chapter, youre going to learn how load data from disk into your R session using the readr package.</p>
</section>
</section>