r4ds/oreilly/workflow-basics.html

164 lines
11 KiB
HTML
Raw Normal View History

<section data-type="chapter" id="chp-workflow-basics">
2023-01-13 07:22:57 +08:00
<h1><span id="sec-workflow-basics" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: basics</span></span></h1><p>You now have some experience running R code. We didnt give you many details, but youve obviously figured out the basics, or you wouldve thrown this book away in frustration! Frustration is natural when you start programming in R because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.</p><p>Before we go any further, lets ensure youve got a solid foundation in running R code and that you know some of the most helpful RStudio features.</p>
<section id="coding-basics" data-type="sect1">
<h1>
Coding basics</h1>
2023-01-13 07:22:57 +08:00
<p>Lets review some basics weve omitted so far in the interest of getting you plotting as quickly as possible. You can use R as a calculator:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">1 / 200 * 30
#&gt; [1] 0.15
(59 + 73 + 2) / 3
#&gt; [1] 44.66667
sin(pi / 2)
#&gt; [1] 1</pre>
</div>
<p>You can create new objects with the assignment operator <code>&lt;-</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">x &lt;- 3 * 4</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>You can <strong>c</strong>ombine multiple elements into a vector with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">primes &lt;- c(2, 3, 5, 7, 11, 13)</pre>
</div>
<p>And basic arithmetic is applied to every element of the vector:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">primes * 2
#&gt; [1] 4 6 10 14 22 26
primes - 1
#&gt; [1] 1 2 4 6 10 12</pre>
</div>
<p>All R statements where you create objects, <strong>assignment</strong> statements, have the same form:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">object_name &lt;- value</pre>
</div>
<p>When reading that code, say “object name gets value” in your head.</p>
2023-01-13 07:22:57 +08:00
<p>You will make lots of assignments, and <code>&lt;-</code> is a pain to type. You can save time with RStudios keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automatically surrounds <code>&lt;-</code> with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.</p>
</section>
<section id="comments" data-type="sect1">
<h1>
Comments</h1>
2023-01-13 07:22:57 +08:00
<p>R will ignore any text after <code>#</code>. This allows you to write <strong>comments</strong>, text that is ignored by R but read by other humans. Well sometimes include comments in examples explaining whats happening with the code.</p>
<p>Comments can be helpful for briefly describing what the following code does.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># define primes
primes &lt;- c(2, 3, 5, 7, 11, 13)
# multiply primes by 2
primes * 2
#&gt; [1] 4 6 10 14 22 26</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>With short pieces of code like this, leaving a comment for every single line of code might not be necessary. But as the code youre writing gets more complex, comments can save you (and your collaborators) a lot of time figuring out what was done in the code.</p>
<p>Use comments to explain the <em>why</em> of your code, not the <em>how</em> or the <em>what</em>. The <em>what</em> and <em>how</em> of your code are always possible to figure out, even if it might be tedious, by carefully reading it. But if you describe the “what” in your comments and your code, youll have to remember to update the comment and code in tandem carefully. If you change the code and forget to update the comment, theyll be inconsistent, leading to confusion when you return to your code in the future.</p>
<p>Figuring out <em>why</em> something was done is much more difficult, if not impossible. For example, <code>geom_smooth()</code> has an argument called <code>span</code>, which controls the smoothness of the curve, with larger values yielding a smoother curve. Suppose you decide to change the value of <code>span</code> from its default of 0.75 to 0.3: its easy for a future reader to understand <em>what</em> is happening, but unless you note your thinking in a comment, no one will understand <em>why</em> you changed the default.</p>
2023-01-13 07:22:57 +08:00
<p>For data analysis code, use comments to explain your overall plan of attack and record important insights as you encounter them. Theres no way to re-capture this knowledge from the code itself.</p>
</section>
<section id="sec-whats-in-a-name" data-type="sect1">
<h1>
Whats in a name?</h1>
2023-01-13 07:22:57 +08:00
<p>Object names must start with a letter and can only contain letters, numbers, <code>_</code>, and <code>.</code>. You want your object names to be descriptive, so youll need to adopt a convention for multiple words. We recommend <strong>snake_case</strong>, where you separate lowercase words with <code>_</code>.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Well return to names again when we discuss code style in <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>.</p>
<p>You can inspect an object by typing its name:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">x
#&gt; [1] 12</pre>
</div>
<p>Make another assignment:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">this_is_a_really_long_name &lt;- 2.5</pre>
</div>
<p>To inspect this object, try out RStudios completion facility: type “this”, press TAB, add characters until you have a unique prefix, then press return.</p>
<p>Ooops, you made a mistake! The value of <code>this_is_a_really_long_name</code> should be 3.5, not 2.5. Use another keyboard shortcut to help you fix it. Type “this” then press Cmd/Ctrl + ↑. Doing so will list all the commands youve typed that start with those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.</p>
<p>Make yet another assignment:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">r_rocks &lt;- 2 ^ 3</pre>
</div>
<p>Lets try to inspect it:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">r_rock
#&gt; Error: object 'r_rock' not found
R_rocks
#&gt; Error: object 'R_rocks' not found</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>This illustrates the implied contract between you and R: R will do the tedious computations for you, but in exchange, you must be completely precise in your instructions. Typos matter; R cant read your mind and say, “oh, they probably meant <code>r_rocks</code> when they typed <code>r_rock</code>”. Case matters; similarly, R cant read your mind and say, “oh, they probably meant <code>r_rocks</code> when they typed <code>R_rocks</code>”.</p>
</section>
<section id="calling-functions" data-type="sect1">
<h1>
Calling functions</h1>
<p>R has a large collection of built-in functions that are called like this:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">function_name(arg1 = val1, arg2 = val2, ...)</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Lets try using <code><a href="https://rdrr.io/r/base/seq.html">seq()</a></code>, which makes regular <strong>seq</strong>uences of numbers, and while were at it, learn more helpful features of RStudio. Type <code>se</code> and hit TAB. A popup shows you possible completions. Specify <code><a href="https://rdrr.io/r/base/seq.html">seq()</a></code> by typing more (a <code>q</code>) to disambiguate or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the functions arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane.</p>
<p>When youve selected the function you want, press TAB again. RStudio will add matching opening (<code>(</code>) and closing (<code>)</code>) parentheses for you. Type the arguments <code>1, 10</code> and hit return.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">seq(1, 10)
#&gt; [1] 1 2 3 4 5 6 7 8 9 10</pre>
</div>
<p>Type this code and notice that RStudio provides similar assistance with the paired quotation marks:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">x &lt;- "hello world"</pre>
</div>
<p>Quotation marks and parentheses must always come in a pair. RStudio does its best to help you, but its still possible to mess up and end up with a mismatch. If this happens, R will show you the continuation character “+”:</p>
<pre><code>&gt; x &lt;- "hello
+</code></pre>
<p>The <code>+</code> tells you that R is waiting for more input; it doesnt think youre done yet. Usually, this means youve forgotten either a <code>"</code> or a <code>)</code>. Either add the missing pair, or press ESCAPE to abort the expression and try again.</p>
<p>Note that the environment tab in the upper right pane displays all of the objects that youve created:</p>
<div class="cell">
<div class="cell-output-display">
2023-01-13 07:22:57 +08:00
<p><img src="screenshots/rstudio-env.png" class="img-fluid" alt="Environment tab of RStudio which shows r_rocks, this_is_a_really_long_name, x, and y in the Global Environment." width="778"/></p>
</div>
</div>
</section>
<section id="workflow-basics-exercises" data-type="sect1">
<h1>
Exercises</h1>
<ol type="1"><li>
<p>Why does this code not work?</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">my_variable &lt;- 10
my_varıable
#&gt; Error in eval(expr, envir, enclos): object 'my_varıable' not found</pre>
</div>
<p>Look carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)</p>
</li>
<li>
<p>Tweak each of the following R commands so that they run correctly:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">libary(tidyverse)
ggplot(dota = mpg) +
geom_point(maping = aes(x = displ, y = hwy))</pre>
</div>
</li>
<li><p>Press Alt + Shift + K. What happens? How can you get to the same place using the menus?</p></li>
2023-01-13 07:22:57 +08:00
<li>
<p>Lets revisit an exercise from the <a href="#sec-ggsave" data-type="xref">#sec-ggsave</a>. Run the following lines of code. Which of the two plots is saved as <code>mpg-plot.png</code>? Why?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">my_bar_plot &lt;- ggplot(mpg, aes(x = class)) +
geom_bar()
my_scatter_plot &lt;- ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggsave(filename = "mpg-plot.png", plot = my_bar_plot)</pre>
</div>
</li>
</ol></section>
<section id="workflow-basics-summary" data-type="sect1">
<h1>
Summary</h1>
<p>Now that youve learned a little more about how R code works, and some tips to help you understand your code when you come back to it in the future. In the next chapter, well continue your data science journey by teaching you about dplyr, the tidyverse package that helps you transform data, whether its selecting important variables, filtering down to rows of interest, or computing summary statistics.</p>
</section>
</section>