r4ds/oreilly/functions.html

962 lines
52 KiB
HTML
Raw Normal View History

<section data-type="chapter" id="chp-functions">
2022-11-19 01:55:22 +08:00
<h1><span id="sec-functions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Functions</span></span></h1>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:</p>
<ol type="1"><li><p>You can give a function an evocative name that makes your code easier to understand.</p></li>
<li><p>As requirements change, you only need to update code in one place, instead of many.</p></li>
<li><p>You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).</p></li>
</ol><p>A good rule of thumb is to consider writing a function whenever youve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). In this chapter, youll learn about three useful types of functions:</p>
<ul><li>Vector functions take one or more vectors as input and return a vector as output.</li>
<li>Data frame functions take a data frame as input and return a data frame as output.</li>
<li>Plot functions that take a data frame as input and return a plot as output.</li>
2022-11-19 00:30:32 +08:00
</ul><p>Each of these sections include many examples to help you generalize the patterns that you see. These examples wouldnt be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for <a href="https://twitter.com/hadleywickham/status/1571603361350164486">general functions</a> and <a href="https://twitter.com/hadleywickham/status/1574373127349575680">plotting functions</a> to see even more functions.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Well wrap up a variety of functions from around the tidyverse. Well also use nycflights13 as a source of familiar data to use our functions with.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(nycflights13)</pre>
</div>
</section>
</section>
<section id="vector-functions" data-type="sect1">
<h1>
Vector functions</h1>
<p>Well begin with vector functions: functions that take one or more vectors and return a vector result. For example, take a look at this code. What does it do?</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
a = rnorm(5),
b = rnorm(5),
c = rnorm(5),
d = rnorm(5),
)
df |&gt; mutate(
a = (a - min(a, na.rm = TRUE)) /
(max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(b, na.rm = TRUE)) /
(max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) /
(max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
d = (d - min(d, na.rm = TRUE)) /
(max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
)
#&gt; # A tibble: 5 × 4
#&gt; a b c d
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.339 2.59 0.291 0
#&gt; 2 0.880 0 0.611 0.557
#&gt; 3 0 1.37 1 0.752
#&gt; 4 0.795 1.37 0 1
#&gt; 5 1 1.34 0.580 0.394</pre>
</div>
<p>You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an <code>a</code> to a <code>b</code>. Preventing this type of mistake of is one very good reason to learn how to write functions.</p>
<section id="writing-a-function" data-type="sect2">
<h2>
Writing a function</h2>
2023-01-13 07:22:57 +08:00
<p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, its a little easier to see the pattern because each repetition is now one line:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)) </pre>
</div>
<p>To make this a bit clearer we can replace the bit that varies with <code></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))</pre>
</div>
<p>To turn this into a function you need three things:</p>
<ol type="1"><li><p>A <strong>name</strong>. Here well use <code>rescale01</code> because this function rescales a vector to lie between 0 and 1.</p></li>
2023-01-13 07:22:57 +08:00
<li><p>The <strong>arguments</strong>. The arguments are things that vary across calls and our analysis above tells us that we have just one. Well call it <code>x</code> because this is the conventional name for a numeric vector.</p></li>
<li><p>The <strong>body</strong>. The body is the code thats repeated across all the calls.</p></li>
</ol><p>Then you create a function by following the template:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">name &lt;- function(arguments) {
body
}</pre>
</div>
<p>For this case that leads to:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">rescale01 &lt;- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}</pre>
</div>
<p>At this point you might test with a few simple inputs to make sure youve captured the logic correctly:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">rescale01(c(-10, 0, 10))
#&gt; [1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
#&gt; [1] 0.00 0.25 0.50 NA 1.00</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Then you can rewrite the call to <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> as:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(
a = rescale01(a),
b = rescale01(b),
c = rescale01(c),
d = rescale01(d),
)
#&gt; # A tibble: 5 × 4
#&gt; a b c d
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.339 1 0.291 0
#&gt; 2 0.880 0 0.611 0.557
#&gt; 3 0 0.530 1 0.752
#&gt; 4 0.795 0.531 0 1
#&gt; 5 1 0.518 0.580 0.394</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>(In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> to reduce the duplication even further so all you need is <code>df |&gt; mutate(across(a:d, rescale01))</code>).</p>
</section>
<section id="improving-our-function" data-type="sect2">
<h2>
Improving our function</h2>
2023-01-13 07:22:57 +08:00
<p>You might notice that the <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> twice and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="https://rdrr.io/r/base/range.html">range()</a></code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}</pre>
</div>
<p>Or you might try this function on a vector that includes an infinite value:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1:10, Inf)
rescale01(x)
#&gt; [1] 0 0 0 0 0 0 0 0 0 0 NaN</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>That result is not particularly useful so we could ask <code><a href="https://rdrr.io/r/base/range.html">range()</a></code> to ignore infinite values:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
2023-01-13 07:22:57 +08:00
rescale01(x)
#&gt; [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
#&gt; [8] 0.7777778 0.8888889 1.0000000 Inf</pre>
</div>
<p>These changes illustrate an important benefit of functions: because weve moved the repeated code into a function, we only need to make the change in one place.</p>
</section>
<section id="mutate-functions" data-type="sect2">
<h2>
Mutate functions</h2>
2023-01-13 07:22:57 +08:00
<p>Now youve got the basic idea of functions, lets take a look at a whole bunch of examples. Well start by looking at “mutate” functions, i.e. functions that work well inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> because they return an output of the same length as the input.</p>
<p>Lets start with a simple variation of <code>rescale01()</code>. Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">z_score &lt;- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Or maybe you want to wrap up a straightforward <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> and give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">clamp &lt;- function(x, min, max) {
case_when(
x &lt; min ~ min,
x &gt; max ~ max,
.default = x
)
}
2023-01-13 07:22:57 +08:00
clamp(1:10, min = 3, max = 7)
#&gt; [1] 3 3 3 4 5 6 7 7 7 7</pre>
</div>
<p>Or maybe youd rather mark those values as <code>NA</code>s:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">na_outside &lt;- function(x, min, max) {
case_when(
x &lt; min ~ NA,
x &gt; max ~ NA,
.default = x
)
}
2023-01-13 07:22:57 +08:00
na_outside(1:10, min = 3, max = 7)
#&gt; [1] NA NA 3 4 5 6 7 NA NA NA</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Of course functions dont just need to work with numeric variables. You might want to do some repeated string manipulation. Maybe you need to make the first character upper case:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">first_upper &lt;- function(x) {
str_sub(x, 1, 1) &lt;- str_to_upper(str_sub(x, 1, 1))
x
}
2023-01-13 07:22:57 +08:00
first_upper("hello")
#&gt; [1] "Hello"</pre>
</div>
<p>Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/NVlabormarket/status/1571939851922198530
clean_number &lt;- function(x) {
is_pct &lt;- str_detect(x, "%")
num &lt;- x |&gt;
str_remove_all("%") |&gt;
str_remove_all(",") |&gt;
str_remove_all(fixed("$")) |&gt;
as.numeric(x)
if_else(is_pct, num / 100, num)
}
2023-01-13 07:22:57 +08:00
clean_number("$12,300")
#&gt; [1] 12300
clean_number("45%")
#&gt; [1] 0.45</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Sometimes your functions will be highly specialized for one data analysis step. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with <code>NA</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">fix_na &lt;- function(x) {
if_else(x %in% c(997, 998, 999), NA, x)
}</pre>
</div>
<p>Weve focused on examples that take a single vector because we think theyre the most common. But theres no reason that your function cant take multiple vector inputs. For example, you might want to compute the distance between two locations on the globe using the haversine formula. This requires four vectors:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
haversine &lt;- function(long1, lat1, long2, lat2, round = 3) {
# convert to radians
long1 &lt;- long1 * pi / 180
lat1 &lt;- lat1 * pi / 180
long2 &lt;- long2 * pi / 180
lat2 &lt;- lat2 * pi / 180
R &lt;- 6371 # Earth mean radius in km
a &lt;- sin((lat2 - lat1) / 2)^2 +
cos(lat1) * cos(lat2) * sin((long2 - long1) / 2)^2
d &lt;- R * 2 * asin(sqrt(a))
round(d, round)
}</pre>
</div>
</section>
<section id="summary-functions" data-type="sect2">
<h2>
Summary functions</h2>
2022-11-19 00:30:32 +08:00
<p>Another important family of vector functions is summary functions, functions that return a single value for use in <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. Sometimes this can just be a matter of setting a default argument or two:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">commas &lt;- function(x) {
str_flatten(x, collapse = ", ", last = " and ")
}
2023-01-13 07:22:57 +08:00
commas(c("cat", "dog", "pigeon"))
#&gt; [1] "cat, dog and pigeon"</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Or you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">cv &lt;- function(x, na.rm = FALSE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
2023-01-13 07:22:57 +08:00
cv(runif(100, min = 0, max = 50))
#&gt; [1] 0.5196276
cv(runif(100, min = 0, max = 500))
#&gt; [1] 0.5652554</pre>
</div>
<p>Or maybe you just want to make a common pattern easier to remember by giving it a memorable name:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/gbganalyst/status/1571619641390252033
n_missing &lt;- function(x) {
sum(is.na(x))
} </pre>
</div>
<p>You can also write functions with multiple vector inputs. For example, maybe you want to compute the mean absolute prediction error to help you compare model predictions with actual values:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/neilgcurrie/status/1571607727255834625
mape &lt;- function(actual, predicted) {
sum(abs((actual - predicted) / actual)) / length(actual)
}</pre>
</div>
<div data-type="note"><h1>
RStudio
</h1><p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p><ul><li><p>To find the definition of a function that youve written, place the cursor on the name of the function and press <code>F2</code>.</p></li>
<li><p>To quickly jump to a function, press <code>Ctrl + .</code> to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.</p></li>
</ul></div>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">mean(is.na(x))
mean(is.na(y))
mean(is.na(z))
x / sum(x, na.rm = TRUE)
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)
round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)</pre>
</div>
</li>
<li><p>In the second variant of <code>rescale01()</code>, infinite values are left unchanged. Can you rewrite <code>rescale01()</code> so that <code>-Inf</code> is mapped to 0, and <code>Inf</code> is mapped to 1?</p></li>
<li><p>Given a vector of birthdates, write a function to compute the age in years.</p></li>
<li><p>Write your own functions to compute the variance and skewness of a numeric vector. Variance is defined as <span class="math display">\[
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
\]</span> where <span class="math inline">\(\bar{x} = (\sum_i^n x_i) / n\)</span> is the sample mean. Skewness is defined as <span class="math display">\[
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
\]</span></p></li>
<li><p>Write <code>both_na()</code>, a summary function that takes two vectors of the same length and returns the number of positions that have an <code>NA</code> in both vectors.</p></li>
<li>
<p>Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">is_directory &lt;- function(x) file.info(x)$isdir
is_readable &lt;- function(x) file.access(x, 4) == 0</pre>
</div>
</li>
</ol></section>
</section>
<section id="data-frame-functions" data-type="sect1">
<h1>
Data frame functions</h1>
<p>Vector functions are useful for pulling out code thats repeated within a dplyr verb. But youll often also repeat the verbs themselves, particularly within a large pipeline. When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function. Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or vector.</p>
<p>To let you write a function that uses dplyr verbs, well first introduce you to the challenge of indirection and how you can overcome it with embracing, <code>{{ }}</code>. With this theory under your belt, well then show you a bunch of examples to illustrate what you might do with it.</p>
<section id="indirection-and-tidy-evaluation" data-type="sect2">
<h2>
Indirection and tidy evaluation</h2>
2023-01-13 07:22:57 +08:00
<p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Lets illustrate the problem with a very simple function: <code>grouped_mean()</code>. The goal of this function is compute the mean of <code>mean_var</code> grouped by <code>group_var</code>:</p>
<div class="cell">
2023-01-13 07:22:57 +08:00
<pre data-type="programlisting" data-code-language="r">grouped_mean &lt;- function(df, group_var, mean_var) {
df |&gt;
2023-01-13 07:22:57 +08:00
group_by(group_var) |&gt;
summarize(mean(mean_var))
}</pre>
</div>
<p>If we try and use it, we get an error:</p>
<div class="cell">
2023-01-13 07:22:57 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds |&gt; grouped_mean(cut, carat)
#&gt; Error in `group_by()`:
#&gt; ! Must group by variables found in `.data`.
#&gt; ✖ Column `group_var` is not found.</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>To make the problem a bit more clear, we can use a made up data frame:</p>
<div class="cell">
2023-01-13 07:22:57 +08:00
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
mean_var = 1,
group_var = "g",
group = 1,
x = 10,
y = 100
)
df |&gt; grouped_mean(group, x)
#&gt; # A tibble: 1 × 2
#&gt; group_var `mean(mean_var)`
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 g 1
df |&gt; grouped_mean(group, y)
#&gt; # A tibble: 1 × 2
#&gt; group_var `mean(mean_var)`
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 g 1</pre>
</div>
<p>Regardless of how we call <code>grouped_mean()</code> it always does <code>df |&gt; group_by(group_var) |&gt; summarize(mean(mean_var))</code>, instead of <code>df |&gt; group_by(group) |&gt; summarize(mean(x))</code> or <code>df |&gt; group_by(group) |&gt; summarize(mean(y))</code>. This is a problem of indirection, and it arises because dplyr uses <strong>tidy evaluation</strong> to allow you to refer to the names of variables inside your data frame without any special treatment.</p>
<p>Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; its obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell <code>group_mean()</code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> not to treat <code>group_var</code> and <code>mean_var</code> as the name of the variables, but instead look inside them for the variable we actually want to use.</p>
<p>Tidy evaluation includes a solution to this problem called <strong>embracing</strong> 🤗. Embracing a variable means to wrap it in braces so (e.g.) <code>var</code> becomes <code>{{ var }}</code>. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember whats happening is to think of <code>{{ }}</code> as looking down a tunnel — <code>{{ var }}</code> will make a dplyr function look inside of <code>var</code> rather than looking for a variable called <code>var</code>.</p>
2023-01-13 07:22:57 +08:00
<p>So to make grouped_mean<code>()</code> work, we need to surround <code>group_var</code> and <code>mean_var()</code> with <code>{{ }}</code>:</p>
<div class="cell">
2023-01-13 07:22:57 +08:00
<pre data-type="programlisting" data-code-language="r">grouped_mean &lt;- function(df, group_var, mean_var) {
df |&gt;
2023-01-13 07:22:57 +08:00
group_by({{ group_var }}) |&gt;
summarize(mean({{ mean_var }}))
}
2023-01-13 07:22:57 +08:00
diamonds |&gt; grouped_mean(cut, carat)
#&gt; # A tibble: 5 × 2
#&gt; cut `mean(carat)`
#&gt; &lt;ord&gt; &lt;dbl&gt;
#&gt; 1 Fair 1.05
#&gt; 2 Good 0.849
#&gt; 3 Very Good 0.806
#&gt; 4 Premium 0.892
#&gt; 5 Ideal 0.703</pre>
</div>
<p>Success!</p>
</section>
<section id="sec-embracing" data-type="sect2">
<h2>
When to embrace?</h2>
2023-01-13 07:22:57 +08:00
<p>So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately, this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:</p>
<ul><li><p><strong>Data-masking</strong>: this is used in functions like <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> that compute with variables.</p></li>
<li><p><strong>Tidy-selection</strong>: this is used for functions like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> that select variables.</p></li>
</ul><p>Your intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g. <code>x + 1</code>) or select (e.g. <code>a:x</code>).</p>
2023-01-13 07:22:57 +08:00
<p>In the following sections, well explore the sorts of handy functions you might write once you understand embracing.</p>
</section>
<section id="common-use-cases" data-type="sect2">
<h2>
Common use cases</h2>
<p>If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">summary6 &lt;- function(data, var) {
2023-01-13 07:22:57 +08:00
data |&gt; summarize(
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
max = max({{ var }}, na.rm = TRUE),
n = n(),
n_miss = sum(is.na({{ var }})),
.groups = "drop"
)
}
2023-01-13 07:22:57 +08:00
diamonds |&gt; summary6(carat)
#&gt; # A tibble: 1 × 6
#&gt; min mean median max n n_miss
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.2 0.798 0.7 5.01 53940 0</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>(Whenever you wrap <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> in a helper, we think its good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p>
<p>The nice thing about this function is, because it wraps <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, you can use it on grouped data:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
summary6(carat)
#&gt; # A tibble: 5 × 7
#&gt; cut min mean median max n n_miss
#&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Fair 0.22 1.05 1 5.01 1610 0
#&gt; 2 Good 0.23 0.849 0.82 3.01 4906 0
#&gt; 3 Very Good 0.2 0.806 0.71 4 12082 0
#&gt; 4 Premium 0.2 0.892 0.86 4.01 13791 0
#&gt; 5 Ideal 0.2 0.703 0.54 3.5 21551 0</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Furthermore, since the arguments to summarize are data-masking also means that the <code>var</code> argument to <code>summary6()</code> is data-masking. That means you can also summarize computed variables:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
summary6(log10(carat))
#&gt; # A tibble: 5 × 7
#&gt; cut min mean median max n n_miss
#&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Fair -0.658 -0.0273 0 0.700 1610 0
#&gt; 2 Good -0.638 -0.133 -0.0862 0.479 4906 0
#&gt; 3 Very Good -0.699 -0.164 -0.149 0.602 12082 0
#&gt; 4 Premium -0.699 -0.125 -0.0655 0.603 13791 0
#&gt; 5 Ideal -0.699 -0.225 -0.268 0.544 21551 0</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>To summarize multiple variables, youll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>.</p>
<p>Another popular <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> helper function is a version of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> that also computes proportions:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/Diabb6/status/1571635146658402309
count_prop &lt;- function(df, var, sort = FALSE) {
df |&gt;
count({{ var }}, sort = sort) |&gt;
mutate(prop = n / sum(n))
}
2023-01-13 07:22:57 +08:00
diamonds |&gt; count_prop(clarity)
#&gt; # A tibble: 8 × 3
#&gt; clarity n prop
#&gt; &lt;ord&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 I1 741 0.0137
#&gt; 2 SI2 9194 0.170
#&gt; 3 SI1 13065 0.242
#&gt; 4 VS2 12258 0.227
#&gt; 5 VS1 8171 0.151
#&gt; 6 VVS2 5066 0.0939
#&gt; # … with 2 more rows</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>This function has three arguments: <code>df</code>, <code>var</code>, and <code>sort</code>, and only <code>var</code> needs to be embraced because its passed to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> which uses data-masking for all variables in <code></code>.</p>
<p>Or maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, well allow the user to supply a condition:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">unique_where &lt;- function(df, condition, var) {
df |&gt;
filter({{ condition }}) |&gt;
distinct({{ var }}) |&gt;
2023-01-13 07:22:57 +08:00
arrange({{ var }})
}
# Find all the destinations in December
flights |&gt; unique_where(month == 12, dest)
2023-01-13 07:22:57 +08:00
#&gt; # A tibble: 96 × 1
#&gt; dest
#&gt; &lt;chr&gt;
#&gt; 1 ABQ
#&gt; 2 ALB
#&gt; 3 ATL
#&gt; 4 AUS
#&gt; 5 AVL
#&gt; 6 BDL
#&gt; # … with 90 more rows
# Which months did plane N14228 fly in?
flights |&gt; unique_where(tailnum == "N14228", month)
2023-01-13 07:22:57 +08:00
#&gt; # A tibble: 11 × 1
#&gt; month
#&gt; &lt;int&gt;
#&gt; 1 1
#&gt; 2 2
#&gt; 3 3
#&gt; 4 4
#&gt; 5 5
#&gt; 6 6
#&gt; # … with 5 more rows</pre>
</div>
<p>Here we embrace <code>condition</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code>var</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>.</p>
<p>Weve made all these examples to take a data frame as the first argument, but if youre working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights_sub &lt;- function(rows, cols) {
flights |&gt;
filter({{ rows }}) |&gt;
select(time_hour, carrier, flight, {{ cols }})
}
flights_sub(dest == "IAH", contains("time"))
#&gt; # A tibble: 7,198 × 8
2023-01-13 07:22:57 +08:00
#&gt; time_hour carrier flight dep_time sched_dep_time arr_time
#&gt; &lt;dttm&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013-01-01 05:00:00 UA 1545 517 515 830
#&gt; 2 2013-01-01 05:00:00 UA 1714 533 529 850
#&gt; 3 2013-01-01 06:00:00 UA 496 623 627 933
#&gt; 4 2013-01-01 07:00:00 UA 473 728 732 1041
#&gt; 5 2013-01-01 07:00:00 UA 1479 739 739 1104
#&gt; 6 2013-01-01 09:00:00 UA 1220 908 908 1228
#&gt; # … with 7,192 more rows, and 2 more variables: sched_arr_time &lt;int&gt;,
#&gt; # air_time &lt;dbl&gt;</pre>
</div>
</section>
2023-01-13 07:22:57 +08:00
<section id="data-masking-vs.-tidy-selection" data-type="sect2">
<h2>
2023-01-13 07:22:57 +08:00
Data-masking vs. tidy-selection</h2>
<p>Sometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a <code>count_missing()</code> that counts the number of missing observations in rows. You might try writing something like:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt;
group_by({{ group_vars }}) |&gt;
2023-01-13 07:22:57 +08:00
summarize(n_miss = sum(is.na({{ x_var }})))
}
2023-01-13 07:22:57 +08:00
flights |&gt;
count_missing(c(year, month, day), dep_time)
2023-01-13 07:22:57 +08:00
#&gt; Error in `group_by()`:
#&gt; In argument: `c(year, month, day)`.
#&gt; Caused by error:
2023-01-13 07:22:57 +08:00
#&gt; ! `c(year, month, day)` must be size 336776 or 1, not 1010328.</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>This doesnt work because <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> function, which allows you to use tidy-selection inside data-masking functions:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt;
group_by(pick({{ group_vars }})) |&gt;
2023-01-13 07:22:57 +08:00
summarize(n_miss = sum(is.na({{ x_var }})))
}
2023-01-13 07:22:57 +08:00
flights |&gt;
count_missing(c(year, month, day), dep_time)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using
#&gt; the `.groups` argument.
#&gt; # A tibble: 365 × 4
#&gt; # Groups: year, month [12]
#&gt; year month day n_miss
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 4
#&gt; 2 2013 1 2 8
#&gt; 3 2013 1 3 10
#&gt; 4 2013 1 4 6
#&gt; 5 2013 1 5 3
#&gt; 6 2013 1 6 1
#&gt; # … with 359 more rows</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Another convenient use of <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> is to make a 2d table of counts. Here we count using all the variables in the <code>rows</code> and <code>columns</code>, then use <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to rearrange the counts into a grid:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/pollicipes/status/1571606508944719876
count_wide &lt;- function(data, rows, cols) {
data |&gt;
count(pick(c({{ rows }}, {{ cols }}))) |&gt;
pivot_wider(
names_from = {{ cols }},
values_from = n,
names_sort = TRUE,
values_fill = 0
)
}
2023-01-13 07:22:57 +08:00
diamonds |&gt; count_wide(clarity, cut)
#&gt; # A tibble: 8 × 6
#&gt; clarity Fair Good `Very Good` Premium Ideal
#&gt; &lt;ord&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 I1 210 96 84 205 146
#&gt; 2 SI2 466 1081 2100 2949 2598
#&gt; 3 SI1 408 1560 3240 3575 4282
#&gt; 4 VS2 261 978 2591 3357 5071
#&gt; 5 VS1 170 648 1775 1989 3589
#&gt; 6 VVS2 69 286 1235 870 2606
#&gt; # … with 2 more rows
diamonds |&gt; count_wide(c(clarity, color), cut)
#&gt; # A tibble: 56 × 7
#&gt; clarity color Fair Good `Very Good` Premium Ideal
#&gt; &lt;ord&gt; &lt;ord&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 I1 D 4 8 5 12 13
#&gt; 2 I1 E 9 23 22 30 18
#&gt; 3 I1 F 35 19 13 34 42
#&gt; 4 I1 G 53 19 16 46 16
#&gt; 5 I1 H 52 14 12 46 38
#&gt; 6 I1 I 34 9 8 24 17
#&gt; # … with 50 more rows</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> docs you can see that <code>names_from</code> uses tidy-selection.</p>
</section>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
2023-01-13 07:22:57 +08:00
<p>Using the datasets from nycflights13, write a function that:</p>
<ol type="1"><li>
2023-01-13 07:22:57 +08:00
<p>Finds all flights that were cancelled (i.e. <code>is.na(arr_time)</code>) or delayed by more than an hour.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt; filter_severe()</pre>
</div>
</li>
<li>
<p>Counts the number of cancelled flights and the number of flights delayed by more than an hour.</p>
<div class="cell">
2023-01-13 07:22:57 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt; group_by(dest) |&gt; summarize_severe()</pre>
</div>
</li>
<li>
<p>Finds all flights that were cancelled or delayed by more than a user supplied number of hours:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt; filter_severe(hours = 2)</pre>
</div>
</li>
<li>
2023-01-13 07:22:57 +08:00
<p>Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:</p>
<div class="cell">
2023-01-13 07:22:57 +08:00
<pre data-type="programlisting" data-code-language="r">weather |&gt; summarize_weather(temp)</pre>
</div>
</li>
<li>
2023-01-13 07:22:57 +08:00
<p>Converts the user supplied variable that uses clock time (e.g. <code>dep_time</code>, <code>arr_time</code>, etc.) into a decimal time (i.e. hours + (minutes / 60)).</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">weather |&gt; standardise_time(sched_dep_time)</pre>
</div>
</li>
</ol></li>
2023-01-13 07:22:57 +08:00
<li><p>For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_sample()</a></code>.</p></li>
<li>
<p>Generalize the following function so that you can supply any number of variables to count.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">count_prop &lt;- function(df, var, sort = FALSE) {
df |&gt;
count({{ var }}, sort = sort) |&gt;
mutate(prop = n / sum(n))
}</pre>
</div>
</li>
</ol></section>
</section>
<section id="plot-functions" data-type="sect1">
<h1>
Plot functions</h1>
2023-01-13 07:22:57 +08:00
<p>Instead of returning a data frame, you might want to return a plot. Fortunately, you can use the same techniques with ggplot2, because <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function. For example, imagine that youre making a lot of histograms:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.1)
diamonds |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.05)</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Wouldnt it be nice if you could wrap this up into a histogram function? This is easy as pie once you know that <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function and you need to embrace:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth)
}
diamonds |&gt; histogram(carat, 0.1)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-46-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
2023-01-13 07:22:57 +08:00
<p>Note that <code>histogram()</code> returns a ggplot2 plot, meaning you can still add on additional components if you want. Just remember to switch from <code>|&gt;</code> to <code>+</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
histogram(carat, 0.1) +
labs(x = "Size (in carats)", y = "Number of diamonds")</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-47-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<section id="more-variables" data-type="sect2">
<h2>
More variables</h2>
<p>Its straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/tyler_js_smith/status/1574377116988104704
linearity_check &lt;- function(df, x, y) {
df |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = {{ x }}, y = {{ y }})) +
geom_point() +
geom_smooth(method = "loess", color = "red", se = FALSE) +
geom_smooth(method = "lm", color = "blue", se = FALSE)
}
starwars |&gt;
filter(mass &lt; 1000) |&gt;
linearity_check(mass, height)
#&gt; `geom_smooth()` using formula = 'y ~ x'
#&gt; `geom_smooth()` using formula = 'y ~ x'</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-48-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/ppaxisa/status/1574398423175921665
hex_plot &lt;- function(df, x, y, z, bins = 20, fun = "mean") {
df |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = {{ x }}, y = {{ y }}, z = {{ z }})) +
stat_summary_hex(
2023-01-13 07:22:57 +08:00
aes(color = after_scale(fill)), # make border same color as fill
bins = bins,
fun = fun,
)
}
2023-01-13 07:22:57 +08:00
diamonds |&gt; hex_plot(carat, price, depth)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-49-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
</section>
<section id="combining-with-dplyr" data-type="sect2">
<h2>
Combining with dplyr</h2>
2022-11-19 00:30:32 +08:00
<p>Some of the most useful helpers combine a dash of dplyr with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_infreq()</a></code>. Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">sorted_bars &lt;- function(df, var) {
df |&gt;
mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |&gt;
ggplot(aes(y = {{ var }})) +
geom_bar()
}
2023-01-13 07:22:57 +08:00
diamonds |&gt; sorted_bars(cut)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-50-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
2023-01-13 07:22:57 +08:00
<p>We have to use a new operator here, <code>:=</code>, because we are generating the variable name based on user-supplied data. Variable names go on the left hand side of <code>=</code>, but Rs syntax doesnt allow anything to the left of <code>=</code> except for a single literal name. To work around this problem, we use the special operator <code>:=</code> which tidy evaluation treats in exactly the same way as <code>=</code>.</p>
<p>Or maybe you want to make it easy to draw a bar plot just for a subset of the data:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">conditional_bars &lt;- function(df, condition, var) {
df |&gt;
filter({{ condition }}) |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = {{ var }})) +
geom_bar()
}
diamonds |&gt; conditional_bars(cut == "Good", clarity)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-51-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
2023-01-13 07:22:57 +08:00
<p>You can also get creative and display data summaries in other ways. For example, this code uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
fancy_ts &lt;- function(df, val, group) {
labs &lt;- df |&gt;
2023-01-13 07:22:57 +08:00
group_by({{ group }}) |&gt;
summarize(breaks = max({{ val }}))
df |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = date, y = {{ val }}, group = {{ group }}, color = {{ group }})) +
geom_path() +
scale_y_continuous(
breaks = labs$breaks,
labels = scales::label_comma(),
minor_breaks = NULL,
guide = guide_axis(position = "right")
)
}
df &lt;- tibble(
dist1 = sort(rnorm(50, 5, 2)),
dist2 = sort(rnorm(50, 8, 3)),
dist4 = sort(rnorm(50, 15, 1)),
date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
)
2023-01-13 07:22:57 +08:00
df &lt;- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
fancy_ts(df, value, dist_name)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-52-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Next well discuss two more complicated cases: faceting and automatic labeling.</p>
</section>
<section id="faceting" data-type="sect2">
<h2>
Faceting</h2>
2023-01-13 07:22:57 +08:00
<p>Unfortunately, programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. So you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="https://ggplot2.tidyverse.org/reference/vars.html">vars()</a></code> uses tidy evaluation so you can embrace within it:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/sharoz/status/1574376332821204999
foo &lt;- function(x) {
2023-01-13 07:22:57 +08:00
ggplot(mtcars, aes(x = mpg, y = disp)) +
geom_point() +
facet_wrap(vars({{ x }}))
}
2023-01-13 07:22:57 +08:00
foo(cyl)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-53-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
2023-01-13 07:22:57 +08:00
<p>As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable. For example, the following function makes it particularly easy to interactively explore the conditional distribution of <code>carat</code> from the diamonds dataset.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/yutannihilat_en/status/1574387230025875457
2023-01-13 07:22:57 +08:00
density &lt;- function(color, facets, binwidth = 0.1) {
diamonds |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
density()
density(cut)
density(cut, clarity)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-54-1.png" class="img-fluid" width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-54-2.png" class="img-fluid" width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-54-3.png" class="img-fluid" width="576"/></p>
</div>
</div>
</section>
<section id="labeling" data-type="sect2">
<h2>
Labeling</h2>
<p>Remember the histogram function we showed you earlier?</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth)
}</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>Wouldnt it be nice if we could label the output with the variable and the bin width that was used? To do so, were going to have to go under the covers of tidy evaluation and use a function from the package we havent talked about yet: rlang. rlang is a low-level package thats used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p>
<p>To solve the labeling problem we can use <code><a href="https://rlang.r-lib.org/reference/englue.html">rlang::englue()</a></code>. This works similarly to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>, so any value wrapped in <code><a href="https://rdrr.io/r/base/Paren.html">{ }</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically inserts the appropriate variable name:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth) {
label &lt;- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
df |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth) +
labs(title = label)
}
diamonds |&gt; histogram(carat, 0.1)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-56-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
2023-01-13 07:22:57 +08:00
<p>You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.</p>
</section>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
2023-01-13 07:22:57 +08:00
<p>Build up a rich plotting function by incrementally implementing each of the steps below:</p>
<ol type="1"><li><p>Draw a scatterplot given dataset and <code>x</code> and <code>y</code> variables.</p></li>
<li><p>Add a line of best fit (i.e. a linear model with no standard errors).</p></li>
<li><p>Add a title.</p></li>
</ol></section>
</section>
<section id="style" data-type="sect1">
<h1>
Style</h1>
<p>R doesnt care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. Thats hard! But its better to be clear than short, as RStudios autocomplete makes it easy to type long names.</p>
2022-11-19 00:30:32 +08:00
<p>Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> is better than <code>compute_mean()</code>), or accessing some property of an object (i.e. <code><a href="https://rdrr.io/r/stats/coef.html">coef()</a></code> is better than <code>get_coefficients()</code>). Use your best judgement and dont be afraid to rename a function if you figure out a better name later.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># Too short
f()
# Not a verb, or descriptive
my_awesome_function()
# Long, but clear
impute_missing()
collapse_years()</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>R also doesnt care about how you use white space in your functions but future readers will. Continue to follow the rules from <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>. Additionally, <code>function()</code> should always be followed by squiggly brackets (<code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># missing extra two spaces
2023-01-13 07:22:57 +08:00
density &lt;- function(color, facets, binwidth = 0.1) {
diamonds |&gt;
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
# Pipe indented incorrectly
2023-01-13 07:22:57 +08:00
density &lt;- function(color, facets, binwidth = 0.1) {
diamonds |&gt;
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}</pre>
</div>
<p>As you can see we recommend putting extra spaces inside of <code>{{ }}</code>. This makes it very obvious that something unusual is happening.</p>
<section id="exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">f1 &lt;- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
2023-01-13 07:22:57 +08:00
f3 &lt;- function(x, y) {
rep(y, length.out = length(x))
}</pre>
</div>
</li>
<li><p>Take a function that youve written recently and spend 5 minutes brainstorming a better name for it and its arguments.</p></li>
2023-01-13 07:22:57 +08:00
<li><p>Make a case for why <code>norm_r()</code>, <code>norm_d()</code> etc. would be better than <code><a href="https://rdrr.io/r/stats/Normal.html">rnorm()</a></code>, <code><a href="https://rdrr.io/r/stats/Normal.html">dnorm()</a></code>. Make a case for the opposite.</p></li>
</ol></section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
2023-01-13 07:22:57 +08:00
<p>In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.</p>
<p>We have only shown you the bare minimum to get started with functions and theres much more to learn. A few places to learn more are:</p>
2022-11-19 00:30:32 +08:00
<ul><li>To learn more about programming with tidy evaluation, see useful recipes in <a href="https://dplyr.tidyverse.org/articles/programming.html">programming with dplyr</a> and <a href="https://tidyr.tidyverse.org/articles/programming.html">programming with tidyr</a> and learn more about the theory in <a href="https://rlang.r-lib.org/reference/topic-data-mask.html">What is data-masking and why do I need {{?</a>.</li>
<li>To learn more about reducing duplication in your ggplot2 code, read the <a href="https://ggplot2-book.org/programming.html" class="uri">Programming with ggplot2</a> chapter of the ggplot2 book.</li>
<li>For more advice on function style, see the <a href="https://style.tidyverse.org/functions.html" class="uri">tidyverse style guide</a>.</li>
</ul><p>In the next chapter, well dive into some of the details of Rs vector data structures that weve omitted so far. These are not immediately useful by themselves, but are a necessary foundation for the following chapter on iteration which gives you further tools for reducing code duplication.</p>
</section>
</section>