r4ds/oreilly/data-transform.html

969 lines
66 KiB
HTML
Raw Normal View History

<section data-type="chapter" id="chp-data-transform">
2022-11-19 01:55:22 +08:00
<h1><span id="sec-data-transform" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data transformation</span></span></h1>
<section id="data-transform-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Visualisation is an important tool for generating insight, but its rare that you get the data in exactly the right form you need for it. Often youll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. Youll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the <strong>dplyr</strong> package and a new dataset on flights that departed New York City in 2013.</p>
2023-01-13 07:22:57 +08:00
<p>The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. Well start with functions that operate on rows and then columns of a data frame. We will then introduce the ability to work with groups. We will end the chapter with a case study that showcases these functions in action and well come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).</p>
<section id="data-transform-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well focus on the dplyr package, another core member of the tidyverse. Well illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">library(nycflights13)
library(tidyverse)
2023-01-13 07:22:57 +08:00
#&gt; ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
#&gt; ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
#&gt; ✔ forcats 0.5.2 ✔ stringr 1.5.0
2023-01-13 07:22:57 +08:00
#&gt; ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.3.0
2023-01-13 07:22:57 +08:00
#&gt; ✔ purrr 1.0.1
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
2023-01-13 07:22:57 +08:00
#&gt; ✖ dplyr::lag() masks stats::lag()
#&gt; Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Take careful note of the conflicts message thats printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, youll need to use their full names: <code><a href="https://rdrr.io/r/stats/filter.html">stats::filter()</a></code> and <code><a href="https://rdrr.io/r/stats/lag.html">stats::lag()</a></code>. So far weve mostly ignored which package a function comes from because most of the time it doesnt matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, well use the same syntax as R: <code>packagename::functionname()</code>.</p>
</section>
<section id="nycflights13" data-type="sect2">
<h2>
nycflights13</h2>
2022-11-19 00:30:32 +08:00
<p>To explore the basic dplyr verbs, were going to use <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code>. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US <a href="http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0">Bureau of Transportation Statistics</a>, and is documented in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">?flights</a></code>.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights
#&gt; # A tibble: 336,776 × 19
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
2023-01-13 07:22:57 +08:00
</div>
<p>If youve used R before, you might notice that this data frame prints a little differently to other data frames youve seen. Thats because its a <strong>tibble</strong>, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. If youre using RStudio, the most convenient is probably <code>View(flights)</code>, which will open an interactive scrollable and filterable view. Otherwise you can use <code>print(flights, width = Inf)</code> to show all columns, or use call <code><a href="https://pillar.r-lib.org/reference/glimpse.html">glimpse()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">glimpse(flights)
#&gt; Rows: 336,776
#&gt; Columns: 19
#&gt; $ year &lt;int&gt; 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
#&gt; $ month &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#&gt; $ day &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#&gt; $ dep_time &lt;int&gt; 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55…
#&gt; $ sched_dep_time &lt;int&gt; 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60…
#&gt; $ dep_delay &lt;dbl&gt; 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,…
#&gt; $ arr_time &lt;int&gt; 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8…
#&gt; $ sched_arr_time &lt;int&gt; 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8…
#&gt; $ arr_delay &lt;dbl&gt; 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,…
#&gt; $ carrier &lt;chr&gt; "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"…
#&gt; $ flight &lt;int&gt; 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301…
#&gt; $ tailnum &lt;chr&gt; "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N…
#&gt; $ origin &lt;chr&gt; "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG…
#&gt; $ dest &lt;chr&gt; "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA…
#&gt; $ air_time &lt;dbl&gt; 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149…
#&gt; $ distance &lt;dbl&gt; 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73…
#&gt; $ hour &lt;dbl&gt; 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6…
#&gt; $ minute &lt;dbl&gt; 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59…
#&gt; $ time_hour &lt;dttm&gt; 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0…</pre>
</div>
<p>In both views, the variables names are followed by abbreviations that tell you the type of each variable: <code>&lt;int&gt;</code> is short for integer, <code>&lt;dbl&gt;</code> is short for double (aka real numbers), <code>&lt;chr&gt;</code> for character (aka strings), and <code>&lt;dttm&gt;</code> for date-time. These are important because the operations you can perform on a column depend so much on its “type”, and these types are used to organize the chapters in the next section of the book.</p>
</section>
<section id="dplyr-basics" data-type="sect2">
<h2>
dplyr basics</h2>
<p>Youre about to learn the primary dplyr verbs which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, its worth stating what they have in common:</p>
<ol type="1"><li><p>The first argument is always a data frame.</p></li>
<li><p>The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).</p></li>
<li><p>The result is always a new data frame.</p></li>
</ol><p>Because the first argument is a data frame and the output is a data frame, dplyr verbs work well with the pipe, <code>|&gt;</code>. The pipe takes the thing on its left and passes it along to the function on its right so that <code>x |&gt; f(y)</code> is equivalent to <code>f(x, y)</code>, and <code>x |&gt; f(y) |&gt; g(z)</code> is equivalent to into <code>g(f(x, y), z)</code>. The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you havent yet learned the details:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dest == "IAH") |&gt;
group_by(year, month, day) |&gt;
summarize(
arr_delay = mean(arr_delay, na.rm = TRUE)
)</pre>
</div>
<p>The code starts with the <code>flights</code> dataset, then filters it, then groups it, then summarizes it. Well come back to the pipe and its alternatives in <a href="#sec-pipes" data-type="xref">#sec-pipes</a>.</p>
2023-01-13 07:22:57 +08:00
<p>dplyrs verbs are organised into four groups based on what they operate on: <strong>rows</strong>, <strong>columns</strong>, <strong>groups</strong>, or <strong>tables</strong>. In the following sections youll learn the most important verbs for rows, columns, and groups, then well come back to verbs that work on tables in <a href="#chp-joins" data-type="xref">#chp-joins</a>. Lets dive in!</p>
</section>
</section>
<section id="rows" data-type="sect1">
<h1>
Rows</h1>
2023-01-13 07:22:57 +08:00
<p>The most important verbs that operate on rows are <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, which changes which rows are present without changing their order, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged. Well also discuss <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> which finds rows with unique values but unlike <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> it can also optionally modify the columns.</p>
<section id="filter" data-type="sect2">
<h2>
filter()
</h2>
2022-11-19 00:30:32 +08:00
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, youll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(arr_delay &gt; 120)
#&gt; # A tibble: 10,034 × 19
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 811 630 101 1047 830
#&gt; 2 2013 1 1 848 1835 853 1001 1950
#&gt; 3 2013 1 1 957 733 144 1056 853
#&gt; 4 2013 1 1 1114 900 134 1447 1222
#&gt; 5 2013 1 1 1505 1310 115 1638 1431
#&gt; 6 2013 1 1 1525 1340 105 1831 1626
#&gt; # … with 10,028 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>As well as <code>&gt;</code> (greater than), you can use <code>&gt;=</code> (greater than or equal to), <code>&lt;</code> (less than), <code>&lt;=</code> (less than or equal to), <code>==</code> (equal to), and <code>!=</code> (not equal to). You can also use <code>&amp;</code> (and) or <code>|</code> (or) to combine multiple conditions:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># Flights that departed on January 1
flights |&gt;
filter(month == 1 &amp; day == 1)
#&gt; # A tibble: 842 × 19
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …
# Flights that departed in January or February
flights |&gt;
filter(month == 1 | month == 2)
#&gt; # A tibble: 51,955 × 19
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 51,949 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>Theres a useful shortcut when youre combining <code>|</code> and <code>==</code>: <code>%in%</code>. It keeps rows where the variable equals one of the values on the right:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># A shorter way to select flights that departed in January or February
flights |&gt;
filter(month %in% c(1, 2))
#&gt; # A tibble: 51,955 × 19
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 51,949 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>Well come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
2022-11-19 00:30:32 +08:00
<p>When you run <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesnt modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code>&lt;-</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">jan1 &lt;- flights |&gt;
filter(month == 1 &amp; day == 1)</pre>
</div>
</section>
<section id="common-mistakes" data-type="sect2">
<h2>
Common mistakes</h2>
2022-11-19 00:30:32 +08:00
<p>When youre starting out with R, the easiest mistake to make is to use <code>=</code> instead of <code>==</code> when testing for equality. <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> will let you know when this happens:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(month = 1)
#&gt; Error in `filter()`:
#&gt; ! We detected a named input.
#&gt; This usually means that you've used `=` instead of `==`.
#&gt; Did you mean `month == 1`?</pre>
</div>
<p>Another mistakes is you write “or” statements like you would in English:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(month == 1 | 2)</pre>
</div>
<p>This works, in the sense that it doesnt throw an error, but it doesnt do what you want. Well come back to what it does and why in <a href="#sec-boolean-operations" data-type="xref">#sec-boolean-operations</a>.</p>
</section>
<section id="arrange" data-type="sect2">
<h2>
arrange()
</h2>
2022-11-19 00:30:32 +08:00
<p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
arrange(year, month, day, dep_time)
#&gt; # A tibble: 336,776 × 19
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>You can use <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
arrange(desc(dep_delay))
#&gt; # A tibble: 336,776 × 19
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 9 641 900 1301 1242 1530
#&gt; 2 2013 6 15 1432 1935 1137 1607 2120
#&gt; 3 2013 1 10 1121 1635 1126 1239 1810
#&gt; 4 2013 9 20 1139 1845 1014 1457 2210
#&gt; 5 2013 7 22 845 1600 1005 1044 1815
#&gt; 6 2013 4 10 1100 1900 960 1342 2211
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>You can combine <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dep_delay &lt;= 10 &amp; dep_delay &gt;= -10) |&gt;
arrange(desc(arr_delay))
#&gt; # A tibble: 239,109 × 19
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 11 1 658 700 -2 1329 1015
#&gt; 2 2013 4 18 558 600 -2 1149 850
#&gt; 3 2013 7 7 1659 1700 -1 2050 1823
#&gt; 4 2013 7 22 1606 1615 -9 2056 1831
#&gt; 5 2013 9 19 648 641 7 1035 810
#&gt; 6 2013 4 18 655 700 -5 1213 950
#&gt; # … with 239,103 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
</section>
2023-01-13 07:22:57 +08:00
<section id="distinct" data-type="sect2">
<h2>
distinct()
2023-01-13 07:22:57 +08:00
</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, youll want the distinct combination of some variables, so you can also optionally supply column names:</p>
2023-01-13 07:22:57 +08:00
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># This would remove any duplicate rows if there were any
flights |&gt;
distinct()
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …
2023-01-13 07:22:57 +08:00
# This finds all unique origin and destination pairs.
flights |&gt;
distinct(origin, dest)
#&gt; # A tibble: 224 × 2
#&gt; origin dest
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 EWR IAH
#&gt; 2 LGA IAH
#&gt; 3 JFK MIA
#&gt; 4 JFK BQN
#&gt; 5 LGA ATL
#&gt; 6 EWR ORD
#&gt; # … with 218 more rows</pre>
</div>
<p>Note that if you want to find the number of duplicates, or rows that werent duplicated, youre better off swapping <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> for <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> and then filtering as needed.</p>
</section>
<section id="data-transform-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Find all flights that</p>
<ol type="a"><li>Had an arrival delay of two or more hours</li>
<li>Flew to Houston (<code>IAH</code> or <code>HOU</code>)</li>
<li>Were operated by United, American, or Delta</li>
<li>Departed in summer (July, August, and September)</li>
<li>Arrived more than two hours late, but didnt leave late</li>
<li>Were delayed by at least an hour, but made up over 30 minutes in flight</li>
</ol></li>
<li><p>Sort <code>flights</code> to find the flights with longest departure delays. Find the flights that left earliest in the morning.</p></li>
<li><p>Sort <code>flights</code> to find the fastest flights (Hint: try sorting by a calculation).</p></li>
2023-01-13 07:22:57 +08:00
<li><p>Was there a flight on every day of 2013?</p></li>
<li><p>Which flights traveled the farthest distance? Which traveled the least distance?</p></li>
<li><p>Does it matter what order you used <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> if youre using both? Why/why not? Think about the results and how much work the functions would have to do.</p></li>
</ol></section>
</section>
<section id="columns" data-type="sect1">
<h1>
Columns</h1>
2023-01-13 07:22:57 +08:00
<p>There are four important verbs that affect the columns without changing the rows: <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> creates new columns that are functions of the existing columns; <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> change which columns are present, their names, or their positions. Well also discuss <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> since it allows you to get a column out of data frame.</p>
<section id="sec-mutate" data-type="sect2">
<h2>
mutate()
</h2>
2022-11-19 00:30:32 +08:00
<p>The job of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, youll learn a large set of functions that you can use to manipulate different types of variables. For now, well stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
#&gt; # A tibble: 336,776 × 21
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 13 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see whats happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>.</span>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.before = 1
)
#&gt; # A tibble: 336,776 × 21
2023-01-13 07:22:57 +08:00
#&gt; gain speed year month day dep_time sched_dep_time dep_delay arr_time
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 -9 370. 2013 1 1 517 515 2 830
#&gt; 2 -16 374. 2013 1 1 533 529 4 850
#&gt; 3 -31 408. 2013 1 1 542 540 2 923
#&gt; 4 17 517. 2013 1 1 544 545 -1 1004
#&gt; 5 19 394. 2013 1 1 554 600 -6 812
#&gt; 6 -16 288. 2013 1 1 554 558 -4 740
#&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, …</pre>
</div>
2023-01-13 07:22:57 +08:00
<p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can use the variable name instead of a position. For example, we could add the new variables after <code>day:</code></p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.after = day
)
#&gt; # A tibble: 336,776 × 21
2023-01-13 07:22:57 +08:00
#&gt; year month day gain speed dep_time sched_dep_time dep_delay arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 2013 1 1 -9 370. 517 515 2 830
#&gt; 2 2013 1 1 -16 374. 533 529 4 850
#&gt; 3 2013 1 1 -31 408. 542 540 2 923
#&gt; 4 2013 1 1 17 517. 544 545 -1 1004
#&gt; 5 2013 1 1 19 394. 554 600 -6 812
#&gt; 6 2013 1 1 -16 288. 554 558 -4 740
#&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, …</pre>
</div>
<p>Alternatively, you can control which variables are kept with the <code>.keep</code> argument. A particularly useful argument is <code>"used"</code> which allows you to see the inputs and outputs from your calculations:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "used"
)
#&gt; # A tibble: 336,776 × 6
#&gt; dep_delay arr_delay air_time gain hours gain_per_hour
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 11 227 -9 3.78 -2.38
#&gt; 2 4 20 227 -16 3.78 -4.23
#&gt; 3 2 33 160 -31 2.67 -11.6
#&gt; 4 -1 -18 183 17 3.05 5.57
#&gt; 5 -6 -25 116 19 1.93 9.83
#&gt; 6 -4 12 150 -16 2.5 -6.4
#&gt; # … with 336,770 more rows</pre>
</div>
</section>
<section id="sec-select" data-type="sect2">
<h2>
select()
</h2>
2022-11-19 00:30:32 +08:00
<p>Its not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables youre interested in. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># Select columns by name
flights |&gt;
select(year, month, day)
#&gt; # A tibble: 336,776 × 3
#&gt; year month day
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1
#&gt; 2 2013 1 1
#&gt; 3 2013 1 1
#&gt; 4 2013 1 1
#&gt; 5 2013 1 1
#&gt; 6 2013 1 1
#&gt; # … with 336,770 more rows
# Select all columns between year and day (inclusive)
flights |&gt;
select(year:day)
#&gt; # A tibble: 336,776 × 3
#&gt; year month day
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1
#&gt; 2 2013 1 1
#&gt; 3 2013 1 1
#&gt; 4 2013 1 1
#&gt; 5 2013 1 1
#&gt; 6 2013 1 1
#&gt; # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
flights |&gt;
select(!year:day)
#&gt; # A tibble: 336,776 × 16
2023-01-13 07:22:57 +08:00
#&gt; dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 517 515 2 830 819 11 UA
#&gt; 2 533 529 4 850 830 20 UA
#&gt; 3 542 540 2 923 850 33 AA
#&gt; 4 544 545 -1 1004 1022 -18 B6
#&gt; 5 554 600 -6 812 837 -25 DL
#&gt; 6 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, and 9 more variables: flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, …
# Select all columns that are characters
flights |&gt;
select(where(is.character))
#&gt; # A tibble: 336,776 × 4
#&gt; carrier tailnum origin dest
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 UA N14228 EWR IAH
#&gt; 2 UA N24211 LGA IAH
#&gt; 3 AA N619AA JFK MIA
#&gt; 4 B6 N804JB JFK BQN
#&gt; 5 DL N668DN LGA ATL
#&gt; 6 UA N39463 EWR ORD
#&gt; # … with 336,770 more rows</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>There are a number of helper functions you can use within <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
<ul><li>
<code>starts_with("abc")</code>: matches names that begin with “abc”.</li>
<li>
<code>ends_with("xyz")</code>: matches names that end with “xyz”.</li>
<li>
<code>contains("ijk")</code>: matches names that contain “ijk”.</li>
<li>
<code>num_range("x", 1:3)</code>: matches <code>x1</code>, <code>x2</code> and <code>x3</code>.</li>
2023-01-13 07:22:57 +08:00
</ul><p>See <code><a href="https://dplyr.tidyverse.org/reference/select.html">?select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) youll also be able to use <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">matches()</a></code> to select variables that match a pattern.</p>
2022-11-19 00:30:32 +08:00
<p>You can rename variables as you <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> them by using <code>=</code>. The new name appears on the left hand side of the <code>=</code>, and the old variable appears on the right hand side:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
select(tail_num = tailnum)
#&gt; # A tibble: 336,776 × 1
#&gt; tail_num
#&gt; &lt;chr&gt;
#&gt; 1 N14228
#&gt; 2 N24211
#&gt; 3 N619AA
#&gt; 4 N804JB
#&gt; 5 N668DN
#&gt; 6 N39463
#&gt; # … with 336,770 more rows</pre>
</div>
</section>
<section id="rename" data-type="sect2">
<h2>
rename()
</h2>
2022-11-19 00:30:32 +08:00
<p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> instead of <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
rename(tail_num = tailnum)
#&gt; # A tibble: 336,776 × 19
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tail_num &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>It works exactly the same way as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, but keeps all the variables that arent explicitly selected.</p>
<p>If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out <code><a href="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> which provides some useful automated cleaning.</p>
</section>
<section id="relocate" data-type="sect2">
<h2>
relocate()
</h2>
2022-11-19 00:30:32 +08:00
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> moves variables to the front:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
relocate(time_hour, air_time)
#&gt; # A tibble: 336,776 × 19
2023-01-13 07:22:57 +08:00
#&gt; time_hour air_time year month day dep_time sched_dep_time
#&gt; &lt;dttm&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013-01-01 05:00:00 227 2013 1 1 517 515
#&gt; 2 2013-01-01 05:00:00 227 2013 1 1 533 529
#&gt; 3 2013-01-01 05:00:00 160 2013 1 1 542 540
#&gt; 4 2013-01-01 05:00:00 183 2013 1 1 544 545
#&gt; 5 2013-01-01 06:00:00 116 2013 1 1 554 600
#&gt; 6 2013-01-01 05:00:00 150 2013 1 1 554 558
#&gt; # … with 336,770 more rows, and 12 more variables: dep_delay &lt;dbl&gt;,
#&gt; # arr_time &lt;int&gt;, sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, …</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
relocate(year:dep_time, .after = time_hour)
#&gt; # A tibble: 336,776 × 19
2023-01-13 07:22:57 +08:00
#&gt; sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 515 2 830 819 11 UA 1545
#&gt; 2 529 4 850 830 20 UA 1714
#&gt; 3 540 2 923 850 33 AA 1141
#&gt; 4 545 -1 1004 1022 -18 B6 725
#&gt; 5 600 -6 812 837 -25 DL 461
#&gt; 6 558 -4 740 728 12 UA 1696
#&gt; # … with 336,770 more rows, and 12 more variables: tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, …
flights |&gt;
relocate(starts_with("arr"), .before = dep_time)
#&gt; # A tibble: 336,776 × 19
2023-01-13 07:22:57 +08:00
#&gt; year month day arr_time arr_delay dep_time sched_dep_time dep_delay
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 2013 1 1 830 11 517 515 2
#&gt; 2 2013 1 1 850 20 533 529 4
#&gt; 3 2013 1 1 923 33 542 540 2
#&gt; 4 2013 1 1 1004 -18 544 545 -1
#&gt; 5 2013 1 1 812 -25 554 600 -6
#&gt; 6 2013 1 1 740 12 554 558 -4
#&gt; # … with 336,770 more rows, and 11 more variables: sched_arr_time &lt;int&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
</section>
<section id="data-transform-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<div class="cell">
</div>
<ol type="1"><li><p>Compare <code>air_time</code> with <code>arr_time - dep_time</code>. What do you expect to see? What do you see? What do you need to do to fix it?</p></li>
<li><p>Compare <code>dep_time</code>, <code>sched_dep_time</code>, and <code>dep_delay</code>. How would you expect those three numbers to be related?</p></li>
<li><p>Brainstorm as many ways as possible to select <code>dep_time</code>, <code>dep_delay</code>, <code>arr_time</code>, and <code>arr_delay</code> from <code>flights</code>.</p></li>
2022-11-19 00:30:32 +08:00
<li><p>What happens if you include the name of a variable multiple times in a <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> call?</p></li>
<li>
2022-11-19 00:30:32 +08:00
<p>What does the <code><a href="https://tidyselect.r-lib.org/reference/all_of.html">any_of()</a></code> function do? Why might it be helpful in conjunction with this vector?</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">variables &lt;- c("year", "month", "day", "dep_delay", "arr_delay")</pre>
</div>
</li>
<li>
<p>Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">select(flights, contains("TIME"))</pre>
</div>
</li>
</ol></section>
</section>
<section id="groups" data-type="sect1">
<h1>
Groups</h1>
2022-11-19 00:30:32 +08:00
<p>So far youve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, well focus on the most important functions: <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, and the slice family of functions.</p>
<section id="group_by" data-type="sect2">
<h2>
group_by()
</h2>
2022-11-19 00:30:32 +08:00
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> to divide your dataset into groups meaningful for your analysis:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(month)
#&gt; # A tibble: 336,776 × 19
#&gt; # Groups: month [12]
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
2023-01-13 07:22:57 +08:00
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesnt change the data but, if you look closely at the output, youll notice that its now “grouped by” month. This means subsequent operations will now work “by month”. <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesnt do anything by itself; instead it changes the behavior of the subsequent verbs.</p>
</section>
<section id="sec-summarize" data-type="sect2">
<h2>
summarize()
</h2>
2023-01-13 07:22:57 +08:00
<p>The most important grouped operation is a summary, which collapses each group to a single row. In dplyr, this is operation is performed by <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code><span data-type="footnote">Or <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, if you prefer British English.</span>, as shown by the following example, which computes the average departure delay by month:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(dep_delay)
)
#&gt; # A tibble: 12 × 2
#&gt; month delay
#&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 NA
#&gt; 2 2 NA
#&gt; 3 3 NA
#&gt; 4 4 NA
#&gt; 5 5 NA
#&gt; 6 6 NA
#&gt; # … with 6 more rows</pre>
</div>
<p>Uhoh! Something has gone wrong and all of our results are <code>NA</code> (pronounced “N-A”), Rs symbol for missing value. Well come back to discuss missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>, but for now well remove them by using <code>na.rm = TRUE</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(dep_delay, na.rm = TRUE)
)
#&gt; # A tibble: 12 × 2
#&gt; month delay
#&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 10.0
#&gt; 2 2 10.8
#&gt; 3 3 13.2
#&gt; 4 4 13.9
#&gt; 5 5 13.0
#&gt; 6 6 20.8
#&gt; # … with 6 more rows</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>You can create any number of summaries in a single call to <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. Youll learn various useful summaries in the upcoming chapters, but one very useful summary is <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>, which returns the number of rows in each group:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(dep_delay, na.rm = TRUE),
n = n()
)
#&gt; # A tibble: 12 × 3
#&gt; month delay n
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 1 10.0 27004
#&gt; 2 2 10.8 24951
#&gt; 3 3 13.2 28834
#&gt; 4 4 13.9 28330
#&gt; 5 5 13.0 28796
#&gt; 6 6 20.8 28243
#&gt; # … with 6 more rows</pre>
</div>
<p>Means and counts can get you a surprisingly long way in data science!</p>
</section>
<section id="the-slice_-functions" data-type="sect2">
<h2>
The slice_ functions</h2>
<p>There are five handy functions that allow you pick off specific rows within each group:</p>
<ul><li>
<code>df |&gt; slice_head(n = 1)</code> takes the first row from each group.</li>
<li>
<code>df |&gt; slice_tail(n = 1)</code> takes the last row in each group.</li>
<li>
<code>df |&gt; slice_min(x, n = 1)</code> takes the row with the smallest value of <code>x</code>.</li>
<li>
<code>df |&gt; slice_max(x, n = 1)</code> takes the row with the largest value of <code>x</code>.</li>
<li>
2023-01-13 07:22:57 +08:00
<code>df |&gt; slice_sample(n = 1)</code> takes one random row.</li>
</ul><p>You can vary <code>n</code> to select more than one row, or instead of <code>n =</code>, you can use <code>prop = 0.1</code> to select (e.g.) 10% of the rows in each group. For example, the following code finds the most delayed flight to each destination:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
slice_max(arr_delay, n = 1)
#&gt; # A tibble: 108 × 19
#&gt; # Groups: dest [105]
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 7 22 2145 2007 98 132 2259
#&gt; 2 2013 7 23 1139 800 219 1250 909
#&gt; 3 2013 1 25 123 2000 323 229 2101
#&gt; 4 2013 8 17 1740 1625 75 2042 2003
#&gt; 5 2013 7 22 2257 759 898 121 1026
#&gt; 6 2013 7 10 2056 1505 351 2347 1758
#&gt; # … with 102 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>This is similar to computing the max delay with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, but you get the whole row instead of the single summary:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
summarize(max_delay = max(arr_delay, na.rm = TRUE))
#&gt; Warning: There was 1 warning in `summarize()`.
2023-01-13 07:22:57 +08:00
#&gt; In argument: `max_delay = max(arr_delay, na.rm = TRUE)`.
#&gt; In group 52: `dest = "LGA"`.
#&gt; Caused by warning in `max()`:
#&gt; ! no non-missing arguments to max; returning -Inf
#&gt; # A tibble: 105 × 2
#&gt; dest max_delay
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 ABQ 153
#&gt; 2 ACK 221
#&gt; 3 ALB 328
#&gt; 4 ANC 39
#&gt; 5 ATL 895
#&gt; 6 AUS 349
#&gt; # … with 99 more rows</pre>
</div>
</section>
<section id="grouping-by-multiple-variables" data-type="sect2">
<h2>
Grouping by multiple variables</h2>
<p>You can create groups using more than one variable. For example, we could make a group for each day:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">daily &lt;- flights |&gt;
group_by(year, month, day)
daily
#&gt; # A tibble: 336,776 × 19
#&gt; # Groups: year, month, day [365]
2023-01-13 07:22:57 +08:00
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasnt great way to make this function work, but its difficult to change without breaking existing code. To make it obvious whats happening, dplyr displays a message that tells you how you can change this behavior:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">daily_flights &lt;- daily |&gt;
summarize(
n = n()
)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using
#&gt; the `.groups` argument.</pre>
</div>
<p>If youre happy with this behavior, you can explicitly request it in order to suppress the message:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">daily_flights &lt;- daily |&gt;
summarize(
n = n(),
.groups = "drop_last"
)</pre>
</div>
<p>Alternatively, change the default behavior by setting a different value, e.g. <code>"drop"</code> to drop all grouping or <code>"keep"</code> to preserve the same groups.</p>
</section>
<section id="ungrouping" data-type="sect2">
<h2>
Ungrouping</h2>
2022-11-19 00:30:32 +08:00
<p>You might also want to remove grouping outside of <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. You can do this with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">ungroup()</a></code>.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">daily |&gt;
ungroup() |&gt;
summarize(
delay = mean(dep_delay, na.rm = TRUE),
flights = n()
)
#&gt; # A tibble: 1 × 2
#&gt; delay flights
#&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 12.6 336776</pre>
</div>
<p>As you can see, when you summarize an ungrouped data frame, you get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.</p>
</section>
<section id="data-transform-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about <code>flights |&gt; group_by(carrier, dest) |&gt; summarize(n())</code>)</p></li>
<li><p>Find the most delayed flight to each destination.</p></li>
<li><p>How do delays vary over the course of the day. Illustrate your answer with a plot.</p></li>
2022-11-19 00:30:32 +08:00
<li><p>What happens if you supply a negative <code>n</code> to <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code> and friends?</p></li>
<li><p>Explain what <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> does in terms of the dplyr verbs you just learn. What does the <code>sort</code> argument to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> do?</p></li>
2023-01-13 07:22:57 +08:00
<li>
<p>Suppose we have the following tiny data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
x = 1:5,
y = c("a", "b", "a", "a", "b"),
z = c("K", "K", "L", "L", "K")
)</pre>
</div>
<ol type="a"><li>
<p>What does the following code do? Run it, analyze the result, and describe what <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> does.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(y)</pre>
</div>
</li>
<li>
<p>What does the following code do? Run it, analyze the result, and describe what <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> does. Also comment on how its different from the <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> in part (a)?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
arrange(y)</pre>
</div>
</li>
<li>
<p>What does the following code do? Run it, analyze the result, and describe what the pipeline does.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(y) |&gt;
summarize(mean_x = mean(x))</pre>
</div>
</li>
<li>
<p>What does the following code do? Run it, analyze the result, and describe what the pipeline does. Then, comment on what the message says.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(y, z) |&gt;
summarize(mean_x = mean(x))</pre>
</div>
</li>
<li>
<p>What does the following code do? Run it, analyze the result, and describe what the pipeline does. How is the output different from the one in part (d).</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(y, z) |&gt;
summarize(mean_x = mean(x), .groups = "drop")</pre>
</div>
</li>
<li>
<p>What do the following pipelines do? Run both, analyze the results, and describe what each pipeline does. How are the outputs of the two pipelines different?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(y, z) |&gt;
summarize(mean_x = mean(x))
df |&gt;
group_by(y, z) |&gt;
mutate(mean_x = mean(x))</pre>
</div>
</li>
</ol></li>
</ol></section>
</section>
<section id="sec-sample-size" data-type="sect1">
<h1>
Case study: aggregates and sample size</h1>
2022-11-19 00:30:32 +08:00
<p>Whenever you do any aggregation, its always a good idea to include a count (<code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>). That way, you can ensure that youre not drawing conclusions based on very small amounts of data. For example, lets look at the planes (identified by their tail number) that have the highest average delays:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">delays &lt;- flights |&gt;
filter(!is.na(arr_delay), !is.na(tailnum)) |&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
2023-01-13 07:22:57 +08:00
ggplot(delays, aes(x = delay)) +
geom_freqpoly(binwidth = 10)</pre>
<div class="cell-output-display">
2023-01-13 07:22:57 +08:00
<p><img src="data-transform_files/figure-html/unnamed-chunk-45-1.png" class="img-fluid" alt="A frequency histogram showing the distribution of flight delays. The distribution is unimodal, with a large spike around 0, and asymmetric: very few flights leave more than 30 minutes early, but flights are delayed up to 5 hours." width="576"/></p>
</div>
</div>
<p>Wow, there are some planes that have an <em>average</em> delay of 5 hours (300 minutes)! That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:</p>
<div class="cell">
2023-01-13 07:22:57 +08:00
<pre data-type="programlisting" data-code-language="r">ggplot(delays, aes(x = n, y = delay)) +
geom_point(alpha = 1/10)</pre>
<div class="cell-output-display">
2023-01-13 07:22:57 +08:00
<p><img src="data-transform_files/figure-html/unnamed-chunk-46-1.png" class="img-fluid" alt="A scatterplot showing number of flights versus after delay. Delays for planes with very small number of flights have very high variability (from -50 to ~300), but the variability rapidly decreases as the number of flights increases." width="576"/></p>
</div>
</div>
<p>Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, youll see that the variation decreases as the sample size increases<span data-type="footnote">*cough* the central limit theorem *cough*.</span>.</p>
<p>When looking at this sort of plot, its often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">delays |&gt;
filter(n &gt; 25) |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = n, y = delay)) +
geom_point(alpha = 1/10) +
geom_smooth(se = FALSE)</pre>
<div class="cell-output-display">
2023-01-13 07:22:57 +08:00
<p><img src="data-transform_files/figure-html/unnamed-chunk-47-1.png" class="img-fluid" alt="Now that the y-axis (average delay) is smaller (-20 to 60 minutes), we can see a more complicated story. The smooth line suggests an initial decrease in average delay from 10 minutes to 0 minutes as number of flights per plane increases from 25 to 100. This is followed by a gradual increase up to 10 minutes for 250 flights, then a gradual decrease to ~5 minutes at 500 flights." width="576"/></p>
</div>
</div>
<p>Note the handy pattern for combining ggplot2 and dplyr. Its a bit annoying that you have to switch from <code>|&gt;</code> to <code>+</code>, but its not too much of a hassle once you get the hang of it.</p>
<p>Theres another common variation on this pattern that we can see in some data about baseball players. The following code uses data from the <strong>Lahman</strong> package to compare what proportion of times a player hits the ball vs. the number of attempts they take:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">batters &lt;- Lahman::Batting |&gt;
group_by(playerID) |&gt;
summarize(
perf = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
n = sum(AB, na.rm = TRUE)
)
batters
#&gt; # A tibble: 20,166 × 3
#&gt; playerID perf n
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 aardsda01 0 4
#&gt; 2 aaronha01 0.305 12364
#&gt; 3 aaronto01 0.229 944
#&gt; 4 aasedo01 0 5
#&gt; 5 abadan01 0.0952 21
#&gt; 6 abadfe01 0.111 9
#&gt; # … with 20,160 more rows</pre>
</div>
<p>When we plot the skill of the batter (measured by the batting average, <code>ba</code>) against the number of opportunities to hit the ball (measured by at bat, <code>ab</code>), you see two patterns:</p>
<ol type="1"><li><p>As above, the variation in our aggregate decreases as we get more data points.</p></li>
<li><p>Theres a positive correlation between skill (<code>perf</code>) and opportunities to hit the ball (<code>n</code>) because obviously teams want to give their best batters the most opportunities to hit the ball.</p></li>
</ol><div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">batters |&gt;
filter(n &gt; 100) |&gt;
2023-01-13 07:22:57 +08:00
ggplot(aes(x = n, y = perf)) +
geom_point(alpha = 1 / 10) +
geom_smooth(se = FALSE)</pre>
<div class="cell-output-display">
2023-01-13 07:22:57 +08:00
<p><img src="data-transform_files/figure-html/unnamed-chunk-49-1.png" class="img-fluid" alt="A scatterplot of number of batting opportunites vs. batting performance overlaid with a smoothed line. Average performance increases sharply from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance continues to increase linearly at a much shallower slope reaching ~0.3 when n is ~15,000." width="576"/></p>
</div>
</div>
<p>This also has important implications for ranking. If you naively sort on <code>desc(ba)</code>, the people with the best batting averages are clearly lucky, not skilled:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">batters |&gt;
arrange(desc(perf))
#&gt; # A tibble: 20,166 × 3
#&gt; playerID perf n
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 abramge01 1 1
#&gt; 2 alberan01 1 1
#&gt; 3 banisje01 1 1
#&gt; 4 bartocl01 1 1
#&gt; 5 bassdo01 1 1
#&gt; 6 birasst01 1 2
#&gt; # … with 20,160 more rows</pre>
</div>
<p>You can find a good explanation of this problem and how to overcome it at <a href="http://varianceexplained.org/r/empirical_bayes_baseball/" class="uri">http://varianceexplained.org/r/empirical_bayes_baseball/</a> and <a href="https://www.evanmiller.org/how-not-to-sort-by-average-rating.html" class="uri">https://www.evanmiller.org/how-not-to-sort-by-average-rating.html</a>.</p>
</section>
<section id="data-transform-summary" data-type="sect1">
<h1>
Summary</h1>
2023-01-13 07:22:57 +08:00
<p>In this chapter, youve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, those that manipulate the columns (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>), and those that manipulate groups (like <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>). In this chapter, weve focused on these “whole data frame” tools, but you havent yet learned much about what you can do with the individual variable. Well come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p>
<p>For now, well pivot back to workflow, and in the next chapter youll learn more about the pipe, <code>|&gt;</code>, why we recommend it, and a little of the history that lead from magrittrs <code>%&gt;%</code> to base Rs <code>|&gt;</code>.</p>
</section>
</section>