More work on O'Reilly book

* Make width narrower
* Convert deps to table
* Strip chapter status
This commit is contained in:
Hadley Wickham 2022-11-18 11:05:00 -06:00
parent 5895db09cd
commit 69b4597f3b
33 changed files with 784 additions and 1048 deletions

View File

@ -47,6 +47,17 @@ devtools::install_github("hadley/r4ds")
knitr::include_graphics("screenshots/rstudio-wg.png")
```
### O'Reilly
To generate book for O'Reilly, build the book then:
```{r}
devtools::load_all("../minibook/"); process_book()
html <- list.files("oreilly", pattern = "[.]html$", full.names = TRUE)
file.copy(html, "../r-for-data-science-2e/", overwrite = TRUE)
```
## Code of Conduct
Please note that r4ds uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html).

View File

@ -17,7 +17,7 @@ options(
# Activate crayon output - temporarily disabled for quarto
# crayon.enabled = TRUE,
pillar.bold = TRUE,
width = 80
width = 77 # 80 - 3 for #> comment
)
ggplot2::theme_set(ggplot2::theme_gray(12))
@ -39,7 +39,7 @@ status <- function(type) {
)
cat(paste0(
"::: callout-", class, "\n",
"::: status callout-", class, "\n",
"You are reading the work-in-progress second edition of R for Data Science. ",
"This chapter ", status, ". ",
"You can find the complete first edition at <https://r4ds.had.co.nz>.\n",

View File

@ -340,6 +340,22 @@ The book is powered by [Quarto](https://quarto.org) which makes it easy to write
This book was built with:
```{r}
sessioninfo::session_info(c("tidyverse"))
#| echo: false
#| results: asis
pkgs <- sessioninfo::package_info(
tidyverse:::tidyverse_packages(),
dependencies = FALSE
)
df <- tibble(
package = pkgs$package,
version = pkgs$ondiskversion,
source = gsub("@", "\\\\@", pkgs$source)
)
knitr::kable(df, format = "markdown")
```
```{r}
cli:::ruler()
```

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-EDA">
<h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-base-R">
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>To finish off the programming section, were going to give you a quick tour of the most important base R functions that we dont otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that youll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. Its not possible to use the tidyverse without using base R, so weve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we havent focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book youll learn other approaches to the same problems using base R, data.table, and other packages. Youll certainly encounter these other approaches when you start reading R code written by other people, particularly if youre using StackOverflow. Its 100% okay to write code that uses a mix of approaches, and dont let anyone tell you otherwise!</p><p>In this chapter, well focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and for loops. To finish off, well briefly discuss two important plotting functions.</p>
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>To finish off the programming section, were going to give you a quick tour of the most important base R functions that we dont otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that youll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. Its not possible to use the tidyverse without using base R, so weve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we havent focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book youll learn other approaches to the same problems using base R, data.table, and other packages. Youll certainly encounter these other approaches when you start reading R code written by other people, particularly if youre using StackOverflow. Its 100% okay to write code that uses a mix of approaches, and dont let anyone tell you otherwise!</p><p>In this chapter, well focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and for loops. To finish off, well briefly discuss two important plotting functions.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-communicate-plots">
<h1><span id="sec-graphics-communication" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Graphics for communication</span></span></h1><div data-type="important"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we dont recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-graphics-communication" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Graphics for communication</span></span></h1><p>::: status callout-important You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we dont recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-data-import">
<h1><span id="sec-data-import" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data import</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-data-import" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data import</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -83,7 +75,7 @@ Reading data from a file</h1>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students &lt;- read_csv("data/students.csv")
#&gt; Rows: 6 Columns: 5
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; ── Column specification ─────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (4): Full Name, favourite.food, mealPlan, AGE
#&gt; dbl (1): Student ID
@ -324,7 +316,7 @@ Guessing types</h2>
T,Inf,2021-02-16,ghi"
)
#&gt; Rows: 3 Columns: 4
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; ── Column specification ─────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (1): string
#&gt; dbl (1): numeric
@ -360,7 +352,7 @@ Missing values, column types, and problems</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- read_csv(csv)
#&gt; Rows: 4 Columns: 1
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; ── Column specification ─────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (1): x
#&gt;
@ -370,8 +362,8 @@ Missing values, column types, and problems</h2>
<p>In this very small case, you can easily see the missing value <code>.</code>. But what happens if you have thousands of rows with only a few missing values represented by <code>.</code>s speckled amongst them? One approach is to tell readr that <code>x</code> is a numeric column, and then see where it fails. You can do that with the <code>col_types</code> argument, which takes a named list:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- read_csv(csv, col_types = list(x = col_double()))
#&gt; Warning: One or more parsing issues, call `problems()` on your data frame for details,
#&gt; e.g.:
#&gt; Warning: One or more parsing issues, call `problems()` on your data frame for
#&gt; details, e.g.:
#&gt; dat &lt;- vroom(...)
#&gt; problems(dat)</pre>
</div>
@ -381,13 +373,13 @@ Missing values, column types, and problems</h2>
#&gt; # A tibble: 1 × 5
#&gt; row col expected actual file
#&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 3 1 a double . /private/tmp/Rtmp43JYhG/file7cf337a06034</pre>
#&gt; 1 3 1 a double . /private/tmp/Rtmpc2nAIe/file8f2f488fc2f4</pre>
</div>
<p>This tells us that there was a problem in row 3, col 1 where readr expected a double but got a <code>.</code>. That suggests this dataset uses <code>.</code> for missing values. So then we set <code>na = "."</code>, the automatic guessing succeeds, giving us the numeric column that we want:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- read_csv(csv, na = ".")
#&gt; Rows: 4 Columns: 1
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; ── Column specification ─────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; dbl (1): x
#&gt;
@ -447,7 +439,7 @@ Reading data from multiple files</h1>
<pre data-type="programlisting" data-code-language="downlit">sales_files &lt;- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
read_csv(sales_files, id = "file")
#&gt; Rows: 19 Columns: 6
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; ── Column specification ─────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (1): month
#&gt; dbl (4): year, brand, item, n

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-data-tidy">
<h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -174,21 +166,21 @@ Data in column names</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">billboard
#&gt; # A tibble: 317 × 79
#&gt; artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA NA
#&gt; 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA NA
#&gt; 3 3 Door… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 51
#&gt; 4 3 Door… Loser 2000-10-21 76 76 72 69 67 65 55 59 62
#&gt; 5 504 Bo Wobb… 2000-04-15 57 34 25 17 17 31 36 49 53
#&gt; 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 3
#&gt; # … with 311 more rows, 67 more variables: wk10 &lt;dbl&gt;, wk11 &lt;dbl&gt;, wk12 &lt;dbl&gt;,
#&gt; # wk13 &lt;dbl&gt;, wk14 &lt;dbl&gt;, wk15 &lt;dbl&gt;, wk16 &lt;dbl&gt;, wk17 &lt;dbl&gt;, wk18 &lt;dbl&gt;,
#&gt; # wk19 &lt;dbl&gt;, wk20 &lt;dbl&gt;, wk21 &lt;dbl&gt;, wk22 &lt;dbl&gt;, wk23 &lt;dbl&gt;, wk24 &lt;dbl&gt;,
#&gt; # wk25 &lt;dbl&gt;, wk26 &lt;dbl&gt;, wk27 &lt;dbl&gt;, wk28 &lt;dbl&gt;, wk29 &lt;dbl&gt;, wk30 &lt;dbl&gt;,
#&gt; # wk31 &lt;dbl&gt;, wk32 &lt;dbl&gt;, wk33 &lt;dbl&gt;, wk34 &lt;dbl&gt;, wk35 &lt;dbl&gt;, wk36 &lt;dbl&gt;,
#&gt; # wk37 &lt;dbl&gt;, wk38 &lt;dbl&gt;, wk39 &lt;dbl&gt;, wk40 &lt;dbl&gt;, wk41 &lt;dbl&gt;, wk42 &lt;dbl&gt;,
#&gt; # wk43 &lt;dbl&gt;, wk44 &lt;dbl&gt;, wk45 &lt;dbl&gt;, wk46 &lt;dbl&gt;, wk47 &lt;dbl&gt;, wk48 &lt;dbl&gt;, …</pre>
#&gt; artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
#&gt; 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
#&gt; 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
#&gt; 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
#&gt; 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
#&gt; 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
#&gt; # … with 311 more rows, 68 more variables: wk9 &lt;dbl&gt;, wk10 &lt;dbl&gt;,
#&gt; # wk11 &lt;dbl&gt;, wk12 &lt;dbl&gt;, wk13 &lt;dbl&gt;, wk14 &lt;dbl&gt;, wk15 &lt;dbl&gt;, wk16 &lt;dbl&gt;,
#&gt; # wk17 &lt;dbl&gt;, wk18 &lt;dbl&gt;, wk19 &lt;dbl&gt;, wk20 &lt;dbl&gt;, wk21 &lt;dbl&gt;, wk22 &lt;dbl&gt;,
#&gt; # wk23 &lt;dbl&gt;, wk24 &lt;dbl&gt;, wk25 &lt;dbl&gt;, wk26 &lt;dbl&gt;, wk27 &lt;dbl&gt;, wk28 &lt;dbl&gt;,
#&gt; # wk29 &lt;dbl&gt;, wk30 &lt;dbl&gt;, wk31 &lt;dbl&gt;, wk32 &lt;dbl&gt;, wk33 &lt;dbl&gt;, wk34 &lt;dbl&gt;,
#&gt; # wk35 &lt;dbl&gt;, wk36 &lt;dbl&gt;, wk37 &lt;dbl&gt;, wk38 &lt;dbl&gt;, wk39 &lt;dbl&gt;, wk40 &lt;dbl&gt;,
#&gt; # wk41 &lt;dbl&gt;, wk42 &lt;dbl&gt;, wk43 &lt;dbl&gt;, wk44 &lt;dbl&gt;, wk45 &lt;dbl&gt;, …</pre>
</div>
<p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p>
<p>To tidy this data, well use <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>. After the data, there are three key arguments:</p>
@ -347,21 +339,21 @@ Many variables in column names</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">who2
#&gt; # A tibble: 7,240 × 58
#&gt; country year sp_m_…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_…⁶ sp_m_65 sp_f_…⁷
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghani… 1980 NA NA NA NA NA NA NA NA
#&gt; 2 Afghani… 1981 NA NA NA NA NA NA NA NA
#&gt; 3 Afghani… 1982 NA NA NA NA NA NA NA NA
#&gt; 4 Afghani… 1983 NA NA NA NA NA NA NA NA
#&gt; 5 Afghani… 1984 NA NA NA NA NA NA NA NA
#&gt; 6 Afghani… 1985 NA NA NA NA NA NA NA NA
#&gt; # … with 7,234 more rows, 48 more variables: sp_f_1524 &lt;dbl&gt;, sp_f_2534 &lt;dbl&gt;,
#&gt; # sp_f_3544 &lt;dbl&gt;, sp_f_4554 &lt;dbl&gt;, sp_f_5564 &lt;dbl&gt;, sp_f_65 &lt;dbl&gt;,
#&gt; # sn_m_014 &lt;dbl&gt;, sn_m_1524 &lt;dbl&gt;, sn_m_2534 &lt;dbl&gt;, sn_m_3544 &lt;dbl&gt;,
#&gt; # sn_m_4554 &lt;dbl&gt;, sn_m_5564 &lt;dbl&gt;, sn_m_65 &lt;dbl&gt;, sn_f_014 &lt;dbl&gt;,
#&gt; # sn_f_1524 &lt;dbl&gt;, sn_f_2534 &lt;dbl&gt;, sn_f_3544 &lt;dbl&gt;, sn_f_4554 &lt;dbl&gt;,
#&gt; # sn_f_5564 &lt;dbl&gt;, sn_f_65 &lt;dbl&gt;, ep_m_014 &lt;dbl&gt;, ep_m_1524 &lt;dbl&gt;,
#&gt; # ep_m_2534 &lt;dbl&gt;, ep_m_3544 &lt;dbl&gt;, ep_m_4554 &lt;dbl&gt;, ep_m_5564 &lt;dbl&gt;, …</pre>
#&gt; country year sp_m_014 sp_m_1…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_65
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1980 NA NA NA NA NA NA NA
#&gt; 2 Afghanistan 1981 NA NA NA NA NA NA NA
#&gt; 3 Afghanistan 1982 NA NA NA NA NA NA NA
#&gt; 4 Afghanistan 1983 NA NA NA NA NA NA NA
#&gt; 5 Afghanistan 1984 NA NA NA NA NA NA NA
#&gt; 6 Afghanistan 1985 NA NA NA NA NA NA NA
#&gt; # … with 7,234 more rows, 49 more variables: sp_f_014 &lt;dbl&gt;,
#&gt; # sp_f_1524 &lt;dbl&gt;, sp_f_2534 &lt;dbl&gt;, sp_f_3544 &lt;dbl&gt;, sp_f_4554 &lt;dbl&gt;,
#&gt; # sp_f_5564 &lt;dbl&gt;, sp_f_65 &lt;dbl&gt;, sn_m_014 &lt;dbl&gt;, sn_m_1524 &lt;dbl&gt;,
#&gt; # sn_m_2534 &lt;dbl&gt;, sn_m_3544 &lt;dbl&gt;, sn_m_4554 &lt;dbl&gt;, sn_m_5564 &lt;dbl&gt;,
#&gt; # sn_m_65 &lt;dbl&gt;, sn_f_014 &lt;dbl&gt;, sn_f_1524 &lt;dbl&gt;, sn_f_2534 &lt;dbl&gt;,
#&gt; # sn_f_3544 &lt;dbl&gt;, sn_f_4554 &lt;dbl&gt;, sn_f_5564 &lt;dbl&gt;, sn_f_65 &lt;dbl&gt;,
#&gt; # ep_m_014 &lt;dbl&gt;, ep_m_1524 &lt;dbl&gt;, ep_m_2534 &lt;dbl&gt;, ep_m_3544 &lt;dbl&gt;, …</pre>
</div>
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, youll notice theres a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
@ -454,14 +446,14 @@ Widening data</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience
#&gt; # A tibble: 500 × 5
#&gt; org_pac_id org_nm measure_cd measure_title prf_r…¹
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS SSM… 63
#&gt; 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS SSM… 87
#&gt; 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS SSM… 86
#&gt; 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS SSM… 57
#&gt; 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS SSM… 85
#&gt; 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS SSM… 24
#&gt; org_pac_id org_nm measure_cd measure_title prf_r…¹
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS … 63
#&gt; 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS … 87
#&gt; 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS … 86
#&gt; 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS … 57
#&gt; 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS … 85
#&gt; 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS … 24
#&gt; # … with 494 more rows, and abbreviated variable name ¹prf_rate</pre>
</div>
<p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>:</p>
@ -469,13 +461,13 @@ Widening data</h2>
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
distinct(measure_cd, measure_title)
#&gt; # A tibble: 6 × 2
#&gt; measure_cd measure_title
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 CAHPS_GRP_1 CAHPS for MIPS SSM: Getting Timely Care, Appointments, and Infor
#&gt; 2 CAHPS_GRP_2 CAHPS for MIPS SSM: How Well Providers Communicate
#&gt; 3 CAHPS_GRP_3 CAHPS for MIPS SSM: Patient's Rating of Provider
#&gt; 4 CAHPS_GRP_5 CAHPS for MIPS SSM: Health Promotion and Education
#&gt; 5 CAHPS_GRP_8 CAHPS for MIPS SSM: Courteous and Helpful Office Staff
#&gt; measure_cd measure_title
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 CAHPS_GRP_1 CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In…
#&gt; 2 CAHPS_GRP_2 CAHPS for MIPS SSM: How Well Providers Communicate
#&gt; 3 CAHPS_GRP_3 CAHPS for MIPS SSM: Patient's Rating of Provider
#&gt; 4 CAHPS_GRP_5 CAHPS for MIPS SSM: Health Promotion and Education
#&gt; 5 CAHPS_GRP_8 CAHPS for MIPS SSM: Courteous and Helpful Office Staff
#&gt; 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources</pre>
</div>
<p>Neither of these columns will make particularly great variable names: <code>measure_cd</code> doesnt hint at the meaning of the variable and <code>measure_title</code> is a long sentence containing spaces. Well use <code>measure_cd</code> for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.</p>
@ -487,14 +479,14 @@ Widening data</h2>
values_from = prf_rate
)
#&gt; # A tibble: 500 × 9
#&gt; org_pac_id org_nm measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE M… CAHPS … 63 NA NA NA NA NA
#&gt; 2 0446157747 USC CARE M… CAHPS … NA 87 NA NA NA NA
#&gt; 3 0446157747 USC CARE M… CAHPS … NA NA 86 NA NA NA
#&gt; 4 0446157747 USC CARE M… CAHPS … NA NA NA 57 NA NA
#&gt; 5 0446157747 USC CARE M… CAHPS … NA NA NA NA 85 NA
#&gt; 6 0446157747 USC CARE M… CAHPS … NA NA NA NA NA 24
#&gt; org_pac_id org_nm measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CAR… CAHPS … 63 NA NA NA NA NA
#&gt; 2 0446157747 USC CAR… CAHPS … NA 87 NA NA NA NA
#&gt; 3 0446157747 USC CAR… CAHPS … NA NA 86 NA NA NA
#&gt; 4 0446157747 USC CAR… CAHPS … NA NA NA 57 NA NA
#&gt; 5 0446157747 USC CAR… CAHPS … NA NA NA NA 85 NA
#&gt; 6 0446157747 USC CAR… CAHPS … NA NA NA NA NA 24
#&gt; # … with 494 more rows, and abbreviated variable names ¹measure_title,
#&gt; # ²CAHPS_GRP_1, ³CAHPS_GRP_2, ⁴CAHPS_GRP_3, ⁵CAHPS_GRP_5, ⁶CAHPS_GRP_8,
#&gt; # ⁷CAHPS_GRP_12</pre>
@ -508,14 +500,14 @@ Widening data</h2>
values_from = prf_rate
)
#&gt; # A tibble: 95 × 8
#&gt; org_pac_id org_nm CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICAL G… 63 87 86 57 85 24
#&gt; 2 0446162697 ASSOCIATION OF UNI… 59 85 83 63 88 22
#&gt; 3 0547164295 BEAVER MEDICAL GRO… 49 NA 75 44 73 12
#&gt; 4 0749333730 CAPE PHYSICIANS AS… 67 84 85 65 82 24
#&gt; 5 0840104360 ALLIANCE PHYSICIAN… 66 87 87 64 87 28
#&gt; 6 0840109864 REX HOSPITAL INC 73 87 84 67 91 30
#&gt; org_pac_id org_nm CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICA… 63 87 86 57 85 24
#&gt; 2 0446162697 ASSOCIATION OF … 59 85 83 63 88 22
#&gt; 3 0547164295 BEAVER MEDICAL … 49 NA 75 44 73 12
#&gt; 4 0749333730 CAPE PHYSICIANS… 67 84 85 65 82 24
#&gt; 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64 87 28
#&gt; 6 0840109864 REX HOSPITAL INC 73 87 84 67 91 30
#&gt; # … with 89 more rows, and abbreviated variable names ¹CAHPS_GRP_1,
#&gt; # ²CAHPS_GRP_2, ³CAHPS_GRP_3, ⁴CAHPS_GRP_5, ⁵CAHPS_GRP_8, ⁶CAHPS_GRP_12</pre>
</div>
@ -602,7 +594,8 @@ How does<code>pivot_wider()</code> work?</h2>
names_from = name,
values_from = value
)
#&gt; Warning: Values from `value` are not uniquely identified; output will contain list-cols.
#&gt; Warning: Values from `value` are not uniquely identified; output will contain
#&gt; list-cols.
#&gt; • Use `values_fn = list` to suppress this warning.
#&gt; • Use `values_fn = {summary_fun}` to summarise duplicates.
#&gt; • Use the following dplyr code to identify duplicates.
@ -695,15 +688,16 @@ col_year &lt;- gapminder |&gt;
)
col_year
#&gt; # A tibble: 142 × 13
#&gt; country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghani… 2.89 2.91 2.93 2.92 2.87 2.90 2.99 2.93 2.81 2.80
#&gt; 2 Albania 3.20 3.29 3.36 3.44 3.52 3.55 3.56 3.57 3.40 3.50
#&gt; 3 Algeria 3.39 3.48 3.41 3.51 3.62 3.69 3.76 3.75 3.70 3.68
#&gt; 4 Angola 3.55 3.58 3.63 3.74 3.74 3.48 3.44 3.39 3.42 3.36
#&gt; 5 Argenti… 3.77 3.84 3.85 3.91 3.98 4.00 3.95 3.96 3.97 4.04
#&gt; 6 Austral… 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37 4.43
#&gt; # … with 136 more rows, and 2 more variables: `2002` &lt;dbl&gt;, `2007` &lt;dbl&gt;</pre>
#&gt; country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
#&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 2.89 2.91 2.93 2.92 2.87 2.90 2.99 2.93 2.81
#&gt; 2 Albania 3.20 3.29 3.36 3.44 3.52 3.55 3.56 3.57 3.40
#&gt; 3 Algeria 3.39 3.48 3.41 3.51 3.62 3.69 3.76 3.75 3.70
#&gt; 4 Angola 3.55 3.58 3.63 3.74 3.74 3.48 3.44 3.39 3.42
#&gt; 5 Argentina 3.77 3.84 3.85 3.91 3.98 4.00 3.95 3.96 3.97
#&gt; 6 Australia 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37
#&gt; # … with 136 more rows, and 3 more variables: `1997` &lt;dbl&gt;, `2002` &lt;dbl&gt;,
#&gt; # `2007` &lt;dbl&gt;</pre>
</div>
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> produces a tibble where each row is labelled by the <code>country</code> variable. But most classic statistical algorithms dont want the identifier as an explicit variable; they want as a <strong>row name</strong>. We can turn the <code>country</code> variable into row names with <code>column_to_rowname()</code>:</p>
<div class="cell">

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-data-transform">
<h1><span id="sec-data-transform" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data transformation</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-data-transform" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data transformation</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -21,12 +13,12 @@ Prerequisites</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(nycflights13)
library(tidyverse)
#&gt; ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
#&gt; ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
#&gt; ✔ ggplot2 3.4.0.9000 ✔ purrr 0.9000.0.9000
#&gt; ✔ tibble 3.1.8 ✔ dplyr 1.0.99.9000
#&gt; ✔ tidyr 1.2.1.9001 ✔ stringr 1.4.1.9000
#&gt; ✔ readr 2.1.3 ✔ forcats 0.5.2
#&gt; ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()</pre>
</div>
@ -40,14 +32,14 @@ nycflights13</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -92,14 +84,14 @@ Rows</h1>
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(arr_delay &gt; 120)
#&gt; # A tibble: 10,034 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 811 630 101 1047 830 137 MQ
#&gt; 2 2013 1 1 848 1835 853 1001 1950 851 MQ
#&gt; 3 2013 1 1 957 733 144 1056 853 123 UA
#&gt; 4 2013 1 1 1114 900 134 1447 1222 145 UA
#&gt; 5 2013 1 1 1505 1310 115 1638 1431 127 EV
#&gt; 6 2013 1 1 1525 1340 105 1831 1626 125 B6
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 811 630 101 1047 830 137 MQ
#&gt; 2 2013 1 1 848 1835 853 1001 1950 851 MQ
#&gt; 3 2013 1 1 957 733 144 1056 853 123 UA
#&gt; 4 2013 1 1 1114 900 134 1447 1222 145 UA
#&gt; 5 2013 1 1 1505 1310 115 1638 1431 127 EV
#&gt; 6 2013 1 1 1525 1340 105 1831 1626 125 B6
#&gt; # … with 10,028 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -111,14 +103,14 @@ Rows</h1>
flights |&gt;
filter(month == 1 &amp; day == 1)
#&gt; # A tibble: 842 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 836 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -128,14 +120,14 @@ flights |&gt;
flights |&gt;
filter(month == 1 | month == 2)
#&gt; # A tibble: 51,955 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 51,949 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -147,14 +139,14 @@ flights |&gt;
flights |&gt;
filter(month %in% c(1, 2))
#&gt; # A tibble: 51,955 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 51,949 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -197,14 +189,14 @@ Common mistakes</h2>
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(year, month, day, dep_time)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -215,14 +207,14 @@ Common mistakes</h2>
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(desc(dep_delay))
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 9 641 900 1301 1242 1530 1272 HA
#&gt; 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
#&gt; 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
#&gt; 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
#&gt; 5 2013 7 22 845 1600 1005 1044 1815 989 MQ
#&gt; 6 2013 4 10 1100 1900 960 1342 2211 931 DL
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 9 641 900 1301 1242 1530 1272 HA
#&gt; 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
#&gt; 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
#&gt; 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
#&gt; 5 2013 7 22 845 1600 1005 1044 1815 989 MQ
#&gt; 6 2013 4 10 1100 1900 960 1342 2211 931 DL
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -234,14 +226,14 @@ Common mistakes</h2>
filter(dep_delay &lt;= 10 &amp; dep_delay &gt;= -10) |&gt;
arrange(desc(arr_delay))
#&gt; # A tibble: 239,109 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 11 1 658 700 -2 1329 1015 194 VX
#&gt; 2 2013 4 18 558 600 -2 1149 850 179 AA
#&gt; 3 2013 7 7 1659 1700 -1 2050 1823 147 US
#&gt; 4 2013 7 22 1606 1615 -9 2056 1831 145 DL
#&gt; 5 2013 9 19 648 641 7 1035 810 145 UA
#&gt; 6 2013 4 18 655 700 -5 1213 950 143 AA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 11 1 658 700 -2 1329 1015 194 VX
#&gt; 2 2013 4 18 558 600 -2 1149 850 179 AA
#&gt; 3 2013 7 7 1659 1700 -1 2050 1823 147 US
#&gt; 4 2013 7 22 1606 1615 -9 2056 1831 145 DL
#&gt; 5 2013 9 19 648 641 7 1035 810 145 UA
#&gt; 6 2013 4 18 655 700 -5 1213 950 143 AA
#&gt; # … with 239,103 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -285,14 +277,14 @@ Columns</h1>
speed = distance / air_time * 60
)
#&gt; # A tibble: 336,776 × 21
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 11 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, gain &lt;dbl&gt;, speed &lt;dbl&gt;, and abbreviated
@ -308,18 +300,19 @@ Columns</h1>
.before = 1
)
#&gt; # A tibble: 336,776 × 21
#&gt; gain speed year month day dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 -9 370. 2013 1 1 517 515 2 830 819 11
#&gt; 2 -16 374. 2013 1 1 533 529 4 850 830 20
#&gt; 3 -31 408. 2013 1 1 542 540 2 923 850 33
#&gt; 4 17 517. 2013 1 1 544 545 -1 1004 1022 -18
#&gt; 5 19 394. 2013 1 1 554 600 -6 812 837 -25
#&gt; 6 -16 288. 2013 1 1 554 558 -4 740 728 12
#&gt; # … with 336,770 more rows, 10 more variables: carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
#&gt; gain speed year month day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 -9 370. 2013 1 1 517 515 2 830 819
#&gt; 2 -16 374. 2013 1 1 533 529 4 850 830
#&gt; 3 -31 408. 2013 1 1 542 540 2 923 850
#&gt; 4 17 517. 2013 1 1 544 545 -1 1004 1022
#&gt; 5 19 394. 2013 1 1 554 600 -6 812 837
#&gt; 6 -16 288. 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;, and abbreviated variable names ¹sched_dep_time,
#&gt; # ²dep_delay, ³arr_time, ⁴sched_arr_time</pre>
</div>
<p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can the name of a variable name instead of a position. For example, we could add the new variables after <code>day:</code></p>
<div class="cell">
@ -330,18 +323,19 @@ Columns</h1>
.after = day
)
#&gt; # A tibble: 336,776 × 21
#&gt; year month day gain speed dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 2013 1 1 -9 370. 517 515 2 830 819 11
#&gt; 2 2013 1 1 -16 374. 533 529 4 850 830 20
#&gt; 3 2013 1 1 -31 408. 542 540 2 923 850 33
#&gt; 4 2013 1 1 17 517. 544 545 -1 1004 1022 -18
#&gt; 5 2013 1 1 19 394. 554 600 -6 812 837 -25
#&gt; 6 2013 1 1 -16 288. 554 558 -4 740 728 12
#&gt; # … with 336,770 more rows, 10 more variables: carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
#&gt; year month day gain speed dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 -9 370. 517 515 2 830 819
#&gt; 2 2013 1 1 -16 374. 533 529 4 850 830
#&gt; 3 2013 1 1 -31 408. 542 540 2 923 850
#&gt; 4 2013 1 1 17 517. 544 545 -1 1004 1022
#&gt; 5 2013 1 1 19 394. 554 600 -6 812 837
#&gt; 6 2013 1 1 -16 288. 554 558 -4 740 728
#&gt; # … with 336,770 more rows, 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;, and abbreviated variable names ¹sched_dep_time,
#&gt; # ²dep_delay, ³arr_time, ⁴sched_arr_time</pre>
</div>
<p>Alternatively, you can control which variables are kept with the <code>.keep</code> argument. A particularly useful argument is <code>"used"</code> which allows you to see the inputs and outputs from your calculations:</p>
<div class="cell">
@ -403,18 +397,18 @@ flights |&gt;
flights |&gt;
select(!year:day)
#&gt; # A tibble: 336,776 × 16
#&gt; dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin
#&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 517 515 2 830 819 11 UA 1545 N14228 EWR
#&gt; 2 533 529 4 850 830 20 UA 1714 N24211 LGA
#&gt; 3 542 540 2 923 850 33 AA 1141 N619AA JFK
#&gt; 4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
#&gt; 5 554 600 -6 812 837 -25 DL 461 N668DN LGA
#&gt; 6 554 558 -4 740 728 12 UA 1696 N39463 EWR
#&gt; # … with 336,770 more rows, 6 more variables: dest &lt;chr&gt;, air_time &lt;dbl&gt;,
#&gt; # distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated
#&gt; # variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
#&gt; # ⁵arr_delay
#&gt; dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum
#&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 517 515 2 830 819 11 UA 1545 N14228
#&gt; 2 533 529 4 850 830 20 UA 1714 N24211
#&gt; 3 542 540 2 923 850 33 AA 1141 N619AA
#&gt; 4 544 545 -1 1004 1022 -18 B6 725 N804JB
#&gt; 5 554 600 -6 812 837 -25 DL 461 N668DN
#&gt; 6 554 558 -4 740 728 12 UA 1696 N39463
#&gt; # … with 336,770 more rows, 7 more variables: origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;, and abbreviated variable names ¹sched_dep_time,
#&gt; # ²dep_delay, ³arr_time, ⁴sched_arr_time, arr_delay
# Select all columns that are characters
flights |&gt;
@ -466,14 +460,14 @@ flights |&gt;
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
rename(tail_num = tailnum)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tail_num &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -492,51 +486,51 @@ flights |&gt;
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
relocate(time_hour, air_time)
#&gt; # A tibble: 336,776 × 19
#&gt; time_hour air_time year month day dep_t…¹ sched…² dep_d…³ arr_t…⁴
#&gt; &lt;dttm&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2 830
#&gt; 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4 850
#&gt; 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2 923
#&gt; 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1 1004
#&gt; 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6 812
#&gt; 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4 740
#&gt; # … with 336,770 more rows, 10 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;,
#&gt; # dest &lt;chr&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, and abbreviated
#&gt; # variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time</pre>
#&gt; time_hour air_time year month day dep_time sched_dep…¹ dep_d…²
#&gt; &lt;dttm&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2
#&gt; 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4
#&gt; 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2
#&gt; 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1
#&gt; 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6
#&gt; 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4
#&gt; # … with 336,770 more rows, 11 more variables: arr_time &lt;int&gt;,
#&gt; # sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, and abbreviated variable names ¹sched_dep_time, ²dep_delay</pre>
</div>
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
relocate(year:dep_time, .after = time_hour)
#&gt; # A tibble: 336,776 × 19
#&gt; sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 515 2 830 819 11 UA 1545 N14228 EWR IAH
#&gt; 2 529 4 850 830 20 UA 1714 N24211 LGA IAH
#&gt; 3 540 2 923 850 33 AA 1141 N619AA JFK MIA
#&gt; 4 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
#&gt; 5 600 -6 812 837 -25 DL 461 N668DN LGA ATL
#&gt; 6 558 -4 740 728 12 UA 1696 N39463 EWR ORD
#&gt; # … with 336,770 more rows, 9 more variables: air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, year &lt;int&gt;, month &lt;int&gt;,
#&gt; # day &lt;int&gt;, dep_time &lt;int&gt;, and abbreviated variable names ¹sched_dep_time,
#&gt; # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
#&gt; sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 515 2 830 819 11 UA 1545 N14228 EWR IAH
#&gt; 2 529 4 850 830 20 UA 1714 N24211 LGA IAH
#&gt; 3 540 2 923 850 33 AA 1141 N619AA JFK MIA
#&gt; 4 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
#&gt; 5 600 -6 812 837 -25 DL 461 N668DN LGA ATL
#&gt; 6 558 -4 740 728 12 UA 1696 N39463 EWR ORD
#&gt; # … with 336,770 more rows, 9 more variables: air_time &lt;dbl&gt;,
#&gt; # distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, year &lt;int&gt;,
#&gt; # month &lt;int&gt;, day &lt;int&gt;, dep_time &lt;int&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights |&gt;
relocate(starts_with("arr"), .before = dep_time)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day arr_time arr_delay dep_time sched_…¹ dep_d…² sched…³ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 830 11 517 515 2 819 UA
#&gt; 2 2013 1 1 850 20 533 529 4 830 UA
#&gt; 3 2013 1 1 923 33 542 540 2 850 AA
#&gt; 4 2013 1 1 1004 -18 544 545 -1 1022 B6
#&gt; 5 2013 1 1 812 -25 554 600 -6 837 DL
#&gt; 6 2013 1 1 740 12 554 558 -4 728 UA
#&gt; year month day arr_time arr_de…¹ dep_t…² sched…³ dep_d…⁴ sched…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 830 11 517 515 2 819 UA
#&gt; 2 2013 1 1 850 20 533 529 4 830 UA
#&gt; 3 2013 1 1 923 33 542 540 2 850 AA
#&gt; 4 2013 1 1 1004 -18 544 545 -1 1022 B6
#&gt; 5 2013 1 1 812 -25 554 600 -6 837 DL
#&gt; 6 2013 1 1 740 12 554 558 -4 728 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹​sched_dep_time, ²dep_delay, ³sched_arr_time</pre>
#&gt; # ¹​arr_delay, ²dep_time, ³sched_dep_time, ⁴dep_delay, ⁵sched_arr_time</pre>
</div>
</section>
@ -580,14 +574,14 @@ Groups</h1>
group_by(month)
#&gt; # A tibble: 336,776 × 19
#&gt; # Groups: month [12]
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -679,14 +673,14 @@ The<code>slice_</code> functions</h2>
slice_max(arr_delay, n = 1)
#&gt; # A tibble: 108 × 19
#&gt; # Groups: dest [105]
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 7 22 2145 2007 98 132 2259 153 B6
#&gt; 2 2013 7 23 1139 800 219 1250 909 221 B6
#&gt; 3 2013 1 25 123 2000 323 229 2101 328 EV
#&gt; 4 2013 8 17 1740 1625 75 2042 2003 39 UA
#&gt; 5 2013 7 22 2257 759 898 121 1026 895 DL
#&gt; 6 2013 7 10 2056 1505 351 2347 1758 349 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 7 22 2145 2007 98 132 2259 153 B6
#&gt; 2 2013 7 23 1139 800 219 1250 909 221 B6
#&gt; 3 2013 1 25 123 2000 323 229 2101 328 EV
#&gt; 4 2013 8 17 1740 1625 75 2042 2003 39 UA
#&gt; 5 2013 7 22 2257 759 898 121 1026 895 DL
#&gt; 6 2013 7 10 2056 1505 351 2347 1758 349 UA
#&gt; # … with 102 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -725,14 +719,14 @@ Grouping by multiple variables</h2>
daily
#&gt; # A tibble: 336,776 × 19
#&gt; # Groups: year, month, day [365]
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -744,8 +738,8 @@ daily
summarize(
n = n()
)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using the
#&gt; `.groups` argument.</pre>
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using
#&gt; the `.groups` argument.</pre>
</div>
<p>If youre happy with this behavior, you can explicitly request it in order to suppress the message:</p>
<div class="cell">

View File

@ -14,12 +14,12 @@ Prerequisites</h2>
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
#&gt; ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
#&gt; ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
#&gt; ✔ ggplot2 3.4.0.9000 ✔ purrr 0.9000.0.9000
#&gt; ✔ tibble 3.1.8 ✔ dplyr 1.0.99.9000
#&gt; ✔ tidyr 1.2.1.9001 ✔ stringr 1.4.1.9000
#&gt; ✔ readr 2.1.3 ✔ forcats 0.5.2
#&gt; ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()</pre>
</div>
@ -45,14 +45,14 @@ The<code>mpg</code> data frame</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">mpg
#&gt; # A tibble: 234 × 11
#&gt; manufacturer model displ year cyl trans drv cty hwy fl class
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa
#&gt; 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa
#&gt; 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa
#&gt; 4 audi a4 2 2008 4 auto(av) f 21 30 p compa
#&gt; 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa
#&gt; 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa
#&gt; manufacturer model displ year cyl trans drv cty hwy fl class
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p comp…
#&gt; 2 audi a4 1.8 1999 4 manual(… f 21 29 p comp
#&gt; 3 audi a4 2 2008 4 manual(… f 20 31 p comp
#&gt; 4 audi a4 2 2008 4 auto(av) f 21 30 p comp…
#&gt; 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p comp…
#&gt; 6 audi a4 2.8 1999 6 manual(… f 18 26 p comp
#&gt; # … with 228 more rows</pre>
</div>
<p>Among the variables in <code>mpg</code> are:</p>

View File

@ -1,26 +1,5 @@
<section data-type="chapter" id="chp-databases">
<h1><span id="sec-import-databases" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Databases</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
<h1><span id="sec-import-databases" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Databases</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -203,8 +182,6 @@ diamonds_db
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
@ -334,8 +311,6 @@ planes |&gt; show_query()
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
@ -388,8 +363,6 @@ planes |&gt;
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
@ -665,8 +638,8 @@ mutate_query &lt;- function(df, ...) {
mean = mean(arr_delay, na.rm = TRUE),
median = median(arr_delay, na.rm = TRUE)
)
#&gt; `summarise()` has grouped output by "year" and "month". You can override using
#&gt; the `.groups` argument.
#&gt; `summarise()` has grouped output by "year" and "month". You can override
#&gt; using the `.groups` argument.
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; "year",

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-datetimes">
<h1><span id="sec-dates-and-times" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Dates and times</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-dates-and-times" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Dates and times</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -43,7 +35,7 @@ Creating date/times</h1>
<pre data-type="programlisting" data-code-language="downlit">today()
#&gt; [1] "2022-11-18"
now()
#&gt; [1] "2022-11-18 10:21:36 CST"</pre>
#&gt; [1] "2022-11-18 10:59:07 CST"</pre>
</div>
<p>Otherwise, the following sections describe the four ways youre likely to create a date/time:</p>
<ul><li>While reading a file with readr.</li>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-factors">
<h1><span id="sec-factors" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Factors</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-factors" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Factors</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -122,14 +114,14 @@ General Social Survey</h1>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat
#&gt; # A tibble: 21,483 × 9
#&gt; year marital age race rincome partyid relig denom tvhours
#&gt; &lt;int&gt; &lt;fct&gt; &lt;int&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 2000 Never married 26 White $8000 to 9999 Ind,near r… Prot… Sout… 12
#&gt; 2 2000 Divorced 48 White $8000 to 9999 Not str re… Prot… Bapt… NA
#&gt; 3 2000 Widowed 67 White Not applicable Independent Prot… No d… 2
#&gt; 4 2000 Never married 39 White Not applicable Ind,near r… Orth… Not … 4
#&gt; 5 2000 Divorced 25 White Not applicable Not str de… None Not … 1
#&gt; 6 2000 Married 25 White $20000 - 24999 Strong dem… Prot… Sout… NA
#&gt; year marital age race rincome partyid relig denom tvhours
#&gt; &lt;int&gt; &lt;fct&gt; &lt;int&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 2000 Never married 26 White $8000 to 9999 Ind,nea… Prot… Sout… 12
#&gt; 2 2000 Divorced 48 White $8000 to 9999 Not str… Prot… Bapt… NA
#&gt; 3 2000 Widowed 67 White Not applicable Indepen Prot… No d… 2
#&gt; 4 2000 Never married 39 White Not applicable Ind,nea… Orth… Not … 4
#&gt; 5 2000 Divorced 25 White Not applicable Not str… None Not … 1
#&gt; 6 2000 Married 25 White $20000 - 24999 Strong … Prot… Sout… NA
#&gt; # … with 21,477 more rows</pre>
</div>
<p>(Remember, since this dataset is provided by a package, you can get more information about the variables with <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">?gss_cat</a></code>.)</p>

View File

@ -1,17 +1,5 @@
<section data-type="chapter" id="chp-functions">
<h1><span id="sec-functions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Functions</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div><h1>
RStudio
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p><ul><li><p>To find the definition of a function that youve written, place the cursor on the name of the function and press <code>F2</code>.</p></li>
<li><p>To quickly jump to a function, press <code>Ctrl + .</code> to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.</p></li>
</ul></div>
<h1><span id="sec-functions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Functions</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -278,9 +266,7 @@ mape &lt;- function(actual, predicted) {
</div>
<div data-type="note"><h1>
RStudio
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p><ul><li><p>To find the definition of a function that youve written, place the cursor on the name of the function and press <code>F2</code>.</p></li>
</h1><p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p><ul><li><p>To find the definition of a function that youve written, place the cursor on the name of the function and press <code>F2</code>.</p></li>
<li><p>To quickly jump to a function, press <code>Ctrl + .</code> to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.</p></li>
</ul></div>
@ -490,14 +476,14 @@ flights |&gt; unique_where(tailnum == "N14228", month)
flights_sub(dest == "IAH", contains("time"))
#&gt; # A tibble: 7,198 × 8
#&gt; time_hour carrier flight dep_time sched_de…¹ arr_t…² sched…³ air_t…⁴
#&gt; &lt;dttm&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 2013-01-01 05:00:00 UA 1545 517 515 830 819 227
#&gt; 2 2013-01-01 05:00:00 UA 1714 533 529 850 830 227
#&gt; 3 2013-01-01 06:00:00 UA 496 623 627 933 932 229
#&gt; 4 2013-01-01 07:00:00 UA 473 728 732 1041 1038 238
#&gt; 5 2013-01-01 07:00:00 UA 1479 739 739 1104 1038 249
#&gt; 6 2013-01-01 09:00:00 UA 1220 908 908 1228 1219 233
#&gt; time_hour carrier flight dep_time sched…¹ arr_t…² sched…³ air_t…⁴
#&gt; &lt;dttm&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 2013-01-01 05:00:00 UA 1545 517 515 830 819 227
#&gt; 2 2013-01-01 05:00:00 UA 1714 533 529 850 830 227
#&gt; 3 2013-01-01 06:00:00 UA 496 623 627 933 932 229
#&gt; 4 2013-01-01 07:00:00 UA 473 728 732 1041 1038 238
#&gt; 5 2013-01-01 07:00:00 UA 1479 739 739 1104 1038 249
#&gt; 6 2013-01-01 09:00:00 UA 1220 908 908 1228 1219 233
#&gt; # … with 7,192 more rows, and abbreviated variable names ¹sched_dep_time,
#&gt; # ²arr_time, ³sched_arr_time, ⁴air_time</pre>
</div>
@ -529,8 +515,8 @@ flights |&gt;
}
flights |&gt;
count_missing(c(year, month, day), dep_time)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using the
#&gt; `.groups` argument.
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using
#&gt; the `.groups` argument.
#&gt; # A tibble: 365 × 4
#&gt; # Groups: year, month [12]
#&gt; year month day n_miss

View File

@ -98,12 +98,12 @@ The tidyverse</h2>
<p>You will not be able to use the functions, objects, or help files in a package until you load it with <code><a href="https://rdrr.io/r/base/library.html">library()</a></code>. Once you have installed a package, you can load it using the <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
#&gt; ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
#&gt; ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
#&gt; ✔ ggplot2 3.4.0.9000 ✔ purrr 0.9000.0.9000
#&gt; ✔ tibble 3.1.8 ✔ dplyr 1.0.99.9000
#&gt; ✔ tidyr 1.2.1.9001 ✔ stringr 1.4.1.9000
#&gt; ✔ readr 2.1.3 ✔ forcats 0.5.2
#&gt; ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()</pre>
</div>
@ -162,134 +162,105 @@ Acknowledgements</h1>
Colophon</h1>
<p>An online version of this book is available at <a href="https://r4ds.hadley.nz" class="uri">https://r4ds.hadley.nz</a>. It will continue to evolve in between reprints of the physical book. The source of the book is available at <a href="https://github.com/hadley/r4ds" class="uri">https://github.com/hadley/r4ds</a>. The book is powered by <a href="https://quarto.org">Quarto</a> which makes it easy to write books that combine text and executable code.</p>
<p>This book was built with:</p>
<div class="cell-output-display">
<table class="table"><colgroup><col style="width: 14%"/><col style="width: 14%"/><col style="width: 71%"/></colgroup><thead><tr class="header"><th style="text-align: left;">package</th>
<th style="text-align: left;">version</th>
<th style="text-align: left;">source</th>
</tr></thead><tbody><tr class="odd"><td style="text-align: left;">broom</td>
<td style="text-align: left;">1.0.1</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">cli</td>
<td style="text-align: left;">3.4.1</td>
<td style="text-align: left;">CRAN (R 4.2.1)</td>
</tr><tr class="odd"><td style="text-align: left;">crayon</td>
<td style="text-align: left;">1.5.2</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">dbplyr</td>
<td style="text-align: left;">2.2.1.9000</td>
<td style="text-align: left;">Github (tidyverse/dbplyr@f7b5596f6125011ab0dcd4eccbfe56c5294214da)</td>
</tr><tr class="odd"><td style="text-align: left;">dplyr</td>
<td style="text-align: left;">1.0.99.9000</td>
<td style="text-align: left;">local</td>
</tr><tr class="even"><td style="text-align: left;">dtplyr</td>
<td style="text-align: left;">1.2.2</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="odd"><td style="text-align: left;">forcats</td>
<td style="text-align: left;">0.5.2</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">ggplot2</td>
<td style="text-align: left;">3.4.0.9000</td>
<td style="text-align: left;">Github (tidyverse/ggplot2@4fea51b1eb2cdacebeacf425627dcbc1d61a5d3e)</td>
</tr><tr class="odd"><td style="text-align: left;">googledrive</td>
<td style="text-align: left;">2.0.0</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">googlesheets4</td>
<td style="text-align: left;">1.0.1</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="odd"><td style="text-align: left;">haven</td>
<td style="text-align: left;">2.5.1</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">hms</td>
<td style="text-align: left;">1.1.2</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="odd"><td style="text-align: left;">httr</td>
<td style="text-align: left;">1.4.4</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">jsonlite</td>
<td style="text-align: left;">1.8.3</td>
<td style="text-align: left;">CRAN (R 4.2.1)</td>
</tr><tr class="odd"><td style="text-align: left;">lubridate</td>
<td style="text-align: left;">1.9.0</td>
<td style="text-align: left;">CRAN (R 4.2.1)</td>
</tr><tr class="even"><td style="text-align: left;">magrittr</td>
<td style="text-align: left;">2.0.3</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="odd"><td style="text-align: left;">modelr</td>
<td style="text-align: left;">0.1.9</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">pillar</td>
<td style="text-align: left;">1.8.1</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="odd"><td style="text-align: left;">purrr</td>
<td style="text-align: left;">0.9000.0.9000</td>
<td style="text-align: left;">Github (tidyverse/purrr@aaaa58a571cc449dbcc4348e77e589b373e1e059)</td>
</tr><tr class="even"><td style="text-align: left;">readr</td>
<td style="text-align: left;">2.1.3</td>
<td style="text-align: left;">CRAN (R 4.2.1)</td>
</tr><tr class="odd"><td style="text-align: left;">readxl</td>
<td style="text-align: left;">1.4.1</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">reprex</td>
<td style="text-align: left;">2.0.2</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="odd"><td style="text-align: left;">rlang</td>
<td style="text-align: left;">1.0.6</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">rstudioapi</td>
<td style="text-align: left;">0.14</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="odd"><td style="text-align: left;">rvest</td>
<td style="text-align: left;">1.0.3</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">stringr</td>
<td style="text-align: left;">1.4.1.9000</td>
<td style="text-align: left;">Github (tidyverse/stringr@ebf38238cbb80bf0e852d5d8d056c04e36d7c20c)</td>
</tr><tr class="odd"><td style="text-align: left;">tibble</td>
<td style="text-align: left;">3.1.8</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">tidyr</td>
<td style="text-align: left;">1.2.1.9001</td>
<td style="text-align: left;">Github (tidyverse/tidyr@91747952f10c961be747c0de1026d109c920e4fc)</td>
</tr><tr class="odd"><td style="text-align: left;">tidyverse</td>
<td style="text-align: left;">1.3.2</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">xml2</td>
<td style="text-align: left;">1.3.3</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr></tbody></table></div>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sessioninfo::session_info(c("tidyverse"))
#&gt; ─ Session info ───────────────────────────────────────────────────────────────
#&gt; setting value
#&gt; version R version 4.2.1 (2022-06-23)
#&gt; os macOS Ventura 13.0.1
#&gt; system aarch64, darwin20
#&gt; ui X11
#&gt; language (EN)
#&gt; collate en_US.UTF-8
#&gt; ctype en_US.UTF-8
#&gt; tz America/Chicago
#&gt; date 2022-11-18
#&gt; pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#&gt;
#&gt; ─ Packages ───────────────────────────────────────────────────────────────────
#&gt; package * version date (UTC) lib source
#&gt; askpass 1.1 2019-01-13 [1] CRAN (R 4.2.0)
#&gt; assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#&gt; backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
#&gt; base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.0)
#&gt; bit 4.0.4 2020-08-04 [1] CRAN (R 4.2.0)
#&gt; bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0)
#&gt; blob 1.2.3 2022-04-10 [1] CRAN (R 4.2.0)
#&gt; broom 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)
#&gt; bslib 0.4.1 2022-11-02 [1] CRAN (R 4.2.0)
#&gt; cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0)
#&gt; callr 3.7.3 2022-11-02 [1] CRAN (R 4.2.1)
#&gt; cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0)
#&gt; cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.1)
#&gt; clipr 0.8.0 2022-02-22 [1] CRAN (R 4.2.0)
#&gt; colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
#&gt; cpp11 0.4.3 2022-10-12 [1] CRAN (R 4.2.0)
#&gt; crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0)
#&gt; curl 4.3.3 2022-10-06 [1] CRAN (R 4.2.0)
#&gt; data.table 1.14.4 2022-10-17 [1] CRAN (R 4.2.1)
#&gt; DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
#&gt; dbplyr 2.2.1.9000 2022-11-03 [1] Github (tidyverse/dbplyr@f7b5596)
#&gt; digest 0.6.30 2022-10-18 [1] CRAN (R 4.2.0)
#&gt; dplyr * 1.0.99.9000 2022-11-17 [1] local
#&gt; dtplyr 1.2.2 2022-08-20 [1] CRAN (R 4.2.0)
#&gt; ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#&gt; evaluate 0.18 2022-11-07 [1] CRAN (R 4.2.1)
#&gt; fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
#&gt; farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.0)
#&gt; fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#&gt; forcats * 0.5.2 2022-08-19 [1] CRAN (R 4.2.0)
#&gt; fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#&gt; gargle 1.2.1.9000 2022-10-27 [1] Github (r-lib/gargle@69d3f28)
#&gt; generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
#&gt; ggplot2 * 3.4.0.9000 2022-11-10 [1] Github (tidyverse/ggplot2@4fea51b)
#&gt; glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#&gt; googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.0)
#&gt; googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.0)
#&gt; gtable 0.3.1.9000 2022-09-25 [1] local
#&gt; haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.0)
#&gt; highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
#&gt; hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
#&gt; htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
#&gt; httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
#&gt; ids 1.0.1 2017-05-31 [1] CRAN (R 4.2.0)
#&gt; isoband 0.2.6 2022-10-06 [1] CRAN (R 4.2.0)
#&gt; jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.0)
#&gt; jsonlite 1.8.3 2022-10-21 [1] CRAN (R 4.2.1)
#&gt; knitr 1.40 2022-08-24 [1] CRAN (R 4.2.0)
#&gt; labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
#&gt; lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
#&gt; lifecycle 1.0.3.9000 2022-10-10 [1] Github (r-lib/lifecycle@80a1e52)
#&gt; lubridate 1.9.0 2022-11-06 [1] CRAN (R 4.2.1)
#&gt; magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#&gt; MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.0)
#&gt; Matrix 1.5-1 2022-09-13 [1] CRAN (R 4.2.0)
#&gt; memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0)
#&gt; mgcv 1.8-41 2022-10-21 [1] CRAN (R 4.2.0)
#&gt; mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
#&gt; modelr 0.1.9 2022-08-19 [1] CRAN (R 4.2.0)
#&gt; munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
#&gt; nlme 3.1-160 2022-10-10 [1] CRAN (R 4.2.0)
#&gt; openssl 2.0.4 2022-10-17 [1] CRAN (R 4.2.1)
#&gt; pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
#&gt; pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#&gt; prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.0)
#&gt; processx 3.8.0 2022-10-26 [1] CRAN (R 4.2.1)
#&gt; progress 1.2.2 2019-05-16 [1] CRAN (R 4.2.0)
#&gt; ps 1.7.2 2022-10-26 [1] CRAN (R 4.2.1)
#&gt; purrr * 0.9000.0.9000 2022-11-10 [1] Github (tidyverse/purrr@aaaa58a)
#&gt; R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#&gt; rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.2.0)
#&gt; RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
#&gt; readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.1)
#&gt; readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.0)
#&gt; rematch 1.0.1 2016-04-21 [1] CRAN (R 4.2.0)
#&gt; rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.2.0)
#&gt; reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0)
#&gt; rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)
#&gt; rmarkdown 2.18 2022-11-09 [1] CRAN (R 4.2.1)
#&gt; rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
#&gt; rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0)
#&gt; sass 0.4.2 2022-07-16 [1] CRAN (R 4.2.0)
#&gt; scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
#&gt; selectr 0.4-2 2019-11-20 [1] CRAN (R 4.2.0)
#&gt; stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
#&gt; stringr * 1.4.1.9000 2022-11-10 [1] Github (tidyverse/stringr@ebf3823)
#&gt; sys 3.4.1 2022-10-18 [1] CRAN (R 4.2.0)
#&gt; tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
#&gt; tidyr * 1.2.1.9001 2022-11-05 [1] Github (tidyverse/tidyr@9174795)
#&gt; tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.1)
#&gt; tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.0)
#&gt; timechange 0.1.1 2022-11-04 [1] CRAN (R 4.2.1)
#&gt; tinytex 0.42 2022-09-27 [1] CRAN (R 4.2.1)
#&gt; tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
#&gt; utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
#&gt; uuid 1.1-0 2022-04-19 [1] CRAN (R 4.2.0)
#&gt; vctrs 0.5.0 2022-10-22 [1] CRAN (R 4.2.0)
#&gt; viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.0)
#&gt; vroom 1.6.0 2022-09-30 [1] CRAN (R 4.2.0)
#&gt; withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#&gt; xfun 0.34 2022-10-18 [1] CRAN (R 4.2.1)
#&gt; xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
#&gt; yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.0)
#&gt;
#&gt; [1] /Users/hadleywickham/Library/R/arm64/4.2/library
#&gt; [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#&gt;
#&gt; ──────────────────────────────────────────────────────────────────────────────
cli:::ruler()
#&gt; ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
#&gt; 12345678901234567890123456789012345678901234567890123456789012345678901234567890</pre>
<pre data-type="programlisting" data-code-language="downlit">cli:::ruler()
#&gt; ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+--
#&gt; 12345678901234567890123456789012345678901234567890123456789012345678901234567</pre>
</div>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-iteration">
<h1><span id="sec-iteration" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Iteration</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-iteration" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Iteration</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -226,9 +218,10 @@ df_miss |&gt;
n = n()
)
#&gt; # A tibble: 1 × 9
#&gt; a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss n
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5</pre>
#&gt; a_median a_n_miss b_median b_n_miss c_median c_n_miss d_med…¹ d_n_m…² n
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5
#&gt; # … with abbreviated variable names ¹d_median, ²d_n_miss</pre>
</div>
<p>If you look carefully, you might intuit that the columns are named using using a glue specification (<a href="#sec-glue" data-type="xref">#sec-glue</a>) like <code>{.col}_{.fn}</code> where <code>.col</code> is the name of the original column and <code>.fn</code> is the name of the function. Thats not a coincidence! As youll learn in the next section, you can use <code>.names</code> argument to supply your own glue spec.</p>
</section>
@ -251,9 +244,10 @@ Column names</h2>
n = n(),
)
#&gt; # A tibble: 1 × 9
#&gt; median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d n
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5</pre>
#&gt; median_a n_miss_a median_b n_miss_b median_c n_miss_c media…¹ n_mis…² n
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5
#&gt; # … with abbreviated variable names ¹median_d, ²n_miss_d</pre>
</div>
<p>The <code>.names</code> argument is particularly important when you use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. By default the output of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is given the same names as the inputs. This means that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> will replace existing columns. For example, here we use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace <code>NA</code>s with <code>0</code>:</p>
<div class="cell">
@ -930,8 +924,8 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
<pre data-type="programlisting" data-code-language="downlit">con |&gt; tbl("gapminder")
#&gt; # Source: table&lt;gapminder&gt; [0 x 6]
#&gt; # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; # … with 6 variables: country &lt;chr&gt;, continent &lt;chr&gt;, lifeExp &lt;dbl&gt;, pop &lt;dbl&gt;,
#&gt; # gdpPercap &lt;dbl&gt;, year &lt;dbl&gt;</pre>
#&gt; # … with 6 variables: country &lt;chr&gt;, continent &lt;chr&gt;, lifeExp &lt;dbl&gt;,
#&gt; # pop &lt;dbl&gt;, gdpPercap &lt;dbl&gt;, year &lt;dbl&gt;</pre>
</div>
<p>Next, we need a function that takes a single file path, reads it into R, and adds the result to the <code>gapminder</code> table. We can do that by combining <code>read_excel()</code> with <code><a href="https://dbi.r-dbi.org/reference/dbAppendTable.html">DBI::dbAppendTable()</a></code>:</p>
<div class="cell">

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-joins">
<h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -57,14 +49,14 @@ Primary and foreign keys</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">airports
#&gt; # A tibble: 1,458 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/Ne
#&gt; 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Ch
#&gt; 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Ch
#&gt; 4 06N Randall Airport 41.4 -74.4 523 -5 A America/Ne
#&gt; 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/Ne
#&gt; 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/Ne
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America…
#&gt; 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America…
#&gt; 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America…
#&gt; 4 06N Randall Airport 41.4 -74.4 523 -5 A America…
#&gt; 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America…
#&gt; 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America…
#&gt; # … with 1,452 more rows</pre>
</div>
</li>
@ -73,14 +65,14 @@ Primary and foreign keys</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">planes
#&gt; # A tibble: 3,322 × 9
#&gt; tailnum year type manuf…¹ model engines seats speed engine
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 2 N102UW 1998 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; 3 N103US 1999 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; 4 N104UW 1999 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; 5 N10575 2002 Fixed wing multi engine EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 6 N105UW 1999 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; tailnum year type manuf…¹ model engines seats speed engine
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 N10156 2004 Fixed wing multi en EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 2 N102UW 1998 Fixed wing multi en AIRBUS… A320… 2 182 NA Turbo…
#&gt; 3 N103US 1999 Fixed wing multi en AIRBUS… A320… 2 182 NA Turbo…
#&gt; 4 N104UW 1999 Fixed wing multi en AIRBUS… A320… 2 182 NA Turbo…
#&gt; 5 N10575 2002 Fixed wing multi en EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 6 N105UW 1999 Fixed wing multi en AIRBUS… A320… 2 182 NA Turbo…
#&gt; # … with 3,316 more rows, and abbreviated variable name ¹manufacturer</pre>
</div>
</li>
@ -89,16 +81,17 @@ Primary and foreign keys</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">weather
#&gt; # A tibble: 26,115 × 15
#&gt; origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
#&gt; # … with 26,109 more rows, and 4 more variables: precip &lt;dbl&gt;, pressure &lt;dbl&gt;,
#&gt; # visib &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; origin year month day hour temp dewp humid wind_dir wind_sp…¹ wind_…²
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
#&gt; # … with 26,109 more rows, 4 more variables: precip &lt;dbl&gt;, pressure &lt;dbl&gt;,
#&gt; # visib &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹wind_speed, ²wind_gust</pre>
</div>
</li>
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
@ -147,8 +140,8 @@ weather |&gt;
filter(is.na(tailnum))
#&gt; # A tibble: 0 × 9
#&gt; # … with 9 variables: tailnum &lt;chr&gt;, year &lt;int&gt;, type &lt;chr&gt;,
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, speed &lt;int&gt;,
#&gt; # engine &lt;chr&gt;
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;
weather |&gt;
filter(is.na(time_hour) | is.na(origin))
@ -189,18 +182,19 @@ Surrogate keys</h2>
mutate(id = row_number(), .before = 1)
flights2
#&gt; # A tibble: 336,776 × 20
#&gt; id year month day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 2013 1 1 517 515 2 830 819 11
#&gt; 2 2 2013 1 1 533 529 4 850 830 20
#&gt; 3 3 2013 1 1 542 540 2 923 850 33
#&gt; 4 4 2013 1 1 544 545 -1 1004 1022 -18
#&gt; 5 5 2013 1 1 554 600 -6 812 837 -25
#&gt; 6 6 2013 1 1 554 558 -4 740 728 12
#&gt; id year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 2013 1 1 517 515 2 830 819 11
#&gt; 2 2 2013 1 1 533 529 4 850 830 20
#&gt; 3 3 2013 1 1 542 540 2 923 850 33
#&gt; 4 4 2013 1 1 544 545 -1 1004 1022 -18
#&gt; 5 5 2013 1 1 554 600 -6 812 837 -25
#&gt; 6 6 2013 1 1 554 558 -4 740 728 12
#&gt; # … with 336,770 more rows, 10 more variables: carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable
#&gt; # names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
#&gt; # ⁵arr_delay</pre>
</div>
<p>Surrogate keys can be particular useful when communicating to other humans: its much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
</section>
@ -247,14 +241,14 @@ flights2
left_join(airlines)
#&gt; Joining with `by = join_by(carrier)`
#&gt; # A tibble: 336,776 × 7
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines Inc.
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines Inc.
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines Inc.
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc.
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines Inc.
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines In
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines In
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines I
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc.
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines In
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Or we could find out the temperature and wind speed when each plane departed:</p>
@ -279,14 +273,14 @@ flights2
left_join(planes |&gt; select(tailnum, type, engines, seats))
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 336,776 × 9
#&gt; year time_hour origin dest tailnum carrier type engines seats
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed wi… 2 149
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed wi… 2 149
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed wi… 2 178
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed wi… 2 200
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed wi… 2 178
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wi… 2 191
#&gt; year time_hour origin dest tailnum carrier type engines seats
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed… 2 149
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed… 2 149
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed… 2 178
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed… 2 200
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed… 2 178
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed… 2 191
#&gt; # … with 336,770 more rows</pre>
</div>
<p>When <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, theres no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
@ -318,14 +312,14 @@ Specifying join keys</h2>
left_join(planes)
#&gt; Joining with `by = join_by(year, tailnum)`
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier type manufactu…¹ model
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; year time_hour origin dest tailnum carrier type manufa…¹ model
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; # … with 336,770 more rows, 4 more variables: engines &lt;int&gt;, seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;, and abbreviated variable name ¹manufacturer</pre>
</div>
@ -334,17 +328,16 @@ Specifying join keys</h2>
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(planes, join_by(tailnum))
#&gt; # A tibble: 336,776 × 14
#&gt; year.x time_hour origin dest tailnum carrier year.y type manuf…¹
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 Fixed … BOEING
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 Fixed … BOEING
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 Fixed … BOEING
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 Fixed … AIRBUS
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 Fixed … BOEING
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 Fixed … BOEING
#&gt; # … with 336,770 more rows, 5 more variables: model &lt;chr&gt;, engines &lt;int&gt;,
#&gt; # seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;, and abbreviated variable name
#&gt; # ¹manufacturer</pre>
#&gt; year.x time_hour origin dest tailnum carrier year.y type
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 Fixed wing …
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 Fixed wing …
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 Fixed wing …
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 Fixed wing …
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 Fixed wing …
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 Fixed wing …
#&gt; # … with 336,770 more rows, and 6 more variables: manufacturer &lt;chr&gt;,
#&gt; # model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;</pre>
</div>
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. Its important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. Thats why this type of join is often called an <strong>equi-join</strong>. Youll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
@ -353,30 +346,30 @@ Specifying join keys</h2>
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(airports, join_by(dest == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name lat lon alt
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Geor… 30.0 -95.3 97
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Geor… 30.0 -95.3 97
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miam… 25.8 -80.3 8
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; NA NA NA
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hart… 33.6 -84.4 1026
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chic… 42.0 -87.9 668
#&gt; # … with 336,770 more rows, and 3 more variables: tz &lt;dbl&gt;, dst &lt;chr&gt;,
#&gt; # tzone &lt;chr&gt;
#&gt; year time_hour origin dest tailnum carrier name lat lon
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA George … 30.0 -95.3
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA George … 30.0 -95.3
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miami I… 25.8 -80.3
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; NA NA
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hartsfi… 33.6 -84.4
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chicago… 42.0 -87.9
#&gt; # … with 336,770 more rows, and 4 more variables: alt &lt;dbl&gt;, tz &lt;dbl&gt;,
#&gt; # dst &lt;chr&gt;, tzone &lt;chr&gt;
flights2 |&gt;
left_join(airports, join_by(origin == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name lat lon alt
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newa… 40.7 -74.2 18
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La G… 40.8 -73.9 22
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John… 40.6 -73.8 13
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John… 40.6 -73.8 13
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La G… 40.8 -73.9 22
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newa… 40.7 -74.2 18
#&gt; # … with 336,770 more rows, and 3 more variables: tz &lt;dbl&gt;, dst &lt;chr&gt;,
#&gt; # tzone &lt;chr&gt;</pre>
#&gt; year time_hour origin dest tailnum carrier name lat lon
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newark … 40.7 -74.2
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La Guar… 40.8 -73.9
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John F … 40.6 -73.8
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John F … 40.6 -73.8
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La Guar… 40.8 -73.9
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newark … 40.7 -74.2
#&gt; # … with 336,770 more rows, and 4 more variables: alt &lt;dbl&gt;, tz &lt;dbl&gt;,
#&gt; # dst &lt;chr&gt;, tzone &lt;chr&gt;</pre>
</div>
<p>In older code you might see a different way of specifying the join keys, using a character vector:</p>
<ul><li>
@ -405,14 +398,14 @@ Filtering joins</h2>
<pre data-type="programlisting" data-code-language="downlit">airports |&gt;
semi_join(flights2, join_by(faa == dest))
#&gt; # A tibble: 101 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 ABQ Albuquerque International Sunport 35.0 -107. 5355 -7 A Americ
#&gt; 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A Americ
#&gt; 3 ALB Albany Intl 42.7 -73.8 285 -5 A Americ
#&gt; 4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A Americ
#&gt; 5 ATL Hartsfield Jackson Atlanta Intl 33.6 -84.4 1026 -5 A Americ
#&gt; 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A Americ
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 ABQ Albuquerque International Sunpo… 35.0 -107. 5355 -7 A Amer
#&gt; 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A Amer…
#&gt; 3 ALB Albany Intl 42.7 -73.8 285 -5 A Amer…
#&gt; 4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A Amer…
#&gt; 5 ATL Hartsfield Jackson Atlanta Intl 33.6 -84.4 1026 -5 A Amer…
#&gt; 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A Amer…
#&gt; # … with 95 more rows</pre>
</div>
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that dont have a match in <code>y</code>. Theyre useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values dont show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that as missing from <code>airports</code> by looking for flights that dont have a matching destination airport:</p>
@ -664,14 +657,14 @@ Allow multiple rows</h2>
plane_flights
#&gt; # A tibble: 284,170 × 9
#&gt; tailnum type engines seats year time_hour origin dest carrier
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 N10156 Fixed wi… 2 55 2013 2013-01-10 06:00:00 EWR PIT EV
#&gt; 2 N10156 Fixed wi… 2 55 2013 2013-01-10 10:00:00 EWR CHS EV
#&gt; 3 N10156 Fixed wi… 2 55 2013 2013-01-10 15:00:00 EWR MSP EV
#&gt; 4 N10156 Fixed wi… 2 55 2013 2013-01-11 06:00:00 EWR CMH EV
#&gt; 5 N10156 Fixed wi… 2 55 2013 2013-01-11 11:00:00 EWR MCI EV
#&gt; 6 N10156 Fixed wi… 2 55 2013 2013-01-11 18:00:00 EWR PWM EV
#&gt; tailnum type engines seats year time_hour origin dest carrier
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 N10156 Fixed… 2 55 2013 2013-01-10 06:00:00 EWR PIT EV
#&gt; 2 N10156 Fixed… 2 55 2013 2013-01-10 10:00:00 EWR CHS EV
#&gt; 3 N10156 Fixed… 2 55 2013 2013-01-10 15:00:00 EWR MSP EV
#&gt; 4 N10156 Fixed… 2 55 2013 2013-01-11 06:00:00 EWR CMH EV
#&gt; 5 N10156 Fixed… 2 55 2013 2013-01-11 11:00:00 EWR MCI EV
#&gt; 6 N10156 Fixed… 2 55 2013 2013-01-11 18:00:00 EWR PWM EV
#&gt; # … with 284,164 more rows</pre>
</div>
</section>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-logicals">
<h1><span id="sec-logicals" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Logical vectors</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-logicals" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Logical vectors</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -55,14 +47,14 @@ Comparisons</h1>
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_time &gt; 600 &amp; dep_time &lt; 2000 &amp; abs(arr_delay) &lt; 20)
#&gt; # A tibble: 172,286 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 601 600 1 844 850 -6 B6
#&gt; 2 2013 1 1 602 610 -8 812 820 -8 DL
#&gt; 3 2013 1 1 602 605 -3 821 805 16 MQ
#&gt; 4 2013 1 1 606 610 -4 858 910 -12 AA
#&gt; 5 2013 1 1 606 610 -4 837 845 -8 DL
#&gt; 6 2013 1 1 607 607 0 858 915 -17 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 601 600 1 844 850 -6 B6
#&gt; 2 2013 1 1 602 610 -8 812 820 -8 DL
#&gt; 3 2013 1 1 602 605 -3 821 805 16 MQ
#&gt; 4 2013 1 1 606 610 -4 858 910 -12 AA
#&gt; 5 2013 1 1 606 610 -4 837 845 -8 DL
#&gt; 6 2013 1 1 607 607 0 858 915 -17 UA
#&gt; # … with 172,280 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -185,14 +177,14 @@ is.na(c("a", NA, "b"))
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(is.na(dep_time))
#&gt; # A tibble: 8,255 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 NA 1630 NA NA 1815 NA EV
#&gt; 2 2013 1 1 NA 1935 NA NA 2240 NA AA
#&gt; 3 2013 1 1 NA 1500 NA NA 1825 NA AA
#&gt; 4 2013 1 1 NA 600 NA NA 901 NA B6
#&gt; 5 2013 1 2 NA 1540 NA NA 1747 NA EV
#&gt; 6 2013 1 2 NA 1620 NA NA 1746 NA EV
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 NA 1630 NA NA 1815 NA EV
#&gt; 2 2013 1 1 NA 1935 NA NA 2240 NA AA
#&gt; 3 2013 1 1 NA 1500 NA NA 1825 NA AA
#&gt; 4 2013 1 1 NA 600 NA NA 901 NA B6
#&gt; 5 2013 1 2 NA 1540 NA NA 1747 NA EV
#&gt; 6 2013 1 2 NA 1620 NA NA 1746 NA EV
#&gt; # … with 8,249 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -204,14 +196,14 @@ is.na(c("a", NA, "b"))
filter(month == 1, day == 1) |&gt;
arrange(dep_time)
#&gt; # A tibble: 842 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 836 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -221,14 +213,14 @@ flights |&gt;
filter(month == 1, day == 1) |&gt;
arrange(desc(is.na(dep_time)), dep_time)
#&gt; # A tibble: 842 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 NA 1630 NA NA 1815 NA EV
#&gt; 2 2013 1 1 NA 1935 NA NA 2240 NA AA
#&gt; 3 2013 1 1 NA 1500 NA NA 1825 NA AA
#&gt; 4 2013 1 1 NA 600 NA NA 901 NA B6
#&gt; 5 2013 1 1 517 515 2 830 819 11 UA
#&gt; 6 2013 1 1 533 529 4 850 830 20 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 NA 1630 NA NA 1815 NA EV
#&gt; 2 2013 1 1 NA 1935 NA NA 2240 NA AA
#&gt; 3 2013 1 1 NA 1500 NA NA 1825 NA AA
#&gt; 4 2013 1 1 NA 600 NA NA 901 NA B6
#&gt; 5 2013 1 1 517 515 2 830 819 11 UA
#&gt; 6 2013 1 1 533 529 4 850 830 20 UA
#&gt; # … with 836 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -294,14 +286,14 @@ Order of operations</h2>
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month == 11 | 12)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -356,14 +348,14 @@ c(1, 2, NA) %in% NA
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_time %in% c(NA, 0800))
#&gt; # A tibble: 8,803 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 800 800 0 1022 1014 8 DL
#&gt; 2 2013 1 1 800 810 -10 949 955 -6 MQ
#&gt; 3 2013 1 1 NA 1630 NA NA 1815 NA EV
#&gt; 4 2013 1 1 NA 1935 NA NA 2240 NA AA
#&gt; 5 2013 1 1 NA 1500 NA NA 1825 NA AA
#&gt; 6 2013 1 1 NA 600 NA NA 901 NA B6
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 800 800 0 1022 1014 8 DL
#&gt; 2 2013 1 1 800 810 -10 949 955 -6 MQ
#&gt; 3 2013 1 1 NA 1630 NA NA 1815 NA EV
#&gt; 4 2013 1 1 NA 1935 NA NA 2240 NA AA
#&gt; 5 2013 1 1 NA 1500 NA NA 1825 NA AA
#&gt; 6 2013 1 1 NA 600 NA NA 901 NA B6
#&gt; # … with 8,797 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-missing-values">
<h1><span id="sec-missing-values" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Missing values</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-missing-values" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Missing values</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-numbers">
<h1><span id="sec-numbers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Numbers</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-numbers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Numbers</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -218,14 +210,14 @@ x * c(1, 2, 3)
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month == c(1, 2))
#&gt; # A tibble: 25,977 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 542 540 2 923 850 33 AA
#&gt; 3 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 4 2013 1 1 555 600 -5 913 854 19 B6
#&gt; 5 2013 1 1 557 600 -3 838 846 -8 B6
#&gt; 6 2013 1 1 558 600 -2 849 851 -2 B6
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 542 540 2 923 850 33 AA
#&gt; 3 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 4 2013 1 1 555 600 -5 913 854 19 B6
#&gt; 5 2013 1 1 557 600 -3 838 846 -8 B6
#&gt; 6 2013 1 1 558 600 -2 849 851 -2 B6
#&gt; # … with 25,971 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
@ -759,8 +751,8 @@ Positions</h2>
fifth_dep = nth(dep_time, 5),
last_dep = last(dep_time)
)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using the
#&gt; `.groups` argument.
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using
#&gt; the `.groups` argument.
#&gt; # A tibble: 365 × 6
#&gt; # Groups: year, month [12]
#&gt; year month day first_dep fifth_dep last_dep
@ -783,14 +775,14 @@ Positions</h2>
filter(r %in% c(1, max(r)))
#&gt; # A tibble: 1,195 × 20
#&gt; # Groups: year, month, day [365]
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 2353 2359 -6 425 445 -20 B6
#&gt; 3 2013 1 1 2353 2359 -6 418 442 -24 B6
#&gt; 4 2013 1 1 2356 2359 -3 425 437 -12 B6
#&gt; 5 2013 1 2 42 2359 43 518 442 36 B6
#&gt; 6 2013 1 2 458 500 -2 703 650 13 US
#&gt; year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 2353 2359 -6 425 445 -20 B6
#&gt; 3 2013 1 1 2353 2359 -6 418 442 -24 B6
#&gt; 4 2013 1 1 2356 2359 -3 425 437 -12 B6
#&gt; 5 2013 1 2 42 2359 43 518 442 36 B6
#&gt; 6 2013 1 2 458 500 -2 703 650 13 US
#&gt; # … with 1,189 more rows, 10 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, r &lt;int&gt;, and abbreviated variable names

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-quarto-formats">
<h1><span id="sec-quarto-formats" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto formats</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-quarto-formats" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto formats</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-quarto-workflow">
<h1><span id="sec-quarto-workflow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto workflow</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the <em>console</em>, then capture what works in the <em>script editor</em>. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When youre happy, you move on and start a new chunk.</p><p>Quarto is also important because it so tightly integrates prose and code. This makes it a great <strong>analysis notebook</strong> because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:</p><ul><li><p>Records what you did and why you did it. Regardless of how great your memory is, if you dont record what you do, there will come a time when you have forgotten important details. Write them down so you dont forget!</p></li>
<h1><span id="sec-quarto-workflow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto workflow</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the <em>console</em>, then capture what works in the <em>script editor</em>. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When youre happy, you move on and start a new chunk.</p><p>Quarto is also important because it so tightly integrates prose and code. This makes it a great <strong>analysis notebook</strong> because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:</p><ul><li><p>Records what you did and why you did it. Regardless of how great your memory is, if you dont record what you do, there will come a time when you have forgotten important details. Write them down so you dont forget!</p></li>
<li><p>Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.</p></li>
<li><p>Helps others understand your work. It is rare to do data analysis by yourself, and youll often be working as part of a team. A lab notebook helps you share not only what youve done, but why you did it with your colleagues or lab mates.</p></li>
</ul><p>Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. Weve drawn on our own experiences and Colin Purringtons advice on lab notebooks (<a href="https://colinpurrington.com/tips/lab-notebooks" class="uri">https://colinpurrington.com/tips/lab-notebooks</a>) to come up with the following tips:</p><ul><li><p>Ensure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis.</p></li>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-quarto">
<h1><span id="sec-quarto" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-quarto" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>

View File

@ -1,29 +1,5 @@
<section data-type="chapter" id="chp-rectangling">
<h1><span id="sec-rectangling" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data rectangling</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div><h1>
Base R
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>Its possible to put a list in a column of a <code>data.frame</code>, but its a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">data.frame(x = list(1:3, 3:5))
#&gt; x.1.3 x.3.5
#&gt; 1 1 3
#&gt; 2 2 4
#&gt; 3 3 5</pre>
</div><p>You can force <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="https://rdrr.io/r/base/AsIs.html">I()</a></code>, but the result doesnt print particularly well:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">data.frame(
x = I(list(1:2, 3:5)),
y = c("1, 2", "3, 4, 5")
)
#&gt; x y
#&gt; 1 1, 2 1, 2
#&gt; 2 3, 4, 5 3, 4, 5</pre>
</div><p>Its easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like either vectors and the print method has been designed with lists in mind.</p></div>
<h1><span id="sec-rectangling" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data rectangling</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -198,9 +174,7 @@ df
<p>Similarly, if you <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code> a data frame in RStudio, youll get the standard tabular view, which doesnt allow you to selectively expand list columns. To explore those fields youll need to <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> and view, e.g. <code>df |&gt; pull(z) |&gt; View()</code>.</p>
<div data-type="note"><h1>
Base R
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>Its possible to put a list in a column of a <code>data.frame</code>, but its a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell">
</h1><p>Its possible to put a list in a column of a <code>data.frame</code>, but its a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">data.frame(x = list(1:3, 3:5))
#&gt; x.1.3 x.3.5
#&gt; 1 1 3
@ -486,15 +460,15 @@ repos
unnest_longer(json) |&gt;
unnest_wider(json)
#&gt; # A tibble: 176 × 68
#&gt; id name full_…¹ owner private html_…² descr…³ fork url forks…⁴
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;lgl&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 6.12e7 after gaborc… &lt;named list&gt; FALSE https:… Run Co… FALSE http… https:…
#&gt; 2 4.05e7 argu… gaborc… &lt;named list&gt; FALSE https:… Declar… FALSE http… https:…
#&gt; 3 3.64e7 ask gaborc… &lt;named list&gt; FALSE https:… Friend… FALSE http… https:…
#&gt; 4 3.49e7 base… gaborc… &lt;named list&gt; FALSE https:… Do we … FALSE http… https:…
#&gt; 5 6.16e7 cite… gaborc… &lt;named list&gt; FALSE https:… Test R… TRUE http… https:…
#&gt; 6 3.39e7 clis… gaborc… &lt;named list&gt; FALSE https:… Unicod… FALSE http… https:…
#&gt; # … with 170 more rows, 58 more variables: keys_url &lt;chr&gt;,
#&gt; id name full_…¹ owner private html_…² descr…³ fork url
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;lgl&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt; &lt;chr&gt;
#&gt; 1 61160198 after gaborc… &lt;named list&gt; FALSE https:… Run Co… FALSE http…
#&gt; 2 40500181 argufy gaborc… &lt;named list&gt; FALSE https:… Declar… FALSE http…
#&gt; 3 36442442 ask gaborc… &lt;named list&gt; FALSE https:… Friend… FALSE http…
#&gt; 4 34924886 baseimpo… gaborc… &lt;named list&gt; FALSE https:… Do we … FALSE http…
#&gt; 5 61620661 citest gaborc… &lt;named list&gt; FALSE https:… Test R… TRUE http…
#&gt; 6 33907457 clisymbo… gaborc… &lt;named list&gt; FALSE https:… Unicod… FALSE http…
#&gt; # … with 170 more rows, 59 more variables: forks_url &lt;chr&gt;, keys_url &lt;chr&gt;,
#&gt; # collaborators_url &lt;chr&gt;, teams_url &lt;chr&gt;, hooks_url &lt;chr&gt;,
#&gt; # issue_events_url &lt;chr&gt;, events_url &lt;chr&gt;, assignees_url &lt;chr&gt;,
#&gt; # branches_url &lt;chr&gt;, tags_url &lt;chr&gt;, blobs_url &lt;chr&gt;, git_tags_url &lt;chr&gt;,
@ -539,14 +513,14 @@ repos
unnest_wider(json) |&gt;
select(id, full_name, owner, description)
#&gt; # A tibble: 176 × 4
#&gt; id full_name owner description
#&gt; &lt;int&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt;
#&gt; 1 61160198 gaborcsardi/after &lt;named list [17]&gt; Run Code in the Background
#&gt; 2 40500181 gaborcsardi/argufy &lt;named list [17]&gt; Declarative function argum
#&gt; 3 36442442 gaborcsardi/ask &lt;named list [17]&gt; Friendly CLI interaction i
#&gt; 4 34924886 gaborcsardi/baseimports &lt;named list [17]&gt; Do we get warnings for und
#&gt; 5 61620661 gaborcsardi/citest &lt;named list [17]&gt; Test R package and repo fo
#&gt; 6 33907457 gaborcsardi/clisymbols &lt;named list [17]&gt; Unicode symbols for CLI ap
#&gt; id full_name owner description
#&gt; &lt;int&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt;
#&gt; 1 61160198 gaborcsardi/after &lt;named list [17]&gt; Run Code in the Backgro
#&gt; 2 40500181 gaborcsardi/argufy &lt;named list [17]&gt; Declarative function ar…
#&gt; 3 36442442 gaborcsardi/ask &lt;named list [17]&gt; Friendly CLI interactio…
#&gt; 4 34924886 gaborcsardi/baseimports &lt;named list [17]&gt; Do we get warnings for …
#&gt; 5 61620661 gaborcsardi/citest &lt;named list [17]&gt; Test R package and repo…
#&gt; 6 33907457 gaborcsardi/clisymbols &lt;named list [17]&gt; Unicode symbols for CLI…
#&gt; # … with 170 more rows</pre>
</div>
<p>You can use this to work back to understand how <code>gh_repos</code> was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.</p>
@ -572,21 +546,21 @@ repos
select(id, full_name, owner, description) |&gt;
unnest_wider(owner, names_sep = "_")
#&gt; # A tibble: 176 × 20
#&gt; id full_…¹ owner…² owner…³ owner…⁴ owner…⁵ owner…⁶ owner…⁷ owner…⁸ owner…⁹
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 6.12e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:…
#&gt; 2 4.05e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:…
#&gt; 3 3.64e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:…
#&gt; 4 3.49e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:…
#&gt; 5 6.16e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:…
#&gt; 6 3.39e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:…
#&gt; # … with 170 more rows, 10 more variables: owner_gists_url &lt;chr&gt;,
#&gt; # owner_starred_url &lt;chr&gt;, owner_subscriptions_url &lt;chr&gt;,
#&gt; # owner_organizations_url &lt;chr&gt;, owner_repos_url &lt;chr&gt;,
#&gt; # owner_events_url &lt;chr&gt;, owner_received_events_url &lt;chr&gt;, owner_type &lt;chr&gt;,
#&gt; # owner_site_admin &lt;lgl&gt;, description &lt;chr&gt;, and abbreviated variable names
#&gt; # ¹full_name, ²owner_login, ³owner_id, ⁴owner_avatar_url, ⁵owner_gravatar_id,
#&gt; # owner_url, ⁷owner_html_url, ⁸owner_followers_url, ⁹owner_following_url</pre>
#&gt; id full_name owner…¹ owner…² owner…³ owner…⁴ owner…⁵ owner…⁶ owner…⁷
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 61160198 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
#&gt; 2 40500181 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
#&gt; 3 36442442 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
#&gt; 4 34924886 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
#&gt; 5 61620661 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
#&gt; 6 33907457 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
#&gt; # … with 170 more rows, 11 more variables: owner_following_url &lt;chr&gt;,
#&gt; # owner_gists_url &lt;chr&gt;, owner_starred_url &lt;chr&gt;,
#&gt; # owner_subscriptions_url &lt;chr&gt;, owner_organizations_url &lt;chr&gt;,
#&gt; # owner_repos_url &lt;chr&gt;, owner_events_url &lt;chr&gt;,
#&gt; # owner_received_events_url &lt;chr&gt;, owner_type &lt;chr&gt;,
#&gt; # owner_site_admin &lt;lgl&gt;, description &lt;chr&gt;, and abbreviated variable
#&gt; # names ¹owner_login, ²owner_id, ³owner_avatar_url, ⁴owner_gravatar_id, …</pre>
</div>
<p>This gives another wide dataset, but you can see that <code>owner</code> appears to contain a lot of additional data about the person who “owns” the repository.</p>
</section>
@ -614,14 +588,14 @@ chars
<pre data-type="programlisting" data-code-language="downlit">chars |&gt;
unnest_wider(json)
#&gt; # A tibble: 30 × 18
#&gt; url id name gender culture born died alive titles aliases father
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt; &lt;list&gt; &lt;list&gt; &lt;chr&gt;
#&gt; 1 https://ww… 1022 Theo… Male "Ironb… "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 2 https://ww… 1052 Tyri… Male "" "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 3 https://ww… 1074 Vict… Male "Ironb… "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 4 https://ww… 1109 Will Male "" "" "In … FALSE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 5 https://ww… 1166 Areo… Male "Norvo… "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 6 https://ww… 1267 Chett Male "" "At … "In … FALSE &lt;chr&gt; &lt;chr&gt; ""
#&gt; url id name gender culture born died alive titles aliases father
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt; &lt;list&gt; &lt;list&gt; &lt;chr&gt;
#&gt; 1 https:/… 1022 Theo… Male "Ironb… "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 2 https:/… 1052 Tyri… Male "" "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 3 https:/… 1074 Vict… Male "Ironb… "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 4 https:/… 1109 Will Male "" "" "In … FALSE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 5 https:/… 1166 Areo… Male "Norvo… "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 6 https:/… 1267 Chett Male "" "At … "In … FALSE &lt;chr&gt; &lt;chr&gt; ""
#&gt; # … with 24 more rows, and 7 more variables: mother &lt;chr&gt;, spouse &lt;chr&gt;,
#&gt; # allegiances &lt;list&gt;, books &lt;list&gt;, povBooks &lt;list&gt;, tvSeries &lt;list&gt;,
#&gt; # playedBy &lt;list&gt;</pre>
@ -633,14 +607,14 @@ chars
select(id, name, gender, culture, born, died, alive)
characters
#&gt; # A tibble: 30 × 7
#&gt; id name gender culture born died alive
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;
#&gt; 1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 279 AC, a… "" TRUE
#&gt; 2 1052 Tyrion Lannister Male "" "In 273 AC, at Casterly… "" TRUE
#&gt; 3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or before, a… "" TRUE
#&gt; 4 1109 Will Male "" "" "In … FALSE
#&gt; 5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or before, a… "" TRUE
#&gt; 6 1267 Chett Male "" "At Hag's Mire" "In … FALSE
#&gt; id name gender culture born died alive
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;
#&gt; 1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 279 AC… "" TRUE
#&gt; 2 1052 Tyrion Lannister Male "" "In 273 AC, at Caste… "" TRUE
#&gt; 3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or before… "" TRUE
#&gt; 4 1109 Will Male "" "" "In … FALSE
#&gt; 5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or before… "" TRUE
#&gt; 6 1267 Chett Male "" "At Hag's Mire" "In … FALSE
#&gt; # … with 24 more rows</pre>
</div>
<p>There are also many list-columns:</p>
@ -649,15 +623,15 @@ characters
unnest_wider(json) |&gt;
select(id, where(is.list))
#&gt; # A tibble: 30 × 8
#&gt; id titles aliases allegiances books povBooks tvSeries playedBy
#&gt; &lt;int&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt;
#&gt; 1 1022 &lt;chr [3]&gt; &lt;chr [4]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr [2]&gt; &lt;chr [6]&gt; &lt;chr [1]&gt;
#&gt; 2 1052 &lt;chr [2]&gt; &lt;chr [11]&gt; &lt;chr [1]&gt; &lt;chr [2]&gt; &lt;chr [4]&gt; &lt;chr [6]&gt; &lt;chr [1]&gt;
#&gt; 3 1074 &lt;chr [2]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr [2]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt;
#&gt; 4 1109 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;NULL&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt;
#&gt; 5 1166 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr [2]&gt; &lt;chr [2]&gt; &lt;chr [1]&gt;
#&gt; 6 1267 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;NULL&gt; &lt;chr [2]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt;
#&gt; # … with 24 more rows</pre>
#&gt; id titles aliases allegiances books povBooks tvSeries playe…¹
#&gt; &lt;int&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt;
#&gt; 1 1022 &lt;chr [3]&gt; &lt;chr [4]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr [2]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 2 1052 &lt;chr [2]&gt; &lt;chr [11]&gt; &lt;chr [1]&gt; &lt;chr [2]&gt; &lt;chr [4]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 3 1074 &lt;chr [2]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr [2]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 4 1109 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;NULL&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 5 1166 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr [2]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 6 1267 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;NULL&gt; &lt;chr [2]&gt; &lt;chr [1]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; # … with 24 more rows, and abbreviated variable name ¹playedBy</pre>
</div>
<p>Lets explore the <code>titles</code> column. Its an unnamed list-column, so well unnest it into rows:</p>
<div class="cell">
@ -713,14 +687,14 @@ characters |&gt;
select(id, name) |&gt;
inner_join(titles, by = "id", multiple = "all")
#&gt; # A tibble: 53 × 3
#&gt; id name title
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1022 Theon Greyjoy Prince of Winterfell
#&gt; 2 1022 Theon Greyjoy Captain of Sea Bitch
#&gt; 3 1022 Theon Greyjoy Lord of the Iron Islands (by law of the green lands)
#&gt; 4 1052 Tyrion Lannister Acting Hand of the King (former)
#&gt; 5 1052 Tyrion Lannister Master of Coin (former)
#&gt; 6 1074 Victarion Greyjoy Lord Captain of the Iron Fleet
#&gt; id name title
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1022 Theon Greyjoy Prince of Winterfell
#&gt; 2 1022 Theon Greyjoy Captain of Sea Bitch
#&gt; 3 1022 Theon Greyjoy Lord of the Iron Islands (by law of the green land
#&gt; 4 1052 Tyrion Lannister Acting Hand of the King (former)
#&gt; 5 1052 Tyrion Lannister Master of Coin (former)
#&gt; 6 1074 Victarion Greyjoy Lord Captain of the Iron Fleet
#&gt; # … with 47 more rows</pre>
</div>
<p>You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.</p>
@ -855,15 +829,15 @@ Deeply nested</h2>
unnest_wider(results)
locations
#&gt; # A tibble: 7 × 6
#&gt; city address_components formatted_address geometry place_id types
#&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston &lt;list [4]&gt; Houston, TX, USA &lt;named list&gt; ChIJAYW&lt;list&gt;
#&gt; 2 Washington &lt;list [2]&gt; Washington, USA &lt;named list&gt; ChIJ-bD&lt;list&gt;
#&gt; 3 Washington &lt;list [4]&gt; Washington, DC, USA &lt;named list&gt; ChIJW-T&lt;list&gt;
#&gt; 4 New York &lt;list [3]&gt; New York, NY, USA &lt;named list&gt; ChIJOwg&lt;list&gt;
#&gt; 5 Chicago &lt;list [4]&gt; Chicago, IL, USA &lt;named list&gt; ChIJ7cv&lt;list&gt;
#&gt; 6 Arlington &lt;list [4]&gt; Arlington, TX, USA &lt;named list&gt; ChIJ05g&lt;list&gt;
#&gt; # … with 1 more row</pre>
#&gt; city address_components formatted_address geometry place…¹ types
#&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston &lt;list [4]&gt; Houston, TX, USA &lt;named list&gt; ChIJAY… &lt;list&gt;
#&gt; 2 Washington &lt;list [2]&gt; Washington, USA &lt;named list&gt; ChIJ-b… &lt;list&gt;
#&gt; 3 Washington &lt;list [4]&gt; Washington, DC, &lt;named list&gt; ChIJW-&lt;list&gt;
#&gt; 4 New York &lt;list [3]&gt; New York, NY, USA &lt;named list&gt; ChIJOw… &lt;list&gt;
#&gt; 5 Chicago &lt;list [4]&gt; Chicago, IL, USA &lt;named list&gt; ChIJ7c… &lt;list&gt;
#&gt; 6 Arlington &lt;list [4]&gt; Arlington, TX, U&lt;named list&gt; ChIJ05&lt;list&gt;
#&gt; # … with 1 more row, and abbreviated variable name ¹place_id</pre>
</div>
<p>Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.</p>
<p>There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the <code>geometry</code> list-column:</p>
@ -872,14 +846,14 @@ locations
select(city, formatted_address, geometry) |&gt;
unnest_wider(geometry)
#&gt; # A tibble: 7 × 6
#&gt; city formatted_address bounds location locati…¹ viewport
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston Houston, TX, USA &lt;named list&gt; &lt;named list&gt; APPROXI&lt;named list&gt;
#&gt; 2 Washington Washington, USA &lt;named list&gt; &lt;named list&gt; APPROXI&lt;named list&gt;
#&gt; 3 Washington Washington, DC, USA &lt;named list&gt; &lt;named list&gt; APPROXI&lt;named list&gt;
#&gt; 4 New York New York, NY, USA &lt;named list&gt; &lt;named list&gt; APPROXI&lt;named list&gt;
#&gt; 5 Chicago Chicago, IL, USA &lt;named list&gt; &lt;named list&gt; APPROXI&lt;named list&gt;
#&gt; 6 Arlington Arlington, TX, USA &lt;named list&gt; &lt;named list&gt; APPROXI&lt;named list&gt;
#&gt; city formatted_address bounds location locat…¹ viewport
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston Houston, TX, USA &lt;named list&gt; &lt;named list&gt; APPROX… &lt;named list&gt;
#&gt; 2 Washington Washington, USA &lt;named list&gt; &lt;named list&gt; APPROX… &lt;named list&gt;
#&gt; 3 Washington Washington, DC, &lt;named list&gt; &lt;named list&gt; APPROX… &lt;named list&gt;
#&gt; 4 New York New York, NY, USA &lt;named list&gt; &lt;named list&gt; APPROX… &lt;named list&gt;
#&gt; 5 Chicago Chicago, IL, USA &lt;named list&gt; &lt;named list&gt; APPROX… &lt;named list&gt;
#&gt; 6 Arlington Arlington, TX, U &lt;named list&gt; &lt;named list&gt; APPROX… &lt;named list&gt;
#&gt; # … with 1 more row, and abbreviated variable name ¹location_type</pre>
</div>
<p>That gives us new <code>bounds</code> (a rectangular region) and <code>location</code> (a point). We can unnest <code>location</code> to see the latitude (<code>lat</code>) and longitude (<code>lng</code>):</p>
@ -889,14 +863,14 @@ locations
unnest_wider(geometry) |&gt;
unnest_wider(location)
#&gt; # A tibble: 7 × 7
#&gt; city formatted_address bounds lat lng locati…¹ viewport
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston Houston, TX, USA &lt;named list&gt; 29.8 -95.4 APPROXI&lt;named list&gt;
#&gt; 2 Washington Washington, USA &lt;named list&gt; 47.8 -121. APPROXI&lt;named list&gt;
#&gt; 3 Washington Washington, DC, USA &lt;named list&gt; 38.9 -77.0 APPROXI&lt;named list&gt;
#&gt; 4 New York New York, NY, USA &lt;named list&gt; 40.7 -74.0 APPROXI&lt;named list&gt;
#&gt; 5 Chicago Chicago, IL, USA &lt;named list&gt; 41.9 -87.6 APPROXI&lt;named list&gt;
#&gt; 6 Arlington Arlington, TX, USA &lt;named list&gt; 32.7 -97.1 APPROXI&lt;named list&gt;
#&gt; city formatted_address bounds lat lng locat…¹ viewport
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston Houston, TX, USA &lt;named list&gt; 29.8 -95.4 APPROX… &lt;named list&gt;
#&gt; 2 Washington Washington, USA &lt;named list&gt; 47.8 -121. APPROX… &lt;named list&gt;
#&gt; 3 Washington Washington, DC, &lt;named list&gt; 38.9 -77.0 APPROX… &lt;named list&gt;
#&gt; 4 New York New York, NY, USA &lt;named list&gt; 40.7 -74.0 APPROX… &lt;named list&gt;
#&gt; 5 Chicago Chicago, IL, USA &lt;named list&gt; 41.9 -87.6 APPROX… &lt;named list&gt;
#&gt; 6 Arlington Arlington, TX, U &lt;named list&gt; 32.7 -97.1 APPROX… &lt;named list&gt;
#&gt; # … with 1 more row, and abbreviated variable name ¹location_type</pre>
</div>
<p>Extracting the bounds requires a few more steps:</p>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-regexps">
<h1><span id="sec-regular-expressions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Regular expressions</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-regular-expressions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Regular expressions</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -1006,8 +998,9 @@ Base R</h2>
<p><code>apropos(pattern)</code> searches all objects available from the global environment that match the given pattern. This is useful if you cant quite remember the name of a function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">apropos("replace")
#&gt; [1] "%+replace%" "replace" "replace_na" "setReplaceMethod"
#&gt; [5] "str_replace" "str_replace_all" "str_replace_na" "theme_replace"</pre>
#&gt; [1] "%+replace%" "replace" "replace_na"
#&gt; [4] "setReplaceMethod" "str_replace" "str_replace_all"
#&gt; [7] "str_replace_na" "theme_replace"</pre>
</div>
<p><code>list.files(path, pattern)</code> lists all files in <code>path</code> that match a regular expression <code>pattern</code>. For example, you can find all the R Markdown files in the current directory with:</p>
<div class="cell">

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-spreadsheets">
<h1><span id="sec-import-spreadsheets" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Spreadsheets</span></span></h1><div data-type="important"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we dont recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-import-spreadsheets" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Spreadsheets</span></span></h1><p>::: status callout-important You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we dont recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@ -197,16 +189,16 @@ Reading individual sheets</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
#&gt; # A tibble: 52 × 8
#&gt; species island bill_length_mm bill_depth_mm flipp…¹ body_…² sex year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#&gt; 2 Adelie Torgersen 39.5 17.399999999… 186 3800 fema… 2007
#&gt; 3 Adelie Torgersen 40.299999999999997 18 195 3250 fema… 2007
#&gt; 4 Adelie Torgersen NA NA NA NA NA 2007
#&gt; 5 Adelie Torgersen 36.700000000000003 19.3 193 3450 fema… 2007
#&gt; 6 Adelie Torgersen 39.299999999999997 20.6 190 3650 male 2007
#&gt; # … with 46 more rows, and abbreviated variable names ¹​flipper_length_mm,
#&gt; # ²body_mass_g</pre>
#&gt; species island bill_length_mm bill_dep…¹ flipp…² body_…³ sex year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#&gt; 2 Adelie Torgersen 39.5 17.399999… 186 3800 fema… 2007
#&gt; 3 Adelie Torgersen 40.299999999999997 18 195 3250 fema… 2007
#&gt; 4 Adelie Torgersen NA NA NA NA NA 2007
#&gt; 5 Adelie Torgersen 36.700000000000003 19.3 193 3450 fema… 2007
#&gt; 6 Adelie Torgersen 39.299999999999997 20.6 190 3650 male 2007
#&gt; # … with 46 more rows, and abbreviated variable names ¹​bill_depth_mm,
#&gt; # ²​flipper_length_mm, ³​body_mass_g</pre>
</div>
<p>Some variables that appear to contain numerical data are read in as characters due to the character string <code>"NA"</code> not being recognized as a true <code>NA</code>.</p>
<div class="cell">
@ -214,14 +206,14 @@ Reading individual sheets</h2>
penguins_torgersen
#&gt; # A tibble: 52 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#&gt; 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#&gt; 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#&gt; 4 Adelie Torgersen NA NA NA NA &lt;NA&gt; 2007
#&gt; 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#&gt; 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#&gt; species island bill_length_mm bill_depth_mm flippe…¹ body_…² sex year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#&gt; 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#&gt; 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#&gt; 4 Adelie Torgersen NA NA NA NA &lt;NA&gt; 2007
#&gt; 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#&gt; 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#&gt; # … with 46 more rows, and abbreviated variable names ¹flipper_length_mm,
#&gt; # ²body_mass_g</pre>
</div>
@ -249,14 +241,14 @@ dim(penguins_dream)
<pre data-type="programlisting" data-code-language="downlit">penguins &lt;- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
penguins
#&gt; # A tibble: 344 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#&gt; 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#&gt; 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#&gt; 4 Adelie Torgersen NA NA NA NA &lt;NA&gt; 2007
#&gt; 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#&gt; 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#&gt; species island bill_length_mm bill_depth_mm flippe…¹ body_…² sex year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#&gt; 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#&gt; 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#&gt; 4 Adelie Torgersen NA NA NA NA &lt;NA&gt; 2007
#&gt; 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#&gt; 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#&gt; # … with 338 more rows, and abbreviated variable names ¹flipper_length_mm,
#&gt; # ²body_mass_g</pre>
</div>
@ -287,14 +279,14 @@ deaths &lt;- read_excel(deaths_path)
#&gt; • `` -&gt; `...6`
deaths
#&gt; # A tibble: 18 × 6
#&gt; `Lots of people` ...2 ...3 ...4 ...5 ...6
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 simply cannot resist writing &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; some not
#&gt; 2 at the top &lt;NA&gt; of their sp
#&gt; 3 or merging &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; cells
#&gt; 4 Name Profession Age Has kids Date of birth Date of
#&gt; 5 David Bowie musician 69 TRUE 17175 42379
#&gt; 6 Carrie Fisher actor 60 TRUE 20749 42731
#&gt; `Lots of people` ...2 ...3 ...4 ...5 ...6
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 simply cannot resist writing &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; some …
#&gt; 2 at the top &lt;NA&gt; of their…
#&gt; 3 or merging &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; cells
#&gt; 4 Name Profession Age Has kids Date of birth Date …
#&gt; 5 David Bowie musician 69 TRUE 17175 42379
#&gt; 6 Carrie Fisher actor 60 TRUE 20749 42731
#&gt; # … with 12 more rows</pre>
</div>
<p>The top three rows and the bottom four rows are not part of the data frame.</p>
@ -302,29 +294,30 @@ deaths
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, skip = 4)
#&gt; # A tibble: 14 × 6
#&gt; Name Profession Age `Has kids` `Date of birth` `Date of death`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dttm&gt; &lt;chr&gt;
#&gt; 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 42379
#&gt; 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 42731
#&gt; 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 42812
#&gt; 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 42791
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00 42481
#&gt; 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 42383
#&gt; # … with 8 more rows</pre>
#&gt; Name Profession Age `Has kids` `Date of birth` Date of dea…¹
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dttm&gt; &lt;chr&gt;
#&gt; 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 42379
#&gt; 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 42731
#&gt; 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 42812
#&gt; 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 42791
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00 42481
#&gt; 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 42383
#&gt; # … with 8 more rows, and abbreviated variable name ¹​`Date of death`</pre>
</div>
<p>We could also set <code>n_max</code> to omit the extraneous rows at the bottom.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, skip = 4, n_max = 10)
#&gt; # A tibble: 10 × 6
#&gt; Name Profession Age Has k…¹ `Date of birth` `Date of death`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
#&gt; 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
#&gt; 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
#&gt; 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
#&gt; 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00
#&gt; # … with 4 more rows, and abbreviated variable name ¹​`Has kids`</pre>
#&gt; Name Profe…¹ Age Has k…² `Date of birth` `Date of death`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 David Bowie musici… 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
#&gt; 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
#&gt; 3 Chuck Berry musici… 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
#&gt; 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
#&gt; 5 Prince musici… 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
#&gt; 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00
#&gt; # … with 4 more rows, and abbreviated variable names ¹Profession,
#&gt; # ²​`Has kids`</pre>
</div>
<p>Another approach is using cell ranges. In Excel, the top left cell is <code>A1</code>. As you move across columns to the right, the cell label moves down the alphabet, i.e. <code>B1</code>, <code>C1</code>, etc. And as you move down a column, the number in the cell label increases, i.e. <code>A2</code>, <code>A3</code>, etc.</p>
<p>The data we want to read in starts in cell <code>A5</code> and ends in cell <code>F15</code>. In spreadsheet notation, this is <code>A5:F15</code>.</p>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-strings">
<h1><span id="sec-strings" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Strings</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-strings" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Strings</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>

View File

@ -1,10 +1,2 @@
<section data-type="chapter" id="chp-webscraping">
<h1><span id="sec-scraping" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Web scraping</span></span></h1><div data-type="important"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we dont recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
</section>
<h1><span id="sec-scraping" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Web scraping</span></span></h1><p>::: status callout-important You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we dont recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p></section>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-workflow-basics">
<h1><span id="sec-workflow-basics" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: basics</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>You now have some experience running R code. We didnt give you many details, but youve obviously figured out the basics, or you wouldve thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.</p><p>Before we go any further, lets make sure youve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.</p>
<h1><span id="sec-workflow-basics" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: basics</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>You now have some experience running R code. We didnt give you many details, but youve obviously figured out the basics, or you wouldve thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.</p><p>Before we go any further, lets make sure youve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.</p>
<section id="coding-basics" data-type="sect1">
<h1>
Coding basics</h1>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-workflow-help">
<h1><span id="sec-workflow-getting-help" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Getting help</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>This book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help, and to help you keep learning.</p>
<h1><span id="sec-workflow-getting-help" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Getting help</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>This book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help, and to help you keep learning.</p>
<section id="google-is-your-friend" data-type="sect1">
<h1>
Google is your friend</h1>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-workflow-pipes">
<h1><span id="sec-workflow-pipes" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Pipes</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>The pipe, <code>|&gt;</code>, is a powerful tool for clearly expressing a sequence of operations that transform an object. We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss <code>%&gt;%</code>, a predecessor to <code>|&gt;</code>.</p><p>To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M. Youll need to make one change to your RStudio options to use <code>|&gt;</code> instead of <code>%&gt;%</code> as shown in <a href="#fig-pipe-options" data-type="xref">#fig-pipe-options</a>; more on <code>%&gt;%</code> shortly.</p><div class="cell">
<h1><span id="sec-workflow-pipes" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Pipes</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>The pipe, <code>|&gt;</code>, is a powerful tool for clearly expressing a sequence of operations that transform an object. We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss <code>%&gt;%</code>, a predecessor to <code>|&gt;</code>.</p><p>To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M. Youll need to make one change to your RStudio options to use <code>|&gt;</code> instead of <code>%&gt;%</code> as shown in <a href="#fig-pipe-options" data-type="xref">#fig-pipe-options</a>; more on <code>%&gt;%</code> shortly.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-pipe-options"><p><img src="screenshots/rstudio-pipe-options.png" alt="Screenshot showing the &quot;Use native pipe operator&quot; option which can be found on the &quot;Editing&quot; panel of the &quot;Code&quot; options." width="616"/></p>

View File

@ -1,15 +1,5 @@
<section data-type="chapter" id="chp-workflow-scripts">
<h1><span id="sec-workflow-scripts-projects" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: scripts and projects</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div><h1>
RStudio server
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>If youre using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like youre closing R, but the server actually keeps it running in the background. The next time you return, youll be in exactly the same place you left. This makes it even more important to regularly restart R so that youre starting with a refresh slate.</p></div>
<p>This chapter will introduce you to two very important tools for organizing your code: scripts and projects.</p>
<h1><span id="sec-workflow-scripts-projects" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: scripts and projects</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>This chapter will introduce you to two very important tools for organizing your code: scripts and projects.</p>
<section id="scripts" data-type="sect1">
<h1>
Scripts</h1>
@ -126,9 +116,7 @@ What is the source of truth?</h2>
</ol><p>We collectively use this pattern hundreds of times a week.</p>
<div data-type="note"><h1>
RStudio server
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>If youre using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like youre closing R, but the server actually keeps it running in the background. The next time you return, youll be in exactly the same place you left. This makes it even more important to regularly restart R so that youre starting with a refresh slate.</p></div>
</h1><p>If youre using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like youre closing R, but the server actually keeps it running in the background. The next time you return, youll be in exactly the same place you left. This makes it even more important to regularly restart R so that youre starting with a refresh slate.</p></div>
</section>

View File

@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-workflow-style">
<h1><span id="sec-workflow-style" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: code style</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer its a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. This chapter will introduce to the most important points of the <a href="https://style.tidyverse.org">tidyverse style guide</a>, which is used throughout this book.</p><p>Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the <a href="https://styler.r-lib.org">styler</a> package by Lorenz Walthert. Once youve installed it with <code>install.packages("styler")</code>, an easy way to use it is via RStudios <strong>command palette</strong>. The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts provided by styler. <a href="#fig-styler" data-type="xref">#fig-styler</a> shows the results.</p><div class="cell">
<h1><span id="sec-workflow-style" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: code style</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer its a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. This chapter will introduce to the most important points of the <a href="https://style.tidyverse.org">tidyverse style guide</a>, which is used throughout this book.</p><p>Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the <a href="https://styler.r-lib.org">styler</a> package by Lorenz Walthert. Once youve installed it with <code>install.packages("styler")</code>, an easy way to use it is via RStudios <strong>command palette</strong>. The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts provided by styler. <a href="#fig-styler" data-type="xref">#fig-styler</a> shows the results.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-styler"><p><img src="screenshots/rstudio-palette.png" alt="A screenshot showing the command palette after typing &quot;styler&quot;, showing the four styling tool provided by the package." width="638"/></p>