More minor page count tweaks & fixes

And re-convert with latest htmlbook
This commit is contained in:
Hadley Wickham 2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions

View File

@ -52,7 +52,8 @@ devtools::install_github("hadley/r4ds")
To generate book for O'Reilly, build the book then:
```{r}
devtools::load_all("../minibook/"); process_book()
# pak::pak("hadley/htmlbook")
htmlbook::convert_book()
html <- list.files("oreilly", pattern = "[.]html$", full.names = TRUE)
file.copy(html, "../r-for-data-science-2e/", overwrite = TRUE)
@ -63,6 +64,8 @@ fs::dir_create(unique(dirname(dest)))
file.copy(pngs, dest, overwrite = TRUE)
```
Then commit and push to atlas.
## Code of Conduct
Please note that r4ds uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html).

View File

@ -16,8 +16,9 @@ options(
pillar.max_footer_lines = 2,
pillar.min_chars = 15,
stringr.view_n = 6,
# Activate crayon output - temporarily disabled for quarto
# crayon.enabled = TRUE,
# Temporarily deactivate cli output for quarto
cli.num_colors = 0,
cli.hyperlink = FALSE,
pillar.bold = TRUE,
width = 77 # 80 - 3 for #> comment
)

View File

@ -210,7 +210,7 @@ This function was the inspiration for much of dplyr's syntax.
2. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
Read the documentation for `which()` and do some experiments to figure it out.
## Selecting a single element `$` and `[[` {#sec-subset-one}
## Selecting a single element with `$` and `[[` {#sec-subset-one}
`[`, which selects many elements, is paired with `[[` and `$`, which extract a single element.
In this section, we'll show you how to use `[[` and `$` to pull columns out of data frames, discuss a couple more differences between `data.frames` and tibbles, and emphasize some important differences between `[` and `[[` when used with lists.

View File

@ -365,7 +365,7 @@ knitr::kable(df, format = "markdown")
```
```{r}
#| eval: false
#| include: false
cli:::ruler()
```

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-EDA">
<h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="EDA-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:</p>
@ -10,7 +10,7 @@ Introduction</h1>
</ol><p>EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that youll eventually write up and communicate to others.</p>
<p>EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, youll need to deploy all the tools of EDA: visualization, transformation, and modelling.</p>
<section id="prerequisites" data-type="sect2">
<section id="EDA-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well combine what youve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.</p>
@ -137,7 +137,7 @@ unusual
<p>Its good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you cant figure out why theyre there, its reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldnt drop them without justification. Youll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="EDA-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li>
@ -198,7 +198,7 @@ Unusual values</h1>
</div>
<p>However this plot isnt great because there are many more non-cancelled flights than cancelled flights. In the next section well explore some techniques for improving this comparison.</p>
<section id="exercises-1" data-type="sect2">
<section id="EDA-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?</p></li>
@ -217,9 +217,7 @@ A categorical and a numerical variable</h2>
<p>For example, lets explore how the price of a diamond varies with its quality (measured by <code>cut</code>) using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price)) +
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)
#&gt; Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#&gt; Please use `linewidth` instead.</pre>
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
</div>
@ -235,7 +233,7 @@ A categorical and a numerical variable</h2>
<p>To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, well display the <strong>density</strong>, which is the count standardized so that the area under each frequency polygon is one.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price, y = after_stat(density))) +
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)</pre>
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
</div>
@ -279,7 +277,7 @@ A categorical and a numerical variable</h2>
</div>
</div>
<section id="exercises-2" data-type="sect3">
<section id="EDA-exercises-2" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Use what youve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li>
@ -291,7 +289,7 @@ Exercises</h3>
</ol></section>
</section>
<section id="two-categorical-variables" data-type="sect2">
<section id="EDA-two-categorical-variables" data-type="sect2">
<h2>
Two categorical variables</h2>
<p>To visualize the covariation between categorical variables, youll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
@ -330,7 +328,7 @@ Two categorical variables</h2>
</div>
<p>If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.</p>
<section id="exercises-3" data-type="sect3">
<section id="EDA-exercises-3" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?</p></li>
@ -340,7 +338,7 @@ Exercises</h3>
</ol></section>
</section>
<section id="two-numerical-variables" data-type="sect2">
<section id="EDA-two-numerical-variables" data-type="sect2">
<h2>
Two numerical variables</h2>
<p>Youve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
@ -390,7 +388,7 @@ ggplot(smaller, aes(x = carat, y = price)) +
</div>
</div>
<section id="exercises-4" data-type="sect3">
<section id="EDA-exercises-4" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code> vs. <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
@ -464,7 +462,7 @@ ggplot(diamonds_aug, aes(x = carat, y = .resid)) +
<p>Were not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.</p>
</section>
<section id="summary" data-type="sect1">
<section id="EDA-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter youve learned a variety of tools to help you understand the variation within your data. Youve seen technique that work with a single variable at a time and with a pair of variables. This might seem painful restrictive if you have tens or hundreds of variables in your data, but theyre foundation upon which all other techniques are built.</p>

View File

@ -1,13 +1,13 @@
<section data-type="chapter" id="chp-arrow">
<h1><span id="sec-arrow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Arrow</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="arrow-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>CSV files are designed to be easily read by humans. Theyre a good interchange format because theyre very simple and they can be read by every tool under the sun. But CSV files arent very efficient: you have to do quite a lot of work to read the data into R. In this chapter, youll learn about a powerful alternative: the <a href="https://parquet.apache.org/">parquet format</a>, an open standards-based format widely used by big data systems.</p>
<p>Well pair parquet files with <a href="https://arrow.apache.org">Apache Arrow</a>, a multi-language toolbox designed for efficient analysis and transport of large data sets. Well use Apache Arrow via the the <a href="https://arrow.apache.org/docs/r/">arrow package</a>, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. As an additional benefit, arrow is extremely fast: youll see some examples later in the chapter.</p>
<p>Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as in the data is already in a database or in parquet files, and youll want to work with it as is. But if youre starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, its hard to know what will work best, so in the early stages of your analysis wed encourage you to try both and pick the one that works the best for you.</p>
<section id="prerequisites" data-type="sect2">
<section id="arrow-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well continue to use the tidyverse, particularly dplyr, but well pair it with the arrow package which is designed specifically for working with large data.</p>
@ -272,7 +272,7 @@ Using dbplyr with arrow</h2>
</section>
</section>
<section id="summary" data-type="sect1">
<section id="arrow-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. It can work with CSV files, its much much faster if you convert your data to parquet. Parquet is a binary data format thats designed specifically for data analysis on modern computers. Far fewer tools can work with parquet files compared to CSV, but its partitioned, compressed, and columnar structure makes it much more efficient to analyze.</p>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-base-R">
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>To finish off the programming section, were going to give you a quick tour of the most important base R functions that we dont otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that youll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. Its not possible to use the tidyverse without using base R, so weve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we havent focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book youll learn other approaches to the same problems using base R, data.table, and other packages. Youll certainly encounter these other approaches when you start reading R code written by other people, particularly if youre using StackOverflow. Its 100% okay to write code that uses a mix of approaches, and dont let anyone tell you otherwise!</p><p>In this chapter, well focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and <code>for</code> loops. To finish off, well briefly discuss two important plotting functions.</p>
<section id="prerequisites" data-type="sect2">
<section id="base-R-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div class="cell">
@ -10,7 +10,7 @@ Prerequisites</h2>
<section id="sec-subset-many" data-type="sect1">
<h1>
Selecting multiple elements with<code>[</code>
Selecting multiple elements with [
</h1>
<p><code>[</code> is used to extract sub-components from vectors and data frames, and is called like <code>x[i]</code> or <code>x[i, j]</code>. In this section, well introduce you to the power of <code>[</code>, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. Well then help you cement that knowledge by showing how various dplyr verbs are special cases of <code>[</code>.</p>
@ -188,7 +188,7 @@ df |&gt; subset(x &gt; 1, c(y, z))
<p>This function was the inspiration for much of dplyrs syntax.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="base-R-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -203,7 +203,7 @@ Exercises</h2>
<section id="sec-subset-one" data-type="sect1">
<h1>
Selecting a single element<code>$</code> and <code>[[</code>
Selecting a single element with $ and [[
</h1>
<p><code>[</code>, which selects many elements, is paired with <code>[[</code> and <code>$</code>, which extract a single element. In this section, well show you how to use <code>[[</code> and <code>$</code> to pull columns out of data frames, discuss a couple more differences between <code>data.frames</code> and tibbles, and emphasize some important differences between <code>[</code> and <code>[[</code> when used with lists.</p>
@ -284,7 +284,7 @@ tb$z
<p>For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.</p>
</section>
<section id="lists" data-type="sect2">
<section id="base-R-lists" data-type="sect2">
<h2>
Lists</h2>
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and its important to understand how they differ from <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
@ -372,7 +372,7 @@ df[["x"]]
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="base-R-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens when you use <code>[[</code> with a positive integer thats bigger than the length of the vector? What happens when you subset with a name that doesnt exist?</p></li>
@ -515,7 +515,7 @@ plot(diamonds$carat, diamonds$price)</pre>
<p>Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using <code>$</code> or some other technique.</p>
</section>
<section id="summary" data-type="sect1">
<section id="base-R-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, weve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.</p>

View File

@ -1,28 +1,18 @@
<section data-type="chapter" id="chp-communication">
<h1><span id="sec-communication" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Communication</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="communication-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In <a href="#chp-EDA" data-type="xref">#chp-EDA</a>, you learned how to use plots as tools for <em>exploration</em>. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, youll produce tens or hundreds of plots, most of which are immediately thrown away.</p>
<p>Now that you understand your data, you need to <em>communicate</em> your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, youll learn some of the tools that ggplot2 provides to do so.</p>
<p>This chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like <a href="https://www.amazon.com/gp/product/0321934075/">The Truthful Art</a>, by Albert Cairo. It doesnt teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.</p>
<section id="prerequisites" data-type="sect2">
<section id="communication-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well focus once again on ggplot2. Well also use a little dplyr for data manipulation, <strong>scales</strong> to override the default breaks, labels, transformations and palettes, and a few ggplot2 extension packages, including <strong>ggrepel</strong> (<a href="https://ggrepel.slowkow.com/">https://ggrepel.slowkow.com</a>) by Kamil Slowikowski and <strong>patchwork</strong> (<a href="https://patchwork.data-imaginist.com/">https://patchwork.data-imaginist.com</a>) by Thomas Lin Pedersen. Dont forget that youll need to install those packages with <code><a href="https://rdrr.io/r/utils/install.packages.html">install.packages()</a></code> if you dont already have them.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
#&gt; ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
#&gt; ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
#&gt; ✔ forcats 0.5.2.9000 ✔ stringr 1.5.0.9000
#&gt; ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.2.1.9001
#&gt; ✔ purrr 1.0.1
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()
#&gt; Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggrepel)
library(patchwork)</pre>
</div>
@ -91,7 +81,7 @@ ggplot(df, aes(x, y)) +
</div>
</div>
<section id="exercises" data-type="sect2">
<section id="communication-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Create one plot on the fuel economy data with customized <code>title</code>, <code>subtitle</code>, <code>caption</code>, <code>x</code>, <code>y</code>, and <code>color</code> labels.</p></li>
@ -280,12 +270,12 @@ ggplot(mpg, aes(x = displ, y = hwy)) +
#&gt; decreasing fuel economy.</pre>
</div>
<p>Remember, in addition to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code>, you have many other geoms in ggplot2 available to help annotate your plot. A couple ideas:</p>
<ul><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_hline()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_vline()</a></code> to add reference lines. We often make them thick (<code>size = 2</code>) and white (<code>color = white</code>), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.</p></li>
<ul><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_hline()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_vline()</a></code> to add reference lines. We often make them thick (<code>linewidth = 2</code>) and white (<code>color = white</code>), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.</p></li>
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_rect()</a></code> to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics <code>xmin</code>, <code>xmax</code>, <code>ymin</code>, <code>ymax</code>.</p></li>
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_segment.html">geom_segment()</a></code> with the <code>arrow</code> argument to draw attention to a point with an arrow. Use aesthetics <code>x</code> and <code>y</code> to define the starting location, and <code>xend</code> and <code>yend</code> to define the end location.</p></li>
</ul><p>The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!</p>
<section id="exercises-1" data-type="sect2">
<section id="communication-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> with infinite positions to place text at the four corners of the plot.</p></li>
@ -603,7 +593,7 @@ mpg |&gt;
</div>
</div>
</div>
<p>You can also set the <code>limits</code> on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want <em>expand</em> the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, its difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.</p>
<p>You can also set the <code>limits</code> on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want to <em>expand</em> the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, its difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.</p>
<div>
<pre data-type="programlisting" data-code-language="r">suv &lt;- mpg |&gt; filter(class == "suv")
compact &lt;- mpg |&gt; filter(class == "compact")
@ -655,7 +645,7 @@ ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
<p>In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.</p>
</section>
<section id="exercises-2" data-type="sect2">
<section id="communication-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -740,7 +730,7 @@ Themes</h1>
</div>
<p>For an overview of all <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code> components, see help with <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">?theme</a></code>. The <a href="https://ggplot2-book.org/">ggplot2 book</a> is also a great place to go for the full details on theming.</p>
<section id="exercises-3" data-type="sect2">
<section id="communication-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Pick a theme offered by the ggthemes package and apply it to the last plot you made.</li>
@ -808,14 +798,14 @@ p5 &lt;- ggplot(mpg, aes(x = cty, y = hwy, color = drv)) +
guides = "collect",
heights = c(1, 3, 2, 4)
) &amp;
theme(legend.position = "bottom")</pre>
theme(legend.position = "top")</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-45-1.png" alt="Five plots laid out such that first two plots are next to each other. Plots three and four are underneath them. And the fifth plot stretches under them. The patchworked plot is titled &quot;City and highway mileage for cars with different drive trains&quot; and captioned &quot;Source: Source: https://fueleconomy.gov&quot;. The first two plots are side-by-side box plots. Plots 3 and 4 are density plots. And the fifth plot is a faceted scatterplot. Each of these plots show geoms colored by drive train, but the patchworked plot has only one legend that applies to all of them, above the plots and beneath the title." width="576"/></p>
</div>
</div>
<p>If youd like to learn more about combining and layout out multiple plots with patchwork, we recommend looking through the guides on the package website: <a href="https://patchwork.data-imaginist.com" class="uri">https://patchwork.data-imaginist.com</a>.</p>
<section id="exercises-4" data-type="sect2">
<section id="communication-exercises-4" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -848,7 +838,7 @@ p3 &lt;- ggplot(mpg, aes(x = cty, y = hwy)) +
</ol></section>
</section>
<section id="summary" data-type="sect1">
<section id="communication-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter youve learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customizing the axis scales, and changing the theme of your plot. Youve also learned about combining multiple plots in a single graph using both simple and complex plot layouts.</p>

View File

@ -1,12 +1,12 @@
<section data-type="chapter" id="chp-data-import">
<h1><span id="sec-data-import" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data import</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="data-import-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Working with data provided by R packages is a great way to learn data science tools, but you want to apply what youve learned to your own data at some point. In this chapter, youll learn the basics of reading data files into R.</p>
<p>Specifically, this chapter will focus on reading plain-text rectangular files. Well start with practical advice for handling features like column names, types, and missing data. You will then learn about reading data from multiple files at once and writing data from R to a file. Finally, youll learn how to handcraft data frames in R.</p>
<section id="prerequisites" data-type="sect2">
<section id="data-import-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, youll learn how to load flat files in R with the <strong>readr</strong> package, which is part of the core tidyverse.</p>
@ -257,7 +257,7 @@ Other file types</h2>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_log.html">read_log()</a></code> reads Apache-style log files.</p></li>
</ul></section>
<section id="exercises" data-type="sect2">
<section id="data-import-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What function would you use to read a file where fields were separated with “|”?</p></li>
@ -372,9 +372,9 @@ Missing values, column types, and problems</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">problems(df)
#&gt; # A tibble: 1 × 5
#&gt; row col expected actual file
#&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 3 1 a double . /private/tmp/Rtmp1nE0XP/file11b88112257a4</pre>
#&gt; row col expected actual file
#&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 3 1 a double . /private/tmp/Rtmpx37bAU/filec1bb57d587a7</pre>
</div>
<p>This tells us that there was a problem in row 3, col 1 where readr expected a double but got a <code>.</code>. That suggests this dataset uses <code>.</code> for missing values. So then we set <code>na = "."</code>, the automatic guessing succeeds, giving us the numeric column that we want:</p>
<div class="cell">
@ -584,7 +584,7 @@ Data entry</h1>
<p>Well use <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code> later in the book to construct small examples to demonstrate how various functions work.</p>
</section>
<section id="summary" data-type="sect1">
<section id="data-import-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to load CSV files with <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and to do your own data entry with <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>. Youve learned how csv files work, some of the problems you might encounter, and how to overcome them. Well come to data import a few times in this book: <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a> from Excel and googlesheets, <a href="#chp-databases" data-type="xref">#chp-databases</a> will show you how to load data from databases, <a href="#chp-arrow" data-type="xref">#chp-arrow</a> from parquet files, <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a> from JSON, and <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a> from websites.</p>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-data-tidy">
<h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="data-tidy-introduction" data-type="sect1">
<h1>
Introduction</h1>
<blockquote class="blockquote">
@ -14,7 +14,7 @@ Introduction</h1>
<p>In this chapter, you will learn a consistent way to organize your data in R using a system called <strong>tidy data</strong>. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.</p>
<p>In this chapter, youll first learn the definition of tidy data and see it applied to a simple toy dataset. Then well dive into the primary tool youll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values. Well finish with a discussion of usefully untidy data and how you can create it if needed.</p>
<section id="prerequisites" data-type="sect2">
<section id="data-tidy-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.</p>
@ -35,7 +35,7 @@ Tidy data</h1>
<pre data-type="programlisting" data-code-language="r">table1
#&gt; # A tibble: 6 × 4
#&gt; country year cases population
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1999 745 19987071
#&gt; 2 Afghanistan 2000 2666 20595360
#&gt; 3 Brazil 1999 37737 172006362
@ -45,7 +45,7 @@ Tidy data</h1>
table2
#&gt; # A tibble: 12 × 4
#&gt; country year type count
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;int&gt;
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1999 cases 745
#&gt; 2 Afghanistan 1999 population 19987071
#&gt; 3 Afghanistan 2000 cases 2666
@ -56,7 +56,7 @@ table2
table3
#&gt; # A tibble: 6 × 3
#&gt; country year rate
#&gt; * &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 Afghanistan 1999 745/19987071
#&gt; 2 Afghanistan 2000 2666/20595360
#&gt; 3 Brazil 1999 37737/172006362
@ -68,14 +68,14 @@ table3
table4a # cases
#&gt; # A tibble: 3 × 3
#&gt; country `1999` `2000`
#&gt; * &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 745 2666
#&gt; 2 Brazil 37737 80488
#&gt; 3 China 212258 213766
table4b # population
#&gt; # A tibble: 3 × 3
#&gt; country `1999` `2000`
#&gt; * &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 19987071 20595360
#&gt; 2 Brazil 172006362 174504898
#&gt; 3 China 1272915272 1280428583</pre>
@ -106,7 +106,7 @@ table1 |&gt;
)
#&gt; # A tibble: 6 × 5
#&gt; country year cases population rate
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1999 745 19987071 0.373
#&gt; 2 Afghanistan 2000 2666 20595360 1.29
#&gt; 3 Brazil 1999 37737 172006362 2.19
@ -119,7 +119,7 @@ table1 |&gt;
count(year, wt = cases)
#&gt; # A tibble: 2 × 2
#&gt; year n
#&gt; &lt;int&gt; &lt;int&gt;
#&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1999 250740
#&gt; 2 2000 296920
@ -133,7 +133,7 @@ ggplot(table1, aes(x = year, y = cases)) +
</div>
</div>
<section id="exercises" data-type="sect2">
<section id="data-tidy-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Using prose, describe how the variables and observations are organised in each of the sample tables.</p></li>
@ -166,21 +166,16 @@ Data in column names</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">billboard
#&gt; # A tibble: 317 × 79
#&gt; artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
#&gt; 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
#&gt; 3 3 Doors… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
#&gt; 4 3 Doors… Loser 2000-10-21 76 76 72 69 67 65 55 59
#&gt; 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
#&gt; 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
#&gt; # … with 311 more rows, and 68 more variables: wk9 &lt;dbl&gt;, wk10 &lt;dbl&gt;,
#&gt; # wk11 &lt;dbl&gt;, wk12 &lt;dbl&gt;, wk13 &lt;dbl&gt;, wk14 &lt;dbl&gt;, wk15 &lt;dbl&gt;, wk16 &lt;dbl&gt;,
#&gt; # wk17 &lt;dbl&gt;, wk18 &lt;dbl&gt;, wk19 &lt;dbl&gt;, wk20 &lt;dbl&gt;, wk21 &lt;dbl&gt;, wk22 &lt;dbl&gt;,
#&gt; # wk23 &lt;dbl&gt;, wk24 &lt;dbl&gt;, wk25 &lt;dbl&gt;, wk26 &lt;dbl&gt;, wk27 &lt;dbl&gt;, wk28 &lt;dbl&gt;,
#&gt; # wk29 &lt;dbl&gt;, wk30 &lt;dbl&gt;, wk31 &lt;dbl&gt;, wk32 &lt;dbl&gt;, wk33 &lt;dbl&gt;, wk34 &lt;dbl&gt;,
#&gt; # wk35 &lt;dbl&gt;, wk36 &lt;dbl&gt;, wk37 &lt;dbl&gt;, wk38 &lt;dbl&gt;, wk39 &lt;dbl&gt;, wk40 &lt;dbl&gt;,
#&gt; # wk41 &lt;dbl&gt;, wk42 &lt;dbl&gt;, wk43 &lt;dbl&gt;, wk44 &lt;dbl&gt;, wk45 &lt;dbl&gt;, …</pre>
#&gt; artist track date.entered wk1 wk2 wk3 wk4 wk5
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby Don't Cry (Ke… 2000-02-26 87 82 72 77 87
#&gt; 2 2Ge+her The Hardest Part O… 2000-09-02 91 87 92 NA NA
#&gt; 3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66
#&gt; 4 3 Doors Down Loser 2000-10-21 76 76 72 69 67
#&gt; 5 504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17
#&gt; 6 98^0 Give Me Just One N… 2000-08-19 51 39 34 26 26
#&gt; # … with 311 more rows, and 71 more variables: wk6 &lt;dbl&gt;, wk7 &lt;dbl&gt;,
#&gt; # wk8 &lt;dbl&gt;, wk9 &lt;dbl&gt;, wk10 &lt;dbl&gt;, wk11 &lt;dbl&gt;, wk12 &lt;dbl&gt;, wk13 &lt;dbl&gt;, …</pre>
</div>
<p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p>
<p>To tidy this data, well use <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>. After the data, there are three key arguments:</p>
@ -339,21 +334,16 @@ Many variables in column names</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">who2
#&gt; # A tibble: 7,240 × 58
#&gt; country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanist… 1980 NA NA NA NA NA NA
#&gt; 2 Afghanist… 1981 NA NA NA NA NA NA
#&gt; 3 Afghanist… 1982 NA NA NA NA NA NA
#&gt; 4 Afghanist… 1983 NA NA NA NA NA NA
#&gt; 5 Afghanist… 1984 NA NA NA NA NA NA
#&gt; 6 Afghanist… 1985 NA NA NA NA NA NA
#&gt; # … with 7,234 more rows, and 50 more variables: sp_m_65 &lt;dbl&gt;,
#&gt; # sp_f_014 &lt;dbl&gt;, sp_f_1524 &lt;dbl&gt;, sp_f_2534 &lt;dbl&gt;, sp_f_3544 &lt;dbl&gt;,
#&gt; # sp_f_4554 &lt;dbl&gt;, sp_f_5564 &lt;dbl&gt;, sp_f_65 &lt;dbl&gt;, sn_m_014 &lt;dbl&gt;,
#&gt; # sn_m_1524 &lt;dbl&gt;, sn_m_2534 &lt;dbl&gt;, sn_m_3544 &lt;dbl&gt;, sn_m_4554 &lt;dbl&gt;,
#&gt; # sn_m_5564 &lt;dbl&gt;, sn_m_65 &lt;dbl&gt;, sn_f_014 &lt;dbl&gt;, sn_f_1524 &lt;dbl&gt;,
#&gt; # sn_f_2534 &lt;dbl&gt;, sn_f_3544 &lt;dbl&gt;, sn_f_4554 &lt;dbl&gt;, sn_f_5564 &lt;dbl&gt;,
#&gt; # sn_f_65 &lt;dbl&gt;, ep_m_014 &lt;dbl&gt;, ep_m_1524 &lt;dbl&gt;, ep_m_2534 &lt;dbl&gt;, …</pre>
#&gt; country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1980 NA NA NA NA NA
#&gt; 2 Afghanistan 1981 NA NA NA NA NA
#&gt; 3 Afghanistan 1982 NA NA NA NA NA
#&gt; 4 Afghanistan 1983 NA NA NA NA NA
#&gt; 5 Afghanistan 1984 NA NA NA NA NA
#&gt; 6 Afghanistan 1985 NA NA NA NA NA
#&gt; # … with 7,234 more rows, and 51 more variables: sp_m_5564 &lt;dbl&gt;,
#&gt; # sp_m_65 &lt;dbl&gt;, sp_f_014 &lt;dbl&gt;, sp_f_1524 &lt;dbl&gt;, sp_f_2534 &lt;dbl&gt;, …</pre>
</div>
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, youll notice theres a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
@ -479,16 +469,16 @@ Widening data</h2>
values_from = prf_rate
)
#&gt; # A tibble: 500 × 9
#&gt; org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDI… CAHPS for MI… 63 NA NA
#&gt; 2 0446157747 USC CARE MEDI… CAHPS for MI… NA 87 NA
#&gt; 3 0446157747 USC CARE MEDI… CAHPS for MI… NA NA 86
#&gt; 4 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
#&gt; 5 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
#&gt; 6 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
#&gt; # … with 494 more rows, and 3 more variables: CAHPS_GRP_5 &lt;dbl&gt;,
#&gt; # CAHPS_GRP_8 &lt;dbl&gt;, CAHPS_GRP_12 &lt;dbl&gt;</pre>
#&gt; org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… 63 NA
#&gt; 2 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA 87
#&gt; 3 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
#&gt; 4 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
#&gt; 5 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
#&gt; 6 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
#&gt; # … with 494 more rows, and 4 more variables: CAHPS_GRP_3 &lt;dbl&gt;,
#&gt; # CAHPS_GRP_5 &lt;dbl&gt;, CAHPS_GRP_8 &lt;dbl&gt;, CAHPS_GRP_12 &lt;dbl&gt;</pre>
</div>
<p>The output doesnt look quite right; we still seem to have multiple rows for each organization. Thats because, by default, <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
<div class="cell">
@ -515,7 +505,7 @@ Widening data</h2>
<section id="how-does-pivot_wider-work" data-type="sect2">
<h2>
How does<code>pivot_wider()</code> work?</h2>
How does pivot_wider() work?</h2>
<p>To understand how <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> works, lets again start with a very simple dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tribble(
@ -849,7 +839,7 @@ Pragmatic computation</h2>
</ul></section>
</section>
<section id="summary" data-type="sect1">
<section id="data-tidy-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because its a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which allow you to tidy up many untidy datasets. Of course, tidy data cant solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the <a href="https://www.jstatsoft.org/article/view/v059i10">Tidy Data</a> paper published in the Journal of Statistical Software.</p>

View File

@ -1,12 +1,12 @@
<section data-type="chapter" id="chp-data-transform">
<h1><span id="sec-data-transform" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data transformation</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="data-transform-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Visualisation is an important tool for generating insight, but its rare that you get the data in exactly the right form you need for it. Often youll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. Youll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the <strong>dplyr</strong> package and a new dataset on flights that departed New York City in 2013.</p>
<p>The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. Well start with functions that operate on rows and then columns of a data frame. We will then introduce the ability to work with groups. We will end the chapter with a case study that showcases these functions in action and well come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).</p>
<section id="prerequisites" data-type="sect2">
<section id="data-transform-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well focus on the dplyr package, another core member of the tidyverse. Well illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.</p>
@ -15,14 +15,14 @@ Prerequisites</h2>
library(tidyverse)
#&gt; ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
#&gt; ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
#&gt; ✔ forcats 0.5.2.9000 ✔ stringr 1.5.0.9000
#&gt; ✔ forcats 0.5.2 ✔ stringr 1.5.0
#&gt; ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.2.1.9001
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.3.0
#&gt; ✔ purrr 1.0.1
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()
#&gt; Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors</pre>
#&gt; Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</pre>
</div>
<p>Take careful note of the conflicts message thats printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, youll need to use their full names: <code><a href="https://rdrr.io/r/stats/filter.html">stats::filter()</a></code> and <code><a href="https://rdrr.io/r/stats/lag.html">stats::lag()</a></code>. So far weve mostly ignored which package a function comes from because most of the time it doesnt matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, well use the same syntax as R: <code>packagename::functionname()</code>.</p>
</section>
@ -43,9 +43,7 @@ nycflights13</h2>
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>If youve used R before, you might notice that this data frame prints a little differently to other data frames youve seen. Thats because its a <strong>tibble</strong>, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. If youre using RStudio, the most convenient is probably <code>View(flights)</code>, which will open an interactive scrollable and filterable view. Otherwise you can use <code>print(flights, width = Inf)</code> to show all columns, or use call <code><a href="https://pillar.r-lib.org/reference/glimpse.html">glimpse()</a></code>:</p>
<div class="cell">
@ -103,7 +101,7 @@ Rows</h1>
<section id="filter" data-type="sect2">
<h2>
<code>filter()</code>
filter()
</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, youll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p>
<div class="cell">
@ -119,9 +117,7 @@ Rows</h1>
#&gt; 5 2013 1 1 1505 1310 115 1638 1431
#&gt; 6 2013 1 1 1525 1340 105 1831 1626
#&gt; # … with 10,028 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>As well as <code>&gt;</code> (greater than), you can use <code>&gt;=</code> (greater than or equal to), <code>&lt;</code> (less than), <code>&lt;=</code> (less than or equal to), <code>==</code> (equal to), and <code>!=</code> (not equal to). You can also use <code>&amp;</code> (and) or <code>|</code> (or) to combine multiple conditions:</p>
<div class="cell">
@ -138,9 +134,7 @@ flights |&gt;
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …
# Flights that departed in January or February
flights |&gt;
@ -155,9 +149,7 @@ flights |&gt;
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 51,949 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>Theres a useful shortcut when youre combining <code>|</code> and <code>==</code>: <code>%in%</code>. It keeps rows where the variable equals one of the values on the right:</p>
<div class="cell">
@ -174,9 +166,7 @@ flights |&gt;
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 51,949 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>Well come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
<p>When you run <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesnt modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code>&lt;-</code>:</p>
@ -208,7 +198,7 @@ Common mistakes</h2>
<section id="arrange" data-type="sect2">
<h2>
<code>arrange()</code>
arrange()
</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p>
<div class="cell">
@ -224,9 +214,7 @@ Common mistakes</h2>
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>You can use <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
<div class="cell">
@ -242,9 +230,7 @@ Common mistakes</h2>
#&gt; 5 2013 7 22 845 1600 1005 1044 1815
#&gt; 6 2013 4 10 1100 1900 960 1342 2211
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>You can combine <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
<div class="cell">
@ -261,17 +247,15 @@ Common mistakes</h2>
#&gt; 5 2013 9 19 648 641 7 1035 810
#&gt; 6 2013 4 18 655 700 -5 1213 950
#&gt; # … with 239,103 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
</section>
<section id="distinct" data-type="sect2">
<h2>
<code>distinct()</code>
distinct()
</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, youll want to the distinct combination of some variables, so you can also optionally supply column names:</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, youll want the distinct combination of some variables, so you can also optionally supply column names:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># This would remove any duplicate rows if there were any
flights |&gt;
@ -286,9 +270,7 @@ flights |&gt;
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …
# This finds all unique origin and destination pairs.
flights |&gt;
@ -307,7 +289,7 @@ flights |&gt;
<p>Note that if you want to find the number of duplicates, or rows that werent duplicated, youre better off swapping <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> for <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> and then filtering as needed.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="data-transform-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -334,7 +316,7 @@ Columns</h1>
<section id="sec-mutate" data-type="sect2">
<h2>
<code>mutate()</code>
mutate()
</h2>
<p>The job of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, youll learn a large set of functions that you can use to manipulate different types of variables. For now, well stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p>
<div class="cell">
@ -353,9 +335,7 @@ Columns</h1>
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 13 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;, gain &lt;dbl&gt;, speed &lt;dbl&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see whats happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>.</span>:</p>
<div class="cell">
@ -375,9 +355,7 @@ Columns</h1>
#&gt; 5 19 394. 2013 1 1 554 600 -6 812
#&gt; 6 -16 288. 2013 1 1 554 558 -4 740
#&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, …</pre>
</div>
<p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can use the variable name instead of a position. For example, we could add the new variables after <code>day:</code></p>
<div class="cell">
@ -397,14 +375,12 @@ Columns</h1>
#&gt; 5 2013 1 1 19 394. 554 600 -6 812
#&gt; 6 2013 1 1 -16 288. 554 558 -4 740
#&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, …</pre>
</div>
<p>Alternatively, you can control which variables are kept with the <code>.keep</code> argument. A particularly useful argument is <code>"used"</code> which allows you to see the inputs and outputs from your calculations:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(,
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
@ -425,7 +401,7 @@ Columns</h1>
<section id="sec-select" data-type="sect2">
<h2>
<code>select()</code>
select()
</h2>
<p>Its not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables youre interested in. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p>
<div class="cell">
@ -470,8 +446,7 @@ flights |&gt;
#&gt; 5 554 600 -6 812 837 -25 DL
#&gt; 6 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, and 9 more variables: flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, …
# Select all columns that are characters
flights |&gt;
@ -516,7 +491,7 @@ flights |&gt;
<section id="rename" data-type="sect2">
<h2>
<code>rename()</code>
rename()
</h2>
<p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> instead of <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
<div class="cell">
@ -532,9 +507,7 @@ flights |&gt;
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tail_num &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tail_num &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>It works exactly the same way as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, but keeps all the variables that arent explicitly selected.</p>
<p>If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out <code><a href="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> which provides some useful automated cleaning.</p>
@ -542,7 +515,7 @@ flights |&gt;
<section id="relocate" data-type="sect2">
<h2>
<code>relocate()</code>
relocate()
</h2>
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> moves variables to the front:</p>
<div class="cell">
@ -558,9 +531,7 @@ flights |&gt;
#&gt; 5 2013-01-01 06:00:00 116 2013 1 1 554 600
#&gt; 6 2013-01-01 05:00:00 150 2013 1 1 554 558
#&gt; # … with 336,770 more rows, and 12 more variables: dep_delay &lt;dbl&gt;,
#&gt; # arr_time &lt;int&gt;, sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;,
#&gt; # flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;</pre>
#&gt; # arr_time &lt;int&gt;, sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, …</pre>
</div>
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p>
<div class="cell">
@ -576,9 +547,7 @@ flights |&gt;
#&gt; 5 600 -6 812 837 -25 DL 461
#&gt; 6 558 -4 740 728 12 UA 1696
#&gt; # … with 336,770 more rows, and 12 more variables: tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;,
#&gt; # dep_time &lt;int&gt;
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, …
flights |&gt;
relocate(starts_with("arr"), .before = dep_time)
#&gt; # A tibble: 336,776 × 19
@ -591,13 +560,11 @@ flights |&gt;
#&gt; 5 2013 1 1 812 -25 554 600 -6
#&gt; 6 2013 1 1 740 12 554 558 -4
#&gt; # … with 336,770 more rows, and 11 more variables: sched_arr_time &lt;int&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="data-transform-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<div class="cell">
@ -629,7 +596,7 @@ Groups</h1>
<section id="group_by" data-type="sect2">
<h2>
<code>group_by()</code>
group_by()
</h2>
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> to divide your dataset into groups meaningful for your analysis:</p>
<div class="cell">
@ -646,16 +613,14 @@ Groups</h1>
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesnt change the data but, if you look closely at the output, youll notice that its now “grouped by” month. This means subsequent operations will now work “by month”. <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesnt do anything by itself; instead it changes the behavior of the subsequent verbs.</p>
</section>
<section id="sec-summarize" data-type="sect2">
<h2>
<code>summarize()</code>
summarize()
</h2>
<p>The most important grouped operation is a summary, which collapses each group to a single row. In dplyr, this is operation is performed by <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code><span data-type="footnote">Or <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, if you prefer British English.</span>, as shown by the following example, which computes the average departure delay by month:</p>
<div class="cell">
@ -717,7 +682,7 @@ Groups</h1>
<section id="the-slice_-functions" data-type="sect2">
<h2>
The<code>slice_</code> functions</h2>
The slice_ functions</h2>
<p>There are five handy functions that allow you pick off specific rows within each group:</p>
<ul><li>
<code>df |&gt; slice_head(n = 1)</code> takes the first row from each group.</li>
@ -745,9 +710,7 @@ The<code>slice_</code> functions</h2>
#&gt; 5 2013 7 22 2257 759 898 121 1026
#&gt; 6 2013 7 10 2056 1505 351 2347 1758
#&gt; # … with 102 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>This is similar to computing the max delay with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, but you get the whole row instead of the single summary:</p>
<div class="cell">
@ -791,9 +754,7 @@ daily
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasnt great way to make this function work, but its difficult to change without breaking existing code. To make it obvious whats happening, dplyr displays a message that tells you how you can change this behavior:</p>
<div class="cell">
@ -834,7 +795,7 @@ Ungrouping</h2>
<p>As you can see, when you summarize an ungrouped data frame, you get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.</p>
</section>
<section id="exercises-2" data-type="sect2">
<section id="data-transform-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about <code>flights |&gt; group_by(carrier, dest) |&gt; summarize(n())</code>)</p></li>
@ -996,7 +957,7 @@ batters
<p>You can find a good explanation of this problem and how to overcome it at <a href="http://varianceexplained.org/r/empirical_bayes_baseball/" class="uri">http://varianceexplained.org/r/empirical_bayes_baseball/</a> and <a href="https://www.evanmiller.org/how-not-to-sort-by-average-rating.html" class="uri">https://www.evanmiller.org/how-not-to-sort-by-average-rating.html</a>.</p>
</section>
<section id="summary" data-type="sect1">
<section id="data-transform-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, those that manipulate the columns (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>), and those that manipulate groups (like <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>). In this chapter, weve focused on these “whole data frame” tools, but you havent yet learned much about what you can do with the individual variable. Well come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-data-visualize">
<h1><span id="sec-data-visualization" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data visualization</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="data-visualize-introduction" data-type="sect1">
<h1>
Introduction</h1>
<blockquote class="blockquote">
@ -9,30 +9,30 @@ Introduction</h1>
<p>R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the <strong>grammar of graphics</strong>, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.</p>
<p>This chapter will teach you how to visualize your data using <strong>ggplot2</strong>. We will start by creating a simple scatterplot and use that to introduce aesthetic mappings and geometric objects the fundamental building blocks of ggplot2. We will then walk you through visualizing distributions of single variables as well as visualizing relationships between two or more variables. Well finish off with saving your plots and troubleshooting tips.</p>
<section id="prerequisites" data-type="sect2">
<section id="data-visualize-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:</p>
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
#&gt; ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
#&gt; ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
#&gt; ✔ forcats 0.5.2.9000 ✔ stringr 1.5.0.9000
#&gt; ✔ forcats 0.5.2 ✔ stringr 1.5.0
#&gt; ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.2.1.9001
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.3.0
#&gt; ✔ purrr 1.0.1
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()
#&gt; Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors</pre>
#&gt; Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</pre>
</div>
<p>That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).</p>
<p>That one line of code loads the core tidyverse; the packages that you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded)<span data-type="footnote">You can eliminate that message and force conflict resolution to happen on demand by using the conflicted package, which becomes more important as you load more packages. You can learn more about conflicted at <a href="https://conflicted.r-lib.org" class="uri">https://conflicted.r-lib.org</a>.</span>.</p>
<p>If you run this code and get the error message <code>there is no package called 'tidyverse'</code>, youll need to first install it, then run <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> once again.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">install.packages("tidyverse")
library(tidyverse)</pre>
</div>
<p>You only need to install a package once, but you need to reload it every time you start a new session.</p>
<p>You only need to install a package once, but you need to load it every time you start a new session.</p>
<p>In addition to tidyverse, we will also use the <strong>palmerpenguins</strong> package, which includes the <code>penguins</code> dataset containing body measurements for penguins on three islands in the Palmer Archipelago.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(palmerpenguins)</pre>
@ -47,20 +47,21 @@ First steps</h1>
<section id="the-penguins-data-frame" data-type="sect2">
<h2>
The<code>penguins</code> data frame</h2>
The penguins data frame</h2>
<p>You can test your answer with the <code>penguins</code> <strong>data frame</strong> found in palmerpenguins (a.k.a. <code><a href="https://allisonhorst.github.io/palmerpenguins/reference/penguins.html">palmerpenguins::penguins</a></code>). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). <code>penguins</code> contains 344 observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER<span data-type="footnote">Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. <a href="https://allisonhorst.github.io/palmerpenguins/" class="uri">https://allisonhorst.github.io/palmerpenguins/</a>. doi: 10.5281/zenodo.3960218.</span>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">penguins
#&gt; # A tibble: 344 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Adelie Torgers… 39.1 18.7 181 3750
#&gt; 2 Adelie Torgers… 39.5 17.4 186 3800
#&gt; 3 Adelie Torgers… 40.3 18 195 3250
#&gt; 4 Adelie Torgers… NA NA NA NA
#&gt; 5 Adelie Torgers… 36.7 19.3 193 3450
#&gt; 6 Adelie Torgers… 39.3 20.6 190 3650
#&gt; # … with 338 more rows, and 2 more variables: sex &lt;fct&gt;, year &lt;int&gt;</pre>
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm
#&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181
#&gt; 2 Adelie Torgersen 39.5 17.4 186
#&gt; 3 Adelie Torgersen 40.3 18 195
#&gt; 4 Adelie Torgersen NA NA NA
#&gt; 5 Adelie Torgersen 36.7 19.3 193
#&gt; 6 Adelie Torgersen 39.3 20.6 190
#&gt; # … with 338 more rows, and 3 more variables: body_mass_g &lt;int&gt;, sex &lt;fct&gt;,
#&gt; # year &lt;int&gt;</pre>
</div>
<p>This data frame contains 8 columns. For an alternative view, where you can see all variables and the first few observations of each variable, use <code><a href="https://pillar.r-lib.org/reference/glimpse.html">glimpse()</a></code>. Or, if youre in RStudio, run <code>View(penguins)</code> to open an interactive data viewer.</p>
<div class="cell">
@ -239,7 +240,7 @@ Adding aesthetics and layers</h2>
<p>We finally have a plot that perfectly matches our “ultimate goal”!</p>
</section>
<section id="exercises" data-type="sect2">
<section id="data-visualize-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How many rows are in <code>penguins</code>? How many columns?</p></li>
@ -410,7 +411,7 @@ ggplot(penguins, aes(x = body_mass_g)) +
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="data-visualize-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Make a bar plot of <code>species</code> of <code>penguins</code>, where you assign <code>species</code> to the <code>y</code> aesthetic. How is this plot different?</p></li>
@ -479,7 +480,7 @@ A numerical and a categorical variable</h2>
<li>Otherwise, we <em>set</em> the value of an aesthetic.</li>
</ul></section>
<section id="two-categorical-variables" data-type="sect2">
<section id="data-visualize-two-categorical-variables" data-type="sect2">
<h2>
Two categorical variables</h2>
<p>We can use segmented bar plots to visualize the distribution between two categorical variables. In creating this bar chart, we map the variable we want to divide the data into first to the <code>x</code> aesthetic and the variable we then further want to divide each group into to the <code>fill</code> aesthetic.</p>
@ -498,7 +499,7 @@ ggplot(penguins, aes(x = island, fill = species)) +
</div>
</section>
<section id="two-numerical-variables" data-type="sect2">
<section id="data-visualize-two-numerical-variables" data-type="sect2">
<h2>
Two numerical variables</h2>
<p>So far youve learned about scatterplots (created with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>) and smooth curves (created with <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code>) for visualizing the relationship between two numerical variables. A scatterplot is probably the most commonly used plot for visualizing the relationship between two variables.</p>
@ -535,7 +536,7 @@ Three or more variables</h2>
<p>You will learn about many other geoms for visualizing distributions of variables and relationships between them in <a href="#chp-layers" data-type="xref">#chp-layers</a>.</p>
</section>
<section id="exercises-2" data-type="sect2">
<section id="data-visualize-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Which variables in <code>mpg</code> are categorical? Which variables are continuous? (Hint: type <code><a href="https://ggplot2.tidyverse.org/reference/mpg.html">?mpg</a></code> to read the documentation for the dataset). How can you see this information when you run <code>mpg</code>?</p></li>
@ -576,7 +577,7 @@ ggsave(filename = "my-plot.png")</pre>
<p>If you dont specify the <code>width</code> and <code>height</code> they will be taken from the dimensions of the current plotting device. For reproducible code, youll want to specify them. You can learn more about <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> in the documentation.</p>
<p>Generally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. You will learn more about Quarto in <a href="#chp-quarto" data-type="xref">#chp-quarto</a>.</p>
<section id="exercises-3" data-type="sect2">
<section id="data-visualize-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -607,7 +608,7 @@ Common problems</h1>
<p>If that doesnt help, carefully read the error message. Sometimes the answer will be buried there! But when youre new to R, even if the answer is in the error message, you might not yet know how to understand it. Another great tool is Google: try googling the error message, as its likely someone else has had the same problem, and has gotten help online.</p>
</section>
<section id="summary" data-type="sect1">
<section id="data-visualize-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, color, size and shape. You then learned about increasing the complexity and improving the presentation of your plots layer-by-layer. You also learned about commonly used plots for visualizing the distribution of a single variable as well as for visualizing relationships between two or more variables, by levering additional aesthetic mappings and/or splitting your plot into small multiples using faceting.</p>

View File

@ -1,12 +1,12 @@
<section data-type="chapter" id="chp-databases">
<h1><span id="sec-import-databases" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Databases</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="databases-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>A huge amount of data lives in databases, so its essential that you know how to access it. Sometimes you can ask someone to download a snapshot into a <code>.csv</code> for you, but this gets painful quickly: every time you need to make a change youll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.</p>
<p>In this chapter, youll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL<span data-type="footnote">SQL is either pronounced “s”-“q”-“l” or “sequel”.</span> query. <strong>SQL</strong>, short for <strong>s</strong>tructured <strong>q</strong>uery <strong>l</strong>anguage, is the lingua franca of databases, and is an important language for all data scientists to learn. That said, were not going to start with SQL, but instead well teach you dbplyr, which can translate your dplyr code to the SQL. Well use that as way to teach you some of the most important features of SQL. You wont become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.</p>
<section id="prerequisites" data-type="sect2">
<section id="databases-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.</p>
@ -148,7 +148,7 @@ as_tibble(dbGetQuery(con, sql))
<p>Youll need to be a little careful with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> since it can potentially return more data than you have memory. We wont discuss it further here, but if youre dealing with very large datasets its possible to deal with a “page” of data at a time by using <code><a href="https://dbi.r-dbi.org/reference/dbSendQuery.html">dbSendQuery()</a></code> to get a “result set” which you can page through by calling <code><a href="https://dbi.r-dbi.org/reference/dbFetch.html">dbFetch()</a></code> until <code><a href="https://dbi.r-dbi.org/reference/dbHasCompleted.html">dbHasCompleted()</a></code> returns <code>TRUE</code>.</p>
</section>
<section id="other-functions" data-type="sect2">
<section id="databases-other-functions" data-type="sect2">
<h2>
Other functions</h2>
<p>There are lots of other functions in DBI that you might find useful if youre managing your own data (like <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> which we used in <a href="#sec-load-data" data-type="xref">#sec-load-data</a>), but were going to skip past them in the interest of staying focused on working with data that already lives in a database.</p>
@ -164,7 +164,7 @@ dbplyr basics</h1>
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, "diamonds")
diamonds_db
#&gt; # Source: table&lt;diamonds&gt; [?? x 10]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
#&gt; carat cut color clarity depth table price x y z
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
@ -175,25 +175,24 @@ diamonds_db
#&gt; 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#&gt; # … with more rows</pre>
</div>
<div data-type="note"><div class="callout-body d-flex">
<div data-type="note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
</div>
<p>Other times you might want to use your own SQL query as a starting point:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
</div>
</div>
<p>This object is <strong>lazy</strong>; when you use dplyr verbs on it, dplyr doesnt do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:</p>
<div class="cell">
@ -203,7 +202,7 @@ FROM `planes`</pre></div>
big_diamonds_db
#&gt; # Source: SQL [?? x 5]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
#&gt; carat cut color clarity price
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 1.54 Premium E VS2 15002
@ -304,25 +303,16 @@ planes |&gt; show_query()
<ul><li>In SQL, case doesnt matter: you can write <code>select</code>, <code>SELECT</code>, or even <code>SeLeCt</code>. In this book well stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.</li>
<li>In SQL, order matters: you must always write the clauses in the order <code>SELECT</code>, <code>FROM</code>, <code>WHERE</code>, <code>GROUP BY</code>, <code>ORDER BY</code>. Confusingly, this order doesnt match how the clauses actually evaluated which is first <code>FROM</code>, then <code>WHERE</code>, <code>GROUP BY</code>, <code>SELECT</code>, and <code>ORDER BY</code>.</li>
</ul><p>The following sections explore each clause in more detail.</p>
<div data-type="note"><div class="callout-body d-flex">
<div data-type="note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
</div>
</div>
</section>
@ -356,26 +346,23 @@ planes |&gt;
#&gt; FROM planes</pre>
</div>
<p>This example also shows you how SQL does renaming. In SQL terminology renaming is called <strong>aliasing</strong> and is done with <code>AS</code>. Note that unlike <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, the old name is on the left and the new name is on the right.</p>
<div data-type="note"><div class="callout-body d-flex">
<div data-type="note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p>
<p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p>
<pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre>
<p>Some other database systems use backticks instead of quotes:</p>
<pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre>
</div>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
<p>The translations for <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> are similarly straightforward: each variable becomes a new expression in <code>SELECT</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
@ -461,7 +448,7 @@ flights |&gt;
#&gt; Use `na.rm = TRUE` to silence this warning
#&gt; This warning is displayed once every 8 hours.
#&gt; # Source: SQL [?? x 2]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
#&gt; dest delay
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 ATL 11.3
@ -552,7 +539,7 @@ Subqueries</h2>
<p>Sometimes dbplyr will create a subquery where its not needed because it doesnt yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.</p>
</section>
<section id="joins" data-type="sect2">
<section id="databases-joins" data-type="sect2">
<h2>
Joins</h2>
<p>If youre familiar with dplyrs joins, SQL joins are very similar. Heres a simple example:</p>
@ -597,7 +584,7 @@ Other verbs</h2>
<p>dbplyr also translates other verbs like <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code>slice_*()</code>, and <code><a href="https://generics.r-lib.org/reference/setops.html">intersect()</a></code>, and a growing selection of tidyr functions like <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code>. The easiest way to see the full set of whats currently available is to visit the dbplyr website: <a href="https://dbplyr.tidyverse.org/reference/" class="uri">https://dbplyr.tidyverse.org/reference/</a>.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="databases-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What is <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> translated to? How about <code><a href="https://rdrr.io/r/utils/head.html">head()</a></code>?</p></li>
@ -731,7 +718,7 @@ flights |&gt;
<p>dbplyr also translates common string and date-time manipulation functions, which you can learn about in <code><a href="https://dbplyr.tidyverse.org/articles/translation-function.html">vignette("translation-function", package = "dbplyr")</a></code>. dbplyrs translations are certainly not perfect, and there are many R functions that arent translated yet, but dbplyr does a surprisingly good job covering the functions that youll use most of the time.</p>
</section>
<section id="summary" data-type="sect1">
<section id="databases-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr “backend” that allows you to write the dplyr code youre familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; its important to learn some SQL because its <em>the</em> most commonly used language for working with data and knowing some will it easier for you to communicate with other data folks who dont use R. If youve finished this chapter and would like to learn more about SQL. We have two recommendations:</p>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-datetimes">
<h1><span id="sec-dates-and-times" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Dates and times</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="datetimes-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they dont seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get!</p>
@ -8,14 +8,12 @@ Introduction</h1>
<p>Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. This chapter wont teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.</p>
<p>Well begin by showing you how to create date-times from various inputs, and then once youve got a date-time, how you can extract components like year, month, and day. Well then dive into the tricky topic of working with time spans, which come in a variety of flavors depending on what youre trying to do. Well conclude with a brief discussion of the additional challenges posed by time zones.</p>
<section id="prerequisites" data-type="sect2">
<section id="datetimes-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>This chapter will focus on the <strong>lubridate</strong> package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when youre working with dates/times. We will also need nycflights13 for practice data.</p>
<p>This chapter will focus on the <strong>lubridate</strong> package, which makes it easier to work with dates and times in R. As of the latest tidyverse release, lubridate is part of core tidyverse so. We will also need nycflights13 for practice data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(lubridate)
library(nycflights13)</pre>
</div>
</section>
@ -33,9 +31,9 @@ Creating date/times</h1>
<p>To get the current date or date-time you can use <code><a href="https://lubridate.tidyverse.org/reference/now.html">today()</a></code> or <code><a href="https://lubridate.tidyverse.org/reference/now.html">now()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">today()
#&gt; [1] "2023-01-12"
#&gt; [1] "2023-01-26"
now()
#&gt; [1] "2023-01-12 17:04:08 CST"</pre>
#&gt; [1] "2023-01-26 10:32:54 CST"</pre>
</div>
<p>Otherwise, the following sections describe the four ways youre likely to create a date/time:</p>
<ul><li>While reading a file with readr.</li>
@ -281,9 +279,9 @@ From other types</h2>
<p>You may want to switch between a date-time and a date. Thats the job of <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code> and <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">as_datetime(today())
#&gt; [1] "2023-01-12 UTC"
#&gt; [1] "2023-01-26 UTC"
as_date(now())
#&gt; [1] "2023-01-12"</pre>
#&gt; [1] "2023-01-26"</pre>
</div>
<p>Sometimes youll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code>; if its in days, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>.</p>
<div class="cell">
@ -294,7 +292,7 @@ as_date(365 * 10 + 2)
</div>
</section>
<section id="exercises" data-type="sect2">
<section id="datetimes-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -474,7 +472,7 @@ update(ymd("2023-02-01"), hour = 400)
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="datetimes-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How does the distribution of flight times within a day change over the course of the year?</p></li>
@ -507,12 +505,12 @@ Durations</h2>
<pre data-type="programlisting" data-code-language="r"># How old is Hadley?
h_age &lt;- today() - ymd("1979-10-14")
h_age
#&gt; Time difference of 15796 days</pre>
#&gt; Time difference of 15810 days</pre>
</div>
<p>A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the <strong>duration</strong>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">as.duration(h_age)
#&gt; [1] "1364774400s (~43.25 years)"</pre>
#&gt; [1] "1365984000s (~43.29 years)"</pre>
</div>
<p>Durations come with a bunch of convenient constructors:</p>
<div class="cell">
@ -530,7 +528,7 @@ dweeks(3)
dyears(1)
#&gt; [1] "31557600s (~1 years)"</pre>
</div>
<p>Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year is uses the “average” number of days in a year, i.e. 365.25. Theres no way to convert a month to a duration, because theres just too much variation.</p>
<p>Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year uses the “average” number of days in a year, i.e. 365.25. Theres no way to convert a month to a duration, because theres just too much variation.</p>
<p>You can add and multiply durations:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">2 * dyears(1)
@ -545,14 +543,14 @@ last_year &lt;- today() - dyears(1)</pre>
</div>
<p>However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">one_pm &lt;- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York")
<pre data-type="programlisting" data-code-language="r">one_am &lt;- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
one_pm
#&gt; [1] "2026-03-12 13:00:00 EDT"
one_pm + ddays(1)
#&gt; [1] "2026-03-13 13:00:00 EDT"</pre>
one_am
#&gt; [1] "2026-03-08 01:00:00 EST"
one_am + ddays(1)
#&gt; [1] "2026-03-09 02:00:00 EDT"</pre>
</div>
<p>Why is one day after 1pm March 12, 2pm March 13? If you look carefully at the date you might also notice that the time zones have changed. March 12 only has 23 hours because its when DST starts, so if we add a full days worth of seconds we end up with a different time.</p>
<p>Why is one day after 1am March 8, 2am March 9? If you look carefully at the date you might also notice that the time zones have changed. March 8 only has 23 hours because its when DST starts, so if we add a full days worth of seconds we end up with a different time.</p>
</section>
<section id="periods" data-type="sect2">
@ -560,10 +558,10 @@ one_pm + ddays(1)
Periods</h2>
<p>To solve this problem, lubridate provides <strong>periods</strong>. Periods are time spans but dont have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">one_pm
#&gt; [1] "2026-03-12 13:00:00 EDT"
one_pm + days(1)
#&gt; [1] "2026-03-13 13:00:00 EDT"</pre>
<pre data-type="programlisting" data-code-language="r">one_am
#&gt; [1] "2026-03-08 01:00:00 EST"
one_am + days(1)
#&gt; [1] "2026-03-09 01:00:00 EDT"</pre>
</div>
<p>Like durations, periods can be created with a number of friendly constructor functions.</p>
<div class="cell">
@ -591,10 +589,10 @@ ymd("2024-01-01") + years(1)
#&gt; [1] "2025-01-01"
# Daylight Savings Time
one_pm + ddays(1)
#&gt; [1] "2026-03-13 13:00:00 EDT"
one_pm + days(1)
#&gt; [1] "2026-03-13 13:00:00 EDT"</pre>
one_am + ddays(1)
#&gt; [1] "2026-03-09 02:00:00 EDT"
one_am + days(1)
#&gt; [1] "2026-03-09 01:00:00 EDT"</pre>
</div>
<p>Lets use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination <em>before</em> they departed from New York City.</p>
<div class="cell">
@ -668,7 +666,7 @@ y2024 / days(1)
</div>
</section>
<section id="exercises-2" data-type="sect2">
<section id="datetimes-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explain <code>days(overnight * 1)</code> to someone who has just started learning R. How does it work?</p></li>
@ -694,7 +692,7 @@ Time zones</h1>
<p>And see the complete list of all time zone names with <code><a href="https://rdrr.io/r/base/timezones.html">OlsonNames()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">length(OlsonNames())
#&gt; [1] 596
#&gt; [1] 597
head(OlsonNames())
#&gt; [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
#&gt; [4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"</pre>
@ -755,7 +753,7 @@ x4b - x4
</li>
</ul></section>
<section id="summary" data-type="sect1">
<section id="datetimes-summary" data-type="sect1">
<h1>
Summary</h1>
<p>This chapter has introduced you to the tools that lubridate provides to help you work with date-time data. Working with dates and times can seem harder than necessary, but hopefully this chapter has helped you see why — date-times are more complex than they seem at first glance, and handling every possible situation adds complexity. Even if your data never crosses a day light savings boundary or involves a leap year, the functions need to be able to handle it.</p>

View File

@ -1,12 +1,12 @@
<section data-type="chapter" id="chp-factors">
<h1><span id="sec-factors" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Factors</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="factors-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.</p>
<p>Well start by motivating why factors are needed for data analysis and how you can create them with <code><a href="https://rdrr.io/r/base/factor.html">factor()</a></code>. Well then introduce you to the <code>gss_cat</code> dataset which contains a bunch of categorical variables to experiment with. Youll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.</p>
<section id="prerequisites" data-type="sect2">
<section id="factors-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Base R provides some basic tools for creating and manipulating factors. Well supplement these with the <strong>forcats</strong> package, which is part of the core tidyverse. It provides tools for dealing with <strong>cat</strong>egorical variables (and its an anagram of factors!) using a wide range of helpers for working with factors.</p>
@ -114,15 +114,16 @@ General Social Survey</h1>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat
#&gt; # A tibble: 21,483 × 9
#&gt; year marital age race rincome partyid relig denom tvhours
#&gt; &lt;int&gt; &lt;fct&gt; &lt;int&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 2000 Never married 26 White $8000 to 9999 Ind,nea… Prot… Sout… 12
#&gt; 2 2000 Divorced 48 White $8000 to 9999 Not str… Prot… Bapt… NA
#&gt; 3 2000 Widowed 67 White Not applicable Indepen… Prot… No d… 2
#&gt; 4 2000 Never married 39 White Not applicable Ind,nea… Orth… Not … 4
#&gt; 5 2000 Divorced 25 White Not applicable Not str… None Not … 1
#&gt; 6 2000 Married 25 White $20000 - 24999 Strong … Prot… Sout… NA
#&gt; # … with 21,477 more rows</pre>
#&gt; year marital age race rincome partyid
#&gt; &lt;int&gt; &lt;fct&gt; &lt;int&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt;
#&gt; 1 2000 Never married 26 White $8000 to 9999 Ind,near rep
#&gt; 2 2000 Divorced 48 White $8000 to 9999 Not str republican
#&gt; 3 2000 Widowed 67 White Not applicable Independent
#&gt; 4 2000 Never married 39 White Not applicable Ind,near rep
#&gt; 5 2000 Divorced 25 White Not applicable Not str democrat
#&gt; 6 2000 Married 25 White $20000 - 24999 Strong democrat
#&gt; # … with 21,477 more rows, and 3 more variables: relig &lt;fct&gt;, denom &lt;fct&gt;,
#&gt; # tvhours &lt;int&gt;</pre>
</div>
<p>(Remember, since this dataset is provided by a package, you can get more information about the variables with <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">?gss_cat</a></code>.)</p>
<p>When factors are stored in a tibble, you cant see their levels so easily. One way to view them is with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
@ -136,14 +137,6 @@ General Social Survey</h1>
#&gt; 2 Black 3129
#&gt; 3 White 16395</pre>
</div>
<p>Or with a bar chart:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(gss_cat, aes(x = race)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A bar chart showing the distribution of race. There are ~2000 records with race &quot;Other&quot;, 3000 with race &quot;Black&quot;, and other 15,000 with race &quot;White&quot;." width="576"/></p>
</div>
</div>
<p>When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.</p>
<section id="exercise" data-type="sect2">
@ -171,7 +164,7 @@ Modifying factor order</h1>
ggplot(relig_summary, aes(x = tvhours, y = relig)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern." width="576"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern." width="576"/></p>
</div>
</div>
<p>It is hard to read this plot because theres no overall pattern. We can improve it by reordering the levels of <code>relig</code> using <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code>. <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> takes three arguments:</p>
@ -184,7 +177,7 @@ ggplot(relig_summary, aes(x = tvhours, y = relig)) +
<pre data-type="programlisting" data-code-language="r">ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. &quot;Other eastern&quot; has the fewest tvhours under 2, and &quot;Don't know&quot; has the highest (over 5)." width="576"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. &quot;Other eastern&quot; has the fewest tvhours under 2, and &quot;Don't know&quot; has the highest (over 5)." width="576"/></p>
</div>
</div>
<p>Reordering religion makes it much easier to see that people in the “Dont know” category watch much more TV, and Hinduism &amp; Other Eastern religions watch much less.</p>
@ -210,7 +203,7 @@ ggplot(relig_summary, aes(x = tvhours, y = relig)) +
ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then &lt;$1000, then $8000-9999." width="576"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-19-1.png" class="img-fluid" alt="A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then &lt;$1000, then $8000-9999." width="576"/></p>
</div>
</div>
<p>Here, arbitrarily reordering the levels isnt a good idea! Thats because <code>rincome</code> already has a principled order that we shouldnt mess with. Reserve <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> for factors whose levels are arbitrarily ordered.</p>
@ -219,20 +212,13 @@ ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) +
<pre data-type="programlisting" data-code-language="r">ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, "Not applicable"))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="The same scatterplot but now &quot;Not Applicable&quot; is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highest average age is &quot;Not applicable&quot;." width="576"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="The same scatterplot but now &quot;Not Applicable&quot; is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highethst average age is &quot;Not applicable&quot;." width="576"/></p>
</div>
</div>
<p>Why do you think the average age for “Not applicable” is so high?</p>
<p>Another type of reordering is useful when you are coloring the lines on a plot. <code>fct_reorder2(f, x, y)</code> reorders the factor <code>f</code> by the <code>y</code> values associated with the largest <code>x</code> values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.</p>
<div>
<pre data-type="programlisting" data-code-language="r">#|
#| Rearranging the legend makes the plot easier to read because the
#| legend colors now match the order of the lines on the far right
#| of the plot. You can see some unsuprising patterns: the proportion
#| never marred decreases with age, married forms an upside down U
#| shape, and widowed starts off low but increases steeply after age
#| 60.
by_age &lt;- gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">by_age &lt;- gss_cat |&gt;
filter(!is.na(age)) |&gt;
count(age, marital) |&gt;
group_by(age) |&gt;
@ -249,10 +235,10 @@ ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop)))
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="factors_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot." width="384"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot. Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. You can see some unsuprising patterns: the proportion never marred decreases with age, married forms an upside down U shape, and widowed starts off low but increases steeply after age 60." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="factors_files/figure-html/unnamed-chunk-22-2.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot." width="384"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-21-2.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot. Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. You can see some unsuprising patterns: the proportion never marred decreases with age, married forms an upside down U shape, and widowed starts off low but increases steeply after age 60." width="384"/></p>
</div>
</div>
</div>
@ -264,11 +250,11 @@ ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop)))
ggplot(aes(x = marital)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000)." width="576"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000)." width="576"/></p>
</div>
</div>
<section id="exercises" data-type="sect2">
<section id="factors-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>There are some suspiciously high numbers in <code>tvhours</code>. Is the mean a good summary?</p></li>
@ -402,7 +388,7 @@ Modifying factor levels</h1>
</div>
<p>Read the documentation to learn about <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_min()</a></code> and <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_prop()</a></code> which are useful in other cases.</p>
<section id="exercises-1" data-type="sect2">
<section id="factors-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?</p></li>
@ -426,7 +412,7 @@ Ordered factors</h1>
</ul><p>Given the arguable utility of these differences, we dont generally recommend using ordered factors.</p>
</section>
<section id="summary" data-type="sect1">
<section id="factors-summary" data-type="sect1">
<h1>
Summary</h1>
<p>This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didnt have space to discuss here, so whenever youre facing a factor analysis challenge that you havent encountered before, I highly recommend skimming the <a href="https://forcats.tidyverse.org/reference/index.html">reference index</a> to see if theres a canned function that can help solve your problem.</p>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-functions">
<h1><span id="sec-functions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Functions</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="functions-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:</p>
@ -13,7 +13,7 @@ Introduction</h1>
<li>Plot functions that take a data frame as input and return a plot as output.</li>
</ul><p>Each of these sections include many examples to help you generalize the patterns that you see. These examples wouldnt be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for <a href="https://twitter.com/hadleywickham/status/1571603361350164486">general functions</a> and <a href="https://twitter.com/hadleywickham/status/1574373127349575680">plotting functions</a> to see even more functions.</p>
<section id="prerequisites" data-type="sect2">
<section id="functions-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Well wrap up a variety of functions from around the tidyverse. Well also use nycflights13 as a source of familiar data to use our functions with.</p>
@ -273,13 +273,18 @@ mape &lt;- function(actual, predicted) {
</div>
<div data-type="note"><h1>
RStudio
</h1><p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p><ul><li><p>To find the definition of a function that youve written, place the cursor on the name of the function and press <code>F2</code>.</p></li>
</h1>
<p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p>
<ul><li><p>To find the definition of a function that youve written, place the cursor on the name of the function and press <code>F2</code>.</p></li>
<li><p>To quickly jump to a function, press <code>Ctrl + .</code> to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.</p></li>
</ul></div>
</ul>
</div>
</section>
<section id="exercises" data-type="sect2">
<section id="functions-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -610,7 +615,7 @@ diamonds |&gt; count_wide(c(clarity, color), cut)
<p>While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> docs you can see that <code>names_from</code> uses tidy-selection.</p>
</section>
<section id="exercises-1" data-type="sect2">
<section id="functions-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -691,9 +696,6 @@ diamonds |&gt; histogram(carat, 0.1)</pre>
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
histogram(carat, 0.1) +
labs(x = "Size (in carats)", y = "Number of diamonds")</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-47-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<section id="more-variables" data-type="sect2">
@ -706,15 +708,13 @@ linearity_check &lt;- function(df, x, y) {
df |&gt;
ggplot(aes(x = {{ x }}, y = {{ y }})) +
geom_point() +
geom_smooth(method = "loess", color = "red", se = FALSE) +
geom_smooth(method = "lm", color = "blue", se = FALSE)
geom_smooth(method = "loess", formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(method = "lm", formula = y ~ x, color = "blue", se = FALSE)
}
starwars |&gt;
filter(mass &lt; 1000) |&gt;
linearity_check(mass, height)
#&gt; `geom_smooth()` using formula = 'y ~ x'
#&gt; `geom_smooth()` using formula = 'y ~ x'</pre>
linearity_check(mass, height)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-48-1.png" class="img-fluid" width="576"/></p>
</div>
@ -837,15 +837,6 @@ density &lt;- function(color, facets, binwidth = 0.1) {
density()
density(cut)
density(cut, clarity)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-54-1.png" class="img-fluid" width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-54-2.png" class="img-fluid" width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-54-3.png" class="img-fluid" width="576"/></p>
</div>
</div>
</section>
@ -880,7 +871,7 @@ diamonds |&gt; histogram(carat, 0.1)</pre>
<p>You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.</p>
</section>
<section id="exercises-2" data-type="sect2">
<section id="functions-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<p>Build up a rich plotting function by incrementally implementing each of the steps below:</p>
@ -926,7 +917,7 @@ density &lt;- function(color, facets, binwidth = 0.1) {
</div>
<p>As you can see we recommend putting extra spaces inside of <code>{{ }}</code>. This makes it very obvious that something unusual is happening.</p>
<section id="exercises-3" data-type="sect2">
<section id="functions-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -946,7 +937,7 @@ f3 &lt;- function(x, y) {
</ol></section>
</section>
<section id="summary" data-type="sect1">
<section id="functions-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.</p>

View File

@ -57,7 +57,7 @@ Python, Julia, and friends</h2>
</section>
</section>
<section id="prerequisites" data-type="sect1">
<section id="intro-prerequisites" data-type="sect1">
<h1>
Prerequisites</h1>
<p>Weve made a few assumptions about what you already know to get the most out of this book. You should be generally numerically literate, and its helpful if you have some programming experience already. If youve never programmed before, you might find <a href="https://rstudio-education.github.io/hopr/">Hands on Programming with R</a> by Garrett to be a valuable adjunct to this book.</p>
@ -99,16 +99,16 @@ The tidyverse</h2>
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
#&gt; ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
#&gt; ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
#&gt; ✔ forcats 0.5.2.9000 ✔ stringr 1.5.0.9000
#&gt; ✔ forcats 0.5.2 ✔ stringr 1.5.0
#&gt; ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.2.1.9001
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.3.0
#&gt; ✔ purrr 1.0.1
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()
#&gt; Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors</pre>
#&gt; Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</pre>
</div>
<p>This tells you that tidyverse loads eight packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats. These are considered the <strong>core</strong> of the tidyverse because youll use them in almost every analysis.</p>
<p>This tells you that tidyverse loads nine packages: dplyr, forcats, ggplot2, lubridate, purrr, readr, stringr, tibble, tidyr. These are considered the <strong>core</strong> of the tidyverse because youll use them in almost every analysis.</p>
<p>Packages in the tidyverse change fairly frequently. You can check whether updates are available and optionally install them by running <code><a href="https://tidyverse.tidyverse.org/reference/tidyverse_update.html">tidyverse_update()</a></code>.</p>
</section>
@ -116,11 +116,16 @@ The tidyverse</h2>
<h2>
Other packages</h2>
<p>There are many other excellent packages that are not part of the tidyverse because they solve problems in a different domain or are designed with a different set of underlying principles. This doesnt make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse but many other universes of interrelated packages. As you tackle more data science projects with R, youll learn new packages and new ways of thinking about data.</p>
<p>In this book, well use five data packages from outside the tidyverse:</p>
<p>Well use many packages from outside the tidyverse in this book. For example, we use the following four data packages to provide interesting applications:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">install.packages(c("gapminder", "Lahman", "nycflights13", "palmerpenguins", "wakefield"))</pre>
<pre data-type="programlisting" data-code-language="r">install.packages(c("babynames", "gapminder", "nycflights13", "palmerpenguins"))</pre>
</div>
<p>These packages provide data on world development, baseball, airline flights, and body measurements of penguins that well use to illustrate key data science ideas, while the final one helps generate random data sets.</p>
<p>Well also use a selection of other packages for one off examples. You dont need to install them now, just remember that whenever you see an error like this:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(ggrepel)
#&gt; Error in library(ggrepel) : there is no package called ggrepel</pre>
</div>
<p>You need to run <code>install.packages("ggrepel")</code> to install the package.</p>
</section>
</section>
@ -177,17 +182,17 @@ Colophon</h1>
<td style="text-align: left;">1.1.0.9000</td>
<td style="text-align: left;">local</td>
</tr><tr class="even"><td style="text-align: left;">dbplyr</td>
<td style="text-align: left;">2.2.1.9000</td>
<td style="text-align: left;">2.3.0.9000</td>
<td style="text-align: left;">local</td>
</tr><tr class="odd"><td style="text-align: left;">dplyr</td>
<td style="text-align: left;">1.0.99.9000</td>
<td style="text-align: left;">Github (tidyverse/dplyr@f4bece54fb56e10d7ae6a3bb27f2afedd65683ca)</td>
<td style="text-align: left;">Github (tidyverse/dplyr@6a1d46965a0f3ac180456e16bbe004755ec8488e)</td>
</tr><tr class="even"><td style="text-align: left;">dtplyr</td>
<td style="text-align: left;">1.2.2</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="odd"><td style="text-align: left;">forcats</td>
<td style="text-align: left;">0.5.2.9000</td>
<td style="text-align: left;">local</td>
<td style="text-align: left;">0.5.2</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">ggplot2</td>
<td style="text-align: left;">3.4.0.9000</td>
<td style="text-align: left;">Github (tidyverse/ggplot2@4fea51b1eb2cdacebeacf425627dcbc1d61a5d3e)</td>
@ -246,14 +251,14 @@ Colophon</h1>
<td style="text-align: left;">1.0.3</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="odd"><td style="text-align: left;">stringr</td>
<td style="text-align: left;">1.5.0.9000</td>
<td style="text-align: left;">Github (tidyverse/stringr@e4601f7fdb125faafbd028cb9e32d23ef2d1efed)</td>
<td style="text-align: left;">1.5.0</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">tibble</td>
<td style="text-align: left;">3.1.8</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="odd"><td style="text-align: left;">tidyr</td>
<td style="text-align: left;">1.2.1.9001</td>
<td style="text-align: left;">local</td>
<td style="text-align: left;">1.3.0</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr><tr class="even"><td style="text-align: left;">tidyverse</td>
<td style="text-align: left;">1.3.2.9000</td>
<td style="text-align: left;">Github (tidyverse/tidyverse@aeabcde8c6ae435f16b5173682d5667d292829fb)</td>
@ -261,9 +266,6 @@ Colophon</h1>
<td style="text-align: left;">1.3.3</td>
<td style="text-align: left;">CRAN (R 4.2.0)</td>
</tr></tbody></table></div>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cli:::ruler()</pre>
</div>
</section>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-iteration">
<h1><span id="sec-iteration" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Iteration</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="iteration-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In this chapter, youll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector <code>x</code> in R, you can just write <code>2 * x</code>. In most other languages, youd need to explicitly double each element of <code>x</code> using some sort of for loop.</p>
@ -13,17 +13,19 @@ Introduction</h1>
<code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> create new rows and columns for each element of a list-column.</li>
</ul><p>Now its time to learn some more general tools, often called <strong>functional programming</strong> tools because they are built around functions that take other functions as inputs. Learning functional programming can easily veer into the abstract, but in this chapter well keep things concrete by focusing on three common tasks: modifying multiple columns, reading multiple files, and saving multiple objects.</p>
<section id="prerequisites" data-type="sect2">
<section id="iteration-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div data-type="important"><div class="callout-body d-flex">
<div data-type="important">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>This chapter relies on features only found in dplyr 1.1.0, which is still in development. If you want to live life on the edge you can get the dev version with <code>devtools::install_github(c( "tidyverse/dplyr"))</code>.</p>
<p>This chapter relies on features only found in purrr 1.0.0 and dplyr 1.1.0, which are still in development. If you want to live life on the edge you can get the dev version with <code>devtools::install_github(c("tidyverse/purrr", "tidyverse/dplyr"))</code>.</p></div>
</div>
</div>
<p>In this chapter, well focus on tools provided by dplyr and purrr, both core members of the tidyverse. Youve seen dplyr before, but <a href="http://purrr.tidyverse.org/">purrr</a> is new. Were just going to use a couple of purrr functions in this chapter, but its a great package to explore as you improve your programming skills.</p>
<div class="cell">
@ -73,7 +75,7 @@ Modifying multiple columns</h1>
<section id="selecting-columns-with-.cols" data-type="sect2">
<h2>
Selecting columns with<code>.cols</code>
Selecting columns with .cols
</h2>
<p>The first argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, <code>.cols</code>, selects the columns to transform. This uses the same specifications as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <a href="#sec-select" data-type="xref">#sec-select</a>, so you can use functions like <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">starts_with()</a></code> and <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">ends_with()</a></code> to select columns based on their name.</p>
<p>There are two additional selection techniques that are particularly useful for <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> and <code><a href="https://tidyselect.r-lib.org/reference/where.html">where()</a></code>. <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> is straightforward: it selects every (non-grouping) column:</p>
@ -316,12 +318,10 @@ df_miss |&gt; filter(if_all(a:d, is.na))
<section id="across-in-functions" data-type="sect2">
<h2>
<code>across()</code> in functions</h2>
across() in functions</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is particularly useful to program with because it allows you to operate on multiple columns. For example, <a href="https://twitter.com/_wurli/status/1571836746899283969">Jacob Scott</a> uses this little helper which wraps a bunch of lubridate function to expand all date columns into year, month, and day columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(lubridate)
expand_dates &lt;- function(df) {
<pre data-type="programlisting" data-code-language="r">expand_dates &lt;- function(df) {
df |&gt;
mutate(
across(where(is.Date), list(year = year, month = month, day = mday))
@ -382,7 +382,7 @@ diamonds |&gt;
<section id="vs-pivot_longer" data-type="sect2">
<h2>
Vs<code>pivot_longer()</code>
Vs pivot_longer()
</h2>
<p>Before we go on, its worth pointing out an interesting connection between <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> (<a href="#sec-pivoting" data-type="xref">#sec-pivoting</a>). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:</p>
<div class="cell">
@ -472,7 +472,7 @@ df_long |&gt;
<p>If needed, you could <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> this back to the original form.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="iteration-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Compute the number of unique values in each column of <code><a href="https://allisonhorst.github.io/palmerpenguins/reference/penguins.html">palmerpenguins::penguins</a></code>.</p></li>
@ -535,7 +535,7 @@ paths
</div>
</section>
<section id="lists" data-type="sect2">
<section id="iteration-lists" data-type="sect2">
<h2>
Lists</h2>
<p>Now that we have these 12 paths, we could call <code>read_excel()</code> 12 times to get 12 data frames:</p>
@ -575,7 +575,7 @@ gapminder_2007 &lt;- readxl::read_excel("data/gapminder/2007.xlsx")</pre>
<section id="purrrmap-and-list_rbind" data-type="sect2">
<h2>
<code>purrr::map()</code> and <code>list_rbind()</code>
purrr::map() and list_rbind()
</h2>
<p>The code to collect those data frames in a list “by hand” is basically just as tedious to type as code that reads the files one-by-one. Happily, we can use <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code> to make even better use of our <code>paths</code> vector. <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> is similar to<code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, but instead of doing something to each column in a data frame, it does something to each element of a vector.<code>map(x, f)</code> is shorthand for:</p>
<div class="cell">
@ -919,7 +919,7 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">con |&gt; tbl("gapminder")
#&gt; # Source: table&lt;gapminder&gt; [0 x 6]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
#&gt; # … with 6 variables: country &lt;chr&gt;, continent &lt;chr&gt;, lifeExp &lt;dbl&gt;,
#&gt; # pop &lt;dbl&gt;, gdpPercap &lt;dbl&gt;, year &lt;dbl&gt;</pre>
</div>
@ -932,7 +932,7 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
DBI::dbAppendTable(con, "gapminder", df)
}</pre>
</div>
<p>Now we need to call <code>append_csv()</code> once for each element of <code>paths</code>. Thats certainly possible with <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>:</p>
<p>Now we need to call <code>append_file()</code> once for each element of <code>paths</code>. Thats certainly possible with <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">paths |&gt; map(append_file)</pre>
</div>
@ -946,7 +946,7 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
tbl("gapminder") |&gt;
count(year)
#&gt; # Source: SQL [?? x 2]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
#&gt; year n
#&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1952 142
@ -1071,7 +1071,7 @@ ggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)</pre>
</section>
</section>
<section id="summary" data-type="sect1">
<section id="iteration-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once youve mastered the techniques in this chapter, we highly recommend learning more by reading the <a href="https://adv-r.hadley.nz/functionals.html">Functionals chapter</a> of <em>Advanced R</em> and consulting the <a href="https://purrr.tidyverse.org">purrr website</a>.</p>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-joins">
<h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="joins-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Its rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must <strong>join</strong> them together to answer the questions that youre interested in. This chapter will introduce you to two important types of joins:</p>
@ -8,7 +8,7 @@ Introduction</h1>
<li>Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.</li>
</ul><p>Well begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in the nycflights13 datasets, then use that knowledge to start joining data frames together. Next well discuss how joins work, focusing on their action on the rows. Well finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.</p>
<section id="prerequisites" data-type="sect2">
<section id="joins-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well explore the five related datasets from nycflights13 using the join functions from dplyr.</p>
@ -22,7 +22,7 @@ library(nycflights13)</pre>
<section id="keys" data-type="sect1">
<h1>
Keys</h1>
<p>To understand joins, you need to first understand how two tables can be connected through a pair of keys, with on each table. In this section, youll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. Youll also learn how to check that your keys are valid, and what to do if your table lacks a key.</p>
<p>To understand joins, you need to first understand how two tables can be connected through a pair of keys, within each table. In this section, youll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. Youll also learn how to check that your keys are valid, and what to do if your table lacks a key.</p>
<section id="primary-and-foreign-keys" data-type="sect2">
<h2>
@ -46,51 +46,52 @@ Primary and foreign keys</h2>
</li>
<li>
<p><code>airports</code> records data about each airport. You can identify each airport by its three letter airport code, making <code>faa</code> the primary key.</p>
<div class="cell">
<div class="cell" data-r.options="{&quot;width&quot;:67}">
<pre data-type="programlisting" data-code-language="r">airports
#&gt; # A tibble: 1,458 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America…
#&gt; 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America…
#&gt; 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America…
#&gt; 4 06N Randall Airport 41.4 -74.4 523 -5 A America…
#&gt; 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America…
#&gt; 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America…
#&gt; # … with 1,452 more rows</pre>
#&gt; faa name lat lon alt tz dst
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A
#&gt; 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A
#&gt; 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A
#&gt; 4 06N Randall Airport 41.4 -74.4 523 -5 A
#&gt; 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A
#&gt; 6 0A9 Elizabethton Municipal Airpo… 36.4 -82.2 1593 -5 A
#&gt; # … with 1,452 more rows, and 1 more variable: tzone &lt;chr&gt;</pre>
</div>
</li>
<li>
<p><code>planes</code> records data about each plane. You can identify a plane by its tail number, making <code>tailnum</code> the primary key.</p>
<div class="cell">
<div class="cell" data-r.options="{&quot;width&quot;:67}">
<pre data-type="programlisting" data-code-language="r">planes
#&gt; # A tibble: 3,322 × 9
#&gt; tailnum year type manufacturer model engines seats speed engine
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 N10156 2004 Fixed wing mul… EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 2 N102UW 1998 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; 3 N103US 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; 4 N104UW 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; 5 N10575 2002 Fixed wing mul… EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 6 N105UW 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; # … with 3,316 more rows</pre>
#&gt; tailnum year type manufacturer model engines
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 N10156 2004 Fixed wing multi… EMBRAER EMB-145XR 2
#&gt; 2 N102UW 1998 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; 3 N103US 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; 4 N104UW 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; 5 N10575 2002 Fixed wing multi… EMBRAER EMB-145LR 2
#&gt; 6 N105UW 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; # … with 3,316 more rows, and 3 more variables: seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;</pre>
</div>
</li>
<li>
<p><code>weather</code> records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making <code>origin</code> and <code>time_hour</code> the compound primary key.</p>
<div class="cell">
<div class="cell" data-r.options="{&quot;width&quot;:67}">
<pre data-type="programlisting" data-code-language="r">weather
#&gt; # A tibble: 26,115 × 15
#&gt; origin year month day hour temp dewp humid wind_dir wind_speed
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
#&gt; # … with 26,109 more rows, and 5 more variables: wind_gust &lt;dbl&gt;,
#&gt; # precip &lt;dbl&gt;, pressure &lt;dbl&gt;, visib &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; origin year month day hour temp dewp humid wind_dir
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240
#&gt; # … with 26,109 more rows, and 6 more variables: wind_speed &lt;dbl&gt;,
#&gt; # wind_gust &lt;dbl&gt;, precip &lt;dbl&gt;, pressure &lt;dbl&gt;, visib &lt;dbl&gt;, </pre>
</div>
</li>
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
@ -139,23 +140,20 @@ weather |&gt;
filter(is.na(tailnum))
#&gt; # A tibble: 0 × 9
#&gt; # … with 9 variables: tailnum &lt;chr&gt;, year &lt;int&gt;, type &lt;chr&gt;,
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, …
weather |&gt;
filter(is.na(time_hour) | is.na(origin))
#&gt; # A tibble: 0 × 15
#&gt; # … with 15 variables: origin &lt;chr&gt;, year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;,
#&gt; # hour &lt;int&gt;, temp &lt;dbl&gt;, dewp &lt;dbl&gt;, humid &lt;dbl&gt;, wind_dir &lt;dbl&gt;,
#&gt; # wind_speed &lt;dbl&gt;, wind_gust &lt;dbl&gt;, precip &lt;dbl&gt;, pressure &lt;dbl&gt;,
#&gt; # visib &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; # hour &lt;int&gt;, temp &lt;dbl&gt;, dewp &lt;dbl&gt;, humid &lt;dbl&gt;, wind_dir &lt;dbl&gt;, …</pre>
</div>
</section>
<section id="surrogate-keys" data-type="sect2">
<h2>
Surrogate keys</h2>
<p>So far we havent talked about the primary key for <code>flights</code>. Its not super important here, because there are no data frames that use it as a foreign key, but its still useful to consider because its easier to work with observations if have some way to describe them to others.</p>
<p>So far we havent talked about the primary key for <code>flights</code>. Its not super important here, because there are no data frames that use it as a foreign key, but its still useful to consider because its easier to work with observations if we have some way to describe them to others.</p>
<p>After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
@ -190,14 +188,12 @@ flights2
#&gt; 5 5 2013 1 1 554 600 -6 812
#&gt; 6 6 2013 1 1 554 558 -4 740
#&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, …</pre>
</div>
<p>Surrogate keys can be particular useful when communicating to other humans: its much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="joins-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>We forgot to draw the relationship between <code>weather</code> and <code>airports</code> in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>. What is the relationship and how should it appear in the diagram?</p></li>
@ -211,7 +207,7 @@ Exercises</h2>
<section id="sec-mutating-joins" data-type="sect1">
<h1>
Basic joins</h1>
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code>, <code>anti_join(), and full_join()</code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
<p>In this section, youll learn how to use one mutating join, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, and two filtering joins, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. In the next section, youll learn exactly how these functions work, and about the remaining <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">full_join()</a></code>.</p>
<section id="mutating-joins" data-type="sect2">
@ -271,15 +267,15 @@ flights2
left_join(planes |&gt; select(tailnum, type, engines, seats))
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 336,776 × 9
#&gt; year time_hour origin dest tailnum carrier type engines seats
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed… 2 149
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed… 2 149
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed… 2 178
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed… 2 200
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed… 2 178
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed… 2 191
#&gt; # … with 336,770 more rows</pre>
#&gt; year time_hour origin dest tailnum carrier type
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed wing multi en…
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed wing multi en…
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed wing multi en…
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed wing multi en…
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed wing multi en…
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wing multi en…
#&gt; # … with 336,770 more rows, and 2 more variables: engines &lt;int&gt;, seats &lt;int&gt;</pre>
</div>
<p>When <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, theres no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
<div class="cell">
@ -326,16 +322,16 @@ Specifying join keys</h2>
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(planes, join_by(tailnum))
#&gt; # A tibble: 336,776 × 14
#&gt; year.x time_hour origin dest tailnum carrier year.y type
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 Fixed wing …
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 Fixed wing …
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 Fixed wing …
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 Fixed wing …
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 Fixed wing …
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 Fixed wing …
#&gt; # … with 336,770 more rows, and 6 more variables: manufacturer &lt;chr&gt;,
#&gt; # model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;</pre>
#&gt; year.x time_hour origin dest tailnum carrier year.y
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012
#&gt; # … with 336,770 more rows, and 7 more variables: type &lt;chr&gt;,
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, </pre>
</div>
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. Its important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. Thats why this type of join is often called an <strong>equi-join</strong>. Youll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
@ -344,30 +340,30 @@ Specifying join keys</h2>
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(airports, join_by(dest == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name lat lon
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA George … 30.0 -95.3
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA George … 30.0 -95.3
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miami I… 25.8 -80.3
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; NA NA
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hartsfi… 33.6 -84.4
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chicago… 42.0 -87.9
#&gt; # … with 336,770 more rows, and 4 more variables: alt &lt;dbl&gt;, tz &lt;dbl&gt;,
#&gt; # dst &lt;chr&gt;, tzone &lt;chr&gt;
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA George Bush Interco…
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA George Bush Interco…
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miami Intl
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hartsfield Jackson …
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chicago Ohare Intl
#&gt; # … with 336,770 more rows, and 6 more variables: lat &lt;dbl&gt;, lon &lt;dbl&gt;,
#&gt; # alt &lt;dbl&gt;, tz &lt;dbl&gt;, dst &lt;chr&gt;, tzone &lt;chr&gt;
flights2 |&gt;
left_join(airports, join_by(origin == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name lat lon
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newark … 40.7 -74.2
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La Guar… 40.8 -73.9
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John F … 40.6 -73.8
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John F … 40.6 -73.8
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La Guar… 40.8 -73.9
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newark … 40.7 -74.2
#&gt; # … with 336,770 more rows, and 4 more variables: alt &lt;dbl&gt;, tz &lt;dbl&gt;,
#&gt; # dst &lt;chr&gt;, tzone &lt;chr&gt;</pre>
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newark Liberty Intl
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La Guardia
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John F Kennedy Intl
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John F Kennedy Intl
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La Guardia
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newark Liberty Intl
#&gt; # … with 336,770 more rows, and 6 more variables: lat &lt;dbl&gt;, lon &lt;dbl&gt;,
#&gt; # alt &lt;dbl&gt;, tz &lt;dbl&gt;, dst &lt;chr&gt;, tzone &lt;chr&gt;</pre>
</div>
<p>In older code you might see a different way of specifying the join keys, using a character vector:</p>
<ul><li>
@ -396,17 +392,17 @@ Filtering joins</h2>
<pre data-type="programlisting" data-code-language="r">airports |&gt;
semi_join(flights2, join_by(faa == dest))
#&gt; # A tibble: 101 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 ABQ Albuquerque International Sunpo… 35.0 -107. 5355 -7 A Amer…
#&gt; 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A Amer…
#&gt; 3 ALB Albany Intl 42.7 -73.8 285 -5 A Amer…
#&gt; 4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A Amer…
#&gt; 5 ATL Hartsfield Jackson Atlanta Intl 33.6 -84.4 1026 -5 A Amer
#&gt; 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A Amer…
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 ABQ Albuquerque Internati… 35.0 -107. 5355 -7 A America/Denver
#&gt; 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A America/New_Yo
#&gt; 3 ALB Albany Intl 42.7 -73.8 285 -5 A America/New_Yo
#&gt; 4 ANC Ted Stevens Anchorage… 61.2 -150. 152 -9 A America/Anchor…
#&gt; 5 ATL Hartsfield Jackson At… 33.6 -84.4 1026 -5 A America/New_Yo
#&gt; 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A America/Chicago
#&gt; # … with 95 more rows</pre>
</div>
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that dont have a match in <code>y</code>. Theyre useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values dont show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that as missing from <code>airports</code> by looking for flights that dont have a matching destination airport:</p>
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that dont have a match in <code>y</code>. Theyre useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values dont show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that are missing from <code>airports</code> by looking for flights that dont have a matching destination airport:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
anti_join(airports, join_by(dest == faa)) |&gt;
@ -437,7 +433,7 @@ Filtering joins</h2>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="joins-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the <code>weather</code> data. Can you see any patterns?</p></li>
@ -655,15 +651,15 @@ Allow multiple rows</h2>
plane_flights
#&gt; # A tibble: 284,170 × 9
#&gt; tailnum type engines seats year time_hour origin dest carrier
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 N10156 Fixed… 2 55 2013 2013-01-10 06:00:00 EWR PIT EV
#&gt; 2 N10156 Fixed… 2 55 2013 2013-01-10 10:00:00 EWR CHS EV
#&gt; 3 N10156 Fixed… 2 55 2013 2013-01-10 15:00:00 EWR MSP EV
#&gt; 4 N10156 Fixed… 2 55 2013 2013-01-11 06:00:00 EWR CMH EV
#&gt; 5 N10156 Fixed… 2 55 2013 2013-01-11 11:00:00 EWR MCI EV
#&gt; 6 N10156 Fixed… 2 55 2013 2013-01-11 18:00:00 EWR PWM EV
#&gt; # … with 284,164 more rows</pre>
#&gt; tailnum type engines seats year time_hour origin
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt;
#&gt; 1 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 06:00:00 EWR
#&gt; 2 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 10:00:00 EWR
#&gt; 3 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 15:00:00 EWR
#&gt; 4 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 06:00:00 EWR
#&gt; 5 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 11:00:00 EWR
#&gt; 6 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 18:00:00 EWR
#&gt; # … with 284,164 more rows, and 2 more variables: dest &lt;chr&gt;, carrier &lt;chr&gt;</pre>
</div>
</section>
@ -814,19 +810,19 @@ Rolling joins</h2>
<p>Now imagine that you have a table of employee birthdays:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">employees &lt;- tibble(
name = wakefield::name(100),
name = sample(babynames::babynames$name, 100),
birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
)
employees
#&gt; # A tibble: 100 × 2
#&gt; name birthday
#&gt; &lt;variable&gt; &lt;date&gt;
#&gt; 1 Lindzy 2022-08-11
#&gt; 2 Santania 2022-03-01
#&gt; 3 Gardell 2022-03-04
#&gt; 4 Cyrille 2022-11-15
#&gt; 5 Kynli 2022-07-09
#&gt; 6 Sever 2022-02-03
#&gt; name birthday
#&gt; &lt;chr&gt; &lt;date&gt;
#&gt; 1 Case 2022-09-13
#&gt; 2 Shonnie 2022-03-30
#&gt; 3 Burnard 2022-01-10
#&gt; 4 Omer 2022-11-25
#&gt; 5 Hillel 2022-07-30
#&gt; 6 Curlie 2022-12-11
#&gt; # … with 94 more rows</pre>
</div>
<p>And for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:</p>
@ -834,27 +830,22 @@ employees
<pre data-type="programlisting" data-code-language="r">employees |&gt;
left_join(parties, join_by(closest(birthday &gt;= party)))
#&gt; # A tibble: 100 × 4
#&gt; name birthday q party
#&gt; &lt;variable&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt;
#&gt; 1 Lindzy 2022-08-11 3 2022-07-11
#&gt; 2 Santania 2022-03-01 1 2022-01-10
#&gt; 3 Gardell 2022-03-04 1 2022-01-10
#&gt; 4 Cyrille 2022-11-15 4 2022-10-03
#&gt; 5 Kynli 2022-07-09 2 2022-04-04
#&gt; 6 Sever 2022-02-03 1 2022-01-10
#&gt; name birthday q party
#&gt; &lt;chr&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt;
#&gt; 1 Case 2022-09-13 3 2022-07-11
#&gt; 2 Shonnie 2022-03-30 1 2022-01-10
#&gt; 3 Burnard 2022-01-10 1 2022-01-10
#&gt; 4 Omer 2022-11-25 4 2022-10-03
#&gt; 5 Hillel 2022-07-30 3 2022-07-11
#&gt; 6 Curlie 2022-12-11 4 2022-10-03
#&gt; # … with 94 more rows</pre>
</div>
<p>There is, however, one problem with this approach: the folks with birthdays before January 10 dont get a party:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">employees |&gt;
anti_join(parties, join_by(closest(birthday &gt;= party)))
#&gt; # A tibble: 4 × 2
#&gt; name birthday
#&gt; &lt;variable&gt; &lt;date&gt;
#&gt; 1 Janeida 2022-01-04
#&gt; 2 Aires 2022-01-07
#&gt; 3 Mikalya 2022-01-06
#&gt; 4 Carlynn 2022-01-08</pre>
#&gt; # A tibble: 0 × 2
#&gt; # … with 2 variables: name &lt;chr&gt;, birthday &lt;date&gt;</pre>
</div>
<p>To resolve that issue well need to tackle the problem a different way, with overlap joins.</p>
</section>
@ -910,19 +901,19 @@ parties
<pre data-type="programlisting" data-code-language="r">employees |&gt;
inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
#&gt; # A tibble: 100 × 6
#&gt; name birthday q party start end
#&gt; &lt;variable&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt;
#&gt; 1 Lindzy 2022-08-11 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 2 Santania 2022-03-01 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 3 Gardell 2022-03-04 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 4 Cyrille 2022-11-15 4 2022-10-03 2022-10-03 2022-12-31
#&gt; 5 Kynli 2022-07-09 2 2022-04-04 2022-04-04 2022-07-10
#&gt; 6 Sever 2022-02-03 1 2022-01-10 2022-01-01 2022-04-03
#&gt; name birthday q party start end
#&gt; &lt;chr&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt;
#&gt; 1 Case 2022-09-13 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 2 Shonnie 2022-03-30 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 3 Burnard 2022-01-10 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 4 Omer 2022-11-25 4 2022-10-03 2022-10-03 2022-12-31
#&gt; 5 Hillel 2022-07-30 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 6 Curlie 2022-12-11 4 2022-10-03 2022-10-03 2022-12-31
#&gt; # … with 94 more rows</pre>
</div>
</section>
<section id="exercises-2" data-type="sect2">
<section id="joins-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -951,7 +942,7 @@ x |&gt; full_join(y, by = "key", keep = TRUE)
</ol></section>
</section>
<section id="summary" data-type="sect1">
<section id="joins-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to use mutating and filtering joins to combine data from a pair of data frames. Along the way you learned how to identify keys, and the difference between primary and foreign keys. You also understand how joins work and how to figure out how many rows the output will have. Finally, youve gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.</p>

View File

@ -1,28 +1,18 @@
<section data-type="chapter" id="chp-layers">
<h1><span id="sec-layers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Layers</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="layers-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In the <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a>, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make <em>any</em> type of plot with ggplot2.</p>
<p>In this chapter, youll expand on that foundation as you learn about the layered grammar of graphics. Well start with a deeper dive into aesthetic mappings, geometric objects, and facets. Then, you will learn about statistical transformations ggplot2 makes under the hood when creating a plot. These transformations are used to calculate new values to plot, such as the heights of bars in a bar plot or medians in a box plot. You will also learn about position adjustments, which modify how geoms are displayed in your plots. Finally, well briefly introduce coordinate systems.</p>
<p>We will not cover every single function and option for each of these layers, but we will walk you through the most important and commonly used functionality provided by ggplot2 as well as introduce you to packages that extend ggplot2.</p>
<section id="prerequisites" data-type="sect2">
<section id="layers-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
#&gt; ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
#&gt; ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
#&gt; ✔ forcats 0.5.2.9000 ✔ stringr 1.5.0.9000
#&gt; ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.2.1.9001
#&gt; ✔ purrr 1.0.1
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()
#&gt; Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors</pre>
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
</section>
@ -37,15 +27,15 @@ Aesthetic mappings</h1>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">mpg
#&gt; # A tibble: 234 × 11
#&gt; manufacturer model displ year cyl trans drv cty hwy fl class
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p comp…
#&gt; 2 audi a4 1.8 1999 4 manual(… f 21 29 p comp…
#&gt; 3 audi a4 2 2008 4 manual(… f 20 31 p comp…
#&gt; 4 audi a4 2 2008 4 auto(av) f 21 30 p comp…
#&gt; 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p comp…
#&gt; 6 audi a4 2.8 1999 6 manual(… f 18 26 p comp…
#&gt; # … with 228 more rows</pre>
#&gt; manufacturer model displ year cyl trans drv cty hwy fl
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p
#&gt; 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p
#&gt; 3 audi a4 2 2008 4 manual(m6) f 20 31 p
#&gt; 4 audi a4 2 2008 4 auto(av) f 21 30 p
#&gt; 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p
#&gt; 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p
#&gt; # … with 228 more rows, and 1 more variable: class &lt;chr&gt;</pre>
</div>
<p>Among the variables in <code>mpg</code> are:</p>
<ol type="1"><li><p><code>displ</code>: A cars engine size, in liters. A numerical variable.</p></li>
@ -134,7 +124,7 @@ ggplot(mpg, aes(x = displ, y = hwy, alpha = class)) +
<p>So far we have discussed aesthetics that we can map or set in a scatterplot, when using a point geom. You can learn more about all possible aesthetic mappings in the aesthetic specifications vignette at <a href="https://ggplot2.tidyverse.org/articles/ggplot2-specs.html" class="uri">https://ggplot2.tidyverse.org/articles/ggplot2-specs.html</a>.</p>
<p>The specific aesthetics you can use for a plot depend on the geom you use to represent the data. In the next section we dive deeper into geoms.</p>
<section id="exercises" data-type="sect2">
<section id="layers-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Create a scatterplot of <code>hwy</code> vs. <code>displ</code> where the points are pink filled in triangles.</p></li>
@ -285,7 +275,7 @@ ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +
</div>
<p>The best place to get a comprehensive overview of all of the geoms ggplot2 offers, as well as all functions in the package, is the reference page: <a href="https://ggplot2.tidyverse.org/reference" class="uri">https://ggplot2.tidyverse.org/reference</a>. To learn more about any single geom, use the help (e.g. <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">?geom_smooth</a></code>).</p>
<section id="exercises-1" data-type="sect2">
<section id="layers-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?</p></li>
@ -361,7 +351,7 @@ Facets</h1>
</div>
</div>
<section id="exercises-2" data-type="sect2">
<section id="layers-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens if you facet on a continuous variable?</p></li>
@ -502,7 +492,7 @@ ggplot(cut_frequencies, aes(x = cut, y = freq)) +
</li>
</ol><p>ggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">?stat_bin</a></code>.</p>
<section id="exercises-3" data-type="sect2">
<section id="layers-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What is the default geom associated with <code><a href="https://ggplot2.tidyverse.org/reference/stat_summary.html">stat_summary()</a></code>? How could you rewrite the previous plot to use that geom function instead of the stat function?</p></li>
@ -608,7 +598,7 @@ ggplot(diamonds, aes(x = cut, color = clarity)) +
<p>Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph <em>more</em> revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for <code>geom_point(position = "jitter")</code>: <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code>.</p>
<p>To learn more about a position adjustment, look up the help page associated with each adjustment: <code><a href="https://ggplot2.tidyverse.org/reference/position_dodge.html">?position_dodge</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_stack.html">?position_fill</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_identity.html">?position_identity</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_jitter.html">?position_jitter</a></code>, and <code><a href="https://ggplot2.tidyverse.org/reference/position_stack.html">?position_stack</a></code>.</p>
<section id="exercises-4" data-type="sect2">
<section id="layers-exercises-4" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -681,7 +671,7 @@ bar + coord_polar()</pre>
</div>
</li>
</ul>
<section id="exercises-5" data-type="sect2">
<section id="layers-exercises-5" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Turn a stacked bar chart into a pie chart using <code><a href="https://ggplot2.tidyverse.org/reference/coord_polar.html">coord_polar()</a></code>.</p></li>
@ -726,7 +716,7 @@ The layered grammar of graphics</h1>
<p>If youd like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading “<a href="https://vita.had.co.nz/papers/layered-grammar.pdf">The Layered Grammar of Graphics</a>”, the scientific paper that describes the theory of ggplot2 in detail.</p>
</section>
<section id="summary" data-type="sect1">
<section id="layers-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what <code>x</code> and <code>y</code> mean. One layer we have not yet touched on is theme, which we will introduce in <a href="#sec-themes" data-type="xref">#sec-themes</a>.</p>

View File

@ -1,12 +1,12 @@
<section data-type="chapter" id="chp-logicals">
<h1><span id="sec-logicals" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Logical vectors</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="logicals-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In this chapter, youll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: <code>TRUE</code>, <code>FALSE</code>, and <code>NA</code>. Its relatively rare to find logical vectors in your raw data, but youll create and manipulate them in the course of almost every analysis.</p>
<p>Well begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then youll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. Well finish off with <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, two useful functions for making conditional changes powered by logical vectors.</p>
<section id="prerequisites" data-type="sect2">
<section id="logicals-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Most of the functions youll learn about in this chapter are provided by base R, so we dont need the tidyverse, but well still load it so we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and friends to work with data frames. Well also continue to draw examples from the nycflights13 dataset.</p>
@ -56,9 +56,7 @@ Comparisons</h1>
#&gt; 5 2013 1 1 606 610 -4 837 845
#&gt; 6 2013 1 1 607 607 0 858 915
#&gt; # … with 172,280 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>Its useful to know that this is a shortcut and you can explicitly create the underlying logical variables with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
<div class="cell">
@ -151,17 +149,14 @@ x == y
filter(dep_time == NA)
#&gt; # A tibble: 0 × 19
#&gt; # … with 19 variables: year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;, dep_time &lt;int&gt;,
#&gt; # sched_dep_time &lt;int&gt;, dep_delay &lt;dbl&gt;, arr_time &lt;int&gt;,
#&gt; # sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; # sched_dep_time &lt;int&gt;, dep_delay &lt;dbl&gt;, arr_time &lt;int&gt;, …</pre>
</div>
<p>Instead well need a new tool: <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
</section>
<section id="is.na" data-type="sect2">
<h2>
<code>is.na()</code>
is.na()
</h2>
<p><code>is.na(x)</code> works with any type of vector and returns <code>TRUE</code> for missing values and <code>FALSE</code> for everything else:</p>
<div class="cell">
@ -186,9 +181,7 @@ is.na(c("a", NA, "b"))
#&gt; 5 2013 1 2 NA 1540 NA NA 1747
#&gt; 6 2013 1 2 NA 1620 NA NA 1746
#&gt; # … with 8,249 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p><code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> can also be useful in <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> usually puts all the missing values at the end but you can override this default by first sorting by <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>:</p>
<div class="cell">
@ -205,9 +198,7 @@ is.na(c("a", NA, "b"))
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …
flights |&gt;
filter(month == 1, day == 1) |&gt;
@ -222,14 +213,12 @@ flights |&gt;
#&gt; 5 2013 1 1 517 515 2 830 819
#&gt; 6 2013 1 1 533 529 4 850 830
#&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>Well come back to cover missing values in more depth in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="logicals-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>How does <code><a href="https://dplyr.tidyverse.org/reference/near.html">dplyr::near()</a></code> work? Type <code>near</code> to see the source code.</li>
@ -295,9 +284,7 @@ Order of operations</h2>
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>This code doesnt error but it also doesnt seem to have worked. Whats going on? Here, R first evaluates <code>month == 11</code> creating a logical vector, which we call <code>nov</code>. It computes <code>nov | 12</code>. When you use a number with a logical operator it converts everything apart from 0 to <code>TRUE</code>, so this is equivalent to <code>nov | TRUE</code> which will always be <code>TRUE</code>, so every row will be selected:</p>
<div class="cell">
@ -322,7 +309,7 @@ Order of operations</h2>
<section id="in" data-type="sect2">
<h2>
<code>%in%</code>
%in%
</h2>
<p>An easy way to avoid the problem of getting your <code>==</code>s and <code>|</code>s in the right order is to use <code>%in%</code>. <code>x %in% y</code> returns a logical vector the same length as <code>x</code> that is <code>TRUE</code> whenever a value in <code>x</code> is anywhere in <code>y</code> .</p>
<div class="cell">
@ -357,13 +344,11 @@ c(1, 2, NA) %in% NA
#&gt; 5 2013 1 1 NA 1500 NA NA 1825
#&gt; 6 2013 1 1 NA 600 NA NA 901
#&gt; # … with 8,797 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="logicals-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Find all flights where <code>arr_delay</code> is missing but <code>dep_delay</code> is not. Find all flights where neither <code>arr_time</code> nor <code>sched_arr_time</code> are missing, but <code>arr_delay</code> is.</li>
@ -496,7 +481,7 @@ Logical subsetting</h2>
<p>Also note the difference in the group size: in the first chunk <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the number of delayed flights per day; in the second, <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the total number of flights.</p>
</section>
<section id="exercises-2" data-type="sect2">
<section id="logicals-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>What will <code>sum(is.na(x))</code> tell you? How about <code>mean(is.na(x))</code>?</li>
@ -511,7 +496,7 @@ Conditional transformations</h1>
<section id="if_else" data-type="sect2">
<h2>
<code>if_else()</code>
if_else()
</h2>
<p>If you want to use one value when a condition is <code>TRUE</code> and another value when its <code>FALSE</code>, you can use <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">dplyr::if_else()</a></code><span data-type="footnote">dplyrs <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is very similar to base Rs <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>. There are two main advantages of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>over <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>: you can choose what should happen to missing values, and <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. Youll always use the first three argument of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p>
<p>Lets begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:</p>
@ -547,12 +532,13 @@ if_else(is.na(x1), y1, x1)
<section id="case_when" data-type="sect2">
<h2>
<code>case_when()</code>
case_when()
</h2>
<p>dplyrs <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is inspired by SQLs <code>CASE</code> statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else youll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when its <code>TRUE</code>, <code>output</code> will be used.</p>
<p>This means we could recreate our previous nested <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> as follows:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">case_when(
<pre data-type="programlisting" data-code-language="r">x &lt;- c(-3:3, NA)
case_when(
x == 0 ~ "0",
x &lt; 0 ~ "-ve",
x &gt; 0 ~ "+ve",
@ -582,7 +568,7 @@ if_else(is.na(x1), y1, x1)
<div class="cell">
<pre data-type="programlisting" data-code-language="r">case_when(
x &gt; 0 ~ "+ve",
x &gt; 3 ~ "big"
x &gt; 2 ~ "big"
)
#&gt; [1] NA NA NA NA "+ve" "+ve" "+ve" NA</pre>
</div>
@ -595,8 +581,8 @@ if_else(is.na(x1), y1, x1)
arr_delay &lt; -30 ~ "very early",
arr_delay &lt; -15 ~ "early",
abs(arr_delay) &lt;= 15 ~ "on time",
arr_delay &gt; 15 ~ "late",
arr_delay &gt; 60 ~ "very late",
arr_delay &lt; 60 ~ "late",
arr_delay &lt; Inf ~ "very late",
),
.keep = "used"
)
@ -611,6 +597,7 @@ if_else(is.na(x1), y1, x1)
#&gt; 6 12 on time
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Be wary when writing this sort of complex <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> statement; my first two attempts used a mix of <code>&lt;</code> and <code>&gt;</code> and I kept accidentally creating overlapping conditions.</p>
</section>
<section id="compatible-types" data-type="sect2">
@ -639,7 +626,7 @@ case_when(
</section>
</section>
<section id="summary" data-type="sect1">
<section id="logicals-summary" data-type="sect1">
<h1>
Summary</h1>
<p>The definition of a logical vector is simple because each value must be either <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with <code>&gt;</code>, <code>&lt;</code>, <code>&lt;=</code>, <code>=&gt;</code>, <code>==</code>, <code>!=</code>, and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>, how to combine them with <code>!</code>, <code>&amp;</code>, and <code>|</code>, and how to summarize them with <code><a href="https://rdrr.io/r/base/any.html">any()</a></code>, <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>, <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>. You also learned the powerful <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> functions that allow you to return values depending on the value of a logical vector.</p>

View File

@ -1,12 +1,12 @@
<section data-type="chapter" id="chp-missing-values">
<h1><span id="sec-missing-values" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Missing values</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="missing-values-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Youve already learned the basics of missing values earlier in the book. You first saw them in <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a> where they resulted in a warning when making a plot as well as in <a href="#sec-summarize" data-type="xref">#sec-summarize</a> where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in <a href="#sec-na-comparison" data-type="xref">#sec-na-comparison</a>. Now well come back to them in more depth, so you can learn more of the details.</p>
<p>Well start by discussing some general tools for working with missing values recorded as <code>NA</code>s. Well then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit. Well finish off with a related discussion of empty groups, caused by factor levels that dont appear in the data.</p>
<section id="prerequisites" data-type="sect2">
<section id="missing-values-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.</p>
@ -173,11 +173,11 @@ Complete</h2>
<p>In some cases, the complete set of observations cant be generated by a simple combination of variables. In that case, you can do manually what <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">dplyr::full_join()</a></code>.</p>
</section>
<section id="joins" data-type="sect2">
<section id="missing-values-joins" data-type="sect2">
<h2>
Joins</h2>
<p>This brings us to another important way of revealing implicitly missing observations: joins. Youll learn more about joins in <a href="#chp-joins" data-type="xref">#chp-joins</a>, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.</p>
<p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that dont have a match in <code>y</code>. For example, we can use two <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>s reveal to reveal that were missing information for four airports and 722 planes mentioned in <code>flights</code>:</p>
<p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that dont have a match in <code>y</code>. For example, we can use two <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>s to reveal that were missing information for four airports and 722 planes mentioned in <code>flights</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(nycflights13)
@ -210,7 +210,7 @@ flights |&gt;
</div>
</section>
<section id="exercises" data-type="sect2">
<section id="missing-values-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Can you find any relationship between the carrier and the rows that appear to be missing from <code>planes</code>?</li>
@ -323,7 +323,7 @@ length(x2)
<p>The main drawback of this approach is that you get an <code>NA</code> for the count, even though you know that it should be zero.</p>
</section>
<section id="summary" data-type="sect1">
<section id="missing-values-summary" data-type="sect1">
<h1>
Summary</h1>
<p>Missing values are weird! Sometimes theyre recorded as an explicit <code>NA</code> but other times you only notice them by their absence. This chapter has given you some tools for working with explicit missing values, tools for uncovering implicit missing values, and discussed some of the ways that implicit can become explicit and vice versa.</p>

View File

@ -1,22 +1,24 @@
<section data-type="chapter" id="chp-numbers">
<h1><span id="sec-numbers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Numbers</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="numbers-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Numeric vectors are the backbone of data science, and youve already used them a bunch of times earlier in the book. Now its time to systematically survey what you can do with them in R, ensuring that youre well situated to tackle any future problem involving numeric vectors.</p>
<p>Well start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>. Then well dive into various numeric transformations that pair well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. Well finish off by covering the summary functions that pair well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> and show you how they can also be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>.</p>
<section id="prerequisites" data-type="sect2">
<section id="numbers-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div data-type="important"><div class="callout-body d-flex">
<div data-type="important">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>This chapter relies on features only found in dplyr 1.1.0, which is still in development. If you want to live on the edge, you can get the dev versions with <code>devtools::install_github("tidyverse/dplyr")</code>.</p>
<p>This chapter relies on features only found in dplyr 1.1.0, which is still in development. If you want to live on the edge, you can get the dev versions with <code>devtools::install_github("tidyverse/dplyr")</code>.</p></div>
</div>
</div>
<p>This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because well use these base R functions inside of tidyverse functions like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>. Like in the last chapter, well use real examples from nycflights13, as well as toy examples made with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>.</p>
<div class="cell">
@ -109,9 +111,7 @@ Counts</h1>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
summarize(
carriers = n_distinct(carrier)
) |&gt;
summarize(carriers = n_distinct(carrier)) |&gt;
arrange(desc(carriers))
#&gt; # A tibble: 105 × 2
#&gt; dest carriers
@ -144,17 +144,7 @@ Counts</h1>
</div>
<p>Weighted counts are a common problem so <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> has a <code>wt</code> argument that does the same thing:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; count(tailnum, wt = distance)
#&gt; # A tibble: 4,044 × 2
#&gt; tailnum n
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 D942DN 3418
#&gt; 2 N0EGMQ 250866
#&gt; 3 N10156 115966
#&gt; 4 N102UW 25722
#&gt; 5 N103US 24619
#&gt; 6 N104UW 25157
#&gt; # … with 4,038 more rows</pre>
<pre data-type="programlisting" data-code-language="r">flights |&gt; count(tailnum, wt = distance)</pre>
</div>
</li>
<li>
@ -176,7 +166,7 @@ Counts</h1>
</div>
</li>
</ul>
<section id="exercises" data-type="sect2">
<section id="numbers-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>How can you use <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to count the number rows with a missing value for a given variable?</li>
@ -228,9 +218,7 @@ x * c(1, 2, 3)
#&gt; 5 2013 1 1 557 600 -3 838 846
#&gt; 6 2013 1 1 558 600 -2 849 851
#&gt; # … with 25,971 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>The code runs without error, but it doesnt return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unfortunately theres no warning because <code>flights</code> has an even number of rows.</p>
<p>To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesnt help here, or in many other cases, because the key computation is performed by the base R function <code>==</code>, not <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>.</p>
@ -476,7 +464,7 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="numbers-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explain in words what each line of the code used to generate <a href="#fig-prop-cancelled" data-type="xref">#fig-prop-cancelled</a> does.</p></li>
@ -671,7 +659,7 @@ df
</div>
</section>
<section id="exercises-2" data-type="sect2">
<section id="numbers-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">min_rank()</a></code>.</p></li>
@ -718,10 +706,8 @@ Center</h2>
.groups = "drop"
) |&gt;
ggplot(aes(x = mean, y = median)) +
geom_abline(slope = 1, intercept = 0, color = "white", size = 2) +
geom_point()
#&gt; Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#&gt; Please use `linewidth` instead.</pre>
geom_abline(slope = 1, intercept = 0, color = "white", linewidth = 2) +
geom_point()</pre>
<div class="cell-output-display">
<figure id="fig-mean-vs-median"><p><img src="numbers_files/figure-html/fig-mean-vs-median-1.png" alt="All points fall below a 45° line, meaning that the median delay is always less than the mean delay. Most points are clustered in a dense region of mean [0, 20] and median [0, 5]. As the mean delay increases, the spread of the median also increases. There are two outlying points with mean ~60, median ~50, and mean ~85, median ~55." width="576"/></p>
@ -875,15 +861,13 @@ Positions</h2>
#&gt; 5 2013 1 2 42 2359 43 518 442
#&gt; 6 2013 1 2 458 500 -2 703 650
#&gt; # … with 1,189 more rows, and 12 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;, r &lt;int&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
</section>
<section id="with-mutate" data-type="sect2">
<h2>
With<code>mutate()</code>
With mutate()
</h2>
<p>As the names suggest, the summary functions are typically paired with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. However, because of the recycling rules we discussed in <a href="#sec-recycling" data-type="xref">#sec-recycling</a> they can also be usefully paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, particularly when you want do some sort of group standardization. For example:</p>
<ul><li>
@ -894,7 +878,7 @@ With<code>mutate()</code>
<code>x / first(x)</code> computes an index based on the first observation.</li>
</ul></section>
<section id="exercises-3" data-type="sect2">
<section id="numbers-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -910,7 +894,7 @@ Exercises</h2>
</ol></section>
</section>
<section id="summary" data-type="sect1">
<section id="numbers-summary" data-type="sect1">
<h1>
Summary</h1>
<p>Youre already familiar with many tools for working with numbers, and after reading this chapter you now know how to use them in R. Youve also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vectors like ranks and offsets. Finally, you worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.</p>

View File

@ -1,9 +1,9 @@
<section data-type="chapter" id="chp-preface-2e">
<h1>Preface to the second edition</h1><p>Welcome to the second edition of “R for Data Science”! This is a major reworking of the first edition, removing material we no longer think is useful, adding material we wish we included in the first edition, and generally updating the text and code to reflect changes in best practices. Were also very excited to welcome a new co-author: Mine Çetinkaya-Rundel, a noted data science educator and one of our colleagues at Posit (the company formerly known as RStudio).</p><p>A brief summary of the biggest changes follows:</p><ul><li><p>The first part of the book has been renamed to “Whole game”. The goal of this section is to give you the rough details of the “whole game” of data science before we dive into the details.</p></li>
<li><p>The second part of the book is “Visualize”. This part gives data visualization tools and best practices a more thorough coverage compared to the first edition.</p></li>
<li><p>The third part of the book is now called “Transform” and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room.</p></li>
<li><p>The fourth part of the book is called “Import”. Its a new set of chapters that goes beyond reading flat text files to now embrace working with spreadsheets, getting data out of databases, working with big data, rectangling hierarchical data, and scraping data from web sites.</p></li>
<li><p>The “Program” part continues, but has been rewritten from top-to-bottom to focus on the most important parts of function writing and iteration. Function writing now includes sections on how to wrap tidyverse functions (dealing with the challenges of tidy evaluation), since this has become much easier over the last few years. Weve added a new chapter on important Base R functions that youre likely to see when reading R code found in the wild.</p></li>
<li><p>The second part of the book is “Visualize”. This part gives data visualization tools and best practices a more thorough coverage compared to the first edition. The best place to get all the details is still the <a href="http://ggplot2-book.org/">ggplot2 book</a>, but now R4DS covers more of the most important techniques.</p></li>
<li><p>The third part of the book is now called “Transform” and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room to cover all the details.</p></li>
<li><p>The fourth part of the book is called “Import”. Its a new set of chapters that goes beyond reading flat text files to working with spreadsheets, getting data out of databases, working with big data, rectangling hierarchical data, and scraping data from web sites.</p></li>
<li><p>The “Program” part remains, but has been rewritten from top-to-bottom to focus on the most important parts of function writing and iteration. Function writing now includes details on how to wrap tidyverse functions (dealing with the challenges of tidy evaluation), since this has become much easier and more important over the last few years. Weve added a new chapter on important base R functions that youre likely to see in wild-caught R code.</p></li>
<li><p>The modeling part has been removed. We never had enough room to fully do modelling justice, and there are now much better resources available. We generally recommend using the <a href="https://www.tidymodels.org/">tidymodels</a> packages and reading <a href="https://www.tmwr.org/">Tidy Modeling with R</a> by Max Kuhn and Julia Silge.</p></li>
<li><p>The communicate part continues as well, but features Quarto instead of R Markdown as the tool of choice for authoring reproducible computational documents.</p></li>
</ul><p>Other changes include switching from magrittrs pipe (<code>%&gt;%</code>) to the base pipe (<code>|&gt;</code>) and switching the books source from RMarkdown to Quarto.</p></section>
<li><p>The communicate part remains, but has been thoroughly updated to feature Quarto instead of R Markdown. This edition of the book has been written in quarto, and its clearly the tool of the future.</p></li>
</ul></section>

View File

@ -3,16 +3,10 @@
<div class="cell-output-display">
<figure id="fig-ds-program"><p><img src="diagrams/data-science/program.png" alt="Our model of the data science process with program (import, tidy, transform, visualize, model, and communicate, i.e. everything) highlighted in blue." width="535"/></p>
<figcaption>Figure 1: Programming is the water in which all other components of the data science process swims.</figcaption>
<figcaption>Figure 1: Programming is the water in which all the other components swim.</figcaption>
</figure>
</div>
</div><p>Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if youre not working with other people, youll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.</p><p>Writing code is similar in many ways to writing prose. One parallel which we find particularly useful is that in both cases rewriting is the key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, its often worth looking at your code and thinking about whether or not its obvious what youve done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesnt mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely your first attempt will be clear.)</p><p>In the following three chapters, youll learn skills to improve your programming skills:</p><ol type="1"><li><p>Copy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in <a href="#chp-functions" data-type="xref">#chp-functions</a>, youll learn how to write <strong>functions</strong> which let you extract out repeated code so that it can be easily reused.</p></li>
</div><p>Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if youre not working with other people, youll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.</p><p>In the following three chapters, youll learn skills to improve your programming skills:</p><ol type="1"><li><p>Copy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in <a href="#chp-functions" data-type="xref">#chp-functions</a>, youll learn how to write <strong>functions</strong> which let you extract out repeated code so that it can be easily reused.</p></li>
<li><p>Functions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for <strong>iteration</strong> that let you do similar things again and again. These tools include for loops and functional programming, which youll learn about in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p></li>
<li><p>As you read more code written by others, youll see more code that doesnt use the tidyverse. In <a href="#chp-base-R" data-type="xref">#chp-base-R</a>, youll learn some of the most important base R functions that youll see in the wild. These functions tend to be designed to use individual vectors, rather than data frames, often making them a good fit for your programming needs.</p></li>
</ol><section id="chp-program" class="level2">
<h1>Learning more</h1>
<p>The goal of these chapters is to teach you the minimum about programming that you need to practice data science. Once you have mastered the material in this book, we strongly believe you should continue to invest in your programming skills. Learning more about programming is a long-term investment: it wont pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.</p>
<p>To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:</p>
<ul><li><p><a href="https://rstudio-education.github.io/hopr/"><em>Hands on Programming with R</em></a>, by Garrett Grolemund. This is an introduction to R as a programming language and is a great place to start if R is your first programming language. It covers similar material to these chapters, but with a different style and different motivation examples (based in the casino). Its a useful complement if you find that these four chapters go by too quickly.</p></li>
<li><p><a href="https://adv-r.hadley.nz/"><em>Advanced R</em></a> by Hadley Wickham. This dives into the details of R the programming language. This is a great place to start if you have existing programming experience. Its also a great next step once youve internalized the ideas in these chapters.</p></li>
</ul></section></div>
<li><p>As you read more code written by others, youll see more code that doesnt use the tidyverse. In <a href="#chp-base-R" data-type="xref">#chp-base-R</a>, youll learn some of the most important base R functions that youll see in the wild.</p></li>
</ol><p>The goal of these chapters is to teach you the minimum about programming that you need for data science. Once you have mastered the material here, we strongly recommend that you continue to invest in your programming skills. Weve written two books that you might find helpful. <a href="https://rstudio-education.github.io/hopr/"><em>Hands on Programming with R</em></a>, by Garrett Grolemund, is an introduction to R as a programming language and is a great place to start if R is your first programming language. <a href="https://adv-r.hadley.nz/"><em>Advanced R</em></a> by Hadley Wickham dives into the details of R the programming language; its great place to start if you have existing programming experience and great next step once youve internalized the ideas in these chapters.</p></div>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-quarto-formats">
<h1><span id="sec-quarto-formats" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto formats</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="quarto-formats-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>So far, youve seen Quarto used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.</p>
@ -268,7 +268,7 @@ Other formats</h1>
</ul><p>See <a href="https://quarto.org/docs/output-formats/all-formats.html" class="uri">https://quarto.org/docs/output-formats/all-formats.html</a> for a list of even more formats.</p>
</section>
<section id="learning-more" data-type="sect1">
<section id="quarto-formats-learning-more" data-type="sect1">
<h1>
Learning more</h1>
<p>To learn more about effective communication in these different formats, we recommend the following resources:</p>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-quarto">
<h1><span id="sec-quarto" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="quarto-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Quarto provides a unified authoring framework for data science, combining your code, its results, and your prose. Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more.</p>
@ -11,7 +11,7 @@ Introduction</h1>
</ol><p>Quarto is a command line interface tool, not an R package. This means that help is, by-and-large, not available through <code>?</code>. Instead, as you work through this chapter, and use Quarto in the future, you should refer to the Quarto documentation page at <a href="https://quarto.org/" class="uri">https://quarto.org</a> for help.</p>
<p>If youre an R Markdown user, you might be thinking “Quarto sounds a lot like R Markdown”. Youre not wrong! Quarto unifies the functionality of many packages from the R Markdown ecosystem (rmarkdown, bookdown, distill, xaringan, etc.) into a single consistent system as well as extends it with native support for multiple programming languages like Python and Julia in addition to R. In a way, Quarto reflects everything that was learned from expanding and supporting the R Markdown ecosystem over a decade.</p>
<section id="prerequisites" data-type="sect2">
<section id="quarto-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>You need the Quarto command line interface (Quarto CLI), but you dont need to explicitly install it or load it, as RStudio automatically does both when needed.</p>
@ -84,7 +84,7 @@ smaller |&gt;
<p>To get started with your own <code>.qmd</code> file, select <em>File &gt; New File &gt; Quarto Document…</em> in the menu bar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of Quarto work.</p>
<p>The following sections dive into the three components of a Quarto document in more details: the markdown text, the code chunks, and the YAML header.</p>
<section id="exercises" data-type="sect2">
<section id="quarto-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Create a new Quarto document using <em>File &gt; New File &gt; Quarto Document</em>. Read the instructions. Practice running the chunks individually. Then render the document by clicking the appropriate button and then by using the appropriate keyboard short cut. Verify that you can modify the code, re-run it, and see modified output.</p></li>
@ -106,7 +106,7 @@ Visual editor</h1>
<p>The visual editor has many more features that we havent enumerated here that you might find useful as you gain experience authoring with it.</p>
<p>Most importantly, while the visual editor displays your content with formatting, under the hood, it saves your content in plain Markdown and you can switch back and forth between the visual and source editors to view and edit your content using either tool.</p>
<section id="exercises-1" data-type="sect2">
<section id="quarto-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<!--# TO DO: Add exercises. -->
@ -165,7 +165,7 @@ Source editor</h1>
</div>
<p>The best way to learn these is simply to try them out. It will take a few days, but soon they will become second nature, and you wont need to think about them. If you forget, you can get to a handy reference sheet with <em>Help &gt; Markdown Quick Reference</em>.</p>
<section id="exercises-2" data-type="sect2">
<section id="quarto-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Practice what youve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.</p></li>
@ -341,7 +341,7 @@ comma(.12358124331)
</div>
</section>
<section id="exercises-3" data-type="sect2">
<section id="quarto-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Add a section that explores how diamond sizes vary by cut, color, and clarity. Assume youre writing a report for someone who doesnt know R, and instead of setting <code>echo: false</code> on each chunk, set a global option.</p></li>
@ -394,14 +394,14 @@ Other important options</h2>
<p>Its a good idea to name code chunks that produce figures, even if you dont routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (i.e. if you want to quickly drop a single plot into an email or a tweet).</p>
</section>
<section id="exercises-4" data-type="sect2">
<section id="quarto-exercises-4" data-type="sect2">
<h2>
Exercises</h2>
<!--# TO DO: Add exercises -->
</section>
</section>
<section id="tables" data-type="sect1">
<section id="quarto-tables" data-type="sect1">
<h1>
Tables</h1>
<p>Similar to figures, you can include two types of tables in a Quarto document. They can be markdown tables that you create in directly in your Quarto document (using the Insert Table menu) or they can be tables generated as a result of a code chunk. In this section we will focus on the latter, tables generated via computation.</p>
@ -499,7 +499,7 @@ Tables</h1>
<p>Read the documentation for <code><a href="https://rdrr.io/pkg/knitr/man/kable.html">?knitr::kable</a></code> to see the other ways in which you can customize the table. For even deeper customization, consider the <strong>gt</strong>, <strong>huxtable</strong>, <strong>reactable</strong>, <strong>kableExtra</strong>, <strong>xtable</strong>, <strong>stargazer</strong>, <strong>pander</strong>, <strong>tables</strong>, and <strong>ascii</strong> packages. Each provides a set of tools for returning formatted tables from R code.</p>
<p>There is also a rich set of options for controlling how figures are embedded. Youll learn about these in <span class="quarto-unresolved-ref">?sec-graphics-communication</span>.</p>
<section id="exercises-5" data-type="sect2">
<section id="quarto-exercises-5" data-type="sect2">
<h2>
Exercises</h2>
<!--# TO DO: Add exercises -->
@ -672,7 +672,7 @@ csl: apa.csl</pre>
</section>
</section>
<section id="learning-more" data-type="sect1">
<section id="quarto-learning-more" data-type="sect1">
<h1>
Learning more</h1>
<p>Quarto is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official Quarto website: <a href="https://quarto.org/" class="uri">https://quarto.org</a>.</p>

View File

@ -1,12 +1,12 @@
<section data-type="chapter" id="chp-rectangling">
<h1><span id="sec-rectangling" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Hierarchical data</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="rectangling-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In this chapter, youll learn the art of data <strong>rectangling</strong>, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.</p>
<p>To learn about rectangling, youll need to first learn about lists, the data structure that makes hierarchical data possible. Then youll learn about two crucial tidyr functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">tidyr::unnest_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">tidyr::unnest_wider()</a></code>. Well then show you a few case studies, applying these simple functions again and again to solve real problems. Well finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.</p>
<section id="prerequisites" data-type="sect2">
<section id="rectangling-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well use many functions from tidyr, a core member of the tidyverse. Well also use repurrrsive to provide some interesting datasets for rectangling practice, and well finish by using jsonlite to read JSON files into R lists.</p>
@ -18,7 +18,7 @@ library(jsonlite)</pre>
</section>
</section>
<section id="lists" data-type="sect1">
<section id="rectangling-lists" data-type="sect1">
<h1>
Lists</h1>
<p>So far youve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because theyre homogeneous: every element is of the same data type. If you want to store elements of different types in the same vector, youll need a <strong>list</strong>, which you create with <code><a href="https://rdrr.io/r/base/list.html">list()</a></code>:</p>
@ -174,13 +174,19 @@ df
<p>Similarly, if you <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code> a data frame in RStudio, youll get the standard tabular view, which doesnt allow you to selectively expand list columns. To explore those fields youll need to <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> and view, e.g. <code>df |&gt; pull(z) |&gt; View()</code>.</p>
<div data-type="note"><h1>
Base R
</h1><p>Its possible to put a list in a column of a <code>data.frame</code>, but its a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell">
</h1>
<p>Its possible to put a list in a column of a <code>data.frame</code>, but its a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">data.frame(x = list(1:3, 3:5))
#&gt; x.1.3 x.3.5
#&gt; 1 1 3
#&gt; 2 2 4
#&gt; 3 3 5</pre>
</div><p>You can force <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="https://rdrr.io/r/base/AsIs.html">I()</a></code>, but the result doesnt print particularly well:</p><div class="cell">
</div>
<p>You can force <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="https://rdrr.io/r/base/AsIs.html">I()</a></code>, but the result doesnt print particularly well:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">data.frame(
x = I(list(1:2, 3:5)),
y = c("1, 2", "3, 4, 5")
@ -188,7 +194,10 @@ Base R
#&gt; x y
#&gt; 1 1, 2 1, 2
#&gt; 2 3, 4, 5 3, 4, 5</pre>
</div><p>Its easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like vectors and the print method has been designed with lists in mind.</p></div>
</div>
<p>Its easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like vectors and the print method has been designed with lists in mind.</p>
</div>
</section>
</section>
@ -220,7 +229,7 @@ df2 &lt;- tribble(
<section id="unnest_wider" data-type="sect2">
<h2>
<code>unnest_wider()</code>
unnest_wider()
</h2>
<p>When each row has the same number of elements with the same names, like <code>df1</code>, its natural to put each component into its own column with <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
<div class="cell">
@ -260,7 +269,7 @@ df2 &lt;- tribble(
<section id="unnest_longer" data-type="sect2">
<h2>
<code>unnest_longer()</code>
unnest_longer()
</h2>
<p>When each row contains an unnamed list, its most natural to put each element into its own row with <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
<div class="cell">
@ -387,7 +396,7 @@ Inconsistent types</h2>
<p>Youll learn more about <code><a href="https://purrr.tidyverse.org/reference/map.html">map_lgl()</a></code> in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p>
</section>
<section id="other-functions" data-type="sect2">
<section id="rectangling-other-functions" data-type="sect2">
<h2>
Other functions</h2>
<p>tidyr has a few other useful rectangling functions that were not going to cover in this book:</p>
@ -400,7 +409,7 @@ Other functions</h2>
</ul><p>These functions are good to know about as you might encounter them when reading other peoples code or tackling rarer rectangling challenges yourself.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="rectangling-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -460,51 +469,26 @@ repos
unnest_longer(json) |&gt;
unnest_wider(json)
#&gt; # A tibble: 176 × 68
#&gt; id name full_name owner private html_url description fork
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;lgl&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;
#&gt; 1 61160198 after gaborcsa… &lt;named list&gt; FALSE https:/… Run Code i… FALSE
#&gt; 2 40500181 argufy gaborcsa… &lt;named list&gt; FALSE https:/… Declarativ… FALSE
#&gt; 3 36442442 ask gaborcsa… &lt;named list&gt; FALSE https:/… Friendly C… FALSE
#&gt; 4 34924886 baseimp… gaborcsa… &lt;named list&gt; FALSE https:/… Do we get … FALSE
#&gt; 5 61620661 citest gaborcsa… &lt;named list&gt; FALSE https:/… Test R pac… TRUE
#&gt; 6 33907457 clisymb… gaborcsa… &lt;named list&gt; FALSE https:/… Unicode sy… FALSE
#&gt; # … with 170 more rows, and 60 more variables: url &lt;chr&gt;, forks_url &lt;chr&gt;,
#&gt; # keys_url &lt;chr&gt;, collaborators_url &lt;chr&gt;, teams_url &lt;chr&gt;,
#&gt; # hooks_url &lt;chr&gt;, issue_events_url &lt;chr&gt;, events_url &lt;chr&gt;,
#&gt; # assignees_url &lt;chr&gt;, branches_url &lt;chr&gt;, tags_url &lt;chr&gt;,
#&gt; # blobs_url &lt;chr&gt;, git_tags_url &lt;chr&gt;, git_refs_url &lt;chr&gt;,
#&gt; # trees_url &lt;chr&gt;, statuses_url &lt;chr&gt;, languages_url &lt;chr&gt;,
#&gt; # stargazers_url &lt;chr&gt;, contributors_url &lt;chr&gt;, subscribers_url &lt;chr&gt;, …</pre>
#&gt; id name full_name owner private html_url
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;lgl&gt; &lt;chr&gt;
#&gt; 1 61160198 after gaborcsardi/after &lt;named list&gt; FALSE https://github…
#&gt; 2 40500181 argufy gaborcsardi/argu… &lt;named list&gt; FALSE https://github…
#&gt; 3 36442442 ask gaborcsardi/ask &lt;named list&gt; FALSE https://github…
#&gt; 4 34924886 baseimports gaborcsardi/base… &lt;named list&gt; FALSE https://github…
#&gt; 5 61620661 citest gaborcsardi/cite… &lt;named list&gt; FALSE https://github…
#&gt; 6 33907457 clisymbols gaborcsardi/clis… &lt;named list&gt; FALSE https://github…
#&gt; # … with 170 more rows, and 62 more variables: description &lt;chr&gt;,
#&gt; # fork &lt;lgl&gt;, url &lt;chr&gt;, forks_url &lt;chr&gt;, keys_url &lt;chr&gt;, …</pre>
</div>
<p>This has worked but the result is a little overwhelming: there are so many columns that tibble doesnt even print all of them! We can see them all with <code><a href="https://rdrr.io/r/base/names.html">names()</a></code>:</p>
<p>This has worked but the result is a little overwhelming: there are so many columns that tibble doesnt even print all of them! We can see them all with <code><a href="https://rdrr.io/r/base/names.html">names()</a></code>; and here we look at the first 10:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">repos |&gt;
unnest_longer(json) |&gt;
unnest_wider(json) |&gt;
names()
#&gt; [1] "id" "name" "full_name"
#&gt; [4] "owner" "private" "html_url"
#&gt; [7] "description" "fork" "url"
#&gt; [10] "forks_url" "keys_url" "collaborators_url"
#&gt; [13] "teams_url" "hooks_url" "issue_events_url"
#&gt; [16] "events_url" "assignees_url" "branches_url"
#&gt; [19] "tags_url" "blobs_url" "git_tags_url"
#&gt; [22] "git_refs_url" "trees_url" "statuses_url"
#&gt; [25] "languages_url" "stargazers_url" "contributors_url"
#&gt; [28] "subscribers_url" "subscription_url" "commits_url"
#&gt; [31] "git_commits_url" "comments_url" "issue_comment_url"
#&gt; [34] "contents_url" "compare_url" "merges_url"
#&gt; [37] "archive_url" "downloads_url" "issues_url"
#&gt; [40] "pulls_url" "milestones_url" "notifications_url"
#&gt; [43] "labels_url" "releases_url" "deployments_url"
#&gt; [46] "created_at" "updated_at" "pushed_at"
#&gt; [49] "git_url" "ssh_url" "clone_url"
#&gt; [52] "svn_url" "homepage" "size"
#&gt; [55] "stargazers_count" "watchers_count" "language"
#&gt; [58] "has_issues" "has_downloads" "has_wiki"
#&gt; [61] "has_pages" "forks_count" "mirror_url"
#&gt; [64] "open_issues_count" "forks" "open_issues"
#&gt; [67] "watchers" "default_branch"</pre>
names() |&gt;
head(10)
#&gt; [1] "id" "name" "full_name" "owner" "private"
#&gt; [6] "html_url" "description" "fork" "url" "forks_url"</pre>
</div>
<p>Lets select a few that look interesting:</p>
<div class="cell">
@ -523,7 +507,7 @@ repos
#&gt; 6 33907457 gaborcsardi/clisymbols &lt;named list [17]&gt; Unicode symbols for CLI…
#&gt; # … with 170 more rows</pre>
</div>
<p>You can use this to work back to understand how <code>gh_repos</code> was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.</p>
<p>You can use this to work back to understand how <code>gh_repos</code> was structured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.</p>
<p><code>owner</code> is another list-column, and since it contains a named list, we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to get at the values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">repos |&gt;
@ -531,11 +515,13 @@ repos
unnest_wider(json) |&gt;
select(id, full_name, owner, description) |&gt;
unnest_wider(owner)
#&gt; Error in `unpack()` at ]8;line = 121:col = 2;file:///Users/hadleywickham/Documents/tidy-data/tidyr/R/unnest-wider.Rtidyr/R/unnest-wider.R:121:2]8;;:
#&gt; ! Names must be unique.
#&gt; Error in `unnest_wider()`:
#&gt; ! Can't duplicate names between the affected columns and the original
#&gt; data.
#&gt; ✖ These names are duplicated:
#&gt; * "id" at locations 1 and 4.
#&gt; Use argument `names_repair` to specify repair strategy.</pre>
#&gt; `id`, from `owner`.
#&gt; Use `names_sep` to disambiguate using the column name.
#&gt; Or use `names_repair` to specify a repair strategy.</pre>
</div>
<!--# TODO: https://github.com/tidyverse/tidyr/issues/1390 -->
<p>Uh oh, this list column also contains an <code>id</code> column and we cant have two <code>id</code> columns in the same data frame. Rather than following the advice to use <code>names_repair</code> (which would also work), well instead use <code>names_sep</code>:</p>
@ -546,21 +532,16 @@ repos
select(id, full_name, owner, description) |&gt;
unnest_wider(owner, names_sep = "_")
#&gt; # A tibble: 176 × 20
#&gt; id full_name owner_login owner_id owner_avatar_url owner_gravatar_id
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 61160198 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; 2 40500181 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; 3 36442442 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; 4 34924886 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; 5 61620661 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; 6 33907457 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; # … with 170 more rows, and 14 more variables: owner_url &lt;chr&gt;,
#&gt; # owner_html_url &lt;chr&gt;, owner_followers_url &lt;chr&gt;,
#&gt; # owner_following_url &lt;chr&gt;, owner_gists_url &lt;chr&gt;,
#&gt; # owner_starred_url &lt;chr&gt;, owner_subscriptions_url &lt;chr&gt;,
#&gt; # owner_organizations_url &lt;chr&gt;, owner_repos_url &lt;chr&gt;,
#&gt; # owner_events_url &lt;chr&gt;, owner_received_events_url &lt;chr&gt;,
#&gt; # owner_type &lt;chr&gt;, owner_site_admin &lt;lgl&gt;, description &lt;chr&gt;</pre>
#&gt; id full_name owner_login owner_id owner_avatar_url
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 61160198 gaborcsardi/after gaborcsardi 660288 https://avatars.gith…
#&gt; 2 40500181 gaborcsardi/argufy gaborcsardi 660288 https://avatars.gith…
#&gt; 3 36442442 gaborcsardi/ask gaborcsardi 660288 https://avatars.gith…
#&gt; 4 34924886 gaborcsardi/baseimports gaborcsardi 660288 https://avatars.gith…
#&gt; 5 61620661 gaborcsardi/citest gaborcsardi 660288 https://avatars.gith…
#&gt; 6 33907457 gaborcsardi/clisymbols gaborcsardi 660288 https://avatars.gith…
#&gt; # … with 170 more rows, and 15 more variables: owner_gravatar_id &lt;chr&gt;,
#&gt; # owner_url &lt;chr&gt;, owner_html_url &lt;chr&gt;, owner_followers_url &lt;chr&gt;, …</pre>
</div>
<p>This gives another wide dataset, but you can see that <code>owner</code> appears to contain a lot of additional data about the person who “owns” the repository.</p>
</section>
@ -588,17 +569,16 @@ chars
<pre data-type="programlisting" data-code-language="r">chars |&gt;
unnest_wider(json)
#&gt; # A tibble: 30 × 18
#&gt; url id name gender culture born died alive titles aliases father
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt; &lt;list&gt; &lt;list&gt; &lt;chr&gt;
#&gt; 1 https:/… 1022 Theo… Male "Ironb… "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 2 https:/… 1052 Tyri… Male "" "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 3 https:/… 1074 Vict… Male "Ironb… "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 4 https:/… 1109 Will Male "" "" "In … FALSE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 5 https:/… 1166 Areo… Male "Norvo… "In … "" TRUE &lt;chr&gt; &lt;chr&gt; ""
#&gt; 6 https:/… 1267 Chett Male "" "At … "In … FALSE &lt;chr&gt; &lt;chr&gt; ""
#&gt; # … with 24 more rows, and 7 more variables: mother &lt;chr&gt;, spouse &lt;chr&gt;,
#&gt; # allegiances &lt;list&gt;, books &lt;list&gt;, povBooks &lt;list&gt;, tvSeries &lt;list&gt;,
#&gt; # playedBy &lt;list&gt;</pre>
#&gt; url id name gender culture born
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 https://www.anapio… 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or …
#&gt; 2 https://www.anapio… 1052 Tyrion Lannist… Male "" "In 273 AC, at…
#&gt; 3 https://www.anapio… 1074 Victarion Grey… Male "Ironborn" "In 268 AC or …
#&gt; 4 https://www.anapio… 1109 Will Male "" ""
#&gt; 5 https://www.anapio… 1166 Areo Hotah Male "Norvoshi" "In 257 AC or …
#&gt; 6 https://www.anapio… 1267 Chett Male "" "At Hag's Mire"
#&gt; # … with 24 more rows, and 12 more variables: died &lt;chr&gt;, alive &lt;lgl&gt;,
#&gt; # titles &lt;list&gt;, aliases &lt;list&gt;, father &lt;chr&gt;, mother &lt;chr&gt;, …</pre>
</div>
<p>And selecting a few columns to make it easier to read:</p>
<div class="cell">
@ -607,15 +587,15 @@ chars
select(id, name, gender, culture, born, died, alive)
characters
#&gt; # A tibble: 30 × 7
#&gt; id name gender culture born died alive
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;
#&gt; 1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 279 AC… "" TRUE
#&gt; 2 1052 Tyrion Lannister Male "" "In 273 AC, at Caste… "" TRUE
#&gt; 3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or before… "" TRUE
#&gt; 4 1109 Will Male "" "" "In … FALSE
#&gt; 5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or before… "" TRUE
#&gt; 6 1267 Chett Male "" "At Hag's Mire" "In … FALSE
#&gt; # … with 24 more rows</pre>
#&gt; id name gender culture born died
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 27… ""
#&gt; 2 1052 Tyrion Lannister Male "" "In 273 AC, at C… ""
#&gt; 3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or be… ""
#&gt; 4 1109 Will Male "" "" "In 297 AC, at…
#&gt; 5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or be… ""
#&gt; 6 1267 Chett Male "" "At Hag's Mire" "In 299 AC, at…
#&gt; # … with 24 more rows, and 1 more variable: alive &lt;lgl&gt;</pre>
</div>
<p>There are also many list-columns:</p>
<div class="cell">
@ -828,15 +808,16 @@ Deeply nested</h2>
unnest_wider(results)
locations
#&gt; # A tibble: 7 × 6
#&gt; city address_compone…¹ formatted_address geometry place_id types
#&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston &lt;list [4]&gt; Houston, TX, USA &lt;named list&gt; ChIJAYW… &lt;list&gt;
#&gt; 2 Washington &lt;list [2]&gt; Washington, USA &lt;named list&gt; ChIJ-bD… &lt;list&gt;
#&gt; 3 Washington &lt;list [4]&gt; Washington, DC, … &lt;named list&gt; ChIJW-T… &lt;list&gt;
#&gt; 4 New York &lt;list [3]&gt; New York, NY, USA &lt;named list&gt; ChIJOwg… &lt;list&gt;
#&gt; 5 Chicago &lt;list [4]&gt; Chicago, IL, USA &lt;named list&gt; ChIJ7cv… &lt;list&gt;
#&gt; 6 Arlington &lt;list [4]&gt; Arlington, TX, U… &lt;named list&gt; ChIJ05g… &lt;list&gt;
#&gt; # … with 1 more row, and abbreviated variable name ¹address_components</pre>
#&gt; city address_compone…¹ formatted_address geometry place_id
#&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt;
#&gt; 1 Houston &lt;list [4]&gt; Houston, TX, USA &lt;named list&gt; ChIJAYWNSLS4QI…
#&gt; 2 Washington &lt;list [2]&gt; Washington, USA &lt;named list&gt; ChIJ-bDD5__lhV…
#&gt; 3 Washington &lt;list [4]&gt; Washington, DC, … &lt;named list&gt; ChIJW-T2Wt7Gt4…
#&gt; 4 New York &lt;list [3]&gt; New York, NY, USA &lt;named list&gt; ChIJOwg_06VPwo…
#&gt; 5 Chicago &lt;list [4]&gt; Chicago, IL, USA &lt;named list&gt; ChIJ7cv00DwsDo…
#&gt; 6 Arlington &lt;list [4]&gt; Arlington, TX, U… &lt;named list&gt; ChIJ05gI5NJiTo…
#&gt; # … with 1 more row, 1 more variable: types &lt;list&gt;, and abbreviated variable
#&gt; # name ¹address_components</pre>
</div>
<p>Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.</p>
<p>There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the <code>geometry</code> list-column:</p>
@ -937,7 +918,7 @@ locations
<p>If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in <code>vignette("rectangling", package = "tidyr")</code>.</p>
</section>
<section id="exercises-1" data-type="sect2">
<section id="rectangling-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Roughly estimate when <code>gh_repos</code> was created. Why can you only roughly estimate the date?</p></li>
@ -965,7 +946,7 @@ Exercises</h2>
JSON</h1>
<p>All of the case studies in the previous section were sourced from wild-caught JSON. JSON is short for <strong>j</strong>ava<strong>s</strong>cript <strong>o</strong>bject <strong>n</strong>otation and is the way that most web APIs return data. Its important to understand it because while JSON and Rs data types are pretty similar, there isnt a perfect 1-to-1 mapping, so its good to understand a bit about JSON if things go wrong.</p>
<section id="data-types" data-type="sect2">
<section id="rectangling-data-types" data-type="sect2">
<h2>
Data types</h2>
<p>JSON is a simple format designed to be easily read and written by machines, not humans. It has six key data types. Four of them are scalars:</p>
@ -1083,7 +1064,7 @@ Translation challenges</h2>
<p>Since JSON doesnt have any way to represent dates or date-times, theyre often stored as ISO8601 date times in strings, and youll need to use <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">readr::parse_date()</a></code> or <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">readr::parse_datetime()</a></code> to turn them into the correct data structure. Similarly, JSONs rules for representing floating point numbers in JSON are a little imprecise, so youll also sometimes find numbers stored in strings. Apply <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">readr::parse_double()</a></code> as needed to the get correct variable type.</p>
</section>
<section id="exercises-2" data-type="sect2">
<section id="rectangling-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -1110,7 +1091,7 @@ df_row &lt;- tibble(json = json_row)</pre>
</ol></section>
</section>
<section id="summary" data-type="sect1">
<section id="rectangling-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, you learned what lists are, how you can generate them from JSON files, and how turn them into rectangular data frames. Surprisingly we only need two new functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to put list elements into rows and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to put list elements into columns. It doesnt matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.</p>

View File

@ -1,23 +1,14 @@
<section data-type="chapter" id="chp-regexps">
<h1><span id="sec-regular-expressions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Regular expressions</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="regexps-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In <a href="#chp-strings" data-type="xref">#chp-strings</a>, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use <strong>regular expressions</strong>, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”<span data-type="footnote">You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).</span> or “regexp”.</p>
<p>The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. Well then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, well talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. Well finish with a survey of other places in the tidyverse and base R where you might use regexes.</p>
<section id="prerequisites" data-type="sect2">
<section id="regexps-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div data-type="important"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>This chapter relies on features only found in tidyr 1.3.0, which is still in development. If you want to live on the edge, you can get the dev version with <code>devtools::install_github("tidyverse/tidyr")</code>.</p></div>
<p>In this chapter, well use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
@ -46,11 +37,7 @@ Pattern basics</h1>
#&gt; [11] │ boysen&lt;berry&gt;
#&gt; [19] │ cloud&lt;berry&gt;
#&gt; [21] │ cran&lt;berry&gt;
#&gt; [29] │ elder&lt;berry&gt;
#&gt; [32] │ goji &lt;berry&gt;
#&gt; [33] │ goose&lt;berry&gt;
#&gt; [38] │ huckle&lt;berry&gt;
#&gt; ... and 4 more
#&gt; ... and 8 more
str_view(fruit, "BERRY")</pre>
</div>
@ -70,8 +57,7 @@ str_view(fruit, "BERRY")</pre>
#&gt; [51] │ nect&lt;arine&gt;
#&gt; [62] │ pine&lt;apple&gt;
#&gt; [64] │ pomegr&lt;anate&gt;
#&gt; [70] │ r&lt;aspbe&gt;rry
#&gt; [73] │ sal&lt;al be&gt;rry</pre>
#&gt; ... and 2 more</pre>
</div>
<p><strong>Quantifiers</strong> control how many times a pattern can match:</p>
<ul><li>
@ -123,11 +109,7 @@ str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
#&gt; [34] │ alth&lt;ough&gt;
#&gt; [37] │ am&lt;ount&gt;
#&gt; [46] │ app&lt;oint&gt;
#&gt; [47] │ appr&lt;oach&gt;
#&gt; [52] │ ar&lt;ound&gt;
#&gt; [61] │ &lt;auth&gt;ority
#&gt; [79] │ be&lt;auty&gt;
#&gt; ... and 62 more</pre>
#&gt; ... and 66 more</pre>
</div>
<p>(Well learn more elegant ways to express these ideas in <a href="#sec-quantifiers" data-type="xref">#sec-quantifiers</a>.)</p>
<p>You can use <strong>alternation</strong>, <code>|</code> to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “pear”, or “banana”, or a repeated vowel.</p>
@ -144,11 +126,6 @@ str_view(fruit, "aa|ee|ii|oo|uu")
#&gt; [66] │ purple mangost&lt;ee&gt;n</pre>
</div>
<p>Regular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first. Dont worry; youll get better with practice, and simple patterns will soon become second nature. Lets kick off that process by practicing with some useful stringr functions.</p>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
</section>
</section>
<section id="sec-stringr-regex-funs" data-type="sect1">
@ -286,7 +263,7 @@ str_remove_all(x, "[aeiou]")
<section id="sec-extract-variables" data-type="sect2">
<h2>
Extract variables</h2>
<p>The last function well discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Its a peer of the <code>separate_wider_location()</code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.</p>
<p>The last function well discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Its a peer of the <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors.</p>
<p>Lets create a simple dataset to show how it works. Here we have some data derived from <code>babynames</code> where we have the name, gender, and age of a bunch of people in a rather weird format<span data-type="footnote">We wish we could reassure you that youd never see something this weird in real life, but unfortunately over the course of your career youre likely to see much weirder!</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tribble(
@ -325,7 +302,7 @@ Extract variables</h2>
<p>If the match fails, you can use <code>too_short = "debug"</code> to figure out what went wrong, just like <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code>.</p>
</section>
<section id="exercises-1" data-type="sect2">
<section id="regexps-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)</p></li>
@ -398,8 +375,8 @@ str_view(fruit, "a$")
#&gt; [56] │ papay&lt;a&gt;
#&gt; [74] │ satsum&lt;a&gt;</pre>
</div>
<p>Its tempting to think that <code>$</code> should matches the start of a string, because thats how we write dollar amounts, but its not what regular expressions want.</p>
<p>To force a regular expression to only the full string, anchor it with both <code>^</code> and <code>$</code>:</p>
<p>Its tempting to think that <code>$</code> should match the start of a string, because thats how we write dollar amounts, but its not what regular expressions want.</p>
<p>To force a regular expression to match only the full string, anchor it with both <code>^</code> and <code>$</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "apple")
#&gt; [1] │ &lt;apple&gt;
@ -407,7 +384,7 @@ str_view(fruit, "a$")
str_view(fruit, "^apple$")
#&gt; [1] │ &lt;apple&gt;</pre>
</div>
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudios find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarize</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly useful when using RStudios find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarize</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")
@ -523,7 +500,7 @@ Operator precedence and parentheses</h2>
<section id="grouping-and-capturing" data-type="sect2">
<h2>
Grouping and capturing</h2>
<p>As well overriding operator precedence, parentheses have another important effect: they create <strong>capturing groups</strong> that allow you to use sub-components of the match.</p>
<p>As well as overriding operator precedence, parentheses have another important effect: they create <strong>capturing groups</strong> that allow you to use sub-components of the match.</p>
<p>The first way to use a capturing group is to refer back to it within a match with <strong>back reference</strong>: <code>\1</code> refers to the match contained in the first parenthesis, <code>\2</code> in the second parenthesis, and so on. For example, the following pattern finds all fruits that have a repeated pair of letters:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "(..)\\1")
@ -548,17 +525,13 @@ Grouping and capturing</h2>
<pre data-type="programlisting" data-code-language="r">sentences |&gt;
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |&gt;
str_view()
#&gt; [1] │ The canoe birch slid on the smooth planks.
#&gt; [2] │ Glue sheet the to the dark blue background.
#&gt; [3] │ It's to easy tell the depth of a well.
#&gt; [4] │ These a days chicken leg is a rare dish.
#&gt; [5] │ Rice often is served in round bowls.
#&gt; [6] │ The of juice lemons makes fine punch.
#&gt; [7] │ The was box thrown beside the parked truck.
#&gt; [8] │ The were hogs fed chopped corn and garbage.
#&gt; [9] │ Four of hours steady work faced us.
#&gt; [10] │ A size large in stockings is hard to sell.
#&gt; ... and 710 more</pre>
#&gt; [1] │ The canoe birch slid on the smooth planks.
#&gt; [2] │ Glue sheet the to the dark blue background.
#&gt; [3] │ It's to easy tell the depth of a well.
#&gt; [4] │ These a days chicken leg is a rare dish.
#&gt; [5] │ Rice often is served in round bowls.
#&gt; [6] │ The of juice lemons makes fine punch.
#&gt; ... and 714 more</pre>
</div>
<p>If you want extract the matches for each group you can use <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code>. But <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code> returns a matrix, so its not particularly easy to work with<span data-type="footnote">Mostly because we never discuss matrices in this book!</span>:</p>
<div class="cell">
@ -605,7 +578,7 @@ str_match(x, "gr(?:e|a)y")
</div>
</section>
<section id="exercises-2" data-type="sect2">
<section id="regexps-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How would you match the literal string <code>"'\</code>? How about <code>"$^$"</code>?</p></li>
@ -645,7 +618,7 @@ Pattern control</h1>
<section id="sec-flags" data-type="sect2">
<h2>
Regex flags</h2>
<p>There are a number of settings that can use to control the details of the regexp. These settings are often called <strong>flags</strong> in other programming languages. In stringr, you can use these by wrapping the pattern in a call to <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code>. The most useful flag is probably <code>ignore_case = TRUE</code> because it allows characters to match either their uppercase or lowercase forms:</p>
<p>There are a number of settings that can be used to control the details of the regexp. These settings are often called <strong>flags</strong> in other programming languages. In stringr, you can use these by wrapping the pattern in a call to <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code>. The most useful flag is probably <code>ignore_case = TRUE</code> because it allows characters to match either their uppercase or lowercase forms:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">bananas &lt;- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
@ -737,7 +710,7 @@ str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
<h1>
Practice</h1>
<p>To put these ideas into practice well solve a few semi-authentic problems next. Well discuss three general techniques:</p>
<ol type="1"><li>checking you work by creating simple positive and negative controls</li>
<ol type="1"><li>checking your work by creating simple positive and negative controls</li>
<li>combining regular expressions with Boolean algebra</li>
<li>creating complex patterns using string manipulation</li>
</ol>
@ -753,11 +726,7 @@ Check your work</h2>
#&gt; [7] │ &lt;The&gt; box was thrown beside the parked truck.
#&gt; [8] │ &lt;The&gt; hogs were fed chopped corn and garbage.
#&gt; [11] │ &lt;The&gt; boy was there when the sun rose.
#&gt; [13] │ &lt;The&gt; source of the huge river is the clear spring.
#&gt; [18] │ &lt;The&gt; soft cushion broke the man's fall.
#&gt; [19] │ &lt;The&gt; salt breeze came across from the sea.
#&gt; [20] │ &lt;The&gt; girl at the booth sold fifty bonds.
#&gt; ... and 267 more</pre>
#&gt; ... and 271 more</pre>
</div>
<p>Because that pattern also matches sentences starting with words like <code>They</code> or <code>These</code>. We need to make sure that the “e” is the last letter in the word, which we can do by adding adding a word boundary:</p>
<div class="cell">
@ -768,26 +737,18 @@ Check your work</h2>
#&gt; [8] │ &lt;The&gt; hogs were fed chopped corn and garbage.
#&gt; [11] │ &lt;The&gt; boy was there when the sun rose.
#&gt; [13] │ &lt;The&gt; source of the huge river is the clear spring.
#&gt; [18] │ &lt;The&gt; soft cushion broke the man's fall.
#&gt; [19] │ &lt;The&gt; salt breeze came across from the sea.
#&gt; [20] │ &lt;The&gt; girl at the booth sold fifty bonds.
#&gt; [21] │ &lt;The&gt; small pup gnawed a hole in the sock.
#&gt; ... and 246 more</pre>
#&gt; ... and 250 more</pre>
</div>
<p>What about finding all sentences that begin with a pronoun?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "^She|He|It|They\\b")
#&gt; [3] │ &lt;It&gt;'s easy to tell the depth of a well.
#&gt; [15] │ &lt;He&gt;lp the woman get back to her feet.
#&gt; [27] │ &lt;He&gt;r purse was full of useless trash.
#&gt; [29] │ &lt;It&gt; snowed, rained, and hailed the same morning.
#&gt; [63] │ &lt;He&gt; ran half way to the hardware store.
#&gt; [90] │ &lt;He&gt; lay prone and hardly moved a limb.
#&gt; [116] │ &lt;He&gt; ordered peach pie with ice cream.
#&gt; [118] │ &lt;He&gt;mp is a weed found in parts of the tropics.
#&gt; [127] │ &lt;It&gt; caught its hind paw in a rusty trap.
#&gt; [132] │ &lt;He&gt; said the same phrase thirty times.
#&gt; ... and 53 more</pre>
#&gt; [3] │ &lt;It&gt;'s easy to tell the depth of a well.
#&gt; [15] │ &lt;He&gt;lp the woman get back to her feet.
#&gt; [27] │ &lt;He&gt;r purse was full of useless trash.
#&gt; [29] │ &lt;It&gt; snowed, rained, and hailed the same morning.
#&gt; [63] │ &lt;He&gt; ran half way to the hardware store.
#&gt; [90] │ &lt;He&gt; lay prone and hardly moved a limb.
#&gt; ... and 57 more</pre>
</div>
<p>A quick inspection of the results shows that were getting some spurious matches. Thats because weve forgotten to use parentheses:</p>
<div class="cell">
@ -798,11 +759,7 @@ Check your work</h2>
#&gt; [90] │ &lt;He&gt; lay prone and hardly moved a limb.
#&gt; [116] │ &lt;He&gt; ordered peach pie with ice cream.
#&gt; [127] │ &lt;It&gt; caught its hind paw in a rusty trap.
#&gt; [132] │ &lt;He&gt; said the same phrase thirty times.
#&gt; [153] │ &lt;He&gt; broke a new shoelace that day.
#&gt; [159] │ &lt;She&gt; sewed the torn coat quite neatly.
#&gt; [168] │ &lt;He&gt; knew the skill of the great young actress.
#&gt; ... and 47 more</pre>
#&gt; ... and 51 more</pre>
</div>
<p>You might wonder how you might spot such a mistake if it didnt occur in the first few matches. A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:</p>
<div class="cell">
@ -850,11 +807,7 @@ Boolean operations</h2>
#&gt; [62] │ &lt;availab&gt;le
#&gt; [66] │ &lt;ba&gt;by
#&gt; [67] │ &lt;ba&gt;ck
#&gt; [68] │ &lt;ba&gt;d
#&gt; [69] │ &lt;ba&gt;g
#&gt; [70] │ &lt;bala&gt;nce
#&gt; [71] │ &lt;ba&gt;ll
#&gt; ... and 20 more</pre>
#&gt; ... and 24 more</pre>
</div>
<p>Its simpler to combine the results of two calls to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>:</p>
<div class="cell">
@ -897,11 +850,7 @@ Creating a pattern with code</h2>
#&gt; [148] │ The spot on the blotter was made by &lt;green&gt; ink.
#&gt; [160] │ The sofa cushion is &lt;red&gt; and of light weight.
#&gt; [174] │ The sky that morning was clear and bright &lt;blue&gt;.
#&gt; [204] │ A &lt;blue&gt; crane is a tall wading bird.
#&gt; [217] │ It is hard to erase &lt;blue&gt; or &lt;red&gt; ink.
#&gt; [224] │ The lamp shone with a steady &lt;green&gt; flame.
#&gt; [247] │ The box is held by a bright &lt;red&gt; snapper.
#&gt; ... and 16 more</pre>
#&gt; ... and 20 more</pre>
</div>
<p>But as the number of colors grows, it would quickly get tedious to construct this pattern by hand. Wouldnt it be nice if we could store the colors in a vector?</p>
<div class="cell">
@ -915,34 +864,26 @@ Creating a pattern with code</h2>
<p>We could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(colors())
#&gt; [1] │ white
#&gt; [2] │ aliceblue
#&gt; [3] │ antiquewhite
#&gt; [4] │ antiquewhite1
#&gt; [5] │ antiquewhite2
#&gt; [6] │ antiquewhite3
#&gt; [7] │ antiquewhite4
#&gt; [8] │ aquamarine
#&gt; [9] │ aquamarine1
#&gt; [10] │ aquamarine2
#&gt; ... and 647 more</pre>
#&gt; [1] │ white
#&gt; [2] │ aliceblue
#&gt; [3] │ antiquewhite
#&gt; [4] │ antiquewhite1
#&gt; [5] │ antiquewhite2
#&gt; [6] │ antiquewhite3
#&gt; ... and 651 more</pre>
</div>
<p>But lets first eliminate the numbered variants:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cols &lt;- colors()
cols &lt;- cols[!str_detect(cols, "\\d")]
str_view(cols)
#&gt; [1] │ white
#&gt; [2] │ aliceblue
#&gt; [3] │ antiquewhite
#&gt; [4] │ aquamarine
#&gt; [5] │ azure
#&gt; [6] │ beige
#&gt; [7] │ bisque
#&gt; [8] │ black
#&gt; [9] │ blanchedalmond
#&gt; [10] │ blue
#&gt; ... and 133 more</pre>
#&gt; [1] │ white
#&gt; [2] │ aliceblue
#&gt; [3] │ antiquewhite
#&gt; [4] │ aquamarine
#&gt; [5] │ azure
#&gt; [6] │ beige
#&gt; ... and 137 more</pre>
</div>
<p>Then we can turn this into one giant pattern. We wont show the pattern here because its huge, but you can see it working:</p>
<div class="cell">
@ -954,16 +895,12 @@ str_view(sentences, pattern)
#&gt; [66] │ Cars and busses stalled in &lt;snow&gt; drifts.
#&gt; [92] │ A wisp of cloud hung in the &lt;blue&gt; air.
#&gt; [112] │ Leaves turn &lt;brown&gt; and &lt;yellow&gt; in the fall.
#&gt; [148] │ The spot on the blotter was made by &lt;green&gt; ink.
#&gt; [149] │ Mud was spattered on the front of his &lt;white&gt; shirt.
#&gt; [160] │ The sofa cushion is &lt;red&gt; and of light weight.
#&gt; [167] │ The office paint was a dull, sad &lt;tan&gt;.
#&gt; ... and 53 more</pre>
#&gt; ... and 57 more</pre>
</div>
<p>In this example, <code>cols</code> only contains numbers and letters so you dont need to worry about metacharacters. But in general, whenever you create create patterns from existing strings its wise to run them through <code><a href="https://stringr.tidyverse.org/reference/str_escape.html">str_escape()</a></code> to ensure they match literally.</p>
<p>In this example, <code>cols</code> only contains numbers and letters so you dont need to worry about metacharacters. But in general, whenever you create patterns from existing strings its wise to run them through <code><a href="https://stringr.tidyverse.org/reference/str_escape.html">str_escape()</a></code> to ensure they match literally.</p>
</section>
<section id="exercises-3" data-type="sect2">
<section id="regexps-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -988,8 +925,8 @@ Regular expressions in other places</h1>
tidyverse</h2>
<p>There are three other particularly useful places where you might want to use a regular expressions</p>
<ul><li><p><code>matches(pattern)</code> will select all variables whose name matches the supplied pattern. Its a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>).</p></li>
<li><p><code>pivot_longer()'s</code> <code>names_pattern</code> argument takes a vector of regular expressions, just like <code>separate_with_regex()</code>. Its useful when extracting data out of variable names with a complex structure</p></li>
<li><p>The <code>delim</code> argument in <code>separate_delim_longer()</code> and <code>separate_delim_wider()</code> usually matches a fixed string, but you can use <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code> to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. <code>regex(", ?")</code>.</p></li>
<li><p><code>pivot_longer()'s</code> <code>names_pattern</code> argument takes a vector of regular expressions, just like <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Its useful when extracting data out of variable names with a complex structure</p></li>
<li><p>The <code>delim</code> argument in <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_delim()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> usually matches a fixed string, but you can use <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code> to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. <code>regex(", ?")</code>.</p></li>
</ul></section>
<section id="base-r" data-type="sect2">
@ -1011,7 +948,7 @@ Base R</h2>
</section>
</section>
<section id="summary" data-type="sect1">
<section id="regexps-summary" data-type="sect1">
<h1>
Summary</h1>
<p>With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. Theyre definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.</p>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-spreadsheets">
<h1><span id="sec-import-spreadsheets" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Spreadsheets</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="spreadsheets-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>So far, you have learned about importing data from plain text files, e.g., <code>.csv</code> and <code>.tsv</code> files. Sometimes you need to analyze data that lives in a spreadsheet. This chapter will introduce you to tools for working with data in Excel spreadsheets and Google Sheets. This will build on much of what youve learned in <a href="#chp-data-import" data-type="xref">#chp-data-import</a>, but we will also discuss additional considerations and complexities when working with data from spreadsheets.</p>
@ -11,7 +11,7 @@ Introduction</h1>
<h1>
Excel</h1>
<section id="prerequisites" data-type="sect2">
<section id="spreadsheets-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this section, youll learn how to load data from Excel spreadsheets in R with the <strong>readxl</strong> package. This package is non-core tidyverse, so you need to load it explicitly, but it is installed automatically when you install the tidyverse package.</p>
@ -190,15 +190,16 @@ Reading worksheets</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
#&gt; # A tibble: 52 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Adelie Torgers… 39.1 18.7 181 3750
#&gt; 2 Adelie Torgers… 39.5 17.399999999… 186 3800
#&gt; 3 Adelie Torgers… 40.2999999999… 18 195 3250
#&gt; 4 Adelie Torgers… NA NA NA NA
#&gt; 5 Adelie Torgers… 36.7000000000… 19.3 193 3450
#&gt; 6 Adelie Torgers… 39.2999999999… 20.6 190 3650
#&gt; # … with 46 more rows, and 2 more variables: sex &lt;chr&gt;, year &lt;dbl&gt;</pre>
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181
#&gt; 2 Adelie Torgersen 39.5 17.399999999999999 186
#&gt; 3 Adelie Torgersen 40.299999999999997 18 195
#&gt; 4 Adelie Torgersen NA NA NA
#&gt; 5 Adelie Torgersen 36.700000000000003 19.3 193
#&gt; 6 Adelie Torgersen 39.299999999999997 20.6 190
#&gt; # … with 46 more rows, and 3 more variables: body_mass_g &lt;chr&gt;, sex &lt;chr&gt;,
#&gt; # year &lt;dbl&gt;</pre>
</div>
<p>Some variables that appear to contain numerical data are read in as characters due to the character string <code>"NA"</code> not being recognized as a true <code>NA</code>.</p>
<div class="cell">
@ -206,15 +207,16 @@ Reading worksheets</h2>
penguins_torgersen
#&gt; # A tibble: 52 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgers… 39.1 18.7 181 3750
#&gt; 2 Adelie Torgers… 39.5 17.4 186 3800
#&gt; 3 Adelie Torgers… 40.3 18 195 3250
#&gt; 4 Adelie Torgers… NA NA NA NA
#&gt; 5 Adelie Torgers… 36.7 19.3 193 3450
#&gt; 6 Adelie Torgers… 39.3 20.6 190 3650
#&gt; # … with 46 more rows, and 2 more variables: sex &lt;chr&gt;, year &lt;dbl&gt;</pre>
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181
#&gt; 2 Adelie Torgersen 39.5 17.4 186
#&gt; 3 Adelie Torgersen 40.3 18 195
#&gt; 4 Adelie Torgersen NA NA NA
#&gt; 5 Adelie Torgersen 36.7 19.3 193
#&gt; 6 Adelie Torgersen 39.3 20.6 190
#&gt; # … with 46 more rows, and 3 more variables: body_mass_g &lt;dbl&gt;, sex &lt;chr&gt;,
#&gt; # year &lt;dbl&gt;</pre>
</div>
<p>Alternatively, you can use <code><a href="https://readxl.tidyverse.org/reference/excel_sheets.html">excel_sheets()</a></code> to get information on all worksheets in an Excel spreadsheet, and then read the one(s) youre interested in.</p>
<div class="cell">
@ -240,15 +242,16 @@ dim(penguins_dream)
<pre data-type="programlisting" data-code-language="r">penguins &lt;- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
penguins
#&gt; # A tibble: 344 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgers… 39.1 18.7 181 3750
#&gt; 2 Adelie Torgers… 39.5 17.4 186 3800
#&gt; 3 Adelie Torgers… 40.3 18 195 3250
#&gt; 4 Adelie Torgers… NA NA NA NA
#&gt; 5 Adelie Torgers… 36.7 19.3 193 3450
#&gt; 6 Adelie Torgers… 39.3 20.6 190 3650
#&gt; # … with 338 more rows, and 2 more variables: sex &lt;chr&gt;, year &lt;dbl&gt;</pre>
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181
#&gt; 2 Adelie Torgersen 39.5 17.4 186
#&gt; 3 Adelie Torgersen 40.3 18 195
#&gt; 4 Adelie Torgersen NA NA NA
#&gt; 5 Adelie Torgersen 36.7 19.3 193
#&gt; 6 Adelie Torgersen 39.3 20.6 190
#&gt; # … with 338 more rows, and 3 more variables: body_mass_g &lt;dbl&gt;, sex &lt;chr&gt;,
#&gt; # year &lt;dbl&gt;</pre>
</div>
<p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a> well talk about ways of doing this sort of task without repetitive code.</p>
</section>
@ -277,14 +280,14 @@ deaths &lt;- read_excel(deaths_path)
#&gt; • `` -&gt; `...6`
deaths
#&gt; # A tibble: 18 × 6
#&gt; `Lots of people` ...2 ...3 ...4 ...5 ...6
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 simply cannot resist writing &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; some
#&gt; 2 at the top &lt;NA&gt; of their…
#&gt; 3 or merging &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; cells
#&gt; 4 Name Profession Age Has kids Date of birth Date …
#&gt; 5 David Bowie musician 69 TRUE 17175 42379
#&gt; 6 Carrie Fisher actor 60 TRUE 20749 42731
#&gt; `Lots of people` ...2 ...3 ...4 ...5 ...6
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 simply cannot resi &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; some notes
#&gt; 2 at the top &lt;NA&gt; of their spreadsh
#&gt; 3 or merging &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; cells
#&gt; 4 Name Profession Age Has kids Date of birth Date of death
#&gt; 5 David Bowie musician 69 TRUE 17175 42379
#&gt; 6 Carrie Fisher actor 60 TRUE 20749 42731
#&gt; # … with 12 more rows</pre>
</div>
<p>The top three rows and the bottom four rows are not part of the data frame.</p>
@ -292,29 +295,29 @@ deaths
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, skip = 4)
#&gt; # A tibble: 14 × 6
#&gt; Name Profession Age `Has kids` `Date of birth` `Date of death`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dttm&gt; &lt;chr&gt;
#&gt; 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 42379
#&gt; 2 Carrie Fis… actor 60 TRUE 1956-10-21 00:00:00 42731
#&gt; 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 42812
#&gt; 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 42791
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00 42481
#&gt; 6 Alan Rickm… actor 69 FALSE 1946-02-21 00:00:00 42383
#&gt; # … with 8 more rows</pre>
#&gt; Name Profession Age `Has kids` `Date of birth`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dttm&gt;
#&gt; 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00
#&gt; 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00
#&gt; 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00
#&gt; 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00
#&gt; 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00
#&gt; # … with 8 more rows, and 1 more variable: `Date of death` &lt;chr&gt;</pre>
</div>
<p>We could also set <code>n_max</code> to omit the extraneous rows at the bottom.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, skip = 4, n_max = 10)
#&gt; # A tibble: 10 × 6
#&gt; Name Profession Age `Has kids` `Date of birth` `Date of death`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 David … musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
#&gt; 2 Carrie… actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
#&gt; 3 Chuck … musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
#&gt; 4 Bill P… actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
#&gt; 6 Alan R… actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00
#&gt; # … with 4 more rows</pre>
#&gt; Name Profession Age `Has kids` `Date of birth`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;dttm&gt;
#&gt; 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00
#&gt; 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00
#&gt; 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00
#&gt; 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00
#&gt; 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00
#&gt; # … with 4 more rows, and 1 more variable: `Date of death` &lt;dttm&gt;</pre>
</div>
<p>Another approach is using cell ranges. In Excel, the top left cell is <code>A1</code>. As you move across columns to the right, the cell label moves down the alphabet, i.e. <code>B1</code>, <code>C1</code>, etc. And as you move down a column, the number in the cell label increases, i.e. <code>A2</code>, <code>A3</code>, etc.</p>
<p>The data we want to read in starts in cell <code>A5</code> and ends in cell <code>F15</code>. In spreadsheet notation, this is <code>A5:F15</code>.</p>
@ -332,7 +335,7 @@ deaths
</li>
</ul></section>
<section id="data-types" data-type="sect2">
<section id="spreadsheets-data-types" data-type="sect2">
<h2>
Data types</h2>
<p>In CSV files, all values are strings. This is not particularly true to the data, but it is simple: everything is a string.</p>
@ -399,7 +402,7 @@ write_xlsx(bake_sale, path = "data/bake-sale.xlsx")</pre>
<section id="formatted-output" data-type="sect2">
<h2>
Formatted output</h2>
<p>The readxl package is a light-weight solution for writing a simple Excel spreadsheet, but if youre interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the <strong>openxlsx</strong> package. Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions cant be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might need to use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.</p>
<p>The writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if youre interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the <strong>openxlsx</strong> package. Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions cant be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might need to use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.</p>
<p>Below we show how to write a spreadsheet with three sheets, one for each species of penguins in the <code>penguins</code> data frame.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(openxlsx)
@ -466,7 +469,7 @@ writeDataTable(
<p>See <a href="https://ycphs.github.io/openxlsx/articles/Formatting.html" class="uri">https://ycphs.github.io/openxlsx/articles/Formatting.html</a> for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="spreadsheets-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -595,8 +598,8 @@ Read sheets</h2>
<p>The first argument to <code><a href="https://googlesheets4.tidyverse.org/reference/range_read.html">read_sheet()</a></code> is the URL of the file to read. You can also access this file via <a href="https://pos.it/r4ds-students" class="uri">https://pos.it/r4ds-students</a>, however note that at the time of writing this book you cant read a sheet directly from a short link.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students &lt;- read_sheet("https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w/edit?usp=sharing")
#&gt; ✔ Reading from "students".
#&gt; ✔ Range 'Sheet1'.</pre>
#&gt; ✔ Reading from students.
#&gt; ✔ Range Sheet1.</pre>
</div>
<p><code><a href="https://googlesheets4.tidyverse.org/reference/range_read.html">read_sheet()</a></code> will read the file in as a tibble.</p>
<div class="cell">
@ -624,8 +627,8 @@ Read sheets</h2>
age = if_else(age == "five", "5", age),
age = parse_number(age)
)
#&gt; ✔ Reading from "students".
#&gt; ✔ Range '2:10000000'.
#&gt; ✔ Reading from students.
#&gt; ✔ Range 2:10000000.
students
#&gt; # A tibble: 6 × 5
@ -642,18 +645,19 @@ students
<p>Its also possible to read individual sheets from Google Sheets as well. Lets read the penguins Google Sheet at <a href="https://pos.it/r4ds-penguins" class="uri">https://pos.it/r4ds-penguins</a>, and specifically the “Torgersen Island” sheet in it.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_sheet("https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY/edit?usp=sharing", sheet = "Torgersen Island")
#&gt; ✔ Reading from "penguins".
#&gt; ✔ Reading from penguins.
#&gt; ✔ Range ''Torgersen Island''.
#&gt; # A tibble: 52 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt;
#&gt; 1 Adelie Torgers… &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt;
#&gt; 2 Adelie Torgers… &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt;
#&gt; 3 Adelie Torgers… &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt;
#&gt; 4 Adelie Torgers… &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt;
#&gt; 5 Adelie Torgers… &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt;
#&gt; 6 Adelie Torgers… &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt;
#&gt; # … with 46 more rows, and 2 more variables: sex &lt;chr&gt;, year &lt;dbl&gt;</pre>
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt;
#&gt; 1 Adelie Torgersen &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt;
#&gt; 2 Adelie Torgersen &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt;
#&gt; 3 Adelie Torgersen &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt;
#&gt; 4 Adelie Torgersen &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt;
#&gt; 5 Adelie Torgersen &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt;
#&gt; 6 Adelie Torgersen &lt;dbl [1]&gt; &lt;dbl [1]&gt; &lt;dbl [1]&gt;
#&gt; # … with 46 more rows, and 3 more variables: body_mass_g &lt;list&gt;, sex &lt;chr&gt;,
#&gt; # year &lt;dbl&gt;</pre>
</div>
<p>You can obtain a list of all sheets within a Google Sheet with <code><a href="https://googlesheets4.tidyverse.org/reference/sheet_properties.html">sheet_names()</a></code>:</p>
<div class="cell">
@ -664,19 +668,19 @@ students
<div class="cell">
<pre data-type="programlisting" data-code-language="r">deaths_url &lt;- gs4_example("deaths")
deaths &lt;- read_sheet(deaths_url, range = "A5:F15")
#&gt; ✔ Reading from "deaths".
#&gt; ✔ Range 'A5:F15'.
#&gt; ✔ Reading from deaths.
#&gt; ✔ Range A5:F15.
deaths
#&gt; # A tibble: 10 × 6
#&gt; Name Profession Age `Has kids` `Date of birth` `Date of death`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 David … musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
#&gt; 2 Carrie… actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
#&gt; 3 Chuck … musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
#&gt; 4 Bill P… actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
#&gt; 6 Alan R… actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00
#&gt; # … with 4 more rows</pre>
#&gt; Name Profession Age `Has kids` `Date of birth`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;dttm&gt;
#&gt; 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00
#&gt; 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00
#&gt; 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00
#&gt; 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00
#&gt; 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00
#&gt; # … with 4 more rows, and 1 more variable: `Date of death` &lt;dttm&gt;</pre>
</div>
</section>
@ -700,7 +704,7 @@ Authentication</h2>
<p>When you attempt to read in a sheet that requires authentication, googlesheets4 will direct you to a web browser with a prompt to sign in to your Google account and grant permission to operate on your behalf with Google Sheets. However, if you want to specify a specific Google account, authentication scope, etc. you can do so with <code><a href="https://googlesheets4.tidyverse.org/reference/gs4_auth.html">gs4_auth()</a></code>, e.g. <code>gs4_auth(email = "mine@example.com")</code>, which will force the use of a token associated with a specific email. For further authentication details, we recommend reading the documentation googlesheets4 auth vignette: <a href="https://googlesheets4.tidyverse.org/articles/auth.html" class="uri">https://googlesheets4.tidyverse.org/articles/auth.html</a>.</p>
</section>
<section id="exercises-1" data-type="sect2">
<section id="spreadsheets-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Read the <code>students</code> dataset from earlier in the chapter from Excel and also from Google Sheets, with no additional arguments supplied to the <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> and <code><a href="https://googlesheets4.tidyverse.org/reference/range_read.html">read_sheet()</a></code> functions. Are the resulting data frames in R exactly the same? If not, how are they different?</p></li>
@ -728,7 +732,7 @@ Exercises</h2>
</ol></section>
</section>
<section id="summary" data-type="sect1">
<section id="spreadsheets-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned how to read data into R from spreadsheets: from Microsoft Excel with <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> from the readxl package and from Google Sheets with <code><a href="https://googlesheets4.tidyverse.org/reference/range_read.html">read_sheet()</a></code> from the googlesheets4 package. These functions work very similarly to each other and have similar arguments for specifying column names, NA strings, rows to skip on top of the file youre reading in, etc. Additionally, both functions make it possible to read a single sheet from a spreadsheet as well.</p>

View File

@ -1,24 +1,15 @@
<section data-type="chapter" id="chp-strings">
<h1><span id="sec-strings" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Strings</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="strings-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>So far, youve used a bunch of strings without learning much about the details. Now its time to dive into them, learn what makes strings tick, and master some of the powerful string manipulation tools you have at your disposal.</p>
<p>Well begin with the details of creating strings and character vectors. Youll then dive into creating strings from data, then the opposite; extracting strings from data. Well then discuss tools that work with individual letters. The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.</p>
<p>Well keep working with strings in the next chapter, where youll learn more about the power of regular expressions.</p>
<section id="prerequisites" data-type="sect2">
<section id="strings-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div data-type="important"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>This chapter relies on features only found in tidyr 1.3.0, which is still in development. If you want to live on the edge, you can get the dev versions with <code>devtools::install_github("tidyverse/tidyr")</code>.</p></div>
<p>In this chapter, well use functions from the stringr package, which is part of the core tidyverse. Well also use the babynames data since it provides some fun strings to manipulate.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
@ -113,7 +104,7 @@ str_view(x)
<p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that theres a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="strings-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -138,7 +129,7 @@ Creating many strings from data</h1>
<section id="str_c" data-type="sect2">
<h2>
<code>str_c()</code>
str_c()
</h2>
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> takes any number of vectors as arguments and returns a character vector:</p>
<div class="cell">
@ -151,16 +142,14 @@ str_c("Hello ", c("John", "Susan"))
</div>
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> is very similar to the base <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code>, but is designed to be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> by obeying the usual tidyverse rules for recycling and propagating missing values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">set.seed(1410)
df &lt;- tibble(name = c(wakefield::name(3), NA))
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(name = c("Flora", "David", "Terra"))
df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
#&gt; # A tibble: 4 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Ilena Hi Ilena!
#&gt; 2 Sacramento Hi Sacramento!
#&gt; 3 Graylon Hi Graylon!
#&gt; 4 &lt;NA&gt; &lt;NA&gt;</pre>
#&gt; # A tibble: 3 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Flora Hi Flora!
#&gt; 2 David Hi David!
#&gt; 3 Terra Hi Terra!</pre>
</div>
<p>If you want missing values to display in another way, use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace them. Depending on what you want, you might use it either inside or outside of <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>:</p>
<div class="cell">
@ -169,48 +158,45 @@ df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
)
#&gt; # A tibble: 4 × 3
#&gt; name greeting1 greeting2
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Ilena Hi Ilena! Hi Ilena!
#&gt; 2 Sacramento Hi Sacramento! Hi Sacramento!
#&gt; 3 Graylon Hi Graylon! Hi Graylon!
#&gt; 4 &lt;NA&gt; Hi you! Hi!</pre>
#&gt; # A tibble: 3 × 3
#&gt; name greeting1 greeting2
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Flora Hi Flora! Hi Flora!
#&gt; 2 David Hi David! Hi David!
#&gt; 3 Terra Hi Terra! Hi Terra!</pre>
</div>
</section>
<section id="sec-glue" data-type="sect2">
<h2>
<code>str_glue()</code>
str_glue()
</h2>
<p>If you are mixing many fixed and variable strings with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>, youll notice that you type a lot of <code>"</code>s, making it hard to see the overall goal of the code. An alternative approach is provided by the <a href="https://glue.tidyverse.org">glue package</a> via <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code><span data-type="footnote">If youre not using stringr, you can also access it directly with <code><a href="https://glue.tidyverse.org/reference/glue.html">glue::glue()</a></code>.</span>. You give it a single string that has a special feature: anything inside <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code> will be evaluated like its outside of the quotes:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(greeting = str_glue("Hi {name}!"))
#&gt; # A tibble: 4 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;glue&gt;
#&gt; 1 Ilena Hi Ilena!
#&gt; 2 Sacramento Hi Sacramento!
#&gt; 3 Graylon Hi Graylon!
#&gt; 4 &lt;NA&gt; Hi NA!</pre>
#&gt; # A tibble: 3 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;glue&gt;
#&gt; 1 Flora Hi Flora!
#&gt; 2 David Hi David!
#&gt; 3 Terra Hi Terra!</pre>
</div>
<p>As you can see, <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> currently converts missing values to the string <code>"NA"</code> unfortunately making it inconsistent with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>.</p>
<p>You also might wonder what happens if you need to include a regular <code>{</code> or <code>}</code> in your string. Youre on the right track if you guess youll need to escape it somehow. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like <code>\</code>, you double up the special characters:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(greeting = str_glue("{{Hi {name}!}}"))
#&gt; # A tibble: 4 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;glue&gt;
#&gt; 1 Ilena {Hi Ilena!}
#&gt; 2 Sacramento {Hi Sacramento!}
#&gt; 3 Graylon {Hi Graylon!}
#&gt; 4 &lt;NA&gt; {Hi NA!}</pre>
#&gt; # A tibble: 3 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;glue&gt;
#&gt; 1 Flora {Hi Flora!}
#&gt; 2 David {Hi David!}
#&gt; 3 Terra {Hi Terra!}</pre>
</div>
</section>
<section id="str_flatten" data-type="sect2">
<h2>
<code>str_flatten()</code>
str_flatten()
</h2>
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code>glue()</code> work well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because their output is the same length as their inputs. What if you want a function that works well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, i.e., something that always returns a single string? Thats the job of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code><span data-type="footnote">The base R equivalent is <code><a href="https://rdrr.io/r/base/paste.html">paste()</a></code> used with the <code>collapse</code> argument.</span>: it takes a character vector and combines each element of the vector into a single string:</p>
<div class="cell">
@ -244,7 +230,7 @@ df |&gt;
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="strings-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@ -598,7 +584,12 @@ Long strings</h2>
<li><p><code>str_wrap(x, 30)</code> wraps a string introducing new lines so that each line is at most 30 characters (it doesnt hyphenate, however, so any word longer than 30 characters will make a longer line)</p></li>
</ul><p>The following code shows these functions in action with a made-up string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
<pre data-type="programlisting" data-code-language="r">x &lt;- paste0(
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod ",
"tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim ",
"veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea",
"commodo consequat."
)
str_view(str_trunc(x, 30))
#&gt; [1] │ Lorem ipsum dolor sit amet,...
@ -610,12 +601,12 @@ str_view(str_wrap(x, 30))
#&gt; │ magna aliqua. Ut enim ad
#&gt; │ minim veniam, quis nostrud
#&gt; │ exercitation ullamco laboris
#&gt; │ nisi ut aliquip ex ea commodo
#&gt; │ nisi ut aliquip ex eacommodo
#&gt; │ consequat.</pre>
</div>
</section>
<section id="exercises-2" data-type="sect2">
<section id="strings-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Use <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> to extract the middle letter from each baby name. What will you do if the string has an even number of characters?</li>
@ -734,7 +725,7 @@ str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
</section>
</section>
<section id="summary" data-type="sect1">
<section id="strings-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Now its time to learn one of the most important and powerful tools for working with strings: regular expressions. Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter.</p>

View File

@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-webscraping">
<h1><span id="sec-scraping" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Web scraping</span></span></h1><p>This vignette introduces you to the basics of web scraping with <a href="https://rvest.tidyverse.org">rvest</a>. Web scraping is a very useful tool for extracting data from web pages. Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>. Where possible, you should use the API, because typically it will give you more reliable data. Unfortunately, however, programming with web APIs is out of scope for this book. Instead, we are teaching scraping, a technique that works whether or not a site provides an API.</p><p>In this chapter, well first discuss the ethics and legalities of scraping before we dive into the basics of HTML. Youll then learn the basics of CSS selectors to locate specific elements on the page, and how to use rvest functions to get data from text and attributes out of HTML and into R. Well then discuss some techniques to figure out what CSS selector you need for the page youre scraping, before finishing up with a couple of case studies, and a brief discussion of dynamic websites.</p>
<section id="prerequisites" data-type="sect2">
<section id="webscraping-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well focus on tools provided by rvest. rvest is a member of the tidyverse, but is not a core member so youll need to load it explicitly. Well also load the full tidyverse since well find it generally useful working with the data weve scraped.</p>
@ -240,7 +240,7 @@ html |&gt;
<p><code><a href="https://rvest.tidyverse.org/reference/html_attr.html">html_attr()</a></code> always returns a string, so if youre extracting numbers or dates, youll need to do some post-processing.</p>
</section>
<section id="tables" data-type="sect2">
<section id="webscraping-tables" data-type="sect2">
<h2>
Tables</h2>
<p>If youre lucky, your data will be already stored in an HTML table, and itll be a matter of just reading it from that table. Its usually straightforward to recognize a table in your browser: itll have a rectangular structure of rows and columns, and you can copy and paste it into a tool like Excel.</p>
@ -248,22 +248,10 @@ Tables</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">html &lt;- minimal_html("
&lt;table class='mytable'&gt;
&lt;tr&gt;
&lt;th&gt;x&lt;/th&gt;
&lt;th&gt;y&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.5&lt;/td&gt;
&lt;td&gt;2.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4.9&lt;/td&gt;
&lt;td&gt;1.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7.2&lt;/td&gt;
&lt;td&gt;8.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;th&gt;x&lt;/th&gt; &lt;th&gt;y&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1.5&lt;/td&gt; &lt;td&gt;2.7&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;4.9&lt;/td&gt; &lt;td&gt;1.3&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;7.2&lt;/td&gt; &lt;td&gt;8.1&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
")</pre>
</div>
@ -374,7 +362,6 @@ section |&gt; html_element(".director") |&gt; html_text2()
IMDB top films</h2>
<p>For our next task well tackle something a little trickier, extracting the top 250 movies from the internet movie database (IMDb). At the time we wrote this chapter, the page looked like <a href="#fig-scraping-imdb" data-type="xref">#fig-scraping-imdb</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">knitr::include_graphics("screenshots/scraping-imdb.png", dpi = 300)</pre>
<div class="cell-output-display">
<figure id="fig-scraping-imdb"><p><img src="screenshots/scraping-imdb.png" alt="The screenshot shows a table with columns &quot;Rank and Title&quot;, &quot;IMDb Rating&quot;, and &quot;Your Rating&quot;. 9 movies out of the top 250 are shown. The top 5 are the Shawshank Redemption, The Godfather, The Dark Knight, The Godfather: Part II, and 12 Angry Men." width="418"/></p>
@ -392,14 +379,14 @@ table &lt;- html |&gt;
html_table()
table
#&gt; # A tibble: 250 × 5
#&gt; `` `Rank &amp; Title` `IMDb Rating` `Your Rating` ``
#&gt; &lt;lgl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;lgl&gt;
#&gt; 1 NA "1.\n The Shawshank Redemptio… 9.2 "12345678910… NA
#&gt; 2 NA "2.\n The Godfather\n … 9.2 "12345678910… NA
#&gt; 3 NA "3.\n The Dark Knight\n … 9 "12345678910… NA
#&gt; 4 NA "4.\n The Godfather: Part II\… 9 "12345678910… NA
#&gt; 5 NA "5.\n 12 Angry Men\n (… 9 "12345678910… NA
#&gt; 6 NA "6.\n Schindler's List\n … 8.9 "12345678910… NA
#&gt; `` `Rank &amp; Title` `IMDb Rating` `Your Rating` ``
#&gt; &lt;lgl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;lgl&gt;
#&gt; 1 NA "1.\n The Shawshank Redempt… 9.2 "12345678910\n… NA
#&gt; 2 NA "2.\n The Godfather\n … 9.2 "12345678910\n… NA
#&gt; 3 NA "3.\n The Dark Knight\n … 9 "12345678910\n… NA
#&gt; 4 NA "4.\n The Godfather: Part I… 9 "12345678910\n… NA
#&gt; 5 NA "5.\n 12 Angry Men\n … 9 "12345678910\n… NA
#&gt; 6 NA "6.\n Schindler's List\n … 8.9 "12345678910\n… NA
#&gt; # … with 244 more rows</pre>
</div>
<p>This includes a few empty columns, but overall does a good job of capturing the information from the table. However, we need to do some more processing to make it easier to use. First, well rename the columns to be easier to work with, and remove the extraneous whitespace in rank and title. We will do this with <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> (instead of <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>) to do the renaming and selecting of just these two columns in one step. Then, well apply <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> (from <a href="#sec-extract-variables" data-type="xref">#sec-extract-variables</a>) to pull out the title, year, and rank into their own variables.</p>
@ -438,12 +425,12 @@ ratings
html_elements("td strong") |&gt;
head() |&gt;
html_attr("title")
#&gt; [1] "9.2 based on 2,684,096 user ratings"
#&gt; [2] "9.2 based on 1,861,107 user ratings"
#&gt; [3] "9.0 based on 2,657,484 user ratings"
#&gt; [4] "9.0 based on 1,273,669 user ratings"
#&gt; [5] "9.0 based on 792,941 user ratings"
#&gt; [6] "8.9 based on 1,357,901 user ratings"</pre>
#&gt; [1] "9.2 based on 2,691,480 user ratings"
#&gt; [2] "9.2 based on 1,867,146 user ratings"
#&gt; [3] "9.0 based on 2,665,189 user ratings"
#&gt; [4] "9.0 based on 1,276,943 user ratings"
#&gt; [5] "9.0 based on 795,129 user ratings"
#&gt; [6] "8.9 based on 1,361,148 user ratings"</pre>
</div>
<p>We can combine this with the tabular data and again apply <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> to extract out the bit of data we care about:</p>
<div class="cell">
@ -465,12 +452,12 @@ ratings
#&gt; # A tibble: 250 × 5
#&gt; rank title year rating number
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 The Shawshank Redemption 1994 9.2 2684096
#&gt; 2 2 The Godfather 1972 9.2 1861107
#&gt; 3 3 The Dark Knight 2008 9 2657484
#&gt; 4 4 The Godfather: Part II 1974 9 1273669
#&gt; 5 5 12 Angry Men 1957 9 792941
#&gt; 6 6 Schindler's List 1993 8.9 1357901
#&gt; 1 1 The Shawshank Redemption 1994 9.2 2691480
#&gt; 2 2 The Godfather 1972 9.2 1867146
#&gt; 3 3 The Dark Knight 2008 9 2665189
#&gt; 4 4 The Godfather: Part II 1974 9 1276943
#&gt; 5 5 12 Angry Men 1957 9 795129
#&gt; 6 6 Schindler's List 1993 8.9 1361148
#&gt; # … with 244 more rows</pre>
</div>
</section>
@ -483,7 +470,7 @@ Dynamic sites</h1>
<p>Its still possible to scrape these types of sites, but rvest needs to use a more expensive process: fully simulating the web browser including running all javascript. This functionality is not available at the time of writing, but its something were actively working on and should be available by the time you read this. It uses the <a href="https://rstudio.github.io/chromote/index.html">chromote package</a> which actually runs the Chrome browser in the background, and gives you additional tools to interact with the site, like a human typing text and clicking buttons. Check out the rvest website for more details.</p>
</section>
<section id="summary" data-type="sect1">
<section id="webscraping-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned about the why, the why not, and the how of scraping data from web pages. First, youve learned about the basics of HTML and using CSS selectors to refer to specific elements, then youve learned about using the rvest package to get data out of HTML into R. We then demonstrated web scraping with two case studies: a simpler scenario on scraping data on StarWars films from the rvest package website and a more complex scenario on scraping the top 250 films from IMDB.</p>

View File

@ -119,7 +119,7 @@ Calling functions</h1>
</div>
</section>
<section id="exercises" data-type="sect1">
<section id="workflow-basics-exercises" data-type="sect1">
<h1>
Exercises</h1>
<ol type="1"><li>
@ -153,7 +153,7 @@ ggsave(filename = "mpg-plot.png", plot = my_bar_plot)</pre>
</li>
</ol></section>
<section id="summary" data-type="sect1">
<section id="workflow-basics-summary" data-type="sect1">
<h1>
Summary</h1>
<p>Now that youve learned a little more about how R code works, and some tips to help you understand your code when you come back to it in the future. In the next chapter, well continue your data science journey by teaching you about dplyr, the tidyverse package that helps you transform data, whether its selecting important variables, filtering down to rows of interest, or computing summary statistics.</p>

View File

@ -62,7 +62,7 @@ Investing in yourself</h1>
<p>If youre an active Twitter user, you might also want to follow Hadley (<a href="https://twitter.com/hadleywickham">@hadleywickham</a>), Mine (<a href="https://twitter.com/minebocek">@minebocek</a>), Garrett (<a href="https://twitter.com/statgarrett">@statgarrett</a>), or follow <a href="https://twitter.com/rstudiotips">@rstudiotips</a> to keep up with new features in the IDE. If you want the full fire hose of new developments, you can also read the (<a href="https://twitter.com/search?q=%23rstats"><code>#rstats</code></a>) hashtag. This is one of the key tools that Hadley and Mine use to keep up with new developments in the community.</p>
</section>
<section id="summary" data-type="sect1">
<section id="workflow-help-summary" data-type="sect1">
<h1>
Summary</h1>
<p>This chapter concludes the Whole Game part of the book. Youve now seen the most important parts of the data science process: visualization, transformation, tidying and importing. Now youve got a holistic view of the whole process, and we start to get into the details of small pieces.</p>

View File

@ -50,7 +50,7 @@ flights3 &lt;- summarize(flight2,
<section id="magrittr-and-the-pipe" data-type="sect1">
<h1>
magrittr and the<code>%&gt;%</code> pipe</h1>
magrittr and the %&gt;% pipe</h1>
<p>If youve been using the tidyverse for a while, you might be familiar with the <code>%&gt;%</code> pipe provided by the <strong>magrittr</strong> package. The magrittr package is included in the core tidyverse, so you can use <code>%&gt;%</code> whenever you load the tidyverse:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
@ -70,7 +70,7 @@ mtcars %&gt;%
<section id="vs." data-type="sect1">
<h1>
<code>|&gt;</code> vs. <code>%&gt;%</code>
|&gt; vs. %&gt;%
</h1>
<p>While <code>|&gt;</code> and <code>%&gt;%</code> behave identically for simple cases, there are a few crucial differences. These are most likely to affect you if youre a long-term user of <code>%&gt;%</code> who has taken advantage of some of the more advanced features. But theyre still good to know about even if youve never used <code>%&gt;%</code> because youre likely to encounter some of them when reading wild-caught code.</p>
<ul><li><p>By default, the pipe passes the object on its left-hand side to the first argument of the function on the right-hand side. <code>%&gt;%</code> allows you to change the placement with a <code>.</code> placeholder. For example, <code>x %&gt;% f(1)</code> is equivalent to <code>f(x, 1)</code> but <code>x %&gt;% f(1, .)</code> is equivalent to <code>f(1, x)</code>. R 4.2.0 added a <code>_</code> placeholder to the base pipe, with one additional restriction: the argument has to be named. For example, <code>x |&gt; f(1, y = _)</code> is equivalent to <code>f(1, y = x)</code>.</p></li>
@ -89,7 +89,7 @@ mtcars %&gt;%
<section id="vs" data-type="sect1">
<h1>
<code>|&gt;</code> vs <code>+</code>
|&gt; vs +
</h1>
<p>Sometimes well turn the end of a data transformation pipeline into a plot. Watch for the transition from <code>|&gt;</code> to <code>+</code>. We wish this transition wasnt necessary, but unfortunately, ggplot2 was created before the pipe was discovered.</p>
<div class="cell">
@ -100,7 +100,7 @@ mtcars %&gt;%
</div>
</section>
<section id="summary" data-type="sect1">
<section id="workflow-pipes-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned more about the pipe: why we recommend it and some of the history that lead to <code>|&gt;</code>. The pipe is important because youll use it again and again throughout your analysis, but hopefully, it will quickly become invisible, and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it.</p>

View File

@ -116,7 +116,12 @@ What is the source of truth?</h2>
</ol><p>We collectively use this pattern hundreds of times a week.</p>
<div data-type="note"><h1>
RStudio server
</h1><p>If youre using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like youre closing R, but the server actually keeps it running in the background. The next time you return, youll be in exactly the same place you left. This makes it even more important to regularly restart R so that youre starting with a refresh slate.</p></div>
</h1>
<p>If youre using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like youre closing R, but the server actually keeps it running in the background. The next time you return, youll be in exactly the same place you left. This makes it even more important to regularly restart R so that youre starting with a refresh slate.</p>
</div>
</section>
@ -196,28 +201,21 @@ Relative and absolute paths</h2>
</section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In summary, scripts and projects give you a solid workflow that will serve you well in the future:</p>
<ul><li>Create one RStudio project for each data analysis project.</li>
<li>Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure youve captured everything in your scripts.</li>
<li>Only ever use relative paths, not absolute paths.</li>
</ul><p>Then everything you need is in one place and cleanly separated from all the other projects that you are working on.</p>
</section>
<section id="exercises" data-type="sect1">
<section id="workflow-scripts-exercises" data-type="sect1">
<h1>
Exercises</h1>
<ol type="1"><li><p>Go to the RStudio Tips Twitter account, <a href="https://twitter.com/rstudiotips" class="uri">https://twitter.com/rstudiotips</a> and find one tip that looks interesting. Practice using it!</p></li>
<li><p>What other common mistakes will RStudio diagnostics report? Read <a href="https://support.posit.co/hc/en-us/articles/205753617-Code-Diagnostics" class="uri">https://support.posit.co/hc/en-us/articles/205753617-Code-Diagnostics</a> to find out.</p></li>
</ol></section>
<section id="summary-1" data-type="sect1">
<section id="workflow-scripts-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to organize your R code in scripts (files) and projects (directories). Much like code style, this may feel like busywork at first. But as you accumulate more code across multiple projects, youll learn to appreciate how a little up front organisation can save you a bunch of time down the road.</p>
<p>Next up, youll learn about how to get help and how to ask good coding questions.</p>
<p>In summary, scripts and projects give you a solid workflow that will serve you well in the future:</p>
<ul><li>Create one RStudio project for each data analysis project.</li>
<li>Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure youve captured everything in your scripts.</li>
<li>Only ever use relative paths, not absolute paths.</li>
</ul><p>Then everything you need is in one place and cleanly separated from all the other projects that you are working on. Next up, youll learn about how to get help and how to ask good coding questions.</p>
</section>

View File

@ -153,7 +153,7 @@ ggplot2</h1>
span = 0.5,
se = FALSE,
color = "white",
size = 4
linewidth = 4
) +
geom_point()</pre>
</div>
@ -179,7 +179,7 @@ Sectioning comments</h1>
</div>
</section>
<section id="exercises" data-type="sect1">
<section id="workflow-style-exercises" data-type="sect1">
<h1>
Exercises</h1>
<ol type="1"><li>
@ -192,7 +192,7 @@ flights|&gt;filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time&gt;0900,s
</li>
</ol></section>
<section id="summary" data-type="sect1">
<section id="workflow-style-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learn the most important principles of code style. These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, youll see how important a consistent style is. And dont forget about the styler package: its a great way to quickly improve the quality of poorly styled code.</p>

View File

@ -335,16 +335,6 @@ Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows uses backslas
R can work with either type (no matter what platform you're currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes!
That makes life frustrating, so we recommend always using the Linux/Mac style with forward slashes.
## Summary
In summary, scripts and projects give you a solid workflow that will serve you well in the future:
- Create one RStudio project for each data analysis project.
- Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you've captured everything in your scripts.
- Only ever use relative paths, not absolute paths.
Then everything you need is in one place and cleanly separated from all the other projects that you are working on.
## Exercises
1. Go to the RStudio Tips Twitter account, <https://twitter.com/rstudiotips> and find one tip that looks interesting.
@ -355,8 +345,11 @@ Then everything you need is in one place and cleanly separated from all the othe
## Summary
In this chapter, you've learned how to organize your R code in scripts (files) and projects (directories).
Much like code style, this may feel like busywork at first.
But as you accumulate more code across multiple projects, you'll learn to appreciate how a little up front organisation can save you a bunch of time down the road.
In summary, scripts and projects give you a solid workflow that will serve you well in the future:
- Create one RStudio project for each data analysis project.
- Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you've captured everything in your scripts.
- Only ever use relative paths, not absolute paths.
Then everything you need is in one place and cleanly separated from all the other projects that you are working on.
Next up, you'll learn about how to get help and how to ask good coding questions.