r4ds/oreilly/missing-values.html

335 lines
19 KiB
HTML
Raw Normal View History

<section data-type="chapter" id="chp-missing-values">
2022-11-19 01:55:22 +08:00
<h1><span id="sec-missing-values" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Missing values</span></span></h1>
<section id="missing-values-introduction" data-type="sect1">
<h1>
Introduction</h1>
2023-01-13 07:22:57 +08:00
<p>Youve already learned the basics of missing values earlier in the book. You first saw them in <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a> where they resulted in a warning when making a plot as well as in <a href="#sec-summarize" data-type="xref">#sec-summarize</a> where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in <a href="#sec-na-comparison" data-type="xref">#sec-na-comparison</a>. Now well come back to them in more depth, so you can learn more of the details.</p>
<p>Well start by discussing some general tools for working with missing values recorded as <code>NA</code>s. Well then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit. Well finish off with a related discussion of empty groups, caused by factor levels that dont appear in the data.</p>
<section id="missing-values-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="explicit-missing-values" data-type="sect1">
<h1>
Explicit missing values</h1>
<p>To begin, lets explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an <code>NA</code>.</p>
<section id="last-observation-carried-forward" data-type="sect2">
<h2>
Last observation carried forward</h2>
<p>A common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">treatment &lt;- tribble(
~person, ~treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, NA,
"Katherine Burke", 1, 4
)</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>You can fill in these missing values with <code><a href="https://tidyr.tidyverse.org/reference/fill.html">tidyr::fill()</a></code>. It works like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, taking a set of columns:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">treatment |&gt;
fill(everything())
#&gt; # A tibble: 4 × 3
#&gt; person treatment response
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Derrick Whitmore 1 7
#&gt; 2 Derrick Whitmore 2 10
#&gt; 3 Derrick Whitmore 3 10
#&gt; 4 Katherine Burke 1 4</pre>
</div>
<p>This treatment is sometimes called “last observation carried forward”, or <strong>locf</strong> for short. You can use the <code>.direction</code> argument to fill in missing values that have been generated in more exotic ways.</p>
</section>
<section id="fixed-values" data-type="sect2">
<h2>
Fixed values</h2>
2022-11-19 00:30:32 +08:00
<p>Some times missing values represent some fixed and known value, most commonly 0. You can use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">dplyr::coalesce()</a></code> to replace them:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1, 4, 5, 7, NA)
coalesce(x, 0)
#&gt; [1] 1 4 5 7 0</pre>
</div>
<p>Sometimes youll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesnt have a proper way to represent missing values, so it must instead use some special value like 99 or -999.</p>
2022-11-19 00:30:32 +08:00
<p>If possible, handle this when reading in the data, for example, by using the <code>na</code> argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">readr::read_csv()</a></code>. If you discover the problem later, or your data source doesnt provide a way to handle on it read, you can use <code><a href="https://dplyr.tidyverse.org/reference/na_if.html">dplyr::na_if()</a></code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1, 4, 5, 7, -99)
na_if(x, -99)
#&gt; [1] 1 4 5 7 NA</pre>
</div>
</section>
<section id="nan" data-type="sect2">
<h2>
NaN</h2>
<p>Before we continue, theres one special type of missing value that youll encounter from time to time: a <code>NaN</code> (pronounced “nan”), or <strong>n</strong>ot <strong>a</strong> <strong>n</strong>umber. Its not that important to know about because it generally behaves just like <code>NA</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">x &lt;- c(NA, NaN)
x * 10
#&gt; [1] NA NaN
x == 1
#&gt; [1] NA NA
is.na(x)
#&gt; [1] TRUE TRUE</pre>
</div>
<p>In the rare case you need to distinguish an <code>NA</code> from a <code>NaN</code>, you can use <code>is.nan(x)</code>.</p>
<p>Youll generally encounter a <code>NaN</code> when you perform a mathematical operation that has an indeterminate result:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">0 / 0
#&gt; [1] NaN
0 * Inf
#&gt; [1] NaN
Inf - Inf
#&gt; [1] NaN
sqrt(-1)
#&gt; Warning in sqrt(-1): NaNs produced
#&gt; [1] NaN</pre>
</div>
</section>
</section>
<section id="sec-missing-implicit" data-type="sect1">
<h1>
Implicit missing values</h1>
<p>So far weve talked about missing values that are <strong>explicitly</strong> missing, i.e. you can see an <code>NA</code> in your data. But missing values can also be <strong>implicitly</strong> missing, if an entire row of data is simply absent from the data. Lets illustrate the difference with a simple data set that records the price of some stock each quarter:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">stocks &lt;- tibble(
year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)</pre>
</div>
<p>This dataset has two missing observations:</p>
<ul><li><p>The <code>price</code> in the fourth quarter of 2020 is explicitly missing, because its value is <code>NA</code>.</p></li>
<li><p>The <code>price</code> for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.</p></li>
</ul><p>One way to think about the difference is with this Zen-like koan:</p>
<blockquote class="blockquote">
<p>An explicit missing value is the presence of an absence.<br/></p>
<p>An implicit missing value is the absence of a presence.</p>
</blockquote>
<p>Sometimes you want to make implicit missings explicit in order to have something physical to work with. In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them. The following sections discuss some tools for moving between implicit and explicit missingness.</p>
<section id="pivoting" data-type="sect2">
<h2>
Pivoting</h2>
<p>Youve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot <code>stocks</code> to put the <code>quarter</code> in the columns, both missing values become explicit:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">stocks |&gt;
pivot_wider(
names_from = qtr,
values_from = price
)
#&gt; # A tibble: 2 × 5
#&gt; year `1` `2` `3` `4`
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2020 1.88 0.59 0.35 NA
#&gt; 2 2021 NA 0.92 0.17 2.66</pre>
</div>
<p>By default, making data longer preserves explicit missing values, but if they are structurally missing values that only exist because the data is not tidy, you can drop them (make them implicit) by setting <code>values_drop_na = TRUE</code>. See the examples in <a href="#sec-tidy-data" data-type="xref">#sec-tidy-data</a> for more details.</p>
</section>
<section id="complete" data-type="sect2">
<h2>
Complete</h2>
2022-11-19 00:30:32 +08:00
<p><code><a href="https://tidyr.tidyverse.org/reference/complete.html">tidyr::complete()</a></code> allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of <code>year</code> and <code>qtr</code> should exist in the <code>stocks</code> data:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">stocks |&gt;
complete(year, qtr)
#&gt; # A tibble: 8 × 3
#&gt; year qtr price
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2020 1 1.88
#&gt; 2 2020 2 0.59
#&gt; 3 2020 3 0.35
#&gt; 4 2020 4 NA
#&gt; 5 2021 1 NA
#&gt; 6 2021 2 0.92
#&gt; # … with 2 more rows</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Typically, youll call <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the <code>stocks</code> dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for <code>year</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">stocks |&gt;
complete(year = 2019:2021, qtr)
#&gt; # A tibble: 12 × 3
#&gt; year qtr price
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2019 1 NA
#&gt; 2 2019 2 NA
#&gt; 3 2019 3 NA
#&gt; 4 2019 4 NA
#&gt; 5 2020 1 1.88
#&gt; 6 2020 2 0.59
#&gt; # … with 6 more rows</pre>
</div>
<p>If the range of a variable is correct, but not all values are present, you could use <code>full_seq(x, 1)</code> to generate all values from <code>min(x)</code> to <code>max(x)</code> spaced out by 1.</p>
2022-11-19 00:30:32 +08:00
<p>In some cases, the complete set of observations cant be generated by a simple combination of variables. In that case, you can do manually what <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">dplyr::full_join()</a></code>.</p>
</section>
<section id="missing-values-joins" data-type="sect2">
<h2>
Joins</h2>
<p>This brings us to another important way of revealing implicitly missing observations: joins. Youll learn more about joins in <a href="#chp-joins" data-type="xref">#chp-joins</a>, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.</p>
<p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that dont have a match in <code>y</code>. For example, we can use two <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>s to reveal that were missing information for four airports and 722 planes mentioned in <code>flights</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">library(nycflights13)
flights |&gt;
distinct(faa = dest) |&gt;
anti_join(airports)
#&gt; Joining with `by = join_by(faa)`
#&gt; # A tibble: 4 × 1
#&gt; faa
#&gt; &lt;chr&gt;
#&gt; 1 BQN
#&gt; 2 SJU
#&gt; 3 STT
#&gt; 4 PSE
flights |&gt;
distinct(tailnum) |&gt;
anti_join(planes)
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 722 × 1
#&gt; tailnum
#&gt; &lt;chr&gt;
#&gt; 1 N3ALAA
#&gt; 2 N3DUAA
#&gt; 3 N542MQ
#&gt; 4 N730MQ
#&gt; 5 N9EAMQ
#&gt; 6 N532UA
#&gt; # … with 716 more rows</pre>
</div>
</section>
<section id="missing-values-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Can you find any relationship between the carrier and the rows that appear to be missing from <code>planes</code>?</li>
</ol></section>
</section>
<section id="factors-and-empty-groups" data-type="sect1">
<h1>
Factors and empty groups</h1>
<p>A final type of missingness is the empty group, a group that doesnt contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">health &lt;- tibble(
name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
age = c(34L, 88L, 75L, 47L, 56L),
)</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>And we want to count the number of smokers with <code><a href="https://dplyr.tidyverse.org/reference/count.html">dplyr::count()</a></code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">health |&gt; count(smoker)
#&gt; # A tibble: 1 × 2
#&gt; smoker n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 no 5</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to keep all the groups, even those not seen in the data by using <code>.drop = FALSE</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">health |&gt; count(smoker, .drop = FALSE)
#&gt; # A tibble: 2 × 2
#&gt; smoker n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 yes 0
#&gt; 2 no 5</pre>
</div>
<p>The same principle applies to ggplot2s discrete axes, which will also drop levels that dont have any values. You can force them to display by supplying <code>drop = FALSE</code> to the appropriate discrete axis:</p>
<div>
2023-01-13 07:22:57 +08:00
<pre data-type="programlisting" data-code-language="r">ggplot(health, aes(x = smoker)) +
geom_bar() +
scale_x_discrete()
2023-01-13 07:22:57 +08:00
ggplot(health, aes(x = smoker)) +
geom_bar() +
scale_x_discrete(drop = FALSE)</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="missing-values_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="A bar chart with a single value on the x-axis, &quot;no&quot;. The same bar chart as the last plot, but now with two values on the x-axis, &quot;yes&quot; and &quot;no&quot;. There is no bar for the &quot;yes&quot; category." width="288"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="missing-values_files/figure-html/unnamed-chunk-17-2.png" class="img-fluid" alt="A bar chart with a single value on the x-axis, &quot;no&quot;. The same bar chart as the last plot, but now with two values on the x-axis, &quot;yes&quot; and &quot;no&quot;. There is no bar for the &quot;yes&quot; category." width="288"/></p>
</div>
</div>
</div>
</div>
2022-11-19 00:30:32 +08:00
<p>The same problem comes up more generally with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">dplyr::group_by()</a></code>. And again you can use <code>.drop = FALSE</code> to preserve all factor levels:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">health |&gt;
group_by(smoker, .drop = FALSE) |&gt;
2023-01-13 07:22:57 +08:00
summarize(
n = n(),
mean_age = mean(age),
min_age = min(age),
max_age = max(age),
sd_age = sd(age)
)
2023-01-13 07:22:57 +08:00
#&gt; Warning: There were 2 warnings in `summarize()`.
#&gt; The first warning was:
2023-01-13 07:22:57 +08:00
#&gt; In argument: `min_age = min(age)`.
#&gt; In group 1: `smoker = yes`.
#&gt; Caused by warning in `min()`:
#&gt; ! no non-missing arguments to min; returning Inf
#&gt; Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#&gt; # A tibble: 2 × 6
#&gt; smoker n mean_age min_age max_age sd_age
#&gt; &lt;fct&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 yes 0 NaN Inf -Inf NA
#&gt; 2 no 5 60 34 88 21.6</pre>
</div>
<p>We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. Theres an important distinction between empty vectors, which have length 0, and missing values, each of which has length 1.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r"># A vector containing two missing values
x1 &lt;- c(NA, NA)
length(x1)
#&gt; [1] 2
# A vector containing nothing
x2 &lt;- numeric()
length(x2)
#&gt; [1] 0</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see <code>mean(age)</code> returning <code>NaN</code> because <code>mean(age)</code> = <code>sum(age)/length(age)</code> which here is 0/0. <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute youll get the minimum or maximum of the new data<span data-type="footnote">In other words, <code>min(c(x, y))</code> is always equal to <code>min(min(x), min(y))</code>.</span>.</p>
<p>Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code>.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">health |&gt;
group_by(smoker) |&gt;
2023-01-13 07:22:57 +08:00
summarize(
n = n(),
mean_age = mean(age),
min_age = min(age),
max_age = max(age),
sd_age = sd(age)
) |&gt;
complete(smoker)
#&gt; # A tibble: 2 × 6
#&gt; smoker n mean_age min_age max_age sd_age
#&gt; &lt;fct&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 yes NA NA NA NA NA
#&gt; 2 no 5 60 34 88 21.6</pre>
</div>
<p>The main drawback of this approach is that you get an <code>NA</code> for the count, even though you know that it should be zero.</p>
</section>
<section id="missing-values-summary" data-type="sect1">
<h1>
Summary</h1>
<p>Missing values are weird! Sometimes theyre recorded as an explicit <code>NA</code> but other times you only notice them by their absence. This chapter has given you some tools for working with explicit missing values, tools for uncovering implicit missing values, and discussed some of the ways that implicit can become explicit and vice versa.</p>
<p>In the next chapter, we tackle the final chapter in this part of the book: joins. This is a bit of a change from the chapters so far because were going to discuss tools that work with data frames as a whole, not something that you put inside a data frame.</p>
</section>
</section>