131 lines
3.5 KiB
Plaintext
131 lines
3.5 KiB
Plaintext
# Missing values {#missing-values}
|
|
|
|
```{r, results = "asis", echo = FALSE}
|
|
status("drafting")
|
|
```
|
|
|
|
## Introduction
|
|
|
|
A value can be missing in one of two possible ways.
|
|
It can be **explicitly** missing, i.e. flagged with `NA`, or it can be **implicitly**, missing i.e. simply not present in the data.
|
|
|
|
This chapter will explore cases where implicit and explicit missing values can become explict,
|
|
|
|
### Prerequisites
|
|
|
|
```{r setup, message = FALSE}
|
|
library(tidyverse)
|
|
library(nycflights13)
|
|
```
|
|
|
|
## Motivation
|
|
|
|
Let's illustrate this idea with a very simple data set.
|
|
|
|
```{r}
|
|
stocks <- tibble(
|
|
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
|
|
qtr = c( 1, 2, 3, 4, 2, 3, 4),
|
|
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
|
|
)
|
|
```
|
|
|
|
There are two missing values in this dataset:
|
|
|
|
- The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`.
|
|
|
|
- The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
|
|
|
|
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
|
|
|
|
## Complete and joins
|
|
|
|
If a dataset has a regular structure, you can make implicit missing values implicit with `complete()`:
|
|
|
|
```{r}
|
|
stocks |>
|
|
complete(year, qtr)
|
|
```
|
|
|
|
If you know that the range isn't correct, you can:
|
|
|
|
```{r}
|
|
stocks |>
|
|
complete(year = 2015:2017, qtr)
|
|
```
|
|
|
|
`complete()` takes a set of columns, and finds all unique combinations.
|
|
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
|
|
|
|
```{r}
|
|
stocks |>
|
|
expand(year, qtr) |>
|
|
left_join(stocks)
|
|
```
|
|
|
|
Other times missing values might be defined by another dataset.
|
|
|
|
```{r}
|
|
flights |>
|
|
distinct(faa = dest) |>
|
|
anti_join(airports)
|
|
|
|
flights |>
|
|
distinct(tailnum) |>
|
|
anti_join(planes)
|
|
```
|
|
|
|
## Pivotting {#missing-values-tidy}
|
|
|
|
Changing the representation of a dataset brings up an important subtlety of missing values.
|
|
|
|
The way that a dataset is represented can make implicit values explicit.
|
|
For example, we can make the implicit missing value explicit by putting years in the columns:
|
|
|
|
```{r}
|
|
stocks |>
|
|
pivot_wider(names_from = year, values_from = return)
|
|
```
|
|
|
|
Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit:
|
|
|
|
```{r}
|
|
stocks |>
|
|
pivot_wider(names_from = year, values_from = return) |>
|
|
pivot_longer(
|
|
cols = c(`2015`, `2016`),
|
|
names_to = "year",
|
|
values_to = "return",
|
|
values_drop_na = TRUE
|
|
)
|
|
```
|
|
|
|
## Last observation carried forward
|
|
|
|
There's one other important tool that you should know for working with missing values.
|
|
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
|
|
|
|
```{r}
|
|
treatment <- tribble(
|
|
~person, ~treatment, ~response,
|
|
"Derrick Whitmore", 1, 7,
|
|
NA, 2, 10,
|
|
NA, 3, 9,
|
|
"Katherine Burke", 1, 4
|
|
)
|
|
```
|
|
|
|
You can fill in these missing values with `fill()`.
|
|
It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
|
|
|
|
```{r}
|
|
treatment |>
|
|
fill(person)
|
|
```
|
|
|
|
## Factors
|
|
|
|
- factors: `group_by` + `.drop = FALSE`
|
|
|
|
##
|