A little thinking about missing values

This commit is contained in:
Hadley Wickham 2022-03-31 08:10:52 -05:00
parent 61d8a75908
commit 27507a8bf2
2 changed files with 60 additions and 62 deletions

View File

@ -6,27 +6,21 @@ status("drafting")
## Introduction
```{r}
A value can be missing in one of two possible ways.
It can be **explicitly** missing, i.e. flagged with `NA`, or it can be **implicitly**, missing i.e. simply not present in the data.
This chapter will explore cases where implicit and explicit missing values can become explict,
### Prerequisites
```{r setup, message = FALSE}
library(tidyverse)
library(nycflights13)
```
Missing topics:
## Motivation
- Missing values generated from matching data frames (i.e. `left_join()` and `anti_join()`
- Last observation carried forward and `tidy::fill()`
- `coalesce()` and `na_if()`
## Explicit vs implicit missing values {#missing-values-tidy}
Changing the representation of a dataset brings up an important subtlety of missing values.
Surprisingly, a value can be missing in one of two possible ways:
- **Explicitly**, i.e. flagged with `NA`.
- **Implicitly**, i.e. simply not present in the data.
Let's illustrate this idea with a very simple data set:
Let's illustrate this idea with a very simple data set.
```{r}
stocks <- tibble(
@ -44,6 +38,47 @@ There are two missing values in this dataset:
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
## Complete and joins
If a dataset has a regular structure, you can make implicit missing values implicit with `complete()`:
```{r}
stocks |>
complete(year, qtr)
```
If you know that the range isn't correct, you can:
```{r}
stocks |>
complete(year = 2015:2017, qtr)
```
`complete()` takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
```{r}
stocks |>
expand(year, qtr) |>
left_join(stocks)
```
Other times missing values might be defined by another dataset.
```{r}
flights |>
distinct(faa = dest) |>
anti_join(airports)
flights |>
distinct(tailnum) |>
anti_join(planes)
```
## Pivotting {#missing-values-tidy}
Changing the representation of a dataset brings up an important subtlety of missing values.
The way that a dataset is represented can make implicit values explicit.
For example, we can make the implicit missing value explicit by putting years in the columns:
@ -65,15 +100,7 @@ stocks |>
)
```
Another important tool for making missing values explicit in tidy data is `complete()`:
```{r}
stocks |>
complete(year, qtr)
```
`complete()` takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
## Last observation carried forward
There's one other important tool that you should know for working with missing values.
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
@ -96,41 +123,8 @@ treatment |>
fill(person)
```
`group_by` + `.drop = FALSE`
## Factors
### Exercises
- factors: `group_by` + `.drop = FALSE`
1. Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
2. What does the direction argument to `fill()` do?
## dplyr verbs
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
If you want to preserve missing values, ask for them explicitly:
```{r}
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
```
Missing values are always sorted at the end:
```{r}
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))
```
Explain the warning here
```{r, eval = FALSE}
flights |>
group_by(dest) |>
summarise(max_delay = max(arr_delay, na.rm = TRUE))
```
## Exercises
1. Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)
##

View File

@ -344,6 +344,10 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
These are often used with numbers, but can be applied to most other column types.
### Missing values
`coalesce()`
### Ranks
dplyr provides a number of ranking functions, but you should start with `dplyr::min_rank()`.