A little thinking about missing values

This commit is contained in:
Hadley Wickham 2022-03-31 08:10:52 -05:00
parent 61d8a75908
commit 27507a8bf2
2 changed files with 60 additions and 62 deletions

View File

@ -6,27 +6,21 @@ status("drafting")
## Introduction ## Introduction
```{r} A value can be missing in one of two possible ways.
It can be **explicitly** missing, i.e. flagged with `NA`, or it can be **implicitly**, missing i.e. simply not present in the data.
This chapter will explore cases where implicit and explicit missing values can become explict,
### Prerequisites
```{r setup, message = FALSE}
library(tidyverse) library(tidyverse)
library(nycflights13)
``` ```
Missing topics: ## Motivation
- Missing values generated from matching data frames (i.e. `left_join()` and `anti_join()` Let's illustrate this idea with a very simple data set.
- Last observation carried forward and `tidy::fill()`
- `coalesce()` and `na_if()`
## Explicit vs implicit missing values {#missing-values-tidy}
Changing the representation of a dataset brings up an important subtlety of missing values.
Surprisingly, a value can be missing in one of two possible ways:
- **Explicitly**, i.e. flagged with `NA`.
- **Implicitly**, i.e. simply not present in the data.
Let's illustrate this idea with a very simple data set:
```{r} ```{r}
stocks <- tibble( stocks <- tibble(
@ -44,6 +38,47 @@ There are two missing values in this dataset:
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence. One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
## Complete and joins
If a dataset has a regular structure, you can make implicit missing values implicit with `complete()`:
```{r}
stocks |>
complete(year, qtr)
```
If you know that the range isn't correct, you can:
```{r}
stocks |>
complete(year = 2015:2017, qtr)
```
`complete()` takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
```{r}
stocks |>
expand(year, qtr) |>
left_join(stocks)
```
Other times missing values might be defined by another dataset.
```{r}
flights |>
distinct(faa = dest) |>
anti_join(airports)
flights |>
distinct(tailnum) |>
anti_join(planes)
```
## Pivotting {#missing-values-tidy}
Changing the representation of a dataset brings up an important subtlety of missing values.
The way that a dataset is represented can make implicit values explicit. The way that a dataset is represented can make implicit values explicit.
For example, we can make the implicit missing value explicit by putting years in the columns: For example, we can make the implicit missing value explicit by putting years in the columns:
@ -65,15 +100,7 @@ stocks |>
) )
``` ```
Another important tool for making missing values explicit in tidy data is `complete()`: ## Last observation carried forward
```{r}
stocks |>
complete(year, qtr)
```
`complete()` takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
There's one other important tool that you should know for working with missing values. There's one other important tool that you should know for working with missing values.
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward: Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
@ -96,41 +123,8 @@ treatment |>
fill(person) fill(person)
``` ```
`group_by` + `.drop = FALSE` ## Factors
### Exercises - factors: `group_by` + `.drop = FALSE`
1. Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`. ##
2. What does the direction argument to `fill()` do?
## dplyr verbs
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
If you want to preserve missing values, ask for them explicitly:
```{r}
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
```
Missing values are always sorted at the end:
```{r}
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))
```
Explain the warning here
```{r, eval = FALSE}
flights |>
group_by(dest) |>
summarise(max_delay = max(arr_delay, na.rm = TRUE))
```
## Exercises
1. Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)

View File

@ -344,6 +344,10 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
These are often used with numbers, but can be applied to most other column types. These are often used with numbers, but can be applied to most other column types.
### Missing values
`coalesce()`
### Ranks ### Ranks
dplyr provides a number of ranking functions, but you should start with `dplyr::min_rank()`. dplyr provides a number of ranking functions, but you should start with `dplyr::min_rank()`.