r4ds/missing-values.Rmd

# Missing values {#missing-values}

```{r, results = "asis", echo = FALSE}
status("drafting")
```

## Introduction

We have also discussed missing values earlier in the book in:

TODO: add index of previous mentions

### Prerequisites

```{r setup, message = FALSE}
library(tidyverse)
library(nycflights13)
```

## Explicit vs implicit missing values

A value can be missing in one of two possible ways.
It can be **explicitly** missing, i.e. flagged with `NA`, or it can be **implicitly**, missing i.e. simply not present in the data.
Let's illustrate this important idea with a very simple data set, which records the price of a stock in each quarter.

```{r}
stocks <- tibble(
  year  = c(2022, 2022, 2022, 2022, 2023, 2023, 2023),
  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
```

There are two missing values in this dataset:

-   The return for the fourth quarter of 2022 is explicitly missing, because the cell where its value should be instead contains `NA`.

-   The return for the first quarter of 2023 is implicitly missing, because it simply does not appear in the dataset.

One way to think about the difference is with this Zen-like koan:

> An explicit missing value is the presence of an absence.\
> An implicit missing value is the absence of a presence.

### Pivoting

You've already learned about one tool that can make implicit missings explicit and vice versa: pivoting.
Making data wider can make implicit missing values become explicit.
For example, if we pivot `stocks` to put the `year` in the columns pivoting, we can make both missing values explicit:

```{r}
stocks |>
  pivot_wider(
    names_from = year, 
    values_from = price
  )
```

Making data longer generally preserves explicit missing values, but you can make them implicit by setting `drop_na`.
If we make that wider data longer, and specify `values_drop_na = TRUE` we can make both missing values implicit:

```{r}
stocks |>
  pivot_wider(
    names_from = year, 
    values_from = price
  ) |> 
  pivot_longer(
    cols = -qtr, 
    names_to = "year", 
    values_to = "price", 
    values_drop_na = TRUE
  )
```

Generally, however, you use `values_drop_na` because you have missing values that don't represent missing observations, but are forced to exist due to the representation of the data.
See the examples in Chapter \@ref(tidy-data) for more details.

### Complete

A more direct way to implicit values into explicit values is with `complete()`:

```{r}
stocks |>
  complete(year, qtr)
```

`complete()` takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.

Typically, you'll work with the values in the dataset, filling in missing combinations.
But sometimes the dataset is incomplete If you know that the range isn't correct, you can:

```{r}
stocks |>
  complete(year = 2015:2017, qtr)
```

Or if the range is correct, but there might be missing values in the middle:

```{r}
stocks |>
  complete(year = full_seq(year, 1), qtr = 1:4)
```

TODO: add discussion of `group_by()` plus nesting etc.

### Joins

Other times missing values might be defined by another dataset.

```{r}
flights |> 
  distinct(faa = dest) |> 
  anti_join(airports)

flights |> 
  distinct(tailnum) |> 
  anti_join(planes)
```

## When missings represent values

### Last observation carried forward

Another place that missing values arise is as a data entry convenience.
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:

```{r}
treatment <- tribble(
  ~person,           ~treatment, ~response,
  "Derrick Whitmore", 1,         7,
  NA,                 2,         10,
  NA,                 3,         9,
  "Katherine Burke",  1,         4
)
```

You can fill in these missing values with `fill()`.
It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).

```{r}
treatment |>
  fill(person)
```

You can use the `direction` argument to fill in missing values that have been generated in more exotic ways.

### Fixed values

Some times missing values represent some fixed known value, mostly commonly 0.
You can use `coalesce()` to replace.

```{r}
x <- c(1, 4, 5, 7, NA)
coalesce(x, 0)
```

You can apply this to every numeric column in a data frame with:

```{r, eval = FALSE}
df |> mutate(across(where(is.numeric), coalesce, 0))
```

### Sentinel values

Sometimes you'll hit the opposite problem because some older software doesn't have an explicit way to represent missing values, so it might be recorded using some special sentinel value like 99 or -999.
If possible, handle this when reading in the data, for example, by using the `na` argument to `read::read_csv()`.
If you discover later, or from a data source that doesn't provide a way to handle on read, you can use `na_if()`

```{r}
x <- c(1, 4, 5, 7, -99)
na_if(x, -99)
```

You can apply this to every numeric column in a data frame with:

```{r, eval = FALSE}
df |> mutate(across(where(is.numeric), na_if, -99))
```

## Factors

Another sort of missing value arises with factors.
For example, imagine we have a dataset that contains some health information about people:

```{r}
health <- tibble(
  name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
  age = c(34L, 88L, 75L, 47L, 56L),
)
```

And we want to count the number of smokers:

```{r}
health |> count(smoker)
```

This dataset only contains non-smokers, but we know that smokers exist.
We can request to keep all the value, even if not seen in the data with `.drop = FALSE`:

```{r}
health |> count(smoker, .drop = FALSE)
```

This argument also works with `group_by()`:

```{r}
health |> 
  group_by(smoker, .drop = FALSE) |> 
  summarise(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)
  )
```

Summary functions generally work with zero-length vectors, but they may return results that are surprising at first glance.
There's almost always some deeper logic behind them.
A sometimes simpler approach is to perform the summary and then make the implicit missings explicit with `complete()`.

```{r}
health |> 
  group_by(smoker) |> 
  summarise(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)
  ) |> 
  complete(smoker, fill = list(n = 0))
```

Main con of this approach is that you need to carefully specify the `fill` argument so that
Data transformation (#940) * Minor edit + link to style guide * Fix reference * If you don't know order of operations, not clear * Alt text + minor edits * Add median and fix reference * Move up mult groups up to discuss summarise msg * Go over grouping again * Part rename * Chapter rename * Clean up section labels to avoid dups * Update comment * Switch part order * Move columnwise to transform 2021-03-29 21:58:27 +08:00			`# Missing values {#missing-values}`
Second crack and 2e structure 2021-03-04 01:13:14 +08:00
Add chapter status 2021-05-04 21:10:39 +08:00			```{r, results = "asis", echo = FALSE}
			`status("drafting")`
			```

Second crack and 2e structure 2021-03-04 01:13:14 +08:00			`## Introduction`
Break up data-transform content 2021-04-19 20:56:29 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`We have also discussed missing values earlier in the book in:`
Continuing to rewrite data-transform 2021-04-21 21:25:39 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`TODO: add index of previous mentions`
Continuing to rewrite data-transform 2021-04-21 21:25:39 +08:00
A little thinking about missing values 2022-03-31 21:10:52 +08:00			`### Prerequisites`
Continuing to rewrite data-transform 2021-04-21 21:25:39 +08:00
A little thinking about missing values 2022-03-31 21:10:52 +08:00			```{r setup, message = FALSE}
			`library(tidyverse)`
			`library(nycflights13)`
			```
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`## Explicit vs implicit missing values`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`A value can be missing in one of two possible ways.`
			It can be explicitly missing, i.e. flagged with `NA`, or it can be implicitly, missing i.e. simply not present in the data.
			`Let's illustrate this important idea with a very simple data set, which records the price of a stock in each quarter.`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
			```{r}
			`stocks <- tibble(`
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`year = c(2022, 2022, 2022, 2022, 2023, 2023, 2023),`
			`qtr = c( 1, 2, 3, 4, 2, 3, 4),`
			`price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)`
Pull content out of tidying 2021-04-19 20:59:07 +08:00			`)`
			```

			`There are two missing values in this dataset:`

Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			- The return for the fourth quarter of 2022 is explicitly missing, because the cell where its value should be instead contains `NA`.

			`- The return for the first quarter of 2023 is implicitly missing, because it simply does not appear in the dataset.`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`One way to think about the difference is with this Zen-like koan:`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`> An explicit missing value is the presence of an absence.\`
			`> An implicit missing value is the absence of a presence.`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`### Pivoting`
A little thinking about missing values 2022-03-31 21:10:52 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`You've already learned about one tool that can make implicit missings explicit and vice versa: pivoting.`
			`Making data wider can make implicit missing values become explicit.`
			For example, if we pivot `stocks` to put the `year` in the columns pivoting, we can make both missing values explicit:
A little thinking about missing values 2022-03-31 21:10:52 +08:00
			```{r}
			`stocks \|>`
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`pivot_wider(`
			`names_from = year,`
			`values_from = price`
			`)`
A little thinking about missing values 2022-03-31 21:10:52 +08:00			```

Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			Making data longer generally preserves explicit missing values, but you can make them implicit by setting `drop_na`.
			If we make that wider data longer, and specify `values_drop_na = TRUE` we can make both missing values implicit:
A little thinking about missing values 2022-03-31 21:10:52 +08:00
			```{r}
			`stocks \|>`
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`pivot_wider(`
			`names_from = year,`
			`values_from = price`
			`) \|>`
			`pivot_longer(`
			`cols = -qtr,`
			`names_to = "year",`
			`values_to = "price",`
			`values_drop_na = TRUE`
			`)`
			```

			Generally, however, you use `values_drop_na` because you have missing values that don't represent missing observations, but are forced to exist due to the representation of the data.
			`See the examples in Chapter \@ref(tidy-data) for more details.`

			`### Complete`

			A more direct way to implicit values into explicit values is with `complete()`:

			```{r}
			`stocks \|>`
			`complete(year, qtr)`
A little thinking about missing values 2022-03-31 21:10:52 +08:00			```

			`complete()` takes a set of columns, and finds all unique combinations.
			It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.

Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`Typically, you'll work with the values in the dataset, filling in missing combinations.`
			`But sometimes the dataset is incomplete If you know that the range isn't correct, you can:`

A little thinking about missing values 2022-03-31 21:10:52 +08:00			```{r}
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`stocks \|>`
			`complete(year = 2015:2017, qtr)`
			```

			`Or if the range is correct, but there might be missing values in the middle:`

			```{r}
			`stocks \|>`
			`complete(year = full_seq(year, 1), qtr = 1:4)`
A little thinking about missing values 2022-03-31 21:10:52 +08:00			```

Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			TODO: add discussion of `group_by()` plus nesting etc.

			`### Joins`

A little thinking about missing values 2022-03-31 21:10:52 +08:00			`Other times missing values might be defined by another dataset.`

			```{r}
			`flights \|>`
			`distinct(faa = dest) \|>`
			`anti_join(airports)`

			`flights \|>`
			`distinct(tailnum) \|>`
			`anti_join(planes)`
			```

Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`## When missings represent values`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`### Last observation carried forward`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`Another place that missing values arise is as a data entry convenience.`
Pull content out of tidying 2021-04-19 20:59:07 +08:00			`Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:`

			```{r}
			`treatment <- tribble(`
			`~person, ~treatment, ~response,`
			`"Derrick Whitmore", 1, 7,`
			`NA, 2, 10,`
			`NA, 3, 9,`
			`"Katherine Burke", 1, 4`
			`)`
			```

			You can fill in these missing values with `fill()`.
			`It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).`

			```{r}
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`treatment \|>`
Pull content out of tidying 2021-04-19 20:59:07 +08:00			`fill(person)`
			```

Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			You can use the `direction` argument to fill in missing values that have been generated in more exotic ways.

			`### Fixed values`

			`Some times missing values represent some fixed known value, mostly commonly 0.`
			You can use `coalesce()` to replace.

			```{r}
			`x <- c(1, 4, 5, 7, NA)`
			`coalesce(x, 0)`
			```

			`You can apply this to every numeric column in a data frame with:`

			```{r, eval = FALSE}
			`df \|> mutate(across(where(is.numeric), coalesce, 0))`
			```

			`### Sentinel values`

			`Sometimes you'll hit the opposite problem because some older software doesn't have an explicit way to represent missing values, so it might be recorded using some special sentinel value like 99 or -999.`
			If possible, handle this when reading in the data, for example, by using the `na` argument to `read::read_csv()`.
			If you discover later, or from a data source that doesn't provide a way to handle on read, you can use `na_if()`

			```{r}
			`x <- c(1, 4, 5, 7, -99)`
			`na_if(x, -99)`
			```

			`You can apply this to every numeric column in a data frame with:`

			```{r, eval = FALSE}
			`df \|> mutate(across(where(is.numeric), na_if, -99))`
			```

A little thinking about missing values 2022-03-31 21:10:52 +08:00			`## Factors`
Polishing data transformation 2022-02-16 01:59:19 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`Another sort of missing value arises with factors.`
			`For example, imagine we have a dataset that contains some health information about people:`

			```{r}
			`health <- tibble(`
			`name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),`
			`smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),`
			`age = c(34L, 88L, 75L, 47L, 56L),`
			`)`
			```

			`And we want to count the number of smokers:`

			```{r}
			`health \|> count(smoker)`
			```

			`This dataset only contains non-smokers, but we know that smokers exist.`
			We can request to keep all the value, even if not seen in the data with `.drop = FALSE`:

			```{r}
			`health \|> count(smoker, .drop = FALSE)`
			```

			This argument also works with `group_by()`:

			```{r}
			`health \|>`
			`group_by(smoker, .drop = FALSE) \|>`
			`summarise(`
			`n = n(),`
			`mean_age = mean(age),`
			`min_age = min(age),`
			`max_age = max(age),`
			`sd_age = sd(age)`
			`)`
			```

			`Summary functions generally work with zero-length vectors, but they may return results that are surprising at first glance.`
			`There's almost always some deeper logic behind them.`
			A sometimes simpler approach is to perform the summary and then make the implicit missings explicit with `complete()`.

			```{r}
			`health \|>`
			`group_by(smoker) \|>`
			`summarise(`
			`n = n(),`
			`mean_age = mean(age),`
			`min_age = min(age),`
			`max_age = max(age),`
			`sd_age = sd(age)`
			`) \|>`
			`complete(smoker, fill = list(n = 0))`
			```
Break up data-transform content 2021-04-19 20:56:29 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			Main con of this approach is that you need to carefully specify the `fill` argument so that