Polishing missing values

This commit is contained in:
Hadley Wickham 2022-05-10 21:04:14 -05:00
parent 721ba68ac4
commit 0ea0ce5e14
1 changed files with 36 additions and 32 deletions

View File

@ -1,18 +1,17 @@
# Missing values {#missing-values}
```{r, results = "asis", echo = FALSE}
status("restructuring")
status("polishing")
```
## Introduction
You've already learned the basics of missing values earlier in the the book.
You first saw them in Section \@ref(summarize) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
Now we'll come back to them in more depth, so you can learn more of the details.
You've already learned the basics of missing values earlier in the the book: you first saw them in Section \@ref(summarize) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
In this chapter, we'll come back to missing values in more depth, so you can learn more of the details.
We'll start by discussing some general tools for working with missing values recorded as `NA`s.
We'll start by discussing some general tools for explicitly missing values that recorded as `NA`.
We'll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit.
We'll finish off with a related discussion of empty groups, caused by factor levels that don't appear in the data.
We'll finish off with a of empty groups, caused by factor levels that don't appear in the data.
### Prerequisites
@ -24,12 +23,12 @@ library(tidyverse)
## Explicit missing values
To begin, let's explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an `NA`.
To begin, let's explore a few handy tools for creating or eliminating explicitly `NA`s.
In the following sections you'll learn how to carry the last observation forward, convert `NA`s to fixed values, convert some fixed value to `NA`s, and learn about the special variant of `NA` known as "not a number".
### Last observation carried forward
A common use for missing values is as a data entry convenience.
Sometimes data that has been entered by hand, missing values indicate that the value in the previous row has been repeated:
Missing values are commonly used as data entry convenience where they indicate a repeat of the value in the previous row:
```{r}
treatment <- tribble(
@ -42,18 +41,19 @@ treatment <- tribble(
```
You can fill in these missing values with `tidyr::fill()`.
It works like `select()`, taking a set of columns where you want missing values to be replaced by last observation carried forward:
It works like `select()`, taking a set of columns:
```{r}
treatment |>
fill(everything())
```
This treatment is sometimes called "last observation carried forward", or **locf** for short.
You can use the `direction` argument to fill in missing values that have been generated in more exotic ways.
### Fixed values
Some times missing values represent some fixed known value, mostly commonly 0.
Some times missing values represent some fixed and known value, mostly commonly 0.
You can use `dplyr::coalesce()` to replace them:
```{r}
@ -61,7 +61,7 @@ x <- c(1, 4, 5, 7, NA)
coalesce(x, 0)
```
You could use `mutate()` together with `across()` to apply to every numeric column in a data frame:
You could use `mutate()` together with `across()` to apply to every this treatment to (say) every numeric column in a data frame:
```{r, eval = FALSE}
df |>
@ -70,8 +70,8 @@ df |>
### Sentinel values
Sometimes you'll hit the opposite problem where some value should actually be treated as a missing value.
This typically arises in data generated by older software which doesn't have an explicit way to represent missing values, so it uses some special sentinel value like 99 or -999.
Sometimes you'll hit the opposite problem where some conrete value actually represents as a missing value.
This typically arises in data generated by older software that doesn't have a proper way to represent missing values, so it must instead use some special value like 99 or -999.
If possible, handle this when reading in the data, for example, by using the `na` argument to `readr::read_csv()`.
If you discover the problem later, or your data source doesn't provide a way to handle on it read, you can use `dplyr::na_if():`
@ -81,7 +81,7 @@ x <- c(1, 4, 5, 7, -99)
na_if(x, -99)
```
And you could apply this transformation to every numeric column in a data frame with the following code.
You could apply this transformation to every numeric column in a data frame with the following code.
```{r, eval = FALSE}
df |>
@ -113,9 +113,9 @@ sqrt(-1)
## Implicit missing values
So far we've talked with missing values that are **explicitly** missing, i.e. you can see them in your data as an `NA`.
So far we've talked about missing values that are **explicitly** missing, i.e. you can see an `NA` in your data.
But missing values can also be **implicitly** missing, if an entire row of data is simply absent from the data.
Let's illustrate this idea with a simple data set, which records the price of a stock in each quarter:
Let's illustrate the difference with a simple data set that records the price of some stock each quarter:
```{r}
stocks <- tibble(
@ -137,9 +137,9 @@ One way to think about the difference is with this Zen-like koan:
>
> An implicit missing value is the absence of a presence.
It's often useful to make implicit missings explicit so you have something physical that you can work with.
In other cases, explicit missings are forced upon you by the structure of the data.
The following sections discuss some tools for moving between implicit and explict.
Sometimes you want to make implicit missings explicit in order to have something physical to work with.
In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them.
The following sections discuss some tools for moving between implicit and explicit missingness.
### Pivoting
@ -160,16 +160,17 @@ See the examples in Chapter \@ref(tidy-data) for more details.
### Complete
`tidyr::complete()` allows you to generate explicit missing values in tidy data by providing a set of variables that generates all rows that should exist:
`tidyr::complete()` allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist.
For example, we know that all combinations of `year` and `qtr` should exist in the `stocks` data:
```{r}
stocks |>
complete(year, qtr)
```
Typically, you'll call `complete()` with names of variables that already exist, filling in their missing combinations.
However, sometimes the individual variables are themselves incomplete, so you can also provide your own data.
For example, you might know that this dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for `year`:
Typically, you'll call `complete()` with names of existing variables, filling in the missing combinations.
However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data.
For example, you might know that the `stocks` dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for `year`:
```{r}
stocks |>
@ -178,7 +179,7 @@ stocks |>
If the range of a variable is correct, but not all values are present, you could use `full_seq(x, 1)` to generate all values from `min(x)` to `max(x)` spaced out by 1.
In some cases, the complete set of observations can't be generated by a simple combination of variables with `complete()`.
In some cases, the complete set of observations can't be generated by a simple combination of variables.
In that case, you can do manually what `complete()` does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with `dplyr::full_join()`.
### Joins
@ -209,7 +210,7 @@ If you're worried about this, and you have dplyr 1.1.0 or newer, you can use the
## Factors and empty groups
A final type of missingness is empty groups, groups that don't contain any observation, which can arise when working with factors.
A final type of missingness is the empty group, a group that doesn't contain any observations, which can arise when working with factors.
For example, imagine we have a dataset that contains some health information about people:
```{r}
@ -226,8 +227,7 @@ And we want to count the number of smokers with `dplyr::count()`:
health |> count(smoker)
```
This dataset only contains non-smokers, but we know that smokers exist.
The group of non-smoker is empty.
This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty.
We can request `count()` to keep all the groups, even those not seen in the data by using `.drop = FALSE`:
```{r}
@ -271,20 +271,24 @@ health |>
)
```
We get some interesting results here because we are a summarizing an empty group, so the summary functions are applied to zero-length vectors.
Zero-length vectors are empty, not missing:
We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors.
There's an important distinction between empty vectors, which have length 0, and missing values, which each have length 1.
```{r}
# A vector containing two missing values
x1 <- c(NA, NA)
length(x1)
# A vector containing nothing
x2 <- numeric()
length(x2)
```
Summary functions do work with zero-length vectors, but they may return results that are surprising at first glance.
All summary functions work with zero-length vectors, but they may return results that are surprising at first glance.
Here we see `mean(age)` returning `NaN` because `mean(age)` = `sum(age)/length(age)` which here is 0/0.
`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get min or max of the new data.
`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get the minimum or maximum of the new data[^missing-values-1].
[^missing-values-1]: In other words, `min(c(x, y))` is always equal to `min(min(x), min(y)).`
A sometimes simpler approach is to perform the summary and then make the implicit missings explicit with `complete()`.