Polishing missing values

2022-05-10 21:04:14 -05:00 · 2022-05-10 21:04:14 -05:00 · 0ea0ce5e14
parent 721ba68ac4
commit 0ea0ce5e14
1 changed files with 36 additions and 32 deletions
--- a/missing-values.Rmd
+++ b/missing-values.Rmd
@ -1,18 +1,17 @@
 # Missing values {#missing-values}

 ```{r, results = "asis", echo = FALSE}
-status("restructuring")
+status("polishing")
 ```

 ## Introduction

-You've already learned the basics of missing values earlier in the the book.
-You first saw them in Section \@ref(summarize) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
-Now we'll come back to them in more depth, so you can learn more of the details.
+You've already learned the basics of missing values earlier in the the book: you first saw them in Section \@ref(summarize) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
+In this chapter, we'll come back to missing values in more depth, so you can learn more of the details.

-We'll start by discussing some general tools for working with missing values recorded as `NA`s.
+We'll start by discussing some general tools for explicitly missing values that recorded as `NA`.
 We'll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit.
-We'll finish off with a related discussion of empty groups, caused by factor levels that don't appear in the data.
+We'll finish off with a of empty groups, caused by factor levels that don't appear in the data.

 ### Prerequisites

@ -24,12 +23,12 @@ library(tidyverse)

 ## Explicit missing values

-To begin, let's explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an `NA`.
+To begin, let's explore a few handy tools for creating or eliminating explicitly `NA`s.
+In the following sections you'll learn how to carry the last observation forward, convert `NA`s to fixed values, convert some fixed value to `NA`s, and learn about the special variant of `NA` known as "not a number".

 ### Last observation carried forward

-A common use for missing values is as a data entry convenience.
-Sometimes data that has been entered by hand, missing values indicate that the value in the previous row has been repeated:
+Missing values are commonly used as data entry convenience where they indicate a repeat of the value in the previous row:

 ```{r}
 treatment <- tribble(
@ -42,18 +41,19 @@ treatment <- tribble(
 ```

 You can fill in these missing values with `tidyr::fill()`.
-It works like `select()`, taking a set of columns where you want missing values to be replaced by last observation carried forward:
+It works like `select()`, taking a set of columns:

 ```{r}
 treatment |>
  fill(everything())
 ```

+This treatment is sometimes called "last observation carried forward", or **locf** for short.
 You can use the `direction` argument to fill in missing values that have been generated in more exotic ways.

 ### Fixed values

-Some times missing values represent some fixed known value, mostly commonly 0.
+Some times missing values represent some fixed and known value, mostly commonly 0.
 You can use `dplyr::coalesce()` to replace them:

 ```{r}
@ -61,7 +61,7 @@ x <- c(1, 4, 5, 7, NA)
 coalesce(x, 0)
 ```

-You could use `mutate()` together with `across()` to apply to every numeric column in a data frame:
+You could use `mutate()` together with `across()` to apply to every this treatment to (say) every numeric column in a data frame:

 ```{r, eval = FALSE}
 df |> 
@ -70,8 +70,8 @@ df |>

 ### Sentinel values

-Sometimes you'll hit the opposite problem where some value should actually be treated as a missing value.
-This typically arises in data generated by older software which doesn't have an explicit way to represent missing values, so it uses some special sentinel value like 99 or -999.
+Sometimes you'll hit the opposite problem where some conrete value actually represents as a missing value.
+This typically arises in data generated by older software that doesn't have a proper way to represent missing values, so it must instead use some special value like 99 or -999.

 If possible, handle this when reading in the data, for example, by using the `na` argument to `readr::read_csv()`.
 If you discover the problem later, or your data source doesn't provide a way to handle on it read, you can use `dplyr::na_if():`
@ -81,7 +81,7 @@ x <- c(1, 4, 5, 7, -99)
 na_if(x, -99)
 ```

-And you could apply this transformation to every numeric column in a data frame with the following code.
+You could apply this transformation to every numeric column in a data frame with the following code.

 ```{r, eval = FALSE}
 df |> 
@ -113,9 +113,9 @@ sqrt(-1)

 ## Implicit missing values

-So far we've talked with missing values that are **explicitly** missing, i.e. you can see them in your data as an `NA`.
+So far we've talked about missing values that are **explicitly** missing, i.e. you can see an `NA` in your data.
 But missing values can also be **implicitly** missing, if an entire row of data is simply absent from the data.
-Let's illustrate this idea with a simple data set, which records the price of a stock in each quarter:
+Let's illustrate the difference with a simple data set that records the price of some stock each quarter:

 ```{r}
 stocks <- tibble(
@ -137,9 +137,9 @@ One way to think about the difference is with this Zen-like koan:
 >
 > An implicit missing value is the absence of a presence.

-It's often useful to make implicit missings explicit so you have something physical that you can work with.
-In other cases, explicit missings are forced upon you by the structure of the data.
-The following sections discuss some tools for moving between implicit and explict.
+Sometimes you want to make implicit missings explicit in order to have something physical to work with.
+In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them.
+The following sections discuss some tools for moving between implicit and explicit missingness.

 ### Pivoting

@ -160,16 +160,17 @@ See the examples in Chapter \@ref(tidy-data) for more details.

 ### Complete

-`tidyr::complete()` allows you to generate explicit missing values in tidy data by providing a set of variables that generates all rows that should exist:
+`tidyr::complete()` allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist.
+For example, we know that all combinations of `year` and `qtr` should exist in the `stocks` data:

 ```{r}
 stocks |>
  complete(year, qtr)
 ```

-Typically, you'll call `complete()` with names of variables that already exist, filling in their missing combinations.
-However, sometimes the individual variables are themselves incomplete, so you can also provide your own data.
-For example, you might know that this dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for `year`:
+Typically, you'll call `complete()` with names of existing variables, filling in the missing combinations.
+However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data.
+For example, you might know that the `stocks` dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for `year`:

 ```{r}
 stocks |>
@ -178,7 +179,7 @@ stocks |>

 If the range of a variable is correct, but not all values are present, you could use `full_seq(x, 1)` to generate all values from `min(x)` to `max(x)` spaced out by 1.

-In some cases, the complete set of observations can't be generated by a simple combination of variables with `complete()`.
+In some cases, the complete set of observations can't be generated by a simple combination of variables.
 In that case, you can do manually what `complete()` does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with `dplyr::full_join()`.

 ### Joins
@ -209,7 +210,7 @@ If you're worried about this, and you have dplyr 1.1.0 or newer, you can use the

 ## Factors and empty groups

-A final type of missingness is empty groups, groups that don't contain any observation, which can arise when working with factors.
+A final type of missingness is the empty group, a group that doesn't contain any observations, which can arise when working with factors.
 For example, imagine we have a dataset that contains some health information about people:

 ```{r}
@ -226,8 +227,7 @@ And we want to count the number of smokers with `dplyr::count()`:
 health |> count(smoker)
 ```

-This dataset only contains non-smokers, but we know that smokers exist.
-The group of non-smoker is empty.
+This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty.
 We can request `count()` to keep all the groups, even those not seen in the data by using `.drop = FALSE`:

 ```{r}
@ -271,20 +271,24 @@ health |>
  )
 ```

-We get some interesting results here because we are a summarizing an empty group, so the summary functions are applied to zero-length vectors.
-Zero-length vectors are empty, not missing:
+We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors.
+There's an important distinction between empty vectors, which have length 0, and missing values, which each have length 1.

 ```{r}
+# A vector containing two missing values
 x1 <- c(NA, NA)
 length(x1)

+# A vector containing nothing
 x2 <- numeric()
 length(x2)
 ```

-Summary functions do work with zero-length vectors, but they may return results that are surprising at first glance.
+All summary functions work with zero-length vectors, but they may return results that are surprising at first glance.
 Here we see `mean(age)` returning `NaN` because `mean(age)` = `sum(age)/length(age)` which here is 0/0.
-`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get min or max of the new data.
+`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get the minimum or maximum of the new data[^missing-values-1].
+
+[^missing-values-1]: In other words, `min(c(x, y))` is always equal to `min(min(x), min(y)).`

 A sometimes simpler approach is to perform the summary and then make the implicit missings explicit with `complete()`.