diff --git a/missing-values.Rmd b/missing-values.Rmd index c3bbf52..3dfee88 100644 --- a/missing-values.Rmd +++ b/missing-values.Rmd @@ -1,18 +1,17 @@ # Missing values {#missing-values} ```{r, results = "asis", echo = FALSE} -status("restructuring") +status("polishing") ``` ## Introduction -You've already learned the basics of missing values earlier in the the book. -You first saw them in Section \@ref(summarize) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison). -Now we'll come back to them in more depth, so you can learn more of the details. +You've already learned the basics of missing values earlier in the the book: you first saw them in Section \@ref(summarize) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison). +In this chapter, we'll come back to missing values in more depth, so you can learn more of the details. -We'll start by discussing some general tools for working with missing values recorded as `NA`s. +We'll start by discussing some general tools for explicitly missing values that recorded as `NA`. We'll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit. -We'll finish off with a related discussion of empty groups, caused by factor levels that don't appear in the data. +We'll finish off with a of empty groups, caused by factor levels that don't appear in the data. ### Prerequisites @@ -24,12 +23,12 @@ library(tidyverse) ## Explicit missing values -To begin, let's explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an `NA`. +To begin, let's explore a few handy tools for creating or eliminating explicitly `NA`s. +In the following sections you'll learn how to carry the last observation forward, convert `NA`s to fixed values, convert some fixed value to `NA`s, and learn about the special variant of `NA` known as "not a number". ### Last observation carried forward -A common use for missing values is as a data entry convenience. -Sometimes data that has been entered by hand, missing values indicate that the value in the previous row has been repeated: +Missing values are commonly used as data entry convenience where they indicate a repeat of the value in the previous row: ```{r} treatment <- tribble( @@ -42,18 +41,19 @@ treatment <- tribble( ``` You can fill in these missing values with `tidyr::fill()`. -It works like `select()`, taking a set of columns where you want missing values to be replaced by last observation carried forward: +It works like `select()`, taking a set of columns: ```{r} treatment |> fill(everything()) ``` +This treatment is sometimes called "last observation carried forward", or **locf** for short. You can use the `direction` argument to fill in missing values that have been generated in more exotic ways. ### Fixed values -Some times missing values represent some fixed known value, mostly commonly 0. +Some times missing values represent some fixed and known value, mostly commonly 0. You can use `dplyr::coalesce()` to replace them: ```{r} @@ -61,7 +61,7 @@ x <- c(1, 4, 5, 7, NA) coalesce(x, 0) ``` -You could use `mutate()` together with `across()` to apply to every numeric column in a data frame: +You could use `mutate()` together with `across()` to apply to every this treatment to (say) every numeric column in a data frame: ```{r, eval = FALSE} df |> @@ -70,8 +70,8 @@ df |> ### Sentinel values -Sometimes you'll hit the opposite problem where some value should actually be treated as a missing value. -This typically arises in data generated by older software which doesn't have an explicit way to represent missing values, so it uses some special sentinel value like 99 or -999. +Sometimes you'll hit the opposite problem where some conrete value actually represents as a missing value. +This typically arises in data generated by older software that doesn't have a proper way to represent missing values, so it must instead use some special value like 99 or -999. If possible, handle this when reading in the data, for example, by using the `na` argument to `readr::read_csv()`. If you discover the problem later, or your data source doesn't provide a way to handle on it read, you can use `dplyr::na_if():` @@ -81,7 +81,7 @@ x <- c(1, 4, 5, 7, -99) na_if(x, -99) ``` -And you could apply this transformation to every numeric column in a data frame with the following code. +You could apply this transformation to every numeric column in a data frame with the following code. ```{r, eval = FALSE} df |> @@ -113,9 +113,9 @@ sqrt(-1) ## Implicit missing values -So far we've talked with missing values that are **explicitly** missing, i.e. you can see them in your data as an `NA`. +So far we've talked about missing values that are **explicitly** missing, i.e. you can see an `NA` in your data. But missing values can also be **implicitly** missing, if an entire row of data is simply absent from the data. -Let's illustrate this idea with a simple data set, which records the price of a stock in each quarter: +Let's illustrate the difference with a simple data set that records the price of some stock each quarter: ```{r} stocks <- tibble( @@ -137,9 +137,9 @@ One way to think about the difference is with this Zen-like koan: > > An implicit missing value is the absence of a presence. -It's often useful to make implicit missings explicit so you have something physical that you can work with. -In other cases, explicit missings are forced upon you by the structure of the data. -The following sections discuss some tools for moving between implicit and explict. +Sometimes you want to make implicit missings explicit in order to have something physical to work with. +In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them. +The following sections discuss some tools for moving between implicit and explicit missingness. ### Pivoting @@ -160,16 +160,17 @@ See the examples in Chapter \@ref(tidy-data) for more details. ### Complete -`tidyr::complete()` allows you to generate explicit missing values in tidy data by providing a set of variables that generates all rows that should exist: +`tidyr::complete()` allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. +For example, we know that all combinations of `year` and `qtr` should exist in the `stocks` data: ```{r} stocks |> complete(year, qtr) ``` -Typically, you'll call `complete()` with names of variables that already exist, filling in their missing combinations. -However, sometimes the individual variables are themselves incomplete, so you can also provide your own data. -For example, you might know that this dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for `year`: +Typically, you'll call `complete()` with names of existing variables, filling in the missing combinations. +However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. +For example, you might know that the `stocks` dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for `year`: ```{r} stocks |> @@ -178,7 +179,7 @@ stocks |> If the range of a variable is correct, but not all values are present, you could use `full_seq(x, 1)` to generate all values from `min(x)` to `max(x)` spaced out by 1. -In some cases, the complete set of observations can't be generated by a simple combination of variables with `complete()`. +In some cases, the complete set of observations can't be generated by a simple combination of variables. In that case, you can do manually what `complete()` does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with `dplyr::full_join()`. ### Joins @@ -209,7 +210,7 @@ If you're worried about this, and you have dplyr 1.1.0 or newer, you can use the ## Factors and empty groups -A final type of missingness is empty groups, groups that don't contain any observation, which can arise when working with factors. +A final type of missingness is the empty group, a group that doesn't contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people: ```{r} @@ -226,8 +227,7 @@ And we want to count the number of smokers with `dplyr::count()`: health |> count(smoker) ``` -This dataset only contains non-smokers, but we know that smokers exist. -The group of non-smoker is empty. +This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request `count()` to keep all the groups, even those not seen in the data by using `.drop = FALSE`: ```{r} @@ -271,20 +271,24 @@ health |> ) ``` -We get some interesting results here because we are a summarizing an empty group, so the summary functions are applied to zero-length vectors. -Zero-length vectors are empty, not missing: +We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. +There's an important distinction between empty vectors, which have length 0, and missing values, which each have length 1. ```{r} +# A vector containing two missing values x1 <- c(NA, NA) length(x1) +# A vector containing nothing x2 <- numeric() length(x2) ``` -Summary functions do work with zero-length vectors, but they may return results that are surprising at first glance. +All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see `mean(age)` returning `NaN` because `mean(age)` = `sum(age)/length(age)` which here is 0/0. -`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get min or max of the new data. +`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get the minimum or maximum of the new data[^missing-values-1]. + +[^missing-values-1]: In other words, `min(c(x, y))` is always equal to `min(min(x), min(y)).` A sometimes simpler approach is to perform the summary and then make the implicit missings explicit with `complete()`.