From 7f75e635122d7a521e38656dd4b7b93181243345 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Wed, 4 May 2022 16:08:53 -0500 Subject: [PATCH] More polishing of tidy data --- missing-values.Rmd | 44 ++++++++++++++++++++++++-------------------- 1 file changed, 24 insertions(+), 20 deletions(-) diff --git a/missing-values.Rmd b/missing-values.Rmd index 532d5c2..27f5016 100644 --- a/missing-values.Rmd +++ b/missing-values.Rmd @@ -16,39 +16,37 @@ We'll finish off with a related discussion of empty groups, caused by factor lev ### Prerequisites -Most of the functions for working with missing values live in tidyr, but some are also in dplyr. -So we'll load the whole tidyverse ando t +The functions for working will missing data mostly come from dplyr and tidyr, which are core members of the tidyverse. ```{r setup, message = FALSE} library(tidyverse) -library(nycflights13) ``` ## Explicit missing values -To begin, let's +To begin, let's explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an `NA`. ### Last observation carried forward -Another place that missing values arise is as a data entry convenience. -Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward: +A common use for missing values is as a data entry convenience. +Sometimes data that has been entered by hand, missing values indicate that the value in the previous row has been repeated: ```{r} treatment <- tribble( ~person, ~treatment, ~response, "Derrick Whitmore", 1, 7, NA, 2, 10, - NA, 3, 9, + NA, 3, NA, "Katherine Burke", 1, 4 ) ``` You can fill in these missing values with `tidyr::fill()`. -It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward). +It works like `select()`, taking a set of columns where you want missing values to be replaced by last observation carried forward: ```{r} treatment |> - fill(person) + fill(everything()) ``` You can use the `direction` argument to fill in missing values that have been generated in more exotic ways. @@ -56,34 +54,38 @@ You can use the `direction` argument to fill in missing values that have been ge ### Fixed values Some times missing values represent some fixed known value, mostly commonly 0. -You can use `dplyr::coalesce()` to replace. +You can use `dplyr::coalesce()` to replace them: ```{r} x <- c(1, 4, 5, 7, NA) coalesce(x, 0) ``` -You could apply this to every numeric column in a data frame with: +You could use `mutate()` together with `across()` to apply to every numeric column in a data frame: ```{r, eval = FALSE} -df |> mutate(across(where(is.numeric), coalesce, 0)) +df |> + mutate(across(where(is.numeric), coalesce, 0)) ``` ### Sentinel values -Sometimes you'll hit the opposite problem because some older software doesn't have an explicit way to represent missing values, so it might be recorded using some special sentinel value like 99 or -999. -If possible, handle this when reading in the data, for example, by using the `na` argument to `read::read_csv()`. -If you discover later, or from a data source that doesn't provide a way to handle on read, you can use `na_if()` +Sometimes you'll hit the opposite problem where some value should actually be treated as a missing value. +This typically arises in data generated by older software which doesn't have an explicit way to represent missing values, so it uses some special sentinel value like 99 or -999. + +If possible, handle this when reading in the data, for example, by using the `na` argument to `readr::read_csv()`. +If you discover the problem later, or your data source doesn't provide a way to handle on it read, you can use `dplyr::na_if():` ```{r} x <- c(1, 4, 5, 7, -99) na_if(x, -99) ``` -You could apply this to every numeric column in a data frame with: +And you could apply this transformation to every numeric column in a data frame with the following code. ```{r, eval = FALSE} -df |> mutate(across(where(is.numeric), na_if, -99)) +df |> + mutate(across(where(is.numeric), na_if, -99)) ``` ### NaN @@ -188,6 +190,8 @@ Often you can only know that values are missing from one dataset when you go to The following example shows how two `anti_join()`s reveals that we're missing information for four airports and 722 planes. ```{r} +library(nycflights13) + flights |> distinct(faa = dest) |> anti_join(airports) @@ -216,7 +220,7 @@ health <- tibble( ) ``` -And we want to count the number of smokers: +And we want to count the number of smokers with `dplyr::count()`: ```{r} health |> count(smoker) @@ -251,7 +255,7 @@ ggplot(health, aes(smoker)) + scale_x_discrete(drop = FALSE) ``` -`.drop = TRUE` also works with `group_by()`: +`.drop = TRUE` also works with `dplyr::group_by()`: ```{r} health |> @@ -295,4 +299,4 @@ health |> complete(smoker) ``` -The main drawback of this approach is that you get an `NA` for the count, even though you know that's zero. +The main drawback of this approach is that you get an `NA` for the count, even though you know that it's zero.