A little thinking about missing values

2022-03-31 08:10:52 -05:00 · 2022-03-31 08:10:52 -05:00 · 27507a8bf2
parent 61d8a75908
commit 27507a8bf2
2 changed files with 60 additions and 62 deletions
--- a/missing-values.Rmd
+++ b/missing-values.Rmd
@ -6,27 +6,21 @@ status("drafting")
 ## Introduction
-```{r}
+A value can be missing in one of two possible ways.
 It can be **explicitly** missing, i.e. flagged with `NA`, or it can be **implicitly**, missing i.e. simply not present in the data.
 This chapter will explore cases where implicit and explicit missing values can become explict,
 ### Prerequisites
 ```{r setup, message = FALSE}
 library(tidyverse)
 library(nycflights13)
 ```
-Missing topics:
+## Motivation
-   Missing values generated from matching data frames (i.e. `left_join()` and `anti_join()`
+Let's illustrate this idea with a very simple data set.
 -   Last observation carried forward and `tidy::fill()`
 -   `coalesce()` and `na_if()`
 ## Explicit vs implicit missing values {#missing-values-tidy}
 Changing the representation of a dataset brings up an important subtlety of missing values.
 Surprisingly, a value can be missing in one of two possible ways:
 -   **Explicitly**, i.e. flagged with `NA`.
 -   **Implicitly**, i.e. simply not present in the data.
 Let's illustrate this idea with a very simple data set:
 ```{r}
 stocks <- tibble(
@ -44,6 +38,47 @@ There are two missing values in this dataset:
 One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
 ## Complete and joins
 If a dataset has a regular structure, you can make implicit missing values implicit with `complete()`:
 ```{r}
 stocks |>
  complete(year, qtr)
 ```
 If you know that the range isn't correct, you can:
 ```{r}
 stocks |>
  complete(year = 2015:2017, qtr)
 ```
 `complete()` takes a set of columns, and finds all unique combinations.
 It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
 ```{r}
 stocks |> 
  expand(year, qtr) |> 
  left_join(stocks)
 ```
 Other times missing values might be defined by another dataset.
 ```{r}
 flights |> 
  distinct(faa = dest) |> 
  anti_join(airports)
 flights |> 
  distinct(tailnum) |> 
  anti_join(planes)
 ```
 ## Pivotting {#missing-values-tidy}
 Changing the representation of a dataset brings up an important subtlety of missing values.
 The way that a dataset is represented can make implicit values explicit.
 For example, we can make the implicit missing value explicit by putting years in the columns:
@ -65,15 +100,7 @@ stocks |>
  )
 ```
-Another important tool for making missing values explicit in tidy data is `complete()`:
+## Last observation carried forward
 ```{r}
 stocks |>
  complete(year, qtr)
 ```
 `complete()` takes a set of columns, and finds all unique combinations.
 It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
 There's one other important tool that you should know for working with missing values.
 Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
@ -96,41 +123,8 @@ treatment |>
  fill(person)
 ```
-`group_by` + `.drop = FALSE`
+## Factors
-### Exercises
+-   factors: `group_by` + `.drop = FALSE`
-1.  Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
+## 
 2.  What does the direction argument to `fill()` do?
 ## dplyr verbs
 `filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
 If you want to preserve missing values, ask for them explicitly:
 ```{r}
 df <- tibble(x = c(1, NA, 3))
 filter(df, x > 1)
 filter(df, is.na(x) | x > 1)
 ```
 Missing values are always sorted at the end:
 ```{r}
 df <- tibble(x = c(5, 2, NA))
 arrange(df, x)
 arrange(df, desc(x))
 ```
 Explain the warning here
 ```{r, eval = FALSE}
 flights |> 
  group_by(dest) |> 
  summarise(max_delay = max(arr_delay, na.rm = TRUE))
 ```
 ## Exercises
 1.  Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)
--- a/numbers.Rmd
+++ b/numbers.Rmd
@ -344,6 +344,10 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
 These are often used with numbers, but can be applied to most other column types.
 ### Missing values
 `coalesce()`
 ### Ranks
 dplyr provides a number of ranking functions, but you should start with `dplyr::min_rank()`.