r4ds/vector-tools.Rmd

# Vector tools

```{r, results = "asis", echo = FALSE}
status("drafting")
```

## Introduction

`%in%`

`c()`

```{r}
library(tidyverse)
library(nycflights13)

not_cancelled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))
```

## Counts

-   Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
    To count the number of non-missing values, use `sum(!is.na(x))`.
    To count the number of distinct (unique) values, use `n_distinct(x)`.

    ```{r}
    # Which destinations have the most carriers?
    not_cancelled %>%
      group_by(dest) %>%
      summarise(carriers = n_distinct(carrier)) %>%
      arrange(desc(carriers))
    ```

    Counts are so useful that dplyr provides a simple helper if all you want is a count:

    ```{r}
    not_cancelled %>%
      count(dest)
    ```

    Just like with `group_by()`, you can also provide multiple variables to `count()`.

    ```{r}
    not_cancelled %>%
      count(carrier, dest)
    ```

    You can optionally provide a weight variable.
    For example, you could use this to "count" (sum) the total number of miles a plane flew:

    ```{r}
    not_cancelled %>%
      count(tailnum, wt = distance)
    ```

## Window functions

-   Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
    This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
    They are most useful in conjunction with `group_by()`, which you'll learn about shortly.

    ```{r}
    (x <- 1:10)
    lag(x)
    lead(x)
    ```

-   Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
    It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
    The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.

    ```{r}
    y <- c(1, 2, 2, NA, 3, 4)
    min_rank(y)
    min_rank(desc(y))
    ```

    If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
    See their help pages for more details.

    ```{r}
    row_number(y)
    dense_rank(y)
    percent_rank(y)
    cume_dist(y)
    ```

-   Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
    These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
    For example, we can find the first and last departure for each day:

    ```{r}
    not_cancelled %>%
      group_by(year, month, day) %>%
      summarise(
        first_dep = first(dep_time),
        last_dep = last(dep_time)
      )
    ```

    These functions are complementary to filtering on ranks.
    Filtering gives you all variables, with each observation in a separate row:

    ```{r}
    not_cancelled %>%
      group_by(year, month, day) %>%
      mutate(r = min_rank(desc(dep_time))) %>%
      filter(r %in% range(r))
    ```

### Cumulative

-   Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.

```{r}
x <- 1:10
cumsum(x)
cummean(x)
```

Generalise to rolling and use slider package instead?

Another useful pair of functions are cumulative any, `cumany()`, and cumulative all, `cumall()`.
`cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`.
These are particularly useful in conjunction with `filter()` because they allow you to select:

-   `cumall(x)`: all cases until the first `FALSE`.
-   `cumall(!x)`: all cases until the first `TRUE`.
-   `cumany(x)`: all cases after the first `TRUE`.
-   `cumany(!x)`: all cases after the first `FALSE`.

```{r}
df <- data.frame(
  date = as.Date("2020-01-01") + 0:6,
  balance = c(100, 50, 25, -25, -50, 30, 120)
)
# all rows after first overdraft
df %>% filter(cumany(balance < 0))
# all rows until first overdraft
df %>% filter(cumall(!(balance < 0)))
```

###

### dplyr

```{r}
flights_sml <- select(flights,
  year:day,
  ends_with("delay"),
  distance,
  air_time
)
```

-   Find the worst members of each group:

    ```{r}
    flights_sml %>%
      group_by(year, month, day) %>%
      filter(rank(desc(arr_delay)) < 10)
    ```

-   Find all groups bigger than a threshold:

    ```{r}
    popular_dests <- flights %>%
      group_by(dest) %>%
      filter(n() > 365)
    popular_dests
    ```

-   Standardise to compute per group metrics:

    ```{r}
    popular_dests %>%
      filter(arr_delay > 0) %>%
      mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
      select(year:day, dest, arr_delay, prop_delay)
    ```

A grouped filter is a grouped mutate followed by an ungrouped filter.
I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly.

Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries).
You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`.

### Exercises

1.  Find the 10 most delayed flights using a ranking function.
    How do you want to handle ties?
    Carefully read the documentation for `min_rank()`.

2.  Which plane (`tailnum`) has the worst on-time record?

3.  What time of day should you fly if you want to avoid delays as much as possible?

4.  For each destination, compute the total minutes of delay.
    For each flight, compute the proportion of the total delay for its destination.

5.  Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
    Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.

6.  Look at each destination.
    Can you find flights that are suspiciously fast?
    (i.e. flights that represent a potential data entry error).
    Compute the air time of a flight relative to the shortest flight to that destination.
    Which flights were most delayed in the air?

7.  Find all destinations that are flown by at least two carriers.
    Use that information to rank the carriers.

## Recycling rules

Base R.

Tidyverse.