r4ds/iteration.Rmd

# Iteration

```{r, include=FALSE}
library(purrr)
```

One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out in to indepdent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you shouldn't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.)

In this chapter you'll learn about two important iteration tools: for loops and functional programming. For loops are a great place to start because they make iteration very explicit, so that it's obvious what's happening. However, that explicitness is also the downside of for loops: they are quite verbose, and include quite a bit of book-keeping code. The one of the goals of functional programming is to extract out common patterns of for loops into their own functions. Once you master the vocabulary this allows you to solve many common iteration problems with less code, more ease, and less chance of errors.

## For loops

Imagine we have this simple data frame:

```{r}
df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
```

And we want to compute the median of every column. One way to do that is to use a for loop:
```{r}
results <- vector("double", ncol(df))  # 1. results
for (i in seq_along(df)) {             # 2. sequence
  results[[i]] <- median(df[[i]])      # 3. body
}
results
```

Every for loop has three main components:

1.  The __results__: `results <- vector("integer", length(x))`. 
    Before you perform an for loop, you must always allocate sufficient space 
    for the output. This is very important for efficient for-loops: if you grow
    the for loop at each iteration using `c()`, or `rbind()`, or similar,
    your for loop will be very slow.
    
    A general way of creating an empty vector of given length is the `vector()`
    function. It has two arguments: the type of the vector, like "logical", 
    "integer", "double", or "character", and the length of the vector. 

1.  The __sequence__: `i in seq_along(df)`. This determines what to loop over:
    each run of the for loop will assign `i` to a different value from 
    `seq_along(df)`. It's useful to think of `i` as a pronoun.
    
    You might not have seen `seq_along()` before. It's a safe version of the 
    familiar `1:length(l)`. There's one important difference in behaviour. If 
    you have a zero-length vector, `seq_along()` does the right thing:

    ```{r}
    y <- vector("double", 0)
    seq_along(y)
    1:length(y)
    ```
    
    It's unlikely that you've deliberately created a zero-length vector, but
    they're easy to create accidentally.
    
1.  The __body__: `results[i] <- median(df[[i]])`. This is the code that does
    the work. It's run repeatedly, each time with a different value for `i`.
    The first iteration will run `results[[1]] <- median(df[[1]])`, 
    the second will run `results[[2]] <- median(df[[2]])`, and so on.

### Modifying input

We now have the tools to go back our challenge from [functions]:

```{r}
df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```

To solve this with a for loop we need to identify the three pieces:

1.  Results: override the columns in the input. We don't need to create a new
    object but can instead reuse an existing object.

1.  Sequence: remember that we can think about a data frame as a list of 
    columns, so to iterate over each column we can use `seq_along(df)`.

1.  Body: apply `rescale01()`.

This gives us:

```{r}
for (i in seq_along(df)) {
  df[[i]] <- rescale01(df[[i]])
}
```

### Unknown output length

Sometimes you might know now how long the output will be. There is one common pattern that has a relatively simple work around. For example, imagine you want to simulate some random numbers:

```{r}
means <- c(0, 1, 2)

results <- double()
for (i in seq_along(means)) {
  n <- sample(100, 1)
  results <- c(results, rnorm(n, means[[i]]))
}
str(results)
```

In general this loop isn't going to be very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get "quadratic" behaviour which means that a loop with three times as many elements would take nine times ($3^2$) as long to run.

```{r}
out <- vector("list", length(means))
for (i in seq_along(means)) {
  n <- sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(out)
```

Then you can use a function list `unlist()`, or `purrr::flatten_dbl()` to collapse this to a simple vector. This pattern occurs in other places too:

1.  You might be generating a long string. Instead of `paste()`ing together each
    iteration, save the results in a character vector and then run 
    `paste(results, collapse = "")` to combine the individual results into
    a single string.
   
1.  You might generating a big data frame. Instead of `rbind()` the results
    together on each run, save the results in list and then use 
    `dplyr::bind_rows(results)` to combine the results into a single
    data frame.

### Looping patterns

There are three basic ways to loop over a vector:

1.  Loop over the elements: `for (x in xs)`. Most useful for side-effects,
    but it's difficult to save the output efficiently.

1.  Loop over the numeric indices: `for (i in seq_along(xs))`. Most common
    form if you want to know the element (`xs[[i]]`) and its position.

1.  Loop over the names: `for (nm in names(xs))`. Gives you both the name
    and the position. This is useful if you want to use the name in a
    plot title or a file name.

The most general form uses `seq_along(xs)`, because from the position you can access both the name and the value:

```{r, eval = FALSE}
for (i in seq_along(x)) {
  name <- names(x)[[i]]
  value <- x[[i]]
}
```

### Exercises    

1.  Convert the song "99 bottles of beer on the wall" to a function. Generalise
    to any number of any vessel containing any liquid on any surface.

1.  Convert the nursey rhyme "ten in the bed" to a function. Generalise it 
    to any number of people in any sleeping structure.

1.  It's common to see for loops that don't preallocate the output and instead
    increase the length of a vector at each step:
    
    ```{r, eval = FALSE}
    results <- vector("integer", 0)
    for (i in seq_along(x)) {
      results <- c(results, lengths(x[[i]]))
    }
    results
    ```
    
    How does this affect performance? 

## While loops


## For loops vs functionals

For loops are not as important in R as they are in other languages as rather than writing your own for loops, you'll typically use prewritten functions that wrap up common for-loop patterns. These functions are important because they wrap up the book-keeping code related to the for loop, focussing purely on what's happening. 

Imagine you have a data frame and you want to compute the mean of each column. You might write code like this:

```{r}
df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

results <- numeric(length(df))
for (i in seq_along(df)) {
  results[i] <- mean(df[[i]])
}
results
```

(Here we're taking advantage of the fact that a data frame is a list of the individual columns, so `length()` and `seq_along()` are useful.)

You realise that you're going to want to compute the means of every column pretty frequently, so you extract it out into a function:

```{r}
col_mean <- function(df) {
  results <- numeric(length(df))
  for (i in seq_along(df)) {
    results[i] <- mean(df[[i]])
  }
  results
}
```

But then you think it'd also be helpful to be able to compute the median or the standard deviation:

```{r}
col_median <- function(df) {
  results <- numeric(length(df))
  for (i in seq_along(df)) {
    results[i] <- median(df[[i]])
  }
  results
}
col_sd <- function(df) {
  results <- numeric(length(df))
  for (i in seq_along(df)) {
    results[i] <- sd(df[[i]])
  }
  results
}
```

I've now copied-and-pasted this function three times, so it's time to think about how to generalise it. Most of the code is for-loop boilerplate and it's hard to see the one piece (`mean()`, `median()`, `sd()`) that differs.

What would you do if you saw a set of functions like this:

```{r}
f1 <- function(x) abs(x - mean(x)) ^ 1
f2 <- function(x) abs(x - mean(x)) ^ 2
f3 <- function(x) abs(x - mean(x)) ^ 3
```

Hopefully, you'd notice that there's a lot of duplication, and extract it out into an additional argument:

```{r}
f <- function(x, i) abs(x - mean(x)) ^ i
```

You've reduce the chance of bugs (because you now have 1/3 less code), and made it easy to generalise to new situations. We can do exactly the same thing with `col_mean()`, `col_median()` and `col_sd()`, by adding an argument that contains the function to apply to each column:

```{r}
col_summary <- function(df, fun) {
  out <- vector("numeric", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)
col_summary(df, min)
```

The idea of using a function as an argument to another function is extremely powerful. It might take you a while to wrap your head around it, but it's worth the investment. In the rest of the chapter, you'll learn about and use the __purrr__ package which provides a set of functions that eliminate the need for for-loops for many common scenarios. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.

The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces: 

1. How can you solve the problem for a single element of the list? Once
   you've solved that problem, purrr takes care of generalising your
   solution to every element in the list.

1. If you're solving a complex problem, how can you break it down into
   bite sized pieces that allow you to advance one small step towards a 
   solution? With purrr, you get lots of small pieces that you can
   compose together with the pipe.

This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.

In later chapters you'll learn how to apply these ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you'll learn in this chapter will be invaluable.

### Exercises

1.  Read the documentation for `apply()`. In the 2d case, what two for loops
    does it generalise?

1.   Adapt `col_summary()` so that it only applies to numeric columns
     You might want to start with an `is_numeric()` function that returns
     a logical vector that has a TRUE corresponding to each numeric column.

## The map functions

The pattern of looping over a list and doing something to each element is so common that the purrr package provides a family of functions to do it for you. Each function always returns the same type of output so there are six variations based on what sort of result you want:

* `map()`     returns a list.
* `map_lgl()` returns a logical vector.
* `map_int()` returns a integer vector.
* `map_dbl()` returns a double vector.
* `map_chr()` returns a character vector.
* `map_df()`  returns a data frame.
* `walk()`    returns nothing. Walk is a little different to the others because 
  it's called exclusively for its side effects, so it's described in more detail 
  later in [walk](#walk).

Each function takes a list as input, applies a function to each piece, and then returns a new vector that's the same length as the input. The type of the vector is determined by the specific map function. Usually you want to use the most specific available, using `map()` only as a fallback when there is no specialised equivalent available.

We can use these functions to perform the same computations as the previous for loops:

```{r}
map_dbl(df, mean)
map_dbl(df, median)
map_dbl(df, sd)
```

Compared to using a for loop, focus is on the operation being performed (i.e. `mean()`, `median()`, `sd()`), not the book-keeping required to loop over every element and store the results.

There are a few differences between `map_*()` and `compute_summary()`:

*   All purrr functions are implemented in C. This makes them a little faster
    at the expense of readability.
    
*   The second argument, `.f`, the function to apply, can be a formula, a 
    character vector, or an integer vector. You'll learn about those handy 
    shortcuts in the next section.
    
*   Any arguments after `.f` will be passed on to it each time it's called:

    ```{r}
    map_dbl(df, mean, trim = 0.5)
    ```

*   The map functions also preserve names:

    ```{r}
    z <- list(x = 1:3, y = 4:5)
    map_int(z, length)
    ```

### Shortcuts

There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces (one for each value of cylinder) and fits the same linear model to each piece:  

```{r}
models <- mtcars %>% 
  split(.$cyl) %>% 
  map(function(df) lm(mpg ~ wt, data = df))
```

The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula.

```{r}
models <- mtcars %>% 
  split(.$cyl) %>% 
  map(~lm(mpg ~ wt, data = .))
```

Here I've used `.` as a pronoun: it refers to the current list element (in the same way that `i` referred to the current index in the for loop). You can also use `.x` and `.y` to refer to up to two arguments. If you want to create a function with more than two arguments, do it the regular way!

When you're looking at many models, you might want to extract a summary statistic like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous functions:

```{r}
models %>% 
  map(summary) %>% 
  map_dbl(~.$r.squared)
```

But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string.

```{r}
models %>% 
  map(summary) %>% 
  map_dbl("r.squared")
```

You can also use a numeric vector to select elements by position: 

```{r}
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)
```

### Base R
  
If you're familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:

*   `lapply()` is basically identical to `map()`. There's no advantage to using 
    `map()` over `lapply()` except that it's consistent with all the other 
    functions in purrr.

*   The base `sapply()` is a wrapper around `lapply()` that automatically tries 
    to simplify the results. This is useful for interactive work but is 
    problematic in a function because you never know what sort of output
    you'll get:
    
    ```{r}
    x1 <- list(
      c(0.27, 0.37, 0.57, 0.91, 0.20),
      c(0.90, 0.94, 0.66, 0.63, 0.06), 
      c(0.21, 0.18, 0.69, 0.38, 0.77)
    )
    x2 <- list(
      c(0.50, 0.72, 0.99, 0.38, 0.78), 
      c(0.93, 0.21, 0.65, 0.13, 0.27), 
      c(0.39, 0.01, 0.38, 0.87, 0.34)
    )
    
    threshold <- function(x, cutoff = 0.8) x[x > cutoff]
    str(sapply(x1, threshold))
    str(sapply(x2, threshold))
    ```

*   `vapply()` is a safe alternative to `sapply()` because you supply an additional
    argument that defines the type. The only problem with `vapply()` is that 
    it's a lot of typing: `vapply(df, is.numeric, logical(1))` is equivalent to 
    `map_lgl(df, is.numeric)`.
    
    One of advantage of `vapply()` over the map functions is that it can also 
    produce matrices - the map functions only ever produce vectors.

*   `map_df(x, f)` is effectively the same as `do.call("rbind", lapply(x, f))` 
    but under the hood is much more efficient.

### Exercises

1.  How can you determine which columns in a data frame are factors? 
    (Hint: remember that data frames are lists.)

1.  What happens when you use the map functions on vectors that aren't lists?
    What does `map(1:5, runif)` do? Why?
    
1.  What does `map(-2:2, rnorm, n = 5)` do. Why?

1.  Rewrite `map(x, function(df) lm(mpg ~ wt, data = df))` to eliminate the 
    anonymous function. 

## Dealing with failure

When you do many operations on a list, sometimes one will fail. When this happens, you'll get an error message, and no output. This is annoying: why does one failure prevent you from accessing all the other successes? How do you ensure that one bad apple doesn't ruin the whole barrel?

In this section you'll learn how to deal this situation with a new function: `safely()`. `safely()` is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:

1. `result` is the original result. If there was an error, this will be `NULL`.

1. `error` is an error object. If the operation was successful this will be 
   `NULL`.

(You might be familiar with the `try()` function in base R. It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)

Let's illustrate this with a simple example: `log()`:

```{r}
safe_log <- safely(log)
str(safe_log(10))
str(safe_log("a"))
```

When the function succeeds the `result` element contains the result and the `error` element is `NULL`. When the function fails, the `result` element is `NULL` and the `error` element contains an error object.

`safely()` is designed to work with map:

```{r}
x <- list(1, 10, "a")
y <- x %>% map(safely(log))
str(y)
```

This would be easier to work with if we had two lists: one of all the errors and one of all the results. That's easy to get with `transpose()`.

```{r}
y <- y %>% transpose()
str(y)
```

It's up to you how to deal with the errors, but typically you'll either look at the values of `x` where `y` is an error or work with the values of y that are ok:

```{r}
is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
y$result[is_ok] %>% flatten_dbl()
```

Purrr provides two other useful adverbs:

*   Like `safely()`, `possibly()` always succeeds. It's simpler than `safely()`, 
    because you give it a default value to return when there is an error. 
    
    ```{r}
    x <- list(1, 10, "a")
    x %>% map_dbl(possibly(log, NA_real_))
    ```
    
*   `quietly()` performs a similar role to `safely()`, but instead of capturing
    errors, it captures printed output, messages, and warnings:
    
    ```{r}
    x <- list(1, -1)
    x %>% map(quietly(log)) %>% str()
    ```

### Exercises

1.  Challenge: read all the csv files in this directory. Which ones failed
    and why? 

    ```{r, eval = FALSE}
    files <- dir("data", pattern = "\\.csv$")
    files %>%
      set_names(., basename(.)) %>%
      map_df(safely(readr::read_csv), .id = "filename") %>%
    ```


## Parallel maps

So far we've mapped along a single list. But often you have multiple related lists that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:

```{r}
mu <- list(5, 10, -3)
mu %>% map(rnorm, n = 10)
```

What if you also want to vary the standard deviation? You need to iterate along a vector of means and a vector of standard deviations in parallel. That's a job for `map2()` which works with two parallel sets of inputs:

```{r}
sigma <- list(1, 5, 10)
map2(mu, sigma, rnorm, n = 10)
```

`map2()` generates this series of function calls:

```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-map2.png")
```

The arguments that vary for each call come before the function name, and arguments that are the same for every function call come afterwards.

Like `map()`, `map2()` is just a wrapper around a for loop:

```{r}
map2 <- function(x, y, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], y[[i]], ...)
  }
  out
}
```

You could also imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `pmap()` which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:

```{r}
n <- list(1, 3, 5)
args1 <- list(n, mu, sigma)
args1 %>% pmap(rnorm) %>% str()
```

That looks like:

```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-pmap-unnamed.png")
```

However, instead of relying on position matching, it's better to name the arguments. This is more verbose, but it makes the code clearer.

```{r}
args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>% pmap(rnorm) %>% str()
```

That generates longer, but safer, calls:

```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-pmap-named.png")
```

Since the arguments are all the same length, it makes sense to store them in a data frame:

```{r}
params <- dplyr::data_frame(mean = mu, sd = sigma, n = n)
params$result <- params %>% pmap(rnorm)
params
```

As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns. We'll come back to this idea when we explore the intersection of dplyr, purrr, and model fitting.

### Invoking different functions

There's one more step up in complexity - as well as varying the arguments to the function you might also vary the function itself:

```{r}
f <- c("runif", "rnorm", "rpois")
param <- list(
  list(min = -1, max = 1), 
  list(sd = 5), 
  list(lambda = 10)
)
```

To handle this case, you can use `invoke_map()`:

```{r}
invoke_map(f, param, n = 5) %>% str()
```

```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-invoke.png")
```

The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.

You can use `dplyr::frame_data()` to make creating these matching pairs a little easier:

```{r, eval = FALSE}
# Needs dev version of dplyr
sim <- dplyr::frame_data(
  ~f,      ~params,
  "runif", list(min = -1, max = -1),
  "rnorm", list(sd = 5),
  "rpois", list(lambda = 10)
)
sim %>% dplyr::mutate(
  samples = invoke_map(f, params, n = 10)
)
```

## Walk {#walk}

Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value. Here's a very simple example:

```{r}
x <- list(1, "a", 3)

x %>% 
  walk(print)
```

`walk()` is generally not that useful compared to `walk2()` or `pwalk()`. For example, if you had a list of plots and a vector of file names, you could use `pwalk()` to save each file to the corresponding location on disk:

```{r}
library(ggplot2)
plots <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- paste0(names(plots), ".pdf")

pwalk(list(paths, plots), ggsave, path = tempdir())
```

`walk()`, `walk2()` and `pwalk()` all invisibly return the `.x`, the first argument. This makes them suitable for use in the middle of pipelines. 


## Predicates

Imagine we want to summarise each numeric column of a data frame. We could do it in two steps:

1. Find all numeric columns.
1. Summarise each column.

In code, that would look like:

```{r}
col_sum <- function(df, f) {
  is_num <- df %>% map_lgl(is_numeric)
  df[is_num] %>% map_dbl(f)
}
```

`is_numeric()` is a __predicate__: a function that returns either `TRUE` or `FALSE`. There are a number of of purrr functions designed to work specifically with predicates:

* `keep()` and `discard()` keeps/discards list elements where the predicate is 
  true.
  
* `head_while()` and `tail_while()` keep the first/last elements of a list until
  you get the first element where the predicate is true.
  
* `some()` and `every()` determine if the predicate is true for any or all of
  the elements.

* `detect()` and `detect_index()`

We could use `keep()` to simplify the summary function to:

```{r}
col_sum <- function(df, f) {
  df %>%
    keep(is.numeric) %>%
    map_dbl(f)
}
```

I like this formulation because you can easily read the sequence of steps.


### Exercises

1.  A possible base R equivalent of `col_sum()` is:

    ```{r}
    col_sum3 <- function(df, f) {
      is_num <- sapply(df, is.numeric)
      df_num <- df[, is_num]

      sapply(df_num, f)
    }
    ```
    
    But it has a number of bugs as illustrated with the following inputs:
    
    ```{r, eval = FALSE}
    df <- data.frame(z = c("a", "b", "c"), x = 1:3, y = 3:1)
    # OK
    col_sum3(df, mean)
    # Has problems: don't always return numeric vector
    col_sum3(df[1:2], mean)
    col_sum3(df[1], mean)
    col_sum3(df[0], mean)
    ```
    
    What causes the bugs?