r4ds/iteration.qmd

# Iteration {#sec-iteration}

```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```

## Introduction

In @sec-functions, we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting.
Reducing code duplication has three main benefits:

1.  It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.

2.  It's easier to respond to changes in requirements.
    As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.

3.  You're likely to have fewer bugs because each line of code is used in more places.

One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated.
Another tool for reducing duplication is **iteration**, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

Iteration is somewhat of a moving target in the tidyverse because we're keep adding new features to make it easier to solve problems that previously required explicit iteration.
For example:

-   To draw one plot for each group you can use ggplot2's facetting.
-   To compute summary statistics for subgroups you can use `dplyr::group_by()` + `dplyr::summarise()`.
-   To read every .csv file in a directory you can pass a vector to `readr::read_csv()`.
-   To extract every element from a named list you can use `tidyr::unnest_wider()`.

In this section we'll show you three related sets of tools for manipulating each column in a data frame, reading each file in a directory, and saving objects.

We're going to give the very basics of iteration, focusing on the places where it comes up in an analysis.
But in general, iteration is a super power: one you solved one problem, you can apply iteration techniques to solve every similar problem.
You can learn more in <https://purrr.tidyverse.org> and the [Functionals chapter](https://adv-r.hadley.nz/functionals.html) of *Advanced R*.

### Prerequisites

We'll use a selection of useful iteration idioms from dplyr and purrr, both core members of the tidyverse.

```{r}
#| label: setup
#| message: false

library(tidyverse)
```

## Modifying multiple columns

Imagine you have this simple tibble:

```{r}
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
```

And you want to compute the median of every column.
You could do it with copy-and-paste:

```{r}
df %>% summarise(
  a = median(a),
  b = median(b),
  c = median(c),
  d = median(d),
)
```

But that breaks our rule of thumb: never copy and paste more than twice.
And you could imagine that this will get particularly tedious if you have tens or even hundreds of variables.
Instead you can use `across()`:

```{r}
df %>% summarise(
  across(a:d, median)
)
```

There are two arguments that you'll use every time:

-   The first argument specifies which columns you want to iterate over. It uses the same syntax as `select()`.
-   The second argument specifies what to do with each column.

There's another argument, `.names` that's useful when use `across()` with `mutate()`, and two variations `if_any()` and `if_all()` that work with `filter()`.
These are described in detail below.

### Which columns

The first argument to `across()`, `.cols`, selects the columns to transform.
This argument uses the same specifications as `select()`, @sec-select, so you can use functions like `starts_with()` and `ends_with()` to select variables based on their name.
There are two other techniques that you can use with both `select()` and `across()` that we didn't discuss earlier because they're particularly useful for `across()`: `everything()` and `where()` .

`everything()` is straightforward: it selects every (non-grouping) column!

```{r}
df <- tibble(
  grp = sample(2, 10, replace = TRUE),
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df %>% 
  group_by(grp) |> 
  summarise(across(everything(), median))
```

`where()` allows you to select columns based on their type:

-   `where(is.numeric)` selects all numeric columns.
-   `where(is.character)` selects all string columns.
-   `where(is.Date)` selects all date columns.
-   `where(is.POSIXct)` selects all date-time columns.
-   `where(is.logical)` selects all logical columns.

You can combine these in the usual `select()` way with Boolean algebra so that `!where(is.numeric)` selects all non-numeric columns and `starts_with("a") & where(is.logical)` selects all logical columns whose name starts with "a".

### Extra arguments

The second argument, `.funs`, determines what happens to each column selected by the first argument.
In most cases, this will be the name of an existing function, but you can also create your own function inline, or supply multiple functions.

Lets motivate this problem with an example: what happens if we have some missing values?
It'd be nice to be able to pass along additional arguments to `median()`:

```{r}
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
  sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
}

df <- tibble(
  a = rnorm_na(10, 2),
  b = rnorm_na(10, 2),
  c = rnorm_na(10, 4),
  d = rnorm(10)
)
df %>% summarise(
  across(a:d, median)
)
```

For complicated reasons, it's not easy to pass on arguments from `across()`, so instead we can create another function that wraps `median()` and calls it with the correct arguments.
We can write that compactly using R's anonymous function shorthand:

```{r}
df %>% summarise(
  across(a:d, \(x) median(x, na.rm = TRUE))
)
```

This is short hand for creating a function, as below.
It's easier to remember because you just replace the eight letters of `function` with a single `\`.

```{r}
#| results: false
df %>% summarise(
  across(a:d, function(x) median(x, na.rm = TRUE))
)
```

As well as computing the median with out missing values, it'd be nice to know how many missing values there were.
We can do that by supplying a named list of functions to `across()`:

```{r}
df %>% summarise(
  across(a:d, list(
    median = \(x) median(x, na.rm = TRUE),
    n_miss = \(x) sum(is.na(x))
  ))
)
```

Note that you could describe the name of the new columns using a glue specification (@sec-glue) like `{.col}_{.fn}` where `.col` is the name of the original column and `.fun` is the name of the function in the list.
That's not a coincidence because you can use the `.names` argument to set these names.

### Column names

The result of `across()` is named according to the specification provided in the `.names` variable.
We could specify our own if we wanted the name of the function to come first.
(You can't currently change the order of the columns).

```{r}
df %>% summarise(
  across(a:d, list(
    median = \(x) median(x, na.rm = TRUE),
    n_miss = \(x) sum(is.na(x))
  ), .names = "{.fn}_{.col}")
)
```

The `.names` argument is particularly important when you use `across()` with `mutate()`.
By default the outputs of `across()` are given the same names as the inputs.
This means that `across()` inside of `mutate()` will replace existing columns:

```{r}
df %>% mutate(
  across(a:d, \(x) x + 1)
)
```

If you'd like to instead create new columns, you can supply the `.names` argument which takes a glue specification where `{.col}` refers to the current column name.

```{r}
df %>% mutate(
  across(a:d, \(x) x * 2, .names = "{.col}_2")
)
```

### Filtering

`across()` is a great match for `summarise()` and `mutate()` but it's not such a great fit for `filter()` because you usually string together calls to multiple functions either with `|` or `&`.
So dplyr provides two variants of `across()` called `if_any()` and `if_all()`:

```{r}
df |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
# same as:
df |> filter(if_any(a:d, is.na))

df |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
# same as:
df |> filter(if_all(a:d, is.na))
```

### Vs `pivot_longer()`

Before we go on, it's worth pointing out an interesting connection between `across()` and `pivot_longer()`.
In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column.
For example, we could rewrite our multiple summary `across()` as:

```{r}
df |> 
  pivot_longer(a:d) |> 
  group_by(name) |> 
  summarise(
    median = median(value, na.rm = TRUE),
    n_miss = sum(is.na(value))
  )
```

This is a useful technique to know about because sometimes you'll hit a problem that's not currently possible to solve with `across()`: when you have groups of variables that you want to compute with simultaneously.
For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:

```{r}
df3 <- tibble(
  a_val = rnorm(10),
  a_w = runif(10),
  b_val = rnorm(10),
  b_w = runif(10),
  c_val = rnorm(10),
  c_w = runif(10),
  d_val = rnorm(10),
  d_w = runif(10)
)
```

There's currently no way to do this with `across()`[^iteration-1], but it's relatively straightforward with `pivot_longer()`:

[^iteration-1]: Maybe there will be one day, but currently we don't see how.

```{r}
df3_long <- df3 |> 
  pivot_longer(
    everything(), 
    names_to = c("group", ".value"), 
    names_sep = "_"
  )
df3_long

df3_long |> 
  group_by(group) |> 
  summarise(mean = weighted.mean(val, w))
```

If needed, you could `pivot_wider()` this back to the original form.

### Exercises

1.  Compute the number of unique values in each column of `palmerpenguins::penguins`.
2.  Compute the mean of every column in `mtcars`.
3.  Group `diamonds` by `cut`, `clarity`, and `color` then count the number of observations and the mean of each numeric variable.
4.  What happens if you use a list of functions, but don't name them? How is the output named?
5.  It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`. Can you explain why?

## Reading multiple files

Imagine you have a directory full of excel spreadsheets[^iteration-2] you want to read.
You could do it with copy and paste:

[^iteration-2]: If you instead had a directory of csv files with the same format, you can use `read_csv()` directly: `read_csv(c("data/y2019.xls", "data/y2020.xls", "data/y2021.xls", "data/y2020.xls").`

```{r}
#| eval: false
data2019 <- readr::read_excel("data/y2019.xls")
data2020 <- readr::read_excel("data/y2020.xls")
data2021 <- readr::read_excel("data/y2021.xls")
data2022 <- readr::read_excel("data/y2022.xls")
```

And then use `dplyr::bind_rows()` to combine them all together:

```{r}
#| eval: false
data <- bind_rows(data2019, data2020, data2021, data2022)
```

But you can imagine that this would get tedious quickly, since often you won't have four files, but more like 400.
In this section you'll first learn a little bit about the base `dir()` function which allows you to list all the files in a directory.
And then about `purrr::map()` which lets you repeatedly apply a function to each element of a vector, allowing you to read many files in one step.
And then we'll finish up with `purrr::list_rbind()` which takes a list of data frames and combines them all together.

### Listing files in a directory

`dir()`.
Use `pattern`, a regular expression, to filter files.
Always use `full.name`.

Let's make this problem real with a folder of 12 excel spreadsheets that contain data from the gapminder package that contains some information about multiple countries over time:

```{r}
paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
paths
```

### Basic pattern

Now that we have the paths, we want to call `read_excel()` with each path.
Since in general we won't know how many elements there are, instead of putting each individual data frame in its own variable, we'll save them all into a list:

```{r}
#| eval: false
list(
  readxl::read_excel("data/gapminder/1952.xls"),
  readxl::read_excel("data/gapminder/1957.xls"),
  readxl::read_excel("data/gapminder/1962.xls"),
  ...,
  readxl::read_excel("data/gapminder/2007.xls")
)
```

The shortcut for this is the `map()` function.
`map(x, f)` is short hand for:

```{r}
#| eval: false
list(
  f(x[[1]]),
  f(x[[2]]),
  ...,
  f(x[[n]])
)
```

`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector.

We can use `map()` get a list of data frames in one step with:

```{r}
files <- map(paths, readxl::read_excel)
length(files)

files[[1]]
```

(This is another data structure that doesn't display particularly compactly with `str()` so you might want to load into RStudio and inspecting with `` View()` ``).

Now we can to use `purrr::list_rbind()` to combine that list of data frames into a single data frame:

```{r}
list_rbind(files)
```

Or we could combine in a single pipeline like this:

```{r}
#| results: false
paths |> 
  map(readxl::read_excel) |> 
  list_rbind()
```

What if we want to pass in extra arguments to `read_excel()`?
We use the same trick that we used with across.
For example, it's often useful to peak at just the first few rows of the data:

```{r}
paths |> 
  map(\(path) readxl::read_excel(path, n_max = 1)) |> 
  list_rbind()
```

This really hammers in something that you might've noticed earlier: each individual sheet doesn't contain the year.
That's only recorded in the path.

### Data in the path

Sometimes the name of the file is itself data.
In this example, the file name contains the year, which is not otherwise recorded in the individual data frames.
To get that column into the final data frame, we need to do two things.

Firstly, we give the path vector names.
The easiest way to do this is with the `set_names()` function, which can optionally take a function.
Here we use `basename` to extract just the file name from the full path:

```{r}
paths <- paths |> set_names(basename) 
paths
```

Those paths are automatically carried along by all the map functions, so the list of data frames will have those same names:

```{r}
#| eval: false
paths |> 
  map(readxl::read_excel) |> 
  names()
```

Then we use the `names_to` argument `list_rbind()` to tell it which column to save the names to:

```{r}
paths |> 
  set_names(basename) |> 
  map(readxl::read_excel) |> 
  list_rbind(names_to = "year") |> 
  mutate(year = parse_number(year))
```

Here I used `readr::parse_number()` to turn year into a proper number.

If the path contains more data, do `paths <- paths |> set_names()` to set the names to the full path, and then use `tidyr::separate_by()` and friends to turn them into useful columns.

```{r}
paths |> 
  set_names() |> 
  map(readxl::read_excel) |> 
  list_rbind(names_to = "year") |> 
  separate(year, into = c(NA, "directory", "file", "ext"), sep = "[/.]")
```

### Get to a single data frame as quickly as possible

If you need to read and transform your data in some way you have two basic ways of structuring your data: doing a little iteration and a lot in a function, or doing a lot of iteration with simple functions.
Let's make that concrete with an example.

Say you want to read in a bunch of files, filter out missing values, pivot them, and then join them all together.
One way to approach the problem is write a function that takes a file and does all those steps:

```{r}
#| eval: false
process_file <- function(path) {
  df <- read_csv(path)
  
  df |> 
    filter(!is.na(id)) |> 
    mutate(id = tolower(id)) |> 
    pivot_longer(jan:dec, names_to = "month")
}

all <- paths |> 
  map(process_file) |> 
  list_rbind()
```

Alternatively, you could write

```{r}
#| eval: false

data <- paths |> 
  map(read_csv) |> 
  list_rbind() 

data |> 
  filter(!is.na(id)) |> 
  mutate(id = tolower(id)) |> 
  pivot_longer(jan:dec, names_to = "month")
```

If you need to do more work to get `list_rbind()` to work, you should do it, but in generate the sooner you can everything into one big data frame the better.

This is particularly important if the structure of your data varies in some way because it's usually easier to understand the variations when you have them all in front of you.
Much easier to interactively experiment and figure out what the right approach is.

### Heterogeneous data

However, sometimes that's not possible because the data frames are sufficiently inconsistent that `list_rbind()` either fails or yields a data frame that's not very useful.
In that case, start by loading all the files:

```{r}
#| eval: false
files <- paths |> map(read_excel, .id = "id") 
```

If the files have heterogeneous formats you might need to do more processing before you can successfully merge them.
You can use `map_if()` or `map_at()` to selectively modify inputs.
Use `map_if()` if its easier to select the elements to transform with a function; use `map_at()` if you can tell based on their names.

After spending all this effort, save it to a new csv file.

In terms of organising your analysis project, you might want to have a file called `0-cleanup.R` that generates nice csv files to be used by the rest of your project.

If the files are really inconsistent, one useful way to get some traction is to think about the structure of the files as data itself.

```{r}
#| eval: false

paths |> 
  set_names(basename) |> 
  map(\(path) read_csv(path, n_max = 0)) |> 
  map(\(df) data.frame(cols = names(df))) |> 
  list_rbind(.id = "name")
```

You could then think about pivotting or plotting this code to understand what the differences are.

### Handling failures

Some times you might not be able

```{r}
#| eval: false
paths |> 
  map(safely(\(path) readxl::read_excel(path)))
```

When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail.
When this happens, you'll get an error message, and no output.
This is annoying: why does one failure prevent you from accessing all the other successes?
How do you ensure that one bad apple doesn't ruin the whole barrel?

In this section you'll learn how to deal with this situation with a new function: `safely()`.
`safely()` is an adverb: it takes a function (a verb) and returns a modified version.
In this case, the modified function will never throw an error.
Instead, it always returns a list with two elements:

1.  `result` is the original result.
    If there was an error, this will be `NULL`.

2.  `error` is an error object.
    If the operation was successful, this will be `NULL`.

(You might be familiar with the `try()` function in base R.
It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)

Let's illustrate this with a simple example: `log()`:

```{r}
safe_log <- safely(log)
str(safe_log(10))
str(safe_log("a"))
```

When the function succeeds, the `result` element contains the result and the `error` element is `NULL`.
When the function fails, the `result` element is `NULL` and the `error` element contains an error object.

`safely()` is designed to work with `map()`:

```{r}
x <- list(1, 10, "a")
y <- x |> map(safely(log))
str(y)
```

```{r}
#| eval: false
paths |> 
  map(safely(read_csv))
```

This would be easier to work with if we had two lists: one of all the errors and one of all the output.
That's easy to get with `purrr::transpose()`:

```{r}
y <- y |> transpose()
str(y)
```

It's up to you how to deal with the errors, but typically you'll either look at the values of `x` where `y` is an error, or work with the values of `y` that are ok:

```{r}
is_ok <- y$error |> map_lgl(is_null)
x[!is_ok]
y$result[is_ok] |> flatten_dbl()
```

## Writing multiple outputs

So far we've focused on map, which is design for functions that return something.
But some functions don't return things, they instead do things (i.e. their return value isn't important).
This sort of function includes:

-   Saving data to a database.
-   Saving data to disk, like `readr::read_csv()`.
-   Saving plots to disk with `ggsave()`.

they instead change the state of the world in some way.
In this section, you'll learn about `map()`'s friend `walk()`, which is design to work with this sort of function.
Along the way you'll see how to use it to load multiple csv files into a database and turn multiple plots into files.

### Writing to a databse

Sometimes when working with many files at once, it's not possible to load all your data into memory at once.
If you can't `map(files, read_csv)` how can you work with your work?
Well, one approach is to put it all into a database and then use dbplyr to access just the subsets that you need.

Sometimes the database package will provide a handy function that will take a vector of paths and load them all into the datbase.
This is the case with duckdb's `duckdb_read_csv()`:

```{r}
#| eval: false
duckdb::duckdb_read_csv(con, "cars", paths)
```

But with other databases you'll need to do it yourself.
The key idea is to write a function that loads you data then immediately appends to an existing table with `dbAppendTable()`:

```{r}
#| eval: false
append_csv <- function(path) {
  df <- read_csv(path)
  DBI::dbAppendTable(con, "cars", df)
}
```

Then you just need to create a table to fill in.
Here I use a `filter()` that's guaranteed to select zero rows to create a table that will have the write column names and types.

```{r}
#| eval: false
con <- DBI::dbConnect(RSQLite::SQLite(tempfile()))

template <- read_csv(paths[[1]])
DBI::dbWriteTable(con, "cars", filter(template, FALSE))
```

Then I need to call `append_csv()` once for each value of `path`.
That's certainly possible with map:

```{r}
#| eval: false
paths |> map(append_csv)
```

But we don't actually care about the output, so instead we can use `walk()`.
This does exactly the same thing as `map()` but throws the output away.

```{r}
#| eval: false
paths |> walk(append_csv)
```

### Writing csv files

The same basic principle applies if we want to save out multiple csv files, one for each group.
Let's imagine that we want to take the `ggplot2::diamonds` data and save our one csv file for each `clarity`.
First we need to make those individual datasets.
One way to do that is with dplyr's `group_split()`:

```{r}
by_clarity <- diamonds |> 
  group_by(clarity) |> 
  group_split()
```

This produces a list of length 8, containing one tibble for each unique value of `clarity`:

```{r}
length(by_clarity)

by_clarity[[1]]
```

If we were going to save these data frames by hand, we might write something like:

```{r}
#| eval: false
write_csv(by_clarity[[1]], "diamonds-I1.csv")
write_csv(by_clarity[[2]], "diamonds-SI2.csv")
write_csv(by_clarity[[3]], "diamonds-SI1.csv")
...
write_csv(by_clarity[[8]], "diamonds-IF.csv")
```

This is a little different compared our previous uses of `map()` because instead of changing one argument we're now changing two.
This means that we'll need to use `map2()` instead of `map()`.

We'll also need to generate the names for those files somehow.
The most general way to do so is to use `dplyr::group_indices()`:

```{r}
keys <- diamonds |> 
  group_by(clarity) |> 
  group_keys()
keys

paths <- keys |> 
  mutate(path = str_glue("diamonds-{clarity}.csv")) |> 
  pull()
paths
```

This feels a bit fiddly here because we're only working with a single group, but you can imagine this is very powerful if you want to group by multiple variables.

Now that we have all the pieces in place, we can eliminate the need to copy and paste by running `walk2()`:

```{r}
#| eval: false
walk2(by_clarity, paths, write_csv)
```

### Saving plots

We can take the same basic approach if you want to create many plots.
We're jumping the gun here a bit because you won't learn how to save a single plot until @sec-ggsave, but hopefully

Let's first split up the data:

```{r}
by_cyl <- mtcars |> group_by(cyl)
```

Then create the plots using `map()` to call `ggplot()` repeatedly with different datasets.
That gives us a list of plots[^iteration-3]:

[^iteration-3]: You can print `plots` to get a crude animation --- you'll get one plot for each element of `plots`.

```{r}
plots <- by_cyl |>
  group_split() |> 
  map(\(df) ggplot(df, aes(mpg, wt)) + geom_point())
```

(If this was a more complicated plot you'd use a named function so there's more room for all the details.)

Then you create the file names:

```{r}
paths <- by_cyl |> 
  group_keys() |> 
  mutate(path = str_glue("cyl-{cyl}.png")) |> 
  pull()
paths
```

Then use `walk2()` with `ggsave()` to save each plot:

```{r}
walk2(plots, paths, \(plot, name) ggsave(name, plot, path = tempdir()))
```

This is short hand for:

```{r}
#| eval: false
ggsave(plots[[1]], paths[[1]], path = tempdir())
ggsave(plots[[2]], paths[[2]], path = tempdir())
ggsave(plots[[3]], paths[[3]], path = tempdir())
```

It's barely necessary here, but you can imagine how useful this would be if you had to create hundreds of plot.

### Exercises

1.  Imagine you have a table of student data containing (amongst other variables) `school_name` and `student_id`. Sketch out what code you'd write if you want to save all the information for each student in file called `{student_id}.csv` in the `{school}` directory.

## For loops

Another way to attack this sort of problem is with a `for` loop.
We don't teach for loops here to stay focused.
They're definitely important.
You can learn more about them and how they're connected to the map functions in purr in <https://adv-r.hadley.nz/control-flow.html#loops> and <https://adv-r.hadley.nz/functionals.html>.

Once you master these functions, you'll find it takes much less time to solve iteration problems.
But you should never feel bad about using a `for` loop instead of a map function.
The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work.
The important thing is that you solve the problem that you're working on, not write the most concise and elegant code (although that's definitely something you want to strive towards!).

Some people will tell you to avoid `for` loops because they are slow.
They're wrong!
(Well at least they're rather out of date, as `for` loops haven't been slow for many years.) The chief benefits of using functions like `map()` is not speed, but clarity: they make your code easier to write and to read.

If you actually need to worry about performance, you'll know, it'll be obvious.
till then, don't worry about it.