r4ds/iteration.qmd

470 lines
14 KiB
Plaintext
Raw Normal View History

# Iteration {#sec-iteration}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
2022-09-10 21:38:20 +08:00
status("drafting")
```
2016-03-01 22:29:58 +08:00
2016-07-24 22:08:24 +08:00
## Introduction
In @sec-functions, we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting.
Reducing code duplication has three main benefits:
1. It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.
2. It's easier to respond to changes in requirements.
As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.
3. You're likely to have fewer bugs because each line of code is used in more places.
One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated.
Another tool for reducing duplication is **iteration**, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
2022-09-10 21:38:20 +08:00
Iteration is somewhat of a moving target in the tidyverse because we're keep adding new features to make it easier to solve problems that previously required explicit iteration.
For example:
- To draw one plot for each group you can use ggplot2's facetting.
- To compute summary statistics for subgroups you can use `dplyr::group_by()` + `dplyr::summarise()`.
- To read every .csv file in a directory you can pass a vector to `readr::read_csv()`.
- To extract every element from a named list you can use `tidyr::unnest_wider()`.
-
2016-03-24 22:09:09 +08:00
2016-07-19 21:01:50 +08:00
### Prerequisites
2022-09-10 21:38:20 +08:00
We'll use a selection of useful iteration idioms from dplyr and purrr, both core members of the tidyverse.
2016-07-19 21:01:50 +08:00
```{r}
#| label: setup
#| message: false
2016-10-04 01:30:24 +08:00
library(tidyverse)
2016-07-19 21:01:50 +08:00
```
2022-09-10 21:38:20 +08:00
## For each column
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
Imagine you have this simple tibble:
2016-03-01 22:29:58 +08:00
```{r}
2016-10-04 03:10:05 +08:00
df <- tibble(
2016-03-01 22:29:58 +08:00
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
2016-03-21 21:55:07 +08:00
```
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
And you want to compute the median of every column.
You could do it with copy-and-paste:
2016-03-22 21:57:52 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df %>% summarise(
a = median(a),
b = median(b),
c = median(c),
d = median(d),
)
2016-03-22 21:57:52 +08:00
```
But that breaks our rule of thumb: never copy and paste more than twice.
2022-09-10 21:38:20 +08:00
And you could imagine that this will get particularly tedious if you have tens or even hundreds of variables.
Instead you can use `across()`:
2016-03-22 21:57:52 +08:00
2016-03-21 21:55:07 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df %>% summarise(
across(a:d, median)
)
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
- The first argument specifies which columns you want to iterate over. It uses the same syntax as `select()`.
- The second argument specifies what to do with each column.
2022-09-10 21:38:20 +08:00
### Which columns
2022-09-10 21:38:20 +08:00
All the same specifications as `select()`.
But there are two extras that we haven't discussed earlier:
2022-09-10 21:38:20 +08:00
- `everything()` selects all columns.
- `where(fun)` select all columns where `fun` returns `TRUE`. Most commonly used with functions like `is.numeric()`, `is.factor()`, `is.character()`, `lubridate::is.Date()`, `lubridate::is.POSIXt()`.
2022-09-10 21:38:20 +08:00
### Extra arguments
2022-09-10 21:38:20 +08:00
What happens if we have some missing values?
It'd be nice to be able to pass along additional arguments to `median()`:
2016-03-01 22:29:58 +08:00
```{r}
2016-10-04 03:10:05 +08:00
df <- tibble(
2016-03-21 21:55:07 +08:00
a = rnorm(10),
b = rnorm(10),
2022-09-10 21:38:20 +08:00
c = c(NA, rnorm(9)),
2016-03-21 21:55:07 +08:00
d = rnorm(10)
)
2022-09-10 21:38:20 +08:00
df %>% summarise(
across(a:d, median)
)
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
For complicated reasons, it's not easy to pass on arguments from `across()`, so instead we can create another function that wraps `median()` and calls it with the correct arguments.
We can write that compactly using R's anonymous function shorthand:
2016-03-21 21:55:07 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df %>% summarise(
across(a:d, \(x) median(x, na.rm = TRUE))
)
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
This is short hand for creating a function, as below.
It's easier to remember because you just replace the eight letters of `function` with a single `\`.
2016-03-22 21:57:52 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| results: false
df %>% summarise(
across(a:d, function(x) median(x, na.rm = TRUE))
)
2016-03-22 21:57:52 +08:00
```
2022-09-10 21:38:20 +08:00
### Mutating
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
Similar problem if you want to modify the columns:
2016-03-01 22:29:58 +08:00
2016-03-21 21:55:07 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df %>% mutate(
across(a:d, \(x) x + 1)
)
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
By default the outputs of `across()` are given the same numbers as the inputs.
This means that using `across()` inside of `mutate()` will replace the existing columns by default.
If you'd like to instead create new columns, you can supply the `.names` argument which takes a glue specification where `{.col}` refers to the current column name.
2016-03-22 21:57:52 +08:00
2016-03-21 21:55:07 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df %>% mutate(
across(a:d, \(x) x * 2, .names = "{.col}_2")
)
2016-03-21 21:55:07 +08:00
```
2022-09-10 21:38:20 +08:00
The name specification is also important if you supply a list of multiple functions to `across()`.
In this case the default specification is `{.col}_{.fun}`.
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df %>% summarise(
across(a:d, list(
median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x))
))
)
2016-03-22 21:57:52 +08:00
```
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
### Filtering
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
2016-03-22 21:57:52 +08:00
2022-09-10 21:38:20 +08:00
df |> filter(if_any(a:d, is.na))
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
### Vs `pivot_longer()`
Before we go on, it's worth pointing out an interesting connection to `pivot_longer()`.
In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column.
2016-03-01 22:29:58 +08:00
2016-03-22 21:57:52 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df |>
pivot_longer(a:d) |>
group_by(name) |>
summarise(
median = median(value, na.rm = TRUE),
n_miss = sum(is.na(value))
)
2016-03-22 21:57:52 +08:00
```
2022-09-10 21:38:20 +08:00
Another place where you have to use `pivot_longer()` or similar is if you have pairs of variables that you need to compute with simultaneously:
2016-03-22 21:57:52 +08:00
2022-09-10 21:38:20 +08:00
```{r}
df <- tibble(
a_val = rnorm(10),
a_w = runif(10),
b_val = rnorm(10),
b_w = runif(10),
c_val = rnorm(10),
c_w = runif(10),
d_val = rnorm(10),
d_w = runif(10)
)
2022-09-10 21:38:20 +08:00
df |>
pivot_longer(
everything(),
names_to = c("group", ".value"),
names_sep = "_"
) |>
group_by(group) |>
summarise(mean = weighted.mean(val, w))
```
2022-09-10 21:38:20 +08:00
(You could `pivot_wider()` this back to the original form if that's the structure you need)
2022-09-10 21:38:20 +08:00
One day `across()` or a friend might support this sort of computation directly, but currently we don't see how.
2022-09-10 21:38:20 +08:00
### Exercises
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
1. Compute the number of unique values in each column of `palmerpenguins::penguins`.
2. Compute the mean of every column in `mtcars`.
3. Group `diamonds` by `cut`, `clarity`, and `color` then count the number of observations and the mean of each numeric variable.
2022-09-10 21:38:20 +08:00
## For each file
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
`map()` similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a list.
2022-09-10 21:38:20 +08:00
`dir()`.
Use `pattern`, a regular expression, to filter files.
Always use `full.name`.
2022-09-10 21:38:20 +08:00
If you're lucky you can just pass to `readr::read_csv(paths)`.
2022-09-10 21:38:20 +08:00
Otherwise you'll need to do it yourself.
2022-09-10 21:38:20 +08:00
Two steps --- read every file into a list.
Then join the pieces back into a data frame.
Overall this framework is sometimes called split-apply-combine.
You split the problem up into pieces (here paths), apply a function to each piece (read_csv), and then combine the pieces back together.
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
paths <- dir(pattern = "\\.xls$")
2016-03-24 22:09:09 +08:00
2022-09-10 21:38:20 +08:00
paths |>
map(\(path) readxl::read_excel(path)) |>
list_rbind()
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
### Data in the path
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
If the files have heterogeneous formats you might need to do more processing before you can successfully merge them.
You can use `map_if()` or `map_at()` to selectively modify inputs.
Use `map_if()` if its easier to select the elements to transform with a function; use `map_at()` if you can tell based on their names.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
If the path itself contains data, try:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
paths |>
set_names |>
map(readxl::read_excel) |>
list_rbind(.id = "path")
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
You can then use `tidyr::separate_by()` and friends to turn into useful columns.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
You can use `set_names(basename)` to just use the file name.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
### Get to a single data frame as quickly as possible
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
If you need to read and transform your data in some way you have two basic ways of structuring your data: doing a little iteration and a lot in a function, or doing a lot of iteration with simple functions.
Let's make that concrete with an example.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
Say you want to read in a bunch of files, filter out missing values, pivot them, and then join them all together.
One way to approach the problem is write a function that takes a file and does all those steps:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
process_file <- function(path) {
df <- read_csv(path)
df |>
filter(!is.na(id)) |>
mutate(id = tolower(id)) |>
pivot_longer(jan:dec, names_to = "month")
2016-03-01 22:29:58 +08:00
}
2016-08-15 21:18:51 +08:00
2022-09-10 21:38:20 +08:00
paths <- dir("data", full.names = TRUE)
all <- paths |>
map(process_file) |>
list_rbind()
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
Alternatively, you could write
2016-03-01 22:29:58 +08:00
2016-03-24 22:09:09 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
paths <- dir("data", full.names = TRUE)
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
data <- paths |>
map(read_csv) |>
list_rbind()
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
data |>
filter(!is.na(id)) |>
mutate(id = tolower(id)) |>
pivot_longer(jan:dec, names_to = "month")
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
If you need to do more work to get `list_rbind()` to work, you should do it, but in generate the sooner you can everything into one big data frame the better.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
This is particularly important if the structure of your data varies in some way because it's usually easier to understand the variations when you have them all in front of you.
Much easier to interactively experiment and figure out what the right approach is.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
### Optimize iteration speed by saving your work
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
Even in that case, I'd suggest starting with one pass to load all the files:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
files <- paths |> map(read_csv)
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
Then you can iteratively test your tidying code as you develop it.
2022-09-10 21:38:20 +08:00
After spending all this effort, save it to a new csv file.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
In terms of organising your analysis project, you might want to have a file called `0-cleanup.R` that generates nice csv files to be used by the rest of your project.
2022-09-10 21:38:20 +08:00
### For really inconsistent data
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
If the files are really inconsistent, one useful way to get some traction is to think about the structure of the files as data itself.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
```{r}
#| eval: false
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
paths |>
set_names(basename) |>
map(\(path) read_csv(path, n_max = 0)) |>
map(\(df) data.frame(cols = names(df))) |>
list_rbind(.id = "name")
```
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
You could then think about pivotting or plotting this code to understand what the differences are.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
### Handling failures
2016-03-01 22:29:58 +08:00
When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail.
When this happens, you'll get an error message, and no output.
This is annoying: why does one failure prevent you from accessing all the other successes?
How do you ensure that one bad apple doesn't ruin the whole barrel?
2016-03-01 22:29:58 +08:00
In this section you'll learn how to deal with this situation with a new function: `safely()`.
`safely()` is an adverb: it takes a function (a verb) and returns a modified version.
In this case, the modified function will never throw an error.
Instead, it always returns a list with two elements:
2016-03-01 22:29:58 +08:00
1. `result` is the original result.
If there was an error, this will be `NULL`.
2016-03-01 22:29:58 +08:00
2. `error` is an error object.
If the operation was successful, this will be `NULL`.
2016-03-01 22:29:58 +08:00
(You might be familiar with the `try()` function in base R.
It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)
2016-03-01 22:29:58 +08:00
Let's illustrate this with a simple example: `log()`:
```{r}
safe_log <- safely(log)
str(safe_log(10))
str(safe_log("a"))
```
When the function succeeds, the `result` element contains the result and the `error` element is `NULL`.
When the function fails, the `result` element is `NULL` and the `error` element contains an error object.
2016-03-01 22:29:58 +08:00
`safely()` is designed to work with `map()`:
2016-03-01 22:29:58 +08:00
```{r}
x <- list(1, 10, "a")
2022-02-24 03:15:52 +08:00
y <- x |> map(safely(log))
2016-03-01 22:29:58 +08:00
str(y)
```
2022-09-10 21:38:20 +08:00
```{r}
#| eval: false
paths |>
map(safely(read_csv))
```
This would be easier to work with if we had two lists: one of all the errors and one of all the output.
That's easy to get with `purrr::transpose()`:
2016-03-01 22:29:58 +08:00
```{r}
2022-02-24 03:15:52 +08:00
y <- y |> transpose()
2016-03-01 22:29:58 +08:00
str(y)
```
2016-03-25 20:59:05 +08:00
It's up to you how to deal with the errors, but typically you'll either look at the values of `x` where `y` is an error, or work with the values of `y` that are ok:
2016-03-01 22:29:58 +08:00
```{r}
2022-02-24 03:15:52 +08:00
is_ok <- y$error |> map_lgl(is_null)
2016-03-01 22:29:58 +08:00
x[!is_ok]
2022-02-24 03:15:52 +08:00
y$result[is_ok] |> flatten_dbl()
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
## Writing multiple outputs
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
Main challenge is that's there two important arguments: the object you want to save and the place you want to save it.
2022-09-10 21:38:20 +08:00
### Very large data
2022-09-10 21:38:20 +08:00
Another exception to this rule is if you have very large data --- it might be impossible to store all the data in memory at once.
If you're lucky, the database you're working with will have a function to load csv files directly into the database.
For example, if you're using duckdb, you can:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
duckdb::duckdb_read_csv(con, "cars", paths)
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
Otherwise:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
template <- read_csv(paths[[1]])
DBI::dbWriteTable(con, "cars", filter(template, FALSE))
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
read_write <- function(path) {
df <- read_csv(path)
DBI::dbAppendTable(con, "cars", df)
2016-03-01 22:29:58 +08:00
}
2022-09-10 21:38:20 +08:00
paths |> walk(read_write)
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
Or maybe you just write one clean csv for each file and then read with `arrow::open_dataset()`.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
### Saving plots
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
`walk2()`.
It differs in two ways: it iterates over two arguments at the same time, and it hides the output.
2016-03-01 22:29:58 +08:00
```{r}
#| eval: false
2022-02-24 03:15:52 +08:00
plots <- mtcars |>
2022-09-10 21:38:20 +08:00
group_split(cyl) |>
2022-09-09 00:32:10 +08:00
map(\(df) ggplot(df, aes(mpg, wt)) + geom_point())
2022-08-30 21:55:08 +08:00
paths <- str_c(names(plots), ".pdf")
2016-03-01 22:29:58 +08:00
2022-08-30 21:55:08 +08:00
walk2(paths, plots, ggsave, path = tempdir())
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
## For loops
2022-09-10 21:38:20 +08:00
Another way to attack this sort of problem is with a `for` loop.
We don't teach for loops here to stay focused.
They're definitely important.
You can learn more about them and how they're connected to the map functions in purr in <https://adv-r.hadley.nz/control-flow.html#loops> and <https://adv-r.hadley.nz/functionals.html>.
2022-09-10 21:38:20 +08:00
Once you master these functions, you'll find it takes much less time to solve iteration problems.
But you should never feel bad about using a `for` loop instead of a map function.
The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work.
The important thing is that you solve the problem that you're working on, not write the most concise and elegant code (although that's definitely something you want to strive towards!).
2021-02-22 21:00:06 +08:00
2022-09-10 21:38:20 +08:00
Some people will tell you to avoid `for` loops because they are slow.
They're wrong!
(Well at least they're rather out of date, as `for` loops haven't been slow for many years.) The chief benefits of using functions like `map()` is not speed, but clarity: they make your code easier to write and to read.
2021-02-22 21:00:06 +08:00
2022-09-10 21:38:20 +08:00
If you actually need to worry about performance, you'll know, it'll be obvious.
till then, don't worry about it.