r4ds/iteration.qmd

962 lines
31 KiB
Plaintext
Raw Normal View History

# Iteration {#sec-iteration}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
2022-09-10 21:38:20 +08:00
status("drafting")
```
2016-03-01 22:29:58 +08:00
2016-07-24 22:08:24 +08:00
## Introduction
2022-09-22 23:43:21 +08:00
In this chapter, you'll tools for iteration, repeatedly performing the same action on different objects.
You've already learned a number of special purpose tools for iteration:
2022-09-10 21:38:20 +08:00
- To draw one plot for each group you can use ggplot2's facetting.
2022-09-22 23:43:21 +08:00
- To compute a summary statistic for each subgroup you can use `group_by()` and `summarise()`.
- To extract each element in a named list you can use `unnest_wider()` or `unnest_longer()`.
2022-09-16 03:56:12 +08:00
2022-09-22 23:43:21 +08:00
Now it's time to learn some more general tools.
Tools for iteration can quickly become very abstract, but in this chapter we'll keep things concrete to make as easy as possible to learn the basics.
We're going to focus on three related tools for three related tasks: modifying multiple columns, reading multiple files, and saving multiple objects.
We'll conclude with a brief discussion of `for`-loops, an important iteration technique that we deliberately don't cover here, and provide a few pointers for learning more.
2022-09-16 04:28:54 +08:00
2016-07-19 21:01:50 +08:00
### Prerequisites
2022-09-26 21:38:18 +08:00
::: callout-important
This chapter relies on features only found in purrr 1.0.0, which is still in development.
If you want to live life on the edge you can get the dev version with `devtools::install_github("tidyverse/purrr")`.
:::
2022-09-22 23:43:21 +08:00
In this chapter, we'll focus on tools provided by dplyr and purrr, both core members of the tidyverse.
You've seen dplyr before, but purrr is new.
We're going to use just a couple of purrr functions from in this chapter, but it's a great package to skill as you improve your programming skills.
2016-07-19 21:01:50 +08:00
```{r}
#| label: setup
#| message: false
2016-10-04 01:30:24 +08:00
library(tidyverse)
2016-07-19 21:01:50 +08:00
```
2022-09-29 22:20:45 +08:00
This chapter also relies on a function that hasn't yet been implemented for dplyr but will be by the time the book is out:
2022-09-26 21:38:18 +08:00
```{r}
pick <- function(cols) {
across({{ cols }})
}
```
2022-09-19 05:18:45 +08:00
## Modifying multiple columns {#sec-across}
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
Imagine you have this simple tibble:
2016-03-01 22:29:58 +08:00
```{r}
2016-10-04 03:10:05 +08:00
df <- tibble(
2016-03-01 22:29:58 +08:00
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
2016-03-21 21:55:07 +08:00
```
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
And you want to compute the median of every column.
You could do it with copy-and-paste:
2016-03-22 21:57:52 +08:00
```{r}
2022-09-22 23:43:21 +08:00
df |> summarise(
2022-09-10 21:38:20 +08:00
a = median(a),
b = median(b),
c = median(c),
d = median(d),
2022-09-20 22:13:51 +08:00
n = n()
2022-09-10 21:38:20 +08:00
)
2016-03-22 21:57:52 +08:00
```
2022-09-22 23:43:21 +08:00
But that breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of variables.
2022-09-10 21:38:20 +08:00
Instead you can use `across()`:
2016-03-22 21:57:52 +08:00
2016-03-21 21:55:07 +08:00
```{r}
2022-09-22 23:43:21 +08:00
df |> summarise(
2022-09-20 22:13:51 +08:00
across(a:d, median),
n = n()
2022-09-10 21:38:20 +08:00
)
2016-03-01 22:29:58 +08:00
```
2022-09-22 23:43:21 +08:00
`across()` has three particularly important arguments, which we'll discuss in detail in the following sections.
You'll use the first two every time you use `across()`:
2022-09-16 03:56:12 +08:00
2022-09-22 23:43:21 +08:00
- The first argument, `.cols`, specifies which columns you want to iterate over. It uses tidy-select syntax, just like `select()`.
- The second argument, `.fns`, specifies what to do with each column.
2022-09-22 23:43:21 +08:00
The `.names` argument gives you control over the output names, and is particularly useful when you use `across()` with `mutate()`.
We'll also discuss two important variations, `if_any()` and `if_all()`, which work with `filter()`.
2022-09-16 03:56:12 +08:00
2022-09-20 22:13:51 +08:00
### Selecting columns with `.cols`
2022-09-22 23:43:21 +08:00
The first argument to `across()` selects the columns to transform.
2022-09-16 03:56:12 +08:00
This argument uses the same specifications as `select()`, @sec-select, so you can use functions like `starts_with()` and `ends_with()` to select variables based on their name.
2022-09-20 22:13:51 +08:00
Grouping columns are automatically ignored because they're carried along for the ride by the dplyr verb.
2022-09-22 23:43:21 +08:00
There are two additional selection techniques that are particularly useful for `across()`: `everything()` and `where()`.
`everything()` is straightforward: it selects every (non-grouping) column:
2022-09-16 03:56:12 +08:00
```{r}
df <- tibble(
grp = sample(2, 10, replace = TRUE),
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
2022-09-22 23:43:21 +08:00
df |>
2022-09-16 03:56:12 +08:00
group_by(grp) |>
summarise(across(everything(), median))
```
`where()` allows you to select columns based on their type:
- `where(is.numeric)` selects all numeric columns.
- `where(is.character)` selects all string columns.
- `where(is.Date)` selects all date columns.
- `where(is.POSIXct)` selects all date-time columns.
- `where(is.logical)` selects all logical columns.
2022-09-20 22:13:51 +08:00
```{r}
df <- tibble(
x1 = 1:3,
x2 = runif(3),
y1 = sample(letters, 3),
y2 = c("banana", "apple", "egg")
)
df |>
summarise(across(where(is.numeric), mean))
2022-09-22 23:43:21 +08:00
2022-09-20 22:13:51 +08:00
df |>
summarise(across(where(is.character), str_flatten))
```
2022-09-22 23:43:21 +08:00
Just like other selectors, you can combine these with Boolean algebra.
For example, `!where(is.numeric)` selects all non-numeric columns and `starts_with("a") & where(is.logical)` selects all logical columns whose name starts with "a".
2022-09-22 23:43:21 +08:00
### Defining the action with `.fns`
2022-09-22 23:43:21 +08:00
The second argument to `across()` defines how each column will be transformed.
In simple cases, this will just be the name of existing function, but you might want to supply additional arguments or perform multiple transformations, as described below.
2022-09-16 03:56:12 +08:00
2022-09-22 23:43:21 +08:00
Lets motivate this problem with an simple example: what happens if we have some missing values in our data?
`median()` will preserve those missing values giving us a suboptimal output:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-16 03:56:12 +08:00
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
}
2016-10-04 03:10:05 +08:00
df <- tibble(
2022-09-22 23:43:21 +08:00
a = rnorm_na(5, 1),
b = rnorm_na(5, 1),
c = rnorm_na(5, 2),
d = rnorm(5)
2016-03-21 21:55:07 +08:00
)
2022-09-22 23:43:21 +08:00
df |>
2022-09-20 22:13:51 +08:00
summarise(
across(a:d, median),
n = n()
)
2016-03-01 22:29:58 +08:00
```
2022-09-22 23:43:21 +08:00
It'd be nice to be able to pass along `na.rm = TRUE` to `median()` to remove these missing values.
To do so, instead of calling `median()` directly, we need to create a new function that calls `median()` with the correct arguments:
```{r}
df |>
summarise(
across(a:d, function(x) median(x, na.rm = TRUE)),
n = n()
)
```
This is a little verbose, so R comes with a handy shortcut: for this sort of throw away function[^iteration-1], you can replace `function` with `\`:
[^iteration-1]: These are often called anonymous functions because you don't give them a name with `<-.`
2016-03-21 21:55:07 +08:00
```{r}
2022-09-22 23:43:21 +08:00
#| results: false
df |>
2022-09-20 22:13:51 +08:00
summarise(
across(a:d, \(x) median(x, na.rm = TRUE)),
n = n()
)
```
2022-09-22 23:43:21 +08:00
In either case, `across()` effectively expands to the following code:
2022-09-20 22:13:51 +08:00
```{r}
#| eval: false
2022-09-22 23:43:21 +08:00
df |> summarise(
2022-09-20 22:13:51 +08:00
a = median(a, na.rm = TRUE),
b = median(b, na.rm = TRUE),
c = median(c, na.rm = TRUE),
d = median(d, na.rm = TRUE),
n = n()
2022-09-10 21:38:20 +08:00
)
2016-03-01 22:29:58 +08:00
```
2022-09-22 23:43:21 +08:00
When we remove the missing values from the `median()`, it would be nice to know just how many values we were removing.
We find that out by supplying two functions to `across()`: one to compute the median and the other to count the missing values.
You can supply multiple functions with a named list:
2022-09-16 03:56:12 +08:00
```{r}
2022-09-22 23:43:21 +08:00
df |>
2022-09-20 22:13:51 +08:00
summarise(
across(a:d, list(
median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x))
)),
n = n()
)
2022-09-16 03:56:12 +08:00
```
2022-09-22 23:43:21 +08:00
If you look carefully, you might intuit that the columns are named using using a glue specification (@sec-glue) like `{.col}_{.fn}` where `.col` is the name of the original column and `.fn` is the name of the function.
That's not a coincidence!
As you'll learn in the next section, you can use `.names` argument to supply your own glue spec.
2022-09-16 03:56:12 +08:00
### Missing values {#sec-across-missing-values}
```{r}
#| eval: false
df |>
mutate(across(where(is.numeric), coalesce, 0))
df |>
mutate(across(where(is.numeric), na_if, -99))
```
2022-09-16 03:56:12 +08:00
### Column names
2016-03-01 22:29:58 +08:00
2022-09-16 03:56:12 +08:00
The result of `across()` is named according to the specification provided in the `.names` variable.
2022-09-22 23:43:21 +08:00
We could specify our own if we wanted the name of the function to come first[^iteration-2]:
2022-09-20 22:13:51 +08:00
2022-09-22 23:43:21 +08:00
[^iteration-2]: You can't currently change the order of the columns, but you could reorder them after the fact using `relocate()` or similar.
2016-03-01 22:29:58 +08:00
2016-03-21 21:55:07 +08:00
```{r}
2022-09-22 23:43:21 +08:00
df |>
2022-09-20 22:13:51 +08:00
summarise(
across(
a:d,
list(
median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x))
),
.names = "{.fn}_{.col}"
),
n = n(),
)
2016-03-01 22:29:58 +08:00
```
2022-09-16 03:56:12 +08:00
The `.names` argument is particularly important when you use `across()` with `mutate()`.
2022-09-20 22:13:51 +08:00
By default the output of `across()` is given the same names as the inputs.
2022-09-16 03:56:12 +08:00
This means that `across()` inside of `mutate()` will replace existing columns:
2016-03-22 21:57:52 +08:00
2016-03-21 21:55:07 +08:00
```{r}
2022-09-22 23:43:21 +08:00
df |>
2022-09-20 22:13:51 +08:00
mutate(
2022-09-22 23:43:21 +08:00
across(a:d, \(x) coalesce(x, 0))
2022-09-20 22:13:51 +08:00
)
2016-03-21 21:55:07 +08:00
```
2022-09-20 22:13:51 +08:00
If you'd like to instead create new columns, you can use the `.names` argument give the output new names:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-22 23:43:21 +08:00
df |>
2022-09-20 22:13:51 +08:00
mutate(
across(a:d, \(x) x * 2, .names = "{.col}_double")
)
2016-03-22 21:57:52 +08:00
```
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
### Filtering
2016-03-01 22:29:58 +08:00
2022-09-16 03:56:12 +08:00
`across()` is a great match for `summarise()` and `mutate()` but it's not such a great fit for `filter()` because you usually string together calls to multiple functions either with `|` or `&`.
So dplyr provides two variants of `across()` called `if_any()` and `if_all()`:
```{r}
2022-09-10 21:38:20 +08:00
df |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
2022-09-16 03:56:12 +08:00
# same as:
2022-09-10 21:38:20 +08:00
df |> filter(if_any(a:d, is.na))
2022-09-16 03:56:12 +08:00
df |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
# same as:
df |> filter(if_all(a:d, is.na))
2016-03-01 22:29:58 +08:00
```
2022-09-20 22:13:51 +08:00
### `across()` in functions
`across()` is particularly useful to program with because it allows you to operate with multiple variables.
2022-09-22 23:43:21 +08:00
For example, [Jacob Scott](https://twitter.com/_wurli/status/1571836746899283969) uses this little helper to expand all date variables into year, month, and day variables:
2022-09-20 22:13:51 +08:00
```{r}
expand_dates <- function(df) {
df |>
mutate(
across(
where(lubridate::is.Date),
list(year = year, month = month, day = mday)
)
)
}
```
2022-09-22 23:43:21 +08:00
It also makes it easy to supply multiple variables in a single argument because the first argument uses tidy-select.
You just need to remember to embrace that argument.
2022-09-20 22:13:51 +08:00
For example, this function will compute the means of numeric variables by default.
2022-09-22 23:43:21 +08:00
But by supplying the second argument you can choose to summarize just selected variables:
2022-09-20 22:13:51 +08:00
```{r}
2022-09-22 23:43:21 +08:00
summarise_means <- function(df, summary_vars = where(is.numeric)) {
df |>
2022-09-20 22:13:51 +08:00
summarise(
across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
n = n()
)
}
diamonds |>
group_by(clarity) |>
summarise_means()
diamonds |>
group_by(clarity) |>
summarise_means(c(carat, x:z))
```
2022-09-10 21:38:20 +08:00
### Vs `pivot_longer()`
2022-09-16 03:56:12 +08:00
Before we go on, it's worth pointing out an interesting connection between `across()` and `pivot_longer()`.
2022-09-10 21:38:20 +08:00
In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column.
2022-09-16 03:56:12 +08:00
For example, we could rewrite our multiple summary `across()` as:
2016-03-01 22:29:58 +08:00
2016-03-22 21:57:52 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df |>
pivot_longer(a:d) |>
group_by(name) |>
summarise(
median = median(value, na.rm = TRUE),
n_miss = sum(is.na(value))
)
2016-03-22 21:57:52 +08:00
```
2022-09-16 03:56:12 +08:00
This is a useful technique to know about because sometimes you'll hit a problem that's not currently possible to solve with `across()`: when you have groups of variables that you want to compute with simultaneously.
For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:
2016-03-22 21:57:52 +08:00
2022-09-10 21:38:20 +08:00
```{r}
2022-09-16 03:56:12 +08:00
df3 <- tibble(
2022-09-10 21:38:20 +08:00
a_val = rnorm(10),
a_w = runif(10),
b_val = rnorm(10),
b_w = runif(10),
c_val = rnorm(10),
c_w = runif(10),
d_val = rnorm(10),
d_w = runif(10)
)
2022-09-16 03:56:12 +08:00
```
2022-09-22 23:43:21 +08:00
There's currently no way to do this with `across()`[^iteration-3], but it's relatively straightforward with `pivot_longer()`:
2022-09-16 03:56:12 +08:00
2022-09-22 23:43:21 +08:00
[^iteration-3]: Maybe there will be one day, but currently we don't see how.
2022-09-16 03:56:12 +08:00
```{r}
df3_long <- df3 |>
2022-09-10 21:38:20 +08:00
pivot_longer(
everything(),
names_to = c("group", ".value"),
names_sep = "_"
2022-09-16 03:56:12 +08:00
)
df3_long
df3_long |>
2022-09-10 21:38:20 +08:00
group_by(group) |>
summarise(mean = weighted.mean(val, w))
```
2022-09-16 03:56:12 +08:00
If needed, you could `pivot_wider()` this back to the original form.
2022-09-10 21:38:20 +08:00
### Exercises
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
1. Compute the number of unique values in each column of `palmerpenguins::penguins`.
2022-09-22 23:43:21 +08:00
2022-09-10 21:38:20 +08:00
2. Compute the mean of every column in `mtcars`.
2022-09-22 23:43:21 +08:00
2022-09-10 21:38:20 +08:00
3. Group `diamonds` by `cut`, `clarity`, and `color` then count the number of observations and the mean of each numeric variable.
2022-09-22 23:43:21 +08:00
4. What happens if you use a list of functions, but don't name them?
How is the output named?
5. It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`.
Can you explain why?
6. Adjust `expand_dates()` to automatically remove the date columns after they've been expanded.
Do you need to embrace any arguments?
7. Explain what each step of the pipeline in this function does.
What special feature of `where()` are we taking advantage of?
```{r}
#| results: false
show_missing <- function(df, group_vars, summary_vars = everything()) {
df |>
group_by(pick({{ group_vars }})) |>
summarise(
across({{ summary_vars }}, \(x) sum(is.na(x))),
.groups = "drop"
) |>
select(where(\(x) any(x > 0)))
}
nycflights13::flights |> show_missing(c(year, month, day))
```
2022-09-16 04:28:54 +08:00
## Reading multiple files
2016-03-01 22:29:58 +08:00
2022-09-22 23:43:21 +08:00
In the previous section, you learn how to use `dplyr::across()` to repeat a transformation on multiple columns.
In this section, you'll learn how to use `purrr::map()` to read every file in a directly.
Let's start with a little motivation: imagine you have a directory full of excel spreadsheets[^iteration-4] you want to read.
2022-09-16 03:56:12 +08:00
You could do it with copy and paste:
2022-09-22 23:43:21 +08:00
[^iteration-4]: If you instead had a directory of csv files with the same format, you can use the technique from @sec-readr-directory.
2022-09-16 03:56:12 +08:00
```{r}
#| eval: false
data2019 <- readr::read_excel("data/y2019.xls")
data2020 <- readr::read_excel("data/y2020.xls")
data2021 <- readr::read_excel("data/y2021.xls")
data2022 <- readr::read_excel("data/y2022.xls")
```
And then use `dplyr::bind_rows()` to combine them all together:
```{r}
#| eval: false
data <- bind_rows(data2019, data2020, data2021, data2022)
```
2022-09-22 23:43:21 +08:00
You can imagine that this would get tedious quickly, especially if you had 400 files, not four.
So in the following sections, you'll learn how to automate this sort of task.
There are basic steps: use `dir()` list all the files, then use `purrr::map()` to read each of them into a list, then use `purrr::list_rbind()` to combine them into a single data frame.
We'll then discuss how you can handle situations of increasing heterogeneity, where you can't do exactly the same thing to every file.
2022-09-16 03:56:12 +08:00
### Listing files in a directory
2022-09-20 22:13:51 +08:00
`dir()` lists the files in a directory.
You'll almost always use three arguments:
2022-09-22 23:43:21 +08:00
- `path`, the first argument is the directory to look in.
2022-09-20 22:13:51 +08:00
2022-09-22 23:43:21 +08:00
- `pattern` is a regular expression that the file names must match.
The most common pattern is something like `\\.xlsx$` or `\\.csv$` to match an extension, but you can use whatever you need to extract the data files from anything else living in that directory.
2022-09-20 22:13:51 +08:00
- `full.names` determines whether or not the directory name should be included in the output.
You almost always want this to be `TRUE`.
2022-09-22 23:43:21 +08:00
To make our motivating example concrete, this book contains a folder with 12 excel spreadsheets containining data from the gapminder package.
Each file contains one years for data for 142 countries.
2022-09-20 22:13:51 +08:00
We can list them all with the appropriate call to `dir()`:
2022-09-17 23:58:18 +08:00
2022-09-16 03:56:12 +08:00
```{r}
2022-09-17 23:58:18 +08:00
paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
paths
2022-09-16 03:56:12 +08:00
```
2022-09-20 22:13:51 +08:00
### `purrr::map()` and `list_rbind()`
2022-09-22 23:43:21 +08:00
Now that we have these 12 paths, we could call `read_excel()` 12 times to get 12 data frames.
In general, we won't know how files there are to read, so instead of saving each data frame to its own variable, we'll put them all into a list, something like this:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2022-09-17 23:58:18 +08:00
list(
readxl::read_excel("data/gapminder/1952.xls"),
readxl::read_excel("data/gapminder/1957.xls"),
readxl::read_excel("data/gapminder/1962.xls"),
...,
readxl::read_excel("data/gapminder/2007.xls")
)
```
2022-09-20 22:13:51 +08:00
Now that's just as tedious to type as before, but we can use a shortcut: `purrr::map()`.
`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector.
`map(x, f)` is shorthand for:
2022-09-17 23:58:18 +08:00
```{r}
#| eval: false
list(
f(x[[1]]),
f(x[[2]]),
...,
f(x[[n]])
)
```
2022-09-20 22:13:51 +08:00
So we can use `map()` get a list of 12 data frames:
2022-09-17 23:58:18 +08:00
```{r}
files <- map(paths, readxl::read_excel)
length(files)
files[[1]]
```
2022-09-20 22:13:51 +08:00
(This is another data structure that doesn't display particularly compactly with `str()` so you might want to load into RStudio and inspect it with `View()`).
2022-09-17 23:58:18 +08:00
2022-09-20 22:13:51 +08:00
Now we can use `purrr::list_rbind()` to combine that list of data frames into a single data frame:
2022-09-17 23:58:18 +08:00
```{r}
list_rbind(files)
```
2022-09-20 22:13:51 +08:00
Or we could do both steps at once in pipeline:
2022-09-17 23:58:18 +08:00
```{r}
#| results: false
2022-09-10 21:38:20 +08:00
paths |>
2022-09-17 23:58:18 +08:00
map(readxl::read_excel) |>
2022-09-10 21:38:20 +08:00
list_rbind()
2016-03-01 22:29:58 +08:00
```
2022-09-17 23:58:18 +08:00
What if we want to pass in extra arguments to `read_excel()`?
2022-09-22 23:43:21 +08:00
We use the same technique that we used with `across()`.
For example, it's often useful to peak at the first few row of the data with `n_max = 1`:
2022-09-17 23:58:18 +08:00
```{r}
paths |>
map(\(path) readxl::read_excel(path, n_max = 1)) |>
list_rbind()
```
2022-09-22 23:43:21 +08:00
This makes it clear that something is missing: there's no `year` column because that value is recorded in the path, not the individual files.
2022-09-20 22:13:51 +08:00
We'll tackle that problem next.
2022-09-17 23:58:18 +08:00
2022-09-25 22:26:22 +08:00
### Data in the path {#sec-data-in-the-path}
2016-03-01 22:29:58 +08:00
2022-09-17 23:58:18 +08:00
Sometimes the name of the file is itself data.
2022-09-22 23:43:21 +08:00
In this example, the file name contains the year, which is not otherwise recorded in the individual files.
2022-09-17 23:58:18 +08:00
To get that column into the final data frame, we need to do two things.
2022-09-22 23:43:21 +08:00
2022-09-20 22:13:51 +08:00
Firstly, we name the vector of paths.
The easiest way to do this is with the `set_names()` function, which can take a function.
2022-09-22 23:43:21 +08:00
Here we use `basename()` to extract just the file name from the full path:
2022-09-17 23:58:18 +08:00
```{r}
paths <- paths |> set_names(basename)
paths
```
Those paths are automatically carried along by all the map functions, so the list of data frames will have those same names:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2022-09-17 23:58:18 +08:00
paths |>
map(readxl::read_excel) |>
names()
```
2022-09-22 23:43:21 +08:00
Then we use the `names_to` argument `list_rbind()` to tell it to save the names to a new column called `year`, then use `readr::parse_number()` to turn it into a number.
2022-09-17 23:58:18 +08:00
```{r}
2022-09-10 21:38:20 +08:00
paths |>
2022-09-16 03:56:12 +08:00
set_names(basename) |>
2022-09-17 23:58:18 +08:00
map(readxl::read_excel) |>
list_rbind(names_to = "year") |>
mutate(year = parse_number(year))
2016-03-01 22:29:58 +08:00
```
2022-09-22 23:43:21 +08:00
In more complicated other cases, there might be another variable stored in the directory name, or maybe the file name contains multiple bits of data.
In that case, use `set_names()` (without any arguments) to record the full path, and then use `tidyr::separate_by()` and friends to turn them into useful columns.
2016-03-01 22:29:58 +08:00
2022-09-17 23:58:18 +08:00
```{r}
paths |>
set_names() |>
map(readxl::read_excel) |>
list_rbind(names_to = "year") |>
2022-09-22 23:43:21 +08:00
separate(
year,
into = c(NA, "directory", "file", "ext"),
sep = "[/.]"
)
2022-09-17 23:58:18 +08:00
```
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
### Save your work
2016-03-01 22:29:58 +08:00
2022-09-22 23:43:21 +08:00
Now that you've done all this hard work to get to a nice tidy data frame, it's a great time to save your work:
2016-03-01 22:29:58 +08:00
2022-09-22 23:43:21 +08:00
```{r}
gapminder <- paths |>
set_names(basename) |>
map(readxl::read_excel) |>
list_rbind(names_to = "year") |>
mutate(year = parse_number(year))
write_csv(gapminder, "gapminder.csv")
```
```{r}
#| include: false
unlink("gapminder.csv")
```
2022-09-29 23:50:47 +08:00
If you're working in a project, we'd suggest calling the file that does this sort of data prep work something like `0-cleanup.R.` The `0` in the file name suggests that this should be run before anything else.
2022-09-22 23:43:21 +08:00
If your input data files change of over time, you might consider learning a tool like [targets](https://docs.ropensci.org/targets/) to set up your data cleaning code to automatically re-run when ever one of the input files is modified.
2022-09-20 22:13:51 +08:00
### Many simple iterations
2022-09-22 23:43:21 +08:00
Here we've just loaded the data directly from disk, and were lucky enough to get a tidy dataset.
In most cases, you'll need to do some additional tidying, and you have basic basic options: you can do one round of iteration with a complex function, or do a multiple rounds of iteration with multiple simple functions.
In our experience most folks reach first for one complex iteration, but you're often better by doing multiple simple iterations.
2022-09-20 22:13:51 +08:00
2022-09-22 23:43:21 +08:00
For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine.
One way to approach the problem is write a function that takes a file and does all those steps and call `map()` once:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
process_file <- function(path) {
df <- read_csv(path)
df |>
filter(!is.na(id)) |>
mutate(id = tolower(id)) |>
pivot_longer(jan:dec, names_to = "month")
2016-03-01 22:29:58 +08:00
}
2016-08-15 21:18:51 +08:00
2022-09-22 23:43:21 +08:00
paths |>
2022-09-10 21:38:20 +08:00
map(process_file) |>
list_rbind()
2016-03-01 22:29:58 +08:00
```
2022-09-22 23:43:21 +08:00
Another approach is to read all the files and combine them together first.
Then you only need to
2016-03-01 22:29:58 +08:00
2016-03-24 22:09:09 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2022-09-22 23:43:21 +08:00
paths |>
2022-09-10 21:38:20 +08:00
map(read_csv) |>
2022-09-22 23:43:21 +08:00
list_rbind() |>
2022-09-10 21:38:20 +08:00
filter(!is.na(id)) |>
mutate(id = tolower(id)) |>
pivot_longer(jan:dec, names_to = "month")
2016-03-01 22:29:58 +08:00
```
2022-09-22 23:43:21 +08:00
We recommend the second approach because it stops you getting fixated on getting the first file right because moving on to the rest.
By considering all of the data when doing tidying and cleaning, you're more likely to think holistically and end up with a higher quality result.
2016-03-01 22:29:58 +08:00
2022-09-16 03:56:12 +08:00
### Heterogeneous data
2016-03-01 22:29:58 +08:00
2022-09-22 23:43:21 +08:00
Unfortunately it's sometime not possible to go from `map()` straight to `list_rbind()` because the data frames are so heterogeneous that `list_rbind()` either fails or yields a data frame that's not very useful.
In that case, it's still useful to start by loading all of the files:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2022-09-22 23:43:21 +08:00
files <- paths |>
map(readxl::read_excel)
2016-03-01 22:29:58 +08:00
```
2022-09-22 23:43:21 +08:00
Then a very useful strategy is to convert the structure of the data frames to data so that you can explore using your data science skills.
2022-09-20 22:13:51 +08:00
One way to do so is with this handy `df_types` function that returns a tibble with one row for each column:
2022-09-20 22:13:51 +08:00
```{r}
df_types <- function(df) {
tibble(
col_name = names(df),
2022-09-22 23:43:21 +08:00
col_type = map_chr(df, vctrs::vec_ptype_full),
n_miss = map_int(df, \(x) sum(is.na(x)))
2022-09-20 22:13:51 +08:00
)
}
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
df_types(starwars)
2022-09-22 23:43:21 +08:00
df_types(nycflights13::flights)
2022-09-20 22:13:51 +08:00
```
2022-09-22 23:43:21 +08:00
You can then apply this function all of the files, and maybe do some pivoting to make it easy to see where there are differences.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
```{r}
2022-09-20 22:13:51 +08:00
files |>
map(df_types) |>
list_rbind(names_to = "file_name") |>
2022-09-22 23:43:21 +08:00
select(-n_miss) |>
2022-09-20 22:13:51 +08:00
pivot_wider(names_from = col_name, values_from = col_type)
2022-09-10 21:38:20 +08:00
```
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
If the files have heterogeneous formats you might need to do more processing before you can successfully merge them.
2022-09-22 23:43:21 +08:00
Unfortunately we're now going to leave you to figure that out on your own, but you might want to read about `map_if()` and `map_at()`.
`map_if()` allows you to selectively modify elements of a list based on their values; `map_at()` allows you to selectively modify elements based on their names.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
### Handling failures
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
Sometimes the structure of your data might be sufficiently wild that you can't even read all the files with a single command.
2022-09-22 23:43:21 +08:00
And then you'll encounter one of the downsides of map: is that it succeeds or fails as a whole.
`map()` will either successfully read all of the files in a directory or fail with an error.
This is annoying: why does one failure prevent you from accessing all the other successes?
2016-03-01 22:29:58 +08:00
2022-09-22 23:43:21 +08:00
Luckily, purrr comes with a helper to tackle this problem: `possibly()`.
When you wrap a function in possible, a failure with instead return a `NULL`.
`list_rbind()` automatically ignores `NULL`s, so the following code will always succeed:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-20 22:13:51 +08:00
files <- paths |>
map(possibly(\(path) readxl::read_excel(path), NULL))
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
data <- files |> list_rbind()
2022-09-10 21:38:20 +08:00
```
2022-09-20 22:13:51 +08:00
Now comes the hard part of figuring out why they failed and what do to about it.
Start by getting the paths that failed:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-20 22:13:51 +08:00
failed <- map_vec(files, is.null)
paths[failed]
2016-03-01 22:29:58 +08:00
```
2022-09-22 23:43:21 +08:00
Then call the import function again for each failure and figure out what went wrong.
2016-03-01 22:29:58 +08:00
## Saving multiple outputs
2016-03-01 22:29:58 +08:00
In the last section, you learned about `map()`, which is useful for reading multiple files into a single object.
In this section, we'll now explore the opposite: how can you take one or more R objects and save them to one or more files?
We'll explore this challenge using three examples:
2022-09-16 04:28:54 +08:00
- Saving multiple data frames into one database.
- Saving multiple data frames into multiple csv files.
- Saving multiple plots to multiple `.png` files.
2022-09-20 22:13:51 +08:00
### Writing to a database {#sec-save-database}
Sometimes when working with many files at once, it's not possible to fit all your data into memory at once.
If you can't `map(files, read_csv)` how can you work with your data?
One approach is to put it all into a database and then use dbplyr to access just the subsets that you need.
2022-09-16 04:28:54 +08:00
If you're lucky, the database package will provide a handy function that will take a vector of paths and load them all into the database.
2022-09-16 04:28:54 +08:00
This is the case with duckdb's `duckdb_read_csv()`:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2022-09-20 22:13:51 +08:00
con <- DBI::dbConnect(duckdb::duckdb())
duckdb::duckdb_read_csv(con, "gapminder", paths)
2016-03-01 22:29:58 +08:00
```
But here we don't have csv files, we have excel spreadsheets.
2022-09-20 22:13:51 +08:00
So we're going to have to do it "by hand".
And you can use this same pattern for databases that don't have a handy function for loading many csv files.
2022-09-20 22:13:51 +08:00
We need to start by creating a table that will fill in with data.
The easiest way to do this is by creating template for the existing data.
So we begin by loading a single row from one file and adding the year to it:
2016-03-01 22:29:58 +08:00
```{r}
template <- readxl::read_excel(paths[[1]], n_max = 1)
2022-09-20 22:13:51 +08:00
template$year <- 1952
template
```
2022-09-20 22:13:51 +08:00
Now we can connect to the database, and `DBI::dbCreateTable()` to turn our template into database table:
```{r}
con <- DBI::dbConnect(duckdb::duckdb())
2022-09-20 22:13:51 +08:00
DBI::dbCreateTable(con, "gapminder", template)
2022-09-16 04:28:54 +08:00
```
`dbCreateTable()` doesn't use the data in `template`, just variable names and types.
So if we inspect the `gapminder` table now you'll see that it's empty but it has the variables we need:
2022-09-16 04:28:54 +08:00
```{r}
2022-09-20 22:13:51 +08:00
con |> tbl("gapminder")
```
Next, we need a function that takes a single file path and reads it into R, and adds it to the `gapminder` table, the job of `DBI::dbAppendTable()`:
2022-09-16 04:28:54 +08:00
2022-09-20 22:13:51 +08:00
```{r}
append_file <- function(path) {
df <- readxl::read_excel(path)
df$year <- parse_number(basename(path))
DBI::dbAppendTable(con, "gapminder", df)
}
2022-09-16 04:28:54 +08:00
```
Now we need to call `append_csv()` once for `path`.
That's certainly possible with `map()`:
2016-03-01 22:29:58 +08:00
2022-09-16 04:28:54 +08:00
```{r}
#| eval: false
2022-09-20 22:13:51 +08:00
paths |> map(append_file)
2022-09-16 04:28:54 +08:00
```
But we don't actually care about the output, so instead of `map()` it's slightly nicer to use `walk()`.
`walk()` does exactly the same thing as `map()` but throws the output away:
2022-09-16 04:28:54 +08:00
```{r}
2022-09-20 22:13:51 +08:00
paths |> walk(append_file)
```
Now if we can see we have all the data in our table:
2022-09-20 22:13:51 +08:00
```{r}
con |>
tbl("gapminder") |>
count(year)
2022-09-20 22:13:51 +08:00
```
```{r, include = FALSE}
DBI::dbDisconnect(con, shutdown = TRUE)
2016-03-01 22:29:58 +08:00
```
2022-09-16 04:28:54 +08:00
### Writing csv files
The same basic principle applies if we want to write multiple csv files, one for each group.
2022-09-16 04:28:54 +08:00
Let's imagine that we want to take the `ggplot2::diamonds` data and save our one csv file for each `clarity`.
First we need to make those individual datasets.
One way to do that is with dplyr's `group_split()`:
```{r}
by_clarity <- diamonds |>
group_by(clarity) |>
group_split()
```
This produces a list of length 8, containing one tibble for each unique value of `clarity`:
```{r}
length(by_clarity)
by_clarity[[1]]
```
If we were going to save these data frames by hand, we might write something like:
```{r}
#| eval: false
write_csv(by_clarity[[1]], "diamonds-I1.csv")
write_csv(by_clarity[[2]], "diamonds-SI2.csv")
write_csv(by_clarity[[3]], "diamonds-SI1.csv")
...
write_csv(by_clarity[[8]], "diamonds-IF.csv")
```
This is a little different to our previous uses of `map()` because there are two arguments changing, not just one.
That means that we'll need to use `map2()` instead of `map()`.
2022-09-16 04:28:54 +08:00
But before we can use `map2()` we need to figure out the names for those files.
The most general way to do so is to use `dplyr::group_key()` to get the unique values of the grouping variables, then use `mutate()` and `str_glue()` to make a path:
2022-09-16 04:28:54 +08:00
```{r}
keys <- diamonds |>
group_by(clarity) |>
group_keys()
keys
paths <- keys |>
mutate(path = str_glue("diamonds-{clarity}.csv")) |>
pull()
paths
```
This feels a bit fiddly here because we're only working with a single group, but you can imagine this is very powerful when you're grouping by multiple variables.
2022-09-16 04:28:54 +08:00
Now that we have all the pieces in place, we can eliminate the need to copy and paste by running `walk2()`:
```{r}
walk2(by_clarity, paths, write_csv)
```
2016-03-01 22:29:58 +08:00
This is shorthand for:
2022-09-16 04:28:54 +08:00
```{r}
#| eval: false
write_csv(by_clarity[[1]], paths[[1]])
write_csv(by_clarity[[2]], paths[[2]])
write_csv(by_clarity[[3]], paths[[3]])
...
write_csv(by_clarity[[8]], paths[[8]])
```
2022-09-16 03:56:12 +08:00
2022-09-16 04:28:54 +08:00
```{r}
#| include: false
unlink(paths)
2022-09-16 04:28:54 +08:00
```
### Saving plots
We can take the same basic approach to create many plots.
We're jumping the gun here a bit because you won't learn how to save a single plot until @sec-ggsave, but hopefully you'll get the basic idea.
Let's assume you've already split up the data using `group_split()`.
Now you can use `map()` to create a list of many plots[^iteration-5]:
2022-09-16 04:28:54 +08:00
2022-09-22 23:43:21 +08:00
[^iteration-5]: You can print `plots` to get a crude animation --- you'll get one plot for each element of `plots`.
2016-03-01 22:29:58 +08:00
```{r}
plots <- by_clarity |>
map(\(df) ggplot(df, aes(carat)) + geom_histogram(binwidth = 0.01))
2022-09-16 03:56:12 +08:00
```
2022-09-16 04:28:54 +08:00
(If this was a more complicated plot you'd use a named function so there's more room for all the details.)
Then you create the file names:
2022-09-16 03:56:12 +08:00
```{r}
paths <- keys |>
mutate(path = str_glue("clarity-{clarity}.png")) |>
2022-09-16 04:28:54 +08:00
pull()
paths
```
Then use `walk2()` with `ggsave()` to save each plot:
2016-03-01 22:29:58 +08:00
2022-09-16 04:28:54 +08:00
```{r}
walk2(paths, plots, \(path, plot) ggsave(path, plot, width = 6, height = 6))
2016-03-01 22:29:58 +08:00
```
2022-09-16 04:28:54 +08:00
This is short hand for:
```{r}
#| eval: false
ggsave(paths[[1]], plots[[1]], width = 6, height = 6)
ggsave(paths[[2]], plots[[2]], width = 6, height = 6)
ggsave(paths[[3]], plots[[3]], width = 6, height = 6)
...
ggsave(paths[[8]], plots[[8]], width = 6, height = 6)
2022-09-16 04:28:54 +08:00
```
```{r}
#| include: false
unlink(paths)
```
2022-09-16 04:28:54 +08:00
### Exercises
1. Imagine you have a table of student data containing (amongst other variables) `school_name` and `student_id`. Sketch out what code you'd write if you want to save all the information for each student in file called `{student_id}.csv` in the `{school}` directory.
2022-09-10 21:38:20 +08:00
## For loops
Before we finish up this chapter, we have a duty to mention another important technique for iteration in R, the `for` loop.
`for` loops are powerful and general tool that you definitely need to learn as you become a more experienced R programmer.
But we skip them here because, as you've seen, you can solve a whole bunch of useful problems just by learning `across()`, `map()`, and `walk2()`.
If you'd like to learn more about for loops, <https://adv-r.hadley.nz/control-flow.html#loops> is one place to start.
2021-02-22 21:00:06 +08:00
2022-09-10 21:38:20 +08:00
Some people will tell you to avoid `for` loops because they are slow.
They're wrong!
(Well at least they're rather out of date, as `for` loops haven't been slow for many years.) The chief benefit of using functions like `map()` is not speed, but clarity: once you've mastered the basic idea, they make your code easier to write and to read.
2022-09-22 23:43:21 +08:00
## Summary
In this chapter you learn iteration tools to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs.
But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problems to fixing any number of problems.
Once you've mastered the techniques in this chapter, we highly recommend learning more by reading <https://purrr.tidyverse.org> and the [Functionals chapter](https://adv-r.hadley.nz/functionals.html) of *Advanced R*.
This chapter concludes the programming section of the book.
You've now learned the basics of programming in R.
You know now the data types that underpin all of the objects you work with, and have two powerful techniques (functions and iteration) for reducing the duplication in your code.
We hope you've got a taste for how programming can help your analyses, and you've made a solid start on your journey to become not just a data scientist who uses R, but a data science who can program in R.