r4ds/iteration.qmd

872 lines
28 KiB
Plaintext
Raw Normal View History

# Iteration {#sec-iteration}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
2022-09-10 21:38:20 +08:00
status("drafting")
```
2016-03-01 22:29:58 +08:00
2016-07-24 22:08:24 +08:00
## Introduction
2022-09-10 21:38:20 +08:00
Iteration is somewhat of a moving target in the tidyverse because we're keep adding new features to make it easier to solve problems that previously required explicit iteration.
For example:
- To draw one plot for each group you can use ggplot2's facetting.
- To compute summary statistics for subgroups you can use `dplyr::group_by()` + `dplyr::summarise()`.
- To read every .csv file in a directory you can pass a vector to `readr::read_csv()`.
- To extract every element from a named list you can use `tidyr::unnest_wider()`.
2022-09-16 03:56:12 +08:00
2022-09-20 22:13:51 +08:00
In this section we'll show you three related sets of tools for manipulating each column in a data frame, reading each file in a directory, and saving each element in a list.
2016-03-24 22:09:09 +08:00
2022-09-20 22:13:51 +08:00
These are the basics of iteration, focusing on the places where it comes up in an analysis.
2022-09-16 04:28:54 +08:00
But in general, iteration is a super power: one you solved one problem, you can apply iteration techniques to solve every similar problem.
You can learn more in <https://purrr.tidyverse.org> and the [Functionals chapter](https://adv-r.hadley.nz/functionals.html) of *Advanced R*.
2016-07-19 21:01:50 +08:00
### Prerequisites
2022-09-20 22:13:51 +08:00
We'll use a selection of important iteration idioms from dplyr and purrr, both core members of the tidyverse.
2016-07-19 21:01:50 +08:00
```{r}
#| label: setup
#| message: false
2016-10-04 01:30:24 +08:00
library(tidyverse)
2016-07-19 21:01:50 +08:00
```
2022-09-19 05:18:45 +08:00
## Modifying multiple columns {#sec-across}
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
Imagine you have this simple tibble:
2016-03-01 22:29:58 +08:00
```{r}
2016-10-04 03:10:05 +08:00
df <- tibble(
2016-03-01 22:29:58 +08:00
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
2016-03-21 21:55:07 +08:00
```
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
And you want to compute the median of every column.
You could do it with copy-and-paste:
2016-03-22 21:57:52 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df %>% summarise(
a = median(a),
b = median(b),
c = median(c),
d = median(d),
2022-09-20 22:13:51 +08:00
n = n()
2022-09-10 21:38:20 +08:00
)
2016-03-22 21:57:52 +08:00
```
But that breaks our rule of thumb: never copy and paste more than twice.
2022-09-10 21:38:20 +08:00
And you could imagine that this will get particularly tedious if you have tens or even hundreds of variables.
Instead you can use `across()`:
2016-03-22 21:57:52 +08:00
2016-03-21 21:55:07 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df %>% summarise(
2022-09-20 22:13:51 +08:00
across(a:d, median),
n = n()
2022-09-10 21:38:20 +08:00
)
2016-03-01 22:29:58 +08:00
```
2022-09-16 03:56:12 +08:00
There are two arguments that you'll use every time:
2022-09-10 21:38:20 +08:00
- The first argument specifies which columns you want to iterate over. It uses the same syntax as `select()`.
- The second argument specifies what to do with each column.
2022-09-16 03:56:12 +08:00
There's another argument, `.names` that's useful when use `across()` with `mutate()`, and two variations `if_any()` and `if_all()` that work with `filter()`.
These are described in detail below.
2022-09-20 22:13:51 +08:00
### Selecting columns with `.cols`
2022-09-16 03:56:12 +08:00
The first argument to `across()`, `.cols`, selects the columns to transform.
This argument uses the same specifications as `select()`, @sec-select, so you can use functions like `starts_with()` and `ends_with()` to select variables based on their name.
2022-09-20 22:13:51 +08:00
Grouping columns are automatically ignored because they're carried along for the ride by the dplyr verb.
2022-09-16 03:56:12 +08:00
There are two other techniques that you can use with both `select()` and `across()` that we didn't discuss earlier because they're particularly useful for `across()`: `everything()` and `where()` .
`everything()` is straightforward: it selects every (non-grouping) column!
```{r}
df <- tibble(
grp = sample(2, 10, replace = TRUE),
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
group_by(grp) |>
summarise(across(everything(), median))
```
`where()` allows you to select columns based on their type:
- `where(is.numeric)` selects all numeric columns.
- `where(is.character)` selects all string columns.
- `where(is.Date)` selects all date columns.
- `where(is.POSIXct)` selects all date-time columns.
- `where(is.logical)` selects all logical columns.
2022-09-20 22:13:51 +08:00
```{r}
df <- tibble(
x1 = 1:3,
x2 = runif(3),
y1 = sample(letters, 3),
y2 = c("banana", "apple", "egg")
)
df |>
summarise(across(where(is.numeric), mean))
df |>
summarise(across(where(is.character), str_flatten))
```
2022-09-16 03:56:12 +08:00
You can combine these in the usual `select()` way with Boolean algebra so that `!where(is.numeric)` selects all non-numeric columns and `starts_with("a") & where(is.logical)` selects all logical columns whose name starts with "a".
2022-09-20 22:13:51 +08:00
### Defining the action with `.funs`
2022-09-16 03:56:12 +08:00
The second argument, `.funs`, determines what happens to each column selected by the first argument.
In most cases, this will be the name of an existing function, but you can also create your own function inline, or supply multiple functions.
Lets motivate this problem with an example: what happens if we have some missing values?
2022-09-10 21:38:20 +08:00
It'd be nice to be able to pass along additional arguments to `median()`:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-16 03:56:12 +08:00
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
}
2016-10-04 03:10:05 +08:00
df <- tibble(
2022-09-16 03:56:12 +08:00
a = rnorm_na(10, 2),
b = rnorm_na(10, 2),
c = rnorm_na(10, 4),
2016-03-21 21:55:07 +08:00
d = rnorm(10)
)
2022-09-20 22:13:51 +08:00
df %>%
summarise(
across(a:d, median),
n = n()
)
2016-03-01 22:29:58 +08:00
```
2022-09-10 21:38:20 +08:00
For complicated reasons, it's not easy to pass on arguments from `across()`, so instead we can create another function that wraps `median()` and calls it with the correct arguments.
We can write that compactly using R's anonymous function shorthand:
2016-03-21 21:55:07 +08:00
```{r}
2022-09-20 22:13:51 +08:00
df %>%
summarise(
across(a:d, \(x) median(x, na.rm = TRUE)),
n = n()
)
```
This expands to the following code.
Each call is the same, apart from the argument which changes each time.
```{r}
#| eval: false
2022-09-10 21:38:20 +08:00
df %>% summarise(
2022-09-20 22:13:51 +08:00
a = median(a, na.rm = TRUE),
b = median(b, na.rm = TRUE),
c = median(c, na.rm = TRUE),
d = median(d, na.rm = TRUE),
n = n()
2022-09-10 21:38:20 +08:00
)
2016-03-01 22:29:58 +08:00
```
2022-09-20 22:13:51 +08:00
This is shorthand for creating a function, as below.
2022-09-10 21:38:20 +08:00
It's easier to remember because you just replace the eight letters of `function` with a single `\`.
2016-03-22 21:57:52 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| results: false
2022-09-20 22:13:51 +08:00
df %>%
summarise(
across(a:d, function(x) median(x, na.rm = TRUE)),
n = n()
)
2016-03-22 21:57:52 +08:00
```
2022-09-16 03:56:12 +08:00
As well as computing the median with out missing values, it'd be nice to know how many missing values there were.
We can do that by supplying a named list of functions to `across()`:
```{r}
2022-09-20 22:13:51 +08:00
df %>%
summarise(
across(a:d, list(
median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x))
)),
n = n()
)
2022-09-16 03:56:12 +08:00
```
2022-09-20 22:13:51 +08:00
If you look carefully, you might intuit that the columns are named using using a glue specification (@sec-glue) like `{.col}_{.fn}` where `.col` is the name of the original column and `.fun` is the name of the function in the list.
That's not a coincidence because you can use the `.names` argument to set these names, the topic of the next section.
2022-09-16 03:56:12 +08:00
### Column names
2016-03-01 22:29:58 +08:00
2022-09-16 03:56:12 +08:00
The result of `across()` is named according to the specification provided in the `.names` variable.
2022-09-20 22:13:51 +08:00
We could specify our own if we wanted the name of the function to come first[^iteration-1]:
[^iteration-1]: You can't currently change the order of the columns, but you could reorder them after the fact using `relocate()` or similar.
2016-03-01 22:29:58 +08:00
2016-03-21 21:55:07 +08:00
```{r}
2022-09-20 22:13:51 +08:00
df %>%
summarise(
across(
a:d,
list(
median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x))
),
.names = "{.fn}_{.col}"
),
n = n(),
)
2016-03-01 22:29:58 +08:00
```
2022-09-16 03:56:12 +08:00
The `.names` argument is particularly important when you use `across()` with `mutate()`.
2022-09-20 22:13:51 +08:00
By default the output of `across()` is given the same names as the inputs.
2022-09-16 03:56:12 +08:00
This means that `across()` inside of `mutate()` will replace existing columns:
2016-03-22 21:57:52 +08:00
2016-03-21 21:55:07 +08:00
```{r}
2022-09-20 22:13:51 +08:00
df %>%
mutate(
across(a:d, \(x) x + 1)
)
2016-03-21 21:55:07 +08:00
```
2022-09-20 22:13:51 +08:00
If you'd like to instead create new columns, you can use the `.names` argument give the output new names:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-20 22:13:51 +08:00
df %>%
mutate(
across(a:d, \(x) x * 2, .names = "{.col}_double")
)
2016-03-22 21:57:52 +08:00
```
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
### Filtering
2016-03-01 22:29:58 +08:00
2022-09-16 03:56:12 +08:00
`across()` is a great match for `summarise()` and `mutate()` but it's not such a great fit for `filter()` because you usually string together calls to multiple functions either with `|` or `&`.
So dplyr provides two variants of `across()` called `if_any()` and `if_all()`:
```{r}
2022-09-10 21:38:20 +08:00
df |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
2022-09-16 03:56:12 +08:00
# same as:
2022-09-10 21:38:20 +08:00
df |> filter(if_any(a:d, is.na))
2022-09-16 03:56:12 +08:00
df |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
# same as:
df |> filter(if_all(a:d, is.na))
2016-03-01 22:29:58 +08:00
```
2022-09-20 22:13:51 +08:00
### `across()` in functions
`across()` is particularly useful to program with because it allows you to operate with multiple variables.
For example, [Jacob Scott](https://twitter.com/_wurli/status/1571836746899283969) uses this little helper to expand our all date into year, month, and day variables:
```{r}
expand_dates <- function(df) {
df |>
mutate(
across(
where(lubridate::is.Date),
list(year = year, month = month, day = mday)
)
)
}
```
It also lets the user supply multiple variables.
The key thing to remember is that the first argument to `across()` uses tidy evaluation, so you need to embrace any arguments.
For example, this function will compute the means of numeric variables by default.
But by supplying the second argument you can choose to summarize just selected variables.
```{r}
summarise_means <- function(data, summary_vars = where(is.numeric)) {
data |>
summarise(
across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
n = n()
)
}
diamonds |>
group_by(clarity) |>
summarise_means()
diamonds |>
group_by(clarity) |>
summarise_means(c(carat, x:z))
```
2022-09-10 21:38:20 +08:00
### Vs `pivot_longer()`
2022-09-16 03:56:12 +08:00
Before we go on, it's worth pointing out an interesting connection between `across()` and `pivot_longer()`.
2022-09-10 21:38:20 +08:00
In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column.
2022-09-16 03:56:12 +08:00
For example, we could rewrite our multiple summary `across()` as:
2016-03-01 22:29:58 +08:00
2016-03-22 21:57:52 +08:00
```{r}
2022-09-10 21:38:20 +08:00
df |>
pivot_longer(a:d) |>
group_by(name) |>
summarise(
median = median(value, na.rm = TRUE),
n_miss = sum(is.na(value))
)
2016-03-22 21:57:52 +08:00
```
2022-09-16 03:56:12 +08:00
This is a useful technique to know about because sometimes you'll hit a problem that's not currently possible to solve with `across()`: when you have groups of variables that you want to compute with simultaneously.
For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:
2016-03-22 21:57:52 +08:00
2022-09-10 21:38:20 +08:00
```{r}
2022-09-16 03:56:12 +08:00
df3 <- tibble(
2022-09-10 21:38:20 +08:00
a_val = rnorm(10),
a_w = runif(10),
b_val = rnorm(10),
b_w = runif(10),
c_val = rnorm(10),
c_w = runif(10),
d_val = rnorm(10),
d_w = runif(10)
)
2022-09-16 03:56:12 +08:00
```
2022-09-20 22:13:51 +08:00
There's currently no way to do this with `across()`[^iteration-2], but it's relatively straightforward with `pivot_longer()`:
2022-09-16 03:56:12 +08:00
2022-09-20 22:13:51 +08:00
[^iteration-2]: Maybe there will be one day, but currently we don't see how.
2022-09-16 03:56:12 +08:00
```{r}
df3_long <- df3 |>
2022-09-10 21:38:20 +08:00
pivot_longer(
everything(),
names_to = c("group", ".value"),
names_sep = "_"
2022-09-16 03:56:12 +08:00
)
df3_long
df3_long |>
2022-09-10 21:38:20 +08:00
group_by(group) |>
summarise(mean = weighted.mean(val, w))
```
2022-09-16 03:56:12 +08:00
If needed, you could `pivot_wider()` this back to the original form.
2022-09-10 21:38:20 +08:00
### Exercises
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
1. Compute the number of unique values in each column of `palmerpenguins::penguins`.
2. Compute the mean of every column in `mtcars`.
3. Group `diamonds` by `cut`, `clarity`, and `color` then count the number of observations and the mean of each numeric variable.
2022-09-16 03:56:12 +08:00
4. What happens if you use a list of functions, but don't name them? How is the output named?
5. It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`. Can you explain why?
2022-09-16 04:28:54 +08:00
## Reading multiple files
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
Imagine you have a directory full of excel spreadsheets[^iteration-3] you want to read.
2022-09-16 03:56:12 +08:00
You could do it with copy and paste:
2022-09-20 22:13:51 +08:00
[^iteration-3]: If you instead had a directory of csv files with the same format, you can use the technique from @sec-readr-directory.
2022-09-16 03:56:12 +08:00
```{r}
#| eval: false
data2019 <- readr::read_excel("data/y2019.xls")
data2020 <- readr::read_excel("data/y2020.xls")
data2021 <- readr::read_excel("data/y2021.xls")
data2022 <- readr::read_excel("data/y2022.xls")
```
And then use `dplyr::bind_rows()` to combine them all together:
```{r}
#| eval: false
data <- bind_rows(data2019, data2020, data2021, data2022)
```
2022-09-20 22:13:51 +08:00
But you can imagine that this would get tedious quickly, especially if you had 400 files, not just four.
In the following secitons section you'll learn how to use `dir()` list all the files in a directory, then `purrr::map()` to read each of them into a list, and then `purrr::list_rbind()` to combine them into a single data frame.
We'll then discuss how you can use these tools as the challenge level increases.
2022-09-16 03:56:12 +08:00
### Listing files in a directory
2022-09-20 22:13:51 +08:00
`dir()` lists the files in a directory.
You'll almost always use three arguments:
- `path`, the first argument, which you won't usually name, is the directory to look in.
- `pattern` is a regular expression that file names must match to be included in the output.
The most common pattern is to match an extension like `\\.xlsx$` or `\\.csv$` but you can use whatever you need to extract you data files.
2022-09-20 22:13:51 +08:00
- `full.names` determines whether or not the directory name should be included in the output.
You almost always want this to be `TRUE`.
For example, this book contains a folder with 12 excel spreadsheets that contain data from the gapminder package.
Each file contains provides the life expectancy, population, and per capita GDP for 142 countries for one year.
We can list them all with the appropriate call to `dir()`:
2022-09-17 23:58:18 +08:00
2022-09-16 03:56:12 +08:00
```{r}
2022-09-17 23:58:18 +08:00
paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
paths
2022-09-16 03:56:12 +08:00
```
2022-09-20 22:13:51 +08:00
### `purrr::map()` and `list_rbind()`
2022-09-20 22:13:51 +08:00
Now that we have these 12 paths, we call `read_excel()` 12 times to get 12 data frames.
We're going to make a small generalization compared to the example above.
Since, in general, we won't know how files there are to read, instead of loading each individual data frame in its own variable, we'll put them all into a list, something like this:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2022-09-17 23:58:18 +08:00
list(
readxl::read_excel("data/gapminder/1952.xls"),
readxl::read_excel("data/gapminder/1957.xls"),
readxl::read_excel("data/gapminder/1962.xls"),
...,
readxl::read_excel("data/gapminder/2007.xls")
)
```
2022-09-20 22:13:51 +08:00
Now that's just as tedious to type as before, but we can use a shortcut: `purrr::map()`.
`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector.
`map(x, f)` is shorthand for:
2022-09-17 23:58:18 +08:00
```{r}
#| eval: false
list(
f(x[[1]]),
f(x[[2]]),
...,
f(x[[n]])
)
```
2022-09-20 22:13:51 +08:00
So we can use `map()` get a list of 12 data frames:
2022-09-17 23:58:18 +08:00
```{r}
files <- map(paths, readxl::read_excel)
length(files)
files[[1]]
```
2022-09-20 22:13:51 +08:00
(This is another data structure that doesn't display particularly compactly with `str()` so you might want to load into RStudio and inspect it with `View()`).
2022-09-17 23:58:18 +08:00
2022-09-20 22:13:51 +08:00
Now we can use `purrr::list_rbind()` to combine that list of data frames into a single data frame:
2022-09-17 23:58:18 +08:00
```{r}
list_rbind(files)
```
2022-09-20 22:13:51 +08:00
Or we could do both steps at once in pipeline:
2022-09-17 23:58:18 +08:00
```{r}
#| results: false
2022-09-10 21:38:20 +08:00
paths |>
2022-09-17 23:58:18 +08:00
map(readxl::read_excel) |>
2022-09-10 21:38:20 +08:00
list_rbind()
2016-03-01 22:29:58 +08:00
```
2022-09-17 23:58:18 +08:00
What if we want to pass in extra arguments to `read_excel()`?
We use the same trick that we used with across.
2022-09-20 22:13:51 +08:00
For example, it's often useful to peak at just the first few rows of the data which we can do with `n_max`:
2022-09-17 23:58:18 +08:00
```{r}
paths |>
map(\(path) readxl::read_excel(path, n_max = 1)) |>
list_rbind()
```
2022-09-20 22:13:51 +08:00
This makes it very clear that each individual sheet doesn't contain the year, which is only recorded in the path.
We'll tackle that problem next.
2022-09-17 23:58:18 +08:00
2022-09-10 21:38:20 +08:00
### Data in the path
2016-03-01 22:29:58 +08:00
2022-09-17 23:58:18 +08:00
Sometimes the name of the file is itself data.
In this example, the file name contains the year, which is not otherwise recorded in the individual data frames.
To get that column into the final data frame, we need to do two things.
2022-09-20 22:13:51 +08:00
Firstly, we name the vector of paths.
The easiest way to do this is with the `set_names()` function, which can take a function.
2022-09-17 23:58:18 +08:00
Here we use `basename` to extract just the file name from the full path:
```{r}
paths <- paths |> set_names(basename)
paths
```
Those paths are automatically carried along by all the map functions, so the list of data frames will have those same names:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2022-09-17 23:58:18 +08:00
paths |>
map(readxl::read_excel) |>
names()
```
2022-09-20 22:13:51 +08:00
Then we use the `names_to` argument `list_rbind()` to tell it to save the names to a new column called `year`, and use `readr::parse_number()` to turn it into a number.
2022-09-17 23:58:18 +08:00
```{r}
2022-09-10 21:38:20 +08:00
paths |>
2022-09-16 03:56:12 +08:00
set_names(basename) |>
2022-09-17 23:58:18 +08:00
map(readxl::read_excel) |>
list_rbind(names_to = "year") |>
mutate(year = parse_number(year))
2016-03-01 22:29:58 +08:00
```
2022-09-20 22:13:51 +08:00
In other cases, there might be more variables in the directory, or maybe multiple variables encoded in the path.
In that case, you can use `set_names()` without any argument to record the full path, and then you `tidyr::separate_by()` and friends to turn them into useful columns.
2016-03-01 22:29:58 +08:00
2022-09-17 23:58:18 +08:00
```{r}
paths |>
set_names() |>
map(readxl::read_excel) |>
list_rbind(names_to = "year") |>
separate(year, into = c(NA, "directory", "file", "ext"), sep = "[/.]")
```
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
### Save your work
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
Now that you've done all this hard work to get to a nice tidy data frame, make sure to save your work!
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
In terms of organising your analysis project, you might want to have a file called `0-cleanup.R` that generates nice csv files to be used by the rest of your project.
### Many simple iterations
If you need to read and transform your data in some way you have two basic ways of structuring your data: doing one round of iteration with a complex function, or doing a multiple rounds of iteration with simple functions.
In my experience, you will be better off with many simple iterations, but most folks reach first for one complex iteration.
Let's make that concrete with an example.
Imagine that you want to read in a bunch of files, filter out missing values, pivot them, and then join them all together.
2022-09-10 21:38:20 +08:00
One way to approach the problem is write a function that takes a file and does all those steps:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
process_file <- function(path) {
df <- read_csv(path)
df |>
filter(!is.na(id)) |>
mutate(id = tolower(id)) |>
pivot_longer(jan:dec, names_to = "month")
2016-03-01 22:29:58 +08:00
}
2022-09-20 22:13:51 +08:00
```
2016-08-15 21:18:51 +08:00
2022-09-20 22:13:51 +08:00
Then you call `map()` once:
```{r}
#| eval: false
2022-09-10 21:38:20 +08:00
all <- paths |>
map(process_file) |>
list_rbind()
2016-03-01 22:29:58 +08:00
```
2022-09-20 22:13:51 +08:00
Alternatively, you could read all the files first:
2016-03-01 22:29:58 +08:00
2016-03-24 22:09:09 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
data <- paths |>
map(read_csv) |>
list_rbind()
2022-09-20 22:13:51 +08:00
```
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
Then rely on dplyr functions to do the rest:
```{r}
#| eval: false
2022-09-10 21:38:20 +08:00
data |>
filter(!is.na(id)) |>
mutate(id = tolower(id)) |>
pivot_longer(jan:dec, names_to = "month")
2016-03-01 22:29:58 +08:00
```
2022-09-20 22:13:51 +08:00
I think this second approach is usually more desirable because it stops you getting fixated on getting the first file right because moving on to the rest.
By considering all of the data when you do your tidying and cleaning, you're more likely to think holistically about the problems and end up with a higher quality result.
2016-03-01 22:29:58 +08:00
2022-09-16 03:56:12 +08:00
### Heterogeneous data
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
Unforuntately sometimes the strategy fails because the data frames are so heterogenous that `list_rbind()` either fails or yields a data frame that's not very useful.
In that case, it's still useful to start by getting all of the files into memory:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2022-09-20 22:13:51 +08:00
files <- paths |> map(readxl::read_excel)
2016-03-01 22:29:58 +08:00
```
2022-09-20 22:13:51 +08:00
And then a very useful strategy is to convert the structure of the data frames to data so that you can then explore it.
One way to do so is with this handy `df_types` function that returns a tibble with one row for each column:
2022-09-20 22:13:51 +08:00
```{r}
df_types <- function(df) {
tibble(
col_name = names(df),
col_type = map_chr(df, vctrs::vec_ptype_full)
)
}
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
df_types(starwars)
```
2022-09-20 22:13:51 +08:00
You can then use the to explore all of the files:
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
```{r}
2022-09-20 22:13:51 +08:00
files |>
map(df_types) |>
list_rbind(names_to = "file_name") |>
pivot_wider(names_from = col_name, values_from = col_type)
2022-09-10 21:38:20 +08:00
```
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
If the files have heterogeneous formats you might need to do more processing before you can successfully merge them.
You can use `map_if()` or `map_at()` to selectively modify inputs.
Use `map_if()` if its easier to select the elements to transform with a function; use `map_at()` if you can tell based on their names.
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
### Handling failures
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
Sometimes the structure of your data might be sufficiently wild that you can't even read all the files with a single command.
One of the downsides of map is that it succeeds or fails as a whole: either you successfully read all of the files in a directory or you fail with an error.
This is annoying: why does one failure prevent you from accessing all the other successes?
How do you ensure that one bad apple doesn't ruin the whole barrel?
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
Luckily, purrr comes with a helper for this situation: `possibly()`.
Now any failure will pull a `NULL` in the list of files, and `list_rbind()` will automatically ignore those `NULL`.
2016-03-01 22:29:58 +08:00
```{r}
2022-09-20 22:13:51 +08:00
files <- paths |>
map(possibly(\(path) readxl::read_excel(path), NULL))
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
data <- files |> list_rbind()
2022-09-10 21:38:20 +08:00
```
2022-09-20 22:13:51 +08:00
Now comes the hard part of figuring out why they failed and what do to about it.
Start by getting the paths that failed:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-20 22:13:51 +08:00
failed <- map_vec(files, is.null)
paths[failed]
2016-03-01 22:29:58 +08:00
```
2022-09-20 22:13:51 +08:00
Now the hard work begins: you'll have to look at each failure, call the import file again, and figure out what went wrong.
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
## Saving multiple objects
2016-03-01 22:29:58 +08:00
2022-09-20 22:13:51 +08:00
So far we've focused on map, which is designed for functions that return something.
2022-09-16 04:28:54 +08:00
But some functions don't return things, they instead do things (i.e. their return value isn't important).
This sort of function includes:
- Saving data to a database.
- Saving data to disk, like `readr::read_csv()`.
- Saving plots to disk with `ggsave()`.
2022-09-16 03:56:12 +08:00
In this section, you'll learn about `map()`'s friend `walk()`, which is design to work with this sort of function.
Along the way you'll see how to use it to load multiple csv files into a database and turn multiple plots into files.
2022-09-20 22:13:51 +08:00
### Writing to a database {#sec-save-database}
2022-09-16 04:28:54 +08:00
Sometimes when working with many files at once, it's not possible to load all your data into memory at once.
If you can't `map(files, read_csv)` how can you work with your work?
Well, one approach is to put it all into a database and then use dbplyr to access just the subsets that you need.
2022-09-20 22:13:51 +08:00
If you're Sometimes the database package will provide a handy function that will take a vector of paths and load them all into the datbase.
2022-09-16 04:28:54 +08:00
This is the case with duckdb's `duckdb_read_csv()`:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-10 21:38:20 +08:00
#| eval: false
2022-09-20 22:13:51 +08:00
con <- DBI::dbConnect(duckdb::duckdb())
duckdb::duckdb_read_csv(con, "gapminder", paths)
2016-03-01 22:29:58 +08:00
```
2022-09-20 22:13:51 +08:00
But we don't have csv files, we have excel spreadsheets.
So we're going to have to do it "by hand".
And you can use this same pattern for databases that don't have
Unlike in @sec-load-data, we we're not using to `dbWriteTable()`, because we're going to create the table once, and then append to it multiple times.
So instead we'll use `dbCreateTable()` and `dbAppend()` table.
We first create an empty table with the fields we'll use:
2016-03-01 22:29:58 +08:00
```{r}
2022-09-20 22:13:51 +08:00
con <- DBI::dbConnect(duckdb::duckdb())
template <- readxl::read_excel(paths[[1]])
template$year <- 1952
DBI::dbCreateTable(con, "gapminder", template)
2022-09-16 04:28:54 +08:00
```
2022-09-20 22:13:51 +08:00
Unlike `dbWriteTable()`, `dbCreateTable()` doesn't load in any data.
It's job is to create the write table fields with the right types:
2022-09-16 04:28:54 +08:00
```{r}
2022-09-20 22:13:51 +08:00
con |> tbl("gapminder")
```
Now we need a function that takes a single path and loads it into an existing table in the database with `dbAppendTable()`:
2022-09-16 04:28:54 +08:00
2022-09-20 22:13:51 +08:00
```{r}
append_file <- function(path) {
df <- readxl::read_excel(path)
df$year <- parse_number(basename(path))
DBI::dbAppendTable(con, "gapminder", df)
}
2022-09-16 04:28:54 +08:00
```
2022-09-20 22:13:51 +08:00
Now you need to call `append_csv()` once for each value of `path`.
2022-09-16 04:28:54 +08:00
That's certainly possible with map:
2016-03-01 22:29:58 +08:00
2022-09-16 04:28:54 +08:00
```{r}
#| eval: false
2022-09-20 22:13:51 +08:00
paths |> map(append_file)
2022-09-16 04:28:54 +08:00
```
But we don't actually care about the output, so instead we can use `walk()`.
This does exactly the same thing as `map()` but throws the output away.
```{r}
2022-09-20 22:13:51 +08:00
paths |> walk(append_file)
```
Now if we look at the data we can see we have all the data in one place:
```{r}
con |> tbl("gapminder")
```
```{r, include = FALSE}
DBI::dbDisconnect(con, shutdown = TRUE)
2016-03-01 22:29:58 +08:00
```
2022-09-16 04:28:54 +08:00
### Writing csv files
The same basic principle applies if we want to save out multiple csv files, one for each group.
Let's imagine that we want to take the `ggplot2::diamonds` data and save our one csv file for each `clarity`.
First we need to make those individual datasets.
One way to do that is with dplyr's `group_split()`:
```{r}
by_clarity <- diamonds |>
group_by(clarity) |>
group_split()
```
This produces a list of length 8, containing one tibble for each unique value of `clarity`:
```{r}
length(by_clarity)
by_clarity[[1]]
```
If we were going to save these data frames by hand, we might write something like:
```{r}
#| eval: false
write_csv(by_clarity[[1]], "diamonds-I1.csv")
write_csv(by_clarity[[2]], "diamonds-SI2.csv")
write_csv(by_clarity[[3]], "diamonds-SI1.csv")
...
write_csv(by_clarity[[8]], "diamonds-IF.csv")
```
This is a little different compared our previous uses of `map()` because instead of changing one argument we're now changing two.
This means that we'll need to use `map2()` instead of `map()`.
We'll also need to generate the names for those files somehow.
The most general way to do so is to use `dplyr::group_indices()`:
```{r}
keys <- diamonds |>
group_by(clarity) |>
group_keys()
keys
paths <- keys |>
mutate(path = str_glue("diamonds-{clarity}.csv")) |>
pull()
paths
```
This feels a bit fiddly here because we're only working with a single group, but you can imagine this is very powerful if you want to group by multiple variables.
Now that we have all the pieces in place, we can eliminate the need to copy and paste by running `walk2()`:
```{r}
#| eval: false
walk2(by_clarity, paths, write_csv)
```
2016-03-01 22:29:58 +08:00
2022-09-10 21:38:20 +08:00
### Saving plots
2016-03-01 22:29:58 +08:00
2022-09-16 04:28:54 +08:00
We can take the same basic approach if you want to create many plots.
We're jumping the gun here a bit because you won't learn how to save a single plot until @sec-ggsave, but hopefully
Let's first split up the data:
2022-09-16 03:56:12 +08:00
2022-09-16 04:28:54 +08:00
```{r}
by_cyl <- mtcars |> group_by(cyl)
```
Then create the plots using `map()` to call `ggplot()` repeatedly with different datasets.
2022-09-20 22:13:51 +08:00
That gives us a list of plots[^iteration-4]:
2022-09-16 04:28:54 +08:00
2022-09-20 22:13:51 +08:00
[^iteration-4]: You can print `plots` to get a crude animation --- you'll get one plot for each element of `plots`.
2016-03-01 22:29:58 +08:00
```{r}
2022-09-16 04:28:54 +08:00
plots <- by_cyl |>
group_split() |>
2022-09-09 00:32:10 +08:00
map(\(df) ggplot(df, aes(mpg, wt)) + geom_point())
2022-09-16 03:56:12 +08:00
```
2022-09-16 04:28:54 +08:00
(If this was a more complicated plot you'd use a named function so there's more room for all the details.)
Then you create the file names:
2022-09-16 03:56:12 +08:00
```{r}
2022-09-16 04:28:54 +08:00
paths <- by_cyl |>
group_keys() |>
mutate(path = str_glue("cyl-{cyl}.png")) |>
pull()
paths
```
Then use `walk2()` with `ggsave()` to save each plot:
2016-03-01 22:29:58 +08:00
2022-09-16 04:28:54 +08:00
```{r}
walk2(plots, paths, \(plot, name) ggsave(name, plot, path = tempdir()))
2016-03-01 22:29:58 +08:00
```
2022-09-16 04:28:54 +08:00
This is short hand for:
```{r}
#| eval: false
ggsave(plots[[1]], paths[[1]], path = tempdir())
ggsave(plots[[2]], paths[[2]], path = tempdir())
ggsave(plots[[3]], paths[[3]], path = tempdir())
```
It's barely necessary here, but you can imagine how useful this would be if you had to create hundreds of plot.
### Exercises
1. Imagine you have a table of student data containing (amongst other variables) `school_name` and `student_id`. Sketch out what code you'd write if you want to save all the information for each student in file called `{student_id}.csv` in the `{school}` directory.
2022-09-10 21:38:20 +08:00
## For loops
2022-09-10 21:38:20 +08:00
Another way to attack this sort of problem is with a `for` loop.
We don't teach for loops here to stay focused.
They're definitely important.
You can learn more about them and how they're connected to the map functions in purr in <https://adv-r.hadley.nz/control-flow.html#loops> and <https://adv-r.hadley.nz/functionals.html>.
2022-09-10 21:38:20 +08:00
Once you master these functions, you'll find it takes much less time to solve iteration problems.
But you should never feel bad about using a `for` loop instead of a map function.
The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work.
The important thing is that you solve the problem that you're working on, not write the most concise and elegant code (although that's definitely something you want to strive towards!).
2021-02-22 21:00:06 +08:00
2022-09-10 21:38:20 +08:00
Some people will tell you to avoid `for` loops because they are slow.
They're wrong!
(Well at least they're rather out of date, as `for` loops haven't been slow for many years.) The chief benefits of using functions like `map()` is not speed, but clarity: they make your code easier to write and to read.
2021-02-22 21:00:06 +08:00
2022-09-10 21:38:20 +08:00
If you actually need to worry about performance, you'll know, it'll be obvious.
till then, don't worry about it.