More hacking away at iteration

This commit is contained in:
Hadley Wickham 2022-09-15 14:56:12 -05:00
parent 242b9e9c11
commit 525807e842
2 changed files with 194 additions and 77 deletions

View File

@ -29,7 +29,8 @@ For example:
- To compute summary statistics for subgroups you can use `dplyr::group_by()` + `dplyr::summarise()`.
- To read every .csv file in a directory you can pass a vector to `readr::read_csv()`.
- To extract every element from a named list you can use `tidyr::unnest_wider()`.
-
In this section we'll show you three related sets of tools for manipulating each column in a data frame, reading each file in a directory, and saving objects.
### Prerequisites
@ -44,6 +45,8 @@ library(tidyverse)
## For each column
### Motivation
Imagine you have this simple tibble:
```{r}
@ -77,27 +80,63 @@ df %>% summarise(
)
```
There are two arguments that you'll use every time:
- The first argument specifies which columns you want to iterate over. It uses the same syntax as `select()`.
- The second argument specifies what to do with each column.
There's another argument, `.names` that's useful when use `across()` with `mutate()`, and two variations `if_any()` and `if_all()` that work with `filter()`.
These are described in detail below.
### Which columns
All the same specifications as `select()`.
But there are two extras that we haven't discussed earlier:
The first argument to `across()`, `.cols`, selects the columns to transform.
This argument uses the same specifications as `select()`, @sec-select, so you can use functions like `starts_with()` and `ends_with()` to select variables based on their name.
There are two other techniques that you can use with both `select()` and `across()` that we didn't discuss earlier because they're particularly useful for `across()`: `everything()` and `where()` .
- `everything()` selects all columns.
- `where(fun)` select all columns where `fun` returns `TRUE`. Most commonly used with functions like `is.numeric()`, `is.factor()`, `is.character()`, `lubridate::is.Date()`, `lubridate::is.POSIXt()`.
### Extra arguments
What happens if we have some missing values?
It'd be nice to be able to pass along additional arguments to `median()`:
`everything()` is straightforward: it selects every (non-grouping) column!
```{r}
df <- tibble(
grp = sample(2, 10, replace = TRUE),
a = rnorm(10),
b = rnorm(10),
c = c(NA, rnorm(9)),
c = rnorm(10),
d = rnorm(10)
)
df %>%
group_by(grp) |>
summarise(across(everything(), median))
```
`where()` allows you to select columns based on their type:
- `where(is.numeric)` selects all numeric columns.
- `where(is.character)` selects all string columns.
- `where(is.Date)` selects all date columns.
- `where(is.POSIXct)` selects all date-time columns.
- `where(is.logical)` selects all logical columns.
You can combine these in the usual `select()` way with Boolean algebra so that `!where(is.numeric)` selects all non-numeric columns and `starts_with("a") & where(is.logical)` selects all logical columns whose name starts with "a".
### Extra arguments
The second argument, `.funs`, determines what happens to each column selected by the first argument.
In most cases, this will be the name of an existing function, but you can also create your own function inline, or supply multiple functions.
Lets motivate this problem with an example: what happens if we have some missing values?
It'd be nice to be able to pass along additional arguments to `median()`:
```{r}
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
}
df <- tibble(
a = rnorm_na(10, 2),
b = rnorm_na(10, 2),
c = rnorm_na(10, 4),
d = rnorm(10)
)
df %>% summarise(
@ -124,28 +163,8 @@ df %>% summarise(
)
```
### Mutating
Similar problem if you want to modify the columns:
```{r}
df %>% mutate(
across(a:d, \(x) x + 1)
)
```
By default the outputs of `across()` are given the same numbers as the inputs.
This means that using `across()` inside of `mutate()` will replace the existing columns by default.
If you'd like to instead create new columns, you can supply the `.names` argument which takes a glue specification where `{.col}` refers to the current column name.
```{r}
df %>% mutate(
across(a:d, \(x) x * 2, .names = "{.col}_2")
)
```
The name specification is also important if you supply a list of multiple functions to `across()`.
In this case the default specification is `{.col}_{.fun}`.
As well as computing the median with out missing values, it'd be nice to know how many missing values there were.
We can do that by supplying a named list of functions to `across()`:
```{r}
df %>% summarise(
@ -156,18 +175,62 @@ df %>% summarise(
)
```
Note that you could describe the name of the new columns using a glue specification (@sec-glue) like `{.col}_{.fn}` where `.col` is the name of the original column and `.fun` is the name of the function in the list.
That's not a coincidence because you can use the `.names` argument to set these names.
### Column names
The result of `across()` is named according to the specification provided in the `.names` variable.
We could specify our own if we wanted the name of the function to come first.
(You can't currently change the order of the columns).
```{r}
df %>% summarise(
across(a:d, list(
median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x))
), .names = "{.fn}_{.col}")
)
```
The `.names` argument is particularly important when you use `across()` with `mutate()`.
By default the outputs of `across()` are given the same numbers as the inputs.
This means that `across()` inside of `mutate()` will replace existing columns:
```{r}
df %>% mutate(
across(a:d, \(x) x + 1)
)
```
If you'd like to instead create new columns, you can supply the `.names` argument which takes a glue specification where `{.col}` refers to the current column name.
```{r}
df %>% mutate(
across(a:d, \(x) x * 2, .names = "{.col}_2")
)
```
### Filtering
`across()` is a great match for `summarise()` and `mutate()` but it's not such a great fit for `filter()` because you usually string together calls to multiple functions either with `|` or `&`.
So dplyr provides two variants of `across()` called `if_any()` and `if_all()`:
```{r}
df |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
# same as:
df |> filter(if_any(a:d, is.na))
df |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
# same as:
df |> filter(if_all(a:d, is.na))
```
### Vs `pivot_longer()`
Before we go on, it's worth pointing out an interesting connection to `pivot_longer()`.
Before we go on, it's worth pointing out an interesting connection between `across()` and `pivot_longer()`.
In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column.
For example, we could rewrite our multiple summary `across()` as:
```{r}
df |>
@ -179,10 +242,11 @@ df |>
)
```
Another place where you have to use `pivot_longer()` or similar is if you have pairs of variables that you need to compute with simultaneously:
This is a useful technique to know about because sometimes you'll hit a problem that's not currently possible to solve with `across()`: when you have groups of variables that you want to compute with simultaneously.
For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:
```{r}
df <- tibble(
df3 <- tibble(
a_val = rnorm(10),
a_w = runif(10),
b_val = rnorm(10),
@ -192,38 +256,76 @@ df <- tibble(
d_val = rnorm(10),
d_w = runif(10)
)
```
df |>
There's currently no way to do this with `across()`[^iteration-1], but it's relatively straightforward with `pivot_longer()`:
[^iteration-1]: Maybe there will be one day, but currently we don't see how.
```{r}
df3_long <- df3 |>
pivot_longer(
everything(),
names_to = c("group", ".value"),
names_sep = "_"
) |>
)
df3_long
df3_long |>
group_by(group) |>
summarise(mean = weighted.mean(val, w))
```
(You could `pivot_wider()` this back to the original form if that's the structure you need)
One day `across()` or a friend might support this sort of computation directly, but currently we don't see how.
If needed, you could `pivot_wider()` this back to the original form.
### Exercises
1. Compute the number of unique values in each column of `palmerpenguins::penguins`.
2. Compute the mean of every column in `mtcars`.
3. Group `diamonds` by `cut`, `clarity`, and `color` then count the number of observations and the mean of each numeric variable.
4. What happens if you use a list of functions, but don't name them? How is the output named?
5. It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`. Can you explain why?
## For each file
`map()` similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a list.
Imagine you have a directory full of excel spreadsheets[^iteration-2] you want to read in.
You could do it with copy and paste:
[^iteration-2]: If you instead had a directory of csv files with the same format, you can use `read_csv()` directly: `read_csv(c("data/y2019.xls", "data/y2020.xls", "data/y2021.xls", "data/y2020.xls").`
```{r}
#| eval: false
data2019 <- readr::read_excel("data/y2019.xls")
data2020 <- readr::read_excel("data/y2020.xls")
data2021 <- readr::read_excel("data/y2021.xls")
data2022 <- readr::read_excel("data/y2022.xls")
```
And then use `dplyr::bind_rows()` to combine them all together:
```{r}
#| eval: false
data <- bind_rows(data2019, data2020, data2021, data2022)
```
But you can imagine that this would get tedious quickly, since often you won't have four files, but more like 400.
In this section you'll first learn a little bit about the base `dir()` function which allows you to list all the files in a directory.
And then about `map()` which lets you repeatedly apply a function to each element of a vector, allowing you to read many files in one step.
`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector.
### Listing files in a directory
`dir()`.
Use `pattern`, a regular expression, to filter files.
Always use `full.name`.
If you're lucky you can just pass to `readr::read_csv(paths)`.
```{r}
#| eval: false
paths <- dir("data", pattern = "\\.xls$", full.names = TRUE)
```
Otherwise you'll need to do it yourself.
### Basic pattern
Two steps --- read every file into a list.
Then join the pieces back into a data frame.
@ -232,9 +334,6 @@ You split the problem up into pieces (here paths), apply a function to each piec
```{r}
#| eval: false
paths <- dir(pattern = "\\.xls$")
paths |>
map(\(path) readxl::read_excel(path)) |>
list_rbind()
@ -242,17 +341,13 @@ paths |>
### Data in the path
If the files have heterogeneous formats you might need to do more processing before you can successfully merge them.
You can use `map_if()` or `map_at()` to selectively modify inputs.
Use `map_if()` if its easier to select the elements to transform with a function; use `map_at()` if you can tell based on their names.
If the path itself contains data, try:
If the file name itself contains data, try:
```{r}
#| eval: false
paths |>
set_names |>
map(readxl::read_excel) |>
set_names(basename) |>
map(\(path) readxl::read_excel) |>
list_rbind(.id = "path")
```
@ -279,7 +374,6 @@ process_file <- function(path) {
pivot_longer(jan:dec, names_to = "month")
}
paths <- dir("data", full.names = TRUE)
all <- paths |>
map(process_file) |>
list_rbind()
@ -290,8 +384,6 @@ Alternatively, you could write
```{r}
#| eval: false
paths <- dir("data", full.names = TRUE)
data <- paths |>
map(read_csv) |>
list_rbind()
@ -307,23 +399,24 @@ If you need to do more work to get `list_rbind()` to work, you should do it, but
This is particularly important if the structure of your data varies in some way because it's usually easier to understand the variations when you have them all in front of you.
Much easier to interactively experiment and figure out what the right approach is.
### Optimize iteration speed by saving your work
### Heterogeneous data
Even in that case, I'd suggest starting with one pass to load all the files:
However, sometimes that's not possible because the data frames are sufficiently inconsistent that `list_rbind()` either fails or yields a data frame that's not very useful.
In that case, start by loading all the files:
```{r}
#| eval: false
files <- paths |> map(read_csv)
files <- paths |> map(read_excel, .id = "id")
```
Then you can iteratively test your tidying code as you develop it.
If the files have heterogeneous formats you might need to do more processing before you can successfully merge them.
You can use `map_if()` or `map_at()` to selectively modify inputs.
Use `map_if()` if its easier to select the elements to transform with a function; use `map_at()` if you can tell based on their names.
After spending all this effort, save it to a new csv file.
In terms of organising your analysis project, you might want to have a file called `0-cleanup.R` that generates nice csv files to be used by the rest of your project.
### For really inconsistent data
If the files are really inconsistent, one useful way to get some traction is to think about the structure of the files as data itself.
```{r}
@ -340,6 +433,14 @@ You could then think about pivotting or plotting this code to understand what th
### Handling failures
Some times you might not be able
```{r}
#| eval: false
paths |>
map(safely(\(path) readxl::read_excel(path)))
```
When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail.
When this happens, you'll get an error message, and no output.
This is annoying: why does one failure prevent you from accessing all the other successes?
@ -402,7 +503,10 @@ y$result[is_ok] |> flatten_dbl()
## Writing multiple outputs
Main challenge is that's there two important arguments: the object you want to save and the place you want to save it.
So far we've focused on map, which is design for functions that return something.
But some functions don't return data, they instead change the state of the world in some way.
In this section, you'll learn about `map()`'s friend `walk()`, which is design to work with this sort of function.
Along the way you'll see how to use it to load multiple csv files into a database and turn multiple plots into files.
### Very large data
@ -419,34 +523,41 @@ Otherwise:
```{r}
#| eval: false
template <- read_csv(paths[[1]])
DBI::dbWriteTable(con, "cars", filter(template, FALSE))
read_write <- function(path) {
append_csv <- function(path) {
df <- read_csv(path)
DBI::dbAppendTable(con, "cars", df)
}
paths |> walk(read_write)
paths |> walk(append_csv)
```
Or maybe you just write one clean csv for each file and then read with `arrow::open_dataset()`.
### Saving plots
To save plots, we need to embrace a new challenge: there's now two important arguments: the object you want to save and the place you want to save it.
So we're going to switch from `walk()` to `walk2()`.
`walk2()`.
It differs in two ways: it iterates over two arguments at the same time, and it hides the output.
Let's first make some plots:
```{r}
#| eval: false
plots <- mtcars |>
group_split(cyl) |>
map(\(df) ggplot(df, aes(mpg, wt)) + geom_point())
paths <- str_c(names(plots), ".pdf")
```
walk2(paths, plots, ggsave, path = tempdir())
Then
```{r}
file_names <- str_c(names(plots), ".pdf")
plots |>
walk2(file_names, \(plot, name) ggsave(name, plot, path = tempdir()))
```
## For loops

View File

@ -18,7 +18,8 @@ Next, we'll discuss the basics of regular expressions, a powerful tool for descr
The chapter finishes up with functions that work with individual letters, including a brief discussion of where your expectations from English might steer you wrong when working with other languages, and a few useful non-stringr functions.
This chapter is paired with two other chapters.
Regular expression are a big topic, so we'll come back to them again in @sec-regular-expressions. We'll also come back to strings again in @sec-programming-with-strings where we'll look at them from a programming perspective rather than a data analysis perspective.
Regular expression are a big topic, so we'll come back to them again in @sec-regular-expressions.
We'll also come back to strings again in @sec-programming-with-strings where we'll look at them from a programming perspective rather than a data analysis perspective.
### Prerequisites
@ -138,7 +139,10 @@ One of the challenges of working with text is that there's a variety of ways tha
3. `\\\\\\`
2. Create the string in your R session and print it. What happens to the special "\\u00a0"? How does `str_view()` display it? Can you do a little googling to figure out what this special character is?
2. Create the string in your R session and print it.
What happens to the special "\\u00a0"?
How does `str_view()` display it?
Can you do a little googling to figure out what this special character is?
```{r}
x <- "This\u00a0is\u00a0tricky"
@ -182,7 +186,7 @@ df |> mutate(
)
```
### `str_glue()`
### `str_glue()` {#sec-glue}
If you are mixing many fixed and variable strings with `str_c()`, you'll notice that you have to type `""` repeatedly, and this can make it hard to see the overall goal of the code.
An alternative approach is provided by the [glue package](https://glue.tidyverse.org) via `str_glue()`[^strings-4] .
@ -325,7 +329,8 @@ str_detect(c("x", "X"), "x")
In general, any letter or number will match exactly, but punctuation characters like `.`, `+`, `*`, `[`, `]`, `?`, often have special meanings[^strings-8].
For example, `.`
will match any character[^strings-9], so `"a."` will match any string that contains an "a" followed by another character:
will match any character[^strings-9], so `"a."` will match any string that contains an "a" followed by another character
:
[^strings-8]: You'll learn how to escape this special behaviour in @sec-regexp-escaping.
@ -342,7 +347,8 @@ This shows which characters are matched by colouring the match blue and surround
str_view_all(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
```
Regular expressions are a powerful and flexible language which we'll come back to in @sec-regular-expressions. Here we'll just introduce only the most important components: quantifiers and character classes.
Regular expressions are a powerful and flexible language which we'll come back to in @sec-regular-expressions.
Here we'll just introduce only the most important components: quantifiers and character classes.
**Quantifiers** control how many times an element that can be applied to other pattern: `?` makes a pattern optional (i.e. it matches 0 or 1 times), `+` lets a pattern repeat (i.e. it matches at least once), and `*` lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).