Bang out more details on writing files
This commit is contained in:
parent
525807e842
commit
8e0e6db1d2
|
@ -678,7 +678,7 @@ It's also possible to control individual components of each theme, like the size
|
||||||
Unfortunately, this level of detail is outside the scope of this book, so you'll need to read the [ggplot2 book](https://ggplot2-book.org/) for the full details.
|
Unfortunately, this level of detail is outside the scope of this book, so you'll need to read the [ggplot2 book](https://ggplot2-book.org/) for the full details.
|
||||||
You can also create your own themes, if you are trying to match a particular corporate or journal style.
|
You can also create your own themes, if you are trying to match a particular corporate or journal style.
|
||||||
|
|
||||||
## Saving your plots
|
## Saving your plots {#sec-ggsave}
|
||||||
|
|
||||||
There are two main ways to get your plots out of R and into your final write-up: `ggsave()` and knitr.
|
There are two main ways to get your plots out of R and into your final write-up: `ggsave()` and knitr.
|
||||||
`ggsave()` will save the most recent plot to disk:
|
`ggsave()` will save the most recent plot to disk:
|
||||||
|
|
173
iteration.qmd
173
iteration.qmd
|
@ -32,6 +32,10 @@ For example:
|
||||||
|
|
||||||
In this section we'll show you three related sets of tools for manipulating each column in a data frame, reading each file in a directory, and saving objects.
|
In this section we'll show you three related sets of tools for manipulating each column in a data frame, reading each file in a directory, and saving objects.
|
||||||
|
|
||||||
|
We're going to give the very basics of iteration, focusing on the places where it comes up in an analysis.
|
||||||
|
But in general, iteration is a super power: one you solved one problem, you can apply iteration techniques to solve every similar problem.
|
||||||
|
You can learn more in <https://purrr.tidyverse.org> and the [Functionals chapter](https://adv-r.hadley.nz/functionals.html) of *Advanced R*.
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
We'll use a selection of useful iteration idioms from dplyr and purrr, both core members of the tidyverse.
|
We'll use a selection of useful iteration idioms from dplyr and purrr, both core members of the tidyverse.
|
||||||
|
@ -43,7 +47,7 @@ We'll use a selection of useful iteration idioms from dplyr and purrr, both core
|
||||||
library(tidyverse)
|
library(tidyverse)
|
||||||
```
|
```
|
||||||
|
|
||||||
## For each column
|
## Modifying multiple columns
|
||||||
|
|
||||||
### Motivation
|
### Motivation
|
||||||
|
|
||||||
|
@ -286,7 +290,7 @@ If needed, you could `pivot_wider()` this back to the original form.
|
||||||
4. What happens if you use a list of functions, but don't name them? How is the output named?
|
4. What happens if you use a list of functions, but don't name them? How is the output named?
|
||||||
5. It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`. Can you explain why?
|
5. It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`. Can you explain why?
|
||||||
|
|
||||||
## For each file
|
## Reading multiple files
|
||||||
|
|
||||||
Imagine you have a directory full of excel spreadsheets[^iteration-2] you want to read in.
|
Imagine you have a directory full of excel spreadsheets[^iteration-2] you want to read in.
|
||||||
You could do it with copy and paste:
|
You could do it with copy and paste:
|
||||||
|
@ -504,62 +508,183 @@ y$result[is_ok] |> flatten_dbl()
|
||||||
## Writing multiple outputs
|
## Writing multiple outputs
|
||||||
|
|
||||||
So far we've focused on map, which is design for functions that return something.
|
So far we've focused on map, which is design for functions that return something.
|
||||||
But some functions don't return data, they instead change the state of the world in some way.
|
But some functions don't return things, they instead do things (i.e. their return value isn't important).
|
||||||
|
This sort of function includes:
|
||||||
|
|
||||||
|
- Saving data to a database.
|
||||||
|
- Saving data to disk, like `readr::read_csv()`.
|
||||||
|
- Saving plots to disk with `ggsave()`.
|
||||||
|
|
||||||
|
they instead change the state of the world in some way.
|
||||||
In this section, you'll learn about `map()`'s friend `walk()`, which is design to work with this sort of function.
|
In this section, you'll learn about `map()`'s friend `walk()`, which is design to work with this sort of function.
|
||||||
Along the way you'll see how to use it to load multiple csv files into a database and turn multiple plots into files.
|
Along the way you'll see how to use it to load multiple csv files into a database and turn multiple plots into files.
|
||||||
|
|
||||||
### Very large data
|
### Writing to a databse
|
||||||
|
|
||||||
Another exception to this rule is if you have very large data --- it might be impossible to store all the data in memory at once.
|
Sometimes when working with many files at once, it's not possible to load all your data into memory at once.
|
||||||
If you're lucky, the database you're working with will have a function to load csv files directly into the database.
|
If you can't `map(files, read_csv)` how can you work with your work?
|
||||||
For example, if you're using duckdb, you can:
|
Well, one approach is to put it all into a database and then use dbplyr to access just the subsets that you need.
|
||||||
|
|
||||||
|
Sometimes the database package will provide a handy function that will take a vector of paths and load them all into the datbase.
|
||||||
|
This is the case with duckdb's `duckdb_read_csv()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| eval: false
|
#| eval: false
|
||||||
duckdb::duckdb_read_csv(con, "cars", paths)
|
duckdb::duckdb_read_csv(con, "cars", paths)
|
||||||
```
|
```
|
||||||
|
|
||||||
Otherwise:
|
But with other databases you'll need to do it yourself.
|
||||||
|
The key idea is to write a function that loads you data then immediately appends to an existing table with `dbAppendTable()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| eval: false
|
#| eval: false
|
||||||
template <- read_csv(paths[[1]])
|
|
||||||
DBI::dbWriteTable(con, "cars", filter(template, FALSE))
|
|
||||||
|
|
||||||
append_csv <- function(path) {
|
append_csv <- function(path) {
|
||||||
df <- read_csv(path)
|
df <- read_csv(path)
|
||||||
DBI::dbAppendTable(con, "cars", df)
|
DBI::dbAppendTable(con, "cars", df)
|
||||||
}
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Then you just need to create a table to fill in.
|
||||||
|
Here I use a `filter()` that's guaranteed to select zero rows to create a table that will have the write column names and types.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| eval: false
|
||||||
|
con <- DBI::dbConnect(RSQLite::SQLite(tempfile()))
|
||||||
|
|
||||||
|
template <- read_csv(paths[[1]])
|
||||||
|
DBI::dbWriteTable(con, "cars", filter(template, FALSE))
|
||||||
|
```
|
||||||
|
|
||||||
|
Then I need to call `append_csv()` once for each value of `path`.
|
||||||
|
That's certainly possible with map:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| eval: false
|
||||||
|
paths |> map(append_csv)
|
||||||
|
```
|
||||||
|
|
||||||
|
But we don't actually care about the output, so instead we can use `walk()`.
|
||||||
|
This does exactly the same thing as `map()` but throws the output away.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| eval: false
|
||||||
paths |> walk(append_csv)
|
paths |> walk(append_csv)
|
||||||
```
|
```
|
||||||
|
|
||||||
Or maybe you just write one clean csv for each file and then read with `arrow::open_dataset()`.
|
### Writing csv files
|
||||||
|
|
||||||
|
The same basic principle applies if we want to save out multiple csv files, one for each group.
|
||||||
|
Let's imagine that we want to take the `ggplot2::diamonds` data and save our one csv file for each `clarity`.
|
||||||
|
First we need to make those individual datasets.
|
||||||
|
One way to do that is with dplyr's `group_split()`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
by_clarity <- diamonds |>
|
||||||
|
group_by(clarity) |>
|
||||||
|
group_split()
|
||||||
|
```
|
||||||
|
|
||||||
|
This produces a list of length 8, containing one tibble for each unique value of `clarity`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
length(by_clarity)
|
||||||
|
|
||||||
|
by_clarity[[1]]
|
||||||
|
```
|
||||||
|
|
||||||
|
If we were going to save these data frames by hand, we might write something like:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| eval: false
|
||||||
|
write_csv(by_clarity[[1]], "diamonds-I1.csv")
|
||||||
|
write_csv(by_clarity[[2]], "diamonds-SI2.csv")
|
||||||
|
write_csv(by_clarity[[3]], "diamonds-SI1.csv")
|
||||||
|
...
|
||||||
|
write_csv(by_clarity[[8]], "diamonds-IF.csv")
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a little different compared our previous uses of `map()` because instead of changing one argument we're now changing two.
|
||||||
|
This means that we'll need to use `map2()` instead of `map()`.
|
||||||
|
|
||||||
|
We'll also need to generate the names for those files somehow.
|
||||||
|
The most general way to do so is to use `dplyr::group_indices()`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
keys <- diamonds |>
|
||||||
|
group_by(clarity) |>
|
||||||
|
group_keys()
|
||||||
|
keys
|
||||||
|
|
||||||
|
paths <- keys |>
|
||||||
|
mutate(path = str_glue("diamonds-{clarity}.csv")) |>
|
||||||
|
pull()
|
||||||
|
paths
|
||||||
|
```
|
||||||
|
|
||||||
|
This feels a bit fiddly here because we're only working with a single group, but you can imagine this is very powerful if you want to group by multiple variables.
|
||||||
|
|
||||||
|
Now that we have all the pieces in place, we can eliminate the need to copy and paste by running `walk2()`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| eval: false
|
||||||
|
walk2(by_clarity, paths, write_csv)
|
||||||
|
```
|
||||||
|
|
||||||
### Saving plots
|
### Saving plots
|
||||||
|
|
||||||
To save plots, we need to embrace a new challenge: there's now two important arguments: the object you want to save and the place you want to save it.
|
We can take the same basic approach if you want to create many plots.
|
||||||
So we're going to switch from `walk()` to `walk2()`.
|
We're jumping the gun here a bit because you won't learn how to save a single plot until @sec-ggsave, but hopefully
|
||||||
|
|
||||||
`walk2()`.
|
Let's first split up the data:
|
||||||
It differs in two ways: it iterates over two arguments at the same time, and it hides the output.
|
|
||||||
Let's first make some plots:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
plots <- mtcars |>
|
by_cyl <- mtcars |> group_by(cyl)
|
||||||
group_split(cyl) |>
|
```
|
||||||
|
|
||||||
|
Then create the plots using `map()` to call `ggplot()` repeatedly with different datasets.
|
||||||
|
That gives us a list of plots[^iteration-3]:
|
||||||
|
|
||||||
|
[^iteration-3]: You can print `plots` to get a crude animation --- you'll get one plot for each element of `plots`.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
plots <- by_cyl |>
|
||||||
|
group_split() |>
|
||||||
map(\(df) ggplot(df, aes(mpg, wt)) + geom_point())
|
map(\(df) ggplot(df, aes(mpg, wt)) + geom_point())
|
||||||
```
|
```
|
||||||
|
|
||||||
Then
|
(If this was a more complicated plot you'd use a named function so there's more room for all the details.)
|
||||||
|
|
||||||
|
Then you create the file names:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
file_names <- str_c(names(plots), ".pdf")
|
paths <- by_cyl |>
|
||||||
|
group_keys() |>
|
||||||
plots |>
|
mutate(path = str_glue("cyl-{cyl}.png")) |>
|
||||||
walk2(file_names, \(plot, name) ggsave(name, plot, path = tempdir()))
|
pull()
|
||||||
|
paths
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Then use `walk2()` with `ggsave()` to save each plot:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
walk2(plots, paths, \(plot, name) ggsave(name, plot, path = tempdir()))
|
||||||
|
```
|
||||||
|
|
||||||
|
This is short hand for:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| eval: false
|
||||||
|
ggsave(plots[[1]], paths[[1]], path = tempdir())
|
||||||
|
ggsave(plots[[2]], paths[[2]], path = tempdir())
|
||||||
|
ggsave(plots[[3]], paths[[3]], path = tempdir())
|
||||||
|
```
|
||||||
|
|
||||||
|
It's barely necessary here, but you can imagine how useful this would be if you had to create hundreds of plot.
|
||||||
|
|
||||||
|
### Exercises
|
||||||
|
|
||||||
|
1. Imagine you have a table of student data containing (amongst other variables) `school_name` and `student_id`. Sketch out what code you'd write if you want to save all the information for each student in file called `{student_id}.csv` in the `{school}` directory.
|
||||||
|
|
||||||
## For loops
|
## For loops
|
||||||
|
|
||||||
Another way to attack this sort of problem is with a `for` loop.
|
Another way to attack this sort of problem is with a `for` loop.
|
||||||
|
|
Loading…
Reference in New Issue