Update iteration.qmd (#1179)

Just some minor edits and one comment. I hope that's useful.
This commit is contained in:
mcsnowface, PhD 2022-12-10 12:38:07 -07:00 committed by GitHub
parent 34bb0a5f44
commit 73d779d8e0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 28 additions and 29 deletions

View File

@ -12,7 +12,7 @@ status("polishing")
In this chapter, you'll learn tools for iteration, repeatedly performing the same action on different objects.
Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free.
For example, if you want to double a numeric vector `x` in R, you can just write `2 * x`.
In most other languages, you'd need to explicitly double each element of x using some sort of for loop.
In most other languages, you'd need to explicitly double each element of `x` using some sort of for loop.
This book has already given you a small but powerful number of tools that perform the same action for multiple "things":
@ -32,7 +32,7 @@ If you want to live life on the edge you can get the dev version with `devtools:
In this chapter, we'll focus on tools provided by dplyr and purrr, both core members of the tidyverse.
You've seen dplyr before, but [purrr](http://purrr.tidyverse.org/) is new.
We're going to use just a couple of purrr functions from in this chapter, but it's a great package to explore as you improve your programming skills.
We're just going to use a couple of purrr functions in this chapter, but it's a great package to explore as you improve your programming skills.
```{r}
#| label: setup
@ -67,7 +67,7 @@ df |> summarize(
```
That breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns.
Instead you can use `across()`:
Instead, you can use `across()`:
```{r}
df |> summarize(
@ -129,14 +129,14 @@ df_types |>
```
Just like other selectors, you can combine these with Boolean algebra.
For example, `!where(is.numeric)` selects all non-numeric columns and `starts_with("a") & where(is.logical)` selects all logical columns whose name starts with "a".
For example, `!where(is.numeric)` selects all non-numeric columns, and `starts_with("a") & where(is.logical)` selects all logical columns whose name starts with "a".
### Calling a single function
The second argument to `across()` defines how each column will be transformed.
In simple cases, as above, this will be a single existing function.
This is a pretty special feature of R: we're passing one function (`median`, `mean`, `str_flatten`, ...) to another function (`across`).
This is one of the features that makes R a function programming language.
This is one of the features that makes R a functional programming language.
It's important to note that we're passing this function to `across()`, so `across()` can call it, not calling it ourselves.
That means the function name should never be followed by `()`.
@ -159,7 +159,7 @@ median()
### Calling multiple functions
In more complex cases, you might want to supply additional arguments or perform multiple transformations.
Lets motivate this problem with a simple example: what happens if we have some missing values in our data?
Let's motivate this problem with a simple example: what happens if we have some missing values in our data?
`median()` propagates those missing values, giving us a suboptimal output:
```{r}
@ -224,7 +224,7 @@ df_miss |>
)
```
When we remove the missing values from the `median()`, it would be nice to know just how many values we were removing.
When we remove the missing values from the `median()`, it would be nice to know just how many values were removed.
We can find that out by supplying two functions to `across()`: one to compute the median and the other to count the missing values.
You supply multiple functions by using a named list to `.fns`:
@ -266,7 +266,7 @@ df_miss |>
```
The `.names` argument is particularly important when you use `across()` with `mutate()`.
By default the output of `across()` is given the same names as the inputs.
By default, the output of `across()` is given the same names as the inputs.
This means that `across()` inside of `mutate()` will replace existing columns.
For example, here we use `coalesce()` to replace `NA`s with `0`:
@ -486,7 +486,6 @@ We'll then discuss how you can handle situations of increasing heterogeneity, wh
### Listing files in a directory
As the name suggests, `list.files()` lists the files in a directory.
TO CONSIDER: why not use it via the more obvious name `list.files()`?
You'll almost always use three arguments:
- The first argument, `path`, is the directory to look in.
@ -573,7 +572,7 @@ length(files)
files[[1]]
```
(This is another data structure that doesn't display particularly compactly with `str()` so you might want to load into RStudio and inspect it with `View()`).
(This is another data structure that doesn't display particularly compactly with `str()` so you might want to load it into RStudio and inspect it with `View()`).
Now we can use `purrr::list_rbind()` to combine that list of data frames into a single data frame:
@ -581,7 +580,7 @@ Now we can use `purrr::list_rbind()` to combine that list of data frames into a
list_rbind(files)
```
Or we could do both steps at once in pipeline:
Or we could do both steps at once in a pipeline:
```{r}
#| results: false
@ -592,7 +591,7 @@ paths |>
What if we want to pass in extra arguments to `read_excel()`?
We use the same technique that we used with `across()`.
For example, it's often useful to peak at the first few row of the data with `n_max = 1`:
For example, it's often useful to peak at the first few rows of the data with `n_max = 1`:
```{r}
paths |>
@ -605,9 +604,9 @@ We'll tackle that problem next.
### Data in the path {#sec-data-in-the-path}
Sometimes the name of the file is itself data.
Sometimes the name of the file is data itself.
In this example, the file name contains the year, which is not otherwise recorded in the individual files.
To get that column into the final data frame, we need to do two things.
To get that column into the final data frame, we need to do two things:
First, we name the vector of paths.
The easiest way to do this is with the `set_names()` function, which can take a function.
@ -695,11 +694,11 @@ If your input data files change over time, you might consider learning a tool li
### Many simple iterations
Here we've just loaded the data directly from disk, and were lucky enough to get a tidy dataset.
In most cases, you'll need to do some additional tidying, and you have two basic basic options: you can do one round of iteration with a complex function, or do a multiple rounds of iteration with simple functions.
In most cases, you'll need to do some additional tidying, and you have two basic options: you can do one round of iteration with a complex function, or do multiple rounds of iteration with simple functions.
In our experience most folks reach first for one complex iteration, but you're often better by doing multiple simple iterations.
For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine.
One way to approach the problem is write a function that takes a file and does all those steps then call `map()` once:
One way to approach the problem is to write a function that takes a file and does all those steps then call `map()` once:
```{r}
#| eval: false
@ -730,7 +729,7 @@ paths |>
list_rbind()
```
We recommend this approach because it stops you getting fixated on getting the first file right because moving on to the rest.
We recommend this approach because it stops you getting fixated on getting the first file right before moving on to the rest.
By considering all of the data when doing tidying and cleaning, you're more likely to think holistically and end up with a higher quality result.
In this particular example, there's another optimization you could make, by binding all the data frames together earlier.
@ -748,7 +747,7 @@ paths |>
### Heterogeneous data
Unfortunately sometimes it's not possible to go from `map()` straight to `list_rbind()` because the data frames are so heterogeneous that `list_rbind()` either fails or yields a data frame that's not very useful.
Unfortunately, sometimes it's not possible to go from `map()` straight to `list_rbind()` because the data frames are so heterogeneous that `list_rbind()` either fails or yields a data frame that's not very useful.
In that case, it's still useful to start by loading all of the files:
```{r}
@ -757,7 +756,7 @@ files <- paths |>
map(readxl::read_excel)
```
Then a very useful strategy is to capture the structure of the data frames to data so that you can explore it using your data science skills.
Then a very useful strategy is to capture the structure of the data frames so that you can explore it using your data science skills.
One way to do so is with this handy `df_types` function that returns a tibble with one row for each column:
```{r}
@ -773,7 +772,7 @@ df_types(starwars)
df_types(nycflights13::flights)
```
You can then apply this function all of the files, and maybe do some pivoting to make it easy to see where there are differences.
You can then apply this function to all of the files, and maybe do some pivoting to make it easier to see where the differences are.
For example, this makes it easy to verify that the gapminder spreadsheets that we've been working with are all quite homogeneous:
```{r}
@ -784,8 +783,8 @@ files |>
pivot_wider(names_from = col_name, values_from = col_type)
```
If the files have heterogeneous formats you might need to do more processing before you can successfully merge them.
Unfortunately we're now going to leave you to figure that out on your own, but you might want to read about `map_if()` and `map_at()`.
If the files have heterogeneous formats, you might need to do more processing before you can successfully merge them.
Unfortunately, we're now going to leave you to figure that out on your own, but you might want to read about `map_if()` and `map_at()`.
`map_if()` allows you to selectively modify elements of a list based on their values; `map_at()` allows you to selectively modify elements based on their names.
### Handling failures
@ -808,7 +807,7 @@ data <- files |> list_rbind()
This works particularly well here because `list_rbind()`, like many tidyverse functions, automatically ignores `NULL`s.
Now you have all the data that can be read easily, and it's time to tackle the hard part of figuring out why some files failed load and what do to about it.
Now you have all the data that can be read easily, and it's time to tackle the hard part of figuring out why some files failed to load and what do to about it.
Start by getting the paths that failed:
```{r}
@ -825,13 +824,13 @@ In this section, we'll now explore sort of the opposite problem: how can you tak
We'll explore this challenge using three examples:
- Saving multiple data frames into one database.
- Saving multiple data frames into multiple csv files.
- Saving multiple data frames into multiple `.csv` files.
- Saving multiple plots to multiple `.png` files.
### Writing to a database {#sec-save-database}
Sometimes when working with many files at once, it's not possible to fit all your data into memory at once, and you can't do `map(files, read_csv)`.
One approach to deal with this problem is to load your into a database so you can access just the bits you need with dbplyr.
One approach to deal with this problem is to load your data into a database so you can access just the bits you need with dbplyr.
If you're lucky, the database package you're using will provide a handy function that takes a vector of paths and loads them all into the database.
This is the case with duckdb's `duckdb_read_csv()`:
@ -856,7 +855,7 @@ template$year <- 1952
template
```
Now we can connect to the database, and use `DBI::dbCreateTable()` to turn our template into database table:
Now we can connect to the database, and use `DBI::dbCreateTable()` to turn our template into a database table:
```{r}
con <- DBI::dbConnect(duckdb::duckdb())
@ -931,7 +930,7 @@ This gives us a new tibble with eight rows and two columns.
by_clarity$data[[1]]
```
While we're here, lets create a column that gives the name of output file, using `mutate()` and `str_glue()`:
While we're here, let's create a column that gives the name of output file, using `mutate()` and `str_glue()`:
```{r}
by_clarity <- by_clarity |>
@ -1026,11 +1025,11 @@ unlink(by_clarity$path)
```
## Summary
In this chapter you've seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs.
In this chapter, you've seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs.
But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems.
Once you've mastered the techniques in this chapter, we highly recommend learning more by reading the [Functionals chapter](https://adv-r.hadley.nz/functionals.html) of *Advanced R* and consulting the [purrr website](https://purrr.tidyverse.org).
If you know much about iteration in other languages you might be surprised that we didn't discuss the `for` loop.
If you know much about iteration in other languages, you might be surprised that we didn't discuss the `for` loop.
That's because R's orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group.
And when you can't, you can often use a functional programming tool like `map()` that does something to each element of a list.
However, you will see `for` loops in wild-caught code, so you'll learn about them in the next chapter where we'll discuss some important base R tools.