Hacking pipes chapter

2022-02-15 15:34:31 -06:00 · 2022-02-15 15:34:31 -06:00 · 5b7f2de32d
parent 5283553b74
commit 5b7f2de32d
2 changed files with 49 additions and 127 deletions
--- a/preface-2e.Rmd
+++ b/preface-2e.Rmd
@ -9,7 +9,9 @@ Welcome to the second edition of "R for Data Science".
 -   Data import also gains a whole part that goes beyond importing rectangular data to include chapters on working with spreadsheets, databases, and web scraping.
 -   The iteration chapter gains a new case study on web scraping from multiple pages.
 -   The modeling part has been removed. For modeling, we recommend using packages from [tidymodels](https://www.tidymodels.org/) and reading [Tidy Modeling with R](https://www.tmwr.org/) by Max Kuhn and Julia Silge to learn more about them.
+-   We've switched from the magrittr pipe to the base pipe.

 ## Acknowledgements {.unnumbered}

 *TO DO: Add acknowledgements.*
+
--- a/workflow-pipes.Rmd
+++ b/workflow-pipes.Rmd
@ -1,169 +1,92 @@
 # Workflow: Pipes {#workflow-pipes}

+```{r, results = "asis", echo = FALSE}
+status("restructuring")
+```
+
 ## Introduction

 Pipes are a powerful tool for clearly expressing a sequence of multiple operations.
-So far, you've been using them without knowing how they work, or what the alternatives are.
-Now, in this chapter, it's time to explore the pipe in more detail.
-You'll learn the alternatives to the pipe, when you shouldn't use the pipe, and some useful related tools.
+We briefly introduced them in the previous chapter but before going too much farther I wanted to explain a little more about how they work and give a splash of history.

 ### Prerequisites

-The pipe, `%>%`, comes from the **magrittr** package by Stefan Milton Bache.
-Packages in the tidyverse load `%>%` for you automatically, so you don't usually load magrittr explicitly.
-Here, however, we're focussing on piping, and we aren't loading any other packages, so we will load it explicitly.
+The pipe `|>` is built into R itself so you don't need anything else 😄.
+But we'll also discuss another historically important pipe, `%>%`, which is provided by the core tidyverse package magrittr.

 ```{r setup, message = FALSE}
-library(magrittr)
+library(tidyverse)
 ```

-## Piping alternatives
+## Why use a pipe?

 The point of the pipe is to help you write code in a way that is easier to read and understand.
-To see why the pipe is so useful, we're going to explore a number of ways of writing the same code.
-Let's use code to tell a story about a little bunny named Foo Foo:
-
-> Little bunny Foo Foo\
-> Went hopping through the forest\
-> Scooping up the field mice\
-> And bopping them on the he ad
-
-This is a popular Children's poem that is accompanied by hand actions.
-
-We'll start by defining an object to represent little bunny Foo Foo:
+Imagine you wanted to express the following sequence of actions as R code: find keys, unlock car, start car, drive to work, park.
+You could write it as nested function calls:

 ```{r, eval = FALSE}
-foo_foo <- little_bunny()
+park(drive(start_car(find("keys")), to = "work"))
 ```

-And we'll use a function for each key verb: `hop()`, `scoop()`, and `bop()`.
-Using this object and these verbs, there are (at least) four ways we could retell the story in code:
-
-1.  Save each intermediate step as a new object.
-2.  Overwrite the original object many times.
-3.  Compose functions.
-4.  Use the pipe.
-
-We'll work through each approach, showing you the code and talking about the advantages and disadvantages.
-
-### Intermediate steps
-
-The simplest approach is to save each step as a new object:
+But writing it out using with the pipe gives it a more natural and easier to read structure:

 ```{r, eval = FALSE}
-foo_foo_1 <- hop(foo_foo, through = forest)
-foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
-foo_foo_3 <- bop(foo_foo_2, on = head)
+find("keys") |> 
+  start_car() |>  
+  drive(to = "work") |> 
+  park()
 ```

-The main downside of this form is that it forces you to name each intermediate element.
-If there are natural names, this is a good idea, and you should do it.
-But many times, like this in this example, there aren't natural names, and you add numeric suffixes to make the names unique.
-That leads to two problems:
+Behind the scenes, the pipe actually transforms your code to the first form.
+In other words, `x |> f(y)` is equivalent to `f(x, y)`.

-1.  The code is cluttered with unimportant names
+## magrittr and the `%>%` pipe

-2.  You have to carefully increment the suffix on each line.
+If you've been using the tidyverse for a while, you might be more familiar with `%>%` than `|>`.
+`%>%` comes from the **magrittr** package by Stefan Milton Bache and has been available since 2014.
+This pipe was so successful that in 2021 the base pipe, `|>`, added to R 4.1.0.

-Whenever I write code like this, I invariably use the wrong number on one line and then spend 10 minutes scratching my head and trying to figure out what went wrong with my code.
+`|>` is inspired by `%>%`, and the tidyverse team was involved in its design.
+`|>` offers fewer features than `%>%`, but we largely believe this to be a feature.
+`%>%` was an experiment and included many speculative features that seemed like a good idea at the time, but in hindsight added too much complexity relative to their advantages.
+The development of the base pipe gave an us opportunity to reset back to the most useful core.

-You may also worry that this form creates many copies of your data and takes up a lot of memory.
-Surprisingly, that's not the case.
-First, note that proactively worrying about memory is not a useful way to spend your time: worry about it when it becomes a problem (i.e. you run out of memory), not before.
-Second, R isn't stupid, and it will share columns across data frames, where possible.
-Let's take a look at an actual data manipulation pipeline where we add a new column to `ggplot2::diamonds`:
+## Changing the argument
+
+There is one feature that `%>%` has that `|>` currently lacks: a very easy way to change which argument you pass the object to --- you just put a `.` where you want the object on the left of the pipe to go.
+Ironically this is particularly important for many base functions which were designed well before the pipe existed.
+
+One particularly challenging example is extract a single column out of a data frame with `$`.
+With `%>%` you can write the fairly straightforward:

 ```{r}
-diamonds <- ggplot2::diamonds
-diamonds2 <- diamonds %>% 
-  dplyr::mutate(price_per_carat = price / carat)
-
-pryr::object_size(diamonds)
-pryr::object_size(diamonds2)
-pryr::object_size(diamonds, diamonds2)
+mtcars %>% .$cyl
 ```

-`pryr::object_size()` gives the memory occupied by all of its arguments.
-The results seem counterintuitive at first:
-
-   `diamonds` takes up 3.46 MB,
-   `diamonds2` takes up 3.89 MB,
-   `diamonds` and `diamonds2` together take up 3.89 MB!
-
-How can that work?
-Well, `diamonds2` has 10 columns in common with `diamonds`: there's no need to duplicate all that data, so the two data frames have variables in common.
-These variables will only get copied if you modify one of them.
-In the following example, we modify a single value in `diamonds$carat`.
-That means the `carat` variable can no longer be shared between the two data frames, and a copy must be made.
-The size of each data frame is unchanged, but the collective size increases:
+But the base pipe requires the rather cryptic:

 ```{r}
-diamonds$carat[1] <- NA
-pryr::object_size(diamonds)
-pryr::object_size(diamonds2)
-pryr::object_size(diamonds, diamonds2)
+mtcars |> (`$`)(cyl)
 ```

-(Note that we use `pryr::object_size()` here, not the built-in `object.size()`.
-`object.size()` only takes a single object so it can't compute how data is shared across multiple objects.)
+Fortunately, dplyr provides a way out of this common problem with `pull`:

-### Overwrite the original
-
-Instead of creating intermediate objects at each step, we could overwrite the original object:
-
-```{r, eval = FALSE}
-foo_foo <- hop(foo_foo, through = forest)
-foo_foo <- scoop(foo_foo, up = field_mice)
-foo_foo <- bop(foo_foo, on = head)
+```{r}
+mtcars |> pull(cyl)
 ```

-This is less typing (and less thinking), so you're less likely to make mistakes.
-However, there are two problems:
+magrittr offers a number of other variations on the pipe that you might want to learn about.
+We don't teach them here because none of them has been sufficiently popular that you could reasonable expect a randomly chosen R user to recognize them.

-1.  Debugging is painful: if you make a mistake you'll need to re-run the complete pipeline from the beginning.
+In R 4.2, the base pipe will gain its own placeholder, `_`.
+Must be named.
+Doesn't solve problem above, but helps out in lots of other places.

-2.  The repetition of the object being transformed (we've written `foo_foo` six times!) obscures what's changing on each line.
-
-### Function composition
-
-Another approach is to abandon assignment and just string the function calls together:
-
-```{r, eval = FALSE}
-bop(
-  scoop(
-    hop(foo_foo, through = forest),
-    up = field_mice
-  ), 
-  on = head
-)
-```
-
-Here the disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart (evocatively called the [Dagwood sandwich](https://en.wikipedia.org/wiki/Dagwood_sandwich) problem).
-In short, this code is hard for a human to consume.
-
-### Use the pipe
-
-Finally, we can use the pipe:
-
-```{r, eval = FALSE}
-foo_foo %>%
-  hop(through = forest) %>%
-  scoop(up = field_mice) %>%
-  bop(on = head)
-```
-
-This is my favourite form, because it focusses on verbs, not nouns.
-You can read this series of function compositions like it's a set of imperative actions.
-Foo Foo hops, then scoops, then bops.
-The downside, of course, is that you need to be familiar with the pipe.
-If you've never seen `%>%` before, you'll have no idea what this code does.
-Fortunately, most people pick up the idea very quickly, so when you share your code with others who aren't familiar with the pipe, you can easily teach them.
-
-The pipe works by performing a "lexical transformation": behind the scenes, R reassembles the code in the pipe to the function composition form used above.
+Expect it to continue to evolve.

 ## When not to use the pipe

-The pipe is a powerful tool, but it's not the only tool at your disposal, and it doesn't solve every problem!
+The pipe is such fun to use, it's easy to go overboard and use pipes when better alternatives exists.
 Pipes are most useful for rewriting a fairly short linear sequence of operations.
 I think you should reach for another tool when:

@ -173,6 +96,3 @@ I think you should reach for another tool when:

 -   You have multiple inputs or outputs.
    If there isn't one primary object being transformed, but two or more objects being combined together, don't use the pipe.
-
-   You are starting to think about a directed graph with a complex dependency structure.
-    Pipes are fundamentally linear and expressing complex relationships with them will typically yield confusing code.