From bdc3555b9ab1a46d698d23aa8d419de38f329ff7 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Mon, 21 Nov 2022 08:36:55 -0600 Subject: [PATCH] Describe pull in Base R chapter Fixes #1108 --- base-R.qmd | 13 ++++++++++- functions.qmd | 65 +++++++++++++++++++++++++++------------------------ 2 files changed, 47 insertions(+), 31 deletions(-) diff --git a/base-R.qmd b/base-R.qmd index c77981e..15625c4 100644 --- a/base-R.qmd +++ b/base-R.qmd @@ -219,7 +219,7 @@ In this section, we'll show you how to use `[[` and `$` to pull columns out of a ### Data frames -`[[` and `$` can be used like `pull()` to extract columns out of a data frame. +`[[` and `$` can be used extract columns out of a data frame. `[[` can access by position or by name, and `$` is specialized for access by name: ```{r} @@ -255,6 +255,16 @@ max(diamonds$carat) levels(diamonds$cut) ``` +dplyr also provides an equivalent to `[[`/`$` that we didn't mention in @sec-data-transform: `pull()`. +`pull()` takes either a variable name or variable position and returns just that column. +That means we could rewrite the above code to use the pipe: + +```{r} +diamonds |> pull(carat) |> mean() + +diamonds |> pull(cut) |> levels() +``` + ### Tibbles There are a couple of important differences between tibbles and base `data.frame`s when it comes to `$`. @@ -537,3 +547,4 @@ This often makes life easier for programming and so becomes more important as yo This chapter concludes the programming section of the book. You've made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can *program* in R. We hope these chapters have sparked your interested in programming and that you're are looking forward to learning more outside of this book. + diff --git a/functions.qmd b/functions.qmd index d2d9e6c..717fd0e 100644 --- a/functions.qmd +++ b/functions.qmd @@ -384,14 +384,14 @@ With this theory under your belt, we'll then show you a bunch of examples to ill ### Indirection and tidy evaluation When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. -Let's illustrate the problem with a very simple function: `pull_unique()`. -The goal of this function is to `pull()` the unique (distinct) values of a variable: +Let's illustrate the problem with a very simple function: `grouped_mean()`. +The goal of this function is compute the mean of `mean_var` grouped by `group_var`: ```{r} -pull_unique <- function(df, var) { +grouped_mean <- function(df, group_var, mean_var) { df |> - distinct(var) |> - pull(var) + group_by(group_var) |> + summarize(mean(mean_var)) } ``` @@ -399,38 +399,45 @@ If we try and use it, we get an error: ```{r} #| error: true -diamonds |> pull_unique(clarity) +diamonds |> grouped_mean(cut, carat) ``` To make the problem a bit more clear we can use a made up data frame: ```{r} -df <- tibble(var = "var", x = "x", y = "y") -df |> pull_unique(x) -df |> pull_unique(y) +df <- tibble( + mean_var = 1, + group_var = "g", + group = 1, + x = 10, + y = 100 +) +df |> grouped_mean(group, x) +df |> grouped_mean(group, y) ``` -Regardless of how we call `pull_unique()` it always does `df |> distinct(var) |> pull(var)`, instead of `df |> distinct(x) |> pull(x)` or `df |> distinct(y) |> pull(y)`. +Regardless of how we call `grouped_mean()` it always does `df |> group_by(group_var) |> summarise(mean(mean_var))`, instead of `df |> group_by(group) |> summarise(mean(x))` or `df |> group_by(group) |> summarise(mean(y))`. This is a problem of indirection, and it arises because dplyr uses **tidy evaluation** to allow you to refer to the names of variables inside your data frame without any special treatment. Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. -Here we need some way to tell `distinct()` and `pull()` not to treat `var` as the name of a variable, but instead look inside `var` for the variable we actually want to use. +Here we need some way to tell `group_mean()` and `summarise()` not to treat `group_var` and `mean_var` as the name of the variables, but instead look inside them for the variable we actually want to use. Tidy evaluation includes a solution to this problem called **embracing** 🤗. Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a dplyr function look inside of `var` rather than looking for a variable called `var`. -So to make `pull_unique()` work we need to replace `var` with `{{ var }}`: +So to make grouped_mean`()` work we need to replace surround `group_var` and `mean_var()` with `{{ }}`: ```{r} -pull_unique <- function(df, var) { +grouped_mean <- function(df, group_var, mean_var) { df |> - distinct({{ var }}) |> - pull({{ var }}) + group_by({{ group_var }}) |> + summarize(mean({{ mean_var }})) } -diamonds |> pull_unique(clarity) + +diamonds |> grouped_mean(cut, carat) ``` Success! @@ -511,8 +518,7 @@ unique_where <- function(df, condition, var) { df |> filter({{ condition }}) |> distinct({{ var }}) |> - arrange({{ var }}) |> - pull({{ var }}) + arrange({{ var }}) } # Find all the destinations in December @@ -521,7 +527,7 @@ flights |> unique_where(month == 12, dest) flights |> unique_where(tailnum == "N14228", month) ``` -Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()`, `arrange()`, and `pull()`. +Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()` and `arrange()`. We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they form the compound primary key that allows you to identify a row. @@ -890,21 +896,20 @@ This makes it easier to see the hierarchy in your code by skimming the left-hand ```{r} # missing extra two spaces -pull_unique <- function(df, var) { -df |> - distinct({{ var }}) |> - pull({{ var }}) +density <- function(colour, facets, binwidth = 0.1) { +diamonds |> + ggplot(aes(carat, after_stat(density), colour = {{ colour }})) + + geom_freqpoly(binwidth = binwidth) + + facet_wrap(vars({{ facets }})) } # Pipe indented incorrectly -pull_unique <- function(df, var) { - df |> - distinct({{ var }}) |> - pull({{ var }}) +density <- function(colour, facets, binwidth = 0.1) { + diamonds |> + ggplot(aes(carat, after_stat(density), colour = {{ colour }})) + + geom_freqpoly(binwidth = binwidth) + + facet_wrap(vars({{ facets }})) } - -# Missing {} and all one line -pull_unique <- function(df, var) df |> distinct({{ var }}) |> pull({{ var }}) ``` As you can see we recommend putting extra spaces inside of `{{ }}`.