More work on programming

This commit is contained in:
Hadley Wickham 2022-10-19 17:36:06 -05:00
parent 3e167168e7
commit 765d1c8191
2 changed files with 185 additions and 210 deletions

View File

@ -11,8 +11,6 @@ status("drafting")
One of the best ways to improve your reach as a data scientist is to write functions.
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
Writing a function has three big advantages over using copy-and-paste:
1. You can give a function an evocative name that makes your code easier to understand.
@ -21,9 +19,8 @@ Writing a function has three big advantages over using copy-and-paste:
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
Writing good functions is a lifetime journey.
Even after using R for many years we still learn new techniques and better ways of approaching old problems.
The goal of this chapter is to get you started on your journey with functions with three useful types of functions:
A good rule of thumb is to consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
The goal of this chapter is to get you started on your journey with three useful types of functions:
- Vector functions take one or more vectors as input and return a vector as output.
- Data frame functions take a data frame as input and return a data frame as output.
@ -31,15 +28,14 @@ The goal of this chapter is to get you started on your journey with functions wi
The chapter concludes with some advice on function style.
Many of the examples in this chapter were inspired by real data analysis code supplied by folks on twitter.
We've often simplified the code from the original so you might want to look at the original tweets which we list in the comments.
If you want just to see a huge variety of functions, check out the motivating tweets: https://twitter.com/hadleywickham/status/1574373127349575680, https://twitter.com/hadleywickham/status/1571603361350164486 A big thanks to everyone who contributed!
WI won't fully explain all of the functions that we use here, so you might need to do some reading of the documentation.
This chapter includes many examples to help you generalize the patterns that you see.
Many of the examples were inspired by real data analysis code supplied by folks on twitter; follow the links in the comment to see original inspiration.
And if you want to see even more examples, check out the motivating tweets for [general functions](https://twitter.com/hadleywickham/status/1571603361350164486) and [plotting functions](https://twitter.com/hadleywickham/status/1574373127349575680).
### Prerequisites
We'll wrap up a variety of functions from around the tidyverse.
We'll also use nycflights13 as a source of relatively familiar data to apply our functions to.
We'll also use nycflights13 as a source of familiar data to use our functions with.
```{r}
#| message: false
@ -58,7 +54,6 @@ pick <- function(cols) {
## Vector functions
We'll begin with vector functions: functions that take one or more vectors and return a vector result.
For example, take a look at this code.
What does it do?
@ -89,7 +84,7 @@ Preventing this type of mistake of is one very good reason to learn how to write
### Writing a function
To write a function you need to first analyse your repeated to figure what parts of the repeated code is constant and what parts vary.
To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary.
If we take the code above and pull it outside of `mutate()` it's a little easier to see the pattern because each repetition is now one line:
```{r}
@ -108,19 +103,17 @@ To make this a bit clearer we can replace the bit that varies with `█`:
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
```
There's only one thing that varies which implies we're going to need a function with one argument.
To turn this into an actual function you need three things:
To turn this into a function you need three things:
1. A **name**.
Here we might use `rescale01` because this function rescales a vector to lie between 0 and 1.
Here we'll use `rescale01` because this function rescales a vector to lie between 0 and 1.
2. The **arguments**.
The arguments are things that vary across calls.
Here we have just one argument which we're going to call `x` because this is a conventional name for a numeric vector.
The arguments are things that vary across calls and our analysis above tells us that have just one.
We'll call it `x` because this is the conventional name for a numeric vector.
3. The **body**.
The body is the code that is the in all the calls.
The body is the code that repeated across all the calls.
Then you create a function by following the template:
@ -190,29 +183,20 @@ These changes illustrate an important benefit of functions: because we've moved
### Mutate functions
Let's look at a few more vector functions before you get some practice writing your own.
We'll start by looking at a few useful functions that work well in functions like `mutate()` and `filter()` because they return an output the same length as the input.
The goal of these sections is to expose you to a bunch of different functions to get your creative juices flowing, and to give you plenty of examples to generalize the structure and utility of functions from.
Now you've got the basic idea of functions, lets take a look a whole bunch of examples.
We'll start by looking at "mutate" functions, functions that work well like `mutate()` and `filter()` because they return an output the same length as the input.
For example, maybe instead of rescaling to min 0, max 1, you want to rescale to mean zero, standard deviation one:
Lets start with a simple variation of `rescale01()`.
Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:
```{r}
rescale_z <- function(x) {
z_score <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
```
Sometimes your functions are highly specialised for one data analysis.
For example, you might have a bunch of variables that record missing values as 997, 998, or 999:
```{r}
fix_na <- function(x) {
if_else(x %in% c(997, 998, 999), NA, x)
}
```
Other cases, you might be wrapping up a simple a `case_when()` to give it a standard name.
For example, the `clamp()` function ensures all values of a vector lie in between a minimum or a maximum:
Or maybe you want to wrap up a straightforward `case_when()` in order to give it a useful.
For example, this `clamp()` function ensures all values of a vector lie in between a minimum or a maximum:
```{r}
clamp <- function(x, min, max) {
@ -228,19 +212,19 @@ clamp(1:10, min = 3, max = 7)
Or maybe you'd rather mark those values as `NA`s:
```{r}
discard_outside <- function(x, min, max) {
na_outside <- function(x, min, max) {
case_when(
x < min ~ NA,
x > max ~ NA,
.default = x
)
}
discard_outside(1:10, min = 3, max = 7)
na_outside(1:10, min = 3, max = 7)
```
Of course functions don't just need to work with numeric variables.
You might want to extract out some repeated string manipulation.
Maybe you need to make the first character of each vector upper case:
Maybe you need to make the first character upper case:
```{r}
first_upper <- function(x) {
@ -267,8 +251,19 @@ clean_number("$12,300")
clean_number("45%")
```
There's no reason that your function can't take multiple vector inputs.
For example, you might want to compute the distance between two locations on the globe using the haversine formula:
Sometimes your functions will be highly specialized for one data analysis.
For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with `NA`:
```{r}
fix_na <- function(x) {
if_else(x %in% c(997, 998, 999), NA, x)
}
```
We've focused on examples that take a single vector because we think they're the most common.
But there's no reason that your function can't take multiple vector inputs.
For example, you might want to compute the distance between two locations on the globe using the haversine formula.
This requires four vectors:
```{r}
# https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
@ -290,17 +285,17 @@ haversine <- function(long1, lat1, long2, lat2, round = 3) {
### Summary functions
In other cases you want a function that returns a single value for use in `summary()`.
Sometimes this can just be a matter of setting a default argument:
Another important family of vector functions is summary functions, functions that return a single value for use in `summarize()`.
Sometimes this can just be a matter of setting a default argument or two:
```{r}
commas <- function(x) {
str_flatten(x, collapse = ", ")
str_flatten(x, collapse = ", ", last = " and ")
}
commas(c("cat", "dog", "pigeon"))
```
Or performing some very simple computation, like computing the coefficient of variation, which standardizes the standard deviation by dividing it by the mean:
Or you might wrap up a simple computation, like for the coefficient of variation, which divides standard deviation by the mean:
```{r}
cv <- function(x, na.rm = FALSE) {
@ -320,7 +315,7 @@ n_missing <- function(x) {
```
You can also write functions with multiple vector inputs.
For example, maybe you want to compute the mean absolute prediction error to help you comparing model predictions with actual values:
For example, maybe you want to compute the mean absolute prediction error to help you compare model predictions with actual values:
```{r}
# https://twitter.com/neilgcurrie/status/1571607727255834625
@ -329,6 +324,17 @@ mape <- function(actual, predicted) {
}
```
::: callout-note
## RStudio
Once you start writing functions, there are two RStudio shortcuts that are super useful:
- To find the definition of a function that you've written, place the cursor on the name of the function and press `F2`.
- To quickly jump to a function, press `Ctrl + .` to open the fuzzy file and function finder and type the first few letters of your function name.
You can also navigate to files, Quarto sections, and more, making it a very hand navigation tool.
:::
### Exercises
1. Practice turning the following code snippets into functions.
@ -377,9 +383,13 @@ mape <- function(actual, predicted) {
## Data frame functions
Vector functions are useful for pulling out code that's repeated within dplyr verbs.
In this section, you'll learn how to write "data frame" functions which pull out code that's repeated across multiple pipelines.
These functions work in the same way as dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and usually return a data frame.
Vector functions are useful for pulling out code that's repeated within a dplyr verb.
But you'll often also repeat the verbs themselves, particularly within a large pipeline.
When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function.
Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or vector.
To let you write a function that uses dplyr verbs, we'll first introduce you to the challenge of indirection and how you can overcome it with embracing, `{{ }}`.
With this theory under your belt, we'll then show you a bunch of examples to illustrate what you might do with it.
### Indirection and tidy evaluation
@ -411,7 +421,7 @@ df |> pull_unique(y)
```
Regardless of how we call `pull_unique()` it always does `df |> distinct(var) |> pull(var)`, instead of `df |> distinct(x) |> pull(x)` or `df |> distinct(y) |> pull(y)`.
This is a problem of indirection, and it arises because dplyr allows you to refer to the names of variables inside your data frame without any special treatment, so called **tidy evaluation**.
This is a problem of indirection, and it arises because dplyr uses **tidy evaluation** to allow you to refer to the names of variables inside your data frame without any special treatment.
Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context.
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function.
@ -420,7 +430,7 @@ Here we need some way tell `distinct()` and `pull()` not to treat `var` as the n
Tidy evaluation includes a solution to this problem called **embracing**.
Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`.
Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the a literal variable name.
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a function look inside of `var` rather than looking for a variable called `var`.
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a dplyr function look inside of `var` rather than looking for a variable called `var`.
So to make `pull_unique()` work we need to replace `var` with `{{ var }}`:
@ -433,28 +443,23 @@ pull_unique <- function(df, var) {
diamonds |> pull_unique(clarity)
```
Success!
### When to embrace?
So the art of writing data frame functions is basically just figuring out which arguments need to be embraced.
So the key challenge in writing data frame functions is figuring out which arguments need to be embraced.
Fortunately this is easy because you can look it up from the documentation 😄.
There are two terms to look for in the docs:
There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` which do computation with variables.
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` that compute with variables.
- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select groups of variables.
- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select variables.
When you start looking closely at the documentation, you'll notice that many dplyr functions use `…`.
This is a special shorthand syntax that matches any that aren't otherwise explicitly matched.
For example, `arrange()` uses data-masking for `…` and `select()` uses tidy-select for `…`.
Your intuition about which arguments use tidy evaluation should be good for many common functions --- just think about whether you can compute (e.g. `x + 1`) or select (e.g. `a:x`).
Your intuition for many common functions should be pretty good --- think about whether you can compute (e.g. `x + 1`) or select (e.g. `a:x`).
There are a few cases where it's harder to tell because you usually use them with single variable, which uses the same syntax for both data-masking or tidy-select.
For example, the arguments to `group_by()`, `count()`, and `distinct()` are computing arguments because they can all create new variables.
If you're ever confused, just look at the docs.
In the following sections we'll explore the sorts of handy functions you might write once you understand embracing.
In the next two sections we'll explore the sorts of handy functions you might write for data-masking and tidy-select arguments
### Summary basics
### Common use cases
If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:
@ -494,9 +499,7 @@ diamonds |>
To summarize multiple variables you'll need wait until @sec-across, where you'll learn how to use `across()`.
### Count variations
Another popular helper function is a version of `count()` that also computes proportions:
Another popular `summarise()` helper function is a version of `count()` that also computes proportions:
```{r}
# https://twitter.com/Diabb6/status/1571635146658402309
@ -508,59 +511,7 @@ count_prop <- function(df, var, sort = FALSE) {
diamonds |> count_prop(clarity)
```
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced.
`var` is passed to `count()` which uses data-masking for all variables in `…`.
Sometimes you want to select variables inside a function that uses data-masking.
For example, imagine you want to write `count_missing()` that counts the number of missing observations in rows.
You might try writing something like:
```{r}
#| error: true
count_missing <- function(df, group_vars, x_var) {
df |>
group_by({{ group_vars }}) |>
summarise(n_miss = sum(is.na({{ x_var }})))
}
flights |>
count_missing(c(year, month, day), dep_time)
```
This doesn't work because `group_by()` uses data-masking not tidy-select.
We can work around that problem by using `pick()` which allows you to use use tidy-select inside data-masking functions:
```{r}
count_missing <- function(df, group_vars, x_var) {
df |>
group_by(pick({{ group_vars }})) |>
summarise(n_miss = sum(is.na({{ x_var }})))
}
flights |>
count_missing(c(year, month, day), dep_time)
```
Another useful helper that uses `pick()` is to make a 2d table of counts.
Here we count using all the variables in the `rows` and `columns`, then use `pivot_wider()` to rearrange:
```{r}
# https://twitter.com/pollicipes/status/1571606508944719876
count_wide <- function(data, rows, cols) {
data |>
count(pick(c({{ rows }}, {{ cols }}))) |>
pivot_wider(
names_from = {{ cols }},
values_from = n,
names_sort = TRUE,
values_fill = 0
)
}
diamonds |> count_wide(clarity, cut)
diamonds |> count_wide(c(clarity, color), cut)
```
We didn't discuss `pivot_wider()` above, but you can read the docs to discover that `names_from` uses the tidy-select style of tidy evaluation.
### Selecting rows and columns
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables in `…`.
Or maybe you want to find the sorted unique values of a variable for a subset of the data.
Rather than supplying a variable and a value to do the filtering, we'll allow the user to supply an condition:
@ -582,9 +533,8 @@ flights |> unique_where(tailnum == "N14228", month)
Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()`, `arrange()`, and `pull()`.
We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data frame, it can make sense to hard code it.
For example, this function always works with the flights dataset, make it easy to grab the subset that you want to work with.
It always includes `time_hour`, `carrier`, and `flight` since these are the primary key that allows you to identify a row.
We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it.
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they are form the compound primary key that allows you to identify a row.
```{r}
flights_sub <- function(rows, cols) {
@ -596,11 +546,60 @@ flights_sub <- function(rows, cols) {
flights_sub(dest == "IAH", contains("time"))
```
### Data-masking vs tidy-selection
Sometimes you want to select variables inside a function that uses data-masking.
For example, imagine you want to write `count_missing()` that counts the number of missing observations in rows.
You might try writing something like:
```{r}
#| error: true
count_missing <- function(df, group_vars, x_var) {
df |>
group_by({{ group_vars }}) |>
summarise(n_miss = sum(is.na({{ x_var }})))
}
flights |>
count_missing(c(year, month, day), dep_time)
```
This doesn't work because `group_by()` uses data-masking, not tidy-selection.
We can work around that problem by using the handy `pick()` which allows you to use use tidy-selection inside data-masking functions:
```{r}
count_missing <- function(df, group_vars, x_var) {
df |>
group_by(pick({{ group_vars }})) |>
summarise(n_miss = sum(is.na({{ x_var }})))
}
flights |>
count_missing(c(year, month, day), dep_time)
```
Another convenient use of `pick()` is to make a 2d table of counts.
Here we count using all the variables in the `rows` and `columns`, then use `pivot_wider()` to rearrange into a grid:
```{r}
# https://twitter.com/pollicipes/status/1571606508944719876
count_wide <- function(data, rows, cols) {
data |>
count(pick(c({{ rows }}, {{ cols }}))) |>
pivot_wider(
names_from = {{ cols }},
values_from = n,
names_sort = TRUE,
values_fill = 0
)
}
diamonds |> count_wide(clarity, cut)
diamonds |> count_wide(c(clarity, color), cut)
```
While our examples have mostly focused on dplyr, the tidy evaluation also underpins tidyr, and if you look at the `pivot_wider()` docs you can see that `names_from` uses tidy-selection.
### Learning more
This section has introduced you to some of the power and flexibility of tidy evaluation with dplyr (and a dash of tidyr).
We've only used the smallest part of tidy evaluation, embracing, and it already gives you considerable power to reduce duplication in your data analyses.
You can learn more advanced techniques in `vignette("programming", package = "dplyr")`.
### Exercises
## Plot functions
@ -644,7 +643,7 @@ diamonds |>
### More variables
It's straightforward to add more variables to the mix.
For example, maybe you want an easy way to eye ball whether or not a data set is linear by overlaying a smooth line and a straight line:
For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:
```{r}
# https://twitter.com/tyler_js_smith/status/1574377116988104704
@ -662,7 +661,7 @@ starwars |>
linearity_check(mass, height)
```
Or you want to wrap up an alternative for a scatterplot that uses colour to display a third variable, for very large datasets where overplotting is a problem:
Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:
```{r}
# https://twitter.com/ppaxisa/status/1574398423175921665
@ -670,7 +669,7 @@ hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
df |>
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
stat_summary_hex(
aes(colour = after_scale(fill)),
aes(colour = after_scale(fill)), # make border same colour as fill
bins = bins,
fun = fun,
)
@ -681,8 +680,8 @@ diamonds |> hex_plot(carat, price, depth)
### Combining with dplyr
Some of the most useful helpers combine a dash of dplyr with ggplot2.
For example, if you might want to do a bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
And we're drawing the vertical bars, so you need to reverse the usual order to get the highest values at the top:
For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:
```{r}
sorted_bars <- function(df, var) {
@ -694,7 +693,22 @@ sorted_bars <- function(df, var) {
diamonds |> sorted_bars(cut)
```
You can also get creative and display data summaries in other way:
Or you could maybe you want to make it easy to draw a bar plot just for a subset of the data:
```{r}
conditional_bars <- function(df, condition, var) {
df |>
filter({{ condition }}) |>
ggplot(aes({{ var }})) +
geom_bar()
}
diamonds |> conditional_bars(cut == "Good", clarity)
```
You can also get creative and display data summaries in other way.
For example, this code uses the axis labels to display the highest value.
As you learn more about ggplot2, the power of your functions will continue to increase.
```{r}
# https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
@ -724,15 +738,14 @@ df <- tibble(
df <- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
fancy_ts(df, value, dist_name)
```
Next we'll discuss two more complicated cases: facetting and automatic labelling.
Next we'll discuss two more complicated cases: faceting and automatic labeling.
### Facetting
### Faceting
Unfortunately programming with facetting is a special challenge, because facetting was implemented before we understood what tidy evaluation was and how it should work.
Unlike `aes()`, it wasn't straightforward to backport to tidy evalution, so you have to learn a new syntax.
Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work.
so you have to learn a new syntax.
When programming with facets, instead of writing `~ x`, you need to write `vars(x)` and instead of `~ x + y` you need to write `vars(x, y)`.
The only advantage of this syntax is that `vars()` uses tidy evaluation so you can embrace within it:
@ -746,17 +759,19 @@ foo <- function(x) {
geom_point() +
facet_wrap(vars({{ x }}))
}
foo(cyl)
```
We've written these functions so that you can supply any data frame, but there are also advantages to hardcoding a data frame, if you're using it repeatedly:
As with data frame functions, it can also be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable.
The following function makes it particularly easy to interactively explore the conditional distribution `bill_length_mm` from palmerpenguins dataset.
```{r}
# https://twitter.com/yutannihilat_en/status/1574387230025875457
density <- function(fill, ...) {
density <- function(fill, facets) {
palmerpenguins::penguins |>
ggplot(aes(bill_length_mm, fill = {{ fill }})) +
geom_density(alpha = 0.5) +
facet_wrap(vars(...))
facet_wrap(vars({{ facets }}))
}
density()
@ -766,43 +781,28 @@ density(island, sex)
Also note that we hardcoded the `x` variable but allowed the fill to vary.
```{r}
bars <- function(df, condition, var) {
df |>
filter({{ condition }}) |>
ggplot(aes({{ var }})) +
geom_bar() +
scale_x_discrete(guide = guide_axis(angle = 45))
}
diamonds |> bars(cut == "Good", clarity)
```
### Labelling
It'd be nice to label this plot automatically.
To do so, we're going to have to go under the covers of tidy evaluation and use a function from a package we have talked about before: rlang.
rlang is the package that implements tidy evaluation, and is used by all the other packages in the tidyverse.
rlang provides a helpful function called `englue()` to solve just this problem.
It uses a syntax inspired by glue but combined with embracing:
```{r}
# https://twitter.com/ppaxisa/status/1574398423175921665
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
df |>
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
stat_summary_hex(
aes(colour = after_scale(fill)),
bins = bins,
fun = fun,
) +
labs(colour = rlang::englue("{{z}}"))
}
diamonds |> hex_plot(carat, price, depth)
```
Remember the histogram function we showed you earlier?
```{r}
histogram <- function(df, var, binwidth = NULL) {
df |>
ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth)
}
```
Wouldn't it be nice if we could label the output with the variable and the binwidth that was used?
To do so, we're going to have to go under the covers of tidy evaluation and use a function from a new package: rlang.
rlang is a low-level package that's used by just about every other package in the tidyverse because it implements tidy evaluation (and provided many other useful tools).
To solve the labelling problem we can use `rlang::englue()`.
This works similarly to `str_glue()`, so any value wrapped in `{ }` will be inserted into the string.
But unlike `str_glue()`, it also understands `{{ }}`, which automatically insert the appropriate variable name.
```{r}
histogram <- function(df, var, binwidth) {
label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
df |>
@ -819,23 +819,7 @@ Hopefully it'll be fixed soon!)
You can use the same approach any other place that you might supply a string in a ggplot2 plot.
### Learning more
It's hard to create general purpose plotting functions because you need to consider many different situations, and we haven't given you the programming skills to handle them all.
Fortunately, in most cases it's relatively simple to extract repeated plotting code into a function.
So, for now, strive to keep your functions simple, focussing on concrete repetition, not solve imaginary future problems.
You can also learn other techniques in <https://ggplot2-book.org/programming.html>.
## RStudio
Once you start writing functions, there are two RStudio shortcuts that are useful.
- If you put your cursor on the name of a function that you've written, `F2` will take you to its defintion.
- Press `Ctrl + .` to open the fuzzy file and function finder.
You can type the first few letters in your function name and it'll appear in the dropdown.
You can also navigate to files, Quarto sections, and more, making it a very hand navigation tool.
### Exercises
## Style
@ -916,13 +900,16 @@ Learn more at <https://style.tidyverse.org/functions.html>
## Summary
In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot.
Along the way your saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.
Writing functions to create data frames and plots using the tidyverse required you to learn a little about tidy evaluation.
Tidy evaluation is really important, because its what allows you to write `diamonds |> filter(x == y)` and `filter()` knows to use `x` and `y` from the diamonds dataset.
The downside of tidy evaluation is that you need to learn a new technique for programming: embracing.
Embracing, e.g. `{{ x }}`, tells the tidy-evaluation using function to look inside the argument `x`, rather than using the literal variable `x`.
You can figure out when you need to use embracing by looking in the documentation for the terms for the two major styles of tidyselect: "data masking" and "tidy select".
You also learned a little about tidy evaluation so you could wrap functions from dplyr, tidyr, and ggplot2.
Tidy evaluation is a key component of the tidyverse because it allows you to write `diamonds |> filter(x == y)` and `filter()` knows to use `x` and `y` from the diamonds dataset.
The downside of tidy evaluation is that you need to learn a new technique for programming: embracing, `{{ x }}`.
Embracing already gives you considerable power to reduce duplication in your data analyses, but there are many more advanced techniques available, which you can learn more about it `vignette("programming", package = "dplyr")` and `vignette("programming", package = "tidyr")`.
Here we've focused on very simple plotting functions, the sort of functions that you might naturally extract from repeated code in your analyses.
As you get better at programming and learn more about ggplot2, you'll be able create richer functions with greater flexibility.
The next place you might stop on your journey is the [Programming with ggplot2](https://ggplot2-book.org/programming.html){.uri} chapter of the ggplot2 book, where you'll learn other ways to reduce duplication in your plotting code.
In the next chapter, we'll dive into some of the details of R's vector data structures that we've omitted so far.
These are immediately useful by themselves, but are a necessary foundation for the following chapter on iteration that provides some amazingly powerful tools.

View File

@ -12,6 +12,7 @@ Programming is a cross-cutting skill needed for all data science work: you must
```{r}
#| label: fig-ds-program
#| echo: false
#| out.width: ~
#| fig-cap: >
#| Programming is the water in which all other components of the data
#| science process swims.
@ -19,7 +20,6 @@ Programming is a cross-cutting skill needed for all data science work: you must
#| Our model of the data science process with program (import, tidy,
#| transform, visualize, model, and communicate, i.e. everything)
#| highlighted in blue.
#| out.width: NULL
knitr::include_graphics("diagrams/data-science/program.png", dpi = 270)
```
@ -47,25 +47,13 @@ In the following three chapters, you'll learn skills that will allow you to both
Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies.
Instead, in @sec-functions, you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused.
2. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by vectors, which we discuss in @sec-vectors. You must master the four common atomic vectors, the three important S3 classes built on top of them, and understand the mysteries of the list and data frame.
2. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by vectors, which we discuss in @sec-vectors.
You must master the four common atomic vectors, the three important S3 classes built on top of them, and understand the mysteries of the list and data frame.
3. Functions extract out repeated code, but you often need to repeat the same actions on different inputs.
You need tools for **iteration** that let you do similar things again and again.
These tools include for loops and functional programming, which you'll learn about in @sec-iteration.
A common theme throughout these chapters is the idea of reducing duplication in your code.
Reducing code duplication has three main benefits:
1. It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.
2. It's easier to respond to changes in requirements.
As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.
3. You're likely to have fewer bugs because each line of code is used in more places.
One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated.
Another tool for reducing duplication is iteration, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
## Learning more
The goal of these chapters is to teach you the minimum about programming that you need to practice data science.