More writing about functions

This commit is contained in:
hadley 2016-03-03 08:25:43 -06:00
parent 200a630d0b
commit f6d7f86a84
1 changed files with 123 additions and 183 deletions

View File

@ -4,11 +4,24 @@ knit: bookdown::preview_chapter
# Functions
One of the best ways to grow in your skills as a data scientist in R is to write functions. Functions allow you to automate common tasks, instead of using copy-and-paste. Writing good functions is a lifetime journey: you won't learn everything but you'll hopefully get to start walking in the right direction.
One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks. Writing a function has three big advantages over using copy-and-paste:
1. You drastistically reduce the chances of making incidental mistakes when
you copy and paste.
1. As requirements change, you only need to update code in one place, instead
of many.
1. You can give a function an evocative name that makes your code easier to
understand.
Writing good functions is a lifetime journey. Even after using R for many years we still learn new techniques and better ways of approaching old problems. The goal of this chapter is not master every esoteric detail of functions but to get you started with some pragmatic advice that you can start using right away.
As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code. Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations. Here we present the style we use in our code, but the most important thing is to be consistent.
## When should you write a function?
Whenever you've copied and pasted code more than twice, you need to take a look at it and see if you can extract out the common components and make a function. For example, take a look at this code. What does it do?
You should consider writing a funtion whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). For example, take a look at this code. What does it do?
```{r}
df <- data.frame(
@ -28,32 +41,30 @@ df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
```
You might be able to puzzle out that this rescales each column to 0--1. But did you spot the mistake? I made an error when copying-and-pasting the code for `df$b`, and I forgot to change an `a` to a `b`. Extracting repeated code out into a function is a good idea because it helps make your code more understandable (because you can name the operation), and it prevents you from making this class of errors.
You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? I made an error when copying-and-pasting the code for `df$b`, and I forgot to change an `a` to a `b`. Extracting repeated code out into a function is a good idea because prevents you from making this type of mistake.
To write a function you need to first analyse the operation. How many inputs does it have?
To write a function you need to first analyse the code. How many inputs does it have?
```{r, eval = FALSE}
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
```
This code only has one input: `df$a`. (You might argue that `TRUE` is also an input: I don't think it is here, but there are other scenarios in which it might be.)
To make that more clear, it's a good idea to rewrite the code using some temporary variables. Here this function only takes one input, so I'll call it `x`:
This code only has one input: `df$a`. (You might wonder if that `TRUE` is also an input: you can explore why it's not in the exercise below). To make the single input more clear, it's a good idea to rewrite the code using a temporary variables with a general name. Here this function only takes one vector input, so I'll call it `x`:
```{r}
x <- 1:10
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
```
There is some duplication in this code: I'm computing the `min()` and `max()` multiple times, and I could instead do that in one step:
There is some duplication in this code. We're computing the range of the data three times, but it makes sense to do it in one step:
```{r}
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
```
Now that I've simplified the code, and checked that it still works, I can turn it into a function:
Pulling out intermediate calculations into named variables is good practice because it makes it more clear what the code is doing. Now that I've simplified the code, and checked that it still works, I can turn it into a function:
```{r}
rescale01 <- function(x) {
@ -63,20 +74,19 @@ rescale01 <- function(x) {
rescale01(c(0, 5, 10))
```
There are three key components here:
There are three key steps to making a function:
1. The name of the function, `rescale01`.
1. You need to pick a __name__ for the function. Here I've used `rescale01`
because this function rescales a vector to lie between 0 and 1.
1. The call to `function` listing each argument to the function. Sometimes
these are called _formal_ arguments to distinguish them from the specific
arguments used in a given call.
1. You list the inputs, or __arguments__, to the function inside `function`.
1. The body of the function wrapped in `{`. This does the computation, and
the function returns the last evaluated statement.
1. You place the __body__ of the function inside a `{` block immediately
following `function`.
Note the process that I followed here: I constructed the `function` last. It's much easier to start with code that works on a sample input and then turn it into a function rather than the other way around. You're more likely to get to your final destination if you take small steps and check your work after each step.
Note the process that I followed here: I only made the function after I'd figured out how to make it work with a simple input. It's much easier to start with working code and turn it into a function as opposed to creating a function and then trying to make it work.
Now we can use that to simplify our original example:
Now we have `rescale01()` we can use that to simplify our original example:
```{r}
df$a <- rescale01(df$a)
@ -85,19 +95,12 @@ df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
There are two advantages to using a funtion:
1. We can name the operation. Naming functions well is hard, but important,
because it makes your code much easier to understand.
1. We avoid one class of copy-and-paste errors.
However, we still have quite a bit of duplication: we're still doing the same thing to multiple columns. You'll learn how to deal with that in the iteration chapter, but first, you'll need to learn more about functions.
Compared to our original code, this is easier to understand and we've eliminated one class of class of copy-and-paste errors. There's still quite a bit of duplication since we're doing the same thing to multiple columns. You'll learn how to eliminate that duplication in the next chapter, Iteration.
### Practice
1. Why is `TRUE` not a parameter to `rescale01()`? What would happen if
`x` containing a missing value, and `na.rm` was `FALSE`.
`x` contained a missing value, and `na.rm` was `FALSE`?
1. Practice turning the following code snippets into functions. Think about
what each function does. What would you call it? How many arguments does it
@ -117,7 +120,11 @@ However, we still have quite a bit of duplication: we're still doing the same th
1. Implement a `fizzbuzz` function. It take a single number as input. If
the number is divisible by three, return "fizz". If it's divisible by
five return "buzz". If it's divisible by three and five, return "fizzbuzz".
Otherwise, return the number.
Otherwise, return the number. Make sure you first write working code,
before you create the function.
1. Write `both_na()`, a function that takes two vectors of the same length
and returns the number of positions that have an `NA` in both vectors.
1. What do the following functions do? Why are they useful even though they
are so short?
@ -129,37 +136,14 @@ However, we still have quite a bit of duplication: we're still doing the same th
1. Read the [complete lyrics](https://en.wikipedia.org/wiki/Little_Bunny_Foo_Foo)
to "Little Bunny Foo". There's a lot of duplication in this song.
Extend the initial piping example to recreate the complete song, using
functions to reduce duplication.
Extend the initial piping example to recreate the complete song, and use
functions to reduce the duplication.
## Functions for humans
## Functions are for humans and computers
It's important to remember that functions are not just for the computer, but are also for humans. R doesn't care what your function is called, or what comments it contains, but there are important for human readers. This section discusses some things that you should bear in mind when naming and commenting your functions.
It's important to remember that functions are not just for the computer, but are also for humans. R doesn't care what your function is called, or what comments it contains, but these are important for human readers. This section discusses some things that you should bear in mind when writing functions that humans can understand.
There are few rules for naming your functions, but lots of suggestions. I strongly recommend using only lowercase, and separating multiple words with underscores ("snake\_case"). Camel case is a legitimate alternative, but be consistent: pick either snake\_case or camelCase for your code, don't mix them.
Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`). A good sign that a noun might be a better choice is if you're using a very broad verb like get, or compute, or calculate, or determine. Use your best judgement and don't be afraid to rename a function if you later figure out a better name.
Ideally the name of your function will be short, but clearly evoke what the function does. However, concise names are hard, and autocomplete makes it easy to type long names, so it's better to err on the side of clear descriptions, rather than concise names.
```{r, eval = FALSE}
# Good
day_one
day_1
# Bad
f <- function(x, y, z) {}
first_day_of_the_month
DayOne
dayone
djm1
d1
```
There are also a handful of few very short names that are used very commonly. It's worth remembering these and using in your own functions:
The name of a function is surprisingly important. Ideally the name of your function will be short, but clearly evoke what the function does. However, it's hard to come up with concise names, and autocomplete makes it easy to type long names, so it's better to err on the side of clear descriptions, rather than short names. There are a few exceptions to this rule: the handful of very common, very short names. It's worth memorising these:
* `x`, `y`, `z`: vectors.
* `df`: a data frame.
@ -167,9 +151,13 @@ There are also a handful of few very short names that are used very commonly. It
* `n`: length, or number of rows.
* `p`: number of columns.
Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`). A good sign that a noun might be a better choice is if you're using a very broad verb like get, or compute, or calculate, or determine. Use your best judgement and don't be afraid to rename a function if you later figure out a better name.
To make it easy to type function names, I strongly recommend using only lowercase, and separating multiple words with underscores (so called "snake\_case"). Camel case is a legitimate alternative, but be consistent: pick either snake\_case or camelCase for your code, don't mix them. R itself is not very consistent, but there's nothing you can do about that. Make sure you don't fall into the same trap by making your code as consistent as possible.
If you have a family of functions that do similar things, make sure they have consistent names and arguments. Use a common prefix to indicate that they are connected. That's better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.
```{r}
```{r, eval = FALSE}
# Good
input_select
input_checkbox
@ -223,7 +211,7 @@ Another important use of comments is to break up your file into easily readable
## Conditional execution
An `if` statement allows you to conditionally execute code. It has the following form:
An `if` statement allows you to conditionally execute code. It looks like this:
```{r, eval = FALSE}
if (condition) {
@ -248,7 +236,6 @@ has_name <- function(x) {
}
```
Squiggly brackets are always optional (both here and in function definitons), but I recommend using them because it makes it easier to see the hierarchy in your code. An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it's followed by `else`. Always indent the code inside curly braces.
```{r, eval = FALSE}
@ -281,6 +268,8 @@ If `condition` isn't a single `TRUE` or `FALSE` you'll get a warning or error.
You can use `||` (or) and `&&` (and) to combine multiple logical expressions. These operators a "short-circuiting": as soon as `||` sees the first `TRUE` it returns `TRUE` without computing anything else. As soon as `&&` sees the first `FALSE` it returns `FALSE`.
Chaining multiple else ifs together.
Like a function, an `if` statement "returns" the last expression it evaluated. This means you can assign the result of an `if` statement to a variable:
```{r}
@ -333,81 +322,120 @@ function(x, y, op) {
Neither `if` not `switch` are vectorised: they work with a single value at a time.
## Arguments
### Exercises
1. What's the different between `if` and `ifelse()`? Carefully read the help
and construct three examples that illustrate the key differences.
1. What happens if you use `switch()` with numeric values?
1. What does this `switch()` call do?
```{r}
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)
```
## Function arguments
Note that arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called. This is an important property of R the programming language, but is unlikely to be important to you for a while. You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>
Often the arguments to a function fall into two broad sets: one set supplies the data to compute on, and the other supplies arguments that controls the details of the computation. For example:
The arguments to a function typically fall into two broad sets: one set supplies the data to compute on, and the other supplies arguments that controls the details of the computation. For example:
* In `log()`, the data is `x`, and the base of the logarithm is `base`.
* In `log()`, the data is `x`, and the detail is the `base` of the logarithm.
* In `mean()`, the data is `x`, and `trim` and `na.rm` control the computation.
* In `mean()`, the data is `x`, and the details are the `trim` and how to
handle missing values (`na.rm`).
* In `t.test()`, the data is `x` and `y`, and `alternative`, `mu`, `paired`,
`var.equal`, and `conf.level` control the details of the test.
* In `t.test()`, the data is `x` and `y`, and the details of the test are
`alternative`, `mu`, `paired`, `var.equal`, and `conf.level`.
* In `paste()` you can supply unlimited strings to `...`, and the pasting
is controlled by `sep` and `collapse`.
* In `paste()` you can supply any number of strings to `...`, and the details
of the concatenation is controlled by `sep` and `collapse`.
Generally, the arguments that control computation have default values so you don't need to supply the most commonly used values.
In almost all cases, the default value should be the value that is used most commonly. There are a few exceptions to do with safety. For example, `na.rm` should always have default value `FALSE` even though `TRUE` is what you usually want if you have missing values. The default forces you to confront and deal with the missingness in your data, rather than allowing it to silently propagate.
You can choose to supply default values to your arguments for common options. This is useful so that you don't need to repeat yourself all the time.
Generally, data arguments should come first. Detail arguments should go on the end, and usually should have default values. You specify a default value in the same way you call a function with a named argument:
```{r}
foo <- function(x = 1, y = TRUE, z = 10:1) {
# Compute standard error of a mean using normal approximation
mean_se <- function(x, conf = 0.95) {
se <- sd(x) / sqrt(length(x))
mean(x) + se * qnorm(c(1 - conf, conf))
}
x <- runif(100)
mean_se(x)
mean_se(x, 0.99)
```
Whenever you have a mix of arguments with and without defaults, those without defaults should come first.
The default value should almost always be the most common value. There are a few exceptions to do with safety. For example, it makes sense for `na.rm` to default to `FALSE` because missing values are important. Even though `na.rm = TRUE` is what you usually put in your code, it's a bad idea to silently ignoring missing values by default.
Default values can depend on other arguments but don't overuse this technique as it's possible to create code that is very difficult to understand. What does this function do?
When you call a function, typically you can omit the names for the data arguments (because they are used so commonly). If you override the default value of a detail argument, you should use the full name:
```{r}
bar <- function(x = y + 1, y = x - 1) {
x * y
}
```{r, eval = FALSE}
# Good
mean(1:10, na.rm = TRUE)
# Bad
mean(x = 1:10, , FALSE)
mean(, TRUE, x = c(1:10, NA))
```
### Arguments that take value from a set
You can refer to an argument by its unique prefix (e.g. `mean(x, n = TRUE)`), but this is generally best avoided given the possibilities for confusion.
`match.arg()`
Notice that when you call a function, you should place a space around `=` in function calls, and always put a space after a comma, not before (just like in regular English). Using whitespace makes it easier to skim the function for the important components.
```{r, eval = FALSE}
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
```
### Dot dot dot
There's a special argument that's used quite commonly: `...` (pronounced dot-dot-dot). This captures any other arguments not otherwise matched. It's useful because you can then send those `...` on to another argument. This is a useful catch-all if your function primarily wraps another function. For example, you might have written your own wrapper designed to add linear model lines to a ggplot:
There's one special argument you need to know about: `...`, pronounced dot-dot-dot. This captures any other arguments not otherwise matched. It's useful because you can then send those `...` on to another argument. This is a useful catch-all if your function primarily wraps another function.
```{r}
geom_lm <- function(formula = y ~ x, colour = alpha("steelblue", 0.5),
size = 2, ...) {
geom_smooth(formula = formula, se = FALSE, method = "lm", colour = colour,
size = size, ...)
}
```
For example, I commonly create these helper functions that wrap around `paste()`:
```{r}
commas <- function(...) paste0(..., collapse = ", ")
commas(letters[1:10])
rule <- function(..., pad = "-") {
title <- paste0(...)
width <- getOption("width") - nchar(title) - 5
cat(title, " ", paste(rep(pad, width, collapse = "")), "\n", sep = "")
}
rule("Important output")
```
This allows you to use any other arguments of `geom_smooth()`, even those that aren't explicitly listed in your wrapper (and even arguments that don't exist yet in the version of ggplot2 that you're using).
Here `...` lets me forward on any arguments that I don't want to deal with to `paste()`. It's a very convenient technique. But it does came at a price: any misspelled arguments will not raise an error. This makes it easy for typos to go unnoticed:
```{r}
x <- c(1, 2)
sum(x, na.mr = TRUE)
```
If you just want to get the values of the `...`, use `list(...)`.
### Exercises
1. What happens if you call `bar()`? What does the error message mean?
1. What does `commas(letters, collapse = "-")` do? Why?
1. What happens if you try to override the method in `geom_lm()` created
above (e.g. `geom_lm(method = "glm")`? Why?
1. It'd be nice if you supply multiple characters to the `pad` arugment, e.g.
`rule("Title", pad = "-+")`. Why doesn't this currently work? How could you
fix it?
1. What does the `trim` argument to `mean()` do? When might you use it?
1. The default value for the `method` argument to `cor()` is
`c("pearson", "kendall", "spearman")`. What does that mean? What
value is used by default?
## Body
@ -596,91 +624,3 @@ mean_by <- function(data, group_var, mean_var, n = 10) {
This fails because it tells dplyr to group by `group_var` and compute the mean of `mean_var` neither of which exist in the data frame.
Writing reusable functions for ggplot2 poses a similar problem because `aes(group_var, mean_var)` would look for variables called `group_var` and `mean_var`. It's really only been in the last couple of months that I fully understood this problem, so there aren't currently any great (or general) solutions. However, now that I've understood the problem I think there will be some systematic solutions in the near future.
## Code style {#style}
Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations. Below I describe my style, which is used in this book and in all my packages. You don't have to use my style, but I strongly recommend that you use a consistent style and you document it. If you're working on someone else's code, don't impose your own style. Instead, read their style documentation and follow it as closely as possible.
Good style is important because while your code only has one author, it will usually have multiple readers. This is especially true when you're writing code with others. In that case, it's a good idea to agree on a common style up-front. Since no style is strictly better than another, working with others may mean that you'll need to sacrifice some preferred aspects of your style.
### Spacing
Place spaces around all infix operators (`=`, `+`, `-`, `<-`, etc.). The same rule applies when using `=` in function calls. Always put a space after a comma, and never before (just like in regular English).
```{r, eval = FALSE}
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
```
There's a small exception to this rule: `:`, `::` and `:::` don't need spaces around them.
```{r, eval = FALSE}
# Good
x <- 1:10
base::get
# Bad
x <- 1 : 10
base :: get
```
Place a space before left parentheses, except in a function call.
```{r, eval = FALSE}
# Good
if (debug) do(x)
plot(x, y)
# Bad
if(debug)do(x)
plot (x, y)
```
Extra spacing (i.e., more than one space in a row) is ok if it improves alignment of equal signs or assignments (`<-`).
```{r, eval = FALSE}
list(
total = a + b + c,
mean = (a + b + c) / n
)
```
Do not place spaces around code in parentheses or square brackets (unless there's a comma, in which case see above).
```{r, eval = FALSE}
# Good
if (debug) do(x)
diamonds[5, ]
# Bad
if ( debug ) do(x) # No spaces around debug
x[1,] # Needs a space after the comma
x[1 ,] # Space goes after comma not before
```
### Line length
Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.
### Indentation
When indenting your code, use two spaces. Never use tabs or mix tabs and spaces. I recommend the following configuration in RStudio:
```{r, echo = FALSE}
knitr::include_graphics("screenshots/style-options.png")
```
The only exception is if a function definition runs over multiple lines. In that case, indent the second line to where the definition starts:
```{r, eval = FALSE}
long_function_name <- function(a = "a long argument",
b = "another argument",
c = "another long argument") {
# As usual code is indented by two spaces.
}
```