Tweak functions

This commit is contained in:
hadley 2016-08-09 15:49:26 -05:00
parent ae5764e3c7
commit 93a23b3f28
1 changed files with 73 additions and 55 deletions

View File

@ -2,20 +2,20 @@
## Introduction
One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks. Writing a function has three big advantages over using copy-and-paste:
One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.. Writing a function has three big advantages over using copy-and-paste:
1. You drastically reduce the chances of making incidental mistakes when
you copy and paste.
1. As requirements change, you only need to update code in one place, instead
of many.
1. You can give a function an evocative name that makes your code easier to
understand.
Writing good functions is a lifetime journey. Even after using R for many years we still learn new techniques and better ways of approaching old problems. The goal of this chapter is not to master every esoteric detail of functions but to get you started with some pragmatic advice that you can start using right away.
1. As requirements change, you only need to update code in one place, instead
of many.
As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code. Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations. Here we present the style we use in our code, but the most important thing is to be consistent.
1. You eliminate the chance of making incidental mistakes when you copy and
paste (i.e. updating a variable name in one place, but not in another).
Writing good functions is a lifetime journey. Even after using R for many years I still learn new techniques and better ways of approaching old problems. The goal of this chapter is not to master every esoteric detail of functions but to get you started with some pragmatic advice that you can apply immediately.
As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code. Good coding style is like using correct punctuation. Youcanmanagewithoutit, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations. Here we present the style we use in our code, but the most important thing is to be consistent.
### Prerequisites
@ -52,7 +52,7 @@ To write a function you need to first analyse the code. How many inputs does it
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
```
This code only has one input: `df$a`. (It's a little surprisingly that `TRUE` is not an input: you can explore why in the exercise below.) To make the single input more clear, it's a good idea to rewrite the code using temporary variables with a general name. Here this function only takes one vector of input, so I'll call it `x`:
This code only has one input: `df$a`. (If you're surprised that `TRUE` is not an input, you can explore why in the exercise below.) To make the inputs more clear, it's a good idea to rewrite the code using temporary variables with general names. Here this code only requires a single numeric vector, so I'll call it `x`:
```{r}
x <- 1:10
@ -108,7 +108,7 @@ df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. We'll learn how to eliminate that duplication in [iteration], once you've learn more about R's data structures in [data-structures].
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. We'll learn how to eliminate that duplication in [iteration], once you've learned more about R's data structures in [data-structures].
Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include infinite values, and `rescale01()` fails:
@ -134,6 +134,10 @@ This is an important part of the "do not repeat yourself" (or DRY) principle. Th
1. Why is `TRUE` not a parameter to `rescale01()`? What would happen if
`x` contained a single missing value, and `na.rm` was `FALSE`?
1. In the second variant of `rescale01()`, infinite values are left
unchanged. Rewrite `rescale01()` so that `-Inf` is mapped to 0, and
`Inf` is mapped to 1.
1. Practice turning the following code snippets into functions. Think about
what each function does. What would you call it? How many arguments does it
need? Can you rewrite it to be more expressive or less duplicative?
@ -149,11 +153,11 @@ This is an important part of the "do not repeat yourself" (or DRY) principle. Th
1. Follow <http://nicercode.github.io/intro/writing-functions.html> to
write your own functions to compute the variance and skew of a vector.
1. Implement a `fizzbuzz` function. It take a single number as input. If
the number is divisible by three, return "fizz". If it's divisible by
five return "buzz". If it's divisible by three and five, return "fizzbuzz".
Otherwise, return the number. Make sure you first write working code,
before you create the function.
1. Implement a `fizzbuzz` function. It takes a single number as input. If
the number is divisible by three, it returns "fizz". If it's divisible by
five it returns "buzz". If it's divisible by three and five, it returns
"fizzbuzz". Otherwise, it returns the number. Make sure you first write
working code before you create the function.
1. Write `both_na()`, a function that takes two vectors of the same length
and returns the number of positions that have an `NA` in both vectors.
@ -175,7 +179,7 @@ This is an important part of the "do not repeat yourself" (or DRY) principle. Th
It's important to remember that functions are not just for the computer, but are also for humans. R doesn't care what your function is called, or what comments it contains, but these are important for human readers. This section discusses some things that you should bear in mind when writing functions that humans can understand.
The name of a function is important. Ideally, the name of your function will be short, but clearly evoke what the function does. However, it's hard to come up with concise names, and autocomplete makes it easy to type long names, so it's better to err on the side of clear descriptions, rather than short names.
The name of a function is important. Ideally, the name of your function will be short, but clearly evoke what the function does. That said, it's hard to come up with concise names, and autocomplete makes it easy to type long names, so it's better to err on the side of clarity, not brevity.
Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`). A good sign that a noun might be a better choice is if you're using a very broad verb like "get", "compute", "calculate", or "determine". Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
@ -191,7 +195,7 @@ impute_missing()
collapse_years()
```
If your function name is composed of multiple words, I recommend using "snake\_case", where each word is lower case and separated by an underscore. camelCase is a popular alternative, but be consistent: pick one or the other and stick with it. R itself is not very consistent, but there's nothing you can do about that. Make sure you don't fall into the same trap by making your code as consistent as possible.
If your function name is composed of multiple words, I recommend using "snake\_case", where each word is lower case and separated by an underscore. camelCase is a popular alternative. It doesn't really matter which one you pick, the important thing is to be consistent: pick one or the other and stick with it. R itself is not very consistent, but there's nothing you can do about that. Make sure you don't fall into the same trap by making your code as consistent as possible.
```{r, eval = FALSE}
# Never do this!
@ -224,12 +228,7 @@ c <- 10
mean <- function(x) sum(x)
```
Use comments, lines starting with `#`, to explain the "why" of your code. You generally should avoid comments that explain the "what" or the "how". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear. Do you need to add some intermediate variables with useful names? Do you need to break out a subcomponent of a large function so you can name it? However, your code can never capture the reasoning behind your decisions: why did you choose this approach instead of an alternative? What else did you try that didn't work? It's a great idea to capture that sort of thinking in a comment.
```{r, eval = FALSE}
# NEED EXAMPLE!
```
Use comments, lines starting with `#`, to explain the "why" of your code. You generally should avoid comments that explain the "what" or the "how". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear. Do you need to add some intermediate variables with useful names? Do you need to break out a subcomponent of a large function so you can name it? However, your code can never capture the reasoning behind your decisions: why did you choose this approach instead of an alternative? What else did I try that didn't work? It's a great idea to capture that sort of thinking in a comment.
Another important use of comments is to break up your file into easily readable chunks. Use long lines of `-` and `=` to make it easy to spot the breaks. RStudio even provides a keyboard shortcut for this: Cmd/Ctrl + Shift + R.
@ -242,7 +241,7 @@ Another important use of comments is to break up your file into easily readable
### Exercises
1. Read the source code for each of the following three functions, puzzle out
what they do, and then brainstorm good names.
what they do, and then brainstorm better names.
```{r}
f1 <- function(string, prefix) {
@ -263,7 +262,7 @@ Another important use of comments is to break up your file into easily readable
1. Compare and contrast `rnorm()` and `MASS::mvrnorm()`. How could you make
them more consistent?
1. Make a case for why `normr()`, `normd()` etc would be better than
1. Make a case for why `norm_r()`, `norm_d()` etc would be better than
`rnorm()`, `dnorm()`. Make a case for the opposite.
## Conditional execution
@ -322,7 +321,7 @@ x == 2
x - 2
```
And remember, `x == NA` doesn't work!
And remember, `x == NA` doesn't do anything useful!
### Multiple conditions
@ -338,7 +337,7 @@ if (this) {
}
```
But note that if you end up with a very long series of chained `if` statements, you should consider rewriting. One useful technique is the `switch()` function. It allows you to evaluate selected code based on position or name. Note that neither `if` nor `switch()` is vectorised: they work with a single value at a time.
But if you end up with a very long series of chained `if` statements, you should consider rewriting. One useful technique is the `switch()` function. It allows you to evaluate selected code based on position or name.
```{r}
function(x, y, op) {
@ -354,9 +353,9 @@ function(x, y, op) {
Another useful function that can often eliminate long chains of `if` statements is `cut()`. It's used to discretise continuous variables.
### If styles
### Code style
Squiggly brackets are optional (for both `if` and `function`), but highly recommended. When coupled with good style (described below), this makes it easier to see the hierarchy in your code. You can easily see how the code is nested by skimming the left-hand margin.
Both `if` and `function` should (almost) always be followed by squiggly brackets (`{}`), and the contents should be indented by two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.
An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it's followed by `else`. Always indent the code inside curly braces.
@ -432,9 +431,9 @@ if (y < 20) {
1. What happens if you use `switch()` with numeric values?
1. What does this `switch()` call do?
1. What does this `switch()` call do? What happens if `x` is "e"?
```{r}
```{r, eval = FALSE}
switch(x,
a = ,
b = "ab",
@ -442,17 +441,19 @@ if (y < 20) {
d = "cd"
)
```
Experiment, then carefully read the documentation.
## Function arguments
The arguments to a function typically fall into two broad sets: one set supplies the data to compute on, and the other supplies arguments that controls the details of the computation. For example:
The arguments to a function typically fall into two broad sets: one set supplies the data to compute on, and the other supplies arguments that control the details of the computation. For example:
* In `log()`, the data is `x`, and the detail is the `base` of the logarithm.
* In `mean()`, the data is `x`, and the details are the `trim` and how to
handle missing values (`na.rm`).
* In `mean()`, the data is `x`, and the details are how much data to trim
from the ends (`trim`) and how to handle missing values (`na.rm`).
* In `t.test()`, the data is `x` and `y`, and the details of the test are
* In `t.test()`, the data are `x` and `y`, and the details of the test are
`alternative`, `mu`, `paired`, `var.equal`, and `conf.level`.
* In `paste()` you can supply any number of strings to `...`, and the details
@ -473,7 +474,7 @@ mean_ci(x)
mean_ci(x, 0.99)
```
The default value should almost always be the most common value. There are a few exceptions to do with safety. For example, it makes sense for `na.rm` to default to `FALSE` because missing values are important. Even though `na.rm = TRUE` is what you usually put in your code, it's a bad idea to silently ignore missing values by default.
The default value should almost always be the most common value. The few exceptions are to do with safety. For example, it makes sense for `na.rm` to default to `FALSE` because missing values are important. Even though `na.rm = TRUE` is what you usually put in your code, it's a bad idea to silently ignore missing values by default.
When you call a function, typically you can omit the names for the data arguments (because they are used so commonly). If you override the default value of a detail argument, you should use the full name:
@ -500,7 +501,7 @@ average<-mean(feet/12+inches,na.rm=TRUE)
### Choosing names
The names of the arguments are also important. R doesn't care, but the readers of your code (including future you!) will find your code easier to understand. Generally you should prefer longer, more descriptive names, but there are a handful of very common, very short names. It's worth memorising these:
The names of the arguments are also important. R doesn't care, but the readers of your code (including future-you!) will. Generally you should prefer longer, more descriptive names, but there are a handful of very common, very short names. It's worth memorising these:
* `x`, `y`, `z`: vectors.
* `w`: a vector of weights.
@ -509,11 +510,11 @@ The names of the arguments are also important. R doesn't care, but the readers o
* `n`: length, or number of rows.
* `p`: number of columns.
Otherwise, consider matching names of arguments in existing R functions. For example, always use `na.rm` to determine if missing values should be removed.
Otherwise, consider matching names of arguments in existing R functions. For example, use `na.rm` to determine if missing values should be removed.
### Checking values
As you start to write more functions, you'll eventually get to the point where you don't remember exactly how your function works. At this point it's easier to call your function with invalid inputs. To avoid this problem, it's often useful to make constraints explicit. For example, imagine you've written some functions for computing weighted summary statistics:
As you start to write more functions, you'll eventually get to the point where you don't remember exactly how your function works. At this point it's easy to call your function with invalid inputs. To avoid this problem, it's often useful to make constraints explicit. For example, imagine you've written some functions for computing weighted summary statistics:
```{r}
wt_mean <- function(x, w) {
@ -534,7 +535,7 @@ What happens if `x` and `w` are not the same length?
wt_mean(1:6, 1:3)
```
In this case, because of R's recycling rules, we don't get an error.
In this case, because of R's vector recycling rules, we don't get an error.
It's good practice to check important preconditions, and throw an error (with `stop()`), if they are not true:
@ -570,7 +571,24 @@ wt_mean <- function(x, w, na.rm = FALSE) {
}
```
This is a lot of extra work for little additional gain.
This is a lot of extra work for little additional gain. A useful compromise is the built-in `stopifnot()`: it checks that each argument is `TRUE`, and produces a generic error message if not.
```{r, error = TRUE}
wt_mean <- function(x, w, na.rm = FALSE) {
stopifnot(is.logical(na.rm), length(na.rm) == 1)
stopifnot(length(x) == length(w))
if (na.rm) {
miss <- is.na(x) | is.na(w)
x <- x[!miss]
w <- w[!miss]
}
sum(w * x) / sum(x)
}
wt_mean(1:6, 6:1, na.rm = "foo")
```
Note that when using `stopifnot()` you assert what should be true rather than checking for what might be wrong.
### Dot dot dot
@ -581,9 +599,9 @@ sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
stringr::str_c("a", "b", "c", "d", "e", "f")
```
How do these functions work? They rely on a special argument: `...` (pronounced dot-dot-dot). This special argument captures any number of arguments that aren't otherwise matched.
How do these functions work? They rely on a special argument: `...` (pronounced dot-dot-dot). This special argument captures any number of arguments that aren't otherwise matched.
It's useful because you can then send those `...` on to another function. This is a useful catch-all if your function primarily wraps another function. For example, I commonly create these helper functions that wrap around `paste()`:
It's useful because you can then send those `...` on to another function. This is a useful catch-all if your function primarily wraps another function. For example, I commonly create these helper functions that wrap around `str_c()`:
```{r}
commas <- function(...) stringr::str_c(..., collapse = ", ")
@ -597,26 +615,26 @@ rule <- function(..., pad = "-") {
rule("Important output")
```
Here `...` lets me forward on any arguments that I don't want to deal with to `paste()`. It's a very convenient technique. But it does came at a price: any misspelled arguments will not raise an error. This makes it easy for typos to go unnoticed:
Here `...` lets me forward on any arguments that I don't want to deal with to `str_c()`. It's a very convenient technique. But it does came at a price: any misspelled arguments will not raise an error. This makes it easy for typos to go unnoticed:
```{r}
x <- c(1, 2)
sum(x, na.mr = TRUE)
```
If you just want to get the values of the `...`, use `list(...)`.
If you just want to capture the values of the `...`, use `list(...)`.
### Lazy evaluation
Arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called. This is an important property of R as a programming language, but is generally not important for data analysis. You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>
Arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called. This is an important property of R as a programming language, but is generally not important when you're writing your own functions for data analysis. You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>
### Exercises
1. What does `commas(letters, collapse = "-")` do? Why?
1. It'd be nice if you supply multiple characters to the `pad` argument, e.g.
`rule("Title", pad = "-+")`. Why doesn't this currently work? How could you
fix it?
1. It'd be nice if you could supply multiple characters to the `pad` argument,
e.g. `rule("Title", pad = "-+")`. Why doesn't this currently work? How
could you fix it?
1. What does the `trim` argument to `mean()` do? When might you use it?
@ -630,7 +648,7 @@ Figuring out what your function should return is usually straightforward: it's w
### Explicit return statements
The value returned by the function is the usually the last statement it evaluates, but you choose to return early by using `return()`. I think it's best to save the use of `return()` to signal that you can return early with a simpler solution. A common reason to do this is because the inputs are empty:
The value returned by the function is the usually the last statement it evaluates, but you can choose to return early by using `return()`. I think it's best to save the use of `return()` to signal that you can return early with a simpler solution. A common reason to do this is because the inputs are empty:
```{r}
complicated_function <- function(x, y, z) {
@ -662,7 +680,7 @@ f <- function() {
}
```
But if the first block is very long, by the time you get to the else, you've forgotten what's going on. One way to rewrite it is to use an early return for the simple case:
But if the first block is very long, by the time you get to the `else`, you've forgotten the `condition`. One way to rewrite it is to use an early return for the simple case:
```{r, eval = FALSE}
@ -715,7 +733,7 @@ class(x)
dim(x)
```
But we can still use it in a pipeline:
And we can still use it in a pipeline:
```{r, include = FALSE}
library(dplyr)
@ -730,7 +748,7 @@ mtcars %>%
## Environment
The last component of a function is it's environment. This is not something you need to understand deeply when you first start writing functions. However, it's important to know a little bit about environments because they are crucial to how functions work. The environment of a function controls how R finds the value associated with a name. For example, take this function:
The last component of a function is its environment. This is not something you need to understand deeply when you first start writing functions. However, it's important to know a little bit about environments because they are crucial to how functions work. The environment of a function controls how R finds the value associated with a name. For example, take this function:
```{r}
f <- function(x) {
@ -764,4 +782,4 @@ table(replicate(1000, 1 + 2))
rm(`+`)
```
This is a common phenomenon in R. R places few limits on your power. You can do many things that you can't do in other programming languages. You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!). But this power and flexibility is what makes tools like ggplot2 and dplyr possible. Learning how to make best use of this flexibility is beyond the scope of this book, but you can read about in "[Advanced R](http://adv-r.had.co.nz)".
This is a common phenomenon in R. R places few limits on your power. You can do many things that you can't do in other programming languages. You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!). But this power and flexibility is what makes tools like ggplot2 and dplyr possible. Learning how to make best use of this flexibility is beyond the scope of this book, but you can read about in [_Advanced R_](http://adv-r.had.co.nz).