Function proofing

This commit is contained in:
hadley 2016-08-18 08:37:48 -05:00
parent d6fcb7e78f
commit b632d512f7
2 changed files with 46 additions and 36 deletions

View File

@ -2,7 +2,7 @@
## Introduction
One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.. Writing a function has three big advantages over using copy-and-paste:
One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:
1. You can give a function an evocative name that makes your code easier to
understand.
@ -13,9 +13,9 @@ One of the best ways to improve your reach as a data scientist is to write funct
1. You eliminate the chance of making incidental mistakes when you copy and
paste (i.e. updating a variable name in one place, but not in another).
Writing good functions is a lifetime journey. Even after using R for many years I still learn new techniques and better ways of approaching old problems. The goal of this chapter is not to master every esoteric detail of functions but to get you started with some pragmatic advice that you can apply immediately.
Writing good functions is a lifetime journey. Even after using R for many years I still learn new techniques and better ways of approaching old problems. The goal of this chapter is not to teach you every esoteric detail of functions but to get you started with some pragmatic advice that you can apply immediately.
As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code. Good coding style is like using correct punctuation. Youcanmanagewithoutit, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations. Here we present the style we use in our code, but the most important thing is to be consistent.
As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code. Good code style is like correct punctuation. Youcanmanagewithoutit, but it sure makes things easier to read! As with styles of punctuation, there are many possible variations. Here we present the style we use in our code, but the most important thing is to be consistent.
### Prerequisites
@ -26,7 +26,7 @@ The focus of this chapter is on writing functions in base R, so you won't need a
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). For example, take a look at this code. What does it do?
```{r}
df <- data.frame(
df <- tibble::tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
@ -36,7 +36,7 @@ df <- data.frame(
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$b, na.rm = TRUE))
(max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
@ -55,7 +55,7 @@ To write a function you need to first analyse the code. How many inputs does it
This code only has one input: `df$a`. (If you're surprised that `TRUE` is not an input, you can explore why in the exercise below.) To make the inputs more clear, it's a good idea to rewrite the code using temporary variables with general names. Here this code only requires a single numeric vector, so I'll call it `x`:
```{r}
x <- 1:10
x <- df$a
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
```
@ -85,8 +85,8 @@ There are three key steps to creating a new function:
Here we have just one argument. If we had more the call would look like
`function(x, y, z)`.
1. You place the __body__ of the function inside a `{` block immediately
following `function`.
1. You place the code you have developed in __body__ of the function, a
`{` block that immediately follows `function(...)`.
Note the overall process: I only made the function after I'd figured out how to make it work with a simple input. It's easier to start with working code and turn it into a function; it's harder to create a function and then try to make it work.
@ -108,7 +108,7 @@ df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. We'll learn how to eliminate that duplication in [iteration], once you've learned more about R's data structures in [data-structures].
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. We'll learn how to eliminate that duplication in [iteration], once you've learned more about R's data structures in [vectors].
Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include infinite values, and `rescale01()` fails:
@ -151,7 +151,8 @@ This is an important part of the "do not repeat yourself" (or DRY) principle. Th
```
1. Follow <http://nicercode.github.io/intro/writing-functions.html> to
write your own functions to compute the variance and skew of a vector.
write your own functions to compute the variance and skew of a numeric
vector.
1. Implement a `fizzbuzz` function. It takes a single number as input. If
the number is divisible by three, it returns "fizz". If it's divisible by
@ -179,7 +180,7 @@ This is an important part of the "do not repeat yourself" (or DRY) principle. Th
It's important to remember that functions are not just for the computer, but are also for humans. R doesn't care what your function is called, or what comments it contains, but these are important for human readers. This section discusses some things that you should bear in mind when writing functions that humans can understand.
The name of a function is important. Ideally, the name of your function will be short, but clearly evoke what the function does. That said, it's hard to come up with concise names, and autocomplete makes it easy to type long names, so it's better to err on the side of clarity, not brevity.
The name of a function is important. Ideally, the name of your function will be short, but clearly evoke what the function does. That's hard! But it's better to be clear than short, as RStudio's autocomplete makes it easy to type long names.
Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`). A good sign that a noun might be a better choice is if you're using a very broad verb like "get", "compute", "calculate", or "determine". Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
@ -195,12 +196,12 @@ impute_missing()
collapse_years()
```
If your function name is composed of multiple words, I recommend using "snake\_case", where each word is lower case and separated by an underscore. camelCase is a popular alternative. It doesn't really matter which one you pick, the important thing is to be consistent: pick one or the other and stick with it. R itself is not very consistent, but there's nothing you can do about that. Make sure you don't fall into the same trap by making your code as consistent as possible.
If your function name is composed of multiple words, I recommend using "snake\_case", where each lowercase word is separated by an underscore. camelCase is a popular alternative. It doesn't really matter which one you pick, the important thing is to be consistent: pick one or the other and stick with it. R itself is not very consistent, but there's nothing you can do about that. Make sure you don't fall into the same trap by making your code as consistent as possible.
```{r, eval = FALSE}
# Never do this!
col_mins()
rowMaxes()
col_mins <- function(x, y) {}
rowMaxes <- function(y, x) {}
```
If you have a family of functions that do similar things, make sure they have consistent names and arguments. Use a common prefix to indicate that they are connected. That's better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.
@ -228,9 +229,9 @@ c <- 10
mean <- function(x) sum(x)
```
Use comments, lines starting with `#`, to explain the "why" of your code. You generally should avoid comments that explain the "what" or the "how". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear. Do you need to add some intermediate variables with useful names? Do you need to break out a subcomponent of a large function so you can name it? However, your code can never capture the reasoning behind your decisions: why did you choose this approach instead of an alternative? What else did I try that didn't work? It's a great idea to capture that sort of thinking in a comment.
Use comments, lines starting with `#`, to explain the "why" of your code. You generally should avoid comments that explain the "what" or the "how". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear. Do you need to add some intermediate variables with useful names? Do you need to break out a subcomponent of a large function so you can name it? However, your code can never capture the reasoning behind your decisions: why did you choose this approach instead of an alternative? What else did you try that didn't work? It's a great idea to capture that sort of thinking in a comment.
Another important use of comments is to break up your file into easily readable chunks. Use long lines of `-` and `=` to make it easy to spot the breaks. RStudio even provides a keyboard shortcut for this: Cmd/Ctrl + Shift + R.
Another important use of comments is to break up your file into easily readable chunks. Use long lines of `-` and `=` to make it easy to spot the breaks.
```{r, eval = FALSE}
# Load data --------------------------------------
@ -238,6 +239,12 @@ Another important use of comments is to break up your file into easily readable
# Plot data --------------------------------------
```
RStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("screenshots/rstudio-nav.png")
```
### Exercises
1. Read the source code for each of the following three functions, puzzle out
@ -306,7 +313,7 @@ if (NA) {}
You can use `||` (or) and `&&` (and) to combine multiple logical expressions. These operators are "short-circuiting": as soon as `||` sees the first `TRUE` it returns `TRUE` without computing anything else. As soon as `&&` sees the first `FALSE` it returns `FALSE`. You should never use `|` or `&` in an `if` statement: these are vectorised operations that apply to multiple values (that's why you use them in `filter()`). If you do have a logical vector, you can use `any()` or `all()` to collapse it to a single value.
Be careful when testing for equality. `==` is vectorised, which means that it's easy to get more than one output. Either check the length is already 1, collapsed with `all()` or `any()`, or use the non-vectorised `identical()`. `identical()` is very strict: it always returns either a single `TRUE` or a single `FALSE`, and doesn't coerce types. This means that you need to be careful when comparing integers and doubles:
Be careful when testing for equality. `==` is vectorised, which means that it's easy to get more than one output. Either check the length is already 1, collapse with `all()` or `any()`, or use the non-vectorised `identical()`. `identical()` is very strict: it always returns either a single `TRUE` or a single `FALSE`, and doesn't coerce types. This means that you need to be careful when comparing integers and doubles:
```{r}
identical(0L, 0)
@ -321,6 +328,8 @@ x == 2
x - 2
```
Instead use `dplyr::near()` for comparisons, as described in [comparisons].
And remember, `x == NA` doesn't do anything useful!
### Multiple conditions
@ -339,7 +348,7 @@ if (this) {
But if you end up with a very long series of chained `if` statements, you should consider rewriting. One useful technique is the `switch()` function. It allows you to evaluate selected code based on position or name.
```{r}
```{r, echo = FALSE}
function(x, y, op) {
switch(op,
plus = x + y,
@ -383,7 +392,7 @@ else {
}
```
It's ok to drop the curly braces if you have a very short `if `statement that can fit on one line:
It's ok to drop the curly braces if you have a very short `if` statement that can fit on one line:
```{r}
y <- 10
@ -427,7 +436,8 @@ if (y < 20) {
```
How would you change the call to `cut()` if I'd used `<` instead of `<=`?
What are the advantages of `cut()` for this type of problem?
What is the other chief advantage of `cut()` for this problem? (Hint:
what happens if you have many values in `temp`?)
1. What happens if you use `switch()` with numeric values?
@ -456,7 +466,7 @@ The arguments to a function typically fall into two broad sets: one set supplies
* In `t.test()`, the data are `x` and `y`, and the details of the test are
`alternative`, `mu`, `paired`, `var.equal`, and `conf.level`.
* In `paste()` you can supply any number of strings to `...`, and the details
* In `str_c()` you can supply any number of strings to `...`, and the details
of the concatenation are controlled by `sep` and `collapse`.
Generally, data arguments should come first. Detail arguments should go on the end, and usually should have default values. You specify a default value in the same way you call a function with a named argument:
@ -474,9 +484,9 @@ mean_ci(x)
mean_ci(x, conf = 0.99)
```
The default value should almost always be the most common value. The few exceptions are to do with safety. For example, it makes sense for `na.rm` to default to `FALSE` because missing values are important. Even though `na.rm = TRUE` is what you usually put in your code, it's a bad idea to silently ignore missing values by default.
The default value should almost always be the most common value. The few exceptions to this rule are to do with safety. For example, it makes sense for `na.rm` to default to `FALSE` because missing values are important. Even though `na.rm = TRUE` is what you usually put in your code, it's a bad idea to silently ignore missing values by default.
When you call a function, you typically omit the names of the data arguments (because they are used so commonly). If you override the default value of a detail argument, you should use the full name:
When you call a function, you typically omit the names of the data arguments, because they are used so commonly. If you override the default value of a detail argument, you should use the full name:
```{r, eval = FALSE}
# Good
@ -590,7 +600,7 @@ wt_mean(1:6, 6:1, na.rm = "foo")
Note that when using `stopifnot()` you assert what should be true rather than checking for what might be wrong.
### Dot dot dot
### Dot-dot-dot (...)
Many functions in R take an arbitrary number of inputs:
@ -626,7 +636,7 @@ If you just want to capture the values of the `...`, use `list(...)`.
### Lazy evaluation
Arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called. This is an important property of R as a programming language, but is generally not important when you're writing your own functions for data analysis. You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>
Arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called. This is an important property of R as a programming language, but is generally not important when you're writing your own functions for data analysis. You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>.
### Exercises
@ -644,7 +654,11 @@ Arguments in R are lazily evaluated: they're not computed until they're needed.
## Return values
Figuring out what your function should return is usually straightforward: it's why you created the function in the first place! There are two things you should consider when returning a value: Does returning early make your function easier to read? And can you make your function pipeable?
Figuring out what your function should return is usually straightforward: it's why you created the function in the first place! There are two things you should consider when returning a value:
1. Does returning early make your function easier to read?
2. Can you make your function pipeable?
### Explicit return statements
@ -704,11 +718,11 @@ This tends to make the code easier to understand, because you don't need quite s
### Writing pipeable functions
If you want to write your own pipeable functions, thinking about the return value is important. There are two main types of pipeable functions.
If you want to write your own pipeable functions, thinking about the return value is important. There are two main types of pipeable functions: transformation and side-effecty.
In __transformation__ functions, there's a clear "primary" object that is passed in as the first argument, and a modified version is returned by the function. For example, the key objects for dplyr and tidyr are data frames. If you can identify what the object type is for your domain, you'll find that your functions just work in a pipe.
In __transformation__ functions, there's a clear "primary" object that is passed in as the first argument, and a modified version is returned by the function. For example, the key objects for dplyr and tidyr are data frames. If you can identify what the object type is for your domain, you'll find that your functions just work with the pipe.
__Side-effect__ functions, however, are primarily called to perform an action, like drawing a plot or saving a file, not transforming an object. These functions should "invisibly" return the first argument, so they're not printed by default, but can still be used in a pipeline. For example, this simple function that prints out the number of missing values in a data frame:
__Side-effect__ functions are primarily called to perform an action, like drawing a plot or saving a file, not transforming an object. These functions should "invisibly" return the first argument, so they're not printed by default, but can still be used in a pipeline. For example, this simple function that prints out the number of missing values in a data frame:
```{r}
show_missings <- function(df) {
@ -733,16 +747,12 @@ class(x)
dim(x)
```
And we can still use it in a pipeline:
```{r, include = FALSE}
library(dplyr)
```
And we can still use it in a pipe:
```{r}
mtcars %>%
show_missings() %>%
mutate(mpg = ifelse(mpg < 20, NA, mpg)) %>%
dplyr::mutate(mpg = ifelse(mpg < 20, NA, mpg)) %>%
show_missings()
```
@ -756,7 +766,7 @@ f <- function(x) {
}
```
In many programming languages, this would be an error, because `y` is not defined inside the function. In R, this is valid code because R uses rules called _lexical scoping_ to find the value associated with a name. Since `y` is not defined inside the function, R will look in the _environment_ where the function was defined:
In many programming languages, this would be an error, because `y` is not defined inside the function. In R, this is valid code because R uses rules called __lexical scoping__ to find the value associated with a name. Since `y` is not defined inside the function, R will look in the __environment__ where the function was defined:
```{r}
y <- 100

BIN
screenshots/rstudio-nav.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB