More about robust code

This commit is contained in:
hadley 2016-01-20 08:55:50 -06:00
parent 9ee44648ed
commit 164e82847d
1 changed files with 125 additions and 94 deletions

View File

@ -393,78 +393,121 @@ As you learn more about R, you'll learn more functions that allow you to abstrac
## Robust and readable functions
(This is an advanced topic. You shouldn't worry too much about it when you first start writing function. Instead you should focus on getting a function that works right for the easiest 80% of the problem. Then in time, you'll learn how to get to 99% with minimal extra effort. The defaults in this book should steer you in the right direction: we avoid teaching you functions with major suprises.)
(This is an advanced topic. You shouldn't worry too much about it when you first start writing functions. Instead you should focus on getting a function that works right for the easiest 80% of the problem. Then in time, you'll learn how to get to 99% with minimal extra effort. The defaults in this book should steer you in the right direction: we avoid teaching you functions with major suprises.)
There is one principle that tends to lends itself both to easily readable code and code that works well even when generalised to handle new situations that you didn't previously think about.
You want to use functions who behaviour can be understood with as little context as possible. The smaller amount of code that you need to read to predict the likely outcome of a function, the easier it is to understand the code. Such code is also less likely to fail in unexpected ways because in new situations.
What does this code do?
In this section you'll learn an important principle that lends itself to reliable and readable code: favour code that can be understood with a minimum of context. On one extreme, take this code:
```{r, eval = FALSE}
baz <- foo(bar, qux)
```
You can glean a little from the context: `foo()` is a function that takes (at least) two arguments, and it returns a result we store in `baz`. But apart from that, you have no idea. To understand what this function does, you need to read much more of the context. This is an extreme example.
Function and variable names are important because they hint at (or at least jog your memory of) what the code does. The advantage of using built-in functions is that you can use them in many places so that you're more likely to remember what they do.
The other side of this problem is using functions that rarely surprise you: functions that have consistent behaviour regardless of their inputs. These function are useful because they act as bottlenecks - it doesn't matter what go into them because you always know what comes out.
A few examples:
* What what will `df[, x]` return? You can assume that `df` is a data frame
and `x` is a vector because of their names. But don't know whether this
code will return a data frame or a vector because the behaviour of `[`
differs depending on the length of x. Compare with `df[, x, drop = FALSE]`
or `df[[x]]` which make it more clear what you expect.
* What will `filter(df, x == y)` do? It depends on whether `x` or `y` or
both are variable in `df` or variables in the current environment.
Compare with `df[df$x == y, , drop = FALSE]`.
Currently `filter(df, local(x) == global(y))`
* What sort of column will `data.frame(x = "a")` create? You
can't be sure whether it will contain characters or factors depending on
the value of the global option `stringsAsFactors`.
`data.frame(x = "a", stringsAsFactors = FALSE)` or
`data_frame(x = "a")`.
Avoiding functions that behave unpredictably helps you to write code that you can understand (because you don't need to be intimately familiar with the specifics of the call). This book teaches you functions that have follow this principle as much as possible. When you have a choice between two functions with similar behaviour, pick the one that needs the least context to understand. This often means being more explicit, which means writing more code. That means your functions will take longer to write, but they will be easier to read and more robust to varying inputs.
The transition from interactive analysis to programming R can be very frustrating because it forces you to confront differences that you previously swept under the carpet. You need to learn about how functions can behave differently some of the time, and
If this behaviour is advantageous for programming, why do any functions behave differently? Because R is not just a programming language, it's also an environment for interactive data analysis. And somethings make sense for interactive use (where you quickly check the output and guessing what you want is ok) but don't make sense for programming (where you want errors to arise as quickly as possible).
It's a continuum, not two discrete endpoints. It's not possible to write code where every single line is understandable in isolation. Even if you could, it wouldn't be desirable. Relying on a little context is useful. You just don't want to go overboard.
### Naming
```{r}
is.atomic(NULL)
is.vector(factor(1:3))
```
You'll learn more about these in the data structures package.
### Type
`sapply()` vs `vapply()` vs `purrr::map_xyz()`. (Wouldn't be a problem if R's functions declared their return types.)
What does it do? You can glean only a little from the context: `foo()` is a function that takes (at least) two arguments, and it returns a result we store in `baz`. But apart from that, you have no idea. To understand what this function does, you need to read the definitions of `foo()`, `bar`, and `qux`. Using better variable names helps a lot:
```{r, eval = FALSE}
sapply(df, class) # you need to know the details of df to predict output
map_chr(df, class) # you know it returns a character vector the same length as df no matter what
df2 <- arrange(df, qux)
```
This doesn't make `sapply()` bad and `map_chr()` good. `sapply()` is nice because you can use it interactively without having to think about what `f` will return. 95% of the time it will do the right thing, and if it doesn't you can quickly fix it. `map_chr()` is more important when your programming because a clear error message is more valuable when an operation is buried deep inside a tree of function calls. At this point its worth thinking more about
It's now much easier to see what's going on! Function and variable names are important because they tell you about (or at least jog your memory of) what the code does. That helps you understand code in isolation, even if you don't completely understand all the details. Unfortunately naming things is hard, and its hard to give concrete advice apart from giving objects short but evocative names. As autocomplete in RStudio has gotten better, I've tended to use longer names that are more descriptive. Short names are faster to type, but you write code relatively infrequently compared to the number of times that you read it.
`[.data.frame`.
The idea of minimising the context needed to understand your code goes beyond just good naming. You also want to favour functions with predictable behaviour and few surprises. If a function does radically different things when its inputs differ slightly, you'll need to carefully read the surrounding context in order to predict what it will do. The goal of this section is to educate you about the most common ways R functions can be surprising and to provide you with unsurprising alternatives.
You'll learn more about this type of functions and alternatives that are more predictable in the purrr chapter.
There are three common classes of surprises in R:
Another type of type-stability is illustrated by the dplyr functions. `filter()` , `mutate()`, `summarise()` etc don't always return the same type, but they always return something that behaves like a data frame, and is the same type as the first value.
1. Unstable types: What what will `df[, x]` return? You can assume that `df`
is a data frame and `x` is a vector because of their names. But you don't
know whether this code will return a data frame or a vector because the
behaviour of `[` depends on the length of x.
1. Non-standard evaluation: What will `filter(df, x == y)` do? It depends on
whether `x` or `y` or both are variable in `df` or variables in the current
environment.
### Variable lookup
1. Hidden arguments: What sort of variable will `data.frame(x = "a")`
create? It will be either a character vector or a factor depending on
the value of the global `stringsAsFactors` option.
Avoiding these three types of functions helps you to write code that you is easily understand and fails obviously with unexpected input. If this behaviour is so important, why do any functions behave differently? It's because R is not just a programming language, but it's also an environment for interactive data analysis. Some things make sense for interactive use (where you quickly check the output and guessing what you want is ok) but don't make sense for programming (where you want errors to arise as quickly as possible).
You might notice that these issues revolve around data frames. That's unfortunate because data frames are the data structure you'll use most commonly. It's ironic, the most frustrating things about programming in R are features that were originally designed to make your data analysis easier! Data frames try very hard to be helpful:
```{r}
df <- data.frame(xy = c("x", "y"))
# Character vectors work hard to work with for a long time, so R
# helpfully converts to a factor for you:
class(df$xy)
# If you're only selecting a single column, R tries to be helpful
# and give you that column, rather than giving you a single column
# data frame
class(df[, "xy"])
# If you have long variable names, R is "helpful" and lets you select
# them with a unique prefix
df$x
```
These features all made sense at the time they were added to R, but computing environments have changed a lot, and these features now tend to cause a lot of problems. dplyr disables them for you:
```{r, error = TRUE}
df <- dplyr::data_frame(xy = c("x", "y"))
class(df$xy)
class(df[, "xy"])
df$x
```
### Unpredictable types
One of the most frustrating for programming is they way `[` returns a vector if the result has a single column, and returns a data frame otherwise. In other words, if you see code like `df[x, ]` you can't predict what it will return without knowing the value of `x`. This can trip you up in surprising ways. For example, imagine you've written this function to return the last row of a data frame:
```{r}
last_row <- function(df) {
df[nrow(df), ]
}
```
It's not always going to return a row! If you give it a single column data frame, it will return a single number:
```{r}
df <- data.frame(x = 1:3)
last_row(df)
```
There are two ways to avoid this problem:
* Use `drop = FALSE`: `df[x, , drop = FALSE]`.
* Subset the data frame like a list: `df[x]`.
Using one of those techniques for `last_row()` makes it more predictable: you know it will always return a data frame.
```{r}
last_row <- function(df) {
df[nrow(df), , drop = FALSE]
}
last_row(df)
```
Another common cause of problems is the `sapply()` function. If you've never heard of it before, feel free to skip this bit: just remember to avoid it! The problem with `sapply()` is that it tries to guess what the simplest form of output is, and it always succeeds.
The following code shows how `sapply()` can produce three different types of data depending on the input.
```{r}
df <- data.frame(
a = 1L,
b = 1.5,
y = Sys.time(),
z = ordered(1)
)
df[1:4] %>% sapply(class) %>% str()
df[1:2] %>% sapply(class) %>% str()
df[3:4] %>% sapply(class) %>% str()
```
In the next chapter, you'll learn about the purrr package which provides a variety of alternatives. In this case, you could use `map_chr()` which always returns a character vector: if it can't, it will throw an error. Another option is the base `vapply()` function which takes a third argument indicating what the output should look like.
This doesn't make `sapply()` bad and `vapply()` and `map_chr()` good. `sapply()` is nice because you can use it interactively without having to think about what `f` will return. 95% of the time it will do the right thing, and if it doesn't you can quickly fix it. `map_chr()` is more important when your programming because a clear error message is more valuable when an operation is buried deep inside a tree of function calls. At this point its worth thinking more about
### Non-standard evaluation
You've learned a number of functions that implement special lookup rules:
@ -473,7 +516,7 @@ ggplot(mpg, aes(displ, cty)) + geom_point()
filter(mpg, displ > 10)
```
This is so called "non-standard evaluation", because the usual lookup rules don't apply. In both cases above neither `displ` nor `cty` are present in the global environment. Instead both ggplot2 and dplyr look for them first in a data frame. This is great for interactive use, but can cause problems inside a function because they'll fall back to the global environment if the variable isn't found.
These are called "non-standard evaluation", or NSE for short, because the usual lookup rules don't apply. In both cases above neither `displ` nor `cty` are present in the global environment. Instead both ggplot2 and dplyr look for them first in a data frame. This is great for interactive use, but can cause problems inside a function because they'll fall back to the global environment if the variable isn't found.
[Talk a little bit about the standard scoping rules]
@ -497,20 +540,25 @@ There are two ways in which this function can fail:
```
The second failure mode is particularly pernicious because it doesn't
throw an error, but instead silently returns an incorrect result
because its find `x` in a parent environment. It's unlikely to happen,
but I think it's worth weighting heavily in your analysis of potential
failure modes because it's a failure that will be extremely time consuming
to track down, as you need to read a lot of context.
throw an error, but instead silently returns an incorrect result. It
works because by design `filter()` looks in both the data frame and
the parent environment.
It is unlikely that the variable you care about will both be missing where
you expect it, and present where you don't expect it. But I think it's
worth weighing heavily in your analysis of potential failure modes because
it's a failure that's easy to miss (since it just silently gives a bad
result), and hard to track down (since you need to read a lot of context).
1. `df$threshold` might exist! There's only one potential failure mode
here, but again its bad:
1. `df$threshold` might exist:
```{r}
df <- dplyr::data_frame(x = 1:10, threshold = 100)
big_x(df, 5)
```
Again, this is bad because it silently gives an unexpected result.
How can you avoid this problem? Currently, you need to do this:
```{r}
@ -525,7 +573,7 @@ big_x <- function(df, threshold) {
}
```
Because dplyr currently has no way to force a name to be interpreted as either a local or parent variable, as I've only just realised that's really why I think you should avoid using NSE. In a future version you should be able to do:
Because dplyr currently has no way to force a name to be interpreted as either a local or parent variable, as I've only just realised that's really you should avoid NSE. In a future version you should be able to do:
```{r}
big_x <- function(df, threshold) {
@ -533,53 +581,36 @@ big_x <- function(df, threshold) {
}
```
Another option is to implement it yourself using base subsetting:
```{r}
big_x <- function(df, threshold) {
i <- df$x > threshold
df[!is.na(i) & i, , drop = FALSE]
rows <- df$x > threshold
df[!is.na(rows) & rows, , drop = FALSE]
}
```
The challenge is remembering that `filter()` also drops missing values, and you need to remember to use `drop = FALSE` or the function will return a vector if `df` only has one column.
The challenge is remembering that `filter()` also drops missing values, and you also need to remember to use `drop = FALSE`!
### Purity
### Relying on global options
Functions are easiest to reason about if they have two properties:
1. Their output only depends on their inputs.
1. They don't affect the outside world except through their return value.
There are lots of important functions that aren't pure:
The first property is particularly important. If a function has hidden additional inputs, it's very difficult to even know where the important context is!
1. Random number generation.
1. I/O
1. Current time etc.
1. Plotting
The biggest breaker of this rule in base R are functions that create data frames. Most of these functions have a `stringsAsFactors` argument that defaults to `getOption("stringsAsFactors")`. This means that a global option affects the operation of a very large number of functions, and you need to be aware that depending on an external state a function might produce either a character vector or a factor. In this book, we steer you away from that problem by recommnding functions like `readr::read_csv()` and `dplyr::data_frame()` that don't rely on this option. But be aware of it! Generally if a function is affected by a global option, you should avoid setting it.
But it makes sense to separate into functions that are called primarily for their side-effects and functions that are called primarily for their return value. In other words, if you see `f(x, y, z)` you know it's called for the side-effect, and if you call `a <- g(x, y, z)` you know it's called for its return value and is unlikely to affect the state of the world otherwise.
Only use `options()` to control side-effects of a function. The value of an option should never affect the return value of a function. There are only three violations of this rule in base R: `stringsAsFactors`, `encoding`, `na.action`. For example, base R lets you control the number of digits printed in default displays with (e.g.) `options(digits = 3)`. This is a good use of an option because it's something that people frequently want control over, but doesn't affect the computation of a result, just its display. Follow this principle with your own use of options.
The biggest breaker of this rule in base R are functions that create data frames. Most of these functions have a `stringsAsFactors` argument that defaults to `getOption("stringsAsFactors")`. This means that a global option affects the operation of a very large number of functions, and you need to be aware that depending on an external state a function might produce either a character vector or a factor. In this book, we steer you away from that problem by recommnding functions like `readr::read_csv()` and `dplyr::data_frame()` that don't rely on this option. But be aware of it! Generally if a function is affected by a global option, you should avoid setting it
### Trying too hard
Generally, if you want to use options in your own functions, I recommend using them for controlling default displays, not data types. For example, dplyr has some options that let you control the default number of rows and columns that are printed out. This is a good use of an option because it's something that people frequently want control over, but doesn't affect the computation of a result, just it's interactive display.
Another class of problems is functions that try really really hard to always return a useful result. Unfortunately they try so hard that they never throw error messages so you never find out if the input is really really weird.
`options(digits)`
#### Exercises
### Exercises
1. Look at the `encoding` argument to `file()`, `url()`, `gzfile()` etc.
What is the default value? Why should you avoid setting the default
value on a global level?
### Other things that can catch you out
```{r, error = TRUE}
df <- data.frame(abc = 10)
df$a
df <- dplyr::data_frame(abc = 10)
df$a
```