Continued brain dump about reliable functions

2016-01-19 08:36:31 -06:00 · 2016-01-19 08:36:31 -06:00 · 9ee44648ed
parent cc554ee638
commit 9ee44648ed
1 changed files with 161 additions and 1 deletions
--- a/functions.Rmd
+++ b/functions.Rmd
@ -393,10 +393,24 @@ As you learn more about R, you'll learn more functions that allow you to abstrac

 ## Robust and readable functions

+(This is an advanced topic. You shouldn't worry too much about it when you first start writing function. Instead you should focus on getting a function that works right for the easiest 80% of the problem. Then in time, you'll learn how to get to 99% with minimal extra effort. The defaults in this book should steer you in the right direction: we avoid teaching you functions with major suprises.)
+
 There is one principle that tends to lends itself both to easily readable code and code that works well even when generalised to handle new situations that you didn't previously think about.

 You want to use functions who behaviour can be understood with as little context as possible. The smaller amount of code that you need to read to predict the likely outcome of a function, the easier it is to understand the code. Such code is also less likely to fail in unexpected ways because in new situations.

+What does this code do?
+
+```{r, eval = FALSE}
+baz <- foo(bar, qux)
+```
+
+You can glean a little from the context: `foo()` is a function that takes (at least) two arguments, and it returns a result we store in `baz`. But apart from that, you have no idea. To understand what this function does, you need to read much more of the context. This is an extreme example.
+
+Function and variable names are important because they hint at (or at least jog your memory of) what the code does. The advantage of using built-in functions is that you can use them in many places so that you're more likely to remember what they do.
+
+The other side of this problem is using functions that rarely surprise you: functions that have consistent behaviour regardless of their inputs. These function are useful because they act as bottlenecks - it doesn't matter what go into them because you always know what comes out.
+
 A few examples:

 *   What what will `df[, x]` return? You can assume that `df` is a data frame 
@ -408,7 +422,7 @@ A few examples:
 *   What will `filter(df, x == y)` do? It depends on whether `x` or `y` or
    both are variable in `df` or variables in the current environment.
    Compare with `df[df$x == y, , drop = FALSE]`. 
-    `filter(df, local(x) == global(y))`
+    Currently `filter(df, local(x) == global(y))`
    
 *   What sort of column will `data.frame(x = "a")` create? You 
    can't be sure whether it will contain characters or factors depending on 
@ -423,3 +437,149 @@ The transition from interactive analysis to programming R can be very frustratin
 If this behaviour is advantageous for programming, why do any functions behave differently? Because R is not just a programming language, it's also an environment for interactive data analysis. And somethings make sense for interactive use (where you quickly check the output and guessing what you want is ok) but don't make sense for programming (where you want errors to arise as quickly as possible).

 It's a continuum, not two discrete endpoints. It's not possible to write code where every single line is understandable in isolation. Even if you could, it wouldn't be desirable. Relying on a little context is useful. You just don't want to go overboard.
+
+### Naming
+
+```{r}
+is.atomic(NULL)
+is.vector(factor(1:3))
+```
+
+You'll learn more about these in the data structures package.
+
+### Type
+
+`sapply()` vs `vapply()` vs `purrr::map_xyz()`.  (Wouldn't be a problem if R's functions declared their return types.)
+
+```{r, eval = FALSE}
+sapply(df, class) # you need to know the details of df to predict output
+map_chr(df, class) # you know it returns a character vector the same length as df no matter what
+```
+
+This doesn't make `sapply()` bad and `map_chr()` good. `sapply()` is nice because you can use it interactively without having to think about what `f` will return. 95% of the time it will do the right thing, and if it doesn't you can quickly fix it. `map_chr()` is more important when your programming because a clear error message is more valuable when an operation is buried deep inside a tree of function calls. At this point its worth thinking more about 
+
+`[.data.frame`.
+
+You'll learn more about this type of functions and alternatives that are more predictable in the purrr chapter.
+
+Another type of type-stability is illustrated by the dplyr functions. `filter()` , `mutate()`, `summarise()` etc don't always return the same type, but they always return something that behaves like a data frame, and is the same type as the first value.
+
+### Variable lookup
+
+You've learned a number of functions that implement special lookup rules:
+
+```{r, eval = FALSE}
+ggplot(mpg, aes(displ, cty)) + geom_point()
+filter(mpg, displ > 10)
+```
+
+This is so called "non-standard evaluation", because the usual lookup rules don't apply. In both cases above neither `displ` nor `cty` are present in the global environment. Instead both ggplot2 and dplyr look for them first in a data frame. This is great for interactive use, but can cause problems inside a function because they'll fall back to the global environment if the variable isn't found.
+
+[Talk a little bit about the standard scoping rules]
+
+For example, take this function:
+
+```{r}
+big_x <- function(df, threshold) {
+  dplyr::filter(df, x > threshold)
+}
+```
+
+There are two ways in which this function can fail:
+
+1.  `df$x` might not exist. There are two potential failure modes:
+      
+    ```{r, error = TRUE}
+    big_x(mtcars, 10)
+    
+    x <- 1
+    big_x(mtcars, 10)
+    ```
+    
+    The second failure mode is particularly pernicious because it doesn't 
+    throw an error, but instead silently returns an incorrect result
+    because its find `x` in a parent environment. It's unlikely to happen,
+    but I think it's worth weighting heavily in your analysis of potential
+    failure modes because it's a failure that will be extremely time consuming
+    to track down, as you need to read a lot of context.
+
+1.  `df$threshold` might exist! There's only one potential failure mode
+    here, but again its bad:
+    
+    ```{r}
+    df <- dplyr::data_frame(x = 1:10, threshold = 100)
+    big_x(df, 5)
+    ```
+    
+How can you avoid this problem?  Currently, you need to do this:
+
+```{r}
+big_x <- function(df, threshold) {
+  if (!"x" %in% names(df)) 
+    stop("`df` must contain variable called `x`.", call. = FALSE)
+  
+  if ("threshold" %in% names(df))
+    stop("`df` must not contain variable called `threshold`.", call. = FALSE)
+  
+  dplyr::filter(df, x > threshold)
+}
+```
+
+Because dplyr currently has no way to force a name to be interpreted as either a local or parent variable, as I've only just realised that's really why I think you should avoid using NSE.  In a future version you should be able to do:
+
+```{r}
+big_x <- function(df, threshold) {
+  dplyr::filter(df, .this$x > .parent$threshold)
+}
+```
+
+
+Another option is to implement it yourself using base subsetting:
+
+```{r}
+big_x <- function(df, threshold) {
+  i <- df$x > threshold
+  df[!is.na(i) & i, , drop = FALSE]
+}
+```
+
+The challenge is remembering that `filter()` also drops missing values, and you need to remember to use `drop = FALSE` or the function will return a vector if `df` only has one column.
+
+### Purity
+
+Functions are easiest to reason about if they have two properties:
+
+1.  Their output only depends on their inputs.
+1.  They don't affect the outside world except through their return value.
+
+There are lots of important functions that aren't pure:
+
+1.  Random number generation.
+1.  I/O
+1.  Current time etc.
+1.  Plotting
+
+But it makes sense to separate into functions that are called primarily for their side-effects and functions that are called primarily for their return value. In other words, if you see `f(x, y, z)` you know it's called for the side-effect, and if you call `a <- g(x, y, z)` you know it's called for its return value and is unlikely to affect the state of the world otherwise.
+
+The biggest breaker of this rule in base R are functions that create data frames. Most of these functions have a `stringsAsFactors` argument that defaults to `getOption("stringsAsFactors")`. This means that a global option affects the operation of a very large number of functions, and you need to be aware that depending on an external state a function might produce either a character vector or a factor. In this book, we steer you away from that problem by recommnding functions like `readr::read_csv()` and `dplyr::data_frame()` that don't rely on this option. But be aware of it!  Generally if a function is affected by a global option, you should avoid setting it 
+
+Generally, if you want to use options in your own functions, I recommend using them for controlling default displays, not data types. For example, dplyr has some options that let you control the default number of rows and columns that are printed out. This is a good use of an option because it's something that people frequently want control over, but doesn't affect the computation of a result, just it's interactive display.
+
+`options(digits)`
+
+#### Exercises
+
+1.  Look at the `encoding` argument to `file()`, `url()`, `gzfile()` etc. 
+    What is the default value? Why should you avoid setting the default
+    value on a global level?
+
+### Other things that can catch you out
+
+```{r, error = TRUE}
+df <- data.frame(abc = 10)
+df$a
+
+df <- dplyr::data_frame(abc = 10)
+df$a
+```
+