More functions polishing

This commit is contained in:
hadley 2016-03-04 09:30:16 -06:00
parent 443b38b834
commit 6f5f443734
1 changed files with 88 additions and 49 deletions

View File

@ -74,19 +74,21 @@ rescale01 <- function(x) {
rescale01(c(0, 5, 10))
```
There are three key steps to making a function:
There are three key steps to creating a new function:
1. You need to pick a __name__ for the function. Here I've used `rescale01`
because this function rescales a vector to lie between 0 and 1.
1. You list the inputs, or __arguments__, to the function inside `function`.
Here we have just one argument. If we had more the call would look like
`function(x, y, z)`.
1. You place the __body__ of the function inside a `{` block immediately
following `function`.
Note the process that I followed here: I only made the function after I'd figured out how to make it work with a simple input. It's much easier to start with working code and turn it into a function as opposed to creating a function and then trying to make it work.
Note the overall process: I only made the function after I'd figured out how to make it work with a simple input. It's easier to start with working code and turn it into a function; it's harder to creating a function and then try to make it work.
Now that we have `rescale01()` we can use that to simplify our original example:
Now that we have `rescale01()` we can use that to simplify the original example:
```{r}
df$a <- rescale01(df$a)
@ -95,7 +97,7 @@ df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
Compared to our original code, this is easier to understand and we've eliminated one class of copy-and-paste errors. There's still quite a bit of duplication since we're doing the same thing to multiple columns. You'll learn how to eliminate that duplication in the next chapter, Iteration.
Compared the original, this code is easier to understand. We've also eliminated one class of copy-and-paste errors. There is, however, still quite a bit of duplication since we're doing the same thing to multiple columns. You'll learn how to eliminate that duplication in the next chapter.
### Practice
@ -143,7 +145,7 @@ Compared to our original code, this is easier to understand and we've eliminated
It's important to remember that functions are not just for the computer, but are also for humans. R doesn't care what your function is called, or what comments it contains, but these are important for human readers. This section discusses some things that you should bear in mind when writing functions that humans can understand.
The name of a function is surprisingly important. Ideally the name of your function will be short, but clearly evoke what the function does. However, it's hard to come up with concise names, and autocomplete makes it easy to type long names, so it's better to err on the side of clear descriptions, rather than short names. There are a few exceptions to this rule: the handful of very common, very short names. It's worth memorising these:
The name of a function is important. Ideally the name of your function will be short, but clearly evoke what the function does. However, it's hard to come up with concise names, and autocomplete makes it easy to type long names, so it's better to err on the side of clear descriptions, rather than short names. There are a few exceptions to this rule: the handful of very common, very short names. It's worth memorising these:
* `x`, `y`, `z`: vectors.
* `df`: a data frame.
@ -153,7 +155,7 @@ The name of a function is surprisingly important. Ideally the name of your funct
Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`). A good sign that a noun might be a better choice is if you're using a very broad verb like get, or compute, or calculate, or determine. Use your best judgement and don't be afraid to rename a function if you later figure out a better name.
To make it easy to type function names, I strongly recommend using only lowercase, and separating multiple words with underscores (so called "snake\_case"). Camel case is a legitimate alternative, but be consistent: pick either snake\_case or camelCase for your code, don't mix them. R itself is not very consistent, but there's nothing you can do about that. Make sure you don't fall into the same trap by making your code as consistent as possible.
If your function name is composed of multiple words, I recommend using "snake\_case", where each word is lower case and separated by an underscore. camelCase is a popular alterative alternative, but be consistent: pick one or the other and stick with it. R itself is not very consistent, but there's nothing you can do about that. Make sure you don't fall into the same trap by making your code as consistent as possible.
If you have a family of functions that do similar things, make sure they have consistent names and arguments. Use a common prefix to indicate that they are connected. That's better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.
@ -178,14 +180,14 @@ c <- 10
mean <- function(x) sum(x)
```
Use comments, `#`, to explain the "why" of your code. You generally should avoid comments that explain the "what" or the "how". If you can't understand what the code does from reading, you should think about how to rewrite it to be more clear. Do you need to add some intermediate variables with useful names? Do you need to break out a subcomponent of a large function so you can describe it with a name? However, your code can never capture the reasoning behind your decisions: why do you choose this approach instead of an alternative? It's a great idea to capture that sort of thinking in a comment so that when you come back to your analysis in the future, you can jog your memory about the why.
Use comments, lines starting with `#`, to explain the "why" of your code. You generally should avoid comments that explain the "what" or the "how". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear. Do you need to add some intermediate variables with useful names? Do you need to break out a subcomponent of a large function so you can name it? However, your code can never capture the reasoning behind your decisions: why did you choose this approach instead of an alternative? What else did you try that didn't work? It's a great idea to capture that sort of thinking in a comment.
Another important use of comments is to break up your file into easily readable chunks. Use long lines of `-` and `=` to make it easy to spot the breaks. RStudio even provides a keyboard shortcut to add this: Cmd/Ctrl + Shift + R.
Another important use of comments is to break up your file into easily readable chunks. Use long lines of `-` and `=` to make it easy to spot the breaks. RStudio even provides a keyboard shortcut for this: Cmd/Ctrl + Shift + R.
```{r, eval = FALSE}
# Load data ---------------------------
# Load data --------------------------------------
# Plot data ---------------------------
# Plot data --------------------------------------
```
### Exercises
@ -198,7 +200,7 @@ Another important use of comments is to break up your file into easily readable
substr(string, 1, nchar(prefix)) == prefix
}
f2 <- function(x) {
if (length(x) <= 1L) return(NULL)
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
f3 <- function(x, y) {
@ -209,6 +211,12 @@ Another important use of comments is to break up your file into easily readable
1. Take a function that you've written recently and spend 5 minutes
brainstorming a better name for it and its arguments.
1. Compare and constrast `rnorm()` and `mvrnorm()`. How could you make
them more consistent?
1. Make a case for why `normr()`, `normd()` etc would be better than
`rnorm()`, dnorm()`. Make a case for the opposite.
## Conditional execution
An `if` statement allows you to conditionally execute code. It looks like this:
@ -236,11 +244,32 @@ has_name <- function(x) {
}
```
Squiggly brackets are always optional (both here and in function definitons), but I recommend using them because it makes it easier to see the hierarchy in your code. An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it's followed by `else`. Always indent the code inside curly braces.
This takes advantage of the standard rules of function return values: a function returns the last value that was computed. Here it will be one of the two if branches.
### Conditions
The `condition` should be either a single `TRUE` or a single `FALSE`. If it's a vector you'll get a warning message, if it's an `NA`, you'll get an error. Watch out for these messages in your own code:
```{r, error = TRUE}
if (c(TRUE, FALSE)) {
}
if (NA) {
}
```
You can use `||` (or) and `&&` (and) to combine multiple logical expressions. These operators a "short-circuiting": as soon as `||` sees the first `TRUE` it returns `TRUE` without computing anything else. As soon as `&&` sees the first `FALSE` it returns `FALSE`.
You should never use `|` or `&` in an `if` statement: these are vectorised operations that apply to multiple values. If you do have a logical vector, you can use `any()` or `all()` to collapse it to a single value.
### If styles
Squiggly brackets are always optional (for both `if` and `function`), but I recommend using them because it makes it easier to see the hierarchy in your code. An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it's followed by `else`. Always indent the code inside curly braces.
```{r, eval = FALSE}
# Good
if (y < 0 && debug) {
message("Y is negative")
}
@ -252,7 +281,6 @@ if (y == 0) {
}
# Bad
if (y < 0 && debug)
message("Y is negative")
@ -264,20 +292,14 @@ else {
}
```
If `condition` isn't a single `TRUE` or `FALSE` you'll get a warning or error.
You can use `||` (or) and `&&` (and) to combine multiple logical expressions. These operators a "short-circuiting": as soon as `||` sees the first `TRUE` it returns `TRUE` without computing anything else. As soon as `&&` sees the first `FALSE` it returns `FALSE`.
Chaining multiple else ifs together.
Like a function, an `if` statement "returns" the last expression it evaluated. This means you can assign the result of an `if` statement to a variable:
It's ok to drop the curly braces if you have a very short `if `statement that can fit on one line:
```{r}
y <- 10
x <- if (y < 20) "Too low" else "Too high"
```
I recommend doing this only if the if statement is very short, otherwise it's easier to read:
I recommend this only for very brief `if` statements. Otherwise, the full form is easier to read:
```{r}
if (y < 20) {
@ -287,26 +309,21 @@ if (y < 20) {
}
```
(Note there's a built in function that does this for you: `cut()`. The above call is the same as `cut(y, c(-Inf, 20, Inf), c("Too low", "Too high"))`. It returns a factor, and generalises better for large numbers of splits)
### Multiple conditions
This allows you to write compact functions:
You can chain multiple if statements together:
```{r}
greeting <- function(time = lubridate::now()) {
hour <- lubridate::hour(time)
if (hour < 12) {
"Good morning"
} else if (hour < 18) {
"Good afternoon"
} else {
"Good evening"
}
```{r, eval = FALSE}
if (this) {
# do that
} else if (that) {
# do something else
} else {
#
}
greeting()
```
Another useful technique is the `switch()` function. It allows you to evaluate selected code based on position or name.
If you find that you have a very long series of chained `if` statements, you should consider rewriting. One useful technique is the `switch()` function. It allows you to evaluate selected code based on position or name.
```{r}
function(x, y, op) {
@ -320,13 +337,38 @@ function(x, y, op) {
}
```
Neither `if` not `switch` are vectorised: they work with a single value at a time.
Another useful function that can often eliminate long chains of `if` statements is `cut()`. It's used to discretise continuous variables.
Note that neither `if` nor `switch()` are vectorised: they work with a single value at a time.
### Exercises
1. What's the different between `if` and `ifelse()`? Carefully read the help
and construct three examples that illustrate the key differences.
1. Write a greeting function that says "good morning", "good afternoon",
or "good evening", depending on the time of day. (Hint: use have a time
argument that defaults to `lubridate::now()`. That will make it
easier to test your function.)
1. How could you use `cut()` to simplify this set of nested if-else statements?
```{r, eval = FALSE}
if (temp <= 0) {
"freezing"
} else if (temp <= 10) {
"cold"
} else if (temp <= 20) {
"cool"
} else if (temp <= 30) {
"warm"
} else {
"hot"
}
```
How would you change the call to `cut()` if I'd used `<` instead of `<=`?
1. What happens if you use `switch()` with numeric values?
1. What does this `switch()` call do?
@ -342,8 +384,6 @@ Neither `if` not `switch` are vectorised: they work with a single value at a tim
## Function arguments
Note that arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called. This is an important property of R the programming language, but is unlikely to be important to you for a while. You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>
The arguments to a function typically fall into two broad sets: one set supplies the data to compute on, and the other supplies arguments that controls the details of the computation. For example:
* In `log()`, the data is `x`, and the detail is the `base` of the logarithm.
@ -423,6 +463,10 @@ sum(x, na.mr = TRUE)
If you just want to get the values of the `...`, use `list(...)`.
### Lazy evaluation
Arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called. This is an important property of R as a programming language, but is generally not important for data analysis. You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>
### Exercises
1. What does `commas(letters, collapse = "-")` do? Why?
@ -437,13 +481,9 @@ If you just want to get the values of the `...`, use `list(...)`.
`c("pearson", "kendall", "spearman")`. What does that mean? What
value is used by default?
## Body
## Return values
The body of the function does the actual work. The value returned by the function is the last statement it evaluates. Unlike other languages all statements in R return a value.
### Return values
You can explicitly return early from a function with `return()`. I think it's best to save the use of `return()` to signal that you can return early with a simpler solution. For example, you might write an if statement like this:
The value returned by the function is the last statement it evaluates. You can explicitly return early from a function with `return()`. I think it's best to save the use of `return()` to signal that you can return early with a simpler solution. For example, you might write an if statement like this:
```{r, eval = FALSE}
f <- function() {
@ -484,10 +524,9 @@ f <- function() {
This tends to make the code easier to understand, because you don't need quite so much context to understand it.
### Invisible values
Some functions return "invisible" values. These are not printed out by default but can be saved to a variable:
Some functions return "invisible" values. These are not printed by default but can be saved to a variable:
```{r}
f <- function() {
@ -506,7 +545,7 @@ You can also force printing by surrounding the call in parentheses:
(f())
```
Invisible values are mostly used when your function is called primarily for its side-effects (e.g. printing, plotting, or saving a file). It's nice to be able pipe such functions together, so returning the main input value is useful. This allows you to do things like:
Invisible values are mostly used when your function is called primarily for its side-effects (e.g. printing, plotting, or saving a file). It's nice to be able pipe such functions together, so it's good practive to invisibly return the first argument. This allows you to do things like:
```{r, eval = FALSE}
library(readr)