Update functions.Rmd

typos
This commit is contained in:
Radu Grosu 2016-02-12 11:54:28 +00:00
parent a95709fa5f
commit 27e70f2234
1 changed files with 20 additions and 20 deletions

View File

@ -17,9 +17,9 @@ diamonds <- ggplot2::diamonds
Code is a tool of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you. You want to write clear code so that future-you doesn't curse present-you when you look at a project again after several months have passed.
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to becomes more and more clear, and easier to write. In this chapter, you'll learn three important skills that help you to move in this direction:
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. In this chapter, you'll learn three important skills that help you move in this direction:
1. We'll dive deep in to the __pipe__, `%>%`, talking more about how it works
1. We'll dive deep into the __pipe__, `%>%`, talking more about how it works
and how it gives you a new tool for rewriting your code. You'll also learn
about when not to use the pipe!
@ -34,7 +34,7 @@ To me, improving your communication skills is a key part of mastering R as a pro
common patterns of for loops and put them in a function. We'll come back to
that idea in XYZ.
Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better funtions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've solved in the past with much effort.
Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've solved in the past with much effort.
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely you'll first attempt will be clear.)
@ -51,7 +51,7 @@ To explore how you can write the same code in many different ways, let's use cod
> Scooping up the field mice
> And bopping them on the head
We'll start by defining an object to represent litte bunny Foo Foo:
We'll start by defining an object to represent little bunny Foo Foo:
```{r, eval = FALSE}
foo_foo <- little_bunny()
@ -95,7 +95,7 @@ object_size(diamonds, diamonds2)
* `diamonds2` takes up 3.89 MB,
* `diamonds` and `diamonds2` together take up 3.89 MB!
How can that work? Well, `diamonds2` has 10 columns in common with `diamonds`: there's no need to duplicate all that data so both data frames share the vectors. R will only create a copy of a vector if you modify it. Modifying a single value will mean that the data frames can no longer share as much memory. The individual sizes will be unchange, but the collective size will increase:
How can that work? Well, `diamonds2` has 10 columns in common with `diamonds`: there's no need to duplicate all that data so both data frames share the vectors. R will only create a copy of a vector if you modify it. Modifying a single value will mean that the data frames can no longer share as much memory. The individual sizes will be unchanged, but the collective size will increase:
```{r}
diamonds$carat[1] <- NA
@ -121,7 +121,7 @@ This is less typing (and less thinking), so you're less likely to make mistakes.
1. It will make debugging painful: if you make a mistake you'll need to start
again from scratch.
1. The reptition of the object being transformed (we've written `foo_foo` six
1. The repetition of the object being transformed (we've written `foo_foo` six
times!) obscures what's changing on each line.
#### Function composition
@ -205,7 +205,7 @@ library(magrittr)
cor(disp, mpg)
```
* For assignment. magrittr provides the `%<>%` operator which allows you to
* For assignment magrittr provides the `%<>%` operator which allows you to
replace code like:
```R
@ -219,7 +219,7 @@ library(magrittr)
```
I'm not a fan of this operator because I think assignment is such a
special operation that it should always be clear when it's occuring.
special operation that it should always be clear when it's occurring.
In my opinion, a little bit of duplication (i.e. repeating the
name of the object twice), is fine in return for making assignment
more explicit.
@ -237,19 +237,19 @@ The pipe is a powerful tool, but it's not the only tool at your disposal, and it
* You have multiple inputs or outputs. If there is not one primary object
being transformed, write code the regular ways.
* You are start to think about a directed graph with a complex
* You are starting to think about a directed graph with a complex
dependency structure. Pipes are fundamentally linear and expressing
complex relationships with them typically does not yield clear code.
### Pipes in production
When you run a pipe interactively, it's easy to see if something goes wrong. When you start writing pipes that are used in production, i.e. they're run automatically and a human doesn't immediately look at the output it's a really good idea to include some assertions that verify the data looks like expect. One great way to do this is the ensurer package, writen by Stefan Milton Bache (the author of magrittr).
When you run a pipe interactively, it's easy to see if something goes wrong. When you start writing pipes that are used in production, i.e. they're run automatically and a human doesn't immediately look at the output it's a really good idea to include some assertions that verify the data looks like expected. One great way to do this is the ensurer package, written by Stefan Milton Bache (the author of magrittr).
<http://www.r-statistics.com/2014/11/the-ensurer-package-validation-inside-pipes/>
## Functions
One of the best ways to grow in your capabilities as a user of R for data science is to write functions. Functions allow you to automate common tasks, instead of using copy-and-paste. Writing good functions is a lifetime journey: you won't learn everything but you'll hopefully get start walking in the right direction.
One of the best ways to grow in your capabilities as a user of R for data science is to write functions. Functions allow you to automate common tasks, instead of using copy-and-paste. Writing good functions is a lifetime journey: you won't learn everything but you'll hopefully get to start walking in the right direction.
Whenever you've copied and pasted code more than twice, you need to take a look at it and see if you can extract out the common components and make a function. For example, take a look at this code. What does it do?
@ -344,7 +344,7 @@ foo <- function(x = 1, y = TRUE, z = 10:1) {
}
```
Default values can depend on other arguments but don't over use this technique as it's possible to create code that is very difficult to understand:
Default values can depend on other arguments but don't overuse this technique as it's possible to create code that is very difficult to understand:
```{r}
bar <- function(x = y + 1, y = x + 1) {
@ -352,7 +352,7 @@ bar <- function(x = y + 1, y = x + 1) {
}
```
On other aspect of arguments you'll commonly see is `...`. This captures any other arguments not otherwise matched. It's useful because you can then send those `...` on to another argument. This is a useful catch all if your function primarily wraps another function. For example, you might have written your own wrapper designed to add linear model lines to a ggplot:
On other aspect of arguments you'll commonly see is `...`. This captures any other arguments not otherwise matched. It's useful because you can then send those `...` on to another argument. This is a useful catch-all if your function primarily wraps another function. For example, you might have written your own wrapper designed to add linear model lines to a ggplot:
```{r}
geom_lm <- function(formula = y ~ x, colour = alpha("steelblue", 0.5),
@ -362,7 +362,7 @@ geom_lm <- function(formula = y ~ x, colour = alpha("steelblue", 0.5),
}
```
This allows you to use any other arguments of `geom_smooth()`, even thoses that aren't explicitly listed in your wrapper (and even arguments that don't exist yet in the version of ggplot2 that you're using).
This allows you to use any other arguments of `geom_smooth()`, even those that aren't explicitly listed in your wrapper (and even arguments that don't exist yet in the version of ggplot2 that you're using).
Note that arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called:
@ -493,9 +493,9 @@ f(10)
You should avoid functions that work like this because it makes it harder to predict what your function will return.
This behaviour seems like a recipe for bugs, but by and large it doesn't cause too many, especially as you become a more experienced R programmer. The advantage of this behaviour is from a language stand point it allows R to be very consistent. Every name is looked up using the same set of rules. For `f()` that includes the behaviour of two things that you might not expect: `{` and `+`.
This behaviour seems like a recipe for bugs, but by and large it doesn't cause too many, especially as you become a more experienced R programmer. The advantage of this behaviour is that from a language standpoint it allows R to be very consistent. Every name is looked up using the same set of rules. For `f()` that includes the behaviour of two things that you might not expect: `{` and `+`.
This consistent set of rules allows for a number of powerful tool that are unfortunately beyond the scope of this book, but you can read about in "Advanced R".
This consistent set of rules allows for a number of powerful tools that are unfortunately beyond the scope of this book, but you can read about in "Advanced R".
#### Exercises
@ -577,9 +577,9 @@ mean_by <- function(data, group_var, mean_var, n = 10) {
}
```
Because this tells dplyr to group by `group_var` and compute the mean of `mean_var` neither of which exist in the data frame. A similar problem exists in ggplot2.
This fails because it tells dplyr to group by `group_var` and compute the mean of `mean_var` neither of which exist in the data frame. A similar problem exists in ggplot2.
I've only really recently understood this problem well, so the solutions are currently rather complicated and beyond the scope of this book. You can learn them online techniques with online resources:
I've only really recently understood this problem well, so the solutions are currently rather complicated and beyond the scope of this book. You can learn about these techniques online:
* Programming with ggplot2 (an excerpt from the ggplot2 book):
http://rpubs.com/hadley/97970
@ -649,7 +649,7 @@ df$d <- rescale01(df$d)
In this case the output is already present: we're modifying an existing object.
Need to think about a data frame as a list of column (we'll make this definition precise later on). The length of a data frame is the number of columns. To extract a single column, you use `[[`.
Think about a data frame as a list of columns (we'll make this definition precise later on). The length of a data frame is the number of columns. To extract a single column, you use `[[`.
That makes our for loop quite simple:
@ -678,7 +678,7 @@ There are three basic ways to loop over a vector:
but it's difficult to save the output efficiently.
1. Loop over the numeric indices: `for (i in seq_along(xs))`. Most common
form if you want to know the element (`xs[[i]]`) and it's position.
form if you want to know the element (`xs[[i]]`) and its position.
1. Loop over the names: `for (nm in names(xs))`. Gives you both the name
and the position. This is useful if you want to use the name in a