Brain dump of expressing yourself.

This commit is contained in:
hadley 2015-10-19 08:41:33 -05:00
parent f2a8c6bbf7
commit e8d5b0e479
3 changed files with 231 additions and 1 deletions

View File

@ -22,7 +22,7 @@ install:
# Install R packages
- ./travis-tool.sh r_binary_install knitr png
- ./travis-tool.sh r_install ggplot2 dplyr tidyr
- ./travis-tool.sh r_install ggplot2 dplyr tidyr pryr
- ./travis-tool.sh github_package hadley/bookdown garrettgman/DSR hadley/readr
script: jekyll build

View File

@ -8,6 +8,7 @@
<li><a href="dates.html">Dates and times</a></li>
-->
<li><a href="tidy.html">Tidy</a></li>
<li><a href="expressing-yourself.html">Expressing yourself</a></li>
<li><a href="import.html">Import</a></li>
<!--
<li><a href="models.html">Model</a></li>

229
expressing-yourself.Rmd Normal file
View File

@ -0,0 +1,229 @@
---
layout: default
title: Expressing yourself
output: bookdown::html_chapter
---
# Expressing yourself in code
Code is a means of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative, and even if you're not working with other people you'll definitely be working with future-you.
After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did.
To me, this is what mastering R as a programming language is all about: making it easier to express yourself, so that over time your becomes more and more clear, and easier to write. In this chapter, you'll learn some of the most important skills, but to learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:
* [Hands on programming with R](http://shop.oreilly.com/product/0636920028574.do),
by Garrett Grolemund. This is an introduction to R as a programming language
and is a great place to start if R is your first programming language.
* [Advanced R](http://adv-r.had.co.nz) by Hadley Wickham. This dives into the
details of R the programming language. This is a great place to start if
you've programmed in other languages and you want to learn what makes R
special, different, and particularly well suited to data analysis.
You get better very slowly if you don't consciously practice, so this chapter brings together a number of ideas that we mention elsewhere into one focussed chapter on code as communication.
```{r}
library(magrittr)
```
## Piping
```R
foo_foo <- little_bunny()
```
There are a number of ways that you could write this:
1. Function composition:
```R
bop_on(
scoop_up(
hop_through(foo_foo, forest),
field_mouse
),
head
)
```
The disadvantage is that you have to read from inside-out, from
right-to-left, and that the arguments end up spread far apart
(sometimes called the
[dagwood sandwhich](https://en.wikipedia.org/wiki/Dagwood_sandwich)
problem).
1. Intermediate state:
```R
foo_foo_1 <- hop_through(foo_foo, forest)
foo_foo_2 <- scoop_up(foo_foo_1, field_mouse)
foo_foo_3 <- bop_on(foo_foo_2, head)
```
This avoids the nesting, but you now have to name each intermediate element.
If there are natural names, use this form. But if you're just numbering
them, I don't think it's that useful. Whenever I write code like this,
I invariably write the wrong number somewhere and then spend 10 minutes
scratching my head and trying to figure out what went wrong with my code.
You may also worry that this form creates many intermediate copies of your
data and takes up a lot of memory. First, in R, I don't think worrying about
memory is a useful way to spend your time: worry about it when it becomes
a problem (i.e. you run out of memory), not before. Second, R isn't stupid:
it will reuse the shared columns in a pipeline of data frame transformations.
You can see that using `pryr::object_size()` (unfortunately the built-in
`object.size()` doesn't have quite enough smarts to show you this super
important feature of R):
```{R}
diamonds <- ggplot2::diamonds
pryr::object_size(diamonds)
diamonds2 <- dplyr::mutate(diamonds, price_per_carat = price / carat)
pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)
```
`diamonds` is 3.46 MB, and `diamonds2` is 3.89 MB, but the total size of
`diamonds` and `diamonds2` is only 3.89 MB. How does that work?
only 3.89 MB
1. Overwrite the original:
```R
foo_foo <- hop_through(foo_foo, forest)
foo_foo <- scoop_up(foo_foo, field_mouse)
foo_foo <- bop_on(foo_foo, head)
```
This is a minor variation of the previous form, where instead of giving
each intermediate element its own name, you use the same name, replacing
the previous value at each step. This is less typing (and less thinking),
so you're less likely to make mistakes. However, it can make debugging
painful, because if you make a mistake you'll need to start from
scratch again. Also, I think the reptition of the object being transformed
(here we've repeated `foo_foo` six times!) obscures the intent of the code.
1. Use the pipe
```R
foo_foo %>%
hop_through(forest) %>%
scoop_up(field_mouse) %>%
bop_on(head)
```
This is my favourite form. The downside is that you need to understand
what the pipe does, but once you've mastered that simple task, you can
read this series of function compositions like it's a set of imperative
actions.
## Useful intermediates
* Whenever you write your own function that is used primarily for its
side-effects, you should always return the first argument invisibly, e.g.
`invisible(x)`: that way it can easily be used in a pipe.
If a function doesn't follow this contract (e.g. `plot()` which returns
`NULL`), you can still use it with magrittr by using the "tee" operator.
`%T>%` works like `%>%` except instead it returns the LHS instead of the
RHS:
```{r}
library(magrittr)
rnorm(100) %>%
matrix(ncol = 2) %>%
plot() %>%
str()
rnorm(100) %>%
matrix(ncol = 2) %T>%
plot() %>%
str()
```
* When you run a pipe interactively, it's easy to see if something
goes wrong. When you start writing pipes that are used in production, i.e.
they're run automatically and a human doesn't immediately look at the output
it's a really good idea to include some assertions that verify the data
looks like expect. One great way to do this is the ensurer package,
writen by Stefan Milton Bache (the author of magrittr).
<http://www.r-statistics.com/2014/11/the-ensurer-package-validation-inside-pipes/>
* If you're working with functions that don't have a dataframe based API
(i.e. you pass them individual vectors, not a data frame and expressions
to be evaluated in the context of that data frame), you might find `%$%`
useful. It "explodes" out the variables in a data frame so that you can
refer to them explicitly. This is useful when working with many functions
in base R:
```{r}
mtcars %$%
cor(disp, mpg)
```
## When not to use the pipe
The pipe is a powerful tool, but it's not the only tool at your disposal, and it doesn't solve every problem! Generally, you should reach for another tool when:
* Your pipes get longer than five or six lines. It's a good idea to create
intermediate objects with meaningful names. That helps with debugging,
because it's easier to figure out when things went wrong. It also helps
understand the problem, because a good name can be very evocative of the
purpose.
* You have multiple inputs or outputs.
* Instead of creating a linear pipeline where you're primarily transforming
one object, you're starting to create a directed graphs with a complex
dependency structure. Pipes are fundamentally linear and expressing
complex relationships with them does not often yield clear code.
* For assignment. magrittr provides the `%<>%` operator which allows you to
replace code like:
```R
mtcars <- mtcars %>% transform(cyl = cyl * 2)
```
with
```R
mtcars %<>% transform(cyl = cyl * 2)
```
I'm not a fan of this operator because I think assignment is such a
special operation that it should always be clear when it's occuring.
In my opinion, a little bit of duplication (i.e. repeating the
name of the object twice), is fine in return for making assignment
more explicit.
I think it also gives you a better mental model of how assignment works
in R. The above code does not modify `mtcars`: it instead creates a
modified copy and then replaces the old version.
## Duplication
A rule of thumb: whenever you copy and paste something more than twice (i.e. so you now have three copies), you should consider making a function instead. For example:
```R
df$x %>% abs() %>% sqrt() %>% mean()
df$y %>% abs() %>% sqrt() %>% mean()
df$z %>% abs() %>% sqrt() %>% mean()
```
If you've never written a function before, or just want to quickly remove duplication in code that uses magrittr, you can take advantage of a cool feature: if the first argument in the pipeline is `.`, you get a new function, rather than a specific transformation.
```R
my_f <- . %>% abs() %>% sqrt() %>% mean()
df$x %>% my_f()
df$y %>% my_f()
df$z %>% my_f()
```
As you become a better R programming, you'll learn more techniques for reducing various types of duplication. This allows you to do more with less, and allows you to express yourself more clearly by taking advantage of powerful programming constructs.