Integrate feedback from @jennybc

This commit is contained in:
hadley 2016-10-03 16:08:44 -05:00
parent 6a4c1c9270
commit c8b586514b
7 changed files with 78 additions and 59 deletions

21
EDA.Rmd
View File

@ -197,6 +197,10 @@ ggplot(diamonds) +
This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:
```{r, include = FALSE}
old <- options(tibble.print_max = 10, tibble.print_min = 10)
```
```{r}
unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
@ -204,6 +208,10 @@ unusual <- diamonds %>%
unusual
```
```{r, include = FALSE}
options(old)
```
The `y` variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can't have a width of 0mm, so these values must be incorrect. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don't cost hundreds of thousands of dollars!
It's good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can't figure out why they're there, it's reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn't drop them without justification. You'll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.
@ -452,16 +460,17 @@ ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
```
Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). This problem is similar to showing the distribution of price by cut using a scatterplot:
Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above).
You've already seen one way to fix the problem: using the `alpha` aesthetic to add transparency.
```{r, dev = "png"}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = price, y = cut))
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
```
And we can fix it in the same way: by using binning. Previously you used `geom_histogram()` and `geom_freqpoly()` to bin in one dimension. Now you'll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimensions.
But using transparency can be challening for very large datasets. Another solution is to use bin. Previously you used `geom_histogram()` and `geom_freqpoly()` to bin in one dimension. Now you'll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimensions.
`geom_bin2d()` and `geom_hex()` divide the coordinate plane into two dimensional bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
`geom_bin2d()` and `geom_hex()` divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
```{r, fig.asp = 1, out.width = "50%", fig.align = "default"}
ggplot(data = smaller) +
@ -539,7 +548,7 @@ ggplot(data = faithful) +
Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It's possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain:
Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It's possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts `price` from `carat` and the computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed.
```{r, dev = "png"}
library(modelr)
@ -561,7 +570,7 @@ ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
```
We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
You'll learn how models, and the modelr package, work in the final part of the book, [model]. We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
## ggplot2 calls

View File

@ -105,7 +105,7 @@ A new major version of R comes out once a year, and there are 2-3 minor releases
### RStudio
RStudio is an integrated development environment, or IDE, for R programming. Download and install it from <http://www.rstudio.com/download>. RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know. It's a good idea to upgrade regularly so you can take advantage of the latest and greatest features.
RStudio is an integrated development environment, or IDE, for R programming. Download and install it from <http://www.rstudio.com/download>. RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know. It's a good idea to upgrade regularly so you can take advantage of the latest and greatest features. For this book, make sure you have RStudio 1.0.0.
When you start RStudio, you'll see two key regions in the interface:

View File

@ -25,12 +25,24 @@ flights
You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro).
You might also have noticed the row of three letter abbreviations under the column names. These describe the type of each variable:
You might also have noticed the row of three (or four) letter abbreviations under the column names. These describe the type of each variable:
* `int` stands for integers.
* `dbl` stands for doubles, or real numbers.
* `chr` stands for character vectors, or strings.
* `dttm` stands for date-times (a date + a time).
There are three other common types of variables that aren't used in this dataset but you'll encounter later in the book:
* `lgl` stands for logical, vectors that contain only `TRUE` or `FALSE`.
* `int` stands for integers.
* `dbl` stands for doubles, or real numbers.
* `chr` stands for character vectors, or strings.
* `fctr` stands for factors, which R uses to represent categorical variables
with fixed possible values.
* `date` stands for dates.
### Dplyr basics
@ -48,9 +60,9 @@ All verbs work similarly:
1. The first argument is a data frame.
1. The subsequent arguments describe what to do with the data frame.
You can refer to columns in the data frame directly without using `$`.
1. The subsequent arguments describe what to do with the data frame,
using the variable names (without quotes).
1. The result is a new data frame.
Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let's dive in and see how these verbs work.
@ -92,15 +104,13 @@ sqrt(2) ^ 2 == 2
1/49 * 49 == 1
```
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on `==`, use `dplyr::near()`:
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on `==`, use `near()`:
```{r}
near(sqrt(2) ^ 2, 2)
near(1 / 49 * 49, 1)
```
(Remember that we use `::` to be explicit about where a function lives. If dplyr is installed, `dplyr::near()` will always work. If you want to use the shorter `near()`, you need to make sure you have loaded dplyr with `library(dplyr)`.)
### Logical operators
Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output. For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not". Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
@ -117,6 +127,12 @@ filter(flights, month == 11 | month == 12)
The order of operations doesn't work like English. You can't write `filter(flights, month == 11 | 12)`, which you might literally translate into "finds all flights that departed in November or December". Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`. In a numeric context (like here), `TRUE` becomes one, so this finds all flights in January, not November or December. This is quite confusing!
A useful short-hand for this problem is `x %in% y`. This will select every row where `x` is one of the values in `y`. We could use it to rewrite the code above:
```{r, eval = FALSE}
nov_dec <- filter(flights, month %in% c(11, 12))
```
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`. For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
```{r, eval = FALSE}
@ -126,20 +142,6 @@ filter(flights, arr_delay <= 120, dep_delay <= 120)
As well as `&` and `|`, R also has `&&` and `||`. Don't use them here! You'll learn when you should use them in [conditional execution].
Sometimes you want to find all rows after the first `TRUE`, or all rows until the first `FALSE`. The window functions `cumany()` and `cumall()` allow you to find these values:
```{r}
df <- tibble(
x = c(FALSE, TRUE, FALSE),
y = c(TRUE, FALSE, TRUE)
)
filter(df, cumany(x)) # all rows after first TRUE
filter(df, cumall(y)) # all rows until first FALSE
```
(`tibble()` creates a dataset "by hand". You'll learn more about it in [tibbles].)
Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead. That makes it much easier to check your work. You'll learn how to create new variables shortly.
### Missing values
@ -702,18 +704,10 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
* Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`. These work
similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default
value if that position does not exist (i.e. you're trying to get the 3rd
element from a group that only has two elements).
These functions are complementary to filtering on ranks. Filtering gives
you all variables, with each observation in a separate row. Summarising
gives you one row per group, with multiple variables:
element from a group that only has two elements). For example, we can
find the first and last departure for each day:
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
@ -721,6 +715,16 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
last_dep = last(dep_time)
)
```
These functions are complementary to filtering on ranks. Filtering gives
you all variables, with each observation in a separate row:
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
```
* Counts: You've seen `n()`, which takes no arguments, and returns the
size of the current group. To count the number of non-missing values, use
@ -847,6 +851,7 @@ Grouping is most useful in conjunction with `summarise()`, but you can also do c
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365)
popular_dests
```
* Standardise to compute per group metrics:
@ -872,6 +877,9 @@ Functions that work most naturally in grouped mutates and filters are known as
1. What time of day should you fly if you want to avoid delays as much
as possible?
1. For each destination, compute the total minutes of delay. For each,
flight, compute the proportion of the total delay for its destination.
1. Delays are typically temporally correlated: even once the problem that
caused the initial delay has been resolved, later flights are delayed
to allow earlier flights to leave. Using `lag()` explore how the delay

View File

@ -48,7 +48,7 @@ The dataset contains observations collected by the EPA on 38 models of cars. Amo
To learn more about `mpg`, open its help page by running `?mpg`.
To plot `mpg`, open an R session and run the code below. The code plots the `mpg` data by putting `displ` on the x-axis and `hwy` on the y-axis:
To plot `mpg`, run this code to put `displ` on the x-axis and `hwy` on the y-axis:
```{r}
ggplot(data = mpg) +
@ -70,7 +70,7 @@ You complete your graph by adding one or more layers to `ggplot()`. The function
Each geom function in ggplot2 takes a `mapping` argument. This defines how variables in your dataset are mapped to visual properties. The `mapping` argument is always paired with `aes()`, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the `data` argument, in this case, `mpg`.
Let's turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a set of mappings.
Let's turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.
```{r eval = FALSE}
ggplot(data = <DATA>) +
@ -83,6 +83,8 @@ The rest of this chapter will show you how to complete and extend this template
1. Run `ggplot(data = mpg)` what do you see?
1. How many rows are in `mtcars`? How many columns?
1. What does the `drv` variable describe? Read the help for `?mpg` to find
out.
@ -128,7 +130,7 @@ ggplot(data = mpg) +
(If you prefer British English, like Hadley, you can use `colour` instead of `color`.)
To map an aesthetic to a variable, set the name of the aesthetic to the name of the variable inside `aes()`. ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as __scaling__. ggplot2 will also add a legend that explains which levels correspond to which values.
To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside `aes()`. ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as __scaling__. ggplot2 will also add a legend that explains which levels correspond to which values.
The colors reveal that many of the unusual points are two-seater cars. These cars don't seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.
@ -149,20 +151,20 @@ ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
What happened to the SUVs? ggplot2 will only use six shapes at a time. Additional groups will go unplotted when you use this aesthetic.
What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use this aesthetic.
For each aesthetic, you set the name of the aesthetic to the variable to display within the `aes()` function. The `aes()` function gathers together each of the aesthetic mappings used by a layer and passes them to the layer's mapping argument. The syntax highlights a useful insight about `x` and `y`: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
For each aesthetic, you use the `aes()` associate the name of the aesthetic with a variable to display. The `aes()` function gathers together each of the aesthetic mappings used by a layer and passes them to the layer's mapping argument. The syntax highlights a useful insight about `x` and `y`: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
Once you set an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.
Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.
You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:
You can also _set_ the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```
Here, the color doesn't convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function. You'll need to pick a value that makes sense for that aesthetic:
Here, the color doesn't convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes _outside_ of `aes()`. You'll need to pick a value that makes sense for that aesthetic:
* The name of a color as a character string.
* The size of a point in mm.
@ -206,14 +208,13 @@ Note that there are some seeming duplicates: 0, 15, and 22 are all squares. The
these aesthetics behave differently for categorical vs. continuous
variables?
1. What happens if you map the same variable across multiple aesthetics?
What happens if you map different variables across multiple aesthetics?
1. What happens if you map the same variable to multiple aesthetics?
1. What does the `stroke` aesthetic do? What shapes does it work with?
(Hint: use `?geom_point`)
1. What happens if you set an aesthetic to something other than a variable
name, like `displ < 5`?
1. What happens if you map an aesthetic to something other than a variable
name, like `aes(colour = displ < 5)`?
## Common problems
@ -416,8 +417,9 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)
```
1. What does `show.legend = FALSE` do? What happens if you remove it?
Why do you think I used it in the example above.
Why do you think I used it earlier in the chapter?
1. What does the `se` argument to `geom_smooth()` do?

View File

@ -79,7 +79,7 @@ There's an implied contract between you and R: it will do the tedious computatio
R has a large collection of built-in functions that are called like this:
```{r eval = FALSE}
functionName(arg1 = val1, arg2 = val2, ...)
function_name(arg1 = val1, arg2 = val2, ...)
```
Let's try using `seq()` which makes regular **seq**uences of numbers and, while we're at it, learn more helpful features of RStudio. Type `se` and hit TAB. A popup shows you possible completions. Specify `seq()` by typing more (a "q") to disambiguate, or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the function's arguments and purpose. If you want more help, press F1 to get all the details in help tab in the lower right pane.

View File

@ -65,7 +65,7 @@ knitr::include_graphics("screenshots/rstudio-project-2.png")
knitr::include_graphics("screenshots/rstudio-project-3.png")
```
Call your project `r4ds`.
Call your project `r4ds` and think carefully about which _subdirectory_ you put the project in. If you don't store it somewhere sensible, it will be hard to find it in the future!
Once this process is complete, you'll get a new RStudio project just for this book. Check that the "home" directory of your project is the current working directory:
@ -84,7 +84,7 @@ library(readr)
ggplot(diamonds, aes(carat, price)) +
geom_hex()
ggsave("diamonds-hex.pdf")
ggsave("diamonds.pdf")
write_csv(diamonds, "diamonds.csv")
```

View File

@ -26,7 +26,7 @@ not_cancelled %>%
Instead of running expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S. Doing this regularly is a great way to check that you've captured all the important parts of your code in the script.
I recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see what packages they need to install.
I recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see what packages they need to install. Note, however, that you should never include `install.packages()` or `setwd()` in a script that you share. It's very antisocial to change settings on someone else's computer!
When working through future chapters, I highly recommend starting in the editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you won't even think about it.