Merge branch 'master' of github.com:hadley/r4ds

# Conflicts:
#	EDA.Rmd
This commit is contained in:
hadley 2016-07-31 11:41:30 -05:00
commit c4168bbd37
3 changed files with 41 additions and 37 deletions

66
EDA.Rmd
View File

@ -101,7 +101,7 @@ A variable is **continuous** if can take any of an infinite set of ordered value
```{r}
ggplot(data = diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.5)
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
```
You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_width()`:
@ -183,14 +183,14 @@ Outliers are observations that are unusual; data points that don't seem to fit t
```{r}
ggplot(diamonds) +
geom_histogram(aes(x = y), binwidth = 0.5)
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
```
There are so many observations in the common bins that the rare bins are so short that you can't see them (although maybe if you stare intently at 0 you'll spot something). To make it easy to see the unusual values, we need to zoom into to small values of the y-axis with `coord_cartesian()`:
```{r}
ggplot(diamonds) +
geom_histogram(aes(x = y), binwidth = 0.5) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
```
@ -217,13 +217,13 @@ It's good practice to repeat your analysis with and without the outliers. If the
might decide which dimension is the length, width, and depth.
1. Explore the distribution of `price`. Do you discover anything unusual
or surprising? (Hint: carefully think about the `binwidth` and make sure
you)
or surprising? (Hint: Carefully think about the `binwidth` and make sure
you.)
1. How many diamonds are 0.99 carat? How many have are 1 carat? What
do you think is the cause of the difference?
1. Compare and contrast `coord_cartesian()` vs `xlim()`/`ylim()` when
1. Compare and contrast `coord_cartesian()` vs `xlim()` or `ylim()` when
zooming in on a histogram. What happens if you leave `binwidth` unset?
What happens if you try and zoom so only half a bar shows?
@ -255,7 +255,7 @@ If you've encountered unusual values in your dataset, and simply want to move on
`ifelse()` has three arguments. The first argument `test` should be a logical vector. The result will contain the value of the second argument, `yes`, when `test` is `TRUE`, and the value of the third argument, `no`, when it is false.
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but does warn that they're been removed:
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but it does warn that they've been removed:
```{r, dev = "png"}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
@ -298,25 +298,25 @@ If variation describes the behavior _within_ a variable, covariation describes t
### A categorical and continuous variable {#cat-cont}
It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, lets explore how the price of a diamond varies with its quality:
It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, let's explore how the price of a diamond varies with its quality:
```{r}
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(aes(colour = cut), binwidth = 500)
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```
It's hard to see the difference in distribution because the overall counts differ so much:
```{r, fig.width = "50%", fig.width = 4}
ggplot(diamonds, aes(cut)) +
geom_bar()
ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))
```
To make the comparison easier we need to swap what is displayed on the y-axis. Instead of display count, we'll display __density__, which is the count standardised so that the area under each frequency polygon is one.
To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we'll display __density__, which is the count standardised so that the area under each frequency polygon is one.
```{r}
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(aes(colour = cut), binwidth = 500)
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```
There's something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that's because frequency polygons are a little hard to interpret - there's a lot going on in this plot.
@ -350,7 +350,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuitive finding that better quality diamonds are cheaper on average! In the exercises, you'll be challenged to figure out why.
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don't have an intrinsic order, so you might want to reorder them to make an more informative display. One way to do that is with the `reorder()` function.
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don't have such an intrinsic order, so you might want to reorder them to make an more informative display. One way to do that is with the `reorder()` function.
For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
@ -363,14 +363,14 @@ To make the trend easier to see, we can reorder `class` based on the median valu
```{r fig.height = 3}
ggplot(data = mpg) +
geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy))
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
```
If you have long variable names, `geom_boxplot()` will work better if you flip it 90°. You can do that with `coord_flip()`.
```{r}
ggplot(data = mpg) +
geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
```
@ -379,8 +379,6 @@ ggplot(data = mpg) +
1. Use what you've learned to improve the visualisation of the departure times
of cancelled vs. non-cancelled flights.
1. What variable in the diamonds dataset is most important for predicting
the price of a diamond? How is that variable correlated with cut?
Why does the combination of those two relationships lead to lower quality
@ -429,7 +427,13 @@ Then visualise with `geom_tile()` and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
<<<<<<< HEAD
geom_tile(aes(fill = n))
||||||| merged common ancestors
geom_tile(aes(fill = n))
=======
geom_tile(mapping = aes(fill = n))
>>>>>>> 3eb371e1111d5ec11bacc14d8b4d38208a055bed
```
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.
@ -452,14 +456,14 @@ You've already seen one great way to visualise the covariation between two conti
```{r, dev = "png"}
ggplot(data = diamonds) +
geom_point(aes(x = carat, y = price))
geom_point(mapping = aes(x = carat, y = price))
```
Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). This problem is similar to showing the distribution of price by color using a scatterplot:
```{r, dev = "png"}
ggplot(data = diamonds, mapping = aes(x = price, y = cut)) +
geom_point()
ggplot(data = diamonds) +
geom_point(mapping = aes(x = price, y = cut))
```
And we can fix it in the same way: by using binning. Previously you used `geom_histogram()` and `geom_freqpoly()` to bin in one dimension. Now you'll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimensions.
@ -468,18 +472,18 @@ And we can fix it in the same way: by using binning. Previously you used `geom_h
```{r, fig.asp = 1, out.width = "50%", fig.align = "default"}
ggplot(data = smaller) +
geom_bin2d(aes(x = carat, y = price))
geom_bin2d(mapping = aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = smaller) +
geom_hex(aes(x = carat, y = price))
geom_hex(mapping = aes(x = carat, y = price))
```
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a categorical and a continuous variable that you learned about. For example, you could bin `carat` and then for each group, display a boxplot:
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)))
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
```
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`.
@ -488,7 +492,7 @@ Another approach is to display approximately the same number of points in each b
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_number(carat, 20)))
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
```
#### Exercises
@ -513,7 +517,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
```{r, dev = "png"}
ggplot(data = diamonds) +
geom_point(aes(x = x, y = y)) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
@ -537,7 +541,7 @@ A scatterplot of Old Faithful eruption lengths versus the wait time between erup
```{r fig.height = 2}
ggplot(data = faithful) +
geom_point(aes(x = eruptions, y = waiting))
geom_point(mapping = aes(x = eruptions, y = waiting))
```
Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
@ -553,15 +557,15 @@ diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2, mapping = aes(x = carat, y = resid)) +
geom_point()
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
```
Once you've removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
```{r}
ggplot(data = diamonds2, mapping = aes(x = cut, y = resid)) +
geom_boxplot()
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
```
We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.

View File

@ -175,7 +175,7 @@ Spreading is the opposite of gathering. You use it when an observation is scatte
table2
```
To tidy this up, we first analysis the representation in similar way to `gather()`. This time, however, we only need two parameters:
To tidy this up, we first analyse the representation in similar way to `gather()`. This time, however, we only need two parameters:
* The column that contains variable names, the `key` column. Here, it's
`type`.
@ -385,7 +385,7 @@ stocks %>%
`complete()` takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate the the previous value should be carried forward:
There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
```{r}
treatment <- frame_data(
@ -412,7 +412,7 @@ treatment %>%
## Case Study
To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains reporter tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available for download at <www.who.int/tb/country/data/download/en/>.
To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains reporter tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available for download at <http://www.who.int/tb/country/data/download/en/>.
There's a wealth of epidemiological information in this dataset, but it's challenging to work with the data in the form that it's provided:

View File

@ -24,7 +24,7 @@ To explore the basic data manipulation verbs of dplyr, we'll use `nycflights13::
flights
```
You might notice that this data frame prints little differently to other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro).
You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro).
You might also have noticed the row of three letter abbreviations under the column names. These describe the type of each variable:
@ -420,7 +420,7 @@ There are many functions for creating new variables that you can use with `mutat
(e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small
ranks; use `desc(x)` to give the largest values the smallest ranks.
If `min_rank()` doesn't do what you need, look at the variants
`row_number()`, `dense_rank()`, `cume_dist()`, `percent_rank()`,
`row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`,
`ntile()`.
```{r}
@ -475,7 +475,7 @@ The last key verb is `summarise()`. It collapses a data frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
```
(we'll come back to what that `na.rm = TRUE` means very shortly.)
(We'll come back to what that `na.rm = TRUE` means very shortly.)
`summarise()` is not terribly useful unless we pair it with `group_by()`. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group". For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date: