Typos in transform & EDA (#209)

* little different to -> a little differently from

* reorder to match order in following code/table

* capitalization

* capitalization + punctuation

* replace / with or since it's hard to see between tt formatted code

* missing pronoun

* they're been -> they've been

* lets need apostrophe

* Instead of display -> Instead of displaying

* missing comma

* add mapping in front of aes in a bunch of locations

* adding mapping before aes for sections before the last section where it explicitly says from here on out we'll omit them to make calls simpler
This commit is contained in:
Mine Cetinkaya-Rundel 2016-07-29 23:05:08 -04:00 committed by Hadley Wickham
parent 8f087e8ce3
commit fe73722b0a
2 changed files with 33 additions and 35 deletions

62
EDA.Rmd
View File

@ -100,7 +100,7 @@ A variable is **continuous** if can take any of an infinite set of ordered value
```{r}
ggplot(data = diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.5)
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
```
You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_width()`:
@ -178,14 +178,14 @@ Outliers are observations that are unusual; data points that don't seem to fit t
```{r}
ggplot(diamonds) +
geom_histogram(aes(x = y), binwidth = 0.5)
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
```
There are so many observations in the common bins that the rare bins are so short that you can't see them (although maybe if you stare intently at 0 you'll spot something). To make it easy to see the unusual values, we need to zoom into to small values of the y-axis with `coord_cartesian()`:
```{r}
ggplot(diamonds) +
geom_histogram(aes(x = y), binwidth = 0.5) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
```
@ -211,13 +211,13 @@ When you discover an outlier, it's a good idea to trace it back as far as possib
might decide which dimension is the length, width, and depth.
1. Explore the distribution of `price`. Do you discover anything unusual
or surprising? (Hint: carefully think about the `binwidth` and make sure
you)
or surprising? (Hint: Carefully think about the `binwidth` and make sure
you.)
1. How many diamonds are 0.99 carat? How many have are 1 carat? What
do you think is the cause of the difference?
1. Compare and contrast `coord_cartesian()` vs `xlim()`/`ylim()` when
1. Compare and contrast `coord_cartesian()` vs `xlim()` or `ylim()` when
zooming in on a histogram. What happens if you leave `binwidth` unset?
What happens if you try and zoom so only half a bar shows?
@ -248,7 +248,7 @@ If you've encountered unusual values in your dataset, and simply want to move on
`ifelse()` has three arguments. The first argument `test` should be a logical vector. The result will contain the value of the second argument, `yes`, when `test` is `TRUE`, and the value of the third argument, `no`, when it is false.
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but does warn that they're been removed:
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but it does warn that they've been removed:
```{r, dev = "png"}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
@ -291,25 +291,25 @@ If variation describes the behavior _within_ a variable, covariation describes t
### A categorical and continuous variable
It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, lets explore how the price of a diamond varies with its quality:
It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, let's explore how the price of a diamond varies with its quality:
```{r}
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(aes(colour = cut), binwidth = 500)
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```
It's hard to see the difference in distribution because the overall counts differ so much:
```{r, fig.width = "50%", fig.width = 4}
ggplot(diamonds, aes(cut)) +
geom_bar()
ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))
```
To make the comparison easier we need to swap what is displayed on the y-axis. Instead of display count, we'll display __density__, which is the count standardised so that the area under each frequency polygon is one.
To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we'll display __density__, which is the count standardised so that the area under each frequency polygon is one.
```{r}
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(aes(colour = cut), binwidth = 500)
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```
There's something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that's because frequency polygons are a little hard to interpret - there's a lot going on in this plot.
@ -343,7 +343,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuitive finding that better quality diamonds are cheaper on average! In the exercises, you'll be challenged to figure out why.
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
`cut` is an ordered factor: fair is worse than good, which is worse than very good, and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
@ -354,14 +354,14 @@ Covariation will appear as a systematic change in the medians or IQRs of the box
```{r fig.height = 3}
ggplot(data = mpg) +
geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy))
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
```
If you have long variable names, `geom_boxplot()` will work better if you flip it 90°. You can do that with `coord_flip()`.
```{r}
ggplot(data = mpg) +
geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
```
@ -370,8 +370,6 @@ ggplot(data = mpg) +
1. Use what you've learned to improve the visualisation of the departure times
of cancelled vs. non-cancelled flights.
1. What variable in the diamonds dataset is most important for predicting
the price of a diamond? How is that variable correlated with cut?
Why does the combination of those two relationships lead to lower quality
@ -419,7 +417,7 @@ Then visualise with `geom_tile()` and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(aes(fill = n))
geom_tile(mapping = aes(fill = n))
```
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.
@ -442,14 +440,14 @@ You've already seen one great way to visualise the covariation between two conti
```{r, dev = "png"}
ggplot(data = diamonds) +
geom_point(aes(x = carat, y = price))
geom_point(mapping = aes(x = carat, y = price))
```
Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). This problem is similar to showing the distribution of price by color using a scatterplot:
```{r, dev = "png"}
ggplot(data = diamonds, mapping = aes(x = price, y = cut)) +
geom_point()
ggplot(data = diamonds) +
geom_point(mapping = aes(x = price, y = cut))
```
And we can fix it in the same way: by using binning. Previously you used `geom_histogram()` and `geom_freqpoly()` to bin in one dimension. Now you'll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimensions.
@ -458,18 +456,18 @@ And we can fix it in the same way: by using binning. Previously you used `geom_h
```{r, fig.asp = 1, out.width = "50%", fig.align = "default"}
ggplot(data = smaller) +
geom_bin2d(aes(x = carat, y = price))
geom_bin2d(mapping = aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = smaller) +
geom_hex(aes(x = carat, y = price))
geom_hex(mapping = aes(x = carat, y = price))
```
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a discrete and a continuous variable that you learned about. For example, you could bin `carat` and then for each group, display a boxplot:
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)))
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
```
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`.
@ -478,7 +476,7 @@ Another approach is to display approximately the same number of points in each b
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_number(carat, 20)))
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
```
#### Exercises
@ -503,7 +501,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
```{r, dev = "png"}
ggplot(data = diamonds) +
geom_point(aes(x = x, y = y)) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
@ -527,7 +525,7 @@ A scatterplot of Old Faithful eruption lengths versus the wait time between erup
```{r fig.height = 2}
ggplot(data = faithful) +
geom_point(aes(x = eruptions, y = waiting))
geom_point(mapping = aes(x = eruptions, y = waiting))
```
Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
@ -543,15 +541,15 @@ diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2, mapping = aes(x = carat, y = resid)) +
geom_point()
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
```
Once you've removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
```{r}
ggplot(data = diamonds2, mapping = aes(x = cut, y = resid)) +
geom_boxplot()
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
```
We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.

View File

@ -24,7 +24,7 @@ To explore the basic data manipulation verbs of dplyr, we'll use `nycflights13::
flights
```
You might notice that this data frame prints little differently to other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro).
You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro).
You might also have noticed the row of three letter abbreviations under the column names. These describe the type of each variable:
@ -426,7 +426,7 @@ There are many functions for creating new variables that you can use with `mutat
(e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small
ranks; use `desc(x)` to give the largest values the smallest ranks.
If `min_rank()` doesn't do what you need, look at the variants
`row_number()`, `dense_rank()`, `cume_dist()`, `percent_rank()`,
`row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`,
`ntile()`.
```{r}
@ -481,7 +481,7 @@ The last key verb is `summarise()`. It collapses a data frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
```
(we'll come back to what that `na.rm = TRUE` means very shortly.)
(We'll come back to what that `na.rm = TRUE` means very shortly.)
`summarise()` is not terribly useful unless we pair it with `group_by()`. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group". For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date: