From fe73722b0af729ada4529951eeca6d3301bcb63b Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Fri, 29 Jul 2016 23:05:08 -0400 Subject: [PATCH 1/2] Typos in transform & EDA (#209) * little different to -> a little differently from * reorder to match order in following code/table * capitalization * capitalization + punctuation * replace / with or since it's hard to see between tt formatted code * missing pronoun * they're been -> they've been * lets need apostrophe * Instead of display -> Instead of displaying * missing comma * add mapping in front of aes in a bunch of locations * adding mapping before aes for sections before the last section where it explicitly says from here on out we'll omit them to make calls simpler --- EDA.Rmd | 62 +++++++++++++++++++++++++-------------------------- transform.Rmd | 6 ++--- 2 files changed, 33 insertions(+), 35 deletions(-) diff --git a/EDA.Rmd b/EDA.Rmd index 3a07b93..048cac5 100644 --- a/EDA.Rmd +++ b/EDA.Rmd @@ -100,7 +100,7 @@ A variable is **continuous** if can take any of an infinite set of ordered value ```{r} ggplot(data = diamonds) + - geom_histogram(aes(x = carat), binwidth = 0.5) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5) ``` You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_width()`: @@ -178,14 +178,14 @@ Outliers are observations that are unusual; data points that don't seem to fit t ```{r} ggplot(diamonds) + - geom_histogram(aes(x = y), binwidth = 0.5) + geom_histogram(mapping = aes(x = y), binwidth = 0.5) ``` There are so many observations in the common bins that the rare bins are so short that you can't see them (although maybe if you stare intently at 0 you'll spot something). To make it easy to see the unusual values, we need to zoom into to small values of the y-axis with `coord_cartesian()`: ```{r} ggplot(diamonds) + - geom_histogram(aes(x = y), binwidth = 0.5) + + geom_histogram(mapping = aes(x = y), binwidth = 0.5) + coord_cartesian(ylim = c(0, 50)) ``` @@ -211,13 +211,13 @@ When you discover an outlier, it's a good idea to trace it back as far as possib might decide which dimension is the length, width, and depth. 1. Explore the distribution of `price`. Do you discover anything unusual - or surprising? (Hint: carefully think about the `binwidth` and make sure - you) + or surprising? (Hint: Carefully think about the `binwidth` and make sure + you.) 1. How many diamonds are 0.99 carat? How many have are 1 carat? What do you think is the cause of the difference? -1. Compare and contrast `coord_cartesian()` vs `xlim()`/`ylim()` when +1. Compare and contrast `coord_cartesian()` vs `xlim()` or `ylim()` when zooming in on a histogram. What happens if you leave `binwidth` unset? What happens if you try and zoom so only half a bar shows? @@ -248,7 +248,7 @@ If you've encountered unusual values in your dataset, and simply want to move on `ifelse()` has three arguments. The first argument `test` should be a logical vector. The result will contain the value of the second argument, `yes`, when `test` is `TRUE`, and the value of the third argument, `no`, when it is false. -Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but does warn that they're been removed: +Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but it does warn that they've been removed: ```{r, dev = "png"} ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + @@ -291,25 +291,25 @@ If variation describes the behavior _within_ a variable, covariation describes t ### A categorical and continuous variable -It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, lets explore how the price of a diamond varies with its quality: +It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, let's explore how the price of a diamond varies with its quality: ```{r} ggplot(data = diamonds, mapping = aes(x = price)) + - geom_freqpoly(aes(colour = cut), binwidth = 500) + geom_freqpoly(mapping = aes(colour = cut), binwidth = 500) ``` It's hard to see the difference in distribution because the overall counts differ so much: ```{r, fig.width = "50%", fig.width = 4} -ggplot(diamonds, aes(cut)) + - geom_bar() +ggplot(diamonds) + + geom_bar(mapping = aes(x = cut)) ``` -To make the comparison easier we need to swap what is displayed on the y-axis. Instead of display count, we'll display __density__, which is the count standardised so that the area under each frequency polygon is one. +To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we'll display __density__, which is the count standardised so that the area under each frequency polygon is one. ```{r} ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + - geom_freqpoly(aes(colour = cut), binwidth = 500) + geom_freqpoly(mapping = aes(colour = cut), binwidth = 500) ``` There's something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that's because frequency polygons are a little hard to interpret - there's a lot going on in this plot. @@ -343,7 +343,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuitive finding that better quality diamonds are cheaper on average! In the exercises, you'll be challenged to figure out why. -`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes: +`cut` is an ordered factor: fair is worse than good, which is worse than very good, and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes: ```{r} ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + @@ -354,14 +354,14 @@ Covariation will appear as a systematic change in the medians or IQRs of the box ```{r fig.height = 3} ggplot(data = mpg) + - geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy)) + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) ``` If you have long variable names, `geom_boxplot()` will work better if you flip it 90°. You can do that with `coord_flip()`. ```{r} ggplot(data = mpg) + - geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy)) + + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) + coord_flip() ``` @@ -370,8 +370,6 @@ ggplot(data = mpg) + 1. Use what you've learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights. - - 1. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality @@ -419,7 +417,7 @@ Then visualise with `geom_tile()` and the fill aesthetic: diamonds %>% count(color, cut) %>% ggplot(mapping = aes(x = color, y = cut)) + - geom_tile(aes(fill = n)) + geom_tile(mapping = aes(fill = n)) ``` If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots. @@ -442,14 +440,14 @@ You've already seen one great way to visualise the covariation between two conti ```{r, dev = "png"} ggplot(data = diamonds) + - geom_point(aes(x = carat, y = price)) + geom_point(mapping = aes(x = carat, y = price)) ``` Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). This problem is similar to showing the distribution of price by color using a scatterplot: ```{r, dev = "png"} -ggplot(data = diamonds, mapping = aes(x = price, y = cut)) + - geom_point() +ggplot(data = diamonds) + + geom_point(mapping = aes(x = price, y = cut)) ``` And we can fix it in the same way: by using binning. Previously you used `geom_histogram()` and `geom_freqpoly()` to bin in one dimension. Now you'll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimensions. @@ -458,18 +456,18 @@ And we can fix it in the same way: by using binning. Previously you used `geom_h ```{r, fig.asp = 1, out.width = "50%", fig.align = "default"} ggplot(data = smaller) + - geom_bin2d(aes(x = carat, y = price)) + geom_bin2d(mapping = aes(x = carat, y = price)) # install.packages("hexbin") ggplot(data = smaller) + - geom_hex(aes(x = carat, y = price)) + geom_hex(mapping = aes(x = carat, y = price)) ``` Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a discrete and a continuous variable that you learned about. For example, you could bin `carat` and then for each group, display a boxplot: ```{r} ggplot(data = smaller, mapping = aes(x = carat, y = price)) + - geom_boxplot(aes(group = cut_width(carat, 0.1))) + geom_boxplot(mapping = aes(group = cut_width(carat, 0.1))) ``` `cut_width(x, width)`, as used above, divides `x` into bins of width `width`. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`. @@ -478,7 +476,7 @@ Another approach is to display approximately the same number of points in each b ```{r} ggplot(data = smaller, mapping = aes(x = carat, y = price)) + - geom_boxplot(aes(group = cut_number(carat, 20))) + geom_boxplot(mapping = aes(group = cut_number(carat, 20))) ``` #### Exercises @@ -503,7 +501,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) + ```{r, dev = "png"} ggplot(data = diamonds) + - geom_point(aes(x = x, y = y)) + + geom_point(mapping = aes(x = x, y = y)) + coord_cartesian(xlim = c(4, 11), ylim = c(4, 11)) ``` @@ -527,7 +525,7 @@ A scatterplot of Old Faithful eruption lengths versus the wait time between erup ```{r fig.height = 2} ggplot(data = faithful) + - geom_point(aes(x = eruptions, y = waiting)) + geom_point(mapping = aes(x = eruptions, y = waiting)) ``` Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second. @@ -543,15 +541,15 @@ diamonds2 <- diamonds %>% add_residuals(mod) %>% mutate(resid = exp(resid)) -ggplot(data = diamonds2, mapping = aes(x = carat, y = resid)) + - geom_point() +ggplot(data = diamonds2) + + geom_point(mapping = aes(x = carat, y = resid)) ``` Once you've removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive. ```{r} -ggplot(data = diamonds2, mapping = aes(x = cut, y = resid)) + - geom_boxplot() +ggplot(data = diamonds2) + + geom_boxplot(mapping = aes(x = cut, y = resid)) ``` We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand. diff --git a/transform.Rmd b/transform.Rmd index 3f45a07..5180f71 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -24,7 +24,7 @@ To explore the basic data manipulation verbs of dplyr, we'll use `nycflights13:: flights ``` -You might notice that this data frame prints little differently to other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro). +You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro). You might also have noticed the row of three letter abbreviations under the column names. These describe the type of each variable: @@ -426,7 +426,7 @@ There are many functions for creating new variables that you can use with `mutat (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks. If `min_rank()` doesn't do what you need, look at the variants - `row_number()`, `dense_rank()`, `cume_dist()`, `percent_rank()`, + `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`. ```{r} @@ -481,7 +481,7 @@ The last key verb is `summarise()`. It collapses a data frame to a single row: summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) ``` -(we'll come back to what that `na.rm = TRUE` means very shortly.) +(We'll come back to what that `na.rm = TRUE` means very shortly.) `summarise()` is not terribly useful unless we pair it with `group_by()`. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group". For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date: From 3eb371e1111d5ec11bacc14d8b4d38208a055bed Mon Sep 17 00:00:00 2001 From: Christian Mongeau Date: Sun, 31 Jul 2016 18:33:58 +0200 Subject: [PATCH 2/2] Fixes in tidy (#210) * Fixed URL to WHO data The link was not rendered as missing the protocol. * Typos --- tidy.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tidy.Rmd b/tidy.Rmd index f436d17..38db179 100644 --- a/tidy.Rmd +++ b/tidy.Rmd @@ -171,7 +171,7 @@ Spreading is the opposite of gathering. You use it when an observation is scatte table2 ``` -To tidy this up, we first analysis the representation in similar way to `gather()`. This time, however, we only need two parameters: +To tidy this up, we first analyse the representation in similar way to `gather()`. This time, however, we only need two parameters: * The column that contains variable names, the `key` column. Here, it's `type`. @@ -380,7 +380,7 @@ stocks %>% `complete()` takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary. -There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate the the previous value should be carried forward: +There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward: ```{r} treatment <- frame_data( @@ -407,7 +407,7 @@ treatment %>% ## Case Study -To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains reporter tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available for download at . +To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains reporter tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available for download at . There's a wealth of epidemiological information in this dataset, but it's challenging to work with the data in the form that it's provided: