From b3855be66cca0f6196be2f12ea1abfac68d4bb7b Mon Sep 17 00:00:00 2001 From: hadley Date: Tue, 4 Oct 2016 07:49:10 -0500 Subject: [PATCH] Incorporating suggestions from @csgillespie --- transform.Rmd | 10 ++-- visualize.Rmd | 110 ++++++++++++++++++++++++-------------------- workflow-basics.Rmd | 16 +++---- 3 files changed, 72 insertions(+), 64 deletions(-) diff --git a/transform.Rmd b/transform.Rmd index 8c721f0..7448ae8 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -8,7 +8,7 @@ Visualisation is an important tool for insight generation, but it is rare that y In this chapter we're going to focus on how to use the dplyr package, another core member of the tidyverse. We'll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data. -```{r setup} +```{r setup, message = FALSE} library(nycflights13) library(tidyverse) ``` @@ -44,7 +44,7 @@ There are three other common types of variables that aren't used in this dataset * `date` stands for dates. -### Dplyr basics +### dplyr basics In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges: @@ -431,7 +431,7 @@ There are many functions for creating new variables that you can use with `mutat dense_rank(y), percent_rank(y), cume_dist(y) - ) %>% knitr::kable() + ) ``` ### Exercises @@ -594,7 +594,7 @@ delays <- not_cancelled %>% ) ggplot(data = delays, mapping = aes(x = n, y = delay)) + - geom_point() + geom_point(alpha = 1/10) ``` Not surprisingly, there is much greater variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you'll see that the variation decreases as the sample size increases. @@ -605,7 +605,7 @@ When looking at this sort of plot, it's often useful to filter out the groups wi delays %>% filter(n > 25) %>% ggplot(mapping = aes(x = n, y = delay)) + - geom_point() + geom_point(alpha = 1/10) ``` -------------------------------------------------------------------------------- diff --git a/visualize.Rmd b/visualize.Rmd index 29ab116..efcd3b9 100644 --- a/visualize.Rmd +++ b/visualize.Rmd @@ -38,7 +38,7 @@ You can test your answer with the `mpg` dataset in ggplot2, or `ggplot2::mpg`: mpg ``` -The dataset contains observations collected by the EPA on 38 models of cars. Among the variables in `mpg` are: +The dataset contains observations collected by the US Environment Protection Agency on 38 models of cars. Among the variables in `mpg` are: 1. `displ`, a car's engine size, in litres. @@ -345,14 +345,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + Notice that this plot contains two geoms in the same graph! If this makes you excited, buckle up. In the next section, we will learn how to place multiple geoms in the same plot. -ggplot2 provides over 30 geoms, and extension packages provide even more (see for a sampling). The table below lists the geoms in ggplot2, loosely organized by the type of relationship that they visualise. Beneath each geom is a list of aesthetics the geom understands, and mandatory aesthetics are bolded. The geom call lists the most important arguments. To learn more about any single geom, open its help page in R by running the command `?` followed by the name of the geom function, e.g. `?geom_smooth`. - -```{r, echo = FALSE, out.width = "100%"} -knitr::include_graphics("images/visualization-geoms-1.png") -knitr::include_graphics("images/visualization-geoms-2.png") -knitr::include_graphics("images/visualization-geoms-3.png") -knitr::include_graphics("images/visualization-geoms-4.png") -``` +ggplot2 provides over 30 geoms, and extension packages provide even more (see for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at . To learn more about any single geom, use help: `?geom_smooth`. Many geoms, like `geom_smooth()`, use a single geometric object to display multiple rows of data. For these geoms, you can set the `group` aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the `linetype` example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms. @@ -470,34 +463,39 @@ ggplot(data = diamonds) + On the x-axis, the chart displays `cut`, a variable from `diamonds`. On the y-axis, it displays count, but count is not a variable in `diamonds`! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot: -* __bar charts__, __histograms__, and __frequency polygons__ bin your data +* bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin. -* __smoothers__ fit a model to your data and then plot predictions from the +* smoothers fit a model to your data and then plot predictions from the model. -* __boxplots__ calculate the quartiles of your data and then plot the - quartiles as a box. +* boxplots compute a robust summary of the distribution and display as + specially formatted box. -ggplot2 calls the algorithm that a graph uses to calculate new values, a __stat__, which is short for statistical transformation. Each geom in ggplot2 is associated with a default stat that it uses to calculate values to plot. The figure below describes how this process works with `geom_bar()`. +The algorithm used calculate new values for a graph is called a __stat__, short for statistical transformation. The figure below describes how this process works with `geom_bar()`. ```{r, echo = FALSE, out.width = "100%"} knitr::include_graphics("images/visualization-stat-bar.png") ``` -A few geoms, like `geom_point()`, plot your raw data as it is. These geoms also apply a transformation to your data, the identity transformation, which returns the data in its original state. Now we can say that _every_ geom uses a stat. +You can learn which stat a geom uses by inspecting the default value for the `stat` argument. For example, `?geom_bar` shows the default value for `stat` is "count", which means that `geom_bar()` uses `stat_count()`. `stat_count()` is documented on the same page as `geom_bar()`, and if you scroll down you can find a section called "Computed variables". That tells that it computes two new variables: `count` and `prop`. -```{r, echo = FALSE, out.width = "100%"} -knitr::include_graphics("images/visualization-stat-point.png") +You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`: + +```{r} +ggplot(data = diamonds) + + stat_count(mapping = aes(x = cut)) ``` -You can learn which stat a geom uses, as well as what variables it computes by visiting the geom's help page. For example, the help page of `geom_bar()` shows that it uses the count stat and that the count stat computes two new variables, `count` and `prop`. - -Stats are the most subtle part of plotting because you can't see them directly. ggplot2 applies the transformation and stores the results behind the scenes. You only see the impact in the final plot. Generally, you don't need to think about stats: the defaults work away on your behalf to summarise your data as needed for a particular plot. However, there are two cases where you might need to know about them: +This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly: 1. You might want to override the default stat. In the code below, I change the stat of `geom_bar()` from count (the default) to identity. This lets me map the height of the bars to the raw values of a $y$ variable. + Unfortunately when people talk about bar charts casually, they might be + referring to this type of bar chart, where the height of the bar is already + present in the data, or the previous bar chart where the height of the bar + is generated by counting rows. ```{r} demo <- tibble( @@ -510,10 +508,9 @@ Stats are the most subtle part of plotting because you can't see them directly. geom_bar(mapping = aes(x = a, y = b), stat = "identity") ``` - (Unfortunately when people talk about bar charts casually, they might be - referring to this type of bar chart, where the height of the bar is already - present in the data, or the previous bar chart where the height of the bar - is generated by counting rows.) + (Don't worry that you haven't seen `<-` or `tibble()` before. You might be + able to guess at their meaning from the context, and you'll learn exactly + what they do soon!) 1. You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of @@ -524,36 +521,57 @@ Stats are the most subtle part of plotting because you can't see them directly. geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1)) ``` - The help page of `?geom_bar` reveals that the sum stat creates two - variables, `count` and `prop`. By default, `geom_bar()` maps `y` - to `count`, but you can ask it to use `prop` instead with - `aes(y = ..prop..)`.The two dots that surround prop notify ggplot2 that - the `prop` variable appears in the transformed dataset not in the - raw dataset. + To find the variables computed by the stat, look for the help section + titled "computed variables". -ggplot2 provides over 20 stats for you to use. Each stat is saved as a function, which provides a convenient way to access a stat's help page, e.g. `?stat_identity`. The table below describes each stat in ggplot2 and lists the parameters that the stat takes, as well as the variables that the stat makes. - -```{r, echo = FALSE, out.width = "100%"} -knitr::include_graphics("images/visualization-stats.png") -``` +1. You might want to draw greater attention to the statistical transformation + in your code. For example, you might use `stat_summary()`, which + summarises the y values for each unique x value, to draw + attention to the summary that you're computing: + + ```{r} + ggplot(data = diamonds) + + stat_summary( + mapping = aes(x = cut, y = depth), + fun.ymin = min, + fun.ymax = max, + fun.y = median + ) + ``` + +ggplot2 provides over 20 stats for you to use. Each stat is a function, so you can get help in usual way, e.g. `?stat_bin`. To see a complete list of stats, try the ggplot2 cheatsheet. ### Exercises -1. In our proportion bar chart, we need to set `group = 1`. Why? In other - words, why is this graph not useful? +1. What is the default geom associated with `stat_summary()`? How could + you rewrite the previous plot to use that geom function instead of the + stat function? + +1. What does `geom_col()` do? How is it different to `geom_bar()`? + +1. Most geoms and stats come in pairs that are almost always used in + concert. Read through the documentation and make a list of all the + pairs. What do they have in common? + +1. What variables does `stat_smooth()` compute? What parameters control + its behaviour? + +1. In our proportion bar chart, we need to set `group = 1`. Why? In other + words what is the problem with these two graphs? ```{r, eval = FALSE} ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop..)) + ggplot(data = diamonds) + + geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..)) ``` - -1. How do you find out the default stat associated with a geom? + ## Position adjustments There's one more piece of magic associated with bar charts. You can colour a bar chart using either the `colour` aesthetic, or more usefully, `fill`: -```{r fig.width = 3, out.width = "50%", fig.align = "default"} +```{r out.width = "50%", fig.align = "default"} ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, colour = cut)) ggplot(data = diamonds) + @@ -631,6 +649,8 @@ To learn more about a position adjustment, look up the help page associated with geom_point() ``` +1. What parameters to `geom_jitter()` control the amount of jittering? + 1. Compare and contrast `geom_jitter()` with `geom_count()`. 1. What's the default position adjustment for `geom_boxplot()`? Create @@ -638,9 +658,7 @@ To learn more about a position adjustment, look up the help page associated with ## Coordinate systems -Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y position act independently to find the location of each point. - -There are a number of other coordinate systems that are occasionally helpful. +Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y position act independently to find the location of each point. There are a number of other coordinate systems that are occasionally helpful. * `coord_flip()` switches the x and y axes. This is useful (for example), if you want horizontal boxplots. @@ -685,12 +703,6 @@ There are a number of other coordinate systems that are occasionally helpful. bar + coord_polar() ``` -The table below describes each built-in coord. You can learn more about each coordinate system by opening its help page in R, e.g. `?coord_cartesian`. - -```{r, echo = FALSE, out.width = "100%"} -knitr::include_graphics("images/visualization-coordinate-systems.png") -``` - ### Exercises 1. Turn a stacked bar chart into a pie chart using `coord_polar()`. diff --git a/workflow-basics.Rmd b/workflow-basics.Rmd index 4b0e404..05e3650 100644 --- a/workflow-basics.Rmd +++ b/workflow-basics.Rmd @@ -108,14 +108,14 @@ The `+` tells you that R is waiting for more input; it doesn't think you're done If you make an assignment, you don't get to see the value. You're then tempted to immediately double-check the result: ```{r} -y <- seq(1, 10, length = 5) +y <- seq(1, 10, length.out = 5) y ``` This common action can be shortened by surrounding the assignment with parentheses, which causes assignment and "print to screen" to happen. ```{r} -(y <- seq(1, 10, length = 5)) +(y <- seq(1, 10, length.out = 5)) ``` Now look at your environment in the upper right pane: @@ -142,17 +142,13 @@ Here you can see all of the objects that you've created. 1. Tweak each of the following R commands so that they run correctly: ```{r, eval = FALSE} - library(ggplot2) - library(dplyr) - + library(tidyverse) + ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) - mpg %>% - fliter(cyl = 8) - - diamond %>% - filter(carat > 3) + fliter(mpg, cyl = 8) + filter(diamond, carat > 3) ``` 1. Press Alt + Shift + K. What happens? How can you get to the same place