Tweak figures through out book

This commit is contained in:
hadley 2016-07-18 09:52:55 -05:00
parent 061e233740
commit 11294f5d0c
8 changed files with 41 additions and 48 deletions

View File

@ -4,7 +4,12 @@ options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE
cache = TRUE,
out.width = "70%",
fig.align = 'center',
fig.width = 6,
fig.asp = 0.618, # 1 / phi
fig.show = "hold"
)
options(dplyr.print_min = 6, dplyr.print_max = 6)

View File

@ -23,7 +23,7 @@ circle %>%
While we may stumble over raw data, we can easily process visual information. Within your mind is a powerful visual processing system fine-tuned by millions of years of evolution. As a result, often the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values fall on a circle.
```{r echo=FALSE, dependson=data}
```{r echo=FALSE, dependson = data, fig.asp = 1, out.width = "30%", fig.width = 3}
ggplot(circle, aes(x, y)) +
geom_point() +
coord_fixed()

View File

@ -69,7 +69,7 @@ There are some important topics that this book doesn't cover. We believe it's im
This book proudly focuses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). This book doesn't teach data.table because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth the extra effort required to learn it.
If your data is bigger than this, carefully consider if your big data problem might atually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [[Data transformation]].
If your data is bigger than this, carefully consider if your big data problem might atually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [data transformation].
Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like sparklyr, rhipe, and ddr to solve it for the full dataset.
@ -101,7 +101,7 @@ The complement of hypothesis generation is hypothesis confirmation. Hypothesis c
This means to do hypothesis confirmation you need to "preregister"
(write out in advance) your analysis plan, and not deviate from it
even when you have seen the data. We'll talk a little about some
strategies you can use to make this easier in [[model assessment]].
strategies you can use to make this easier in [model assessment].
It's common to think about modelling as a tool for hypothesis confirmation, and visualisation for a tool for hypothesis generation. But that's a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation. The key difference is how often do you look at each observation: if you look only once, it's confirmation; if you look more than once, it's exploration.
@ -131,7 +131,7 @@ To run the code in this book, you will need to install both R and the RStudio ID
RStudio is an integrated development environment, or IDE, for R programming. There are three key regions:
```{r echo = FALSE}
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/intro-rstudio.png")
```
@ -151,7 +151,7 @@ If you want to see a list of all keyboard shortcuts, use the meta shortcut Alt +
We strongly recommend making two changes to the default RStudio options:
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("screenshots/rstudio-workspace.png")
```

View File

@ -1,15 +1,3 @@
```{r include=FALSE, cache=FALSE}
set.seed(1014)
options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE
)
options(dplyr.print_min = 6, dplyr.print_max = 6)
```
# Model
The goal of a fitted model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
@ -667,7 +655,7 @@ One way to do this is to use `condvis::visualweight()`.
### Transformations
```{r}
```{r, dev = "png"}
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
ggplot(diamonds, aes(x = log(carat), y = log(price))) +
@ -700,7 +688,7 @@ Iteratively re-fit the model down-weighting outlying points (points with high re
### Additive models
```{r}
```{r, dev = "png"}
library(mgcv)
gam(income ~ s(education), data = heights)

View File

@ -77,7 +77,7 @@ One way is to use the same approach as in the last chapter: there's a strong sig
You already know how to do that if we had a single country:
```{r, out.width = "33%", fig.asp = 1, fig.width = 3, fig.show = "hold"}
```{r, out.width = "33%", fig.asp = 1, fig.width = 3, fig.align='default'}
nz <- filter(gapminder, country == "New Zealand")
nz %>%
ggplot(aes(year, lifeExp)) +

View File

@ -61,7 +61,7 @@ R follows a set of conventions that makes one layout of tabular data much easier
Data that satisfies these rules is known as *tidy data*. Notice that `table1` is tidy data.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-1.png")
```
@ -75,7 +75,7 @@ Tidy data works well with R because it takes advantage of R's traits as a vector
Tidy data arranges values so that the relationships between variables in a dataset will parallel the relationship between vectors in R's storage objects. R stores tabular data as a data frame, a list of atomic vectors arranged to look like a table. Each column in the table is an atomic vector in the list. In tidy data, each variable in the dataset is assigned to its own column, i.e., its own vector in the data frame.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-2.png")
```
@ -110,7 +110,7 @@ table1$population / table1$cases
To create the output, R applies the function in element-wise fashion: R first applies the function (or operation) to the first elements of each vector involved. Then R applies the function (or operation) to the second elements of each vector involved, and so on until R reaches the end of the vectors. If one vector is shorter than the others, R will recycle its values as needed (according to a set of recycling rules).
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-3.png")
```
@ -130,7 +130,7 @@ If you use basic R syntax, your calculations will look like the code below. If y
#### Dataset one
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-4.png")
```
@ -143,7 +143,7 @@ table1$cases / table1$population * 10000
#### Dataset two
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-5.png")
```
@ -160,7 +160,7 @@ table2$value[case_rows] / table2$value[pop_rows] * 10000
#### Dataset three
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-6.png")
```
@ -173,7 +173,7 @@ Dataset three combines the values of cases and population into the same cells. I
#### Dataset four
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-7.png")
```
@ -257,7 +257,7 @@ spread(table2, key, value)
`spread()` returns a copy of your dataset that has had the key and value columns removed. In their place, `spread()` adds a new column for each unique key in the key column. These unique keys will form the column names of the new columns. `spread()` distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-8.png")
```
@ -291,7 +291,7 @@ gather(table4, "year", "cases", 2:3)
We've placed "key" in quotation marks because you will usually use `gather()` to create tidy data. In this case, the "key" column will contain values, not keys. The values will only be keys in the sense that they were formally in the column names, a place where keys belong.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-9.png")
```

View File

@ -254,7 +254,7 @@ If you've encountered unusual values in your dataset, and simply want to move on
ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but does warn that they're been removed:
```{r}
```{r, dev = "png"}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()
```
@ -336,7 +336,7 @@ Another alternative to display the distribution of a continuous variable broken
* A line (or whisker) that extends from each end of the box and goes to the
farthest non-outlier point in the distribution.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/EDA-boxplot.png")
```
@ -441,14 +441,14 @@ If the categorical variables are unordered, you might want to use the seriation
You've already seen one great way to visualise the covariation between two continuous variables: draw a scatterplot with `geom_point()`. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.
```{r}
```{r, dev = "png"}
ggplot(data = diamonds) +
geom_point(aes(x = carat, y = price))
```
Scatterplots become less useful as the size of your dataset grows, because points begin to pile up into areas of uniform black (as above). This problem is known as __overplotting__. This problem is similar to showing the distribution of price by color using a scatterplot:
```{r}
```{r, dev = "png"}
ggplot(data = diamonds, mapping = aes(x = price, y = cut)) +
geom_point()
```
@ -457,7 +457,7 @@ And we can fix it in the same way: by using binning. Previously you used `geom_h
`geom_bin2d()` and `geom_hex()` divide the coordinate plane into two dimensional bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
```{r fig.show='hold', fig.asp = 1, out.width = "50%"}
```{r, fig.asp = 1, out.width = "50%", fig.align = "default"}
ggplot(data = smaller) +
geom_bin2d(aes(x = carat, y = price))
@ -502,7 +502,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
unusual combination of $x$ and $y$ values, which makes the points outliers
even though their $x$ and $y$ values appear normal when examined separately.
```{r}
```{r, dev = "png"}
ggplot(data = diamonds) +
geom_point(aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
@ -535,7 +535,7 @@ Patterns provide one of the most useful tools for data scientists because they r
Models are a rich tool for extracting patterns out of data. For example, consider the diamonds data. It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It's possible to use a model to remove the very strong relationship between price and carat so we we can explore the subtleties that remain.
```{r}
```{r, dev = "png"}
library(modelr)
mod <- lm(log(price) ~ log(carat), data = diamonds)

View File

@ -212,7 +212,7 @@ If you get an odd result, double check that you are calling the aesthetic as its
How are these two plots similar?
```{r echo = FALSE, out.width = "50%"}
```{r echo = FALSE, out.width = "50%", fig.align="default"}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
@ -263,7 +263,7 @@ Next to each geom is a visual representation of the geom. Beneath the geom is a
To learn more about any single geom, open it's help page in R by running the command `?` followed by the name of the geom function, e.g. `?geom_smooth`.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-geoms-1.png")
knitr::include_graphics("images/visualization-geoms-2.png")
knitr::include_graphics("images/visualization-geoms-3.png")
@ -274,7 +274,7 @@ Many geoms use a single object to describe all of the data. For example, `geom_s
In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the `linetype` example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.
```{r, fig.show='hold', fig.height = 2.5, fig.width = 2.5, out.width = "33%"}
```{r, fig.asp = 1, fig.width = 2.5, fig.align = 'default', out.width = "33%"}
ggplot(diamonds) +
geom_smooth(aes(x = carat, y = price))
@ -518,13 +518,13 @@ Some graphs, like scatterplots, plot the raw values of your dataset. Other graph
ggplot2 calls the algorithm that a graph uses to calculate new values a _stat_, which is short for statistical transformation. Each geom in ggplot2 is associated with a default stat that it uses to calculate values to plot. The figure below describes how this process works with `geom_bar()`.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-stat-bar.png")
```
A few geoms, like `geom_point()`, plot your raw data as it is. These geoms also apply a transformation to your data, the identity transformation, which returns the data in its original state. Now we can say that _every_ geom uses a stat.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-stat-point.png")
```
@ -575,7 +575,7 @@ For `geom_count()`, the `..prop..` variable does not do anything useful until yo
ggplot2 provides over 20 stats for you to use. Each stat is saved as a function, which provides a convenient way to access a stat's help page, e.g. `?stat_identity`. The table below describes each stat in ggplot2 and lists the parameters that the stat takes, as well as the variables that the stat makes.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-stats.png")
```
@ -583,7 +583,7 @@ knitr::include_graphics("images/visualization-stats.png")
Let's leave the Cartesian coordinate system and examine the polar coordinate system. We will begin with a riddle: how is a bar chart similar to a coxcomb plot, like the one below?
```{r echo = FALSE, fig.show='hold', fig.width=3, fig.height=4, out.width = "50%"}
```{r echo = FALSE, fig.width=3, fig.height=4, out.width = "50%", fig.align = "default"}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) +
@ -620,7 +620,7 @@ ggplot2 comes with eight coordinate functions that you can use in the same way a
You can learn more about each coordinate system by opening its help page in R, e.g. `?coord_cartesian`, `?coord_fixed`, `?coord_flip`, `?coord_map`, `?coord_polar`, and `?coord_trans`.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-coordinate-systems.png")
```
@ -693,19 +693,19 @@ The seven parameters in the template compose the grammar of graphics, a formal s
To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat).
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-grammar-1.png")
```
Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-grammar-2.png")
```
You'd then select a coordinate system to place the geoms into. You'd use the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (facetting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-grammar-3.png")
```