Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
hadley 2016-07-10 09:20:27 -05:00
commit 0bd0021537
4 changed files with 25 additions and 25 deletions

View File

@ -26,7 +26,7 @@ __Visualisation__ is a fundamentally human activity. A good visualisation will s
__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model cannot fundamentally surprise you.
The last step of data science is __communication__, an absolutely critical part of any data analysis project. It doesn't matter how well models and visualisation have led you to understand the data, unless you can commmunicate your results to other people.
The last step of data science is __communication__, an absolutely critical part of any data analysis project. It doesn't matter how well models and visualisation have led you to understand the data, unless you can communicate your results to other people.
There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off. Becoming a better programmer will allow you to automate common tasks, and solve new problems with greater ease.
@ -66,7 +66,7 @@ This book proudly focuses on small, in-memory datasets. This is the right place
Many big data problems are often small data problems in disguise. Often your complete dataset is big, but the data needed to answer a specific question is small. It's often possible to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [transform](#transform).
Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset.
Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarrassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset.
### Big p data (many variables)
@ -102,7 +102,7 @@ To run the code in this book, you will need to install both R and the RStudio ID
### RStudio
RStudio is an integated development environment, or IDE, for R programming. There are three key regions:
RStudio is an integrated development environment, or IDE, for R programming. There are three key regions:
```{r, echo = FALSE}
knitr::include_graphics("screenshots/rstudio-layout.png")
@ -128,7 +128,7 @@ We strongly recommend making two changes to the default RStudio options:
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
This ensures that every time you restart RStudio you get a completely clean slate. This is good pratice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
This ensures that every time you restart RStudio you get a completely clean slate. This is good practice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
### R packages

View File

@ -142,7 +142,7 @@ near(1 / 49 * 49, 1)
### Logical operators
Multiple arguments to `filter()` are combined with "and". To get more complicated expressions, you can use boolean operators yourself:
Multiple arguments to `filter()` are combined with "and". To get more complicated expressions, you can use Boolean operators yourself:
```{r, eval = FALSE}
filter(flights, month == 11 | month == 12)
@ -160,7 +160,7 @@ Instead you can use the helpful `%in%` shortcut:
filter(flights, month %in% c(11, 12))
```
The following figure shows the complete set of boolean operations:
The following figure shows the complete set of Boolean operations:
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations"}
knitr::include_graphics("diagrams/transform-logical.png")
@ -247,7 +247,7 @@ filter(df, is.na(x) | x > 1)
## Arrange rows with `arrange()`
`arrange()` works similarly to `filter()` except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
`arrange()` works similarly to `filter()` except that instead of filtering or selecting rows, it reorders them. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
```{r}
arrange(flights, year, month, day)
@ -281,7 +281,7 @@ flights[order(flights$year, flights$month, flights$day), , drop = FALSE]
### Exercises
1. How could use `arrange()` to sort all missing values to the start?
1. How could you use `arrange()` to sort all missing values to the start?
(Hint: use `is.na()`).
1. Sort `flights` to find the most delayed flights. Find the flights that
@ -629,7 +629,7 @@ ggplot(delays, aes(n, delay)) +
geom_point()
```
Not suprisingly, there is much more variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or many other summaries) vs. number of observations, you'll see that the variation decreases as the sample size increases.
Not surprisingly, there is much more variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or many other summaries) vs. number of observations, you'll see that the variation decreases as the sample size increases.
When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, and also shows you a handy pattern for integrating ggplot2 into dplyr flows. It's a bit painful that you have to switch from `%>%` to `+`, but once you get the hang of it, it's quite convenient.

View File

@ -120,7 +120,7 @@ ggplot(data = diamonds) +
### Asking questions about variation
Now that you can visualize variation, what should you look for in your plots? And what type of follow up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
Now that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
* *Typical values*
@ -211,7 +211,7 @@ ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
```
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specifc x values and specific y values. As with bar charts, you can calculate the specific values with `table()`.
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values. As with bar charts, you can calculate the specific values with `table()`.
```{r}
table(diamonds$color, diamonds$cut)
@ -238,7 +238,7 @@ ggplot(data = mpg) +
geom_boxplot(aes(x = class, y = hwy))
```
Covariation will appear as a systematic change in the medians or IQR's of the boxplots. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median hwy value of each group.
Covariation will appear as a systematic change in the medians or IQRs of the boxplots. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median hwy value of each group.
```{r fig.height = 3}
ggplot(data = mpg) +
@ -444,7 +444,7 @@ small_iris %>%
### K means clustering
K means clustering provides a simulation based alternative to hierarchical clustering. It identifies the "best" way to group your data into a pre-defined number of clusters. The figure below visualizes (in two dimensional space) the k means algorith:
K means clustering provides a simulation based alternative to hierarchical clustering. It identifies the "best" way to group your data into a predefined number of clusters. The figure below visualizes (in two dimensional space) the k means algorithm:
1. Randomly assign each data point to one of $k$ groups
2. Compute the centroid of each group
@ -465,7 +465,7 @@ iris_kmeans <- small_iris %>%
iris_kmeans$cluster
```
Unlike `hclust()`, the k means algorithm does not porvide an intuitive visual interface. Instead, `kmeans()` returns a kmeans class object. Subset the object with `$cluster` to access a list of cluster assignments for your data set, e.g. `iris_kmeans$cluster`. You can visualize the results by mapping them to an aesthetic, or you can apply the results by passing them to dplyr's `group_by()` function.
Unlike `hclust()`, the k means algorithm does not provide an intuitive visual interface. Instead, `kmeans()` returns a kmeans class object. Subset the object with `$cluster` to access a list of cluster assignments for your data set, e.g. `iris_kmeans$cluster`. You can visualize the results by mapping them to an aesthetic, or you can apply the results by passing them to dplyr's `group_by()` function.
```{r}
ggplot(small_iris, aes(x = Sepal.Width, y = Sepal.Length)) +
@ -489,7 +489,7 @@ Ask the same questions about clusters that you find with `hclust()` and `kmeans(
* Might there be a mismatch between the number of clusters that you found and the number that exist in real life? Are only a couple of the clusters meaningful? Are there more clusters in the data than you found?
* How stable are the clusters if you re-run the algorithm?
* How stable are the clusters if you rerun the algorithm?
Keep in mind that both algorithms _will always_ return a set of clusters, whether your data appears clustered or not. As a result, you should always be skeptical about the results. They can be quite insightful, but there is no reason to treat them as a fact without doing further research.
@ -516,7 +516,7 @@ $$\hat{y} = 0.13 + 0.98 x$$
which is the equation of the blue model line in the graph above. Even if we did not have the graph, we could use the model coefficients in the equation above to determine that a positive relationship exists between $y$ and $x$ such that a one unit increase in $x$ is associated with an approximately one unit increase in $y$. We could use a model statistic, such as adjusted $r^{2}$ to determine that the relationship is very strong (here adjusted $r^{2} = 0.99$).
Finally, we could spot outliers in our data by examining the residuals of the model, which are the distances between the actual $y$ values of our data points and the $y$ values that the model would predict for the data points. Observations that are outliers in n-dimensional space will have residuals that are outliers in one dimensional space. You can find these outliers by plotting a histogram of the residuals or by visualizing the residuals against any variable in a two dimenisonal plot.
Finally, we could spot outliers in our data by examining the residuals of the model, which are the distances between the actual $y$ values of our data points and the $y$ values that the model would predict for the data points. Observations that are outliers in n-dimensional space will have residuals that are outliers in one dimensional space. You can find these outliers by plotting a histogram of the residuals or by visualizing the residuals against any variable in a two dimensional plot.
```{r echo = FALSE, fig.width = 3, fig.show='hold'}
diamond_mod <- lm(y ~ x, data = diamonds3)
@ -529,7 +529,7 @@ ggplot(resids) +
geom_point(aes(x = x, y = .resid))
```
You can easily use these techniques with n dimensional relationships that cannot be visualized easily. When you spot a pattern or outlier, ask yourself the same questions that you would ask when you spot a pattern or outlier in a graph. Then visualize the residuals of your model in various ways. If a pattern exists in the residuals, it suggests that your model does not accurately describe the pattern in your data.
You can easily use these techniques with n-dimensional relationships that cannot be visualized easily. When you spot a pattern or outlier, ask yourself the same questions that you would ask when you spot a pattern or outlier in a graph. Then visualize the residuals of your model in various ways. If a pattern exists in the residuals, it suggests that your model does not accurately describe the pattern in your data.
I'll postpone teaching you how to fit and interpret models with R until Part 4. Although models are something simple, descriptions of patterns, they are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make inferences and predictions. As a result, there is more to learn about models than we can examine here.
@ -551,7 +551,7 @@ diamonds %>%
The window functions from Chapter 3 are particularly useful for calculating new variables. To calculate a variable from two or more variables, use basic operators or the `map2()`, `map3()`, and `map_n()` functions from purrr. You will learn more about purrr in Chapter ?.
If you are statistically trained, you can use R to extract potential variables with more sophisticated algorithms. R provides `prcomp()` for Principle Components Analysis and `factanal()` for factor analysis. The psych and SEM packages also provide further tools for working with latent variables.
If you are statistically trained, you can use R to extract potential variables with more sophisticated algorithms. R provides `prcomp()` for principal component analysis and `factanal()` for factor analysis. The psych and SEM packages also provide further tools for working with latent variables.
### To make new observations
@ -569,7 +569,7 @@ Variables, values, and observations provide a basis for Exploratory Data Analysi
Within any particular observation, the exact form of the relationship between variables may be obscured by mediating factors, measurement error, or random noise; which means that the patterns in your data will appear as signals obscured by noise.
Due to a quirk of the human cognitive system, the easiest way to spot signal admidst noise is to visualize your data. The concepts of variables, values, and observations have a role to play here as well. To visualize your data, represent each observation with its own geometric object, such as a point. Then map each variable to an aesthetic property of the point, setting specific values of the variable to specific levels of the aesthetic. You could also compute group-level statistics from your data (i.e. new observations) and map them to geoms, something that `geom_bar()`, `geom_boxplot()` and other geoms do for you automatically.
Due to a quirk of the human cognitive system, the easiest way to spot signal amidst noise is to visualize your data. The concepts of variables, values, and observations have a role to play here as well. To visualize your data, represent each observation with its own geometric object, such as a point. Then map each variable to an aesthetic property of the point, setting specific values of the variable to specific levels of the aesthetic. You could also compute group-level statistics from your data (i.e. new observations) and map them to geoms, something that `geom_bar()`, `geom_boxplot()` and other geoms do for you automatically.
## Exploratory Data Analysis and Data Science
@ -599,7 +599,7 @@ Finally, if your work is meaningful at all, it will have an audience, which mean
knitr::include_graphics("images/EDA-data-science-4.png")
```
This model of data science forms a roadmap for the rest of the book.
This model of data science forms a road map for the rest of the book.
* Part 1 of the book covered the central tasks of the model above, Exploratory Data Analysis.

View File

@ -19,7 +19,7 @@ library(ggplot2)
## A code template
Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear?
Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?
You can test your answer with the `mpg` data set in the `ggplot2` package. The data set contains observations collected by the EPA on 38 models of car. Among the variables in `mpg` are
@ -63,7 +63,7 @@ The rest of this chapter will show you how to complete and extend this template
> "The greatest value of a picture is when it forces us to notice what we never expected to see."---John Tukey
In the plot above, one group of points seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. How can you explain these cars?
In the plot below, one group of points seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. How can you explain these cars?
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-1.png")
@ -73,7 +73,7 @@ Let's hypothesize that the cars are hybrids. One way to test this hypothesis is
You can add a third variable, like `class`, to a two dimensional scatterplot by mapping it to an _aesthetic_.
An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties. Here we change the levels of a point's size, shape, and color to make the point small, trianglular, or blue.
An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties. Here we change the levels of a point's size, shape, and color to make the point small, triangular, or blue.
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-2.png")
@ -111,7 +111,7 @@ ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
What happened to the suv's? `ggplot2` will only use six shapes at a time. Additional groups will go unplotted when you use this aesthetic.
What happened to the suvs? `ggplot2` will only use six shapes at a time. Additional groups will go unplotted when you use this aesthetic.
For each aesthetic, you set the name of the aesthetic to the variable to display, and you do this within the `aes()` function. The `aes()` function gathers together each of the aesthetic mappings used by a layer and passes them to the layer's mapping argument. The syntax highlights a useful insight because you also set `x` and `y` to variables within `aes()`. The insight is that the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
@ -546,7 +546,7 @@ knitr::include_graphics("images/visualization-stats.png")
## Coordinate systems
Let's leave the cartesian coordinate system and examine the polar coordinate system. We will begin with a riddle: how is a bar chart similar to a coxcomb plot, like the one below?
Let's leave the Cartesian coordinate system and examine the polar coordinate system. We will begin with a riddle: how is a bar chart similar to a coxcomb plot, like the one below?
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=3, fig.height=4}
ggplot(data = diamonds) +