Merge branch 'master' of github.com:hadley/r4ds

# Conflicts:
#	index.rmd
This commit is contained in:
hadley 2016-04-01 10:32:09 -07:00
commit 80db16f565
5 changed files with 66 additions and 30 deletions

View File

@ -106,7 +106,7 @@ df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. We'll learn how to eliminate that duplication in [iteration], once you've learn more about R's data structures in [data_structures].
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. We'll learn how to eliminate that duplication in [iteration], once you've learn more about R's data structures in [data-structures].
Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include infinite values, and `rescale01()` fails:

View File

@ -1,11 +1,16 @@
---
knit: "bookdown::render_book"
title: "R for Data Science"
author: ["Garrett Grolemund", "Hadley Wickham"]
description: "This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualizing, and exploring data."
url: 'http\://r4ds.had.co.nz/'
github-repo: hadley/r4ds
cover-image: cover.png
---
# Welcome
This is the book site for __"R for data science"__. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualizing, and exploring data. (__R for Data Science__ was formally called __Data Science with R__ in __Hands-On Programming with R__)
This is the book site for __"R for data science"__. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualizing, and exploring data. (__R for Data Science__ was formerly called __Data Science with R__ in __Hands-On Programming with R__)
To be published by O'Reilly in July 2016.

View File

@ -4,7 +4,7 @@
install.packages <- function(...) invisible()
```
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to introduce you to the most important in R tools that you need to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to introduce you to the most important R tools that you need to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
## What you will learn
@ -24,24 +24,24 @@ There are two main engines of knowledge generation: visualisation and modelling.
__Visualisation__ is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions of the data. A good visualisation might also hint that you're asking the wrong question and you need to refine your thinking. In short, visualisations can surprise you. However, visualisations don't scale particularly well.
__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computation tool, so they generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model can not fundamentally surprise you.
__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model cannot fundamentally surprise you.
The last step of data science is __communication__, an absolutely critical part of any data analysis project. It doesn't matter how well models and visualisation have led you to understand the data, unless you can commmunicate your results to other people.
There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off. Becoming a better programmer will allow you automate common tasks, and solve new problems with greater ease.
There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off. Becoming a better programmer will allow you to automate common tasks, and solve new problems with greater ease.
You'll use these tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play: you can probably tackle 80% of every project using the tools that we'll teach you, but you'll need more to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
## How you will learn
The above description of the tools of data science was organised roughly around the order in which you use them in analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them:
The above description of the tools of data science is organised roughly around the order in which you use them in analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them:
* Starting with data ingest and tidying is sub-optimal because 80% of the time
it's routine and boring, and the other 20% of the time it's horrendously
frustrating. Instead, we'll start with visualisation and transformation on
data that's already been imported and tidied. That way, when you ingest
and tidy your own data, you'll be able to keep your motivation high because
you know the pain is worth it because of what you can accomplish once its
you know the pain is worth it because of what you can accomplish once it's
done.
* Some topics are best explained with other tools. For example, we believe that
@ -58,15 +58,15 @@ Within each chapter, we try and stick to a similar pattern: start with some moti
## What you won't learn
There are some important topics that this book doesn't cover. We believe it's important to stay ruthlessly focussed on the essentials so you can get up and running as quickly as possible. That means this book can't cover every important topic.
There are some important topics that this book doesn't cover. We believe it's important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible. That means this book can't cover every important topic.
### Big data
This book proudly focusses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). We don't teach data.table here because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth a little extra effort to learn it.
This book proudly focuses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). We don't teach data.table here because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth a little extra effort to learn it.
Many big data problems are often small data problems in disguise. Often your complete dataset is big, but the data needed to answer a specific question is small. It's often possible to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [transform](#transform).
Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarassingly parallel), so you just need a system (like hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out to how answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset.
Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset.
### Python
@ -80,7 +80,7 @@ This book focuses exclusively on structured data sets: collections of values tha
### Formal Statistics and Machine Learning
This book focusses on practical tools for understanding your data: visualization, modelling, and transformation. You can develop your understanding further by learning probability theory, statistical hypothesis testing, and machine learning methods; but we won't teach you those things here. There are many books that cover these topics, but few that integrate the other parts of the data science process. When you are ready, you can and should read books devoted to each of these topics. We recommend *Statistical Modeling: A Fresh Approach* by Danny Kaplan; *An Introduction to Statistical Learning* by James, Witten, Hastie, and Tibshirani; and *Applied Predictive Modeling* by Kuhn and Johnson.
This book focuses on practical tools for understanding your data: visualization, modelling, and transformation. You can develop your understanding further by learning probability theory, statistical hypothesis testing, and machine learning methods; but we won't teach you those things here. There are many books that cover these topics, but few that integrate the other parts of the data science process. When you are ready, you can and should read books devoted to each of these topics. We recommend *Statistical Modeling: A Fresh Approach* by Danny Kaplan; *An Introduction to Statistical Learning* by James, Witten, Hastie, and Tibshirani; and *Applied Predictive Modeling* by Kuhn and Johnson.
## Prerequisites
@ -88,7 +88,7 @@ We've made few assumptions about what you already know in order to get the most
To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are open source, free and easy to install:
1. Download R and install R, <https://www.r-project.org/alt-home/>.
1. Download and install R, <https://www.r-project.org/alt-home/>.
1. Download and install RStudio, <http://www.rstudio.com/download>.
1. Install needed packages (see below).
@ -104,7 +104,7 @@ You run R code in the __console__ pane. Textual output appears inline, and graph
There are three keyboard shortcuts for the RStudio IDE that we strongly encourage that you learn because they'll save you so much time:
* Cmd + Enter: sends current line (or current selection) from the editor to
* Cmd + Enter: sends the current line (or current selection) from the editor to
the console and runs it. (Ctrl + Enter on a PC)
* Tab: suggest possible completions for the text you've typed.
@ -120,7 +120,7 @@ We strongly recommend making two changes to the default RStudio options:
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
This ensures that every time you restart RStudio you get a completely clean slate. This is good pratice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
This ensures that every time you restart RStudio you get a completely clean slate. This is good pratice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
### R packages
@ -129,8 +129,8 @@ You'll also need to install some R packages. An R _package_ is a collection of f
```{r}
pkgs <- c(
"broom", "dplyr", "ggplot2", "jpeg", "jsonlite",
"knitr", "microbenchmark", "png", "pryr", "purrr", "readr", "stringr",
"tidyr"
"knitr", "Lahman", "microbenchmark", "png", "pryr", "purrr",
"rcorpora", "readr", "stringr", "tibble", "tidyr"
)
install.packages(pkgs)
```
@ -149,15 +149,15 @@ You will need to reload the package every time you start a new R session.
* Google. Always a great place to start! Adding "R" to a query is usually
enough to filter it down. If you ever hit an error message that you
don't know how to handle, it is a great idea to google it.
don't know how to handle, it is a great idea to Google it.
If your operating system defaults to another language, you can use
`Sys.setenv(LANGUAGE = "en")` to tell R to use english. That's likely to
`Sys.setenv(LANGUAGE = "en")` to tell R to use English. That's likely to
get you to common solutions more quickly.
* StackOverflow. Be sure to read and use [How to make a reproducible example](http://adv-r.had.co.nz/Reproducibility.html)([reprex](https://github.com/jennybc/reprex)) before posting. Unfortunately the R stackoverflow community is not always the friendliest.
* Stack Overflow. Be sure to read and use [How to make a reproducible example](http://adv-r.had.co.nz/Reproducibility.html)([reprex](https://github.com/jennybc/reprex)) before posting. Unfortunately the R Stack Overflow community is not always the friendliest.
* Twitter. #rstats hashtag is very welcoming. Great way to keep up with
* Twitter. The #rstats hashtag is very welcoming and is a great way to keep up with
what's happening in the community.
## Acknowledgements

View File

@ -17,9 +17,9 @@ In [functions], we talked about how important it is to reduce duplication in you
1. You're likely to have fewer bugs because each line of code is
used in more places.
One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out in to indepdent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.)
One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out into indepdent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.)
In this chapter you'll learn about two important iteration tools: for loops and functional programming. For loops are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming, and the machinary each provides. On the imperative side you have things like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, for loops haven't been slow for many years). The chief benefits of using FP functions like `lapply()` or `purrr::map()` is that they are more expressive and make code both easier to write and easier to read.
@ -100,7 +100,7 @@ Every for loop has three components:
it's easy to create them accidentally. If you use `1:length(x)` instead
of `seq_along(x)`, you're likely to get a confusing error message.
1. The __body__: `output[i] <- median(df[[i]])`. This is the code that does
1. The __body__: `output[[i]] <- median(df[[i]])`. This is the code that does
the work. It's run repeatedly, each time with a different value for `i`.
The first iteration will run `output[[1]] <- median(df[[1]])`,
the second will run `output[[2]] <- median(df[[2]])`, and so on.
@ -130,7 +130,7 @@ That's all there is to the for loop! Now is a good time to practice creating som
x <- sample(100)
sd <- 0
for (i in seq_along(out)) {
for (i in seq_along(x)) {
sd <- sd + (x[i] - mean(x)) ^ 2
}
sd <- sqrt(sd) / (length(x) - 1)
@ -335,7 +335,7 @@ while (nheads < 3) {
flips
```
I mention for loops briefly, because I hardly ever use them. They're most often used for simulation, which is outside the scope of this book. However, it is good to know they exist, if you encounter a problem where the number of iterations is not known in advance.
I mention while loops briefly, because I hardly ever use them. They're most often used for simulation, which is outside the scope of this book. However, it is good to know they exist, if you encounter a problem where the number of iterations is not known in advance.
### Exercises

View File

@ -1,6 +1,9 @@
---
output: pdf_document
---
# Model
A model is a function that summarizes how the values of one variable vary in response to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data.
A model is a function that summarizes how the values of one variable vary in relation to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data.
This chapter will explain how to build useful models with R.
@ -20,7 +23,7 @@ This chapter will explain how to build useful models with R.
To access the functions and data sets that we will use in the chapter, load the `ggplot2`, `dplyr`, `mgcv`, `splines`, and `broom` packages:
```{r}
```{r messages = FALSE}
# install.packages("")
library(ggplot2)
library(dplyr)
@ -31,7 +34,8 @@ library(broom)
## Linear models
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that directly inflates the income of the vertically gifted. Do you think this is true?
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?
Luckily, it is easy to measure someone's height, as well as their income (and a swath of other variables besides), which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
@ -54,7 +58,6 @@ I've narrowed the data down to 10 variables:
* `sat_math` - Each subject's score on the math portion of the Scholastic Aptitude Test (SAT), out of 800.
* `bdate` - Month of birth with 1 = January.
```{r}
head(heights)
```
@ -70,7 +73,35 @@ First, let's address a distraction: the data is censored in an odd way. The y va
Also, you can see that heights have been rounded to the nearest inch.
Second, the relationship is not very strong.
Setting those concerns aside, we can measure the correlation between height and income with R's `cor()` function. Correlation, $r$ from statistics, measures how strongly the values of two variables are related. The sign of the correlation describes whether the variables have a positive or negative relationship. The magnitude of the correlation describes how strongly the values of one variable determine the values of the second. A correlation of 1 or -1 implies that the value of one variable completely determines the value of the second variable.
```{r echo = FALSE, cache=TRUE}
x1 <- rnorm(100)
y1 <- .5 * x1 + rnorm(100, sd = .5)
y2 <- -.5 * x1 + rnorm(100, sd = .5)
cordat <- data.frame(x = rep(x1, 5),
y = c(-x1, y2, rnorm(100), y1, x1),
cor = rep(1:5, each = 100))
cordat$cor <- factor(cordat$cor, levels = 1:5,
labels = c("Correlation = -1.0",
"Correlation = -0.5",
"Correlation = 0",
"Correlation = 0.5",
"Correlation = 1.0"))
ggplot(cordat, aes(x = x, y = y)) +
geom_point() +
facet_grid(. ~ cor) +
coord_fixed()
```
the strength of the relationship between two variables. If the values of the variables fall on a straight line with positive slope (e.g. the value of one variable completely determines the value of another variable)
The correlation suggests that heights may have a small effect on income.
```{r}
cor(heights$height, heights$income, use = "na")