Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
Garrett 2015-12-07 14:05:09 -05:00
commit 94e6d5f27b
13 changed files with 251 additions and 106 deletions

View File

@ -1,15 +1,24 @@
<li><a href="intro.html">Introduction</a></li>
<li class="dropdown-header">Data science essentials</li>
<li><a href="visualize.html">Visualize</a></li>
<li><a href="transform.html">Transform</a></li>
<li><a href="strings.html">String manipulation</a></li>
<!--
<li><a href="dates.html">Dates and times</a></li>
-->
<li><a href="tidy.html">Tidy</a></li>
<li><a href="expressing-yourself.html">Expressing yourself</a></li>
<li><a href="import.html">Import</a></li>
<li class="dropdown-header">Communication</li>
<li><a href="rmarkdown.html">R Markdown</a></li>
<li><a href="shiny.html">Shiny</a></li>
<li class="dropdown-header">Programming</li>
<li><a href="data-structures.html">Data structures</a></li>
<li><a href="strings.html">Strings</a></li>
<li><a href="datetimes.html">Dates and times</a></li>
<li><a href="functions.html">Expressing yourself with code</a></li>
<li><a href="lists.html">Lists</a></li>
<!--
<li><a href="models.html">Model</a></li>
<li><a href="communicate.html">Communicate</a></li>
-->
<li class="dropdown-header">Modelling</li>
<li><a href="model-linear.html">Linear models</a></li>
<li><a href="model-vis.html">Models and visualisation</a></li>
<li><a href="model-assess.html">Model assesment</a></li>
<li><a href="model-other.html">Other models</a></li>

21
data-structures.Rmd Normal file
View File

@ -0,0 +1,21 @@
---
layout: default
title: Data structures
output: bookdown::html_chapter
---
Might be quite brief.
## Data structures
Atomic vectors and lists. What is a data frame?
`typeof()` vs. `class()` mostly in context of how date/times and factors are built on top of simpler structures.
## Factors
(Since won't get a chapter of their own)
## Subsetting
Not sure where else this should be covered.

5
datetimes.Rmd Normal file
View File

@ -0,0 +1,5 @@
---
layout: default
title: Dates and times
output: bookdown::html_chapter
---

View File

@ -10,15 +10,14 @@ output: bookdown::html_chapter
The goal of "R for Data Science" is to give you a solid foundation into using R to do data science. The goal is not to be exhaustive, but to instead focus on what we think are the critical skills for data science:
* Getting your data into R so you can work with it.
* Getting your data into R so you can work with it. On disk, in database,
on the web.
* Wrangling your data into a tidy form, so it's easier to work with. This let's you
spend your time struggling with your questions, not fighting to get data
* Wrangling your data into a tidy form, so it's easier to work with. This let's
you spend your time struggling with your questions, not fighting to get data
into the right form for different functions.
* Manipulating your data to add variables and compute basic summaries. We'll
show you the broad tools, and focus on three common types of data: numbers,
strings, and date/times.
* Transforming your data to add variables and compute basic summaries.
* Visualising your data to gain insight. Visualisations are one of the most
important tools of data science because they can surprise you: you can
@ -35,6 +34,8 @@ The goal of "R for Data Science" is to give you a solid foundation into using R
how you can create static reports with rmarkdown, and interactive apps with
shiny.
[Hadley's standard data science diagram]
## Learning data science
Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). This, however, is not the order you'll encounter them in this book. This is because:
@ -49,7 +50,7 @@ Above, I've listed the components of the data science process in roughly the ord
We've honed this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated. We try and stick to a similar pattern within each chapter: give some bigger motivating examples so you can see the bigger picture, and then dive into the details.
Each section of the book also comes with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing. If you were taking a class with either of us, we'd force you to do them by making them homework. (Sometimes I feel like teaching is the art of tricking people to do what's in their own best interests.)
Each section of the book also comes with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing. If you were taking a class with either of us, we'd force you to do them by making them homework. (Sometimes I feel like teaching is the art of tricking people to do what's in their own best interests.)
## Talking about data science
@ -57,21 +58,23 @@ Throughout the book, we will discuss the principles of data that will help you b
* A _variable_ is a quantity, quality, or property that you can measure.
* A _value_ is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
* A _value_ is the state of a variable when you measure it. The value of a
variable may change from measurement to measurement.
* An _observation_ is a set of measurments you make under similar conditions (usually all at the same time or on the same object). Observations contain values that you measure on different variables.
* An _observation_ is a set of measurments you make under similar conditions
(usually all at the same time or on the same object). Observations contain
values that you measure on different variables.
These terms will help us speak precisely about the different parts of a data set. They will also provide a system for turning data into insights.
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation.
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data that doesn't naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.
## R and big data
This book also focuses almost exclusively on in-memory datasets.
* Small data: data that fits in memory on a laptop, ~10 GB. Note that small
data is still big! R is great with small data.
data is still big! R is great with small data. Pointer to data.table.
* Medium data: data that fits in memory on a powerful server, ~5 TB. It's
possible to use R with this much data, but it's challenging. Dealing
@ -106,29 +109,44 @@ To run the code in this book, you will need to install both R and the RStudio ID
### R
To install R, visit [cran.r-project.org](http://cran.r-project.org) and click the link that matches your operating system. What you do next will depend on your operating system.
To install R, visit <http://cran.r-project.org> and click the link that matches your operating system. What you do next will depend on your operating system.
* Mac users should click the `.pkg` file at the top of the page. This file contains the most current release of R. Once the file is downloaded, double click it to open an R installer. Follow the directions in the installer to install R.
* Windows users should click "base" and then download the most current version of R, which will be linked at the top of the page.
* Linux users should select their distribution and then follow the distribution specific instructions to install R. [cran.r-project.org](https://cran.r-project.org/bin/linux/) includes these instructions alongside the files to download.
* Linux users should select their distribution and then follow the distribution specific instructions to install R. <https://cran.r-project.org/bin/linux/> includes these instructions alongside the files to download.
### RStudio
After you install R, visit [www.rstudio.com/download](http://www.rstudio.com/download) to download the RStudio IDE. Choose the installer for your system. Then click the link to download the application. Once you have the application, installation is easy. Once RStudio IDE is installed, open it as you would open any other application.
After you install R, visit <http://www.rstudio.com/download> to download the RStudio IDE. Choose the installer for your system. Then click the link to download the application. Once you have the application, installation is easy. Once RStudio IDE is installed, open it as you would open any other application.
Brief RStudio orientation (code, console, and output). Pointers to where to learn more.
Important keyboard shortcuts:
* Cmd + Enter: sends current line from editor to console.
* Tab: suggest possible completions for the text you've typed.
* Cmd + ↑: in the console, searches all commands you've typed that start with
those characters.
* Cmd + Shift + F10: restart.
* Alt + Shift + K: the keyboard shortcut that shows all the keyboard shortcuts.
Note about turning on save/load session off.
### R Packages
An R _package_ is a collection of functions, data sets, and help files that extends the R language. We will use several packages in this book: `DBI`, `devtools`, `dplyr`, `ggplot2`, `haven`, `knitr`, `lubridate`, `packrat`, `readr`, `rmarkdown`, `rsqlite`, `rvest`, `scales `, `shiny`, `stringr`, and `tidyr`.
To install these packages, open the RStudio IDE and run the command
An R _package_ is a collection of functions, data sets, and help files that extends the R language. We will a lot of R packages in this book. To install them all, open RStudio and run:
```{r eval = FALSE}
install.packages(c("DBI", "devtools", "dplyr", "ggplot2", "haven", "knitr", "lubridate", "packrat", "readr", "rmarkdown", "rsqlite", "rvest", "scales", "shiny", "stringr", "tidyr"))
install.packages(c(
"DBI", "devtools", "dplyr", "ggplot2", "haven", "knitr", "lubridate",
"packrat", "readr", "rmarkdown", "RSQLite", "rvest", "scales", "shiny",
"stringr", "tidyr"
))
```
R will download the packages from [cran.r-project.org](http://cran.r-project.org) and instll them in your system library. So be sure that you are connected to the internet, and that you have not blocked [cran.r-project.org](http://cran.r-project.org)in your firewall or proxy settings.
R will download the packages from CRAN and install them in your system library. If you have problems installing, make that you are connected to the internet, and that you haven't blocked <http://cran.r-project.org> in your firewall or proxy.
After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g.
@ -140,11 +158,21 @@ You will not be able to use the functions, objects, and help files in a package
### Getting help
* Google
* StackOverflow ([reprex](https://github.com/jennybc/reprex))
* Twitter
* Google. Always a great place to start! Adding "R" to a query is usually
enough to filter it down. If you ever hit an error message that you
don't know how to handle, great idea to google it.
If your operating system defaults to another language, you can use
`Sys.setenv(LANGUAGE = "en")` to tell R to use english. That's likely to
get you to common solutions more quickly.
* StackOverflow. How to make a reproducible example.
([reprex](https://github.com/jennybc/reprex))
Unfortunately the R stackoverflow community is not always the friendliest.
* Twitter. #rstats hashtag is very welcoming. Great way to keep up with
what's happening in the community.
## Acknowledgements

View File

@ -30,6 +30,8 @@ The goal of using purrr functions instead of for loops is to allow you break com
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
In later chapters you'll learn how to apply these ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you learn in this chapter will be invaluable.
<!--
## Warm ups
@ -155,6 +157,7 @@ embed_jpg("images/pepper-3.jpg", 300)
1. Generate the lists corresponding to these nested set diagrams.
1. What happens if you subset a data frame as if you're subsetting a list?
What are the key differences between a list and a data frame?
## A common pattern of for loops
@ -338,6 +341,8 @@ There are a few differences between `map_*()` and `compute_summary()`:
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each individual in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces and fits the same linear model to each piece:
<!-- Haven't covered modelling yet so might need a different motivating example -->
```{r}
models <- mtcars %>%
split(.$cyl) %>%
@ -562,7 +567,7 @@ y <- x %>% map(safe_log)
str(y)
```
This would be easier to work with if we had two lists: one of all the errors and one of all the results. That's easy to get to with `transpose()`.
This would be easier to work with if we had two lists: one of all the errors and one of all the results. That's easy to get with `transpose()`.
```{r}
y <- y %>% transpose()
@ -834,69 +839,3 @@ i.e. how do dplyr and purrr intersect.
* List columns in a data frame
* Mutate & filter.
* Creating list columns with `group_by()` and `do()`.
## A case study: modelling
A natural application of `map2()` is handling test-training pairs when doing model evaluation. This is an important modelling technique: you should never evaluate a model on the same data it was fit to because it's going to make you overconfident. Instead, it's better to divide the data up and use one piece to fit the model and the other piece to evaluate it. A popular technique for this is called k-fold cross validation. You randomly hold out x% of the data and fit the model to the rest. You need to repeat this a few times because of random variation.
Why you should store related vectors (even if they're lists!) in a
data frame. Need example that has some covariates so you can (e.g.)
select all models for females, or under 30s, ...
Let's start by writing a function that partitions a dataset into test and training:
```{r}
partition <- function(df, p) {
n <- nrow(df)
groups <- rep(c(TRUE, FALSE), n * c(p, 1 - p))
sample(groups)
}
partition(mtcars, 0.1)
```
We'll generate 20 random test-training splits, and then create lists of test-training datasets:
```{r}
partitions <- rerun(200, partition(mtcars, 0.25))
tst <- partitions %>% map(~mtcars[.x, , drop = FALSE])
trn <- partitions %>% map(~mtcars[!.x, , drop = FALSE])
```
Then fit the models to each training dataset:
```{r}
mod <- trn %>% map(~lm(mpg ~ wt, data = .))
```
If we wanted, we could extract the coefficients using broom, and make a single data frame with `map_df()` and then visualise the distributions with ggplot2:
```{r}
coef <- mod %>%
map_df(broom::tidy, .id = "i")
coef
library(ggplot2)
ggplot(coef, aes(estimate)) +
geom_histogram(bins = 10) +
facet_wrap(~term, scales = "free_x")
```
But we're most interested in the quality of the models, so we make predictions for each test data set and compute the mean squared distance between predicted and actual:
```{r}
pred <- map2(mod, tst, predict)
actl <- map(tst, "mpg")
msd <- function(x, y) sqrt(mean((x - y) ^ 2))
mse <- map2_dbl(pred, actl, msd)
mean(mse)
mod <- lm(mpg ~ wt, data = mtcars)
base_mse <- msd(mtcars$mpg, predict(mod))
base_mse
ggplot(, aes(mse)) +
geom_histogram(binwidth = 0.25) +
geom_vline(xintercept = base_mse, colour = "red")
```

78
model-assess.Rmd Normal file
View File

@ -0,0 +1,78 @@
---
layout: default
title: Models assessment
output: bookdown::html_chapter
---
```{r setup, include=FALSE}
library(purrr)
set.seed(1014)
options(digits = 3)
```
## Multiple models
A natural application of `map2()` is handling test-training pairs when doing model evaluation. This is an important modelling technique: you should never evaluate a model on the same data it was fit to because it's going to make you overconfident. Instead, it's better to divide the data up and use one piece to fit the model and the other piece to evaluate it. A popular technique for this is called k-fold cross validation. You randomly hold out x% of the data and fit the model to the rest. You need to repeat this a few times because of random variation.
Why you should store related vectors (even if they're lists!) in a
data frame. Need example that has some covariates so you can (e.g.)
select all models for females, or under 30s, ...
Let's start by writing a function that partitions a dataset into test and training:
```{r}
partition <- function(df, p) {
n <- nrow(df)
groups <- rep(c(TRUE, FALSE), n * c(p, 1 - p))
sample(groups)
}
partition(mtcars, 0.1)
```
We'll generate 20 random test-training splits, and then create lists of test-training datasets:
```{r}
partitions <- rerun(200, partition(mtcars, 0.25))
tst <- partitions %>% map(~mtcars[.x, , drop = FALSE])
trn <- partitions %>% map(~mtcars[!.x, , drop = FALSE])
```
Then fit the models to each training dataset:
```{r}
mod <- trn %>% map(~lm(mpg ~ wt, data = .))
```
If we wanted, we could extract the coefficients using broom, and make a single data frame with `map_df()` and then visualise the distributions with ggplot2:
```{r}
coef <- mod %>%
map_df(broom::tidy, .id = "i")
coef
library(ggplot2)
ggplot(coef, aes(estimate)) +
geom_histogram(bins = 10) +
facet_wrap(~term, scales = "free_x")
```
But we're most interested in the quality of the models, so we make predictions for each test data set and compute the mean squared distance between predicted and actual:
```{r}
pred <- map2(mod, tst, predict)
actl <- map(tst, "mpg")
msd <- function(x, y) sqrt(mean((x - y) ^ 2))
mse <- map2_dbl(pred, actl, msd)
mean(mse)
mod <- lm(mpg ~ wt, data = mtcars)
base_mse <- msd(mtcars$mpg, predict(mod))
base_mse
ggplot(, aes(mse)) +
geom_histogram(binwidth = 0.25) +
geom_vline(xintercept = base_mse, colour = "red")
```

20
model-linear.Rmd Normal file
View File

@ -0,0 +1,20 @@
---
layout: default
title: Model
output: bookdown::html_chapter
---
# Model
After reading this chapter, what can you do that you couldn't before?
Focus on fitting a single model, and understanding it with broom. Focus on linear models. Focus on intutition and computation tools. No mathematics.
Review caret and mlr.
"Feature engineering":
* Factors
* Interactions
* Splines
* Log transform

20
model-other.Rmd Normal file
View File

@ -0,0 +1,20 @@
---
layout: default
title: Other model familes
output: bookdown::html_chapter
---
## Extensions of linear models
* Generalised linear models: logistic, ...
* Hierarchical models
## Non-linear
* Random forrests
## Clustering
Show example of clustering babynames by year.

7
model-vis.Rmd Normal file
View File

@ -0,0 +1,7 @@
---
layout: default
title: Models and visualisation
output: bookdown::html_chapter
---
Gap minder

5
rmarkdown.Rmd Normal file
View File

@ -0,0 +1,5 @@
---
layout: default
title: R Markdown
output: bookdown::html_chapter
---

5
shiny.Rmd Normal file
View File

@ -0,0 +1,5 @@
---
layout: default
title: Shiny
output: bookdown::html_chapter
---

View File

@ -4,18 +4,24 @@ title: Data transformation
output: bookdown::html_chapter
---
# Data transformation
Copy from dplyr vignettes.
## Missing values
## Filter
### Missing values
* Why `NA == NA` is not `TRUE`
* Why default is `na.rm = FALSE`.
## Data types
## Mutate
Overview of different data types and useful summary functions for working with them. Strings and dates covered in more detail in future chapters.
## Arrange
Need to mention `typeof()` vs. `class()` mostly in context of how date/times and factors are built on top of simpler structures.
## Select
## Grouped summaries
Overview of different data types and useful summary functions for working with them. Strings and dates covered in more detail in future chapters. Anything complicated can be put off until data structures chapter.
### Logical
@ -26,3 +32,5 @@ When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0. Thi
### Strings (and factors)
### Date/times
## Grouped mutate