Working on intro

This commit is contained in:
hadley 2015-12-07 13:55:44 -06:00
parent c18b4a4f40
commit 6b87228742
7 changed files with 99 additions and 61 deletions

7
common.R Normal file
View File

@ -0,0 +1,7 @@
set.seed(1014)
options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE
)

Binary file not shown.

BIN
diagrams/data-science.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

135
intro.Rmd
View File

@ -3,48 +3,50 @@ layout: default
title: Welcome
output: bookdown::html_chapter
---
```{r setup, include = FALSE}
source("common.R")
install.packages <- function(...) invisible()
```
# Welcome
## Overview
The goal of "R for Data Science" is to give you a solid foundation into using R to do data science. The goal is not to be exhaustive, but to instead focus on what we think are the critical skills for data science:
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to introduce you to the most important tools that you need to do data science with in R. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
* Getting your data into R so you can work with it. On disk, in database,
on the web.
## What you will learn
* Wrangling your data into a tidy form, so it's easier to work with. This let's
you spend your time struggling with your questions, not fighting to get data
into the right form for different functions.
* Transforming your data to add variables and compute basic summaries.
Data science is a huge field, and there's no way you can master by after reading a single book. The goal of this book is to give you a solid foundation into the most important tools. These are the tools that in our experience, people use everyday. There's definitely an 80-20 rule at play: you'll do 80% of every project using this handful of tools, but the remaining 20% will is much more variable. Our goal is to teach you that 80% and to point you to where you can learn more.
* Visualising your data to gain insight. Visualisations are one of the most
important tools of data science because they can surprise you: you can
see something in a visualisation that you did not expect. Visualisations
are also really helpful for helping you refine your questions of the data.
We think about data science as using six main tools:
* Modelling your data to scale visualisations to larger datasets, and to
remove strong patterns. Modelling is a very deep topic - we can't possibly
cover all the details, but we'll give you a taste of how you can use it,
and where you can go to learn more.
`r bookdown::embed_png("diagrams/data-science.png")`
* Communicating your results to others. It doesn't matter how great your
analysis is unless you can communicate the results to others. We'll show
how you can create static reports with rmarkdown, and interactive apps with
shiny.
First you must __import__ your data in R. This typically means that you take data stored in file, in a database, or in an web API, and load it into a data frame in R. If you can't get your data into R, you can't do data science on it!
[Hadley's standard data science diagram]
Once you've imported your data, it's a good idea to __tidy__ it. Tidying your data means storing it in a standard form that matches the semantics of the dataset with the way its storage. In brief, when your data is tidy, each column is a variable, and each row is an observation. Working with tidy data is important because the consistency lets you spend your time struggling with your questions, not fighting to get data into the right form for different functions.
## Learning data science
Once you have tidy data, a common first step is to __transform__ it to add new variables that are functions of existing variables (like computing velocity from speed and distance), to rename the variables to be easier to understand, to sort your data, or summarise it.
Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). This, however, is not the order you'll encounter them in this book. This is because:
There are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times. For example, you might see a scatterplot that inspires you to fit a linear model, then you transform the data to add a column of residuals from the model, and look at another scatterplot.
__Visualisation__ is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions of the data. A good visualisation might also hint that you're asking the wrong question and you need to refine your thinking. In short, visualisations can surprise you, but don't scale particularly well.
__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computation tool, so generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model can not fundamentally surprise you.
It doesn't matter how well models and visualisation have led you to understand the data, unless you can __commmunicate__ your results to other people. Communication is an absolutely critical part of any data analysis project.
## How you will learn
Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). In our experience, however, this is not the best way to learn them:
* Starting with data ingest is boring. It's much more interesting to learn
some new visualisation and manipulation tools on data that's already been
imported and cleaned. You'll later learn the skills to apply these new ideas
to your own data.
* You need to learn some cross-cutting tools that help in: programming, RStudio
IDE.
* Some topics, like modelling, are best explained with other tools, like
visualisation and manipulation. These topics need to come later in the book.
@ -69,9 +71,13 @@ These terms will help us speak precisely about the different parts of a data set
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data that doesn't naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.
## R and big data
## What you won't learn
This book also focuses almost exclusively on in-memory datasets.
There are some important topics that this book doesn't cover. Here I want to talk about them briefly and tell you why.
### Big data
This book proudly focussed on in-memory, or small, datasets.
* Small data: data that fits in memory on a laptop, ~10 GB. Note that small
data is still big! R is great with small data. Pointer to data.table.
@ -103,23 +109,48 @@ The other thing to bear in mind, is that while all your data might be big, typic
parallel), so you just need a system (like hadoop) that allows you to
send different datasets to different computers for processing.
### Python
In this book, you won't learn anything about Python, or any other programming language. This isn't because we think Python is bad! It's a great tool, and most data science teams use a mix of (at least!) R and Python.
However, we strongly believe that it's best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn't mean you should be only know one thing, just that you'll generally learn faster if you stick to one thing at a time.
### Non-data-frame data
No trees or graphs, images or sounds.
## Prerequisites
To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are free and easy to install.
We've made few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.
### R
To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are open source, free and easy to install:
To install R, visit <http://cran.r-project.org> and click the link that matches your operating system. What you do next will depend on your operating system.
1. Download R and install R from <https://www.r-project.org/alt-home/>.
1. Download and install RStudio from <http://www.rstudio.com/download>.
1. Open RStudio like you would any operating system.
* Mac users should click the `.pkg` file at the top of the page. This file contains the most current release of R. Once the file is downloaded, double click it to open an R installer. Follow the directions in the installer to install R.
You'll also need to install some R packages. An R _package_ is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. To install all the packages used in this book open RStudio and run:
* Windows users should click "base" and then download the most current version of R, which will be linked at the top of the page.
```{r}
pkgs <- c(
"DBI", "devtools", "dplyr", "ggplot2", "haven", "knitr", "lubridate",
"packrat", "readr", "rmarkdown", "RSQLite", "rvest", "scales", "shiny",
"stringr", "tidyr"
)
install.packages(pkgs)
```
* Linux users should select their distribution and then follow the distribution specific instructions to install R. <https://cran.r-project.org/bin/linux/> includes these instructions alongside the files to download.
R will download the packages from CRAN and install them in your system library. If you have problems installing, make that you are connected to the internet, and that you haven't blocked <http://cran.r-project.org> in your firewall or proxy.
### RStudio
After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g.
After you install R, visit <http://www.rstudio.com/download> to download the RStudio IDE. Choose the installer for your system. Then click the link to download the application. Once you have the application, installation is easy. Once RStudio IDE is installed, open it as you would open any other application.
```{r, eval = FALSE}
library(tidyr)
```
You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. You will need to reload the package if you start a new R session.
## RStudio
Brief RStudio orientation (code, console, and output). Pointers to where to learn more.
@ -134,29 +165,7 @@ Important keyboard shortcuts:
Note about turning on save/load session off.
### R Packages
An R _package_ is a collection of functions, data sets, and help files that extends the R language. We will a lot of R packages in this book. To install them all, open RStudio and run:
```{r eval = FALSE}
install.packages(c(
"DBI", "devtools", "dplyr", "ggplot2", "haven", "knitr", "lubridate",
"packrat", "readr", "rmarkdown", "RSQLite", "rvest", "scales", "shiny",
"stringr", "tidyr"
))
```
R will download the packages from CRAN and install them in your system library. If you have problems installing, make that you are connected to the internet, and that you haven't blocked <http://cran.r-project.org> in your firewall or proxy.
After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g.
```{r eval = FALSE}
library(tidyr)
```
You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. You will need to reload the package if you start a new R session.
### Getting help
## Getting help
* Google. Always a great place to start! Adding "R" to a query is usually
enough to filter it down. If you ever hit an error message that you
@ -178,3 +187,11 @@ You will not be able to use the functions, objects, and help files in a package
* Jenny Bryan and Lionel Henry for many helpful discussions around working
with lists and list-columns.
## Colophon
This book was built with:
```{r}
devtools::session_info(pkgs)
```

View File

@ -6,8 +6,7 @@ output: bookdown::html_chapter
```{r setup, include=FALSE}
library(purrr)
set.seed(1014)
options(digits = 3)
source("common.R")
source("images/embed_jpg.R")
```

View File

@ -5,6 +5,10 @@ output: bookdown::html_chapter
---
[Applied Predictive Modeling](http://amzn.com/1461468485).
[An Introduction to Statistical Learning](http://amzn.com/1461471370)
## Extensions of linear models
* Generalised linear models: logistic, ...

View File

@ -3,3 +3,14 @@ layout: default
title: R Markdown
output: bookdown::html_chapter
---
# Communication
Recommendations for learning more about communication:
For writing: [Style: Lessons in Clarity and Grace](http://amzn.com/0321898680).
For presentations:
For expository visulisations: WSJ guide?