Complete intro

Integrating comments from @behrman
This commit is contained in:
hadley 2016-07-11 10:31:30 -05:00
parent aedadc7a32
commit 7a285374de
6 changed files with 101 additions and 61 deletions

Binary file not shown.

Binary file not shown.

BIN
diagrams/intro-rstudio.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 144 KiB

162
intro.Rmd
View File

@ -4,7 +4,7 @@
install.packages <- function(...) invisible()
```
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to introduce you to the most important R tools that you need to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to help you learn the most important tools R that will allow you to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
## What you will learn
@ -16,37 +16,36 @@ knitr::include_graphics("diagrams/data-science.png")
First you must __import__ your data into R. This typically means that you take data stored in a file, database, or web API, and load it into a data frame in R. If you can't get your data into R, you can't do data science on it!
Once you've imported your data, it is a good idea to __tidy__ it. Tidying your data means storing it in a standard form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Working with tidy data is important because the consistency lets you spend your time struggling with your questions, not fighting to get data into the right form for different functions.
Once you've imported your data, it is a good idea to __tidy__ it. Tidying your data means storing it in a standard form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
Once you have tidy data, a common first step is to __transform__ it. You may zero in on a subset of data, add new variables that are functions of existing variables, calculate a set of summary statistics, or sort your data according to values.
Once you have tidy data, a common first step is to __transform__ it. You may zero in on a subset of data, add new variables that are functions of existing variables, or calculate a set of summary statistics.
There are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times. For example, you might see a scatterplot that inspires you to fit a linear model. Then you transform the data to add a column of residuals from the model, and look at another scatterplot, this time of the residuals.
Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times.
__Visualisation__ is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions of the data. A good visualisation might also hint that you're asking the wrong question and you need to refine your thinking. In short, visualisations can surprise you. However, visualisations don't scale particularly well.
__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model cannot fundamentally surprise you.
The last step of data science is __communication__, an absolutely critical part of any data analysis project. It doesn't matter how well models and visualisation have led you to understand the data, unless you can communicate your results to other people.
The last step of data science is __communication__, an absolutely critical part of any data analysis project. It doesn't matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.
There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off. Becoming a better programmer will allow you to automate common tasks, and solve new problems with greater ease.
You'll use these tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play: you can probably tackle 80% of every project using the tools that we'll teach you, but you'll need more to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
You'll use these six tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play: you can probably tackle 80% of every project using the tools that you'll learn in this book, but you'll need more to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
## How you will learn
The above description of the tools of data science is organised roughly around the order in which you use them in analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them:
* Starting with data ingest and tidying is sub-optimal because 80% of the time
it's routine and boring, and the other 20% of the time it's horrendously
it's routine and boring, and the other 20% of the time it's weird and
frustrating. Instead, we'll start with visualisation and transformation on
data that's already been imported and tidied. That way, when you ingest
and tidy your own data, you'll be able to keep your motivation high because
you know the pain is worth it because of what you can accomplish once it's
done.
and tidy your own data, your motivation will stay high because you know the
pain is worth it.
* Some topics are best explained with other tools. For example, we believe that
it's easier to understand how models work as a tool for data science if you
already know about visualisation, data transformation, and tidy data.
already know about visualisation, data transformation, and tidy data.
* Programming tools are not necessarily interesting in their own right,
but do allow you to tackle considerably more challenging problems. We'll
@ -54,41 +53,51 @@ The above description of the tools of data science is organised roughly around t
then finish off by showing how they combine with the key data science tools
to tackle interesting problems.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you've learned. While it's tempting to skip the exercises, there's no better way to learn than practicing on real problems.
## What you won't learn
There are some important topics that this book doesn't cover. We believe it's important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible. That means this book can't cover every important topic.
### Big n data (many observations)
### Big data
This book proudly focuses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). We don't teach data.table here because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth a little extra effort to learn it.
This book proudly focuses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). This book doesn't teach data.table because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth the extra effort required to learn it.
Many big data problems are often small data problems in disguise. Often your complete dataset is big, but the data needed to answer a specific question is small. It's often possible to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [transform](#transform).
If your data is bigger than this, carefully consider if your big data problem might atually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [[Data transformation]].
Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarrassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset.
Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like sparklyr, rhipe, and ddr to solve it for the full dataset.
### Big p data (many variables)
### Python
### Python, Julia, and friends
In this book, you won't learn anything about Python, Julia, or any other programming language useful for data science. This isn't because we think these tools are bad. They're not! And in practice, most data science teams use a mix of languages, often at least R and Python.
However, we strongly believe that it's best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn't mean you should only know one thing, just that you'll generally learn faster if you stick to one thing at a time.
However, we strongly believe that it's best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn't mean you should only know one thing, just that you'll generally learn faster if you stick to one thing at a time. You should strive to learn new things throughout your career, but make sure your understanding in solid before you move on to the next interesting thing.
### Non-rectangular data
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data sets that do not naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.
### Inference
### Hypothesis confirmation
Exploratory vs. confirmatory
It's possible to divide data analysis into two camps: hypothesis generation, and hypothesis confirmation (sometimes called confirmatory analysis). The focus of this book is unabashedly on hypothesis generation, or data exploration. Here you'll look deeply at the data and combined with your subject knowledge generate many interesting hypotheses to help explain why the data behaves the way it does. You evaluate the hypotheses informally, using your scepticism to challenge the data in multiple ways.
Most people think of models as confirmatory and visualisations as exploratory. But you can have confirmatory visualisations and exploratory models. This book focuses on exploration.
The complement of hypothesis generation is hypothesis confirmation. Hypothesis confirmation is hard for two reasons:
### Formal Statistics and Machine Learning
1. You need a precise mathematical model in order to generate falsifiable
predictions. This often requires considerably statistical sophistication.
To learn more about statistical modelling we recommend *Statistical
Modeling: A Fresh Approach* by Danny Kaplan; *An Introduction to
Statistical Learning* by James, Witten, Hastie, and Tibshirani; and
*Applied Predictive Modeling* by Kuhn and Johnson.
This book focuses on practical tools for understanding your data: visualization, modelling, and transformation. You can develop your understanding further by learning probability theory, statistical hypothesis testing, and machine learning methods; but we won't teach you those things here. There are many books that cover these topics, but few that integrate the other parts of the data science process. When you are ready, you can and should read books devoted to each of these topics. We recommend *Statistical Modeling: A Fresh Approach* by Danny Kaplan; *An Introduction to Statistical Learning* by James, Witten, Hastie, and Tibshirani; and *Applied Predictive Modeling* by Kuhn and Johnson.
1. You can only use an observation once to confirm a hypothesis. As soon as
you use it more than once you're back to doing exploratory analysis.
This means to do hypothesis confirmation you need to "preregister"
(write out in advance) your analysis plan, and not deviate from it
even when you have seen the data. We'll talk a little about some
strategies you can use to make this easier in [[model assessment]].
It's common to think about modelling as a tool for hypothesis confirmation, and visualisation for a tool for hypothesis generation. But that's a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation. The key difference is how often do you look at each observation: if you look only once, it's confirmation; if you look more than once, it's exploration.
## Prerequisites
@ -96,7 +105,7 @@ We've made few assumptions about what you already know in order to get the most
To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are open source, free and easy to install:
1. Download and install R, <https://www.r-project.org/alt-home/>.
1. Download and install R, <https://www.r-project.org/>.
1. Download and install RStudio, <http://www.rstudio.com/download>.
1. Install needed packages (see below).
@ -104,23 +113,23 @@ To run the code in this book, you will need to install both R and the RStudio ID
RStudio is an integrated development environment, or IDE, for R programming. There are three key regions:
```{r, echo = FALSE}
knitr::include_graphics("screenshots/rstudio-layout.png")
```{r echo = FALSE}
knitr::include_graphics("diagrams/intro-rstudio.png")
```
You run R code in the __console__ pane. Textual output appears inline, and graphical output appears in the __output__ pane. You write more complex R scripts in the __editor__ pane.
There are three keyboard shortcuts for the RStudio IDE that we strongly encourage that you learn because they'll save you so much time:
* Cmd + Enter: sends the current line (or current selection) from the editor to
the console and runs it. (Ctrl + Enter on a PC)
* Cmd/Ctrl + Enter: sends the current line (or current selection) from the editor to
the console and runs it.
* Tab: suggest possible completions for the text you've typed.
* Cmd + ↑: in the console, searches all commands you've typed that start with
those characters. (Ctrl + ↑ on a PC)
* Cmd/Ctrl + ↑: in the console, searches all commands you've typed that start with
those characters.
If you want to see a list of all keyboard shortcuts, use the meta keyboard shortcut Alt + Shift + K: that's the keyboard shortcut to show all the other keyboard shortcuts.
If you want to see a list of all keyboard shortcuts, use the meta shortcut Alt + Shift + K: that's the keyboard shortcut to show all the other keyboard shortcuts!
We strongly recommend making two changes to the default RStudio options:
@ -132,18 +141,19 @@ This ensures that every time you restart RStudio you get a completely clean slat
### R packages
You'll also need to install some R packages. An R _package_ is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. To install all the packages used in this book open RStudio and run:
You'll also need to install some R packages. An R _package_ is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. To install the packages you'll need for this book open RStudio and run:
```{r}
pkgs <- c(
"broom", "dplyr", "ggplot2", "jpeg", "jsonlite",
"knitr", "Lahman", "microbenchmark", "png", "pryr", "purrr",
"rcorpora", "readr", "stringr", "tibble", "tidyr"
"broom", "dplyr", "ggplot2", "jsonlite", "Lahman", "purrr",
"rcorpora", "readr", "rmarkdown", "stringr", "tibble", "tidyr"
)
```
```{r, eval = FALSE}
install.packages(pkgs)
```
R will download the packages from CRAN and install them in your system library. If you have problems installing, make sure that you are connected to the internet, and that you haven't blocked <http://cran.r-project.org> in your firewall or proxy.
R will download the packages from CRAN and install them on to your computer. If you have problems installing, make sure that you are connected to the internet, and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g.
@ -153,42 +163,72 @@ library(tidyr)
You will need to reload the package every time you start a new R session.
## Getting help
## Getting help and learning more
* Google. Always a great place to start! Adding "R" to a query is usually
enough to filter it down. If you ever hit an error message that you
don't know how to handle, it is a great idea to Google it.
This book is not an island: there is no single resource that will allow you to master R. As you start to apply the techniques described in this book to your own data you will soon find questions that we do not answer. This section describes a few tips to help you get help, and to help you keep learning.
If you get stuck, start with google. Typically adding "R" to a query is enough to restrict it to relevant results: if the search isn't useful, it often means that there aren't any R specific results available. Google is particuarly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.)
If google doesn't help, try [stackoverflow](http://stackoverflow.com). Start by spending a little time searching for an existing answer (including `[R]` to restrict your search). If you don't find anything useful, next prepare a minimal reproducible example or __reprex__. A good reprex makes it easier for other people to help you, and often you'll figure out the problem yourself in the course of making it.
There are three things you need to include to make your example reproducible: required packages, data, and code.
1. **Packages** should be loaded at the top of the script, so it's easy to
see which ones the example needs. This is a good time to check that you're
using the latest version of each package: it's possible you've discovered
a bug that been fixed since you installed the package.
1. The easiest way to include **data** in a question is to use `dput()` to
generate the R code to recreate it. For example, to recreate the `mtcars`
dataset in R, I'd perform the following steps:
1. Run `dput(mtcars)` in R
2. Copy the output
3. In my reproducible script, type `mtcars <- ` then paste.
If your operating system defaults to another language, you can use
`Sys.setenv(LANGUAGE = "en")` to tell R to use English. That's likely to
get you to common solutions more quickly.
* Stack Overflow. Be sure to read and use [How to make a reproducible example](http://adv-r.had.co.nz/Reproducibility.html)([reprex](https://github.com/jennybc/reprex)) before posting. Unfortunately the R Stack Overflow community is not always the friendliest.
## Keeping up to date
Try and find the smallest subset of your data that still reveals
the problem.
* The best place to keep up with what Hadley and Garrett (and everyone
else at RStudio is doing) is the [RStudio blog](https://blog.rstudio.org)
this is where we post announcements about new packages, new IDE features,
and in-person courses.
1. Spend a little bit of time ensuring that your **code** is easy for others to
read:
* Twitter. You might want to follow
Hadley ([@hadleywickham](https://twitter.com/hadleywickham)) or
Garrett ([@statgarrett](https://twitter.com/statgarrett)) on twitter.
Another resource is the `#rstats` hashtag: if you have a question about
R you can tag it with `#rstats` and other R users will see it. And
you can follow the hashtag to keep up with what's going on in the
community.
* Make sure you've used spaces and your variable names are concise, yet
informative.
* Use comments to indicate where your problem lies.
* Do your best to remove everything that is not related to the problem.
The shorter your code is, the easier it is to understand, and the
easier it is to fix.
Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.
You should also spend some time preparing yourself to solve problems before they occur, and investing a little time in learning R each day will pay off handsomely in the long run. One way to is follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org). This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley ([@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([@statgarrett](https://twitter.com/statgarrett)) on twitter, or follow [@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
To keep up with the R community more broadly, we recommend reading <http://www.r-bloggers.com>: it aggregates over 500 blogs about R from around the world. If you're an active twitter user, following the `#rstats` hashtag on twitter is also a great way to keep up with the latest and greatest. That's one of the key tools that Hadley uses to keep up with new developments in the community.
## Acknowledgements
This book isn't just the product of Hadley and Garrett, but is the result of many conversations (in person and online) that we've had with the R community. There are few people we'd like to thank specifically because they have spent many hours answering our dumb questions and helping us to better think about data science:
* Jenny Bryan and Lionel Henry for many helpful discussions around working
with lists and list-columns.
* Genevera Allen for discussions about models and modelling.
* Genevera Allen for discussions about models, modelling, the statistical
learning perspective, and the difference between hypothesis generation and
hypothesis confirmation.
* Yihui Xie for his work on the bookdown package, and for tireless satisfying
all my feature request.
This book was written in the open, so a special thanks goes to everyone who contributed via GitHub:
__INSERT HERE__
## Colophon
An online version of this book is available at <http://r4ds.had.co.nz>. It will continue to evolve in between reprints of the physical book. The source of the book is available at <http://github.com/hadley/r4ds>. The book is powered by <https://bookdown.org> which makes it easy to turn R markdown files into html, pdf, and epub.
This book was built with:
```{r}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 585 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 554 KiB

After

Width:  |  Height:  |  Size: 492 KiB