Intro tweaking

Add contributors
This commit is contained in:
hadley 2016-07-20 10:07:51 -05:00
parent 220f3eaa7c
commit 1c32a4a1d9
2 changed files with 107 additions and 46 deletions

48
contribs.txt Normal file
View File

@ -0,0 +1,48 @@
393 hadley
80 Garrett
77 Hadley Wickham
10 Radu Grosu
6 Brandon Greenwell
6 Bill Behrman
6 jjchern
5 kdpsingh
3 Ian Lyttle
3 Jennifer (Jenny) Bryan
3 Yihui Xie
3 OaCantona
3 behrman
2 MJMarshall
2 sibusiso16
2 Jim Hester
2 Joanne Jang
2 Devin Pastoor
2 Kirill Sevastyanenko
2 spirgel
2 rlzijdeman
2 robinlovelace
1 harrismcgehee
1 jennybc
1 nate-d-olson
1 shoili
1 Peter Hurford
1 Alex
1 Ben Marwick
1 Colin Gillespie
1 Curtis Alexander
1 Daniel Gromer
1 Earl Brown
1 Flemming Villalona
1 Garrett Grolemund
1 Ian Sealy
1 Jakub Nowosad
1 Julia Stewart Lowndes
1 Kenny Darrell
1 KyleHumphrey
1 Lawrence Wu
1 Mustafa Ascha
1 Nelson Areal
1 Patrick Kennedy
1 Ahmed ElGabbas
1 TJ Mahr
1 Tom Prior
1 adi pradhan

105
intro.Rmd
View File

@ -1,63 +1,59 @@
# Introduction
```{r setup-intro, include = FALSE}
install.packages <- function(...) invisible()
```
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to help you learn the most important tools R that will allow you to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
## What you will learn
Data science is a huge field, and there's no way you can master it by reading a single book. The goal of this book is to give you a solid foundation with the most important tools. Our model of the tools needed in a typical data science project looks something like this:
Data science is a huge field, and there's no way you can master it by reading a single book. The goal of this book is to give you a solid foundation in the most important tools. Our model of the tools needed in a typical data science project looks something like this:
```{r echo = FALSE}
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science.png")
```
First you must __import__ your data into R. This typically means that you take data stored in a file, database, or web API, and load it into a data frame in R. If you can't get your data into R, you can't do data science on it!
Once you've imported your data, it is a good idea to __tidy__ it. Tidying your data means storing it in a standard form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
Once you've imported your data, it is a good idea to __tidy__ it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
Once you have tidy data, a common first step is to __transform__ it. You may zero in on a subset of data, add new variables that are functions of existing variables, or calculate a set of summary statistics.
Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times.
__Visualisation__ is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions of the data. A good visualisation might also hint that you're asking the wrong question and you need to refine your thinking. In short, visualisations can surprise you. However, visualisations don't scale particularly well.
__Visualisation__ is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions about the data. A good visualisation might also hint that you're asking the wrong question, or you need to collect different data. Visualisations can surprise you, don't scale particularly well, because they require a human to interpret them.
__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model cannot fundamentally surprise you.
__Models__ are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model cannot fundamentally surprise you.
The last step of data science is __communication__, an absolutely critical part of any data analysis project. It doesn't matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.
There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off. Becoming a better programmer will allow you to automate common tasks, and solve new problems with greater ease.
Surrounding all these tools is __programming__. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
You'll use these six tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play: you can probably tackle 80% of every project using the tools that you'll learn in this book, but you'll need more to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
You'll use these six tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play: you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
## The tidyverse
The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming, which makes them fit together naturally. Because they are designed with a unifying vision you should experience fewer problems combining multiple packages to solve real problems. The packages in the tidyverse are not perfect, but over time they will continue to evolve towards greater consistency.
The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming, which makes them fit together naturally. Because they are designed with a unifying vision you should experience fewer problems when you combine multiple packages to solve real problems. The packages in the tidyverse are not perfect, but they fit together well, and over time that fit will continue to improve.
There are many other excellent packages that are not part of the tidyverse, because they are designed with a different set of underlying principles. This doesn't make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R you'll learn new packages, and new ways of thinking about data. But we hope that the tidyverse will continue to provide a solid foundation no matter how far you go in R.
## How you will learn
The above description of the tools of data science is organised roughly around the order in which you use them in analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them:
The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them:
* Starting with data ingest and tidying is sub-optimal because 80% of the time
it's routine and boring, and the other 20% of the time it's weird and
frustrating. Instead, we'll start with visualisation and transformation on
data that's already been imported and tidied. That way, when you ingest
and tidy your own data, your motivation will stay high because you know the
pain is worth it.
frustrating. That's a bad place to start learning a new subject! Instead,
we'll start with visualisation and transformation on data that's already been
imported and tidied. That way, when you ingest and tidy your own data, your
motivation will stay high because you know the pain is worth it.
* Some topics are best explained with other tools. For example, we believe that
it's easier to understand how models work as a tool for data science if you
already know about visualisation, data transformation, and tidy data.
it's easier to understand how models work if you already know about
visualisation, tidy data, and programming.
* Programming tools are not necessarily interesting in their own right,
but do allow you to tackle considerably more challenging problems. We'll
give you a selection of programming tools in the middle of the book, and
then finish off by showing how they combine with the key data science tools
to tackle interesting problems.
then you'll see they combine with the data science tools to tackle interesting
modelling problems.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you've learned. While it's tempting to skip the exercises, there's no better way to learn than practicing on real problems.
@ -67,11 +63,11 @@ There are some important topics that this book doesn't cover. We believe it's im
### Big data
This book proudly focuses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). This book doesn't teach data.table because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth the extra effort required to learn it.
This book proudly focuses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). This book doesn't teach data.table because it has a very concise interface which makes it harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth the extra effort required to learn it.
If your data is bigger than this, carefully consider if your big data problem might atually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [data transformation].
If your data is bigger than this, carefully consider if your big data problem might atually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration.
Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like sparklyr, rhipe, and ddr to solve it for the full dataset.
Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop or Spark) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like sparklyr, rhipe, and ddr to solve it for the full dataset.
### Python, Julia, and friends
@ -81,11 +77,11 @@ However, we strongly believe that it's best to master one tool at a time. You wi
### Non-rectangular data
This book focuses exclusively on structured datasets: collections of values that are each associated with a variable and an observation. There are lots of datasets that do not naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.
This book focuses exclusively on rectangular data: collections of values that are each associated with a variable and an observation. There are lots of datasets that do not naturally fit in this paradigm: images, sounds, trees, text. But reactangular data frames are extremely common in science and in industry and we believe that they're a great place to start your data science journey.
### Hypothesis confirmation
It's possible to divide data analysis into two camps: hypothesis generation, and hypothesis confirmation (sometimes called confirmatory analysis). The focus of this book is unabashedly on hypothesis generation, or data exploration. Here you'll look deeply at the data and combined with your subject knowledge generate many interesting hypotheses to help explain why the data behaves the way it does. You evaluate the hypotheses informally, using your scepticism to challenge the data in multiple ways.
It's possible to divide data analysis into two camps: hypothesis generation and hypothesis confirmation (sometimes called confirmatory analysis). The focus of this book is unabashedly on hypothesis generation, or data exploration. Here you'll look deeply at the data and in combination with your subject knowledge generate many interesting hypotheses to help explain why the data behaves the way it does. You evaluate the hypotheses informally, using your scepticism to challenge the data in multiple ways.
The complement of hypothesis generation is hypothesis confirmation. Hypothesis confirmation is hard for two reasons:
@ -97,7 +93,7 @@ The complement of hypothesis generation is hypothesis confirmation. Hypothesis c
This means to do hypothesis confirmation you need to "preregister"
(write out in advance) your analysis plan, and not deviate from it
even when you have seen the data. We'll talk a little about some
strategies you can use to make this easier in [model assessment].
strategies you can use to make this easier in [modelling](#model-intro).
It's common to think about modelling as a tool for hypothesis confirmation, and visualisation for a tool for hypothesis generation. But that's a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation. The key difference is how often do you look at each observation: if you look only once, it's confirmation; if you look more than once, it's exploration.
@ -105,27 +101,15 @@ It's common to think about modelling as a tool for hypothesis confirmation, and
We've made few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.
To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are open source, free and easy to install:
To run the code in this book, you will need to install both R and the RStudio IDE. Both are open source, free, and easy to install:
1. Download and install R, <https://www.r-project.org/>.
1. Download and install RStudio, <http://www.rstudio.com/download>.
1. Install needed packages (see below).
### Code conventions
* In text, we refer to functions in a code font followed by parentheses,
for example, `sum()`, or `mean()`.
* We refer to other R objects (like data or function arguments) without
parentheses: `flights`, `x`, ...
* If we want to make it clear which package an object comes from, we'll use
the package name followed by two colons: `dplyr::mutate()`, or
`nycflights13::flights`. This is the same convention that R uses.
### RStudio
RStudio is an integrated development environment, or IDE, for R programming. There are three key regions:
RStudio is an integrated development environment, or IDE, for R programming. There are three key regions in interface:
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/intro-rstudio.png")
@ -151,19 +135,20 @@ We strongly recommend making two changes to the default RStudio options:
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
This ensures that every time you restart RStudio you get a completely clean slate. This is good practice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
This ensures that every time you restart RStudio you get a completely clean slate. That's good practice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
### R packages
You'll also need to install some R packages. An R _package_ is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. To install the packages you'll need for this book open RStudio and run:
```{r include = FALSE}
install.packages <- function(...) invisible()
```
```{r}
pkgs <- c(
"broom", "dplyr", "ggplot2", "jsonlite", "Lahman", "purrr",
"rcorpora", "readr", "rmarkdown", "stringr", "tibble", "tidyr"
)
```
```{r, eval = FALSE}
install.packages(pkgs)
```
@ -177,6 +162,18 @@ library(tidyr)
You will need to reload the package every time you start a new R session.
### Code conventions
* In text, we refer to functions in a code font followed by parentheses,
for example, `sum()`, or `mean()`.
* We refer to other R objects (like data or function arguments) without
parentheses: `flights`, `x`, ...
* If we want to make it clear which package an object comes from, we'll use
the package name followed by two colons: `dplyr::mutate()`, or
`nycflights13::flights`. This is the same convention that R uses.
## Getting help and learning more
This book is not an island: there is no single resource that will allow you to master R. As you start to apply the techniques described in this book to your own data you will soon find questions that we do not answer. This section describes a few tips to help you get help, and to help you keep learning.
@ -235,9 +232,25 @@ This book isn't just the product of Hadley and Garrett, but is the result of man
* Yihui Xie for his work on the bookdown package, and for tireless satisfying
all my feature request.
* Bill Behrman for thoughtfully reading the entire book and trying it out
with his data science class at Stanford.
This book was written in the open, so a special thanks goes to everyone who contributed via GitHub:
__INSERT HERE__
```{r, results = "asis", echo = FALSE, message = FALSE}
library(dplyr)
# git --no-pager shortlog -ns > contribs.txt
contribs <- readr::read_tsv("contribs.txt", col_names = c("n", "name"))
contribs <- contribs %>%
filter(!name %in% c("hadley", "Garrett", "Hadley Wichkam")) %>%
arrange(name) %>%
mutate(uname = ifelse(!grepl(" ", name), paste0("@", name), name))
cat("Thanks go to all contributers in alphabetical order: ")
cat(paste0(contribs$uname, collapse = ", "))
cat(".\n")
```
## Colophon