r4ds/intro.Rmd

---
layout: default
title: Welcome
output: bookdown::html_chapter
---
  
# Welcome
  
## Overview

The goal of "R for Data Science" is to give you a solid foundation into using R to do data science. The goal is not to be exhaustive, but to instead focus on what we think are the critical skills for data science:

* Getting your data into R so you can work with it.

* Wrangling your data into a tidy form, so it's easier to work with and you
  spend your time struggling with your questions, not fighting to get data
  into the right form for different functions.
  
* Manipulating your data to add variables and compute basic summaries. We'll
  show you the broad tools, and focus on three common types of data: numbers, 
  strings, and date/times.

* Visualising your data to gain insight. Visualisations are one of the most
  important tools of data science because they can surprise you: you can
  see something in a visualisation that you did not expect. Visualisations
  are also really helpful for helping you refine your questions of the data.

* Modelling your data to scale visualisations to larger datasets, and to
  remove strong patterns. Modelling is a very deep topic - we can't possibly
  cover all the details, but we'll give you a taste of how you can use it,
  and where you can go to learn more.

* Communicating your results to others. It doesn't matter how great your
  analysis is unless you can communicate the results to others. We'll show
  how you can create static reports with rmarkdown, and interactive apps with
  shiny.

## Learning data science

Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). This, howver, is not the order you'll encounter them in this book. This is because:

* Starting with data ingest is boring. It's much more interesting to learn
  some new visualisation and manipulation tools on data that's already been
  imported and cleaned. You'll later learn the skills to apply these new ideas
  to your own data.
  
* Some topics, like modelling, are best explained with other tools, like
  visualisation and manipulation. These need to come later in the book.

We've designed this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated. We try and stick to a similar pattern within each chapter: give some bigger motivating examples so you can see the bigger picture, and then dive into the details.

Each section of the book also comes with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing.  If you were taking a class with either of us, we'd force you to do them by making them homework. (Sometimes I feel like the art of teaching is tricking people to do what's in their own best interests.)

## R and big data

This book focuses almost exclusively on in-memory datasets.

* Small data: data that fits in memory on a laptop, ~10 GB. Note that small
  data is still big! R is great with small data.
  
* Medium data: data that fits in memory on a powerful server, ~5 TB. It's
  possible to use R with this much data, but it's challenging. Dealing
  effectively with medium data requires effective use of all cores on a
  computer. It's not that hard to do that from R, but it requires some thought,
  and many packages do not take advantage of R's tools.
  
* Big data: data that must be stored on disk or spread across the memory of
  multiple machines. Writing code that works efficiently with this sort of data
  is a very challenging. Tools for this sort of data will never be written in
  R: they'll be written in a language specially designed for high performance
  computing like C/C++, Fortran or Scala. But R can still talk to these systems.
  
The other thing to bear in mind, is that while all your data might be big, typically you don't need all of it to answer a specific question:

* Many questions can be answered with the right small dataset. It's often
  possible to find a subset, subsample, or summary that fits in memory and
  still allows you to answer the question you're interested in. The challenge
  here is finding the right small data, which often requires a lot of iteration.
  
* Other challenges are because an individual problem might fit in memory,
  but you have hundreds of thousands or millions of them. For example, you 
  might want to fit a model to each person in your dataset. That would be
  trivial if you had just 10 or 100 people, but instead you have a million.
  Fortunately each problem is independent (sometimes called embarassingly
  parallel), so you just need a system (like hadoop) that allows you to
  send different datasets to different computers for processing.

## Prerequisites

To run the code in this book, you will need to have R installed on your computer, as well as the RStudio IDE, an application that makes it easier to use R. Both R and the RStudio IDE are free and easy to install.

### R

To install R, visit [cran.r-project.org](http://cran.r-project.org). Then click the link that matches your operating system. What you do next will depend on your operating system.

* Mac users should click the most current release. This will be the `.pkg` file at the top of the page. Once the file is downloaded, double click it to open an R installer. Follow the directions in the installer to install R.

* Windows users should click "base" and then download the most current version of R, which will be linked at the top of the page.

* Linux users will need to select their distribution and then follow the distribution specific instructions to install R. [cran.r-project.org](https://cran.r-project.org/bin/linux/) includes these instructions along side of the files to download.


### RStudio

Once you have R installed, it is time to download RStudio. To download RStudio, visit [www.rstudio.com/download](http://www.rstudio.com/download). 

Choose the installer for your system. Then click the link to download the application. Once you have the application, installation is easy. Once RStudio is installed, open it as you would open any other application.

### R Packages

Some of the most useful parts of R come in _packages_, collections of functions and code that you can download in addition to base R. We will use several packages in this book. These include the `DBI`, `devtools`, `dplyr`, `DSR`, `ggplot2`, `haven`, `knitr`, `lubridate`, `packrat`, `readr`, `rmarkdown`, `rsqlite`, `rvest`, `scales `, `shiny`, `stringr`, and `tidyr` packages. 

There are two general ways to install packages for R. Both require you to have an internet connection, to start an R session (by opening the RStudio IDE), and to run a command at the command line.

The most common way to install R packages is to download them from the package repository at [cran.r-project.org](http://cran.r-project.org). To do this run the command, `install.packages()`. Give `install.packages()` the name or names of the packages you wish to install as a character vector. R will download the packages from [cran.r-project.org](http://cran.r-project.org) and install them in your system library.

You can use this method to download all but one of the packages listed above. To do so, open R and run the command

```{r eval = FALSE}
install.packages(c("DBI", "devtools", "dplyr", "ggplot2", "haven", "knitr", "lubridate", "packrat", "readr", "rmarkdown", "rsqlite", "rvest", "scales", "shiny", "stringr", "tidyr"))
```

Some R packages are not stored on [cran.r-project.org](http://cran.r-project.org), but are hosted in online repositories maintained by the package's developer. The most common place to host these packages is [www.github.com](http://www.github.com).

For example, `DSR` is a collection of data sets that we have assembled for this book and saved online as a github repository ([github.com/garrettgman/DSR](http://github.com/garrettgman/DSR)). 

You can install packages stored on github with the `install_github()` function in the `devtools` package. (You can install the `devtools` package itself from [cran.r-project.org](http://cran.r-project.org) with `install.packages()`). To use the function, pass it a characterstring with the form `"<github username>/<github repository name>".

To install `DSR`, run the command

```{r eval = FALSE}
devtools::install_github("garrettgman/DSR")
```

#### `library()`

When R installs a package, it downloads the package to your system library. This does not automatically load the contents of the package into your current or future R sessions. To use the functions and data sets that come in an R package saved in your system library, you must load the package into your current R session with `library()`. 

For example, to use the functions in the `tidyr` package, you would need to first run

```{r eval = FALSE}
library("tidyr")
```

You will need to rerun this command each time you open a new R session in which you wish to use the `tidyr` package.

### Getting help

* Google

* StackOverflow ([reprex](https://github.com/jennybc/reprex))

* Twitter
Start moving towards Hadley style 2015-07-29 03:15:28 +08:00			`---`
			`layout: default`
			`title: Welcome`
			`output: bookdown::html_chapter`
			`---`

Add missing headings 2015-09-21 21:45:06 +08:00			`# Welcome`

Some big picture stuff in the overview 2015-09-21 21:21:59 +08:00			`## Overview`

			`The goal of "R for Data Science" is to give you a solid foundation into using R to do data science. The goal is not to be exhaustive, but to instead focus on what we think are the critical skills for data science:`

			`* Getting your data into R so you can work with it.`

			`* Wrangling your data into a tidy form, so it's easier to work with and you`
			`spend your time struggling with your questions, not fighting to get data`
			`into the right form for different functions.`

			`* Manipulating your data to add variables and compute basic summaries. We'll`
			`show you the broad tools, and focus on three common types of data: numbers,`
			`strings, and date/times.`

			`* Visualising your data to gain insight. Visualisations are one of the most`
			`important tools of data science because they can surprise you: you can`
			`see something in a visualisation that you did not expect. Visualisations`
			`are also really helpful for helping you refine your questions of the data.`

			`* Modelling your data to scale visualisations to larger datasets, and to`
			`remove strong patterns. Modelling is a very deep topic - we can't possibly`
			`cover all the details, but we'll give you a taste of how you can use it,`
			`and where you can go to learn more.`

			`* Communicating your results to others. It doesn't matter how great your`
			`analysis is unless you can communicate the results to others. We'll show`
			`how you can create static reports with rmarkdown, and interactive apps with`
			`shiny.`

			`## Learning data science`

			`Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). This, howver, is not the order you'll encounter them in this book. This is because:`

			`* Starting with data ingest is boring. It's much more interesting to learn`
			`some new visualisation and manipulation tools on data that's already been`
			`imported and cleaned. You'll later learn the skills to apply these new ideas`
			`to your own data.`

			`* Some topics, like modelling, are best explained with other tools, like`
			`visualisation and manipulation. These need to come later in the book.`

			`We've designed this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated. We try and stick to a similar pattern within each chapter: give some bigger motivating examples so you can see the bigger picture, and then dive into the details.`

			`Each section of the book also comes with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing. If you were taking a class with either of us, we'd force you to do them by making them homework. (Sometimes I feel like the art of teaching is tricking people to do what's in their own best interests.)`

			`## R and big data`

			`This book focuses almost exclusively on in-memory datasets.`

			`* Small data: data that fits in memory on a laptop, ~10 GB. Note that small`
			`data is still big! R is great with small data.`

			`* Medium data: data that fits in memory on a powerful server, ~5 TB. It's`
			`possible to use R with this much data, but it's challenging. Dealing`
			`effectively with medium data requires effective use of all cores on a`
			`computer. It's not that hard to do that from R, but it requires some thought,`
			`and many packages do not take advantage of R's tools.`

			`* Big data: data that must be stored on disk or spread across the memory of`
			`multiple machines. Writing code that works efficiently with this sort of data`
			`is a very challenging. Tools for this sort of data will never be written in`
			`R: they'll be written in a language specially designed for high performance`
			`computing like C/C++, Fortran or Scala. But R can still talk to these systems.`

			`The other thing to bear in mind, is that while all your data might be big, typically you don't need all of it to answer a specific question:`

			`* Many questions can be answered with the right small dataset. It's often`
			`possible to find a subset, subsample, or summary that fits in memory and`
			`still allows you to answer the question you're interested in. The challenge`
			`here is finding the right small data, which often requires a lot of iteration.`

			`* Other challenges are because an individual problem might fit in memory,`
			`but you have hundreds of thousands or millions of them. For example, you`
			`might want to fit a model to each person in your dataset. That would be`
			`trivial if you had just 10 or 100 people, but instead you have a million.`
			`Fortunately each problem is independent (sometimes called embarassingly`
			`parallel), so you just need a system (like hadoop) that allows you to`
			`send different datasets to different computers for processing.`

Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00			`## Prerequisites`
Start moving towards Hadley style 2015-07-29 03:15:28 +08:00
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00			`To run the code in this book, you will need to have R installed on your computer, as well as the RStudio IDE, an application that makes it easier to use R. Both R and the RStudio IDE are free and easy to install.`
Start moving towards Hadley style 2015-07-29 03:15:28 +08:00
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00			`### R`

			`To install R, visit [cran.r-project.org](http://cran.r-project.org). Then click the link that matches your operating system. What you do next will depend on your operating system.`

			* Mac users should click the most current release. This will be the `.pkg` file at the top of the page. Once the file is downloaded, double click it to open an R installer. Follow the directions in the installer to install R.

			`* Windows users should click "base" and then download the most current version of R, which will be linked at the top of the page.`

			`* Linux users will need to select their distribution and then follow the distribution specific instructions to install R. [cran.r-project.org](https://cran.r-project.org/bin/linux/) includes these instructions along side of the files to download.`


			`### RStudio`

			`Once you have R installed, it is time to download RStudio. To download RStudio, visit [www.rstudio.com/download](http://www.rstudio.com/download).`

			`Choose the installer for your system. Then click the link to download the application. Once you have the application, installation is easy. Once RStudio is installed, open it as you would open any other application.`

			`### R Packages`

			Some of the most useful parts of R come in _packages_, collections of functions and code that you can download in addition to base R. We will use several packages in this book. These include the `DBI`, `devtools`, `dplyr`, `DSR`, `ggplot2`, `haven`, `knitr`, `lubridate`, `packrat`, `readr`, `rmarkdown`, `rsqlite`, `rvest`, `scales `, `shiny`, `stringr`, and `tidyr` packages.

			`There are two general ways to install packages for R. Both require you to have an internet connection, to start an R session (by opening the RStudio IDE), and to run a command at the command line.`

			The most common way to install R packages is to download them from the package repository at [cran.r-project.org](http://cran.r-project.org). To do this run the command, `install.packages()`. Give `install.packages()` the name or names of the packages you wish to install as a character vector. R will download the packages from [cran.r-project.org](http://cran.r-project.org) and install them in your system library.

			`You can use this method to download all but one of the packages listed above. To do so, open R and run the command`
Start moving towards Hadley style 2015-07-29 03:15:28 +08:00
			```{r eval = FALSE}
Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00			`install.packages(c("DBI", "devtools", "dplyr", "ggplot2", "haven", "knitr", "lubridate", "packrat", "readr", "rmarkdown", "rsqlite", "rvest", "scales", "shiny", "stringr", "tidyr"))`
Start moving towards Hadley style 2015-07-29 03:15:28 +08:00			```

Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00			`Some R packages are not stored on [cran.r-project.org](http://cran.r-project.org), but are hosted in online repositories maintained by the package's developer. The most common place to host these packages is [www.github.com](http://www.github.com).`

			For example, `DSR` is a collection of data sets that we have assembled for this book and saved online as a github repository ([github.com/garrettgman/DSR](http://github.com/garrettgman/DSR)).

			You can install packages stored on github with the `install_github()` function in the `devtools` package. (You can install the `devtools` package itself from [cran.r-project.org](http://cran.r-project.org) with `install.packages()`). To use the function, pass it a characterstring with the form `"<github username>/<github repository name>".

			To install `DSR`, run the command
Start moving towards Hadley style 2015-07-29 03:15:28 +08:00
			```{r eval = FALSE}
			`devtools::install_github("garrettgman/DSR")`
			```

Expanded draft of prerequisites section to include install instructions 2015-07-30 08:58:29 +08:00			#### `library()`

			When R installs a package, it downloads the package to your system library. This does not automatically load the contents of the package into your current or future R sessions. To use the functions and data sets that come in an R package saved in your system library, you must load the package into your current R session with `library()`.

			For example, to use the functions in the `tidyr` package, you would need to first run

			```{r eval = FALSE}
			`library("tidyr")`
			```

			You will need to rerun this command each time you open a new R session in which you wish to use the `tidyr` package.
Need to also discuss where to get help 2015-09-22 20:59:27 +08:00
			`### Getting help`

			`* Google`

			`* StackOverflow ([reprex](https://github.com/jennybc/reprex))`

			`* Twitter`