Use tidyverse package

Fixes #451
This commit is contained in:
hadley 2016-10-03 12:30:24 -05:00
parent 42cffaf5b3
commit 7768955fe6
17 changed files with 76 additions and 98 deletions

View File

@ -19,8 +19,7 @@ EDA is an important part of any data analysis, even if the questions are handed
In this chapter we'll combine what you've learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.
```{r setup, message = FALSE}
library(ggplot2)
library(dplyr)
library(tidyverse)
```
## Questions

View File

@ -10,11 +10,10 @@ This chapter focuses on the tools you need to create good graphics. I assume tha
### Prerequisites
In this chapter, we'll focus once again on ggplot2. We'll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including __ggrepel__ and __viridis__. Rather than loading those extensions here, we'll refer to their functions explicitly, using the `::` notation. This will help make it clear which functions are built into ggplot2, and which come from other packages.
In this chapter, we'll focus once again on ggplot2. We'll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including __ggrepel__ and __viridis__. Rather than loading those extensions here, we'll refer to their functions explicitly, using the `::` notation. This will help make it clear which functions are built into ggplot2, and which come from other packages. Don't forget you'll need to install those packages with `install.packages()` if you don't already have them.
```{r, message = FALSE}
library(ggplot2)
library(dplyr)
library(tidyverse)
```
## Label

View File

@ -14,17 +14,13 @@ Dates and times are hard because they have to reconcile two physical phenomena (
### Prerequisites
This chapter will focus on the __lubridate__ package, which makes it easier to work with dates and times in R. We will use nycflights13 for practice data, and some packages for EDA.
This chapter will focus on the __lubridate__ package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when you're working with dates/times. We will also need nycflights13 for practice data.
```{r setup, message = FALSE}
library(tidyverse)
library(lubridate)
# Data
library(nycflights13)
# EDA
library(dplyr)
library(ggplot2)
```
## Creating date/times

View File

@ -8,15 +8,13 @@ Historically, factors were much easier to work with than characters. As a result
For more historical context on factors, I recommend [_stringsAsFactors: An unauthorized biography_](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng, and [_stringsAsFactors = \<sigh\>_](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley.
### Prerequisites
To work with factors, we'll use the __forcats__ package, which provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!). It provides a wide range of helpers for working with factors. We'll also need dplyr for some data manipulation, and ggplot2 for visualisation.
To work with factors, we'll use the __forcats__ package, which provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!). It provides a wide range of helpers for working with factors. forcats is not part of the core tidyverse, so we need to load it explicitly.
```{r setup, message = FALSE}
library(tidyverse)
library(forcats)
library(ggplot2)
library(dplyr)
```
## Creating factors

View File

@ -6,10 +6,10 @@ Working with data provided by R packages is a great way to learn the tools of da
### Prerequisites
In this chapter, you'll learn how to load flat files in R with the __readr__ package:
In this chapter, you'll learn how to load flat files in R with the __readr__ package, which is part of the core tidyverse.
```{r setup}
library(readr)
```{r setup, message = FALSE}
library(tidyverse)
```
## Getting started

View File

@ -28,12 +28,6 @@ Surrounding all these tools is __programming__. Programming is a cross-cutting t
You'll use these six tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
## The tidyverse
The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming, which makes them fit together naturally. Because they are designed with a unifying vision, you should experience fewer problems when you combine multiple packages to solve real problems. The packages in the tidyverse are not perfect, but they fit together well, and over time that fit will continue to improve.
There are many other excellent packages that are not part of the tidyverse, because they are designed with a different set of underlying principles. This doesn't make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data. But we hope that the tidyverse will continue to provide a solid foundation no matter how far you go in R.
## How you will learn
The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them:
@ -52,7 +46,8 @@ The previous description of the tools of data science is organised roughly accor
* Programming tools are not necessarily interesting in their own right,
but do allow you to tackle considerably more challenging problems. We'll
give you a selection of programming tools in the middle of the book, and
then you'll see they can combine with the data science tools to tackle interesting modelling problems.
then you'll see they can combine with the data science tools to tackle
interesting modelling problems.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you've learned. While it's tempting to skip the exercises, there's no better way to learn than practicing on real problems.
@ -100,15 +95,17 @@ It's common to think about modelling as a tool for hypothesis confirmation, and
We've made a few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.
To run the code in this book, you will need to install both R and the RStudio IDE. Both are open source, free, and easy to install:
There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the __tidyverse__, and a handful on other packages.
1. Download and install R, <https://www.r-project.org/>.
1. Download and install RStudio, <http://www.rstudio.com/download>.
1. Install required packages (see below).
### R
To download R, go to CRAN, the **comprehensive** **R** **a**rchive **network**. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don't try and pick a mirror that's close to you: instead use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
### RStudio
RStudio is an integrated development environment, or IDE, for R programming. When you get started, there are two key regions in the interface:
RStudio is an integrated development environment, or IDE, for R programming. Download and install it from <http://www.rstudio.com/download>.
When you start RStudio, you'll see two key regions in the interface:
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/rstudio-console.png")
@ -116,33 +113,39 @@ knitr::include_graphics("diagrams/rstudio-console.png")
For now, all you need to know is that you type R code in the console pane, and press enter to run it. You'll learn more as we go along!
### R packages
### The tidyverse
You'll also need to install some R packages. An R _package_ is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. To install the packages you'll need for this book open RStudio and run:
You'll also need to install some R packages. An R _package_ is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. The majority of the packages that you will learn in this book are part of the so-called tidyverse. The packages in the tidyverse share a common philosophy of data and R programming, and a designed to work togther naturally.
```{r include = FALSE}
install.packages <- function(...) invisible()
```
```{r}
pkgs <- c(
"dplyr", "gapminder", "ggplot2", "jsonlite", "Lahman",
"lubridate", "modelr", "nycflights13", "purrr", "readr",
"stringr", "tibble", "tidyr"
)
install.packages(pkgs)
```
R will download the packages from CRAN and install them on to your computer. CRAN is the central R archive network, and is where R packages are published. If you have problems installing, make sure that you are connected to the internet, and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g.
You can install the complete tidyverse with a single line of code:
```{r, eval = FALSE}
library(tidyr)
install.packages("tidyverse")
```
You will need to reload the package every time you start a new R session.
On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer. If you have problems installing, make sure that you are connected to the internet, and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
### Code conventions
You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. Once you have installed a package, you can load it with the `library()` function:
```{r}
library(tidyverse)
```
This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the __core__ of the tidyverse because you'll use them in almost every analysis.
### Other packages
There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, are or designed with a different set of underlying principles. This doesn't make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
In this book we'll use three data packages from outside the tidyverse:
```{r, eval = FALSE}
install.packages(c("nycflights13", "gapminder", "Lahman"))
```
These packages provide data on airline flights, world development, and baseball that we'll use to illustrate key data science ideas.
## Code conventions
Throughout the book we use a consistent set of conventions to refer to code:
@ -251,5 +254,5 @@ An online version of this book is available at <http://r4ds.had.co.nz>. It will
This book was built with:
```{r}
devtools::session_info(pkgs)
devtools::session_info(c("tidyverse"))
```

View File

@ -20,10 +20,10 @@ In this chapter you'll learn about two important iteration paradigms: imperative
### Prerequisites
Once you've mastered the for loops provided by base R, you'll learn some of the powerful programming tools provided by purrr.
Once you've mastered the for loops provided by base R, you'll learn some of the powerful programming tools provided by purrr, one of the tidyverse core packages.
```{r setup}
library(purrr)
```{r setup, message = FALSE}
library(tidyverse)
```
## For loops

View File

@ -43,16 +43,13 @@ The goal of a model is not to uncover truth, but to discover a simple approximat
### Prerequisites
We need a couple of packages specifically designed for modelling, and all the packages you've used before for EDA.
In this chapter we'll use the modelr package which wraps around base R's modelling functions to make them work naturally in a pipe.
```{r setup, message = FALSE, cache = FALSE}
# Modelling functions
library(tidyverse)
library(modelr)
options(na.action = na.warn)
# EDA tools
library(ggplot2)
library(dplyr)
```
## A simple model

View File

@ -21,19 +21,14 @@ It's a challenge to know when to stop. You need to figure out when your model is
### Prerequisites
We'll start with modelling and EDA tools we used in the last chapter. Then we'll add in some real datasets: `diamonds` from ggplot2, and `flights` from nycflights13. We'll also need lubridate to extract useful components of date-times.
We'll use the same tools as in the previous chapter, but add in some real datasets: `diamonds` from ggplot2, and `flights` from nycflights13. We'll also need lubridate in order to work with the date/times in `flights`.
```{r setup, message = FALSE}
# Modelling functions
library(tidyverse)
library(modelr)
options(na.action = na.warn)
# Data
library(nycflights13)
# EDA tools
library(ggplot2)
library(dplyr)
library(lubridate)
```

View File

@ -37,19 +37,11 @@ This chapter is somewhat aspirational: if this book is your first introduction t
### Prerequisites
Working with many models requires a combination of packages that you're already familiar with from data exploration, wrangling, programming, and modelling.
Working with many models requires many of the packages of the tidyverse (for data exploration, wrangling, and programming) and modelr to facilitate modelling.
```{r setup, message = FALSE}
# Standard data manipulation and visulisation
library(dplyr)
library(ggplot2)
# Tools for working with models
library(modelr)
# Tools for working with lots of models
library(purrr)
library(tidyr)
library(tidyverse)
```
## gapminder

View File

@ -23,8 +23,8 @@ The most common place to find relational data is in a _relational_ database mana
We will explore relational data from `nycflights13` using the two-table verbs from dplyr.
```{r setup, message = FALSE}
library(tidyverse)
library(nycflights13)
library(dplyr)
```
## nycflights13 {#nycflights13-relational}

View File

@ -6,11 +6,11 @@ This chapter introduces you to string manipulation in R. You'll learn the basics
### Prerequisites
This chapter will focus on the __stringr__ package for string manipulation. We'll also show a couple of examples of using stringr functions in conjunction with dplyr.
This chapter will focus on the __stringr__ package for string manipulation. stringr is not part of the core tidyverse because you don't always have textual data, so we need to load it explicitly.
```{r setup}
```{r setup, message = FALSE}
library(tidyverse)
library(stringr)
library(dplyr)
```
## String basics

View File

@ -8,10 +8,10 @@ If this chapter leaves you wanting to learn more about tibbles, you might enjoy
### Prerequisites
In this chapter we'll explore the __tibble__ package. Most chapters don't load the tibble package explicitly, because we're just using tibbles, not creating them. Here we're going to create them by hand (not from an existing data source), so we'll need to load it explicitly.
In this chapter we'll explore the __tibble__ package, part of the core tidyverse.
```{r setup}
library(tibble)
library(tidyverse)
```
## Creating tibbles {#tibbles}

View File

@ -14,11 +14,10 @@ This chapter will give you a practical introduction to tidy data and the accompa
### Prerequisites
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. We'll also need to use a pinch of dplyr, as is common when tidying data.
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.
```{r setup, message = FALSE}
library(tidyr)
library(dplyr)
library(tidyverse)
```
## Tidy data

View File

@ -6,15 +6,14 @@ Visualisation is an important tool for insight generation, but it is rare that y
### Prerequisites
In this chapter we're going to focus on how to use the dplyr package. We'll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.
In this chapter we're going to focus on how to use the dplyr package, another core member of the tidyverse. We'll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.
```{r setup}
library(dplyr)
library(nycflights13)
library(ggplot2)
library(tidyverse)
```
Take careful note of the message that's printed when you load dplyr - it tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()`, `base::intersect()`, etc.
Take careful note of the conflicts message that's printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()` and `stats::lag()`.
### nycflights13

View File

@ -11,7 +11,7 @@ Vectors are particularly important as most of the functions you will write will
The focus of this chapter is on base R data structures, so it isn't essential to load any packages. We will, however, use a handful of functions from the __purrr__ package to avoid some inconsistences in base R.
```{r}
library(purrr)
library(tidyverse)
```
## Vector basics

View File

@ -9,18 +9,19 @@ This chapter will teach you how to visualise your data using ggplot2. R has seve
### Prerequisites
To access the datasets, help pages, and functions that we will use in this chapter, load ggplot2 using the `library()` function. We'll also load tibble, which you'll learn about later. It improves the default printing of datasets.
This chapter focusses on ggplot2, one of the core members of the tidyverse. To access the datasets, help pages, and functions that we will use in this chapter, load the tidyverse by running this code:
```{r setup}
library(ggplot2)
library(tibble)
library(tidyverse)
```
If you run this code and get the error message "there is no package called ggplot2", you'll need to first install it, then run `library()` once again.
That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflicts with functions in base R (or from other packages you might have loaded).
If you run this code and get the error message "there is no package called tidyverse", you'll need to first install it, then run `library()` once again.
```{r eval = FALSE}
install.packages("ggplot2")
library(ggplot2)
install.packages("tidyverse")
library(tidyverse)
```
You only need to install a package once, but you need to reload it every time you start a new session.