Selected, relevant edits from @pkq. Closes #679.

This commit is contained in:
Mine Çetinkaya-Rundel 2022-05-07 22:17:21 -04:00
parent e6bc512e98
commit e6b958b196
8 changed files with 108 additions and 109 deletions

View File

@ -6,9 +6,9 @@ status("polishing")
## Introduction
Visualisation is an important tool for insight generation, but it's rare that you get the data in exactly the right form you need for it.
Visualisation is an important tool for generating insight, but it's rare that you get the data in exactly the right form you need for it.
Often you'll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with.
You'll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the **dplyr** package and a new dataset on flights departing New York City in 2013.
You'll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the **dplyr** package and a new dataset on flights that departed New York City in 2013.
The goal of this chapter is to give you an overview of all the key tools for transforming a data frame.
We'll come back these functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).
@ -222,10 +222,10 @@ flights |>
)
```
By default, `mutate()` adds new columns on the right hand side of your dataset, which makes it hard to see what's happening here.
By default, `mutate()` adds new columns on the right hand side of your dataset, which makes it difficult to see what's happening here.
We can use the `.before` argument to instead add the variables to the left hand side[^data-transform-2]:
[^data-transform-2]: Remember that when you're in RStudio, the easiest way to see all the columns is `View()`.
[^data-transform-2]: Remember that in RStudio, the easiest way to see a dataset with many columns is `View()`.
```{r}
flights |>
@ -535,7 +535,7 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
## Case study: aggregates and sample size {#sample-size}
Whenever you do any aggregation, it's always a good idea to include a count (`n()`).
That way you can check that you're not drawing conclusions based on very small amounts of data.
That way, you can ensure that you're not drawing conclusions based on very small amounts of data.
For example, let's look at the planes (identified by their tail number) that have the highest average delays:
```{r}
@ -569,7 +569,7 @@ ggplot(delays, aes(n, delay)) +
geom_point(alpha = 1/10)
```
Not surprisingly, there is much greater variation in the average delay when there are few flights.
Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane.
The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you'll see that the variation decreases as the sample size increases[^data-transform-4].
[^data-transform-4]: \*cough\* the central limit theorem \*cough\*

View File

@ -9,12 +9,12 @@ R has several systems for making graphs, but ggplot2 is one of the most elegant
ggplot2 implements the **grammar of graphics**, a coherent system for describing and building graphs.
With ggplot2, you can do more faster by learning one system and applying it in many places.
If you'd like to learn more about the theoretical underpinnings of ggplot2, I'd recommend reading "The Layered Grammar of Graphics", <http://vita.had.co.nz/papers/layered-grammar.pdf>.
If you'd like to learn more about the theoretical underpinnings of ggplot2, I recommend reading "The Layered Grammar of Graphics", <http://vita.had.co.nz/papers/layered-grammar.pdf>.
### Prerequisites
This chapter focuses on ggplot2, one of the core members of the tidyverse.
To access the datasets, help pages, and functions that we will use in this chapter, load the tidyverse by running this code:
This chapter focuses on ggplot2, one of the core packages in the tidyverse.
To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:
```{r}
#| label: setup
@ -53,7 +53,7 @@ Nonlinear?
You can test your answer with the `mpg` **data frame** found in ggplot2 (a.k.a. `ggplot2::mpg`).
A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
`mpg` contains observations collected by the US Environmental Protection Agency on 38 models of car.
`mpg` contains observations collected by the US Environmental Protection Agency on 38 car models.
```{r}
mpg
@ -183,7 +183,7 @@ ggplot(data = mpg) +
(If you prefer British English, like Hadley, you can use `colour` instead of `color`.)
To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside `aes()`.
To map an aesthetic to a variable, associate the name of the aesthetic with the name of the variable inside `aes()`.
ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as **scaling**.
ggplot2 will also add a legend that explains which levels correspond to which values.
@ -194,7 +194,7 @@ In hindsight, these cars were unlikely to be hybrids since they have large engin
In the above example, we mapped `class` to the color aesthetic, but we could have mapped `class` to the size aesthetic in the same way.
In this case, the exact size of each point would reveal its class affiliation.
We get a *warning* here, because mapping an unordered variable (`class`) to an ordered aesthetic (`size`) is not a good idea.
We get a *warning* here, because mapping an unordered variable (`class`) to an ordered aesthetic (`size`) is generally not a good idea.
```{r}
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The points representing each car are sized according to the class of the car. The legend on the right of the plot shows the mapping between colours and levels of the class variable -- going from small to large: 2seater, compact, midsize, minivan, pickup, or suv."
@ -203,7 +203,7 @@ ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
```
Or we could have mapped `class` to the *alpha* aesthetic, which controls the transparency of the points, or to the *shape* aesthetic, which controls the shape of the points.
Similarly, we could have mapped `class` to the *alpha* aesthetic, which controls the transparency of the points, or to the *shape* aesthetic, which controls the shape of the points.
```{r}
#| fig-width: 4
@ -247,13 +247,12 @@ ggplot(data = mpg) +
```
Here, the color doesn't convey information about a variable, but only changes the appearance of the plot.
To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes *outside* of `aes()`.
You'll need to pick a level that makes sense for that aesthetic:
To set an aesthetic manually, set the aesthetic by name as an argument of your geom function.
In other words, it goes *outside* of `aes()`.
You'll need to pick a value that makes sense for that aesthetic:
- The name of a color as a character string.
- The size of a point in mm.
- The shape of a point as a number, as shown in Figure \@ref(fig:shapes).
```{r}
@ -340,8 +339,8 @@ Another great tool is Google: try googling the error message, as it's likely som
## Facets
One way to add additional variables is with aesthetics.
Another way, particularly useful for categorical variables, is to split your plot into **facets**, subplots that each display one subset of the data.
One way to add additional variables to a plot is by mapping them to an aesthetic.
Another way, which is particularly useful for categorical variables, is to split your plot into **facets**, subplots that each display one subset of the data.
To facet your plot by a single variable, use `facet_wrap()`.
The first argument of `facet_wrap()` is a formula, which you create with `~` followed by a variable name (here, "formula" is the bane if a data structure in R, not a synonym for "equation").
@ -504,7 +503,7 @@ ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
```
Here `geom_smooth()` separates the cars into three lines based on their `drv` value, which describes a car's drive train.
Here, `geom_smooth()` separates the cars into three lines based on their `drv` value, which describes a car's drive train.
One line describes all of the points that have a `4` value, one line describes all of the points that have an `f` value, and one line describes all of the points that have an `r` value.
Here, `4` stands for four-wheel drive, `f` for front-wheel drive, and `r` for rear-wheel drive.
@ -524,9 +523,9 @@ Notice that this plot contains two geoms in the same graph!
If this makes you excited, buckle up.
You will learn how to place multiple geoms in the same plot very soon.
ggplot2 provides over 40 geoms, and extension packages provide even more (see <https://exts.ggplot2.tidyverse.org/gallery/> for a sampling).
ggplot2 provides more than 40 geoms, and extension packages provide even more (see <https://exts.ggplot2.tidyverse.org/gallery/> for a sampling).
The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at <http://rstudio.com/resources/cheatsheets>.
To learn more about any single geom, use help, e.g. `?geom_smooth`.
To learn more about any single geom, use the help (e.g. `?geom_smooth`).
Many geoms, like `geom_smooth()`, use a single geometric object to display multiple rows of data.
For these geoms, you can set the `group` aesthetic to a categorical variable to draw multiple objects.
@ -628,6 +627,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
3. What does `show.legend = FALSE` do?
What happens if you remove it?\
Why do you think I used it earlier in the chapter?
4. What does the `se` argument to `geom_smooth()` do?
@ -704,7 +704,7 @@ Other graphs, like bar charts, calculate new values to plot:
- smoothers fit a model to your data and then plot predictions from the model.
- boxplots compute a robust summary of the distribution and then display a specially formatted box.
- boxplots compute a robust summary of the distribution and then display that summary as a specially formatted box.
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation.
The figure below describes how this process works with `geom_bar()`.
@ -719,8 +719,8 @@ knitr::include_graphics("images/visualization-stat-bar.png")
You can learn which stat a geom uses by inspecting the default value for the `stat` argument.
For example, `?geom_bar` shows that the default value for `stat` is "count", which means that `geom_bar()` uses `stat_count()`.
`stat_count()` is documented on the same page as `geom_bar()`, and if you scroll down you can find a section called "Computed variables".
That describes how it computes two new variables: `count` and `prop`.
`stat_count()` is documented on the same page as `geom_bar()`.
If you scroll down, the section called "Computed variables" explains that it computes two new variables: `count` and `prop`.
You can generally use geoms and stats interchangeably.
For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`:
@ -734,7 +734,7 @@ ggplot(data = diamonds) +
This works because every geom has a default stat; and every stat has a default geom.
This means that you can typically use geoms without worrying about the underlying statistical transformation.
There are three reasons you might need to use a stat explicitly:
However, there are three reasons why you might need to use a stat explicitly:
1. You might want to override the default stat.
In the code below, I change the stat of `geom_bar()` from count (the default) to identity.
@ -759,7 +759,7 @@ There are three reasons you might need to use a stat explicitly:
```
(Don't worry that you haven't seen `<-` or `tribble()` before.
You might be able to guess at their meaning from the context, and you'll learn exactly what they do soon!)
You might be able to guess their meaning from the context, and you'll learn exactly what they do soon!)
2. You might want to override the default mapping from transformed variables to aesthetics.
For example, you might want to display a bar chart of proportions, rather than counts:
@ -788,7 +788,7 @@ There are three reasons you might need to use a stat explicitly:
)
```
ggplot2 provides over 20 stats for you to use.
ggplot2 provides more than 20 stats for you to use.
Each stat is a function, so you can get help in the usual way, e.g. `?stat_bin`.
To see a complete list of stats, try the [ggplot2 cheatsheet](http://rstudio.com/resources/cheatsheets).
@ -798,7 +798,7 @@ To see a complete list of stats, try the [ggplot2 cheatsheet](http://rstudio.com
How could you rewrite the previous plot to use that geom function instead of the stat function?
2. What does `geom_col()` do?
How is it different to `geom_bar()`?
How is it different from `geom_bar()`?
3. Most geoms and stats come in pairs that are almost always used in concert.
Read through the documentation and make a list of all the pairs.
@ -809,7 +809,7 @@ To see a complete list of stats, try the [ggplot2 cheatsheet](http://rstudio.com
5. In our proportion bar chart, we need to set `group = 1`.
Why?
In other words what is the problem with these two graphs?
In other words, what is the problem with these two graphs?
```{r}
#| eval: false
@ -846,7 +846,7 @@ ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
```
The stacking is performed automatically by the **position adjustment** specified by the `position` argument.
The stacking is performed automatically using the **position adjustment** specified by the `position` argument.
If you don't want a stacked bar chart, you can use one of three other options: `"identity"`, `"dodge"` or `"fill"`.
- `position = "identity"` will place each object exactly where it falls in the context of the graph.
@ -886,7 +886,7 @@ If you don't want a stacked bar chart, you can use one of three other options: `
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
```
There's one other type of adjustment that's not useful for bar charts, but it can be very useful for scatterplots.
There's one other type of adjustment that's not useful for bar charts, but can be very useful for scatterplots.
Recall our first scatterplot.
Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?
@ -900,7 +900,7 @@ ggplot(data = mpg) +
The underlying values of `hwy` and `displ` are rounded so the points appear on a grid and many points overlap each other.
This problem is known as **overplotting**.
This arrangement makes it hard to see where the mass of the data is.
This arrangement makes it difficult to see the distribution of the data.
Are the data points spread equally throughout the graph, or is there one special combination of `hwy` and `displ` that contains 109 values?
You can avoid this gridding by setting the position adjustment to "jitter".
@ -942,7 +942,7 @@ To learn more about a position adjustment, look up the help page associated with
Coordinate systems are probably the most complicated part of ggplot2.
The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point.
There are a number of other coordinate systems that are occasionally helpful.
There are a three other coordinate systems that are occasionally helpful.
- `coord_flip()` switches the x and y axes.
This is useful (for example), if you want horizontal boxplots.
@ -1041,7 +1041,7 @@ There are a number of other coordinate systems that are occasionally helpful.
## The layered grammar of graphics
In the previous sections, you learned much more than how to make scatterplots, bar charts, and boxplots.
In the previous sections, you learned much more than just how to make scatterplots, bar charts, and boxplots.
You learned a foundation that you can use to make *any* type of plot with ggplot2.
To see this, let's add position adjustments, stats, coordinate systems, and faceting to our code template:
@ -1082,8 +1082,7 @@ You would map the values of each variable to the levels of an aesthetic.
knitr::include_graphics("images/visualization-grammar-2.png")
```
You'd then select a coordinate system to place the geoms into.
You'd use the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables.
You'd then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables.
At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting).
You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.

View File

@ -2,7 +2,7 @@
knit: "bookdown::render_book"
title: "R for Data Science (2e)"
author: "Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund"
description: "This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualising, and exploring data."
description: "This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it, and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming to save time and make your work reproducible. Along the way, you'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualising, and exploring data."
url: 'https\://r4ds.had.co.nz/'
github-repo: hadley/r4ds
twitter-handle: hadley

View File

@ -1,13 +1,13 @@
# Introduction
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge.
The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science.
Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge.
The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science efficiently and reproducibly.
After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
## What you will learn
Data science is a huge field, and there's no way you can master it by reading a single book.
The goal of this book is to give you a solid foundation in the most important tools.
Data science is a huge field, and there's no way you can master it all by reading a single book.
The goal of this book is to give you a solid foundation in the most important tools, and enough knowledge to find the resources to learn more when necessary.
Our model of the tools needed in a typical data science project looks something like this:
```{r echo = FALSE, out.width = "75%"}
@ -21,9 +21,9 @@ If you can't get your data into R, you can't do data science on it!
Once you've imported your data, it is a good idea to **tidy** it.
Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored.
In brief, when your data is tidy, each column is a variable, and each row is an observation.
Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
Tidy data is important because the consistent structure lets you focus your efforts on answering questions about the data, not fighting to get the data into the right form for different functions.
Once you have tidy data, a common first step is to **transform** it.
Once you have tidy data, a common next step is to **transform** it.
Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).
Together, tidying and transforming are called **wrangling**, because getting your data in a form that's natural to work with often feels like a fight!
@ -32,19 +32,19 @@ These have complementary strengths and weaknesses so any real analysis will iter
**Visualisation** is a fundamentally human activity.
A good visualisation will show you things that you did not expect, or raise new questions about the data.
A good visualisation might also hint that you're asking the wrong question, or you need to collect different data.
Visualisations can surprise you and don't scale particularly well because they require a human to interpret them.
A good visualisation might also hint that you're asking the wrong question, or that you need to collect different data.
Visualisations can surprise you and they don't scale particularly well because they require a human to interpret them.
The last step of data science is **communication**, an absolutely critical part of any data analysis project.
It doesn't matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.
Surrounding all these tools is **programming**.
Programming is a cross-cutting tool that you use in every part of the project.
You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
Programming is a cross-cutting tool that you use in nearly every part of a data science project.
You don't need to be an expert programmer to be a successful data scientist, but learning more about programming pays off, because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
You'll use these tools in every data science project, but for most projects they're not enough.
There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%.
Throughout this book we'll point you to resources where you can learn more.
Throughout this book, we'll point you to resources where you can learn more.
## How this book is organised
@ -52,40 +52,42 @@ The previous description of the tools of data science is organised roughly accor
In our experience, however, this is not the best way to learn them because tarting with data ingest and tidying is sub-optimal because 80% of the time it's routine and boring, and the other 20% of the time it's weird and frustrating.
That's a bad place to start learning a new subject!
Instead, we'll start with visualisation and transformation of data that's already been imported and tidied.
That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth it.
That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth the effort.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details.
Within each chapter, we try and adhere to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details.
Each section of the book is paired with exercises to help you practice what you've learned.
While it's tempting to skip the exercises, there's no better way to learn than practicing on real problems.
Although it can be tempting to skip the exercises, there's no better way to learn than practicing on real problems.
## What you won't learn
There are some important topics that this book doesn't cover.
There are a number of important topics that this book doesn't cover.
We believe it's important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible.
That means this book can't cover every important topic.
### Modelling
<!--# TO DO: Say a few sentences about modelling. -->
### Big data
This book proudly focuses on small, in-memory datasets.
This is the right place to start because you can't tackle big data unless you have experience with small data.
The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data.
The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care, you can typically use them to work with 1-2 Gb of data.
If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table).
This book doesn't teach data.table because it has a very concise interface which makes it harder to learn since it offers fewer linguistic cues.
But if you're working with large data, the performance payoff is worth the extra effort required to learn it.
This book doesn't teach data.table because it has a very concise interface that offers fewer linguistic cues, which makes it harder to learn.
However, if you're working with large data, the performance payoff is worth the extra effort required to learn it.
If your data is bigger than this, carefully consider if your big data problem might actually be a small data problem in disguise.
While the complete data might be big, often the data needed to answer a specific question is small.
If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise.
While the complete data set might be big, often the data needed to answer a specific question is small.
You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in.
The challenge here is finding the right small data, which often requires a lot of iteration.
Another possibility is that your big data problem is actually a large number of small data problems.
Another possibility is that your big data problem is actually a large number of small data problems in disguise.
Each individual problem might fit in memory, but you have millions of them.
For example, you might want to fit a model to each person in your dataset.
That would be trivial if you had just 10 or 100 people, but instead you have a million.
Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop or Spark) that allows you to send different datasets to different computers for processing.
Once you've figured out how to answer the question for a single subset using the tools described in this book, you learn new tools like sparklyr, rhipe, and ddr to solve it for the full dataset.
This would be trivial if you had just 10 or 100 people, but instead you have a million.
Fortunately, each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like [Hadoop](https://hadoop.apache.org/) or [Spark](https://spark.apache.org/)) that allows you to send different datasets to different computers for processing.
Once you've figured out how to answer your question for a single subset using the tools described in this book, you can learn new tools like **sparklyr**, **rhipe**, and **ddr** to solve it for the full dataset.
### Python, Julia, and friends
@ -100,7 +102,7 @@ This doesn't mean you should only know one thing, just that you'll generally lea
You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next interesting thing.
We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science.
R is not just a programming language, but it is also an interactive environment for doing data science.
R is not just a programming language, it is also an interactive environment for doing data science.
To support interaction, R is a much more flexible language than many of its peers.
This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process.
These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.
@ -135,7 +137,7 @@ When a new version is available, RStudio will let you know.
It's a good idea to upgrade regularly so you can take advantage of the latest and greatest features.
For this book, make sure you have at least RStudio 1.6.0.
When you start RStudio, you'll see two key regions in the interface:
When you start RStudio, you'll see two key regions in the interface: the console pane, and the output pane.
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/rstudio-console.png")
@ -150,7 +152,7 @@ You'll also need to install some R packages.
An R **package** is a collection of functions, data, and documentation that extends the capabilities of base R.
Using packages is key to the successful use of R.
The majority of the packages that you will learn in this book are part of the so-called tidyverse.
The packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally.
All packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally.
You can install the complete tidyverse with a single line of code:
@ -162,18 +164,18 @@ On your own computer, type that line of code in the console, and then press ente
R will download the packages from CRAN and install them on to your computer.
If you have problems installing, make sure that you are connected to the internet, and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
You will not be able to use the functions, objects, and help files in a package until you load it with `library()`.
Once you have installed a package, you can load it with the `library()` function:
You will not be able to use the functions, objects, or help files in a package until you load it with `library()`.
Once you have installed a package, you can load it using the `library()` function:
```{r}
library(tidyverse)
```
This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages.
This tells you that tidyverse is loading eight packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats packages.
These are considered to be the **core** of the tidyverse because you'll use them in almost every analysis.
Packages in the tidyverse change fairly frequently.
You can see if updates are available, and optionally install them, by running `tidyverse_update()`.
You can check whether updates are available, and optionally install them, by running `tidyverse_update()`.
### Other packages
@ -192,7 +194,7 @@ These packages provide data on airline flights, world development, and baseball
## Running R code
The previous section showed you a couple of examples of running R code.
The previous section showed you several examples of running R code.
Code in the book looks like this:
```{r, eval = TRUE}
@ -209,19 +211,19 @@ In your console, you type after the `>`, called the **prompt**; we don't show th
In the book, output is commented out with `#>`; in your console it appears directly after your code.
These two differences mean that if you're working with an electronic version of the book, you can easily copy code out of the book and into the console.
Throughout the book we use a consistent set of conventions to refer to code:
Throughout the book, we use a consistent set of conventions to refer to code:
- Functions are in a code font and followed by parentheses, like `sum()`, or `mean()`.
- Functions are displayed in a code font and followed by parentheses, like `sum()`, or `mean()`.
- Other R objects (like data or function arguments) are in a code font, without parentheses, like `flights` or `x`.
- Other R objects (such as data or function arguments) are in a code font, without parentheses, like `flights` or `x`.
- If we want to make it clear what package an object comes from, we'll use the package name followed by two colons, like `dplyr::mutate()`, or\
- Sometimes, to make it clear which package an object comes from, we'll use we'll use the package name followed by two colons, like `dplyr::mutate()`, or\
`nycflights13::flights`.
This is also valid R code.
## Acknowledgements
This book isn't just the product of Hadley, Mine, and Garrett, but is the result of many conversations (in person and online) that we've had with the many people in the R community.
This book isn't just the product of Hadley, Mine, and Garrett, but is the result of many conversations (in person and online) that we've had with many people in the R community.
There are a few people we'd like to thank in particular, because they have spent many hours answering our questions and helping us to better think about data science:
- Jenny Bryan and Lionel Henry for many helpful discussions around working with lists and list-columns.

View File

@ -515,7 +515,7 @@ Here are a selection that you might find useful.
So far, we've mostly used `mean()` to summarize the center of a vector of values.
Because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values.
An alternative is to use the `median()` which finds a value that lies in the "middle" of the vector, i.e. 50% of the values is above it and 50% are below it.
An alternative is to use the `median()`, which finds a value that lies in the "middle" of the vector, i.e. 50% of the values is above it and 50% are below it.
Depending on the shape of the distribution of the variable you're interested in, mean or median might be a better measure of center.
For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
@ -556,7 +556,7 @@ For these reasons, the mode tends not to be used by statisticians and there's no
What if you're interested in locations other than the center?
`min()` and `max()` will give you the largest and smallest values.
Another powerful tool is `quantile()` which is a generalization of the median: `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, `quantile(x, 0.5)` is equivalent to the median, and `quantile(x, 0.95)` will find a value that's greater than 95% of the values.
Another powerful tool is `quantile()` which is a generalization of the median: `quantile(x, 0.25)` will find the value of `x` that is greater than 25% of the values, `quantile(x, 0.5)` is equivalent to the median, and `quantile(x, 0.95)` will find a value that's greater than 95% of the values.
For the flights data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.

View File

@ -14,12 +14,12 @@ knitr::include_graphics("diagrams/data-science-explore.png")
<!--# TO DO: Update figure to include import and tidy as well. -->
In this part of the book you will learn some useful tools that have an immediate payoff:
In this part of the book, you will learn several useful tools that have an immediate payoff:
- Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data.
In Chapter \@ref(data-visualisation) you'll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.
- Visualisation alone is typically not enough, so in Chapter \@ref(data-transform) you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
- Visualisation alone is typically not enough, so in Chapter \@ref(data-transform), you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
- In Chapter \@ref(data-tidy), you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualization, and modelling easier.
You'll learn the underlying principles, and how to get your data into a tidy form.

View File

@ -3,7 +3,7 @@
You now have some experience running R code.
We didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration!
Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain.
But while you should expect to be a little frustrated, take comfort in that it's both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.
But while you should expect to be a little frustrated, take comfort in that this experience is both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.
Before we go any further, let's make sure you've got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.
@ -18,7 +18,7 @@ You can use R as a calculator:
sin(pi / 2)
```
You can create new objects with `<-`:
You can create new objects with the assignment operator `<-`:
```{r}
x <- 3 * 4
@ -44,10 +44,10 @@ All R statements where you create objects, **assignment** statements, have the s
object_name <- value
```
When reading that code say "object name gets value" in your head.
When reading that code, say "object name gets value" in your head.
You will make lots of assignments and `<-` is a pain to type.
Don't be lazy and use `=`: it will work, but it will cause confusion later.
Don't be lazy and use `=`; it will work, but it will cause confusion later.
Instead, use RStudio's keyboard shortcut: Alt + - (the minus sign).
Notice that RStudio automagically surrounds `<-` with spaces, which is a good code formatting practice.
Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.
@ -82,7 +82,7 @@ In the following example the first comment for the same code is not as good as t
## What's in a name?
Object names must start with a letter, and can only contain letters, numbers, `_` and `.`.
You want your object names to be descriptive, so you'll need a convention for multiple words.
You want your object names to be descriptive, so you'll need to adopt a convention for multiple words.
We recommend **snake_case** where you separate lowercase words with `_`.
```{r, eval = FALSE}
@ -109,10 +109,10 @@ this_is_a_really_long_name <- 2.5
To inspect this object, try out RStudio's completion facility: type "this", press TAB, add characters until you have a unique prefix, then press return.
Ooops, you made a mistake!
`this_is_a_really_long_name` should have value 3.5 not 2.5.
The value of `this_is_a_really_long_name` should be 3.5, not 2.5.
Use another keyboard shortcut to help you fix it.
Type "this" then press Cmd/Ctrl + ↑.
That will list all the commands you've typed that start with those letters.
Doing so will list all the commands you've typed that start with those letters.
Use the arrow keys to navigate, then press enter to retype the command.
Change 2.5 to 3.5 and rerun.
@ -131,7 +131,7 @@ R_rocks
#> Error: object 'R_rocks' not found
```
There's an implied contract between you and R: it will do the tedious computation for you, but in return, you must be completely precise in your instructions.
This illustrates the implied contract between you and R: R will do the tedious computations for you, but in exchange, you must be completely precise in your instructions.
Typos matter.
Case matters.
@ -143,14 +143,14 @@ R has a large collection of built-in functions that are called like this:
function_name(arg1 = val1, arg2 = val2, ...)
```
Let's try using `seq()` which makes regular **seq**uences of numbers and, while we're at it, learn more helpful features of RStudio.
Let's try using `seq()`, which makes regular **seq**uences of numbers and, while we're at it, learn more helpful features of RStudio.
Type `se` and hit TAB.
A popup shows you possible completions.
Specify `seq()` by typing more (a `q`) to disambiguate, or by using ↑/↓ arrows to select.
Notice the floating tooltip that pops up, reminding you of the function's arguments and purpose.
If you want more help, press F1 to get all the details in the help tab in the lower right pane.
Press TAB once more when you've selected the function you want.
When you've selected the function you want, press TAB again.
RStudio will add matching opening (`(`) and closing (`)`) parentheses for you.
Type the arguments `1, 10` and hit return.
@ -158,7 +158,7 @@ Type the arguments `1, 10` and hit return.
seq(1, 10)
```
Type this code and notice you get similar assistance with the paired quotation marks:
Type this code and notice that RStudio provides similar assistance with the paired quotation marks:
```{r}
x <- "hello world"
@ -172,16 +172,14 @@ If this happens, R will show you the continuation character "+":
+
The `+` tells you that R is waiting for more input; it doesn't think you're done yet.
Usually that means you've forgotten either a `"` or a `)`. Either add the missing pair, or press ESCAPE to abort the expression and try again.
Usually, this means you've forgotten either a `"` or a `)`. Either add the missing pair, or press ESCAPE to abort the expression and try again.
Now look at your environment in the upper right pane:
Note that the environment tab in the upper right pane displays all of the objects that you've created:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("screenshots/rstudio-env.png")
```
Here you can see all of the objects that you've created.
## Exercises
1. Why does this code not work?

View File

@ -4,7 +4,7 @@
status("restructuring")
```
So far you've been using the console to run code.
So far, you have used the console to run code.
That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and dplyr pipes.
To give yourself more room to work, it's a great idea to use the script editor.
Open it up either by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N.
@ -66,7 +66,7 @@ This executes the current R expression in the console.
For example, take the code below.
If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates `not_cancelled`.
It will also move the cursor to the next statement (beginning with `not_cancelled |>`).
That makes it easy to run your complete script by repeatedly pressing Cmd/Ctrl + Enter.
That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.
```{r, eval = FALSE}
library(dplyr)
@ -80,15 +80,15 @@ not_cancelled |>
summarise(mean = mean(dep_delay))
```
Instead of running expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S.
Doing this regularly is a great way to check that you've captured all the important parts of your code in the script.
Instead of running your code expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S.
Doing this regularly is a great way to ensure that you've captured all the important parts of your code in the script.
I recommend that you always start your script with the packages that you need.
That way, if you share your code with others, they can easily see what packages they need to install.
That way, if you share your code with others, they can easily see which packages they need to install.
Note, however, that you should never include `install.packages()` or `setwd()` in a script that you share.
It's very antisocial to change settings on someone else's computer!
When working through future chapters, I highly recommend starting in the editor and practicing your keyboard shortcuts.
When working through future chapters, I highly recommend starting in the script editor and practicing your keyboard shortcuts.
Over time, sending code to the console in this way will become so natural that you won't even think about it.
## RStudio diagnostics
@ -113,9 +113,9 @@ knitr::include_graphics("screenshots/rstudio-diagnostic-warn.png")
## Workflow: projects
One day you will need to quit R, go do something else and return to your analysis the next day.
One day you will be working on multiple analyses simultaneously that all use R and you want to keep them separate.
One day you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.
One day, you will need to quit R, go do something else, and return to your analysis later.
One day, you will be working on multiple analyses simultaneously that all use R and you want to keep them separate.
One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.
To handle these real life situations, you need to make two decisions:
1. What about your analysis is "real", i.e. what will you save as your lasting record of what happened?
@ -129,15 +129,15 @@ However, in the long run, you'll be much better off if you consider your R scrip
With your R scripts (and your data files), you can recreate the environment.
It's much harder to recreate your R scripts from your environment!
You'll either have to retype a lot of code from memory (making mistakes all the way) or you'll have to carefully mine your R history.
You'll either have to retype a lot of code from memory (inevitably, making mistakes along the way) or you'll have to carefully mine your R history.
To foster this behavior, I highly recommend that you instruct RStudio not to preserve your workspace between sessions:
To encourage this behavior, I highly recommend that you instruct RStudio not to preserve your workspace between sessions:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
This will cause you some short-term pain, because now when you restart RStudio it will not remember the results of the code that you ran last time.
This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the results of the code that you ran last time.
But this short-term pain will save you long-term agony because it forces you to capture all important interactions in your code.
There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.
@ -197,7 +197,7 @@ There are three chief ways in which they differ:
## RStudio projects
R experts keep all the files associated with a given project together --- input data, R scripts, analytical results, figures.
R experts keep all the files associated with a given project together --- input data, R scripts, analytical results, and figures.
This is such a wise and common practice that RStudio has built-in support for this via **projects**.
Let's make a project for you to use while you're working through the rest of this book.
@ -220,7 +220,7 @@ getwd()
#> [1] /Users/hadley/Documents/r4ds/r4ds
```
Whenever you refer to a file with a relative path it will look for it here.
Whenever you refer to a file using a relative path, R will look for it here.
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R".
Next, run the complete script which will save a PDF and CSV file into your project directory.
@ -244,7 +244,7 @@ Because you followed my instructions above, you will, however, have a completely
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but *also the script that created it* (`diamonds.R`).
This is a huge win!
One day you will want to remake a figure or just understand where it came from.
One day, you will want to remake a figure or just understand where it came from.
If you rigorously save figures to files **with R code** and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
## Summary