Suggestions from Marie

This commit is contained in:
hadley 2016-10-04 11:08:12 -05:00
parent 32bd47212d
commit 5e015bf977
6 changed files with 55 additions and 49 deletions

View File

@ -37,9 +37,9 @@ EDA is fundamentally a creative process. And like most creative processes, the k
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
1. What type of **variation** occurs **within** my variables?
1. What type of variation occurs within my variables?
1. What type of **covariation** occurs **between** my variables?
1. What type of covariation occurs between my variables?
The rest of this chapter will look at these two questions. I'll explain what variation and covariation are, and I'll show you several ways to answer each question. To make the discussion easier, let's define some terms:
@ -117,7 +117,7 @@ ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
There are a few challenges with this type of plot, which we will come back to in [visualising a categorical and a continuous variable](#cat-cont).
Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).
### Typical values

View File

@ -2,15 +2,13 @@
# Introduction {#explore-intro}
The goal of the first part of this book is to get you up to speed with the basic tools of data exploration as quickly as possible:
The goal of the first part of this book is to get you up to speed with the basic tools of __data exploration__ as quickly as possible. Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again. The goal of data exploration is to generate many promising leads that you can later explore in more depth.
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-explore.png")
```
You will get frustrated when you start programming in R, because it is such a stickler. Even one character out of place will cause it to complain. However, that frustration is both typical and temporary. It happens to everyone, and the only way to get over it is to keep trying.
The goal of this part of the book is to get you some useful tools with an immediate payoff as quickly as possible:
In this part of the book you will learn some useful tools that have an immediate payoff:
* Visualisation is a great place to start with R programming, because the
payoff is so clear: you get to make elegant and informative plots that help

View File

@ -14,7 +14,7 @@ First you must __import__ your data into R. This typically means that you take d
Once you've imported your data, it is a good idea to __tidy__ it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
Once you have tidy data, a common first step is to __transform__ it. You may zero in on a subset of data, add new variables that are functions of existing variables, or calculate a set of summary statistics.
Once you have tidy data, a common first step is to __transform__ it. Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing velocity from speed and time), and calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called __wrangling__, because getting your data in a form that's natural to work with often feels like a fight!
Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times.
@ -26,9 +26,9 @@ The last step of data science is __communication__, an absolutely critical part
Surrounding all these tools is __programming__. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
You'll use these six tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
You'll use these tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
## How you will learn
## How this book is organised
The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them:
@ -61,7 +61,7 @@ This book proudly focuses on small, in-memory datasets. This is the right place
If your data is bigger than this, carefully consider if your big data problem might actually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration.
Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop or Spark) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like sparklyr, rhipe, and ddr to solve it for the full dataset.
Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop or Spark) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you learn new tools like sparklyr, rhipe, and ddr to solve it for the full dataset.
### Python, Julia, and friends
@ -69,6 +69,8 @@ In this book, you won't learn anything about Python, Julia, or any other program
However, we strongly believe that it's best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn't mean you should only know one thing, just that you'll generally learn faster if you stick to one thing at a time. You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next interesting thing.
We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science. R is not just a programming language, but it is also an interactive environment for doing data science. To support interaction, R is a much more flexible language than many of its peers. This flexibility comes with its downsides, but the the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process. These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.
### Non-rectangular data
This book focuses exclusively on rectangular data: collections of values that are each associated with a variable and an observation. There are lots of datasets that do not naturally fit in this paradigm: including images, sounds, trees, and text. But rectangular data frames are extremely common in science and industry, and we believe that they're a great place to start your data science journey.
@ -95,7 +97,7 @@ It's common to think about modelling as a tool for hypothesis confirmation, and
We've made a few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.
There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the __tidyverse__, and a handful on other packages.
There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the __tidyverse__, and a handful of other packages. Packages are the fundamental units of reproducible R code. They include reusable functions, the documentation that describes how to use them, and sample data.
### R
@ -158,15 +160,14 @@ The previous section showed you a couple of examples of running R code. Code in
#> [1] 3
```
If you run the same code in you're console, it will look like this:
If you run the same code in your local console, it will look like this:
```
# In your R console
> 1 + 2
[1] 3
```
In your console, input starts at `>`, called the __prompt__. In the book, output is commented out with `#>`. Together, these differences mean that if you're working with an electronic version of the book, you can easily copy code out of the book and into the console.
There are two main diferences. In your console, you type after the `>`, called the __prompt__; we don't show the prompt in the book. In the book, output is commented out with `#>`; in your console it appears directly after your code. These two differences mean that if you're working with an electronic version of the book, you can easily copy code out of the book and into the console.
Throughout the book we use a consistent set of conventions to refer to code:

View File

@ -419,19 +419,22 @@ There are many functions for creating new variables that you can use with `mutat
start with `min_rank()`. It does the most usual type of ranking
(e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small
ranks; use `desc(x)` to give the largest values the smallest ranks.
If `min_rank()` doesn't do what you need, look at the variants
`row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`,
`ntile()`.
```{r}
y <- c(1, 2, 2, NA, 3, 4)
tibble(
row_number(y),
min_rank(y),
dense_rank(y),
percent_rank(y),
cume_dist(y)
)
min_rank(y)
min_rank(desc(y))
```
If `min_rank()` doesn't do what you need, look at the variants
`row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`,
`ntile()`. See their help pages for more details.
```{r}
row_number(y)
dense_rank(y)
percent_rank(y)
cume_dist(y)
```
### Exercises

View File

@ -7,6 +7,8 @@
This chapter will teach you how to visualise your data using ggplot2. R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the __grammar of graphics__, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.
If you'd like to learn more about the theoretical underpinnings of ggplot2 before you start, I'd recommend reading "The Layered Grammar of Graphics", <http://vita.had.co.nz/papers/layered-grammar.pdf>.
### Prerequisites
This chapter focusses on ggplot2, one of the core members of the tidyverse. To access the datasets, help pages, and functions that we will use in this chapter, load the tidyverse by running this code:
@ -28,17 +30,19 @@ You only need to install a package once, but you need to reload it every time yo
If we need to be explicit about where a function (or dataset) comes from, we'll use the special form `package::function()`. For example, `ggplot2::ggplot()` tells you explicitly that we're using the `ggplot()` function from the ggplot2 package.
## A graphing template
## First steps
Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?
You can test your answer with the `mpg` dataset in ggplot2, or `ggplot2::mpg`:
### The `mpg` data frame
You can test your answer with the `mpg` __data frame__ found in ggplot2 (aka `ggplot2::mpg`). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). `mpg` contains observations collected by the US Environment Protection Agency on 38 models of cars.
```{r}
mpg
```
The dataset contains observations collected by the US Environment Protection Agency on 38 models of cars. Among the variables in `mpg` are:
Among the variables in `mpg` are:
1. `displ`, a car's engine size, in litres.
@ -48,6 +52,8 @@ The dataset contains observations collected by the US Environment Protection Age
To learn more about `mpg`, open its help page by running `?mpg`.
### Creating a ggplot
To plot `mpg`, run this code to put `displ` on the x-axis and `hwy` on the y-axis:
```{r}
@ -57,19 +63,14 @@ ggplot(data = mpg) +
The plot shows a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). In other words, cars with big engines use more fuel. Does this confirm or refute your hypothesis about fuel efficiency and engine size?
Pay close attention to this code because it is almost a template for making plots with ggplot2.
```{r eval=FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
With ggplot2, you begin a plot with the function `ggplot()`. `ggplot()` creates a coordinate system that you can add layers to. The first argument of `ggplot()` is the dataset to use in the graph. So `ggplot(data = mpg)` creates an empty graph, but it's not very interesting so I'm not going to show it here.
You complete your graph by adding one or more layers to `ggplot()`. The function `geom_point()` adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot. You'll learn a whole bunch of them throughout this chapter.
Each geom function in ggplot2 takes a `mapping` argument. This defines how variables in your dataset are mapped to visual properties. The `mapping` argument is always paired with `aes()`, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the `data` argument, in this case, `mpg`.
### A graphing template
Let's turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.
```{r eval = FALSE}
@ -143,10 +144,12 @@ ggplot(data = mpg) +
Or we could have mapped `class` to the _alpha_ aesthetic, which controls the transparency of the points, or the shape of the points.
```{r out.width = "50%", fig.align = 'default', warning = FALSE, fig.asp = 1/2}
```{r out.width = "50%", fig.align = 'default', warning = FALSE, fig.asp = 1/2, fig.cap =""}
# Left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# Right
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
@ -167,12 +170,12 @@ ggplot(data = mpg) +
Here, the color doesn't convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes _outside_ of `aes()`. You'll need to pick a value that makes sense for that aesthetic:
* The name of a color as a character string.
* The size of a point in mm.
* The shape of a point as a number, as shown below.
R has a set of 25 built-in shapes, identified by numbers:
* The shape of a point as a number, as shown in Figure \@ref(fig:shapes).
```{r echo = FALSE, out.width = "75%", fig.asp = 1/3}
```{r shapes, echo = FALSE, out.width = "75%", fig.asp = 1/3, fig.cap="R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the `colour` and `fill` aesthetics. The hollow shapes (0--14) have a border determined by `colour`; the solid shapes (15--18) are filled with `colour`; the filled shapes (21--24) have a border of `colour` and are filled with `fill`.", warning = FALSE}
shapes <- tibble(
shape = c(0, 1, 2, 5, 3, 4, 6:19, 22, 21, 24, 23, 20),
x = (0:24 %/% 5) / 2,
@ -189,8 +192,6 @@ ggplot(shapes, aes(x, y)) +
theme(aspect.ratio = 1/2.75)
```
Note that there are some seeming duplicates: 0, 15, and 22 are all squares. The difference comes from the interaction of the `colour` and `fill` aesthetics. The hollow shapes (0--14) have a border determined by `colour`; the solid shapes (15--18) are filled with `colour`; the filled shapes (21--24) have a border of `colour` and are filled with `fill`.
### Exercises
1. What's gone wrong with this code? Why are the points not blue?
@ -365,7 +366,7 @@ ggplot(data = mpg) +
To display multiple geoms in the same plot, add multiple geom functions to `ggplot()`:
```{r}
```{r, message = FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
@ -381,7 +382,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings _for that layer only_. This makes it possible to display different aesthetics in different layers.
```{r}
```{r, message = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
@ -389,13 +390,13 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
You can use the same idea to specify different `data` for each layer. Here, our smooth line displays just a subset of the `mpg` dataset, the subcompact cars. The local data argument in `geom_smooth()` overrides the global data argument in `ggplot()` for that layer only.
```{r}
```{r, message = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = dplyr::filter(mpg, class == "subcompact"), se = FALSE)
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
```
(Remember, `dplyr::filter()` calls the `filter()` function from the dplyr package. You'll learn how `filter()` works in the next chapter.)
(You'll learn how `filter()` works in the next chapter: for now, just know that this command selects only the subcompact cars.)
### Exercises
@ -497,7 +498,7 @@ This works because every geom has a default stat; and every stat has a default g
present in the data, or the previous bar chart where the height of the bar
is generated by counting rows.
```{r}
```{r, warning = FALSE}
demo <- tibble(
a = c("bar_1", "bar_2", "bar_3"),
b = c(20, 30, 40)
@ -661,7 +662,8 @@ To learn more about a position adjustment, look up the help page associated with
Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y position act independently to find the location of each point. There are a number of other coordinate systems that are occasionally helpful.
* `coord_flip()` switches the x and y axes. This is useful (for example),
if you want horizontal boxplots.
if you want horizontal boxplots. It's also useful for long labels: it's
hard to get them to fit without overlapping on the x-axis.
```{r fig.width = 3, out.width = "50%", fig.align = "default"}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +

View File

@ -1,6 +1,8 @@
# Workflow: basics
You now have some experience running R code. I didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration! Before we go any further, let's make sure you've got a solid foundation in running R code and, and that you know about some of the most helpful RStudio features.
You now have some experience running R code. I didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that it's both typical and temporary: It happens to everyone, and the only way to get over it is to keep trying.
Before we go any further, let's make sure you've got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.
## Coding basics