Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
hadley 2016-07-29 09:57:09 -05:00
commit 8da00ed69e
3 changed files with 14 additions and 14 deletions

View File

@ -10,7 +10,7 @@ This chapter will show you how to use visualisation and transformation to explor
1. Use what you learn to refine your questions and or generate new questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel be free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.

View File

@ -2,7 +2,7 @@
## Introduction
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You'll learn how to do all that (and more!) in this chapter which will teach you how to transform your data using the dplyr package and new dataset on flights departing New York City in 2013.
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You'll learn how to do all that (and more!) in this chapter which will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013.
### Prerequisites
@ -14,7 +14,7 @@ library(nycflights13)
library(ggplot2)
```
Take careful note of the message that's printed when you load dplyr - it tells you that dplyr overwrite some functions in base R. If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()`, `base::intersect()`, etc.
Take careful note of the message that's printed when you load dplyr - it tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()`, `base::intersect()`, etc.
### nycflights13
@ -125,7 +125,7 @@ filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
```
As well as `&` and `|`, R also has `&&` and `||`. Don't use them here! You'll when you should use them in [conditional execution].
As well as `&` and `|`, R also has `&&` and `||`. Don't use them here! You'll learn when you should use them in [conditional execution].
Sometimes you want to find all rows after the first `TRUE`, or all rows until the first `FALSE`. The window functions `cumany()` and `cumall()` allow you to find these values:
@ -309,7 +309,7 @@ select(flights, time_hour, air_time, everything())
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```
1. Does the result of running the following code suprise you? How do the
1. Does the result of running the following code surprise you? How do the
select helpers deal with case by default? How can you change that default?
```{r, eval = FALSE}
@ -784,7 +784,7 @@ daily <- group_by(flights, year, month, day)
(per_year <- summarise(per_month, flights = sum(flights)))
```
Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median. In otherwords, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median. In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
### Ungrouping
@ -814,7 +814,7 @@ daily %>%
Which is more important: arrival delay or departure delay?
1. Our definition of cancelled flights (`!is.na(dep_delay) & !is.na(arr_delay)`
) is slightly sup-optimal. Why? Which is the most important column?
) is slightly suboptimal. Why? Which is the most important column?
1. Look at the number of cancelled flights per day. Is there a pattern?
Is the proportion of cancelled flights related to the average delay?
@ -874,7 +874,7 @@ Functions that work most naturally in grouped mutates and filters are known as
1. Delays are typically temporally correlated: even once the problem that
caused the initial delay has been resolved, later flights are delayed
to allow earlier flights to leave. Using `lag()` explore how the delay
of a flight is related to the delay of the immediately preceeding flight.
of a flight is related to the delay of the immediately preceding flight.
1. Look at each destination. Can you find flights that are suspiciously
fast? (i.e. flights that represent a potential data entry error). Compute

View File

@ -211,7 +211,7 @@ ggplot(shapes, aes(x, y)) +
1. What happens if you set an aesthetic to something other than a variable
name, like `displ < 5`?
1. Vignettes are long-form guides the documentation things about
1. Vignettes are long-form guides that document things about
a package that affect many functions. ggplot2 has two vignettes.
How can you find them and what do they describe? (Hint: Google is
your friend.)
@ -220,7 +220,7 @@ ggplot(shapes, aes(x, y)) +
As you start to run R code, you're likely to run into problems. Don't worry --- it happens to everyone. I have been writing R code for years, and every day I still write code that doesn't work!
Start by carefully comparing the code that you're running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every `(` is matched with a `)` and every `"` is paired with another `"`. Sometimes you'll run the code and nothing happens. Check the left-hand of your console: if it's a `+`, it means that R doesn't think you've typed a complete expression and it's waiting for you to finish it. In this case, it's usually easiest to start from scratch again by pressing `Escape` to abort processing the current command.
Start by carefully comparing the code that you're running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every `(` is matched with a `)` and every `"` is paired with another `"`. Sometimes you'll run the code and nothing happens. Check the left-hand of your console: if it's a `+`, it means that R doesn't think you've typed a complete expression and it's waiting for you to finish it. In this case, it's usually easy to start from scratch again by pressing `Escape` to abort processing the current command.
One common problem when creating ggplot2 graphics is to put the `+` in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven't accidentally written code like this:
@ -248,8 +248,8 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
To facet your plot on the combination of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is also a formula. This time the formula should contain two variable names separated by a `~`.
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
```
@ -410,7 +410,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
A histogram? An area chart?
1. Run this code in your head and predict what the output will look like.
Run the code in R and check your predictions.
Then, run the code in R and check your predictions.
```{r, eval = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
@ -496,7 +496,7 @@ Stats are the most subtle part of plotting because you can't see them directly.
1. You might want to override the default stat. In the code below, I change
the stat of `geom_bar()` from count (the default) to identity. This lets
me map to the height of the bars to the raw values of a $y$ variable.
me map the height of the bars to the raw values of a $y$ variable.
```{r}
demo <- tibble::tibble(