Update intro.Rmd (#183)

* Update intro.Rmd

* Update visualize.Rmd
This commit is contained in:
Nick Clark 2016-07-23 23:36:49 -03:00 committed by Hadley Wickham
parent aea913df94
commit ed7348a083
2 changed files with 15 additions and 15 deletions

View File

@ -180,7 +180,7 @@ Throughout the book we use a consistent set of conventions to refer to code:
This book is not an island: there is no single resource that will allow you to master R. As you start to apply the techniques described in this book to your own data you will soon find questions that I do not answer. This section describes a few tips to help you get help, and to help you keep learning.
If you get stuck, start with google. Typically adding "R" to a query is enough to restrict it to relevant results: if the search isn't useful, it often means that there aren't any R specific results available. Google is particuarly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.)
If you get stuck, start with google. Typically adding "R" to a query is enough to restrict it to relevant results: if the search isn't useful, it often means that there aren't any R specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.)
If google doesn't help, try [stackoverflow](http://stackoverflow.com). Start by spending a little time searching for an existing answer (including `[R]` to restrict your search to questions about R). If you don't find anything useful, prepare a minimal reproducible example or __reprex__. A good reprex makes it easier for other people to help you, and often you'll figure out the problem yourself in the course of making it.
@ -222,7 +222,7 @@ To keep up with the R community more broadly, we recommend reading <http://www.r
## Acknowledgements
This book isn't just the product of Hadley and Garrett, but is the result of many conversations (in person and online) that we've had with the many people in the R community. There are few people we'd like to thank in particularly, because they have spent many hours answering our dumb questions and helping us to better think about data science:
This book isn't just the product of Hadley and Garrett, but is the result of many conversations (in person and online) that we've had with the many people in the R community. There are few people we'd like to thank in particular, because they have spent many hours answering our dumb questions and helping us to better think about data science:
* Jenny Bryan and Lionel Henry for many helpful discussions around working
with lists and list-columns.
@ -234,10 +234,10 @@ This book isn't just the product of Hadley and Garrett, but is the result of man
* Yihui Xie for his work on the [bookdown](https://github.com/rstudio/bookdown)
package, and for tirelessly responding to my feature requests.
* Bill Behrman for his thoughtful reading of the entinre book, and for trying
* Bill Behrman for his thoughtful reading of the entire book, and for trying
it out with his data science class at Stanford.
This book was written in the open, and many people contributed pull requests to fix minor problems. I special thanks goes to everyone who contributed via GitHub:
This book was written in the open, and many people contributed pull requests to fix minor problems. Special thanks goes to everyone who contributed via GitHub:
```{r, results = "asis", echo = FALSE, message = FALSE}
library(dplyr)

View File

@ -5,7 +5,7 @@
> "The simple graph has brought more information to the data analysts mind
> than any other device." --- John Tukey
This chapter will teach you how to visualize your data using ggplot2. R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the __grammar of graphics__, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.
This chapter will teach you how to visualise your data using ggplot2. R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the __grammar of graphics__, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.
### Prerequisites
@ -103,7 +103,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(data = dplyr::filter(mpg, displ > 5, hwy > 20), colour = "red", size = 2.2)
```
Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car. The `class` variable of the `mpg` dataset classifies cars into groups such as compact, midsize, and suv. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and suvs became popular).
Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car. The `class` variable of the `mpg` dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular).
You can add a third variable, like `class`, to a two dimensional scatterplot by mapping it to an __aesthetic__. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties. Here we change the levels of a point's size, shape, and color to make the point small, triangular, or blue:
@ -222,7 +222,7 @@ As you start to run R code, you're likely to run into problems. Don't worry ---
Start by carefully comparing the code that you're running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every `(` is matched with a `)` and every `"` is paired with another `"`. Sometimes you'll run the code and nothing happens. Check the left-hand of your console: if it's a `+`, it means that R doesn't think you've typed a complete expression and it's waiting for you to finish it. In this case, it's usually easiest to start from scratch again by pressing `Escape` to abort processing the current command.
One common problem when creating ggplot2 graphics is to put the `+` in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven't accidentally written code this:
One common problem when creating ggplot2 graphics is to put the `+` in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven't accidentally written code like this:
```R
ggplot(data = mpg)
@ -287,7 +287,7 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o
facet_wrap(~ class, nrow = 2)
```
What are the advantages to using facetting instead of the colour aesthetic?
What are the advantages to using faceting instead of the colour aesthetic?
What are the disadvantages? How might the balance change if you had a
larger dataset?
@ -333,7 +333,7 @@ ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
```
Here `geom_smooth()` separates the cars into three lines based on their `drv` value, which describes a car's drive train. One line describes all of the points with a `4` value, one line describes all of the points with an `f` value, and one line describes all of the points with an `r` value. Here, `4` stands for four wheel drive, `f` for front wheel drive, and `r` for rear wheel drive.
Here `geom_smooth()` separates the cars into three lines based on their `drv` value, which describes a car's drivetrain. One line describes all of the points with a `4` value, one line describes all of the points with an `f` value, and one line describes all of the points with an `r` value. Here, `4` stands for four wheel drive, `f` for front wheel drive, and `r` for rear wheel drive.
If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to `drv`.
@ -458,16 +458,16 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(aes(colour = drv))
```
## Statical transformations
## Statistical transformations
Next, lets take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with `geom_bar()`. The following chart displays the total number of diamonds in the `diamonds` dataset, grouped by `cut`. The `diamonds` dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
Next, let's take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with `geom_bar()`. The following chart displays the total number of diamonds in the `diamonds` dataset, grouped by `cut`. The `diamonds` dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
On the x axis, the chart displays `cut`, a variable from `diamonds`. On the y axis, it displays count, but count is not a variable in `diamonds`! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:
On the x-axis, the chart displays `cut`, a variable from `diamonds`. On the y-axis, it displays count, but count is not a variable in `diamonds`! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:
* __bar charts__, __histograms__, and __frequency polygons__ bin your data
and then plot bin counts, the number of points that fall in each bin.
@ -559,7 +559,7 @@ ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
```
Note what happens if you mapped the fill aesthetic to another variable, like `clarity`: the bars are automatically stacked. Each colored rectangle represents a combination of `cut` and `clarity`.
Note what happens if you map the fill aesthetic to another variable, like `clarity`: the bars are automatically stacked. Each colored rectangle represents a combination of `cut` and `clarity`.
```{r}
ggplot(data = diamonds) +
@ -618,7 +618,7 @@ ggplot(data = mpg) +
ggtitle('Position = "jitter"')
```
Adding randomness seems like a strange way to improve your plot, but while makes your graph a less accurate at small scales, it makes your graph _more_ revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for `geom_point(position = "jitter")`: `geom_jitter()`.
Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph _more_ revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for `geom_point(position = "jitter")`: `geom_jitter()`.
To learn more about a position adjustment, look up the help page associated with each adjustment: `?position_dodge`, `?position_fill`, `?position_identity`, `?position_jitter`, and `?position_stack`.
@ -740,7 +740,7 @@ Next, you could choose a geometric object to represent each observation in the t
knitr::include_graphics("images/visualization-grammar-2.png")
```
You'd then select a coordinate system to place the geoms into. You'd use the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (facetting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.
You'd then select a coordinate system to place the geoms into. You'd use the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-grammar-3.png")