Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
hadley 2016-08-30 17:32:14 -05:00
commit 26bd6c8a54
5 changed files with 16 additions and 17 deletions

View File

@ -78,7 +78,7 @@ ggplot(df, aes(x, y)) +
### Exercises
1. Create one plot of the fuel economy data with customized `title`,
1. Create one plot on the fuel economy data with customised `title`,
`subtitle`, `caption`, `x`, `y`, and `colour` labels.
1. The `geom_smooth()` is somewhat misleading because the `hwy` for
@ -221,7 +221,7 @@ The only limit is your imagination (and your patience with positioning annotatio
### Exercises
1. Use `geom_text()` with infinite positions to place text at of the
1. Use `geom_text()` with infinite positions to place text at the
four corners of the plot.
1. Read the documentation for `annotate()`. How can you use it to add a text
@ -287,7 +287,7 @@ ggplot(mpg, aes(displ, hwy)) +
scale_y_continuous(labels = NULL)
```
You can also use `breaks` and `labels` to control the appearance of legends. Collectively axes and legends are called __guides__. Axes are used for x and y aesthetics; legends are used used for everything else.
You can also use `breaks` and `labels` to control the appearance of legends. Collectively axes and legends are called __guides__. Axes are used for x and y aesthetics; legends are used for everything else.
Another use of `breaks` is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.
@ -484,7 +484,7 @@ In this particular case, you could have simply used faceting, but this technique
## Themes
Finally, you can customize the non-data elements of your plot with a theme:
Finally, you can customise the non-data elements of your plot with a theme:
```{r, message = FALSE}
ggplot(mpg, aes(displ, hwy)) +
@ -521,7 +521,7 @@ Generally, however, I think you should be assembling your final reports using R
### Figure sizing
The biggest challenge of graphics in RMarkdown is getting your figures the right size and shape. There are five main options that control figure sizing: `fig.width`, `fig.height`, `fig.asp`, `out.width` and `out.height`. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).
The biggest challenge of graphics in R Markdown is getting your figures the right size and shape. There are five main options that control figure sizing: `fig.width`, `fig.height`, `fig.asp`, `out.width` and `out.height`. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).
I only ever use three of the five options:

View File

@ -481,7 +481,7 @@ To find out how many periods fall into an interval, you need to use integer divi
How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
Figure \@{ref:dt-algebra} summarises permitted arithmetic operations between the different data types.
Figure \@(ref:dt-algebra) summarises permitted arithmetic operations between the different data types.
```{r dt-algebra, echo = FALSE, fig.cap = "The allowed arithmetic operations between pairs of date/time classes."}
knitr::include_graphics("diagrams/datetimes-arithmetic.png")

View File

@ -92,7 +92,7 @@ ggplot(gss_cat, aes(race)) +
These levels represent valid values that simply did not occur in this dataset. Unfortunately, dplyr doesn't yet have a `drop` option, but it will in the future.
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operation are described in the sections below.
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.
### Exercise

View File

@ -26,11 +26,11 @@ The last step of data science is __communication__, an absolutely critical part
Surrounding all these tools is __programming__. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
You'll use these six tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play: you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
You'll use these six tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
## The tidyverse
The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming, which makes them fit together naturally. Because they are designed with a unifying vision you should experience fewer problems when you combine multiple packages to solve real problems. The packages in the tidyverse are not perfect, but they fit together well, and over time that fit will continue to improve.
The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming, which makes them fit together naturally. Because they are designed with a unifying vision, you should experience fewer problems when you combine multiple packages to solve real problems. The packages in the tidyverse are not perfect, but they fit together well, and over time that fit will continue to improve.
There are many other excellent packages that are not part of the tidyverse, because they are designed with a different set of underlying principles. This doesn't make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data. But we hope that the tidyverse will continue to provide a solid foundation no matter how far you go in R.
@ -52,8 +52,7 @@ The previous description of the tools of data science is organised roughly accor
* Programming tools are not necessarily interesting in their own right,
but do allow you to tackle considerably more challenging problems. We'll
give you a selection of programming tools in the middle of the book, and
then you'll see they can combine with the data science tools to tackle interesting
modelling problems.
then you'll see they can combine with the data science tools to tackle interesting modelling problems.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you've learned. While it's tempting to skip the exercises, there's no better way to learn than practicing on real problems.
@ -109,7 +108,7 @@ To run the code in this book, you will need to install both R and the RStudio ID
### RStudio
RStudio is an integrated development environment, or IDE, for R programming. When you get started there two key regions in the interface:
RStudio is an integrated development environment, or IDE, for R programming. When you get started, there two key regions in the interface:
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/rstudio-console.png")
@ -159,7 +158,7 @@ Throughout the book we use a consistent set of conventions to refer to code:
## Getting help and learning more
This book is not an island: there is no single resource that will allow you to master R. As you start to apply the techniques described in this book to your own data you will soon find questions that I do not answer. This section describes a few tips to help you get help, and to help you keep learning.
This book is not an island; there is no single resource that will allow you to master R. As you start to apply the techniques described in this book to your own data you will soon find questions that I do not answer. This section describes a few tips to help you get help, and to help you keep learning.
If you get stuck, start with Google. Typically adding "R" to a query is enough to restrict it to relevant results: if the search isn't useful, it often means that there aren't any R-specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.)
@ -169,7 +168,7 @@ There are three things you need to include to make your example reproducible: re
1. **Packages** should be loaded at the top of the script, so it's easy to
see which ones the example needs. This is a good time to check that you're
using the latest version of each package: it's possible you've discovered
using the latest version of each package; it's possible you've discovered
a bug that's been fixed since you installed the package.
1. The easiest way to include **data** in a question is to use `dput()` to
@ -197,7 +196,7 @@ There are three things you need to include to make your example reproducible: re
Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.
You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way to is follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org). This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley ([@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way to is follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org). This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
To keep up with the R community more broadly, we recommend reading <http://www.r-bloggers.com>: it aggregates over 500 blogs about R from around the world. If you're an active Twitter user, follow the `#rstats` hashtag. Twitter is one of the key tools that Hadley uses to keep up with new developments in the community.

View File

@ -60,7 +60,7 @@ ggplot(diamonds, aes(carat, price)) +
We can make it easier to see how the other attributes of a diamond affect its relative `price` by fitting a model to separate out the effect of `carat`. But first, lets make a couple of tweaks to the diamonds dataset to make it easier to work with:
1. Focus on diamonds bigger smaller than 2.5 carats (99.7% of the data)
1. Focus on diamonds smaller than 2.5 carats (99.7% of the data)
1. Log-transform the carat and price variables.
```{r}
@ -116,7 +116,7 @@ ggplot(diamonds2, aes(color, lresid)) + geom_boxplot()
ggplot(diamonds2, aes(clarity, lresid)) + geom_boxplot()
```
Now we see the relationship we expect: as the quality of the diamond increases, so to does it's relative pirce. To interpret the `y` axis, we need to think about what the residuals are telling us, and what scale they are on. A residual of -1 indicates that `lprice` was 1 unit lower than a prediction based solely on its weight. $2^{-1}$ is 1/2, points with a value of -1 are half the expected price, and residuals with value 1 are twice the predicted price.
Now we see the relationship we expect: as the quality of the diamond increases, so to does it's relative price. To interpret the `y` axis, we need to think about what the residuals are telling us, and what scale they are on. A residual of -1 indicates that `lprice` was 1 unit lower than a prediction based solely on its weight. $2^{-1}$ is 1/2, points with a value of -1 are half the expected price, and residuals with value 1 are twice the predicted price.
### A model complicated model