Merge branch 'master' of github.com:hadley/r4ds

# Conflicts:
#	factors.Rmd
This commit is contained in:
hadley 2016-11-10 11:10:30 -06:00
commit e772065e05
22 changed files with 72 additions and 75 deletions

View File

@ -59,12 +59,12 @@ The rest of this chapter will look at these two questions. I'll explain what var
"cell", each variable in its own column, and each observation in its own
row.
So far, all the data you've seen so far has been tidy. In real-life, most data isn't tidy, so we'll come back to these ideas again in [tidy data].
So far, all of the data that you've seen has been tidy. In real-life, most data isn't tidy, so we'll come back to these ideas again in [tidy data].
## Variation
**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of variable's values.
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable's values.
### Visualising distributions
@ -96,7 +96,7 @@ diamonds %>%
count(cut_width(carat, 0.5))
```
A histogram divides the x-axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
@ -153,7 +153,7 @@ Clusters of similar values suggest that subgroups exist in your data. To underst
* Why might the appearance of clusters be misleading?
The histogram shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.
The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.
```{r}
ggplot(data = faithful, mapping = aes(x = eruptions)) +

View File

@ -1,16 +1,14 @@
# R packages
# R for Data Science
This is code and text behind the [R for data science](http://r4ds.had.co.nz)
This is code and text behind the [R for Data Science](http://r4ds.had.co.nz)
book.
The site is built using [bookdown](https://github.com/rstudio/bookdown)
The R packages used in this book can be installed via
```{r}
devtools::install_github("hadley/r4ds")
```
The site is built using [bookdown package](https://github.com/rstudio/bookdown).
To create the site, you also need:
* [pandoc](http://johnmacfarlane.net/pandoc/)

View File

@ -338,7 +338,7 @@ ggplot(mpg, aes(displ, hwy)) +
Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you're mostly likely to want to switch out: continuous position scales and colour scales. Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and colour, you'll be able to quickly pick up other scale replacements.
It's very useful to plot transformations of your variable. For example, as we've seen in [diamond prices][diamond-prices] it's easier to see the precise relationship between `carat` and `price` if we log transform them:
It's very useful to plot transformations of your variable. For example, as we've seen in [diamond prices](diamond-prices) it's easier to see the precise relationship between `carat` and `price` if we log transform them:
```{r, fig.align = "default", out.width = "50%"}
ggplot(diamonds, aes(carat, price)) +

View File

@ -182,7 +182,7 @@ Now that you know how to get date-time data into R's date-time data structures,
### Getting components
You can pull out individual parts of the date with the acccessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
You can pull out individual parts of the date with the accessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
```{r}
datetime <- ymd_hms("2016-07-08 12:34:56")
@ -477,7 +477,7 @@ To find out how many periods fall into an interval, you need to use integer divi
How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
Figure \@(ref:dt-algebra) summarises permitted arithmetic operations between the different data types.
Figure \@ref(fig:dt-algebra) summarises permitted arithmetic operations between the different data types.
```{r dt-algebra, echo = FALSE, fig.cap = "The allowed arithmetic operations between pairs of date/time classes."}
knitr::include_graphics("diagrams/datetimes-arithmetic.png")
@ -503,7 +503,7 @@ knitr::include_graphics("diagrams/datetimes-arithmetic.png")
Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don't need to dig into all the details as they're not all important for data analysis, but there are a few challenges we'll need to tackle head on.
The first challange is that everyday names of time zones tend to be ambiguous. For example, if you're American you're probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme "<area>/<location>", typically in the form "\<continent\>/\<city\>" (there are a few exceptions because not every country lies on a continent). Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland".
The first challenge is that everyday names of time zones tend to be ambiguous. For example, if you're American you're probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme "<area>/<location>", typically in the form "\<continent\>/\<city\>" (there are a few exceptions because not every country lies on a continent). Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland".
You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. In the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that name needs to reflect not only to the current behaviour, but also the complete history. For example, there are time zones for both "America/New_York" and "America/Detroit". These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It's worth reading the raw time zone database (available at <http://www.iana.org/time-zones>) just to read some of these stories!

View File

@ -64,7 +64,7 @@ y2 <- factor(x2, levels = month_levels)
y2
```
If you want an error, you can use `readr::parse_factor()`:
If you want a warning, you can use `readr::parse_factor()`:
```{r}
y2 <- parse_factor(x2, levels = month_levels)

View File

@ -356,7 +356,7 @@ Time
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`). Beware of abbreviations:
if you're American, note that "EST" is a Canadian time zone that does not
have daylight savings time. It is \emph{not} Eastern Standard Time! We'll
have daylight savings time. It is _not_ Eastern Standard Time! We'll
come back to this [time zones].
: `%z` (as offset from UTC, e.g. `+0800`).

View File

@ -101,7 +101,7 @@ There are four things you need to run the code in this book: R, RStudio, a colle
### R
To download R, go to CRAN, the **comprehensive** **R** **a**rchive **network**. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don't try and pick a mirror that's close to you: instead use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
To download R, go to CRAN, the **c**omprehensive **R** **a**rchive **n**etwork. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don't try and pick a mirror that's close to you: instead use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
A new major version of R comes out once a year, and there are 2-3 minor releases each year. It's a good idea to update regularly. Upgrading can be a bit of a hassle, especially for major versions, which require you to reinstall all your packages, but putting it off only make it worse.
@ -141,7 +141,7 @@ Packages in the tidyverse change fairly frequently. You can see if updates are a
### Other packages
There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, are or designed with a different set of underlying principles. This doesn't make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, or are designed with a different set of underlying principles. This doesn't make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
In this book we'll use three data packages from outside the tidyverse:

View File

@ -658,7 +658,7 @@ str(safe_log(10))
str(safe_log("a"))
```
When the function succeeds the `result` element contains the result and the `error` element is `NULL`. When the function fails, the `result` element is `NULL` and the `error` element contains an error object.
When the function succeeds, the `result` element contains the result and the `error` element is `NULL`. When the function fails, the `result` element is `NULL` and the `error` element contains an error object.
`safely()` is designed to work with map:
@ -914,7 +914,7 @@ x %>%
### Reduce and accumulate
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This is useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:
```{r}
dfs <- list(

View File

@ -192,7 +192,7 @@ sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
```
These are exactly the same values we got with `optim()`! Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model by in a single step, using a sophisticated algorithm. This approach is both faster, and guarantees that there is a global minimum.
These are exactly the same values we got with `optim()`! Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model in a single step, using a sophisticated algorithm. This approach is both faster, and guarantees that there is a global minimum.
### Exercises
@ -488,7 +488,7 @@ Note my use of `seq_range()` inside `data_grid()`. Instead of using every unique
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables has an long tailed distribution and you want to focus on generating
variables have a long tailed distribution and you want to focus on generating
values near the center:
```{r}
@ -552,7 +552,7 @@ model_matrix(df, y ~ x^2 + x)
model_matrix(df, y ~ I(x^2) + x)
```
Transformations are useful because you can use them to approximate non-linear functions. If you've taken a calculus class, you may have heard of Taylor's theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a linear to get arbitrary close to a smooth function by fitting an equation like `y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3`. Typing that sequence by hand is tedious, so R provides a helper function: `poly()`:
Transformations are useful because you can use them to approximate non-linear functions. If you've taken a calculus class, you may have heard of Taylor's theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a polynomial function to get arbitrarily close to a smooth function by fitting an equation like `y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3`. Typing that sequence by hand is tedious, so R provides a helper function: `poly()`:
```{r}
model_matrix(df, y ~ poly(x, 2))

View File

@ -154,7 +154,7 @@ diamonds2 %>%
arrange(price)
```
Nothing really jumps out at me here, but it's probably worth spending time considering if this indicates a problem with our model, or if there are a errors in the data. If there are mistakes in the data, this could be an opportunity to buy diamonds that have been priced low incorrectly.
Nothing really jumps out at me here, but it's probably worth spending time considering if this indicates a problem with our model, or if there are errors in the data. If there are mistakes in the data, this could be an opportunity to buy diamonds that have been priced low incorrectly.
### Exercises
@ -385,7 +385,7 @@ Either approach is reasonable. Making the transformed variable explicit is usefu
### Time of year: an alternative approach
In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model. An alternative to using making our knowledge explicit in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in. A simple linear trend isn't adequate, so we could try using a natural spline to fit a smooth curve across the year:
In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model. An alternative to using our knowledge explicitly in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in. A simple linear trend isn't adequate, so we could try using a natural spline to fit a smooth curve across the year:
```{r}
library(splines)

View File

@ -13,7 +13,7 @@ In this chapter you're going to learn three powerful ideas that help you to work
1. Using the __broom__ package, by David Robinson, to turn models into tidy
data. This is a powerful technique for working with large numbers of models
because once you have tidy data, you can apply all of the techniques that
you've learned about in earlier in the book.
you've learned about earlier in the book.
We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.
@ -133,7 +133,7 @@ And we want to apply it to every data frame. The data frames are in a list, so w
models <- map(by_country$data, country_model)
```
However, rather than leaving leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?
However, rather than leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?
In other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame. That's a job for `dplyr::mutate()`:
@ -194,7 +194,7 @@ resids %>%
facet_wrap(~continent)
```
It looks like we've missed some mild pattern. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
It looks like we've missed some mild patterns. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
### Model quality

View File

@ -216,7 +216,7 @@ y <- tribble(
)
```
The coloured column represents the "key" variable: these are used to match the rows between the tables. The grey column represents the "value" column that is carried along for the ride. In these examples I'll show a single key variable and single value variable, but idea generalises in a straightforward way to multiple keys and multiple values.
The coloured column represents the "key" variable: these are used to match the rows between the tables. The grey column represents the "value" column that is carried along for the ride. In these examples I'll show a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.
A join is a way of connecting each row in `x` to zero, one, or more rows in `y`. The following diagram shows each potential match as an intersection of a pair of lines.

View File

@ -4,7 +4,7 @@
So far you've seen R Markdown used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with R Markdown. There are two ways to set the output of a document:
1. Permanently, by modifying the the YAML header:
1. Permanently, by modifying the YAML header:
```yaml
title: "Viridis Demo"
@ -88,7 +88,7 @@ output:
## Notebooks
A notebook, `html_notebook`, is a variation on a `html_document`. The rendered outputs are very similar, but the purpose is different. A `html_document` is focussed on communicating with decisions makers, while a notebook is focussed on collaborating with other data scientists. These different purposes lead to using the HTML output in different ways. Both HTML outputs will contain the fully rendered output, but the notebook also contains the full source code. That means you can use the `.nb.html` generated by the notebook in two ways:
A notebook, `html_notebook`, is a variation on a `html_document`. The rendered outputs are very similar, but the purpose is different. A `html_document` is focussed on communicating with decision makers, while a notebook is focussed on collaborating with other data scientists. These different purposes lead to using the HTML output in different ways. Both HTML outputs will contain the fully rendered output, but the notebook also contains the full source code. That means you can use the `.nb.html` generated by the notebook in two ways:
1. You can view it in a web browser, and see the rendered output. Unlike
`html_document`, this rendering always includes an embedded copy of
@ -238,7 +238,7 @@ Other packages provide even more output formats:
* The __bookdown__ package, <https://github.com/rstudio/bookdown>,
makes it easy to write books, like this one. To learn more, read
[_Authoring Books with R Markdown_](https://bookdown.org/yihui/bookdown/),
by Yihui Xie, which is, of course, written in bookdown, Visit
by Yihui Xie, which is, of course, written in bookdown. Visit
<http://www.bookdown.org> to see other bookdown books written by the
wider R community.

View File

@ -70,7 +70,7 @@ knitr::include_graphics("images/RMarkdownFlow.png")
To get started with your own `.Rmd` file, select *File > New File > R Markdown...* in the menubar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of R Markdown work.
The following sections dives into the three components of an R Markdown document in more details: the markdown text, the code chunks, and the YAML header.
The following sections dive into the three components of an R Markdown document in more details: the markdown text, the code chunks, and the YAML header.
### Exercises
@ -187,7 +187,7 @@ The most important set of options controls if your code block is executed and wh
of your report, but can be very useful if you need to debug exactly
what is going on inside your `.Rmd`. It's also useful if you're teaching R
and want to deliberately include an error. The default, `error = FALSE` causes
knitting to failure if there is a single error in the document.
knitting to fail if there is a single error in the document.
The following table summarises which types of output each option supressess:
@ -209,7 +209,7 @@ By default, R Markdown prints data frames and matrices as you'd see them in the
mtcars[1:5, ]
```
If you prefer that data be displayed with additional formatting you can use the `knitr::kable` function. The code below generates Table \@ref(kable).
If you prefer that data be displayed with additional formatting you can use the `knitr::kable` function. The code below generates Table \@ref(tab:kable).
```{r kable}
knitr::kable(
@ -220,7 +220,7 @@ knitr::kable(
Read the documentation for `?knitr::kable` to see the other ways in which you can customise the table. For even deeper customisation, consider the __xtable__, __stargazer__, __pander__, __tables__, and __ascii__ packages. Each provides a set of tools for returning formatted tables from R code.
There are also a rich set of options for controlling how figures embedded. You'll learn about these in [saving your plots].
There is also a rich set of options for controlling how figures are embedded. You'll learn about these in [saving your plots].
### Caching
@ -232,7 +232,7 @@ The caching system must be used with care, because by default it is based on the
rawdata <- readr::read_csv("a_very_large_file.csv")
`r chunk`
`r chunk`{r processed_data, cached = TRUE}
`r chunk`{r processed_data, cache = TRUE}
processed_data <- rawdata %>%
filter(!is.na(import_var)) %>%
mutate(new_variable = complicated_transformation(x, y, z))
@ -240,7 +240,7 @@ The caching system must be used with care, because by default it is based on the
Caching the `processed_data` chunk means that it will get re-run if the dplyr pipeline is changed, but it won't get rerun if the `read_csv()` call changes. You can avoid that problem with the `dependson` chunk option:
`r chunk`{r processed_data, cached = TRUE, dependson = "raw_data"}
`r chunk`{r processed_data, cache = TRUE, dependson = "raw_data"}
processed_data <- rawdata %>%
filter(!is.na(import_var)) %>%
mutate(new_variable = complicated_transformation(x, y, z))
@ -260,7 +260,7 @@ I've used the advice of [David Robinson](https://twitter.com/drob/status/7387866
### Global options
As you work more with knitr, you will discover that some of the default chunk options don't fit your needs and you want to change them. You can do by calling `knitr::opts_chunk$set()` in a code chunk. For example, when writing books and tutorials I set:
As you work more with knitr, you will discover that some of the default chunk options don't fit your needs and you want to change them. You can do this by calling `knitr::opts_chunk$set()` in a code chunk. For example, when writing books and tutorials I set:
```{r, eval = FALSE}
knitr::opts_chunk$set(
@ -360,7 +360,7 @@ Alternatively, if you need to produce many such paramterised reports, you can ca
rmarkdown::render("fuel-economy.Rmd", params = list(my_class = "suv"))
```
This is particularly powerful in conjunction with `purrr:pwalk()`. The following example creates a report for each value of `class` found in `mpg`. First we create a data frame that has one row for each class, giving the `filename` of report and the `params` it should be given:
This is particularly powerful in conjunction with `purrr:pwalk()`. The following example creates a report for each value of `class` found in `mpg`. First we create a data frame that has one row for each class, giving the `filename` of the report and the `params`:
```{r}
reports <- tibble(
@ -371,7 +371,7 @@ reports <- tibble(
reports
```
Then we match the column names to the argument names of `render()`, and use purrr's **parallel* walk to call `render()` once for each row:
Then we match the column names to the argument names of `render()`, and use purrr's **parallel** walk to call `render()` once for each row:
```{r, eval = FALSE}
reports %>%
@ -406,7 +406,7 @@ Smith says blah [-@smith04].
When R Markdown renders your file, it will build and append a bibliography to the end of your document. The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading. As a result it is common practice to end your file with a section header for the bibliography, such as `# References` or `# Bibliography`.
You can change the style of your citations and bibliography by reference a CSL (citation style language) file to the `csl` field:
You can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the `csl` field:
```yaml
bibliography: rmarkdown.bib
@ -428,5 +428,5 @@ There are two important topics that we haven't covered here: collaboration, and
1. The "Git and GitHub" chapter of _R Packages_, by Hadley. You can also
read it for free online: <http://r-pkgs.had.co.nz/git.html>.
I have also not touched about what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, I highly recommend reading either [_Style: Lessons in Clarity and Grace_](https://amzn.com/0134080416) by Joseph M. Williams & Joseph Bizup, or [_The Sense of Structure: Writing from the Reader's Perspective_](https://amzn.com/0205296327) by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <http://georgegopen.com/articles/litigation/>. They are aimed at lawyers, but almost everything applies to data scientists too.
I have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, I highly recommend reading either [_Style: Lessons in Clarity and Grace_](https://amzn.com/0134080416) by Joseph M. Williams & Joseph Bizup, or [_The Sense of Structure: Writing from the Reader's Perspective_](https://amzn.com/0205296327) by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <http://georgegopen.com/articles/litigation/>. They are aimed at lawyers, but almost everything applies to data scientists too.

View File

@ -521,7 +521,7 @@ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
This is a somewhat pathological example (because email addresses are actually suprisingly complex), but is used in real code. See the stackoverflow discussion at <http://stackoverflow.com/a/201378> for more details.
Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
### Detect matches
@ -618,7 +618,7 @@ Note the use of `str_view_all()`. As you'll shortly learn, many stringr function
### Extract matches
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexes. These are provided in `stringr::sentences`:
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexps. These are provided in `stringr::sentences`:
```{r}
length(sentences)

View File

@ -2,7 +2,7 @@
## Introduction
Throughout this book we work with "tibbles" instead of R's traditional `data.frame`. Tibbles _are_ data frames, but they tweak some older behaviours to make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the __tibble__ package, which provides opinionated data frames that make working in the tidyverse a little easier. In most places, I'll use the term tibble and data frame interchangeably; when I want to draw particular attention to R's build-in data frame, I'll call them `data.frame`s.
Throughout this book we work with "tibbles" instead of R's traditional `data.frame`. Tibbles _are_ data frames, but they tweak some older behaviours to make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the __tibble__ package, which provides opinionated data frames that make working in the tidyverse a little easier. In most places, I'll use the term tibble and data frame interchangeably; when I want to draw particular attention to R's built-in data frame, I'll call them `data.frame`s.
If this chapter leaves you wanting to learn more about tibbles, you might enjoy `vignette("tibble")`.
@ -140,7 +140,7 @@ Some older functions don't work with tibbles. If you encounter one of these func
class(as.data.frame(tb))
```
The main reason that some older functions don't work with tibble is the `[` function. We don't use `[` much in this book much because `dplyr::filter()` and `dplyr::select()` allow you to solve the same problems with clearer code (but you will learn a little about it in [vector subsetting](#vector-subsetting). With base R data frames, `[` sometimes returns a data frame, and sometimes returns a vector. With tibbles, `[` always returns another tibble.
The main reason that some older functions don't work with tibble is the `[` function. We don't use `[` much in this book because `dplyr::filter()` and `dplyr::select()` allow you to solve the same problems with clearer code (but you will learn a little about it in [vector subsetting](#vector-subsetting)). With base R data frames, `[` sometimes returns a data frame, and sometimes returns a vector. With tibbles, `[` always returns another tibble.
## Exercises

View File

@ -277,7 +277,7 @@ There are a number of helper functions you can use within `select()`:
See `?select` for more details.
`select()` can be used to rename variables, but it's rarely useful because it drops all the variables not explicitly mentioned. Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
`select()` can be used to rename variables, but it's rarely useful because it drops all of the variables not explicitly mentioned. Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
```{r}
rename(flights, tail_num = tailnum)

View File

@ -10,7 +10,7 @@ Vectors are particularly important as most of the functions you will write will
The focus of this chapter is on base R data structures, so it isn't essential to load any packages. We will, however, use a handful of functions from the __purrr__ package to avoid some inconsistences in base R.
```{r}
```{r setup, message = FALSE}
library(tidyverse)
```
@ -26,7 +26,7 @@ There are two types of vectors:
1. __Lists__, which are sometimes called recursive vectors because lists can
contain other lists.
The chief difference between atomic vectors is that atomic vectors are __homogeneous__, while lists can be __heterogeneous__. There's one other related object: `NULL`. `NULL` is often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector). `NULL` typically behaves like a vector of length 0. Figure \@ref(fig:datatypes) summarises the interrelationships.
The chief difference between atomic vectors and lists is that atomic vectors are __homogeneous__, while lists can be __heterogeneous__. There's one other related object: `NULL`. `NULL` is often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector). `NULL` typically behaves like a vector of length 0. Figure \@ref(fig:datatypes) summarises the interrelationships.
```{r datatypes, echo = FALSE, out.width = "50%", fig.cap = "The hierarchy of R's vector types"}
knitr::include_graphics("diagrams/data-structures-overview.png")
@ -72,7 +72,7 @@ c(TRUE, TRUE, FALSE, NA)
### Numeric
Integer and double vectors are known collectively as numeric vectors. In R, numbers are doubles by default. To make an integer, place a `L` after the number:
Integer and double vectors are known collectively as numeric vectors. In R, numbers are doubles by default. To make an integer, place an `L` after the number:
```{r}
typeof(1)
@ -99,8 +99,7 @@ The distinction between integers and doubles is not usually important, but there
some numerical tolerance.
1. Integers have one special value: `NA`, while doubles have four:
`NA`, `NaN`, `Inf` and `-Inf`. All three special values can arise in
during division:
`NA`, `NaN`, `Inf` and `-Inf`. All three special values `NaN`, `Inf` and `-Inf` can arise in during division:
```{r}
c(-1, 0, 1) / 0
@ -175,7 +174,7 @@ Now that you understand the different types of atomic vector, it's useful to rev
1. How to name the elements of a vector.
1. How pull out elements of interest.
1. How to pull out elements of interest.
### Coercion
@ -294,7 +293,7 @@ Named vectors are most useful for subsetting, described next.
### Subsetting {#vector-subsetting}
So far we've used `dplyr::filter()` to filter the rows in a tibble. `filter()` only works with tibble, so we'll need new tool for vectors: `[`. `[` is the subsetting function, and is called like `x[a]`. There are four types of thing that you can subset a vector with:
So far we've used `dplyr::filter()` to filter the rows in a tibble. `filter()` only works with tibble, so we'll need new tool for vectors: `[`. `[` is the subsetting function, and is called like `x[a]`. There are four types of things that you can subset a vector with:
1. A numeric vector containing only integers. The integers must either be all
positive, all negative, or zero.
@ -415,7 +414,7 @@ x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
```
Unlike atomic vectors, `lists()` can contain a mix of objects:
Unlike atomic vectors, `list()` can contain a mix of objects:
```{r}
y <- list("a", 1L, 1.5, TRUE)
@ -458,7 +457,7 @@ There are three principles:
### Subsetting
There are three ways to subset a list, which I'll illustrate with `a`:
There are three ways to subset a list, which I'll illustrate with a list named `a`:
```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
@ -478,8 +477,8 @@ a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
hierarchy from the list.
```{r}
str(y[[1]])
str(y[[4]])
str(a[[1]])
str(a[[4]])
```
* `$` is a shorthand for extracting named elements of a list. It works
@ -553,7 +552,7 @@ There are three very important attributes that are used to implement fundamental
1. __Dimensions__ (dims, for short) make a vector behave like a matrix or array.
1. __Class__ is used to implement the S3 object oriented system.
You've seen names above, and we won't cover dimensions because we don't use matrices in this book. It remains to describe the class, which controls how __generic functions__ work. Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input. A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it _Advanced R_ at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
You've seen names above, and we won't cover dimensions because we don't use matrices in this book. It remains to describe the class, which controls how __generic functions__ work. Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input. A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it in _Advanced R_ at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
Here's what a typical generic function looks like:
@ -567,7 +566,7 @@ The call to "UseMethod" means that this is a generic function, and it will call
methods("as.Date")
```
For example, if `x` is a character vector, `as.Date()` will call `as.Date.charcter()`; if it's a factor, it'll call `as.Date.factor()`.
For example, if `x` is a character vector, `as.Date()` will call `as.Date.character()`; if it's a factor, it'll call `as.Date.factor()`.
You can see the specific implementation of a method with `getS3method()`:
@ -580,10 +579,11 @@ The most important S3 generic is `print()`: it controls how the object is printe
## Augmented vectors
Atomic vectors and lists are the building blocks for other important vector types like factors and dates. I call these __augmented vectors__, because they are vectors with additional __attributes__, including class. Because augmented vectors has a class, they behave differently to the atomic vector on which they are built. In this book, we make use of four important augmented vectors:
Atomic vectors and lists are the building blocks for other important vector types like factors and dates. I call these __augmented vectors__, because they are vectors with additional __attributes__, including class. Because augmented vectors have a class, they behave differently to the atomic vector on which they are built. In this book, we make use of four important augmented vectors:
* Factors.
* Date-times and times.
* Date-times
* Times.
* Tibbles.
These are described below.
@ -671,5 +671,5 @@ The main difference is the class. The class of tibble includes "data.frame" whic
1. Try and make a tibble that has columns with different lengths. What
happens?
1. Based of the definition above, is it ok to have a list as a
1. Based on the definition above, is it ok to have a list as a
column of a tibble?

View File

@ -17,7 +17,7 @@ This chapter focusses on ggplot2, one of the core members of the tidyverse. To a
library(tidyverse)
```
That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflicts with functions in base R (or from other packages you might have loaded).
That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).
If you run this code and get the error message "there is no package called tidyverse", you'll need to first install it, then run `library()` once again.
@ -154,9 +154,9 @@ ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use this aesthetic.
What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.
For each aesthetic, you use the `aes()` associate the name of the aesthetic with a variable to display. The `aes()` function gathers together each of the aesthetic mappings used by a layer and passes them to the layer's mapping argument. The syntax highlights a useful insight about `x` and `y`: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
For each aesthetic, you use `aes()` to associate the name of the aesthetic with a variable to display. The `aes()` function gathers together each of the aesthetic mappings used by a layer and passes them to the layer's mapping argument. The syntax highlights a useful insight about `x` and `y`: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.
@ -221,7 +221,7 @@ ggplot(shapes, aes(x, y)) +
As you start to run R code, you're likely to run into problems. Don't worry --- it happens to everyone. I have been writing R code for years, and every day I still write code that doesn't work!
Start by carefully comparing the code that you're running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every `(` is matched with a `)` and every `"` is paired with another `"`. Sometimes you'll run the code and nothing happens. Check the left-hand of your console: if it's a `+`, it means that R doesn't think you've typed a complete expression and it's waiting for you to finish it. In this case, it's usually easy to start from scratch again by pressing `ESCAPE` to abort processing the current command.
Start by carefully comparing the code that you're running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every `(` is matched with a `)` and every `"` is paired with another `"`. Sometimes you'll run the code and nothing happens. Check the left-hand of your console: if it's a `+`, it means that R doesn't think you've typed a complete expression and it's waiting for you to finish it. In this case, it's usually easy to start from scratch again by pressing ESCAPE to abort processing the current command.
One common problem when creating ggplot2 graphics is to put the `+` in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven't accidentally written code like this:
@ -359,8 +359,7 @@ ggplot(data = mpg) +
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, colour = drv),
show.legend = FALSE
mapping = aes(x = displ, y = hwy, group = drv)
)
```
@ -470,10 +469,10 @@ On the x-axis, the chart displays `cut`, a variable from `diamonds`. On the y-ax
* smoothers fit a model to your data and then plot predictions from the
model.
* boxplots compute a robust summary of the distribution and display as
* boxplots compute a robust summary of the distribution and then display a
specially formatted box.
The algorithm used calculate new values for a graph is called a __stat__, short for statistical transformation. The figure below describes how this process works with `geom_bar()`.
The algorithm used to calculate new values for a graph is called a __stat__, short for statistical transformation. The figure below describes how this process works with `geom_bar()`.
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-stat-bar.png")

View File

@ -1,6 +1,6 @@
# Workflow: basics
You now have some experience running R code. I didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that it's both typical and temporary: It happens to everyone, and the only way to get over it is to keep trying.
You now have some experience running R code. I didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that it's both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.
Before we go any further, let's make sure you've got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.

View File

@ -43,7 +43,7 @@ getwd()
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
```
As a beginning R user, it's OK let your home directory, documents directory, or any other weird directory on your computer be R's working directory. But you're six chapters into this book, and you're no longer a rank beginner. Very soon now you should evolve to organising your analytical projects into directories and, when working on a project, setting R's working directory to the associated directory.
As a beginning R user, it's OK to let your home directory, documents directory, or any other weird directory on your computer be R's working directory. But you're six chapters into this book, and you're no longer a rank beginner. Very soon now you should evolve to organising your analytical projects into directories and, when working on a project, setting R's working directory to the associated directory.
__I do not recommend it__, but you can also set the working directory from within R:
@ -70,7 +70,7 @@ Paths and directories are a little complicated because there are two basic style
letter (e.g. `C:`) or two backslashes (e.g. `\\servername`) and in
Mac/Linux they start with a slash "/" (e.g. `/users/hadley`). You should
__never__ use absolute paths in your scripts, because they hinder sharing:
noone else will have exactly the same directory configuration as you.
no one else will have exactly the same directory configuration as you.
1. The last minor difference is the place that `~` points to. `~` is a
convenient shortcut to your home directory. Windows doesn't really have

View File

@ -6,7 +6,7 @@ So far you've been using the console to run code. That's a great place to start,
knitr::include_graphics("diagrams/rstudio-editor.png")
```
The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that work and does what you want, put it in the script editor. RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open. Nevertheless, it's a good idea to save your scripts regularly and to back them up.
The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor. RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open. Nevertheless, it's a good idea to save your scripts regularly and to back them up.
## Running code