From 00ecb39a719253e5b759c846ac239dddab049345 Mon Sep 17 00:00:00 2001 From: Matthew Sedaghatfar Date: Wed, 20 Jun 2018 04:56:46 -0400 Subject: [PATCH 01/38] Update model-assess.Rmd (#602) --- model-assess.Rmd | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/model-assess.Rmd b/model-assess.Rmd index 2957609..952de1b 100644 --- a/model-assess.Rmd +++ b/model-assess.Rmd @@ -51,7 +51,7 @@ There are lots of high-level helpers to do these resampling methods in R. We're . [Applied Predictive Modeling](https://amzn.com/1461468485), by Max Kuhn and Kjell Johnson. -If you're competing in competitions, like Kaggle, that are predominantly about creating good predicitons, developing a good strategy for avoiding overfitting is very important. Otherwise you risk tricking yourself into thinking that you have a good model, when in reality you just have a model that does a good job of fitting your data. +If you're competing in competitions, like Kaggle, that are predominantly about creating good predictions, developing a good strategy for avoiding overfitting is very important. Otherwise you risk tricking yourself into thinking that you have a good model, when in reality you just have a model that does a good job of fitting your data. There is a closely related family that uses a similar idea: model ensembles. However, instead of trying to find the best models, ensembles make use of all the models, acknowledging that even models that don't fit all the data particularly well can still model some subsets well. In general, you can think of model ensemble techniques as functions that take a list of models, and a return a single model that attempts to take the best part of each. @@ -155,7 +155,7 @@ models %>% But do you think this model will do well if we apply it to new data from the same population? -In real-life you can't easily go out and recollect your data. There are two approach to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections. +In real-life you can't easily go out and recollect your data. There are two approaches to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections. ```{r} boot <- bootstrap(df, 100) %>% @@ -181,7 +181,7 @@ last_plot() + Bootstrapping is a useful tool to help us understand how the model might vary if we'd collected a different sample from the population. A related technique is cross-validation which allows us to explore the quality of the model. It works by repeatedly splitting the data into two pieces. One piece, the training set, is used to fit, and the other piece, the test set, is used to measure the model quality. -The following code generates 100 test-training splits, holding out 20% of the data for testing each time. We then fit a model to the training set, and evalute the error on the test set: +The following code generates 100 test-training splits, holding out 20% of the data for testing each time. We then fit a model to the training set, and evaluate the error on the test set: ```{r} cv <- crossv_mc(df, 100) %>% @@ -192,7 +192,7 @@ cv <- crossv_mc(df, 100) %>% cv ``` -Obviously, a plot is going to help us see distribution more easily. I've added our original estimate of the model error as a white vertical line (where the same dataset is used for both training and teseting), and you can see it's very optimistic. +Obviously, a plot is going to help us see distribution more easily. I've added our original estimate of the model error as a white vertical line (where the same dataset is used for both training and testing), and you can see it's very optimistic. ```{r} cv %>% @@ -202,7 +202,7 @@ cv %>% geom_rug() ``` -The distribution of errors is highly skewed: there are a few cases which have very high errors. These respresent samples where we ended up with a few cases on all with low values or high values of x. Let's take a look: +The distribution of errors is highly skewed: there are a few cases which have very high errors. These represent samples where we ended up with a few cases on all with low values or high values of x. Let's take a look: ```{r} filter(cv, rmse > 1.5) %>% @@ -214,13 +214,13 @@ filter(cv, rmse > 1.5) %>% All of the models that fit particularly poorly were fit to samples that either missed the first one or two or the last one or two observation. Because polynomials shoot off to positive and negative, they give very bad predictions for those values. -Now that we've given you a quick overview and intuition for these techniques, lets dive in more more detail. +Now that we've given you a quick overview and intuition for these techniques, let's dive in more detail. ## Resamples ### Building blocks -Both the boostrap and cross-validation are build on top of a "resample" object. In modelr, you can access these low-level tools directly with the `resample_*` functions. +Both the boostrap and cross-validation are built on top of a "resample" object. In modelr, you can access these low-level tools directly with the `resample_*` functions. These functions return an object of class "resample", which represents the resample in a memory efficient way. Instead of storing the resampled dataset itself, it instead stores the integer indices, and a "pointer" to the original dataset. This makes resamples take up much less memory. @@ -250,7 +250,7 @@ If you get a strange error, it's probably because the modelling function doesn't ``` `strap` gives the bootstrap sample dataset, and `.id` assigns a - unique identifer to each model (this is often useful for plotting) + unique identifier to each model (this is often useful for plotting) * `crossv_mc()` return a data frame with three columns: @@ -290,7 +290,7 @@ It's called the $R^2$ because for simple models like this, it's just the square cor(heights$income, heights$height) ^ 2 ``` -The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data. As you'll also learn later, it's also a rather optimistic interpretation of the model. Because you're asssessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment. +The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data. As you'll also learn later, it's also a rather optimistic interpretation of the model. Because you're assessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment. From 84f04262b0a5b614c5aa1eef49cd88e3bddecd4d Mon Sep 17 00:00:00 2001 From: Ranae Dietzel Date: Wed, 20 Jun 2018 03:57:09 -0500 Subject: [PATCH 02/38] Two minor typos (#603) --- model-many.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-many.Rmd b/model-many.Rmd index 95e75f5..6afffbd 100644 --- a/model-many.Rmd +++ b/model-many.Rmd @@ -107,7 +107,7 @@ by_country (I'm cheating a little by grouping on both `continent` and `country`. Given `country`, `continent` is fixed, so this doesn't add any more groups, but it's an easy way to carry an extra variable along for the ride.) -This creates an data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea. +This creates a data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea. The `data` column is a little tricky to look at because it's a moderately complicated list, and we're still working on good tools to explore these objects. Unfortunately using `str()` is not recommended as it will often produce very long output. But if you pluck out a single element from the `data` column you'll see that it contains all the data for that country (in this case, Afghanistan). From e2dc2d9e42811ace4613c457e5a3d39e801284ee Mon Sep 17 00:00:00 2001 From: Abhinav Singh Date: Wed, 20 Jun 2018 09:57:33 +0100 Subject: [PATCH 03/38] Update vectors.Rmd (#604) Typo: you'll still need you understand vectors --> you'll still need to understand vectors --- vectors.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vectors.Rmd b/vectors.Rmd index 0974b28..f1a155b 100644 --- a/vectors.Rmd +++ b/vectors.Rmd @@ -4,7 +4,7 @@ So far this book has focussed on tibbles and packages that work with them. But as you start to write your own functions, and dig deeper into R, you need to learn about vectors, the objects that underlie tibbles. If you've learned R in a more traditional way, you're probably already familiar with vectors, as most R resources start with vectors and work their way up to tibbles. I think it's better to start with tibbles because they're immediately useful, and then work your way down to the underlying components. -Vectors are particularly important as most of the functions you will write will work with vectors. It is possible to write functions that work with tibbles (like ggplot2, dplyr, and tidyr), but the tools you need to write such functions are currently idiosyncratic and immature. I am working on a better approach, , but it will not be ready in time for the publication of the book. Even when complete, you'll still need you understand vectors, it'll just make it easier to write a user-friendly layer on top. +Vectors are particularly important as most of the functions you will write will work with vectors. It is possible to write functions that work with tibbles (like ggplot2, dplyr, and tidyr), but the tools you need to write such functions are currently idiosyncratic and immature. I am working on a better approach, , but it will not be ready in time for the publication of the book. Even when complete, you'll still need to understand vectors, it'll just make it easier to write a user-friendly layer on top. ### Prerequisites From 83b0fa9132680c23649ccd0209b6d4cec9eebe29 Mon Sep 17 00:00:00 2001 From: Mark Beveridge Date: Wed, 20 Jun 2018 09:58:17 +0100 Subject: [PATCH 04/38] Update tidy.Rmd (#613) Minor typo on row 288 : 'case' --> 'cases' --- tidy.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tidy.Rmd b/tidy.Rmd index d9aa996..da83167 100644 --- a/tidy.Rmd +++ b/tidy.Rmd @@ -285,7 +285,7 @@ table3 %>% (Formally, `sep` is a regular expression, which you'll learn more about in [strings].) -Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful as those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`: +Look carefully at the column types: you'll notice that `cases` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful as those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`: ```{r} table3 %>% From 9c236cddda68397daf7f52266372fab18982dc43 Mon Sep 17 00:00:00 2001 From: Noah Landesberg Date: Wed, 20 Jun 2018 04:58:35 -0400 Subject: [PATCH 05/38] Update model-many.Rmd (#614) Fixes #598 --- model-many.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-many.Rmd b/model-many.Rmd index 6afffbd..9ed8827 100644 --- a/model-many.Rmd +++ b/model-many.Rmd @@ -384,7 +384,7 @@ Another example of this pattern is using the `map()`, `map2()`, `pmap()` from pu ```{r} sim <- tribble( ~f, ~params, - "runif", list(min = -1, max = -1), + "runif", list(min = -1, max = 1), "rnorm", list(sd = 5), "rpois", list(lambda = 10) ) From 3b688ee8a5e7aecea193aa302ca5ab536c105bd4 Mon Sep 17 00:00:00 2001 From: Floris Vanderhaeghe Date: Wed, 20 Jun 2018 10:58:59 +0200 Subject: [PATCH 06/38] Elaborate on capturing groups (#615) Clarify the meaning and use of capturing groups. --- strings.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/strings.Rmd b/strings.Rmd index acf40e2..033316d 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -389,7 +389,7 @@ str_view(x, 'C[LX]+?') ### Grouping and backreferences -Earlier, you learned about parentheses as a way to disambiguate complex expressions. They also define "groups" that you can refer to with _backreferences_, like `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a repeated pair of letters. +Earlier, you learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a _numbered_ capturing group (number 1, 2 etc.). A capturing group stores _the part of the string_ matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with _backreferences_, like `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a repeated pair of letters. ```{r} str_view(fruit, "(..)\\1", match = TRUE) From 9a65563e9ca0b3794465100c3b4c7ee3e4edcb20 Mon Sep 17 00:00:00 2001 From: Mark Beveridge Date: Wed, 20 Jun 2018 09:59:44 +0100 Subject: [PATCH 07/38] Update pipes.Rmd (minor typo) (#616) --- pipes.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipes.Rmd b/pipes.Rmd index b6ac2cb..b42bf18 100644 --- a/pipes.Rmd +++ b/pipes.Rmd @@ -127,7 +127,7 @@ Finally, we can use the pipe: ```{r, eval = FALSE} foo_foo %>% hop(through = forest) %>% - scoop(up = field_mouse) %>% + scoop(up = field_mice) %>% bop(on = head) ``` From 877d165d4d4fbeaf362edac77fd787239234cdd7 Mon Sep 17 00:00:00 2001 From: Ben Herbertson Date: Wed, 20 Jun 2018 16:59:54 +0800 Subject: [PATCH 08/38] Update pipes.Rmd (missing word) (#617) --- pipes.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipes.Rmd b/pipes.Rmd index b42bf18..3a1fa95 100644 --- a/pipes.Rmd +++ b/pipes.Rmd @@ -14,7 +14,7 @@ library(magrittr) ## Piping alternatives -The point of the pipe is to help you write code in a way that easier to read and understand. To see why the pipe is so useful, we're going to explore a number of ways of writing the same code. Let's use code to tell a story about a little bunny named Foo Foo: +The point of the pipe is to help you write code in a way that is easier to read and understand. To see why the pipe is so useful, we're going to explore a number of ways of writing the same code. Let's use code to tell a story about a little bunny named Foo Foo: > Little bunny Foo Foo > Went hopping through the forest From 0d7ba63f77c217d8bf79b8f6089894031dbf6242 Mon Sep 17 00:00:00 2001 From: Hao Chen Date: Wed, 20 Jun 2018 11:00:07 +0200 Subject: [PATCH 09/38] Use pipe for the spread example (#620) To keep consistency with the rest of chapter. --- tidy.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tidy.Rmd b/tidy.Rmd index da83167..8a079e8 100644 --- a/tidy.Rmd +++ b/tidy.Rmd @@ -190,7 +190,8 @@ To tidy this up, we first analyse the representation in similar way to `gather() Once we've figured that out, we can use `spread()`, as shown programmatically below, and visually in Figure \@ref(fig:tidy-spread). ```{r} -spread(table2, key = type, value = count) +table2 %>% + spread(key = type, value = count) ``` ```{r tidy-spread, echo = FALSE, out.width = "100%", fig.cap = "Spreading `table2` makes it tidy"} From 8fee078c9db7956d3e3444bbda68655efb4153fb Mon Sep 17 00:00:00 2001 From: Hao Chen Date: Wed, 20 Jun 2018 11:00:39 +0200 Subject: [PATCH 10/38] Better naming consistency (#621) Now 'key' is used commonly across step-by-step guide, final complex pipe, and the exercise. --- tidy.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tidy.Rmd b/tidy.Rmd index 8a079e8..a7aa088 100644 --- a/tidy.Rmd +++ b/tidy.Rmd @@ -526,9 +526,9 @@ I've shown you the code a piece at a time, assigning each interim result to a ne ```{r, results = "hide"} who %>% - gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% - mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>% - separate(code, c("new", "var", "sexage")) %>% + gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% + mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>% + separate(key, c("new", "var", "sexage")) %>% select(-new, -iso2, -iso3) %>% separate(sexage, c("sex", "age"), sep = 1) ``` From 02502a6ebc506e050bcb93c489775f045ca3f694 Mon Sep 17 00:00:00 2001 From: Matt Herman Date: Wed, 20 Jun 2018 05:00:59 -0400 Subject: [PATCH 11/38] Update factors.Rmd (#624) --- factors.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/factors.Rmd b/factors.Rmd index b212d8b..336f135 100644 --- a/factors.Rmd +++ b/factors.Rmd @@ -209,8 +209,8 @@ Another type of reordering is useful when you are colouring the lines on a plot. ```{r, fig.align = "default", out.width = "50%", fig.width = 4} by_age <- gss_cat %>% filter(!is.na(age)) %>% - group_by(age, marital) %>% - count() %>% + count(age, marital) %>% + group_by(age) %>% mutate(prop = n / sum(n)) ggplot(by_age, aes(age, prop, colour = marital)) + From 74cbc29d627baf12d0c551ea5e6c14606f5136b4 Mon Sep 17 00:00:00 2001 From: Mark Beveridge Date: Wed, 20 Jun 2018 10:01:39 +0100 Subject: [PATCH 13/38] Update model-building.Rmd (#627) --- model-building.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-building.Rmd b/model-building.Rmd index 87fdac3..ca97f79 100644 --- a/model-building.Rmd +++ b/model-building.Rmd @@ -165,7 +165,7 @@ Nothing really jumps out at me here, but it's probably worth spending time consi the relationship between `price` and `carat`? 1. Extract the diamonds that have very high and very low residuals. - Is there anything unusual about these diamonds? Are the particularly bad + Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors? 1. Does the final model, `mod_diamonds2`, do a good job of predicting From 998618073bc130b21c968de16b199cdde3cf2e1e Mon Sep 17 00:00:00 2001 From: Josh Goldberg Date: Wed, 20 Jun 2018 04:01:51 -0500 Subject: [PATCH 14/38] Typo fix (#629) Added question mark ? to end of a sentence, which was a question. --- EDA.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/EDA.Rmd b/EDA.Rmd index 8c934d4..b8a2977 100644 --- a/EDA.Rmd +++ b/EDA.Rmd @@ -494,7 +494,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 1. Visualise the distribution of carat, partitioned by price. 1. How does the price distribution of very large diamonds compare to small - diamonds. Is it as you expect, or does it surprise you? + diamonds? Is it as you expect, or does it surprise you? 1. Combine two of the techniques you've learned to visualise the combined distribution of cut, carat, and price. From 4c82fe68dccb3f7d1e378c177db83b254a3f6118 Mon Sep 17 00:00:00 2001 From: Ben Herbertson Date: Wed, 20 Jun 2018 17:02:03 +0800 Subject: [PATCH 15/38] Update rmarkdown.Rmd (minor typos) (#630) --- rmarkdown.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rmarkdown.Rmd b/rmarkdown.Rmd index 2777da9..880198a 100644 --- a/rmarkdown.Rmd +++ b/rmarkdown.Rmd @@ -336,7 +336,7 @@ You can control many other "whole document" settings by tweaking the parameters R Markdown documents can include one or more parameters whose values can be set when you render the report. Parameters are useful when you want to re-render the same report with distinct values for various key inputs. For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country. To declare one or more parameters, use the `params` field. -This example use a `my_class` parameter to determines which class of cars to display: +This example uses a `my_class` parameter to determine which class of cars to display: ```{r, echo = FALSE, out.width = "100%", comment = ""} cat(readr::read_file("rmarkdown/fuel-economy.Rmd")) From f4683fce1ee7a21264203178f263c9a5923c65a7 Mon Sep 17 00:00:00 2001 From: Jacob Kaplan Date: Wed, 20 Jun 2018 05:03:40 -0400 Subject: [PATCH 16/38] fixes minor spelling mistakes (#631) --- model-many.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-many.Rmd b/model-many.Rmd index 9ed8827..555e547 100644 --- a/model-many.Rmd +++ b/model-many.Rmd @@ -15,7 +15,7 @@ In this chapter you're going to learn three powerful ideas that help you to work because once you have tidy data, you can apply all of the techniques that you've learned about earlier in the book. -We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends. +We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signals so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends. The following sections will dive into more detail about the individual techniques: From 7e6b1e71483adf2c4ece73ee6266e1983452a9f5 Mon Sep 17 00:00:00 2001 From: "John D. Storey" Date: Wed, 20 Jun 2018 05:03:51 -0400 Subject: [PATCH 17/38] Update iteration.Rmd (#633) Fixed typo. --- iteration.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.Rmd b/iteration.Rmd index 4c25b00..a8e2a3f 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -158,7 +158,7 @@ That's all there is to the for loop! Now is a good time to practice creating som ## For loop variations -Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don't forget about them once you've master the FP techniques you'll learn about in the next section. +Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don't forget about them once you've mastered the FP techniques you'll learn about in the next section. There are four variations on the basic theme of the for loop: From 502d91f03610d1dd313404aa83e680e1c3493c62 Mon Sep 17 00:00:00 2001 From: Ben Herbertson Date: Wed, 20 Jun 2018 17:04:10 +0800 Subject: [PATCH 18/38] Minor fixes (#634) --- transform.Rmd | 2 +- visualize.Rmd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/transform.Rmd b/transform.Rmd index 03a0a45..3c6564d 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -877,7 +877,7 @@ Functions that work most naturally in grouped mutates and filters are known as 1. What time of day should you fly if you want to avoid delays as much as possible? -1. For each destination, compute the total minutes of delay. For each, +1. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination. 1. Delays are typically temporally correlated: even once the problem that diff --git a/visualize.Rmd b/visualize.Rmd index 717a007..7e6ef96 100644 --- a/visualize.Rmd +++ b/visualize.Rmd @@ -294,7 +294,7 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o 1. Read `?facet_wrap`. What does `nrow` do? What does `ncol` do? What other options control the layout of the individual panels? Why doesn't - `facet_grid()` have `nrow` and `ncol` argument? + `facet_grid()` have `nrow` and `ncol` arguments? 1. When using `facet_grid()` you should usually put the variable with more unique levels in the columns. Why? From 3617c80681434c4d47a0bcb1c3aa529423080a54 Mon Sep 17 00:00:00 2001 From: Michael Henry Date: Wed, 20 Jun 2018 19:05:11 +1000 Subject: [PATCH 19/38] Fixed broken link (#637) The previous URL gives a 404 error. --- rmarkdown.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rmarkdown.Rmd b/rmarkdown.Rmd index 880198a..255483f 100644 --- a/rmarkdown.Rmd +++ b/rmarkdown.Rmd @@ -428,5 +428,5 @@ There are two important topics that we haven't covered here: collaboration, and 1. The "Git and GitHub" chapter of _R Packages_, by Hadley. You can also read it for free online: . -I have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, I highly recommend reading either [_Style: Lessons in Clarity and Grace_](https://amzn.com/0134080416) by Joseph M. Williams & Joseph Bizup, or [_The Sense of Structure: Writing from the Reader's Perspective_](https://amzn.com/0205296327) by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at . They are aimed at lawyers, but almost everything applies to data scientists too. +I have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, I highly recommend reading either [_Style: Lessons in Clarity and Grace_](https://amzn.com/0134080416) by Joseph M. Williams & Joseph Bizup, or [_The Sense of Structure: Writing from the Reader's Perspective_](https://amzn.com/0205296327) by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at . They are aimed at lawyers, but almost everything applies to data scientists too. From d96fdedfbbee948ac5170668d9fa048c2737586c Mon Sep 17 00:00:00 2001 From: Erik Erhardt Date: Wed, 20 Jun 2018 03:05:25 -0600 Subject: [PATCH 20/38] Update tidy.Rmd, typo: Spreading "forms" to "from" (#639) --- tidy.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tidy.Rmd b/tidy.Rmd index a7aa088..025565c 100644 --- a/tidy.Rmd +++ b/tidy.Rmd @@ -184,7 +184,7 @@ To tidy this up, we first analyse the representation in similar way to `gather() * The column that contains variable names, the `key` column. Here, it's `type`. -* The column that contains values forms multiple variables, the `value` +* The column that contains values from multiple variables, the `value` column. Here it's `count`. Once we've figured that out, we can use `spread()`, as shown programmatically below, and visually in Figure \@ref(fig:tidy-spread). From 2ab1949997ad8ecd96c1c659a5fbf3d72cd35267 Mon Sep 17 00:00:00 2001 From: Jeff Boichuk Date: Wed, 20 Jun 2018 05:05:55 -0400 Subject: [PATCH 21/38] Word swap (#640) To remain consistent with this sentence: "Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties." --- visualize.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/visualize.Rmd b/visualize.Rmd index 7e6ef96..1bb813d 100644 --- a/visualize.Rmd +++ b/visualize.Rmd @@ -167,7 +167,7 @@ ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue") ``` -Here, the color doesn't convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes _outside_ of `aes()`. You'll need to pick a value that makes sense for that aesthetic: +Here, the color doesn't convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes _outside_ of `aes()`. You'll need to pick a level that makes sense for that aesthetic: * The name of a color as a character string. From fcac8eab40381cb4a5dfbde648bd54e62d7852ea Mon Sep 17 00:00:00 2001 From: Mark Beveridge Date: Wed, 20 Jun 2018 10:06:03 +0100 Subject: [PATCH 22/38] Update relational-data.Rmd (minor typo) (#641) Row 371 : In table=`airports`, the variable representing longitude is not `long` ...it is `lon` --- relational-data.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/relational-data.Rmd b/relational-data.Rmd index f523eb9..bf6af4b 100644 --- a/relational-data.Rmd +++ b/relational-data.Rmd @@ -368,7 +368,7 @@ So far, the pairs of tables have always been joined by a single variable, and th variables from `x` will be used in the output. For example, if we want to draw a map we need to combine the flights data - with the airports data which contains the location (`lat` and `long`) of + with the airports data which contains the location (`lat` and `lon`) of each airport. Each flight has an origin and destination `airport`, so we need to specify which one we want to join to: From e8c1dbb4282168e895f7095679d8d8c55f076ecd Mon Sep 17 00:00:00 2001 From: Erik Erhardt Date: Wed, 20 Jun 2018 03:06:51 -0600 Subject: [PATCH 23/38] Added colon in select() section (#645) `num_range("x", 1:3)`:, added missing colon to match the rest of the list --- transform.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transform.Rmd b/transform.Rmd index 3c6564d..7adcb3e 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -273,7 +273,7 @@ There are a number of helper functions you can use within `select()`: This one matches any variables that contain repeated characters. You'll learn more about regular expressions in [strings]. -* `num_range("x", 1:3)` matches `x1`, `x2` and `x3`. +* `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`. See `?select` for more details. From 44f3e50fdf4f4fc867d32aeec1bf57dca5d10c63 Mon Sep 17 00:00:00 2001 From: "Yiming (Paul) Li" Date: Wed, 20 Jun 2018 04:07:17 -0500 Subject: [PATCH 24/38] Correct a typo in model-many.Rmd (#647) --- model-many.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-many.Rmd b/model-many.Rmd index 555e547..741757c 100644 --- a/model-many.Rmd +++ b/model-many.Rmd @@ -377,7 +377,7 @@ df %>% unnest() ``` -(If you find yourself using this pattern a lot, make sure to check out `tidyr:separate_rows()` which is a wrapper around this common pattern). +(If you find yourself using this pattern a lot, make sure to check out `tidyr::separate_rows()` which is a wrapper around this common pattern). Another example of this pattern is using the `map()`, `map2()`, `pmap()` from purrr. For example, we could take the final example from [Invoking different functions] and rewrite it to use `mutate()`: From f8a9d17d6fc8e2974a5106a6c8e44e72af68bcf7 Mon Sep 17 00:00:00 2001 From: andrewmacfarland Date: Wed, 20 Jun 2018 03:07:28 -0600 Subject: [PATCH 25/38] Update datetimes.Rmd (#649) fix typo --- datetimes.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datetimes.Rmd b/datetimes.Rmd index c49eb0b..c9a635d 100644 --- a/datetimes.Rmd +++ b/datetimes.Rmd @@ -538,7 +538,7 @@ x1 - x2 x1 - x3 ``` -Unless other specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like `c()`, will often drop the time zone. In that case, the date-times will display in your local time zone: +Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like `c()`, will often drop the time zone. In that case, the date-times will display in your local time zone: ```{r} x4 <- c(x1, x2, x3) From 37c9d02b294f61066186fe41092d989e2d5afdee Mon Sep 17 00:00:00 2001 From: Edwin Thoen Date: Wed, 20 Jun 2018 11:07:52 +0200 Subject: [PATCH 26/38] added root to msd (#650) --- transform.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transform.Rmd b/transform.Rmd index 7adcb3e..676de52 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -676,7 +676,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot ) ``` -* Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The mean squared deviation, +* Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The root mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile range `IQR()` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers. From 7bcbe88f5b58775c2e0a1438e5d68c985051d8bd Mon Sep 17 00:00:00 2001 From: AlanFeder Date: Wed, 20 Jun 2018 05:09:21 -0400 Subject: [PATCH 27/38] change make_prediction() to model1() (#657) --- model-basics.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-basics.Rmd b/model-basics.Rmd index 2f27798..862b6f0 100644 --- a/model-basics.Rmd +++ b/model-basics.Rmd @@ -214,7 +214,7 @@ These are exactly the same values we got with `optim()`! Behind the scenes `lm() ```{r} measure_distance <- function(mod, data) { - diff <- data$y - make_prediction(mod, data) + diff <- data$y - model1(mod, data) mean(abs(diff)) } ``` From 64a1716d71607f9ff6f9d4641805380241ef7892 Mon Sep 17 00:00:00 2001 From: Dirk Eddelbuettel Date: Wed, 20 Jun 2018 04:09:36 -0500 Subject: [PATCH 28/38] suprinsingly -> surprisingly (#658) one char typo --- strings.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/strings.Rmd b/strings.Rmd index 033316d..9a424dc 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -519,7 +519,7 @@ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] ?:\r\n)?[ \t])*))*)?;\s*) ``` -This is a somewhat pathological example (because email addresses are actually suprisingly complex), but is used in real code. See the stackoverflow discussion at for more details. +This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code. See the stackoverflow discussion at for more details. Don't forget that you're in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one. From c7a2442f64d3eef647e52286efe5678a19895818 Mon Sep 17 00:00:00 2001 From: Garrick Aden-Buie Date: Wed, 20 Jun 2018 05:10:38 -0400 Subject: [PATCH 29/38] Minor typo: dash needs to be first in character class group (#664) --- strings.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/strings.Rmd b/strings.Rmd index 9a424dc..1d500a2 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -853,7 +853,7 @@ You can use the other arguments of `regex()` to control details of the match: phone <- regex(" \\(? # optional opening parens (\\d{3}) # area code - [)- ]? # optional closing parens, dash, or space + [) -]? # optional closing parens, space, or dash (\\d{3}) # another three numbers [ -]? # optional space or dash (\\d{3}) # three more numbers From 6edfe2c9ed4c8d3d282b781e09d13fc91a7e316c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?St=C3=A9phane=20Guillou?= Date: Wed, 20 Jun 2018 19:10:59 +1000 Subject: [PATCH 30/38] minor typos in chapter 5 (#666) --- transform.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/transform.Rmd b/transform.Rmd index 676de52..7517775 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -101,7 +101,7 @@ There's another common problem you might encounter when using `==`: floating poi ```{r} sqrt(2) ^ 2 == 2 -1/49 * 49 == 1 +1 / 49 * 49 == 1 ``` Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on `==`, use `near()`: @@ -389,7 +389,7 @@ There are many functions for creating new variables that you can use with `mutat * Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. `x - lag(x)`) - or find when values change (`x != lag(x))`. They are most useful in + or find when values change (`x != lag(x)`). They are most useful in conjunction with `group_by()`, which you'll learn about shortly. ```{r} @@ -882,7 +882,7 @@ Functions that work most naturally in grouped mutates and filters are known as 1. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed - to allow earlier flights to leave. Using `lag()` explore how the delay + to allow earlier flights to leave. Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight. 1. Look at each destination. Can you find flights that are suspiciously From 03c4cc5e62c442f2387746d5ac05a46130087b16 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=92=8B=E9=9B=A8=E8=92=99?= Date: Wed, 20 Jun 2018 17:11:10 +0800 Subject: [PATCH 31/38] Fix a typo. (#667) `quadatric` to `quadratic`. --- model-basics.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-basics.Rmd b/model-basics.Rmd index 862b6f0..4eda66a 100644 --- a/model-basics.Rmd +++ b/model-basics.Rmd @@ -10,7 +10,7 @@ There are two parts to a model: 1. First, you define a __family of models__ that express a precise, but generic, pattern that you want to capture. For example, the pattern - might be a straight line, or a quadatric curve. You will express + might be a straight line, or a quadratic curve. You will express the model family as an equation like `y = a_1 * x + a_2` or `y = a_1 * x ^ a_2`. Here, `x` and `y` are known variables from your data, and `a_1` and `a_2` are parameters that can vary to capture From 44663f46137abf22f5b795626a54eb9082518434 Mon Sep 17 00:00:00 2001 From: David Rubinger Date: Wed, 20 Jun 2018 05:12:02 -0400 Subject: [PATCH 32/38] Fix typos and formatting (#670) --- functions.Rmd | 2 +- pipes.Rmd | 2 +- vectors.Rmd | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/functions.Rmd b/functions.Rmd index d594cd5..f88e532 100644 --- a/functions.Rmd +++ b/functions.Rmd @@ -280,7 +280,7 @@ if (condition) { To get help on `if` you need to surround it in backticks: `` ?`if` ``. The help isn't particularly helpful if you're not already an experienced programmer, but at least you know how to get to it! -Here's a simple function that uses an if statement. The goal of this function is to return a logical vector describing whether or not each element of a vector is named. +Here's a simple function that uses an `if` statement. The goal of this function is to return a logical vector describing whether or not each element of a vector is named. ```{r} has_name <- function(x) { diff --git a/pipes.Rmd b/pipes.Rmd index 3a1fa95..1bfc9bb 100644 --- a/pipes.Rmd +++ b/pipes.Rmd @@ -131,7 +131,7 @@ foo_foo %>% bop(on = head) ``` -This is my favourite form, because it focusses on verbs, not nouns. You can read this series of function compositions like it's a set of imperative actions. Foo Foo hops, then scoops, then bops. The downside, of course, is that you need to be familiar with the pipe. If you've never seen `%>%` before, you'll have no idea what this code does. Fortunately, most people pick up the idea very quickly, so when you share you code with others who aren't familiar with the pipe, you can easily teach them. +This is my favourite form, because it focusses on verbs, not nouns. You can read this series of function compositions like it's a set of imperative actions. Foo Foo hops, then scoops, then bops. The downside, of course, is that you need to be familiar with the pipe. If you've never seen `%>%` before, you'll have no idea what this code does. Fortunately, most people pick up the idea very quickly, so when you share your code with others who aren't familiar with the pipe, you can easily teach them. The pipe works by performing a "lexical transformation": behind the scenes, magrittr reassembles the code in the pipe to a form that works by overwriting an intermediate object. When you run a pipe like the one above, magrittr does something like this: diff --git a/vectors.Rmd b/vectors.Rmd index f1a155b..3a3095e 100644 --- a/vectors.Rmd +++ b/vectors.Rmd @@ -48,7 +48,7 @@ Every vector has two key properties: length(x) ``` -Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create __augmented vectors__ which build on additional behaviour. There are four important types of augmented vector: +Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create __augmented vectors__ which build on additional behaviour. There are three important types of augmented vector: * Factors are built on top of integer vectors. * Dates and date-times are built on top of numeric vectors. @@ -194,7 +194,7 @@ There are two ways to convert, or coerce, one type of vector to another: Because explicit coercion is used relatively rarely, and is largely easy to understand, I'll focus on implicit coercion here. -You've already seen the most important type of implicit coercion: using a logical vector in a numeric context. In this case `TRUE` is converted to `1` and `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues: +You've already seen the most important type of implicit coercion: using a logical vector in a numeric context. In this case `TRUE` is converted to `1` and `FALSE` converted to `0`. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues: ```{r} x <- sample(20, 100, replace = TRUE) From 0fd29dbae41ad004525e4b6d49b379e6cddc8068 Mon Sep 17 00:00:00 2001 From: Mark Beveridge Date: Wed, 20 Jun 2018 10:12:16 +0100 Subject: [PATCH 33/38] Update model-building.Rmd (minor typo) (#671) --- model-building.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-building.Rmd b/model-building.Rmd index ca97f79..f2bf8fb 100644 --- a/model-building.Rmd +++ b/model-building.Rmd @@ -168,7 +168,7 @@ Nothing really jumps out at me here, but it's probably worth spending time consi Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors? -1. Does the final model, `mod_diamonds2`, do a good job of predicting +1. Does the final model, `mod_diamond2`, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond? From ca89c22741a53d6d917263d5b9b61b3051355fea Mon Sep 17 00:00:00 2001 From: Matthew Hendrickson Date: Wed, 20 Jun 2018 05:12:29 -0400 Subject: [PATCH 34/38] line 188 - typo - delete 'is' (#672) Line read "...because it's is a special..." Removed typo "is" --- model-basics.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-basics.Rmd b/model-basics.Rmd index 4eda66a..20dee1d 100644 --- a/model-basics.Rmd +++ b/model-basics.Rmd @@ -185,7 +185,7 @@ ggplot(sim1, aes(x, y)) + Don't worry too much about the details of how `optim()` works. It's the intuition that's important here. If you have a function that defines the distance between a model and a dataset, an algorithm that can minimise that distance by modifying the parameters of the model, you can find the best model. The neat thing about this approach is that it will work for any family of models that you can write an equation for. -There's one more approach that we can use for this model, because it's is a special case of a broader family: linear models. A linear model has the general form `y = a_1 + a_2 * x_1 + a_3 * x_2 + ... + a_n * x_(n - 1)`. So this simple model is equivalent to a general linear model where n is 2 and `x_1` is `x`. R has a tool specifically designed for fitting linear models called `lm()`. `lm()` has a special way to specify the model family: formulas. Formulas look like `y ~ x`, which `lm()` will translate to a function like `y = a_1 + a_2 * x`. We can fit the model and look at the output: +There's one more approach that we can use for this model, because it's a special case of a broader family: linear models. A linear model has the general form `y = a_1 + a_2 * x_1 + a_3 * x_2 + ... + a_n * x_(n - 1)`. So this simple model is equivalent to a general linear model where n is 2 and `x_1` is `x`. R has a tool specifically designed for fitting linear models called `lm()`. `lm()` has a special way to specify the model family: formulas. Formulas look like `y ~ x`, which `lm()` will translate to a function like `y = a_1 + a_2 * x`. We can fit the model and look at the output: ```{r} sim1_mod <- lm(y ~ x, data = sim1) From c8c28ea2a0fcac78f4b45093968c3198c0376134 Mon Sep 17 00:00:00 2001 From: Mark Beveridge Date: Wed, 20 Jun 2018 10:12:42 +0100 Subject: [PATCH 35/38] Update model-many.Rmd (missing word) (#674) --- model-many.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/model-many.Rmd b/model-many.Rmd index 741757c..124794b 100644 --- a/model-many.Rmd +++ b/model-many.Rmd @@ -266,7 +266,7 @@ We see two main effects here: the tragedies of the HIV/AIDS epidemic and the Rwa 1. To create the last plot (showing the data for the countries with the worst model fits), we needed two steps: we created a data frame with one row per country and then semi-joined it to the original dataset. - It's possible avoid this join if we use `unnest()` instead of + It's possible to avoid this join if we use `unnest()` instead of `unnest(.drop = TRUE)`. How? ## List-columns From e92922f2ae4955fd17f5379736b0d48b858fa684 Mon Sep 17 00:00:00 2001 From: Jen Ren Date: Wed, 20 Jun 2018 02:13:23 -0700 Subject: [PATCH 36/38] typo correction in "Durations" (#678) --- datetimes.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datetimes.Rmd b/datetimes.Rmd index c9a635d..e9477f5 100644 --- a/datetimes.Rmd +++ b/datetimes.Rmd @@ -384,7 +384,7 @@ one_pm one_pm + ddays(1) ``` -Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if add a full days worth of seconds we end up with a different time. +Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if we add a full days worth of seconds we end up with a different time. ### Periods From bbc87c9049cc15b70f14c97db9a03951ed99187a Mon Sep 17 00:00:00 2001 From: Jonas Date: Wed, 20 Jun 2018 11:15:37 +0200 Subject: [PATCH 37/38] Update workflow-basics.Rmd (#681) Fixed typo `dota` to `data` --- workflow-basics.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/workflow-basics.Rmd b/workflow-basics.Rmd index 6c7d2ad..5e51219 100644 --- a/workflow-basics.Rmd +++ b/workflow-basics.Rmd @@ -146,7 +146,7 @@ Here you can see all of the objects that you've created. ```{r, eval = FALSE} library(tidyverse) - ggplot(dota = mpg) + + ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) fliter(mpg, cyl = 8) From 03eb8d06a97a89918083d848b1b7d422d465c9c9 Mon Sep 17 00:00:00 2001 From: "Jennifer (Jenny) Bryan" Date: Wed, 20 Jun 2018 20:08:05 -0700 Subject: [PATCH 38/38] Mention the use of a character class for metacharacters (#687) Closes #673 --- strings.Rmd | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/strings.Rmd b/strings.Rmd index 1d500a2..ae5207f 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -299,6 +299,17 @@ There are a number of special patterns that match more than one character. You'v Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`. +A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable. + +```{r} +# Look for a literal character that normally has special meaning in a regex +str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c") +str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c") +str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]") +``` + +This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`. + You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want: ```{r}