From 4d9998117ff64ec79de95cf7c883e296e882e4b3 Mon Sep 17 00:00:00 2001 From: kdpsingh Date: Fri, 20 May 2016 02:09:26 -0400 Subject: [PATCH 01/11] Fixed typo - "machinary" to machinery --- iteration.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.Rmd b/iteration.Rmd index 92609a2..c4fd653 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -19,7 +19,7 @@ In [functions], we talked about how important it is to reduce duplication in you One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out into indepdent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.) -In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming, and the machinary each provides. On the imperative side you have things like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors. +In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming, and the machinery each provides. On the imperative side you have things like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors. Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, for loops haven't been slow for many years). The chief benefits of using FP functions like `lapply()` or `purrr::map()` is that they are more expressive and make code both easier to write and easier to read. From 6cf882b276a79be1957b4c6b1c6eee783df0c2ed Mon Sep 17 00:00:00 2001 From: kdpsingh Date: Fri, 20 May 2016 02:19:51 -0400 Subject: [PATCH 02/11] Minor grammatical corrections. --- iteration.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/iteration.Rmd b/iteration.Rmd index 92609a2..4f31705 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -21,7 +21,7 @@ One part of reducing duplication is writing functions. Functions allow you to id In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming, and the machinary each provides. On the imperative side you have things like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors. -Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, for loops haven't been slow for many years). The chief benefits of using FP functions like `lapply()` or `purrr::map()` is that they are more expressive and make code both easier to write and easier to read. +Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, as for loops haven't been slow for many years). The chief benefits of using FP functions like `lapply()` or `purrr::map()` is that they are more expressive and make code both easier to write and easier to read. In later chapters you'll learn how to apply these iterating ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you'll learn in this chapter will be invaluable. @@ -248,7 +248,7 @@ for (i in seq_along(x)) { ### Unknown output length -Sometimes you might know now how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector: +Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector: ```{r} means <- c(0, 1, 2) From a43fb95cc429737d81438c78afdc02e6f8931511 Mon Sep 17 00:00:00 2001 From: kdpsingh Date: Fri, 20 May 2016 02:30:10 -0400 Subject: [PATCH 03/11] Minor grammar fix -- removed "type of" There was a missing word after "type of" -- potentially you meant "type of assignment." I removed "type of" to make it more concise. --- iteration.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.Rmd b/iteration.Rmd index 92609a2..afd6eb7 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -261,7 +261,7 @@ for (i in seq_along(means)) { str(output) ``` -But this type of is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get "quadratic" ($O(n^2)$) behaviour which means that a loop with three times as many elements would take nine times ($3^2$) as long to run. +But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get "quadratic" ($O(n^2)$) behaviour which means that a loop with three times as many elements would take nine times ($3^2$) as long to run. A better solution to save the results in a list, and then combine into a single vector after the loop is done: From 5b046924ef11fcfab32d638a19d757682e670be2 Mon Sep 17 00:00:00 2001 From: Julia Stewart Lowndes Date: Sun, 22 May 2016 11:45:14 -0700 Subject: [PATCH 04/11] fix 2 tiny typos --- visualize.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/visualize.Rmd b/visualize.Rmd index f8c2309..d2004ca 100644 --- a/visualize.Rmd +++ b/visualize.Rmd @@ -493,7 +493,7 @@ You can learn which stat a geom uses, as well as what variables it computes by v Stats are the most subtle part of plotting because you do not see them in action. `ggplot2` applies the transformation and stores the results behind the scenes. You only see the finished plot. Moreover, `ggplot2` applies stats automatically, with a very intuitive set of defaults. As a result, you rarely need to adjust a geom's stat. However, you can do three things with a geom's stat if you wish to. -First, you can change the stat that the geom uses with the geom's stat argument. In the code below, I change the stat of `geom_bar()` from count (the default) to identity. This let's me map the height of the bars to the raw values of a $y$ variable. +First, you can change the stat that the geom uses with the geom's stat argument. In the code below, I change the stat of `geom_bar()` from count (the default) to identity. This lets me map the height of the bars to the raw values of a $y$ variable. ```{r} demo <- data.frame( @@ -507,7 +507,7 @@ ggplot(data = demo) + demo ``` -I provide a list of the stats that are availalbe to use in ggplot2 at the end of this section. Be careful when you change a geom's stat. Many combinations of geoms and stats will create incompatible results. In practice, you will almost always use a geom's default stat. +I provide a list of the stats that are available to use in ggplot2 at the end of this section. Be careful when you change a geom's stat. Many combinations of geoms and stats will create incompatible results. In practice, you will almost always use a geom's default stat. Second, you can give some stats arguments by passing the arguments to your geom function. In the code below, I pass a width argument to the count stat, which controls the widths of the bars. `width = 1` will make the bars wide enough to touch each other. From 16e948f487f6258d0d9d5d7f49543daaacfd7cfb Mon Sep 17 00:00:00 2001 From: jjchern Date: Mon, 23 May 2016 18:25:38 -0500 Subject: [PATCH 05/11] Fix a type --- wrangle.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/wrangle.Rmd b/wrangle.Rmd index 4df5bed..732fb73 100644 --- a/wrangle.Rmd +++ b/wrangle.Rmd @@ -4,7 +4,7 @@ With data, the relationships between values matter as much as the values themselves. Tidy data encodes those relationships. -Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames but they encode some patterns that make modern usage of R better. Unfortunately R is an old language, and things that made sense 10 or 20 years a go are no longer as valid. It's difficult to change base R without breaking existing code, so most innovation occurs in packages, providing new functions that you should use instead of the old ones. +Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames but they encode some patterns that make modern usage of R better. Unfortunately R is an old language, and things that made sense 10 or 20 years ago are no longer as valid. It's difficult to change base R without breaking existing code, so most innovation occurs in packages, providing new functions that you should use instead of the old ones. ```{r} library(tibble) From 74c27cfd2a7539316bcbb9ab80306716b4181580 Mon Sep 17 00:00:00 2001 From: jjchern Date: Mon, 23 May 2016 18:30:53 -0500 Subject: [PATCH 06/11] "vs" should be "vs." --- wrangle.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/wrangle.Rmd b/wrangle.Rmd index 732fb73..2327a5f 100644 --- a/wrangle.Rmd +++ b/wrangle.Rmd @@ -38,7 +38,7 @@ frame_data( ) ``` -## Tibbles vs data frames +## Tibbles vs. data frames There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting. From a43a2e82dc598f158b4857bc26e3abd0874934d4 Mon Sep 17 00:00:00 2001 From: jjchern Date: Mon, 23 May 2016 18:33:30 -0500 Subject: [PATCH 07/11] "vs" should be "vs." --- transform.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transform.Rmd b/transform.Rmd index 2441de5..d77fbbc 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -681,7 +681,7 @@ ggplot(delays, aes(n, delay)) + geom_point() ``` -Not suprisingly, there is much more variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or many other summaries) vs number of observations, you'll see that the variation decreases as the sample size increases. +Not suprisingly, there is much more variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or many other summaries) vs. number of observations, you'll see that the variation decreases as the sample size increases. When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, and also shows you a handy pattern for integrating ggplot2 into dplyr flows. It's a bit painful that you have to switch from `%>%` to `+`, but once you get the hang of it, it's quite convenient. From 0eb06e7b7475708977780c02f6f094fcf381bc2d Mon Sep 17 00:00:00 2001 From: jjchern Date: Mon, 23 May 2016 18:35:41 -0500 Subject: [PATCH 08/11] "vs" should be "vs." --- iteration.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.Rmd b/iteration.Rmd index 92609a2..a9eb8c1 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -375,7 +375,7 @@ I mention while loops briefly, because I hardly ever use them. They're most ofte } ``` -## For loops vs functionals +## For loops vs. functionals For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it's possible to wrap up for loops in a function, and call that function instead of using the for loop directly. From ffdf38d2bf06005180ee3526e0e83af76df0c5dc Mon Sep 17 00:00:00 2001 From: jjchern Date: Mon, 23 May 2016 18:37:05 -0500 Subject: [PATCH 09/11] "vs" should be "vs." --- strings.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/strings.Rmd b/strings.Rmd index 33443a4..fd8c60a 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -815,7 +815,7 @@ There are a few other functions in base R that accept regular expressions: stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `r length(ls(getNamespace("stringi")))` functions to stringr's `r length(ls("package:stringr"))`. -So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages is very similar because stringr was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`. +So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages is very similar because stringr was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs. `stri_`. ### Encoding From a9eb6566db8a9a7b56e5d11b8a46a454db71bbc7 Mon Sep 17 00:00:00 2001 From: hadley Date: Thu, 26 May 2016 09:43:22 -0500 Subject: [PATCH 10/11] Minor tweaks --- model-vis.Rmd | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/model-vis.Rmd b/model-vis.Rmd index 759a7e8..7e83d59 100644 --- a/model-vis.Rmd +++ b/model-vis.Rmd @@ -321,6 +321,8 @@ grid %>% ## Generating prediction grids +Now that you're learned the basics of generating prediction grids with `expand()`, we need to go into a few more details to cover other types of data you might come across. In each of the following sections, I'll explore in more detail one type of data along with the expansion and visualisation techniques you'll need to understand it. + ### Continuous variables When you have a continuous variable in the model, rather than using the unique values that you've seen, it's often more useful to generate an evenly spaced grid. One convenient way to do this is with `modelr::seq_range()` which takes a continuous variable, calculates its range, and then generates an evenly spaced points between the minimum and maximum. @@ -507,6 +509,8 @@ To help avoid this problem, it's good practice to include "nearby" observed data One way to do this is to use `condvis::visualweight()`. + + ### Exercises 1. In the use of `rlm` with `poly()`, the model didn't converge. Carefully @@ -545,3 +549,4 @@ delays %>% geom_smooth(se = F) ``` + From add27b771f7f786f647615d24c87f89d7721e8ae Mon Sep 17 00:00:00 2001 From: hadley Date: Thu, 26 May 2016 09:52:02 -0500 Subject: [PATCH 11/11] Install modelr from github --- DESCRIPTION | 1 + 1 file changed, 1 insertion(+) diff --git a/DESCRIPTION b/DESCRIPTION index cc6964e..d4321ea 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -35,6 +35,7 @@ Imports: Remotes: gaborcsardi/rcorpora, garrettgman/DSR, + hadley/modelr, hadley/purrr, hadley/stringr, hadley/ggplot2,