From b234d4fd730c2ed401e69ad19663fc2e5885b2c5 Mon Sep 17 00:00:00 2001 From: Kenny Darrell Date: Fri, 25 Mar 2016 10:07:24 -0400 Subject: [PATCH 01/14] Update iteration.Rmd the loops vs FP sounded like apples to oranges, moved one layer up in abstraction --- iteration.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/iteration.Rmd b/iteration.Rmd index 95e7b94..78438dc 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -17,9 +17,9 @@ In [functions], we talked about how important it is to reduce duplication in you 1. You're likely to have fewer bugs because each line of code is used in more places. -One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out in to indepdent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.) +One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out into indepdent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.) -In this chapter you'll learn about two important iteration tools: for loops and functional programming. For loops are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors. +In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming, and the machinary each provides. On the imperative side you have things like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors. Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, for loops haven't been slow for many years). The chief benefits of using FP functions like `lapply()` or `purrr::map()` is that they are more expressive and make code both easier to write and easier to read. From 4efd7e5da65f56adf8f7d5023f6bce8144bece11 Mon Sep 17 00:00:00 2001 From: Jakub Nowosad Date: Fri, 25 Mar 2016 15:42:33 +0100 Subject: [PATCH 02/14] link fixed --- functions.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/functions.Rmd b/functions.Rmd index bfe5e4f..686685d 100644 --- a/functions.Rmd +++ b/functions.Rmd @@ -106,7 +106,7 @@ df$c <- rescale01(df$c) df$d <- rescale01(df$d) ``` -Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. We'll learn how to eliminate that duplication in [iteration], once you've learn more about R's data structures in [data_structures]. +Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. We'll learn how to eliminate that duplication in [iteration], once you've learn more about R's data structures in [data-structures]. Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include infinite values, and `rescale01()` fails: From 0c40bb3faaf3360b4bb82de641ac12f71b659a34 Mon Sep 17 00:00:00 2001 From: kdpsingh Date: Fri, 25 Mar 2016 17:30:28 -0400 Subject: [PATCH 03/14] changed "formally" to "formerly" --- index.rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/index.rmd b/index.rmd index b938e20..2b85b13 100644 --- a/index.rmd +++ b/index.rmd @@ -7,7 +7,7 @@ output: # Welcome -This is the book site for __"R for data science"__. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualizing, and exploring data. (__R for Data Science__ was formally called __Data Science with R__ in __Hands-On Programming with R__) +This is the book site for __"R for data science"__. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualizing, and exploring data. (__R for Data Science__ was formerly called __Data Science with R__ in __Hands-On Programming with R__) To be published by O'Reilly in July 2016. From 2e024c9722d575ad8dd71f30678617348496c83b Mon Sep 17 00:00:00 2001 From: Earl Brown Date: Fri, 25 Mar 2016 23:43:16 -0500 Subject: [PATCH 04/14] Update intro.Rmd rogue "in" --- intro.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/intro.Rmd b/intro.Rmd index 016ca13..2251796 100644 --- a/intro.Rmd +++ b/intro.Rmd @@ -4,7 +4,7 @@ install.packages <- function(...) invisible() ``` -Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to introduce you to the most important in R tools that you need to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R. +Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to introduce you to the most important R tools that you need to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R. ## What you will learn From d8f3bbddb457bd256c0cff2093156af6916c7547 Mon Sep 17 00:00:00 2001 From: MJMarshall Date: Sat, 26 Mar 2016 10:07:13 +0000 Subject: [PATCH 05/14] Update iteration.Rmd Correcting typo in exercise code --- iteration.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.Rmd b/iteration.Rmd index 78438dc..36288c1 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -130,7 +130,7 @@ That's all there is to the for loop! Now is a good time to practice creating som x <- sample(100) sd <- 0 - for (i in seq_along(out)) { + for (i in seq_along(x)) { sd <- sd + (x[i] - mean(x)) ^ 2 } sd <- sqrt(sd) / (length(x) - 1) From a0fb7283b05b9afabe22b06071b44be18a972d27 Mon Sep 17 00:00:00 2001 From: OaCantona Date: Sat, 26 Mar 2016 11:40:27 +0100 Subject: [PATCH 06/14] Difference in code --- iteration.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.Rmd b/iteration.Rmd index 78438dc..c0ddfc8 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -100,7 +100,7 @@ Every for loop has three components: it's easy to create them accidentally. If you use `1:length(x)` instead of `seq_along(x)`, you're likely to get a confusing error message. -1. The __body__: `output[i] <- median(df[[i]])`. This is the code that does +1. The __body__: `output[[i]] <- median(df[[i]])`. This is the code that does the work. It's run repeatedly, each time with a different value for `i`. The first iteration will run `output[[1]] <- median(df[[1]])`, the second will run `output[[2]] <- median(df[[2]])`, and so on. From 7d95513e74eeaaa4207be88619be18023807fe94 Mon Sep 17 00:00:00 2001 From: Ian Sealy Date: Tue, 29 Mar 2016 00:14:00 +0100 Subject: [PATCH 07/14] Minor typos/suggestions in Introduction. --- intro.Rmd | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/intro.Rmd b/intro.Rmd index 2251796..fba85c3 100644 --- a/intro.Rmd +++ b/intro.Rmd @@ -24,24 +24,24 @@ There are two main engines of knowledge generation: visualisation and modelling. __Visualisation__ is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions of the data. A good visualisation might also hint that you're asking the wrong question and you need to refine your thinking. In short, visualisations can surprise you. However, visualisations don't scale particularly well. -__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computation tool, so they generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model can not fundamentally surprise you. +__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model cannot fundamentally surprise you. The last step of data science is __communication__, an absolutely critical part of any data analysis project. It doesn't matter how well models and visualisation have led you to understand the data, unless you can commmunicate your results to other people. -There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off. Becoming a better programmer will allow you automate common tasks, and solve new problems with greater ease. +There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off. Becoming a better programmer will allow you to automate common tasks, and solve new problems with greater ease. You'll use these tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play: you can probably tackle 80% of every project using the tools that we'll teach you, but you'll need more to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more. ## How you will learn -The above description of the tools of data science was organised roughly around the order in which you use them in analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them: +The above description of the tools of data science is organised roughly around the order in which you use them in analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them: * Starting with data ingest and tidying is sub-optimal because 80% of the time it's routine and boring, and the other 20% of the time it's horrendously frustrating. Instead, we'll start with visualisation and transformation on data that's already been imported and tidied. That way, when you ingest and tidy your own data, you'll be able to keep your motivation high because - you know the pain is worth it because of what you can accomplish once its + you know the pain is worth it because of what you can accomplish once it's done. * Some topics are best explained with other tools. For example, we believe that @@ -58,15 +58,15 @@ Within each chapter, we try and stick to a similar pattern: start with some moti ## What you won't learn -There are some important topics that this book doesn't cover. We believe it's important to stay ruthlessly focussed on the essentials so you can get up and running as quickly as possible. That means this book can't cover every important topic. +There are some important topics that this book doesn't cover. We believe it's important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible. That means this book can't cover every important topic. ### Big data -This book proudly focusses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). We don't teach data.table here because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth a little extra effort to learn it. +This book proudly focuses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). We don't teach data.table here because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth a little extra effort to learn it. Many big data problems are often small data problems in disguise. Often your complete dataset is big, but the data needed to answer a specific question is small. It's often possible to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [transform](#transform). -Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarassingly parallel), so you just need a system (like hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out to how answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset. +Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset. ### Python @@ -80,7 +80,7 @@ This book focuses exclusively on structured data sets: collections of values tha ### Formal Statistics and Machine Learning -This book focusses on practical tools for understanding your data: visualization, modelling, and transformation. You can develop your understanding further by learning probability theory, statistical hypothesis testing, and machine learning methods; but we won't teach you those things here. There are many books that cover these topics, but few that integrate the other parts of the data science process. When you are ready, you can and should read books devoted to each of these topics. We recommend *Statistical Modeling: A Fresh Approach* by Danny Kaplan; *An Introduction to Statistical Learning* by James, Witten, Hastie, and Tibshirani; and *Applied Predictive Modeling* by Kuhn and Johnson. +This book focuses on practical tools for understanding your data: visualization, modelling, and transformation. You can develop your understanding further by learning probability theory, statistical hypothesis testing, and machine learning methods; but we won't teach you those things here. There are many books that cover these topics, but few that integrate the other parts of the data science process. When you are ready, you can and should read books devoted to each of these topics. We recommend *Statistical Modeling: A Fresh Approach* by Danny Kaplan; *An Introduction to Statistical Learning* by James, Witten, Hastie, and Tibshirani; and *Applied Predictive Modeling* by Kuhn and Johnson. ## Prerequisites @@ -88,7 +88,7 @@ We've made few assumptions about what you already know in order to get the most To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are open source, free and easy to install: -1. Download R and install R, . +1. Download and install R, . 1. Download and install RStudio, . 1. Install needed packages (see below). @@ -104,7 +104,7 @@ You run R code in the __console__ pane. Textual output appears inline, and graph There are three keyboard shortcuts for the RStudio IDE that we strongly encourage that you learn because they'll save you so much time: -* Cmd + Enter: sends current line (or current selection) from the editor to +* Cmd + Enter: sends the current line (or current selection) from the editor to the console and runs it. (Ctrl + Enter on a PC) * Tab: suggest possible completions for the text you've typed. @@ -120,7 +120,7 @@ We strongly recommend making two changes to the default RStudio options: knitr::include_graphics("screenshots/rstudio-workspace.png") ``` -This ensures that every time you restart RStudio you get a completely clean slate. This is good pratice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10. +This ensures that every time you restart RStudio you get a completely clean slate. This is good pratice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10. ### R packages @@ -149,15 +149,15 @@ You will need to reload the package every time you start a new R session. * Google. Always a great place to start! Adding "R" to a query is usually enough to filter it down. If you ever hit an error message that you - don't know how to handle, it is a great idea to google it. + don't know how to handle, it is a great idea to Google it. If your operating system defaults to another language, you can use - `Sys.setenv(LANGUAGE = "en")` to tell R to use english. That's likely to + `Sys.setenv(LANGUAGE = "en")` to tell R to use English. That's likely to get you to common solutions more quickly. -* StackOverflow. Be sure to read and use [How to make a reproducible example](http://adv-r.had.co.nz/Reproducibility.html)([reprex](https://github.com/jennybc/reprex)) before posting. Unfortunately the R stackoverflow community is not always the friendliest. +* Stack Overflow. Be sure to read and use [How to make a reproducible example](http://adv-r.had.co.nz/Reproducibility.html)([reprex](https://github.com/jennybc/reprex)) before posting. Unfortunately the R Stack Overflow community is not always the friendliest. -* Twitter. #rstats hashtag is very welcoming. Great way to keep up with +* Twitter. The #rstats hashtag is very welcoming and is a great way to keep up with what's happening in the community. ## Acknowledgements From 762b8e685ee75cf2804be6c048eb41f1f2a3fd8b Mon Sep 17 00:00:00 2001 From: MJMarshall Date: Tue, 29 Mar 2016 10:32:25 +0100 Subject: [PATCH 08/14] Update iteration.Rmd correcting for/while typo --- iteration.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.Rmd b/iteration.Rmd index bd7e004..f8dce4f 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -335,7 +335,7 @@ while (nheads < 3) { flips ``` -I mention for loops briefly, because I hardly ever use them. They're most often used for simulation, which is outside the scope of this book. However, it is good to know they exist, if you encounter a problem where the number of iterations is not known in advance. +I mention while loops briefly, because I hardly ever use them. They're most often used for simulation, which is outside the scope of this book. However, it is good to know they exist, if you encounter a problem where the number of iterations is not known in advance. ### Exercises From 44d1fdcf79e0c07687adb0f76778c8bc66dea803 Mon Sep 17 00:00:00 2001 From: Garrett Date: Thu, 31 Mar 2016 15:52:36 -0400 Subject: [PATCH 09/14] small changes to model.Rmd --- model.Rmd | 41 ++++++++++++++++++++++++++++++++++------- 1 file changed, 34 insertions(+), 7 deletions(-) diff --git a/model.Rmd b/model.Rmd index 32cf91c..888ea34 100644 --- a/model.Rmd +++ b/model.Rmd @@ -3,7 +3,7 @@ layout: default title: Model --- -A model is a function that summarizes how the values of one variable vary in response to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data. +A model is a function that summarizes how the values of one variable vary in relation to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data. This chapter will explain how to build useful models with R. @@ -23,7 +23,7 @@ This chapter will explain how to build useful models with R. To access the functions and data sets that we will use in the chapter, load the `ggplot2`, `dplyr`, `mgcv`, `splines`, and `broom` packages: -```{r} +```{r messages = FALSE} # install.packages("") library(ggplot2) library(dplyr) @@ -34,9 +34,9 @@ library(broom) ## Linear models -Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that directly inflates the income of the vertically gifted. Do you think this is true? +Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true? -Luckily, it is easy to measure someone's height, as well as their income (and a swath of other variables besides), which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering, the point of the NLS is not to study the relationhip between height and income, that's just a lucky accident. +Luckily, it is easy to measure a person's height, as well as their income (and a swath of other related variables), which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident. You can load the latest cross-section of NLS data, collected in 2013 with the code below. @@ -57,7 +57,6 @@ I've narrowed the data down to 10 variables: * `sat_math` - Each subject's score on the math portion of the Scholastic Aptitude Test (SAT), out of 800. * `bdate` - Month of birth with 1 = January. - ```{r} head(heights) ``` @@ -69,11 +68,39 @@ ggplot(data = heights, mapping = aes(x = height, y = income)) + geom_point() ``` -First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Burea of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science. +First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Burea of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data. Also, you can see that heights have been rounded to the nearest inch. -Second, the relationship is not very strong. +Setting those concerns aside, we can measure the correlation between height and income with R's `cor()` function. Correlation, $r$ from statistics, measures how strongly the values of two variables are related. The sign of the correlation describes whether the variables have a positive or negative relationship. The magnitude of the correlation describes how strongly the values of one variable determine the values of the second. A correlation of 1 or -1 implies that the value of one variable completely determines the value of the second variable. + +```{r echo = FALSE, cache=TRUE} +x1 <- rnorm(100) +y1 <- .5 * x1 + rnorm(100, sd = .5) +y2 <- -.5 * x1 + rnorm(100, sd = .5) + +cordat <- data.frame(x = rep(x1, 5), + y = c(-x1, y2, rnorm(100), y1, x1), + cor = rep(1:5, each = 100)) + +cordat$cor <- factor(cordat$cor, levels = 1:5, + labels = c("Correlation = -1.0", + "Correlation = -0.5", + "Correlation = 0", + "Correlation = 0.5", + "Correlation = 1.0")) + +ggplot(cordat, aes(x = x, y = y)) + + geom_point() + + facet_grid(. ~ cor) + + coord_fixed() +``` + + + +the strength of the relationship between two variables. If the values of the variables fall on a straight line with positive slope (e.g. the value of one variable completely determines the value of another variable) + +The correlation suggests that heights may have a small effect on income. ```{r} cor(heights$height, heights$income, use = "na") From 55ab16f27f463bbb463aa50e63e7037a77810a41 Mon Sep 17 00:00:00 2001 From: Garrett Date: Thu, 31 Mar 2016 17:01:22 -0400 Subject: [PATCH 10/14] Adds Lahman, tibble and rcorpora to list of packages used in book --- intro.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/intro.Rmd b/intro.Rmd index fba85c3..7df1ff1 100644 --- a/intro.Rmd +++ b/intro.Rmd @@ -129,8 +129,8 @@ You'll also need to install some R packages. An R _package_ is a collection of f ```{r} pkgs <- c( "broom", "dplyr", "ggplot2", "jpeg", "jsonlite", - "knitr", "microbenchmark", "png", "pryr", "purrr", "readr", "stringr", - "tidyr" + "knitr", "Lahman", "microbenchmark", "png", "pryr", "purrr", + "rcorpora", "readr", "stringr", "tibble", "tidyr" ) install.packages(pkgs) ``` From e8def24843e9f6f562197603466cbd5e4b1785cc Mon Sep 17 00:00:00 2001 From: Yihui Xie Date: Fri, 1 Apr 2016 11:31:57 -0500 Subject: [PATCH 11/14] Add author and description to YAML --- index.rmd | 2 ++ 1 file changed, 2 insertions(+) diff --git a/index.rmd b/index.rmd index 2b85b13..61e13fa 100644 --- a/index.rmd +++ b/index.rmd @@ -1,6 +1,8 @@ --- knit: "bookdown::render_book" title: "R for Data Science" +author: ["Garrett Grolemund", "Hadley Wickham"] +description: "This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualizing, and exploring data." output: - bookdown::gitbook --- From a52def7dd7980e63004cfa7157dd02ab1c2f4a3f Mon Sep 17 00:00:00 2001 From: Yihui Xie Date: Fri, 1 Apr 2016 11:34:28 -0500 Subject: [PATCH 12/14] Also add the cover image --- index.rmd | 1 + 1 file changed, 1 insertion(+) diff --git a/index.rmd b/index.rmd index 61e13fa..5ffac8f 100644 --- a/index.rmd +++ b/index.rmd @@ -3,6 +3,7 @@ knit: "bookdown::render_book" title: "R for Data Science" author: ["Garrett Grolemund", "Hadley Wickham"] description: "This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualizing, and exploring data." +cover-image: cover.png output: - bookdown::gitbook --- From 0557e58ca0bbbfdbceb277c47ae43b403cd31b34 Mon Sep 17 00:00:00 2001 From: Garrett Date: Fri, 1 Apr 2016 12:46:14 -0400 Subject: [PATCH 13/14] Adds two missing dollar signs to model.Rmd --- model.Rmd | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/model.Rmd b/model.Rmd index b2e79d4..1c2cba7 100644 --- a/model.Rmd +++ b/model.Rmd @@ -1,3 +1,6 @@ +--- +output: pdf_document +--- # Model A model is a function that summarizes how the values of one variable vary in relation to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data. @@ -127,7 +130,7 @@ ggplot(data = heights, mapping = aes(x = height, y = income)) + `lm()` treats the variable(s) on the right-hand side of the formula as _explanatory variables_ that partially determine the value of the variable on the left-hand side of the formula, which is known as the _response variable_. In other words, it acts as if the _response variable_ is determined by a function of the _explanatory variables_. It then spots the linear function that best fits the data. -Linear models are straightforward to interpret. Incomes have a baseline mean of $`r coef(h)[1]`. Each one inch increase of height above zero is associated with an increase of $`r coef(h)[2]` in income. +Linear models are straightforward to interpret. Incomes have a baseline mean of $`r coef(h)[1]`$. Each one inch increase of height above zero is associated with an increase of $`r coef(h)[2]`$ in income. ```{r} summary(h) From 77db912d55c0b32fcd953f881eb726540be1bbcd Mon Sep 17 00:00:00 2001 From: Yihui Xie Date: Fri, 1 Apr 2016 11:55:02 -0500 Subject: [PATCH 14/14] And add url and github repo info --- index.rmd | 2 ++ 1 file changed, 2 insertions(+) diff --git a/index.rmd b/index.rmd index 5ffac8f..ae3daa5 100644 --- a/index.rmd +++ b/index.rmd @@ -3,6 +3,8 @@ knit: "bookdown::render_book" title: "R for Data Science" author: ["Garrett Grolemund", "Hadley Wickham"] description: "This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualizing, and exploring data." +url: 'http\://r4ds.had.co.nz/' +github-repo: hadley/r4ds cover-image: cover.png output: - bookdown::gitbook