From 852f6b98a0d3cac48efad58bd5908f5f12410195 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Sun, 21 Feb 2021 17:12:13 +0000 Subject: [PATCH 01/16] Remove references to iris --- iteration.Rmd | 20 ++++++++++---------- tibble.Rmd | 2 +- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/iteration.Rmd b/iteration.Rmd index 322b491..c689a60 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -102,7 +102,7 @@ Then we'll move on some variations of the for loop that help you solve other pro 1. Compute the mean of every column in `mtcars`. 2. Determine the type of each column in `nycflights13::flights`. - 3. Compute the number of unique values in each column of `iris`. + 3. Compute the number of unique values in each column of `palmerpenguins::penguins`. 4. Generate 10 random normals from distributions with means of -10, 0, 10, and 100. Think about the output, sequence, and body **before** you start writing the loop. @@ -346,14 +346,14 @@ However, it is good to know they exist so that you're prepared for problems wher What if the names are not unique? 3. Write a function that prints the mean of each numeric column in a data frame, along with its name. - For example, `show_mean(iris)` would print: + For example, `show_mean(mpg)` would print: ```{r, eval = FALSE} - show_mean(iris) - #> Sepal.Length: 5.84 - #> Sepal.Width: 3.06 - #> Petal.Length: 3.76 - #> Petal.Width: 1.20 + show_mean(mpg) + #> displ: 3.47 + #> year: 2004 + #> cyl: 5.89 + #> cty: 16.86 ``` (Extra challenge: what function did I use to make sure that the numbers lined up nicely, even though the variable names had different lengths?) @@ -636,7 +636,7 @@ I focus on purrr functions here because they have more consistent names and argu 1. Compute the mean of every column in `mtcars`. 2. Determine the type of each column in `nycflights13::flights`. - 3. Compute the number of unique values in each column of `iris`. + 3. Compute the number of unique values in each column of `palmerpenguins::penguins`. 4. Generate 10 random normals from distributions with means of -10, 0, 10, and 100. 2. How can you create a single vector that for each column in a data frame indicates whether or not it's a factor? @@ -909,11 +909,11 @@ A number of functions work with **predicate** functions that return either a sin `keep()` and `discard()` keep elements of the input where the predicate is `TRUE` or `FALSE` respectively: ```{r} -iris %>% +gss_cat %>% keep(is.factor) %>% str() -iris %>% +gss_cat %>% discard(is.factor) %>% str() ``` diff --git a/tibble.Rmd b/tibble.Rmd index 5c90546..ce41ede 100644 --- a/tibble.Rmd +++ b/tibble.Rmd @@ -26,7 +26,7 @@ Most other R packages use regular data frames, so you might want to coerce a dat You can do that with `as_tibble()`: ```{r} -as_tibble(iris) +as_tibble(mtcars) ``` You can create a new tibble from individual vectors with `tibble()`. From fbb738e799784fe678f9fb56e6501e93959feb0a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Sun, 21 Feb 2021 17:32:37 +0000 Subject: [PATCH 02/16] Enumerate exercise subparts with letters --- communicate-plots.Rmd | 10 +++---- iteration.Rmd | 26 ++++++++--------- relational-data.Rmd | 10 +++---- rmarkdown.Rmd | 6 ++-- strings.Rmd | 66 +++++++++++++++++++------------------------ tibble.Rmd | 11 +++----- transform.Rmd | 14 ++++----- vectors.Rmd | 16 ++++------- 8 files changed, 70 insertions(+), 89 deletions(-) diff --git a/communicate-plots.Rmd b/communicate-plots.Rmd index a240703..63dae53 100644 --- a/communicate-plots.Rmd +++ b/communicate-plots.Rmd @@ -495,11 +495,11 @@ Note that all colour scales come in two variety: `scale_colour_x()` and `scale_f 3. Change the display of the presidential terms by: - 1. Combining the two variants shown above. - 2. Improving the display of the y axis. - 3. Labelling each term with the name of the president. - 4. Adding informative plot labels. - 5. Placing breaks every 4 years (this is trickier than it seems!). + a. Combining the two variants shown above. + b. Improving the display of the y axis. + c. Labelling each term with the name of the president. + d. Adding informative plot labels. + e. Placing breaks every 4 years (this is trickier than it seems!). 4. Use `override.aes` to make the legend on the following plot easier to see. diff --git a/iteration.Rmd b/iteration.Rmd index c689a60..5a5da55 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -100,10 +100,10 @@ Then we'll move on some variations of the for loop that help you solve other pro 1. Write for loops to: - 1. Compute the mean of every column in `mtcars`. - 2. Determine the type of each column in `nycflights13::flights`. - 3. Compute the number of unique values in each column of `palmerpenguins::penguins`. - 4. Generate 10 random normals from distributions with means of -10, 0, 10, and 100. + a. Compute the mean of every column in `mtcars`. + b. Determine the type of each column in `nycflights13::flights`. + c. Compute the number of unique values in each column of `palmerpenguins::penguins`. + d. Generate 10 random normals from distributions with means of -10, 0, 10, and 100. Think about the output, sequence, and body **before** you start writing the loop. @@ -132,13 +132,9 @@ Then we'll move on some variations of the for loop that help you solve other pro 3. Combine your function writing and for loop skills: - 1. Write a for loop that `prints()` the lyrics to the children's song "Alice the camel". - - 2. Convert the nursery rhyme "ten in the bed" to a function. - Generalise it to any number of people in any sleeping structure. - - 3. Convert the song "99 bottles of beer on the wall" to a function. - Generalise to any number of any vessel containing any liquid on any surface. + a. Write a for loop that `prints()` the lyrics to the children's song "Alice the camel". + b. Convert the nursery rhyme "ten in the bed" to a function. Generalise it to any number of people in any sleeping structure. + c. Convert the song "99 bottles of beer on the wall" to a function. Generalise to any number of any vessel containing any liquid on any surface. 4. It's common to see for loops that don't preallocate the output and instead increase the length of a vector at each step: @@ -634,10 +630,10 @@ I focus on purrr functions here because they have more consistent names and argu 1. Write code that uses one of the map functions to: - 1. Compute the mean of every column in `mtcars`. - 2. Determine the type of each column in `nycflights13::flights`. - 3. Compute the number of unique values in each column of `palmerpenguins::penguins`. - 4. Generate 10 random normals from distributions with means of -10, 0, 10, and 100. + a. Compute the mean of every column in `mtcars`. + b. Determine the type of each column in `nycflights13::flights`. + c. Compute the number of unique values in each column of `palmerpenguins::penguins`. + d. Generate 10 random normals from distributions with means of -10, 0, 10, and 100. 2. How can you create a single vector that for each column in a data frame indicates whether or not it's a factor? diff --git a/relational-data.Rmd b/relational-data.Rmd index e11dadb..f11c142 100644 --- a/relational-data.Rmd +++ b/relational-data.Rmd @@ -167,11 +167,11 @@ For example, in this data there's a many-to-many relationship between airlines a 2. Identify the keys in the following datasets - 1. `Lahman::Batting`, - 2. `babynames::babynames` - 3. `nasaweather::atmos` - 4. `fueleconomy::vehicles` - 5. `ggplot2::diamonds` + a. `Lahman::Batting`, + b. `babynames::babynames` + c. `nasaweather::atmos` + d. `fueleconomy::vehicles` + e. `ggplot2::diamonds` (You might need to install some packages and read some documentation.) diff --git a/rmarkdown.Rmd b/rmarkdown.Rmd index c65e58b..ce108ac 100644 --- a/rmarkdown.Rmd +++ b/rmarkdown.Rmd @@ -124,9 +124,9 @@ If you forget, you can get to a handy reference sheet with *Help \> Markdown Qui 2. Using the R Markdown quick reference, figure out how to: - 1. Add a footnote. - 2. Add a horizontal rule. - 3. Add a block quote. + a. Add a footnote. + b. Add a horizontal rule. + c. Add a block quote. 3. Copy and paste the contents of `diamond-sizes.Rmd` from in to a local R markdown document. Check that you can run it, then add text after the frequency polygon that describes its most striking features. diff --git a/strings.Rmd b/strings.Rmd index 8da69d9..c89873c 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -314,10 +314,10 @@ For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, 2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that: - 1. Start with "y". - 2. End with "x" - 3. Are exactly three letters long. (Don't cheat by using `str_length()`!) - 4. Have seven letters or more. + a. Start with "y". + b. End with "x" + c. Are exactly three letters long. (Don't cheat by using `str_length()`!) + d. Have seven letters or more. Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words. @@ -360,14 +360,10 @@ str_view(c("grey", "gray"), "gr(e|a)y") 1. Create regular expressions to find all words that: - 1. Start with a vowel. - - 2. That only contain consonants. - (Hint: thinking about matching "not"-vowels.) - - 3. End with `ed`, but not with `eed`. - - 4. End with `ing` or `ise`. + a. Start with a vowel. + b. That only contain consonants. (Hint: thinking about matching "not"-vowels.) + c. End with `ed`, but not with `eed`. + d. End with `ing` or `ise`. 2. Empirically verify the rule "i before e except after c". @@ -423,16 +419,16 @@ str_view(x, 'C[LX]+?') 2. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.) - 1. `^.*$` - 2. `"\\{.+\\}"` - 3. `\d{4}-\d{2}-\d{2}` - 4. `"\\\\{4}"` + a. `^.*$` + b. `"\\{.+\\}"` + c. `\d{4}-\d{2}-\d{2}` + d. `"\\\\{4}"` 3. Create regular expressions to find all words that: - 1. Start with three consonants. - 2. Have three or more vowels in a row. - 3. Have two or more vowel-consonant pairs in a row. + a. Start with three consonants. + b. Have three or more vowels in a row. + c. Have two or more vowel-consonant pairs in a row. 4. Solve the beginner regexp crosswords at . @@ -454,19 +450,17 @@ str_view(fruit, "(..)\\1", match = TRUE) 1. Describe, in words, what these expressions will match: - 1. `(.)\1\1` - 2. `"(.)(.)\\2\\1"` - 3. `(..)\1` - 4. `"(.).\\1.\\1"` - 5. `"(.)(.)(.).*\\3\\2\\1"` + a. `(.)\1\1` + b. `"(.)(.)\\2\\1"` + c. `(..)\1` + d. `"(.).\\1.\\1"` + e. `"(.)(.)(.).*\\3\\2\\1"` 2. Construct regular expressions to match words that: - 1. Start and end with the same character. - - 2. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.) - - 3. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.) + a. Start and end with the same character. + b. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.) + c. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.) ## Tools @@ -666,11 +660,9 @@ The second function will have the suffix `_all`. 1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls. - 1. Find all words that start or end with `x`. - - 2. Find all words that start with a vowel and end with a consonant. - - 3. Are there any words that contain at least one of each different vowel? + a. Find all words that start or end with `x`. + b. Find all words that start with a vowel and end with a consonant. + c. Are there any words that contain at least one of each different vowel? 2. What word has the highest number of vowels? What word has the highest proportion of vowels? @@ -1048,8 +1040,8 @@ The main difference is the prefix: `str_` vs. `stri_`. 1. Find the stringi functions that: - 1. Count the number of words. - 2. Find duplicated strings. - 3. Generate random text. + a. Count the number of words. + b. Find duplicated strings. + c. Generate random text. 2. How do you control the language that `stri_sort()` uses for sorting? diff --git a/tibble.Rmd b/tibble.Rmd index ce41ede..92bf716 100644 --- a/tibble.Rmd +++ b/tibble.Rmd @@ -184,13 +184,10 @@ With tibbles, `[` always returns another tibble. 4. Practice referring to non-syntactic names in the following data frame by: - 1. Extracting the variable called `1`. - - 2. Plotting a scatterplot of `1` vs `2`. - - 3. Creating a new column called `3` which is `2` divided by `1`. - - 4. Renaming the columns to `one`, `two` and `three`. + a. Extracting the variable called `1`. + b. Plotting a scatterplot of `1` vs `2`. + c. Creating a new column called `3` which is `2` divided by `1`. + d. Renaming the columns to `one`, `two` and `three`. ```{r} annoying <- tibble( diff --git a/transform.Rmd b/transform.Rmd index bd7763c..73614ea 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -229,13 +229,13 @@ filter(df, is.na(x) | x > 1) 1. Find all flights that - 1. Had an arrival delay of two or more hours - 2. Flew to Houston (`IAH` or `HOU`) - 3. Were operated by United, American, or Delta - 4. Departed in summer (July, August, and September) - 5. Arrived more than two hours late, but didn't leave late - 6. Were delayed by at least an hour, but made up over 30 minutes in flight - 7. Departed between midnight and 6am (inclusive) + a. Had an arrival delay of two or more hours + b. Flew to Houston (`IAH` or `HOU`) + c. Were operated by United, American, or Delta + d. Departed in summer (July, August, and September) + e. Arrived more than two hours late, but didn't leave late + f. Were delayed by at least an hour, but made up over 30 minutes in flight + g. Departed between midnight and 6am (inclusive) 2. Another useful dplyr filtering helper is `between()`. What does it do? diff --git a/vectors.Rmd b/vectors.Rmd index aea855d..b54b9b6 100644 --- a/vectors.Rmd +++ b/vectors.Rmd @@ -412,14 +412,10 @@ The distinction between `[` and `[[` is most important for lists, as we'll see s 4. Create functions that take a vector as input and returns: - 1. The last value. - Should you use `[` or `[[`? - - 2. The elements at even numbered positions. - - 3. Every element except the last value. - - 4. Only even numbers (and no missing values). + a. The last value. Should you use `[` or `[[`? + b. The elements at even numbered positions. + c. Every element except the last value. + d. Only even numbers (and no missing values). 5. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`? @@ -561,8 +557,8 @@ knitr::include_graphics("images/pepper-3.jpg") 1. Draw the following lists as nested sets: - 1. `list(a, b, list(c, d), list(e, f))` - 2. `list(list(list(list(list(list(a))))))` + a. `list(a, b, list(c, d), list(e, f))` + b. `list(list(list(list(list(list(a))))))` 2. What happens if you subset a tibble as if you're subsetting a list? What are the key differences between a list and a tibble? From f9109aadfe612c36335f955b94ee362d4814c4f4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Sun, 21 Feb 2021 20:09:08 +0000 Subject: [PATCH 03/16] Add 2nd edition preface and planned major changes --- _bookdown.yml | 2 ++ preface-2e.Rmd | 15 +++++++++++++++ 2 files changed, 17 insertions(+) create mode 100644 preface-2e.Rmd diff --git a/_bookdown.yml b/_bookdown.yml index e9c7c2d..31655de 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -3,6 +3,8 @@ new_session: yes rmd_files: [ "index.Rmd", + + "preface-2e.Rmd", "intro.Rmd", "explore.Rmd", diff --git a/preface-2e.Rmd b/preface-2e.Rmd new file mode 100644 index 0000000..216b7b8 --- /dev/null +++ b/preface-2e.Rmd @@ -0,0 +1,15 @@ +# Preface to the second edition {.unnumbered} + +Welcome to the second edition of "R for Data Science". + +## Major changes {.unnumbered} + +- The first part is renamed to "whole game" to reflect the entire data science cycle, including a chapter on data import. +- In the wrangle part highlight improvements to dplyr that make data scientists' lives even easier, such as new functions for rectangling data, working with list columns, and column-wise and row-wise operations. +- Data import also gains a whole part that goes beyond importing rectangular data to include chapters on working with spreadsheets, databases, and web scraping. +- The iteration chapter gains a new case study on web scraping from multiple pages. +- The modeling part has been removed. For modeling, we recommend using packages from [tidymodels](https://www.tidymodels.org/) and reading [Tidy Modeling with R](https://www.tmwr.org/) by Max Kuhn and Julia Silge to learn more about them. + +## Acknowledgements {.unnumbered} + +*TO DO: Add acknowledgements.* From 55803fc8a30dbd724a5a4432ee47d6606ff9f528 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Sun, 21 Feb 2021 21:29:24 +0000 Subject: [PATCH 04/16] Remove modelling - Move files to extras/ for now - Adjust references to modelling - Or add TO DO items to adjust later --- EDA.Rmd | 5 +++-- _bookdown.yml | 5 ----- communicate-plots.Rmd | 1 + communicate.Rmd | 2 +- explore.Rmd | 1 - model-basics.Rmd => extra/model/model-basics.Rmd | 0 model-building.Rmd => extra/model/model-building.Rmd | 0 model-many.Rmd => extra/model/model-many.Rmd | 0 model.Rmd => extra/model/model.Rmd | 0 import.Rmd | 2 +- index.Rmd | 2 +- intro.Rmd | 1 - transform.Rmd | 2 +- 13 files changed, 8 insertions(+), 13 deletions(-) rename model-basics.Rmd => extra/model/model-basics.Rmd (100%) rename model-building.Rmd => extra/model/model-building.Rmd (100%) rename model-many.Rmd => extra/model/model-many.Rmd (100%) rename model.Rmd => extra/model/model.Rmd (100%) diff --git a/EDA.Rmd b/EDA.Rmd index 8e1a6e4..7e5e6f7 100644 --- a/EDA.Rmd +++ b/EDA.Rmd @@ -623,6 +623,8 @@ It's possible to use a model to remove the very strong relationship between pric The following code fits a model that predicts `price` from `carat` and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed. + + ```{r, dev = "png"} library(modelr) @@ -643,8 +645,7 @@ ggplot(data = diamonds2) + geom_boxplot(mapping = aes(x = cut, y = resid)) ``` -You'll learn how models, and the modelr package, work in the final part of the book, [model](#model-intro). -We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand. +We're not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand. ## ggplot2 calls diff --git a/_bookdown.yml b/_bookdown.yml index 31655de..99f4872 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -30,11 +30,6 @@ rmd_files: [ "vectors.Rmd", "iteration.Rmd", - "model.Rmd", - "model-basics.Rmd", - "model-building.Rmd", - "model-many.Rmd", - "communicate.Rmd", "rmarkdown.Rmd", "communicate-plots.Rmd", diff --git a/communicate-plots.Rmd b/communicate-plots.Rmd index 63dae53..c66523d 100644 --- a/communicate-plots.Rmd +++ b/communicate-plots.Rmd @@ -99,6 +99,7 @@ ggplot(df, aes(x, y)) + 2. The `geom_smooth()` is somewhat misleading because the `hwy` for large engines is skewed upwards due to the inclusion of lightweight sports cars with big engines. Use your modelling tools to fit and display a better model. + 3. Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand. diff --git a/communicate.Rmd b/communicate.Rmd index 5d0b520..91c0afb 100644 --- a/communicate.Rmd +++ b/communicate.Rmd @@ -2,7 +2,7 @@ # Introduction {#communicate-intro} -So far, you've learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, visualisation and modelling. +So far, you've learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, and visualisation. However, it doesn't matter how great your analysis is unless you can explain it to others: you need to **communicate** your results. ```{r echo = FALSE, out.width = "75%"} diff --git a/explore.Rmd b/explore.Rmd index 861993d..522a646 100644 --- a/explore.Rmd +++ b/explore.Rmd @@ -20,7 +20,6 @@ In this part of the book you will learn some useful tools that have an immediate - Finally, in [exploratory data analysis], you'll combine visualisation and transformation with your curiosity and scepticism to ask and answer interesting questions about data. Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet. -We'll come back to it in [modelling](#model-intro), once you're better equipped with more data wrangling and programming tools. Nestled among these three chapters that teach you the tools of exploration are three chapters that focus on your R workflow. In [workflow: basics], [workflow: scripts], and [workflow: projects] you'll learn good practices for writing and organising your R code. diff --git a/model-basics.Rmd b/extra/model/model-basics.Rmd similarity index 100% rename from model-basics.Rmd rename to extra/model/model-basics.Rmd diff --git a/model-building.Rmd b/extra/model/model-building.Rmd similarity index 100% rename from model-building.Rmd rename to extra/model/model-building.Rmd diff --git a/model-many.Rmd b/extra/model/model-many.Rmd similarity index 100% rename from model-many.Rmd rename to extra/model/model-many.Rmd diff --git a/model.Rmd b/extra/model/model.Rmd similarity index 100% rename from model.Rmd rename to extra/model/model.Rmd diff --git a/import.Rmd b/import.Rmd index ed51c16..a78f1b8 100644 --- a/import.Rmd +++ b/import.Rmd @@ -639,7 +639,7 @@ There are two alternatives: ``` Feather tends to be faster than RDS and is usable outside of R. -RDS supports list-columns (which you'll learn about in [many models]); feather currently does not. +RDS supports list-columns (which you'll learn about in ); feather currently does not. ```{r, include = FALSE} file.remove("challenge-2.csv") diff --git a/index.Rmd b/index.Rmd index 33cdedd..7d5d962 100644 --- a/index.Rmd +++ b/index.Rmd @@ -14,7 +14,7 @@ documentclass: book # Welcome {.unnumbered} Buy from amazon This is the website for the work-in-progress 2nd edition of **"R for Data Science"**. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. -In this book, you will find a practicum of skills for data science. + In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. diff --git a/intro.Rmd b/intro.Rmd index f5ba338..52d184c 100644 --- a/intro.Rmd +++ b/intro.Rmd @@ -140,7 +140,6 @@ Hypothesis confirmation is hard for two reasons: 2. You can only use an observation once to confirm a hypothesis. As soon as you use it more than once you're back to doing exploratory analysis. This means to do hypothesis confirmation you need to "preregister" (write out in advance) your analysis plan, and not deviate from it even when you have seen the data. - We'll talk a little about some strategies you can use to make this easier in [modelling](#model-intro). It's common to think about modelling as a tool for hypothesis confirmation, and visualisation as a tool for hypothesis generation. But that's a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation. diff --git a/transform.Rmd b/transform.Rmd index 73614ea..8575335 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -423,7 +423,7 @@ There's no way to list every possible function that you might use, but here's a - Logs: `log()`, `log2()`, `log10()`. Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. - They also convert multiplicative relationships to additive, a feature we'll come back to in modelling. + They also convert multiplicative relationships to additive. All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving. From b5654a3a0876fc9feccf16178136ec392ad4c2e7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Sun, 21 Feb 2021 22:32:04 +0000 Subject: [PATCH 05/16] Rename part: explore -> whole game --- _bookdown.yml | 2 +- explore.Rmd => whole-game.Rmd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) rename explore.Rmd => whole-game.Rmd (98%) diff --git a/_bookdown.yml b/_bookdown.yml index 99f4872..6805c36 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -7,7 +7,7 @@ rmd_files: [ "preface-2e.Rmd", "intro.Rmd", - "explore.Rmd", + "whole-game.Rmd", "visualize.Rmd", "workflow-basics.Rmd", "transform.Rmd", diff --git a/explore.Rmd b/whole-game.Rmd similarity index 98% rename from explore.Rmd rename to whole-game.Rmd index 522a646..a783937 100644 --- a/explore.Rmd +++ b/whole-game.Rmd @@ -1,4 +1,4 @@ -# (PART) Explore {.unnumbered} +# (PART) Whole game {.unnumbered} # Introduction {#explore-intro} From dc31887b1e1430d1228828fd9cb96500c041be83 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Sun, 21 Feb 2021 23:32:01 +0000 Subject: [PATCH 06/16] Rename files for whole game, move import in --- _bookdown.yml | 6 +++--- import.Rmd => data-import.Rmd | 2 ++ transform.Rmd => data-transform.Rmd | 0 visualize.Rmd => data-visualize.Rmd | 0 4 files changed, 5 insertions(+), 3 deletions(-) rename import.Rmd => data-import.Rmd (99%) rename transform.Rmd => data-transform.Rmd (100%) rename visualize.Rmd => data-visualize.Rmd (100%) diff --git a/_bookdown.yml b/_bookdown.yml index 6805c36..53319ff 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -8,16 +8,16 @@ rmd_files: [ "intro.Rmd", "whole-game.Rmd", - "visualize.Rmd", + "data-visualize.Rmd", "workflow-basics.Rmd", - "transform.Rmd", + "data-transform.Rmd", + "data-import.Rmd", "workflow-scripts.Rmd", "EDA.Rmd", "workflow-projects.Rmd", "wrangle.Rmd", "tibble.Rmd", - "import.Rmd", "tidy.Rmd", "relational-data.Rmd", "strings.Rmd", diff --git a/import.Rmd b/data-import.Rmd similarity index 99% rename from import.Rmd rename to data-import.Rmd index a78f1b8..c107952 100644 --- a/import.Rmd +++ b/data-import.Rmd @@ -1,5 +1,7 @@ # Data import + + ## Introduction Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. diff --git a/transform.Rmd b/data-transform.Rmd similarity index 100% rename from transform.Rmd rename to data-transform.Rmd diff --git a/visualize.Rmd b/data-visualize.Rmd similarity index 100% rename from visualize.Rmd rename to data-visualize.Rmd From b86a23477fe3ab7664de105e67b83d348eefa239 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Mon, 22 Feb 2021 00:22:15 +0000 Subject: [PATCH 07/16] Restructure wrangle, add stubs for new chapters --- _bookdown.yml | 4 ++++ column-wise.Rmd | 16 ++++++++++++++++ list-columns.Rmd | 16 ++++++++++++++++ rectangle.Rmd | 16 ++++++++++++++++ row-wise.Rmd | 16 ++++++++++++++++ 5 files changed, 68 insertions(+) create mode 100644 column-wise.Rmd create mode 100644 list-columns.Rmd create mode 100644 rectangle.Rmd create mode 100644 row-wise.Rmd diff --git a/_bookdown.yml b/_bookdown.yml index 53319ff..c9c8466 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -19,7 +19,11 @@ rmd_files: [ "wrangle.Rmd", "tibble.Rmd", "tidy.Rmd", + "rectangle.Rmd", "relational-data.Rmd", + "list-columns.Rmd", + "column-wise.Rmd", + "row-wise.Rmd", "strings.Rmd", "factors.Rmd", "datetimes.Rmd", diff --git a/column-wise.Rmd b/column-wise.Rmd new file mode 100644 index 0000000..4f1ac34 --- /dev/null +++ b/column-wise.Rmd @@ -0,0 +1,16 @@ +# Column-wise operations + +## Introduction + + + +### Prerequisites + +In this chapter we'll continue using dplyr. +dplyr is a member of the core tidyverse. + +```{r setup, message = FALSE} +library(tidyverse) +``` + + diff --git a/list-columns.Rmd b/list-columns.Rmd new file mode 100644 index 0000000..2aaaa57 --- /dev/null +++ b/list-columns.Rmd @@ -0,0 +1,16 @@ +# List columns + +## Introduction + + + +### Prerequisites + +In this chapter we'll continue using tidyr, which also provides a bunch of tools to rectangle your datasets. +tidyr is a member of the core tidyverse. + +```{r setup, message = FALSE} +library(tidyverse) +``` + + diff --git a/rectangle.Rmd b/rectangle.Rmd new file mode 100644 index 0000000..53624ca --- /dev/null +++ b/rectangle.Rmd @@ -0,0 +1,16 @@ +# Rectangle data + +## Introduction + + + +### Prerequisites + +In this chapter we'll continue using tidyr, which also provides a bunch of tools to rectangle your datasets. +tidyr is a member of the core tidyverse. + +```{r setup, message = FALSE} +library(tidyverse) +``` + + diff --git a/row-wise.Rmd b/row-wise.Rmd new file mode 100644 index 0000000..8c76617 --- /dev/null +++ b/row-wise.Rmd @@ -0,0 +1,16 @@ +# Row-wise operations + +## Introduction + + + +### Prerequisites + +In this chapter we'll continue using dplyr. +dplyr is a member of the core tidyverse. + +```{r setup, message = FALSE} +library(tidyverse) +``` + + From a6c9e4e6ab0e81b3c32aad01bb777f60e86bf721 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Mon, 22 Feb 2021 11:36:53 +0000 Subject: [PATCH 08/16] Update links and add blurb about new chapters --- wrangle.Rmd | 27 ++++++++++++++++++--------- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/wrangle.Rmd b/wrangle.Rmd index 92570b9..a3caa27 100644 --- a/wrangle.Rmd +++ b/wrangle.Rmd @@ -10,25 +10,34 @@ There are three main parts to data wrangling: knitr::include_graphics("diagrams/data-science-wrangle.png") ``` + + This part of the book proceeds as follows: -- In [tibbles], you'll learn about the variant of the data frame that we use in this book: the **tibble**. +- In Chapter \@ref(tibbles), you'll learn about the variant of the data frame that we use in this book: the **tibble**. You'll learn what makes them different from regular data frames, and how you can construct them "by hand". -- In [data import], you'll learn how to get your data from disk and into R. - We'll focus on plain-text rectangular formats, but will give you pointers to packages that help with other types of data. - -- In [tidy data], you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualisation, and modelling easier. +- In Chapter \@ref(tidy-data), you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualisation, and modelling easier. You'll learn the underlying principles, and how to get your data into a tidy form. +- In Chapter \@ref(rectangle-data), you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting. + +- Chapter \@ref(column-wise-operations) will give you tools for performing the same operation on multiple columns. + +- Chapter \@ref(row-wise-operations) will give you tools for performing operations over rows. + Data wrangling also encompasses data transformation, which you've already learned a little about. Now we'll focus on new skills for three specific types of data you will frequently encounter in practice: -- [Relational data] will give you tools for working with multiple interrelated datasets. +- Chapter \@ref(relational-data) will give you tools for working with multiple interrelated datasets. -- [Strings] will introduce regular expressions, a powerful tool for manipulating strings. +- Chapter \@ref(list-columns) will give you tools for working with list columns --- data stored in columns of a tibble as lists. -- [Factors] are how R stores categorical data. +- Chapter \@ref(strings) will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings. + +- Chapter \@ref(factors) will introduce factors --- how R stores categorical data. They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string. -- [Dates and times] will give you the key tools for working with dates and date-times. +- Chapter \@ref(dates-and-times) will give you the key tools for working with dates and date-times. + + From 9c2fdc7ee0ad47ea48e5e2a71e0d51304990842c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Mon, 22 Feb 2021 11:47:39 +0000 Subject: [PATCH 09/16] Update chapter references --- data-import.Rmd | 2 +- data-transform.Rmd | 2 +- data-visualize.Rmd | 2 +- whole-game.Rmd | 13 ++++++++----- 4 files changed, 11 insertions(+), 8 deletions(-) diff --git a/data-import.Rmd b/data-import.Rmd index c107952..1f8d8fc 100644 --- a/data-import.Rmd +++ b/data-import.Rmd @@ -1,4 +1,4 @@ -# Data import +# Data import {#data-import} diff --git a/data-transform.Rmd b/data-transform.Rmd index 8575335..870b143 100644 --- a/data-transform.Rmd +++ b/data-transform.Rmd @@ -1,4 +1,4 @@ -# Data transformation {#transform} +# Data transformation {#data-transform} ## Introduction diff --git a/data-visualize.Rmd b/data-visualize.Rmd index 8c7962f..386f45e 100644 --- a/data-visualize.Rmd +++ b/data-visualize.Rmd @@ -1,4 +1,4 @@ -# Data visualisation +# Data visualisation {#data-visualisation} ## Introduction diff --git a/whole-game.Rmd b/whole-game.Rmd index a783937..6cbc884 100644 --- a/whole-game.Rmd +++ b/whole-game.Rmd @@ -13,14 +13,17 @@ knitr::include_graphics("diagrams/data-science-explore.png") In this part of the book you will learn some useful tools that have an immediate payoff: - Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data. - In [data visualisation] you'll dive into visualisation, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots. + In Chapter \@ref(data-visualisation) you'll dive into visualisation, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots. -- Visualisation alone is typically not enough, so in [data transformation] you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries. +- Visualisation alone is typically not enough, so in Chapter \@ref(data-transform) you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries. -- Finally, in [exploratory data analysis], you'll combine visualisation and transformation with your curiosity and scepticism to ask and answer interesting questions about data. +- Before you can transform and visualise your data, you need to first get your data into R. + In Chapter \@ref(data-import) you'll learn the basics of getting plain-text rectangular data into R. -Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet. +- Finally, in Chapter \@ref(exploratory-data-analysis), you'll combine visualisation and transformation with your curiosity and scepticism to ask and answer interesting questions about data. + +Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet so we will not cover it in this part. Nestled among these three chapters that teach you the tools of exploration are three chapters that focus on your R workflow. -In [workflow: basics], [workflow: scripts], and [workflow: projects] you'll learn good practices for writing and organising your R code. +In Chapters \@ref(workflow-basics), \@ref(workflow-scripts), and \@ref(workflow-projects), you'll learn good workflow practices for writing and organising your R code. These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects. From 6d977a565a35f453e7e1dcb1c3ec1c9edb90fc8d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Mon, 22 Feb 2021 12:28:26 +0000 Subject: [PATCH 10/16] Add new import part --- _bookdown.yml | 7 +++++++ import-databases.Rmd | 3 +++ import-other.Rmd | 3 +++ import-rectangular.Rmd | 3 +++ import-spreadsheets.Rmd | 3 +++ import-webscrape.Rmd | 3 +++ import.Rmd | 21 +++++++++++++++++++++ 7 files changed, 43 insertions(+) create mode 100644 import-databases.Rmd create mode 100644 import-other.Rmd create mode 100644 import-rectangular.Rmd create mode 100644 import-spreadsheets.Rmd create mode 100644 import-webscrape.Rmd create mode 100644 import.Rmd diff --git a/_bookdown.yml b/_bookdown.yml index c9c8466..e9fd408 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -28,6 +28,13 @@ rmd_files: [ "factors.Rmd", "datetimes.Rmd", + "import.Rmd", + "import-rectangular.Rmd", + "import-spreadsheets.Rmd", + "import-databases.Rmd", + "import-webscrape.Rmd", + "import-other.Rmd", + "program.Rmd", "pipes.Rmd", "functions.Rmd", diff --git a/import-databases.Rmd b/import-databases.Rmd new file mode 100644 index 0000000..eb5812d --- /dev/null +++ b/import-databases.Rmd @@ -0,0 +1,3 @@ +# Databases {#import-databases} + + diff --git a/import-other.Rmd b/import-other.Rmd new file mode 100644 index 0000000..35e3010 --- /dev/null +++ b/import-other.Rmd @@ -0,0 +1,3 @@ +# Other types of data {#import-other} + + diff --git a/import-rectangular.Rmd b/import-rectangular.Rmd new file mode 100644 index 0000000..924215f --- /dev/null +++ b/import-rectangular.Rmd @@ -0,0 +1,3 @@ +# Rectangular data {#import-rectangular} + + diff --git a/import-spreadsheets.Rmd b/import-spreadsheets.Rmd new file mode 100644 index 0000000..d4b3d9a --- /dev/null +++ b/import-spreadsheets.Rmd @@ -0,0 +1,3 @@ +# Spreadsheets {#import-spreadsheets} + + diff --git a/import-webscrape.Rmd b/import-webscrape.Rmd new file mode 100644 index 0000000..74b3669 --- /dev/null +++ b/import-webscrape.Rmd @@ -0,0 +1,3 @@ +# Web scraping {#import-webscrape} + + diff --git a/import.Rmd b/import.Rmd new file mode 100644 index 0000000..04db50b --- /dev/null +++ b/import.Rmd @@ -0,0 +1,21 @@ +# (PART) Import {.unnumbered} + +# Introduction {#import-intro} + +In this part of the book, you'll learn how to get your into R. +We'll focus on plain-text rectangular formats, spreadsheets, databases, and web data. + + + +This part of the book proceeds as follows: + +- In Chapter \@ref(import-rectangular), you'll learn how to get plain-text data in rectangular formats from disk and into R. + +- In Chapter \@ref(import-spreadsheets), you'll learn how to get data from Excel spreadsheets and Google Sheets into R. + +- In Chapter \@ref(import-databases), you'll learn about getting data into R from databases. + + +- In Chapter \@ref(import-webscrape), you'll learn about harvesting data off the web and getting it into R. + +- We'll close up the part with a brief discussion on other types of data and pointers for how to get them into R in Chapter \@ref(import-other). From 5769a02123688be29191dc137f3ef453d6aba3af Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Mon, 22 Feb 2021 12:59:50 +0000 Subject: [PATCH 11/16] Update referencing style --- EDA.Rmd | 2 +- data-transform.Rmd | 4 ++-- functions.Rmd | 2 +- iteration.Rmd | 4 ++-- program.Rmd | 8 ++++---- strings.Rmd | 2 +- workflow-basics.Rmd | 2 +- 7 files changed, 12 insertions(+), 12 deletions(-) diff --git a/EDA.Rmd b/EDA.Rmd index 7e5e6f7..a2aee3f 100644 --- a/EDA.Rmd +++ b/EDA.Rmd @@ -661,7 +661,7 @@ Typically, the first one or two arguments to a function are so important that yo The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`. In the remainder of the book, we won't supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots. -That's a really important programming concern that we'll come back in [functions]. +That's a really important programming concern that we'll come back to in Chapter \@ref(functions). Rewriting the previous plot more concisely yields: diff --git a/data-transform.Rmd b/data-transform.Rmd index 870b143..066f7d5 100644 --- a/data-transform.Rmd +++ b/data-transform.Rmd @@ -564,7 +564,7 @@ Naming things is hard, so this slows down our analysis. There's another way to tackle the same problem with the pipe, `%>%`: ```{r} -delays <- flights %>% +sdelays <- flights %>% group_by(dest) %>% summarise( count = n(), @@ -580,7 +580,7 @@ As suggested by this reading, a good way to pronounce `%>%` when reading code is Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. -We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in [pipes]. +We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(pipes). Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. diff --git a/functions.Rmd b/functions.Rmd index a419e23..2c19096 100644 --- a/functions.Rmd +++ b/functions.Rmd @@ -127,7 +127,7 @@ df$d <- rescale01(df$d) Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. -We'll learn how to eliminate that duplication in [iteration], once you've learned more about R's data structures in [vectors]. +We'll learn how to eliminate that duplication with iteration in Chapter \@ref(iteration), once you've learned more about R's data structures in Chapter \@ref(vectors). Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include infinite values, and `rescale01()` fails: diff --git a/iteration.Rmd b/iteration.Rmd index 5a5da55..6a981d7 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -2,7 +2,7 @@ ## Introduction -In [functions], we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. +In Chapter \@ref(functions), we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. Reducing code duplication has three main benefits: 1. It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same. @@ -164,7 +164,7 @@ There are four variations on the basic theme of the for loop: ### Modifying an existing object Sometimes you want to use a for loop to modify an existing object. -For example, remember our challenge from [functions]. +For example, remember our challenge from Chapter \@ref(functions) on functions. We wanted to rescale every column in a data frame: ```{r} diff --git a/program.Rmd b/program.Rmd index 8d36052..7c2bc88 100644 --- a/program.Rmd +++ b/program.Rmd @@ -28,18 +28,18 @@ But this doesn't mean you should rewrite every function: you need to balance wha In the following four chapters, you'll learn skills that will allow you to both tackle new programs and to solve existing problems with greater clarity and ease: -1. In [pipes], you will dive deep into the **pipe**, `%>%`, and learn more about how it works, what the alternatives are, and when not to use it. +1. In Chapter \@ref(pipes), you will dive deep into the **pipe**, `%>%`, and learn more about how it works, what the alternatives are, and when not to use it. 2. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. - Instead, in [functions], you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused. + Instead, in Chapter \@ref(functions), you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused. -3. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by [vectors]. +3. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by vectors, which we discuss in Chapter \@ref(vectors). You must master the four common atomic vectors, the three important S3 classes built on top of them, and understand the mysteries of the list and data frame. 4. Functions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for **iteration** that let you do similar things again and again. - These tools include for loops and functional programming, which you'll learn about in [iteration]. + These tools include for loops and functional programming, which you'll learn about in Chapter \@ref(iteration). ## Learning more diff --git a/strings.Rmd b/strings.Rmd index c89873c..8ca0981 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -715,7 +715,7 @@ It returns a list: str_extract_all(more, colour_match) ``` -You'll learn more about lists in [lists](#lists) and [iteration]. +You'll learn more about lists in Section \@ref(lists) on lists and Chapter \@ref(iteration) on iteration. If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest: diff --git a/workflow-basics.Rmd b/workflow-basics.Rmd index 1d80c39..c9c972c 100644 --- a/workflow-basics.Rmd +++ b/workflow-basics.Rmd @@ -51,7 +51,7 @@ some.people.use.periods And_aFew.People_RENOUNCEconvention ``` -We'll come back to code style later, in [functions]. +We'll come back to code style later, in Chapter \@ref(functions) on functions. You can inspect an object by typing its name: From b7c4498750dfb1fb34448fa0e5fdd757a336cebf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Mon, 22 Feb 2021 13:00:06 +0000 Subject: [PATCH 12/16] Add stub for new section --- iteration.Rmd | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/iteration.Rmd b/iteration.Rmd index 6a981d7..9ad1f3f 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -1024,3 +1024,7 @@ x %>% accumulate(`+`) ``` What causes the bugs? + +## Case study + + From 90d168246cfaa522b8a9e73a442c5723572b166c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Mon, 22 Feb 2021 13:34:21 +0000 Subject: [PATCH 13/16] Fix chapter reference --- data-import.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data-import.Rmd b/data-import.Rmd index 1f8d8fc..a7bfe0a 100644 --- a/data-import.Rmd +++ b/data-import.Rmd @@ -641,7 +641,7 @@ There are two alternatives: ``` Feather tends to be faster than RDS and is usable outside of R. -RDS supports list-columns (which you'll learn about in ); feather currently does not. +RDS supports list-columns (which you'll learn about in Chapter \@ref(list-columns); feather currently does not. ```{r, include = FALSE} file.remove("challenge-2.csv") From 55cdb66df6fd2f3acfe5d5437c61b9d6505050d9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Mon, 22 Feb 2021 13:41:06 +0000 Subject: [PATCH 14/16] Add feather to imports to see if it helps w/ build --- DESCRIPTION | 1 + 1 file changed, 1 insertion(+) diff --git a/DESCRIPTION b/DESCRIPTION index f9c5c99..d3015f5 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -14,6 +14,7 @@ URL: https://github.com/hadley/r4ds Depends: R (>= 3.1.0) Imports: + feather, gapminder, ggrepel, hexbin, From dc44bde9d900ff955bf2c549149b5f05e8c237f6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Wed, 3 Mar 2021 15:20:37 +0000 Subject: [PATCH 15/16] Move up tidy data chapter --- _bookdown.yml | 1 + tidy.Rmd => data-tidy.Rmd | 4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) rename tidy.Rmd => data-tidy.Rmd (99%) diff --git a/_bookdown.yml b/_bookdown.yml index e9fd408..062f6af 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -12,6 +12,7 @@ rmd_files: [ "workflow-basics.Rmd", "data-transform.Rmd", "data-import.Rmd", + "data-tidy.Rmd", "workflow-scripts.Rmd", "EDA.Rmd", "workflow-projects.Rmd", diff --git a/tidy.Rmd b/data-tidy.Rmd similarity index 99% rename from tidy.Rmd rename to data-tidy.Rmd index 16b852c..e847d70 100644 --- a/tidy.Rmd +++ b/data-tidy.Rmd @@ -1,4 +1,6 @@ -# Tidy data +# Data tidying {#data-tidy} + + ## Introduction From ad7fb0dd4bb900b64369db8ea5e226e6a9bb240c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Wed, 3 Mar 2021 17:13:14 +0000 Subject: [PATCH 16/16] Second crack and 2e structure --- _bookdown.yml | 17 ++++++++++------- data-types.Rmd | 29 +++++++++++++++++++++++++++++ index.Rmd | 2 +- logicals-numbers.Rmd | 3 +++ missing-values.Rmd | 3 +++ rectangle.Rmd | 2 +- row-wise.Rmd | 16 ---------------- vector-tools.Rmd | 3 +++ whole-game.Rmd | 3 +++ 9 files changed, 53 insertions(+), 25 deletions(-) create mode 100644 data-types.Rmd create mode 100644 logicals-numbers.Rmd create mode 100644 missing-values.Rmd delete mode 100644 row-wise.Rmd create mode 100644 vector-tools.Rmd diff --git a/_bookdown.yml b/_bookdown.yml index 062f6af..250bab6 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -11,24 +11,27 @@ rmd_files: [ "data-visualize.Rmd", "workflow-basics.Rmd", "data-transform.Rmd", - "data-import.Rmd", "data-tidy.Rmd", + "data-import.Rmd", "workflow-scripts.Rmd", "EDA.Rmd", "workflow-projects.Rmd", - "wrangle.Rmd", + "data-types.Rmd", "tibble.Rmd", - "tidy.Rmd", - "rectangle.Rmd", "relational-data.Rmd", - "list-columns.Rmd", - "column-wise.Rmd", - "row-wise.Rmd", + "logicals-numbers.Rmd", + "vector-tools.Rmd", + "missing-values.Rmd", "strings.Rmd", "factors.Rmd", "datetimes.Rmd", + "wrangle.Rmd", + "column-wise.Rmd", + "list-columns.Rmd", + "rectangle.Rmd", + "import.Rmd", "import-rectangular.Rmd", "import-spreadsheets.Rmd", diff --git a/data-types.Rmd b/data-types.Rmd new file mode 100644 index 0000000..a465654 --- /dev/null +++ b/data-types.Rmd @@ -0,0 +1,29 @@ +# (PART) Data types {.unnumbered} + +# Introduction {#data-types-intro} + +In this part of the book, you'll learn about data types, ... + + + +This part of the book proceeds as follows: + +- In Chapter \@ref(tibbles), you'll learn about the variant of the data frame that we use in this book: the **tibble**. You'll learn what makes them different from regular data frames, and how you can construct them "by hand". + +Data wrangling also encompasses data transformation, which you've already learned a little about. +Now we'll focus on new skills for specific types of data you will frequently encounter in practice: + +- Chapter \@ref(relational-data) will give you tools for working with multiple interrelated datasets. + + + + + + + +- Chapter \@ref(strings) will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings. + +- Chapter \@ref(factors) will introduce factors -- how R stores categorical data. + They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string. + +- Chapter \@ref(dates-and-times) will give you the key tools for working with dates and date-times. diff --git a/index.Rmd b/index.Rmd index 7d5d962..739b8cc 100644 --- a/index.Rmd +++ b/index.Rmd @@ -13,7 +13,7 @@ documentclass: book # Welcome {.unnumbered} -Buy from amazon This is the website for the work-in-progress 2nd edition of **"R for Data Science"**. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. +[![Buy from amazon](cover.png){.cover width="250"}](http://amzn.to/2aHLAQ1) This is the website for the work-in-progress 2nd edition of **"R for Data Science"**. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. diff --git a/logicals-numbers.Rmd b/logicals-numbers.Rmd new file mode 100644 index 0000000..656a8c8 --- /dev/null +++ b/logicals-numbers.Rmd @@ -0,0 +1,3 @@ +# Logicals and numbers + +## Introduction diff --git a/missing-values.Rmd b/missing-values.Rmd new file mode 100644 index 0000000..f08b770 --- /dev/null +++ b/missing-values.Rmd @@ -0,0 +1,3 @@ +# Missing values + +## Introduction diff --git a/rectangle.Rmd b/rectangle.Rmd index 53624ca..b999fed 100644 --- a/rectangle.Rmd +++ b/rectangle.Rmd @@ -1,4 +1,4 @@ -# Rectangle data +# Rectangling data ## Introduction diff --git a/row-wise.Rmd b/row-wise.Rmd deleted file mode 100644 index 8c76617..0000000 --- a/row-wise.Rmd +++ /dev/null @@ -1,16 +0,0 @@ -# Row-wise operations - -## Introduction - - - -### Prerequisites - -In this chapter we'll continue using dplyr. -dplyr is a member of the core tidyverse. - -```{r setup, message = FALSE} -library(tidyverse) -``` - - diff --git a/vector-tools.Rmd b/vector-tools.Rmd new file mode 100644 index 0000000..463ef46 --- /dev/null +++ b/vector-tools.Rmd @@ -0,0 +1,3 @@ +# General vector tools + +## Introduction diff --git a/whole-game.Rmd b/whole-game.Rmd index 6cbc884..7174cf4 100644 --- a/whole-game.Rmd +++ b/whole-game.Rmd @@ -17,6 +17,9 @@ In this part of the book you will learn some useful tools that have an immediate - Visualisation alone is typically not enough, so in Chapter \@ref(data-transform) you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries. +- In Chapter \@ref(data-tidy), you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualisation, and modelling easier. + You'll learn the underlying principles, and how to get your data into a tidy form. + - Before you can transform and visualise your data, you need to first get your data into R. In Chapter \@ref(data-import) you'll learn the basics of getting plain-text rectangular data into R.