diff --git a/EDA.Rmd b/EDA.Rmd index a2aee3f..3827a2f 100644 --- a/EDA.Rmd +++ b/EDA.Rmd @@ -272,7 +272,7 @@ You'll need to figure out what caused them (e.g. a data entry error) and disclos What happens if you leave `binwidth` unset? What happens if you try and zoom so only half a bar shows? -## Missing values +## Missing values {#missing-values-eda} If you've encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options. diff --git a/_bookdown.yml b/_bookdown.yml index 250bab6..0917eac 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -17,7 +17,7 @@ rmd_files: [ "EDA.Rmd", "workflow-projects.Rmd", - "data-types.Rmd", + "transform.Rmd", "tibble.Rmd", "relational-data.Rmd", "logicals-numbers.Rmd", @@ -26,11 +26,7 @@ rmd_files: [ "strings.Rmd", "factors.Rmd", "datetimes.Rmd", - - "wrangle.Rmd", "column-wise.Rmd", - "list-columns.Rmd", - "rectangle.Rmd", "import.Rmd", "import-rectangular.Rmd", @@ -39,6 +35,10 @@ rmd_files: [ "import-webscrape.Rmd", "import-other.Rmd", + "tidy.Rmd", + "list-columns.Rmd", + "rectangle.Rmd", + "program.Rmd", "pipes.Rmd", "functions.Rmd", diff --git a/column-wise.Rmd b/column-wise.Rmd index 4f1ac34..93ed769 100644 --- a/column-wise.Rmd +++ b/column-wise.Rmd @@ -1,4 +1,4 @@ -# Column-wise operations +# Column-wise operations {#column-wise} ## Introduction diff --git a/data-tidy.Rmd b/data-tidy.Rmd index d9d214f..6cba19f 100644 --- a/data-tidy.Rmd +++ b/data-tidy.Rmd @@ -563,7 +563,7 @@ table1 %>% baker ``` -## Missing values +## Missing values {#missing-values-tidy} Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways: diff --git a/data-transform.Rmd b/data-transform.Rmd index 066f7d5..e3477a0 100644 --- a/data-transform.Rmd +++ b/data-transform.Rmd @@ -31,10 +31,11 @@ flights ``` You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. +It also displays the number of rows (`r format(nrow(nycflights13::flights), big.mark = ",")`) and columns (`r ncol(nycflights13::flights)`). (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a **tibble**. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. -For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro). +For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in Chapter \@ref(tibbles). You might also have noticed the row of three (or four) letter abbreviations under the column names. These describe the type of each variable: @@ -43,7 +44,7 @@ These describe the type of each variable: - `dbl` stands for doubles, or real numbers. -- `chr` stands for character vectors, or strings. +- `chr` stands for characters, or strings. - `dttm` stands for date-times (a date + a time). @@ -120,8 +121,8 @@ There's another common problem you might encounter when using `==`: floating poi These results might surprise you! ```{r} -sqrt(2) ^ 2 == 2 -1 / 49 * 49 == 1 +(sqrt(2) ^ 2) == 2 +(1 / 49 * 49) == 1 ``` Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation. @@ -138,7 +139,7 @@ Multiple arguments to `filter()` are combined with "and": every expression must For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not". Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations. -```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects."} +```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."} knitr::include_graphics("diagrams/transform-logical.png") ``` @@ -151,7 +152,7 @@ filter(flights, month == 11 | month == 12) The order of operations doesn't work like English. You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December". Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`. -In a numeric context (like here), `TRUE` becomes one, so this finds all flights in January, not November or December. +In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December. This is quite confusing! A useful short-hand for this problem is `x %in% y`. @@ -172,15 +173,15 @@ filter(flights, arr_delay <= 120, dep_delay <= 120) As well as `&` and `|`, R also has `&&` and `||`. Don't use them here! -You'll learn when you should use them in [conditional execution]. +You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution. Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead. That makes it much easier to check your work. You'll learn how to create new variables shortly. -### Missing values +### Missing values {#missing-values-filter} -One important feature of R that can make comparison tricky are missing values, or `NA`s ("not availables"). +One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables"). `NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown. ```{r} @@ -277,17 +278,17 @@ arrange(df, desc(x)) ### Exercises -1. How could you use `arrange()` to sort all missing values to the start? - (Hint: use `is.na()`). - -2. Sort `flights` to find the most delayed flights. +1. Sort `flights` to find the flights with longest departure delays. Find the flights that left earliest. -3. Sort `flights` to find the fastest (highest speed) flights. +2. Sort `flights` to find the fastest (highest speed) flights. -4. Which flights travelled the farthest? +3. Which flights travelled the farthest? Which travelled the shortest? +4. How could you use `arrange()` to sort all missing values to the start? + (Hint: use `!is.na()`). + ## Select columns with `select()` {#select} It's not uncommon to get datasets with hundreds or even thousands of variables. @@ -326,11 +327,11 @@ Instead, use `rename()`, which is a variant of `select()` that keeps all the var rename(flights, tail_num = tailnum) ``` -Another option is to use `select()` in conjunction with the `everything()` helper. -This is useful if you have a handful of variables you'd like to move to the start of the data frame. +If you want to move certain variables to the start of the data frame but not drop the others, you can do this in two ways: using `select()` in conjunction with the `everything()` helper or using `relocate()`. ```{r} select(flights, time_hour, air_time, everything()) +relocate(flights, time_hour, air_time) ``` ### Exercises @@ -343,7 +344,7 @@ select(flights, time_hour, air_time, everything()) Why might it be helpful in conjunction with this vector? ```{r} - vars <- c("year", "month", "day", "dep_delay", "arr_delay") + variables <- c("year", "month", "day", "dep_delay", "arr_delay") ``` 4. Does the result of running the following code surprise you? @@ -446,7 +447,7 @@ There's no way to list every possible function that you might use, but here's a cummean(x) ``` -- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier. +- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier. If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected. - Ranking: there are a number of ranking functions, but you should start with `min_rank()`. @@ -472,6 +473,7 @@ There's no way to list every possible function that you might use, but here's a ### Exercises ```{r, eval = FALSE, echo = FALSE} +# For data checking, not used in results shown in book flights <- flights %>% mutate( dep_time = hour * 60 + minute, arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100), @@ -518,11 +520,11 @@ summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) `summarise()` is not terribly useful unless we pair it with `group_by()`. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group". -For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date: +For example, if we applied exactly the same code to a data frame grouped by month, we get the average delay per month: ```{r} -by_day <- group_by(flights, year, month, day) -summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) +by_month <- group_by(flights, month) +summarise(by_month, delay = mean(dep_delay, na.rm = TRUE)) ``` Together `group_by()` and `summarise()` provide one of the tools that you'll use most commonly when working with dplyr: grouped summaries. @@ -558,7 +560,7 @@ There are three steps to prepare this data: 3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport. -This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. +This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about them. Naming things is hard, so this slows down our analysis. There's another way to tackle the same problem with the pipe, `%>%`: @@ -586,14 +588,14 @@ Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet. -### Missing values +### Missing values {#missing-values-summarise} You may have wondered about the `na.rm` argument we used above. What happens if we don't set it? ```{r} flights %>% - group_by(year, month, day) %>% + group_by(month) %>% summarise(mean = mean(dep_delay)) ``` @@ -603,11 +605,11 @@ Fortunately, all aggregation functions have an `na.rm` argument which removes th ```{r} flights %>% - group_by(year, month, day) %>% + group_by(month) %>% summarise(mean = mean(dep_delay, na.rm = TRUE)) ``` -In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights. +In this case, missing values represent cancelled flights, therefore we could also tackle the problem by first removing the cancelled flights. We'll save this dataset so we can reuse it in the next few examples. ```{r} @@ -615,10 +617,63 @@ not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay)) not_cancelled %>% - group_by(year, month, day) %>% + group_by(month) %>% summarise(mean = mean(dep_delay)) ``` +### Grouping by multiple variables + +You can group a data frame by multiple variables as well. +Note that the grouping information is printed on top of the output. +The number in the square brackets indicates how many groups are created. + +```{r} +daily <- group_by(flights, year, month, day) +daily +``` + +When you group by multiple variables, each summary peels off one level of the grouping by default, and a message is printed that tells you how you can change this behaviour. + +```{r} +summarise(daily, flights = n()) +``` + +If you're happy with this behaviour, you can also explicitly define it, in which case the message won't be printed out. + +```{r} +summarise(daily, flights = n(), .groups = "drop_last") +``` + +Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`. + +```{r} +# Note the difference between the grouping structures +summarise(daily, flights = n(), .groups = "drop") +summarise(daily, flights = n(), .groups = "keep") +``` + +The fact that each summary peels off one level of the grouping by default makes it easy to progressively roll up a dataset: + +```{r} +(per_day <- summarise(daily, flights = n())) +(per_month <- summarise(per_day, flights = sum(flights))) +(per_year <- summarise(per_month, flights = sum(flights))) +``` + +Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median. +In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median. + +### Ungrouping + +You might also want to remove grouping outside of `summarise()`. +You can do this and return to operations on ungrouped data using `ungroup()`. + +```{r} +daily %>% + ungroup() %>% # no longer grouped by date + summarise(flights = n()) # all flights +``` + ### Counts Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`). @@ -664,7 +719,7 @@ It's a bit painful that you have to switch from `%>%` to `+`, but once you get t delays %>% filter(n > 25) %>% ggplot(mapping = aes(x = n, y = delay)) + - geom_point(alpha = 1/10) + geom_point(alpha = 1/10) ``` ------------------------------------------------------------------------ @@ -722,8 +777,17 @@ Just using means, counts, and sum can get you a long way, but R provides many ot - Measures of location: we've used `mean(x)`, but `median(x)` is also useful. The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it. + ```{r} + not_cancelled %>% + group_by(month) %>% + summarise( + med_arr_delay = median(arr_delay), + med_dep_delay = median(dep_delay) + ) + ``` + It's sometimes useful to combine aggregation with logical subsetting. - We haven't talked about this sort of subsetting yet, but you'll learn more about it in [subsetting]. + We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting). ```{r} not_cancelled %>% @@ -802,6 +866,13 @@ Just using means, counts, and sum can get you a long way, but R provides many ot count(dest) ``` + Just like with `group_by()`, you can also provide multiple variables to `count()`. + + ```{r} + not_cancelled %>% + count(carrier, dest) + ``` + You can optionally provide a weight variable. For example, you could use this to "count" (sum) the total number of miles a plane flew: @@ -827,31 +898,6 @@ Just using means, counts, and sum can get you a long way, but R provides many ot summarise(hour_prop = mean(arr_delay > 60)) ``` -### Grouping by multiple variables - -When you group by multiple variables, each summary peels off one level of the grouping. -That makes it easy to progressively roll up a dataset: - -```{r} -daily <- group_by(flights, year, month, day) -(per_day <- summarise(daily, flights = n())) -(per_month <- summarise(per_day, flights = sum(flights))) -(per_year <- summarise(per_month, flights = sum(flights))) -``` - -Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median. -In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median. - -### Ungrouping - -If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`. - -```{r} -daily %>% - ungroup() %>% # no longer grouped by date - summarise(flights = n()) # all flights -``` - ### Exercises 1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. @@ -886,7 +932,7 @@ daily %>% 6. What does the `sort` argument to `count()` do. When might you use it? -## Grouped mutates (and filters) +## Grouped mutates and filters Grouping is most useful in conjunction with `summarise()`, but you can also do convenient operations with `mutate()` and `filter()`: diff --git a/logicals-numbers.Rmd b/logicals-numbers.Rmd index 656a8c8..deaecbb 100644 --- a/logicals-numbers.Rmd +++ b/logicals-numbers.Rmd @@ -1,3 +1,3 @@ -# Logicals and numbers +# Logicals and numbers {#logicals-numbers} ## Introduction diff --git a/missing-values.Rmd b/missing-values.Rmd index f08b770..abc43a9 100644 --- a/missing-values.Rmd +++ b/missing-values.Rmd @@ -1,3 +1,3 @@ -# Missing values +# Missing values {#missing-values} ## Introduction diff --git a/rectangle.Rmd b/rectangle.Rmd index b999fed..895d79c 100644 --- a/rectangle.Rmd +++ b/rectangle.Rmd @@ -1,4 +1,4 @@ -# Rectangling data +# Data rectangling {#rectangle-data} ## Introduction diff --git a/tidy.Rmd b/tidy.Rmd new file mode 100644 index 0000000..f705612 --- /dev/null +++ b/tidy.Rmd @@ -0,0 +1,21 @@ +# (PART) Tidy {.unnumbered} + +# Introduction {#wrangle-intro} + +In this part of the book, you'll learn about data tidying, the art of getting your data into R in a useful form for visualisation and modelling. +Data wrangling is very important: without it you can't work with your own data! +There are three main parts to data wrangling: + +```{r echo = FALSE, out.width = "75%"} +knitr::include_graphics("diagrams/data-science-wrangle.png") +``` + + + +This part of the book proceeds as follows: + +- Chapter \@ref(list-columns) will give you tools for working with list columns --- data stored in columns of a tibble as lists. + +- In Chapter \@ref(rectangle-data), you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting. + + diff --git a/data-types.Rmd b/transform.Rmd similarity index 60% rename from data-types.Rmd rename to transform.Rmd index a465654..f3f3d29 100644 --- a/data-types.Rmd +++ b/transform.Rmd @@ -1,24 +1,28 @@ -# (PART) Data types {.unnumbered} +# (PART) Transform {.unnumbered} # Introduction {#data-types-intro} -In this part of the book, you'll learn about data types, ... +In this part of the book, you'll learn about various types of data the columns of a data frame can contain and how to transform them. +The transformations you might want to apply to a column vary depending on the type of data you're working with, for example if you have text strings you might want to extract or remove certain pieces while if you have numerical data, you might want to rescale them. +You've already learned a little about data wrangling in the previous part. +Now we'll focus on new skills for specific types of data you will frequently encounter in practice. This part of the book proceeds as follows: - In Chapter \@ref(tibbles), you'll learn about the variant of the data frame that we use in this book: the **tibble**. You'll learn what makes them different from regular data frames, and how you can construct them "by hand". - -Data wrangling also encompasses data transformation, which you've already learned a little about. -Now we'll focus on new skills for specific types of data you will frequently encounter in practice: - - Chapter \@ref(relational-data) will give you tools for working with multiple interrelated datasets. +- Chapter \@ref(logicals-numbers) ... +- Chapter \@ref(vector-tools) ... + +- Chapter \@ref(missing-values)... + - Chapter \@ref(strings) will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings. @@ -27,3 +31,5 @@ Now we'll focus on new skills for specific types of data you will frequently enc They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string. - Chapter \@ref(dates-and-times) will give you the key tools for working with dates and date-times. + +- Chapter \@ref(column-wise) will give you tools for performing the same operation on multiple columns. diff --git a/vector-tools.Rmd b/vector-tools.Rmd index 463ef46..00a569d 100644 --- a/vector-tools.Rmd +++ b/vector-tools.Rmd @@ -1,3 +1,3 @@ -# General vector tools +# Vector tools ## Introduction diff --git a/vectors.Rmd b/vectors.Rmd index 32c172d..54ef76b 100644 --- a/vectors.Rmd +++ b/vectors.Rmd @@ -150,7 +150,7 @@ pryr::object_size(y) `y` doesn't take up 1,000x as much memory as `x`, because each element of `y` is just a pointer to that same string. A pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 \* 1000 + 152 = 8.14 kB. -### Missing values +### Missing values {#missing-values-vectors} Note that each type of atomic vector has its own missing value: diff --git a/workflow-basics.Rmd b/workflow-basics.Rmd index c9c972c..8ead5a1 100644 --- a/workflow-basics.Rmd +++ b/workflow-basics.Rmd @@ -52,6 +52,7 @@ And_aFew.People_RENOUNCEconvention ``` We'll come back to code style later, in Chapter \@ref(functions) on functions. +If you're interested in learning more about about best practices for code style, I also recommend The tidyverse style guide: [https://style.tidyverse.org](https://style.tidyverse.org/). You can inspect an object by typing its name: @@ -105,7 +106,7 @@ function_name(arg1 = val1, arg2 = val2, ...) Let's try using `seq()` which makes regular **seq**uences of numbers and, while we're at it, learn more helpful features of RStudio. Type `se` and hit TAB. A popup shows you possible completions. -Specify `seq()` by typing more (a "q") to disambiguate, or by using ↑/↓ arrows to select. +Specify `seq()` by typing more (a `q`) to disambiguate, or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the function's arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane. diff --git a/wrangle.Rmd b/wrangle.Rmd deleted file mode 100644 index a3caa27..0000000 --- a/wrangle.Rmd +++ /dev/null @@ -1,43 +0,0 @@ -# (PART) Wrangle {.unnumbered} - -# Introduction {#wrangle-intro} - -In this part of the book, you'll learn about data wrangling, the art of getting your data into R in a useful form for visualisation and modelling. -Data wrangling is very important: without it you can't work with your own data! -There are three main parts to data wrangling: - -```{r echo = FALSE, out.width = "75%"} -knitr::include_graphics("diagrams/data-science-wrangle.png") -``` - - - -This part of the book proceeds as follows: - -- In Chapter \@ref(tibbles), you'll learn about the variant of the data frame that we use in this book: the **tibble**. - You'll learn what makes them different from regular data frames, and how you can construct them "by hand". - -- In Chapter \@ref(tidy-data), you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualisation, and modelling easier. - You'll learn the underlying principles, and how to get your data into a tidy form. - -- In Chapter \@ref(rectangle-data), you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting. - -- Chapter \@ref(column-wise-operations) will give you tools for performing the same operation on multiple columns. - -- Chapter \@ref(row-wise-operations) will give you tools for performing operations over rows. - -Data wrangling also encompasses data transformation, which you've already learned a little about. -Now we'll focus on new skills for three specific types of data you will frequently encounter in practice: - -- Chapter \@ref(relational-data) will give you tools for working with multiple interrelated datasets. - -- Chapter \@ref(list-columns) will give you tools for working with list columns --- data stored in columns of a tibble as lists. - -- Chapter \@ref(strings) will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings. - -- Chapter \@ref(factors) will introduce factors --- how R stores categorical data. - They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string. - -- Chapter \@ref(dates-and-times) will give you the key tools for working with dates and date-times. - -