Improve cross-references

* Fix broken links
* Update chapter links
This commit is contained in:
Hadley Wickham 2022-09-29 10:49:03 -05:00
parent d9a86edcf0
commit faeeb564a4
18 changed files with 49 additions and 80 deletions

View File

@ -919,7 +919,7 @@ Typically, the first one or two arguments to a function are so important that yo
The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`.
In the remainder of the book, we won't supply those names.
That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots.
That's a really important programming concern that we'll come back to in [Chapter -@sec-functions].
That's a really important programming concern that we'll come back to in @sec-functions.
Rewriting the previous plot more concisely yields:

View File

@ -1,26 +0,0 @@
# Column-wise operations {#sec-column-wise}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
## Introduction
<!--# TO DO: Write introduction. -->
### Prerequisites
In this chapter we'll continue using dplyr.
dplyr is a member of the core tidyverse.
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
<!--# TO DO: Write chapter around across, etc. -->

View File

@ -11,7 +11,6 @@ status("polishing")
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
In this chapter, you'll learn how to read plain-text rectangular files into R.
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data, which we'll come back to in @sec-wrangle.
### Prerequisites
@ -116,7 +115,7 @@ There are two cases where you might want to tweak this behavior:
read_csv("1,2,3\n4,5,6", col_names = FALSE)
```
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in [Chapter -@sec-strings].)
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in @sec-strings.)
Alternatively you can pass `col_names` a character vector which will be used as the column names:
@ -171,7 +170,7 @@ Another common task after reading in data is to consider variable types.
For example, `meal_type` is a categorical variable with a known set of possible values.
In R, factors can be used to work with categorical variables.
We can convert this variable to a factor using the `factor()` function.
You'll learn more about factors in [Chapter -@sec-factors].
You'll learn more about factors in @sec-factors.
```{r}
students <- students |>
@ -184,7 +183,7 @@ students
Note that the values in the `meal_type` variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
Before you move on to analyzing these data, you'll probably want to fix the `age` column as well: currently it's a character variable because of the one observation that is typed out as `five` instead of a numeric `5`.
We discuss the details of fixing this issue in [Chapter -@sec-import-spreadsheets] in further detail.
We discuss the details of fixing this issue in @sec-import-spreadsheets in further detail.
### Compared to base R
@ -331,7 +330,7 @@ file.remove("students.rds")
In this chapter, you've learned how to use readr to load rectangular flat files from disk into R.
You've learned how csv files work, some of the problems you might encounter, and how to overcome them.
We'll come to data import a few times in this book: @sec-import-databases will show you how to load data from databases, @sec-import-spreadsheets from Excel and googlesheets, @sec-import-rectangling from JSON, and @sec-import-scraping from websites.
We'll come to data import a few times in this book: @sec-import-databases will show you how to load data from databases, @sec-import-spreadsheets from Excel and googlesheets, @sec-rectangling from JSON, and @sec-scraping from websites.
Now that you're writing a substantial amount of R code, it's time to learn more about organizing your code into files and directories.
In the next chapter, you'll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.

View File

@ -202,7 +202,7 @@ Take 2 Pac's "Baby Don't Cry", for example.
The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
These `NA`s don't really represent unknown observations; they're forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
[^data-tidy-1]: We'll come back to this idea in [Chapter -@sec-missing-values].
[^data-tidy-1]: We'll come back to this idea in @sec-missing-values.
```{r}
billboard |>
@ -218,7 +218,7 @@ You might also wonder what happens if a song is in the top 100 for more than 76
We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.
This data is now tidy, but we could make future computation a bit easier by converting `week` into a number using `mutate()` and `parse_number()`.
You'll learn more about `parse_number()` and friends in [Chapter -@sec-data-import].
You'll learn more about `parse_number()` and friends in @sec-data-import.
```{r}
billboard_tidy <- billboard |>
@ -365,7 +365,7 @@ who2 |>
)
```
An alternative to `names_sep` is `names_pattern`, which you can use to extract variables from more complicated naming scenarios, once you've learned about regular expressions in [Chapter -@sec-regular-expressions].
An alternative to `names_sep` is `names_pattern`, which you can use to extract variables from more complicated naming scenarios, once you've learned about regular expressions in @sec-regular-expressions.
Conceptually, this is only a minor variation on the simpler case you've already seen.
@fig-pivot-multiple-names shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns.
@ -540,7 +540,7 @@ df |>
It then fills in all the missing values using the data in the input.
In this case, not every cell in the output has corresponding value in the input as there's no entry for id "B" and name "z", so that cell remains missing.
We'll come back to this idea that `pivot_wider()` can "make" missing values in [Chapter -@sec-missing-values].
We'll come back to this idea that `pivot_wider()` can "make" missing values in @sec-missing-values.
You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output.
The example below has two rows that correspond to id "A" and name "x":
@ -665,7 +665,7 @@ cluster_id <- cluster$cluster |>
cluster_id
```
You could then combine this back with the original data using one of the joins you'll learn about in [Chapter -@sec-relational-data].
You could then combine this back with the original data using one of the joins you'll learn about in @sec-joins.
```{r}
gapminder |> left_join(cluster_id)

View File

@ -48,7 +48,7 @@ If you've used R before, you might notice that this data frame prints a little d
That's because it's a **tibble**, a special type of data frame used by the tidyverse to avoid some common gotchas.
The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
To see everything, use `View(flights)` to open the dataset in the RStudio viewer.
We'll come back to other important differences in [Chapter -@sec-tibbles].
We'll come back to other important differences in @sec-tibbles.
You might have noticed the short abbreviations that follow each column name.
These tell you the type of each variable: `<int>` is short for integer, `<dbl>` is short for double (aka real numbers), `<chr>` for character (aka strings), and `<dttm>` for date-time.
@ -85,7 +85,7 @@ The code starts with the `flights` dataset, then filters it, then groups it, the
We'll come back to the pipe and its alternatives in @sec-pipes.
dplyr's verbs are organised into four groups based on what they operate on: **rows**, **columns**, **groups**, or **tables**.
In the following sections you'll learn the most important verbs for rows, columns, and groups, then we'll come back to verb that work on tables in [Chapter -@sec-relational-data].
In the following sections you'll learn the most important verbs for rows, columns, and groups, then we'll come back to verb that work on tables in @sec-joins.
Let's dive in!
## Rows
@ -129,7 +129,7 @@ flights |>
filter(month %in% c(1, 2))
```
We'll come back to these comparisons and logical operators in more detail in [Chapter -@sec-logicals].
We'll come back to these comparisons and logical operators in more detail in @sec-logicals.
When you run `filter()` dplyr executes the filtering operation, creating a new data frame, and then prints it.
It doesn't modify the existing `flights` dataset because dplyr functions never modify their inputs.
@ -308,7 +308,7 @@ There are a number of helper functions you can use within `select()`:
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
See `?select` for more details.
Once you know regular expressions (the topic of [Chapter -@sec-regular-expressions]) you'll also be use `matches()` to select variables that match a pattern.
Once you know regular expressions (the topic of @sec-regular-expressions) you'll also be use `matches()` to select variables that match a pattern.
You can rename variables as you `select()` them by using `=`.
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
@ -435,7 +435,7 @@ flights |>
Uhoh!
Something has gone wrong and all of our results are `NA` (pronounced "N-A"), R's symbol for missing value.
We'll come back to discuss missing values in [Chapter -@sec-missing-values], but for now we'll remove them by using `na.rm = TRUE`:
We'll come back to discuss missing values in @sec-missing-values, but for now we'll remove them by using `na.rm = TRUE`:
```{r}
flights |>
@ -671,6 +671,6 @@ You can find a good explanation of this problem and how to overcome it at <http:
In this chapter, you've learned the tools that dplyr provides for working with data frames.
The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`, those that manipulate the columns (like `select()` and `mutate()`), and those that manipulate groups (like `group_by()` and `summarise()`).
In this chapter, we've focused on these "whole data frame" tools, but you haven't yet learned much about what you can do with the individual variable.
We'll come back to that in @sec-transform-intro, where each chapter will give you tools for a specific type of variable.
We'll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.
For now, we'll pivot back to workflow, and in the next chapter you'll learn more about the pipe, `|>`, why we recommend it, and a little of the history that lead from magrittr's `%>%` to base R's `|>`.

View File

@ -32,9 +32,9 @@ The goal of this chapter is to get you started on your journey with functions wi
The chapter concludes with some advice on function style.
Many of the examples in this chapter were inspired by real data analysis code supplied by folks on twitter.
I've often simplified the code from the original so you might want to look at the original tweets which I list in the comments.
We've often simplified the code from the original so you might want to look at the original tweets which I list in the comments.
If you want just to see a huge variety of funcitons, check out the motivating tweets: https://twitter.com/hadleywickham/status/1574373127349575680, https://twitter.com/hadleywickham/status/1571603361350164486 A big thanks to everyone who contributed!
I won't fully explain all of the functions that I use here, so you might need to do some reading of the documentation.
WI won't fully explain all of the functions that I use here, so you might need to do some reading of the documentation.
### Prerequisites

View File

@ -223,7 +223,7 @@ knitr::include_graphics("diagrams/transform.png", dpi = 270)
As well as `&` and `|`, R also has `&&` and `||`.
Don't use them in dplyr functions!
These are called short-circuiting operators and only ever return a single `TRUE` or `FALSE`.
They're important for programming and you'll learn more about them in @sec-conditional-execution.
They're important for programming, not data science
### Missing values {#sec-na-boolean}
@ -402,7 +402,7 @@ This works, but what if we wanted to also compute the average delay for flights
We'd need to perform a separate filter step, and then figure out how to combine the two data frames together[^logicals-3].
Instead you could use `[` to perform an inline filtering: `arr_delay[arr_delay > 0]` will yield only the positive arrival delays.
[^logicals-3]: We'll cover this in [Chapter -@sec-relational-data]
[^logicals-3]: We'll cover this in @sec-joins\]
This leads to:

View File

@ -121,9 +121,9 @@ The vast majority of transformation functions are already built into base R.
It's impractical to list them all so this section will show the most useful ones.
As an example, while R provides all the trigonometric functions that you might dream of, we don't list them here because they're rarely needed for data science.
### Arithmetic and recycling rules
### Arithmetic and recycling rules {#sec-recycling}
We introduced the basics of arithmetic (`+`, `-`, `*`, `/`, `^`) in [Chapter -@sec-workflow-basics] and have used them a bunch since.
We introduced the basics of arithmetic (`+`, `-`, `*`, `/`, `^`) in @sec-workflow-basics and have used them a bunch since.
These functions don't need a huge amount of explanation because they do what you learned in grade school.
But we need to briefly talk about the **recycling rules** which determine what happens when the left and right hand sides have different lengths.
This is important for operations like `flights |> mutate(air_time = air_time / 60)` because there are 336,776 numbers on the left of `/` but only one on the right.
@ -742,7 +742,7 @@ flights |>
### With `mutate()`
As the names suggest, the summary functions are typically paired with `summarise()`.
However, because of the recycling rules we discussed in @sec-scalars-and-recycling-rules they can also be usefully paired with `mutate()`, particularly when you want do some sort of group standardization.
However, because of the recycling rules we discussed in @sec-recycling they can also be usefully paired with `mutate()`, particularly when you want do some sort of group standardization.
For example:
- `x / sum(x)` calculates the proportion of a total.

View File

@ -9,7 +9,7 @@ status("restructuring")
## Introduction
You learned the basics of regular expressions in [Chapter -@sec-strings], but regular expressions are fairly rich language so it's worth spending some extra time on the details.
You learned the basics of regular expressions in @sec-strings, but regular expressions are fairly rich language so it's worth spending some extra time on the details.
The chapter starts by expanding your knowledge of patterns, to cover six important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, and alternation).
Here we'll focus mostly on the language itself, not the functions that use it.
@ -51,7 +51,7 @@ It's not R specific, but it covers the most advanced features and explains how r
## Pattern language
You learned the very basics of the regular expression pattern language in [Chapter -@sec-strings], and now its time to dig into more of the details.
You learned the very basics of the regular expression pattern language in @sec-strings, and now its time to dig into more of the details.
First, we'll start with **escaping**, which allows you to match characters that the pattern language otherwise treats specially.
Next you'll learn about **anchors**, which allow you to match the start or end of the string.
Then you'll learn about **character classes** and their shortcuts, which allow you to match any character from a set.
@ -60,7 +60,7 @@ We'll finish up with **quantifiers**, which control how many times a pattern can
The terms we use here are the technical names for each component.
They're not always the most evocative of their purpose, but it's very helpful to know the correct terms if you later want to Google for more details.
We'll concentrate on showing how these patterns work with `str_view()` and `str_view_all()` but remember that you can use them with any of the functions that you learned about in [Chapter -@sec-strings], i.e.:
We'll concentrate on showing how these patterns work with `str_view()` and `str_view_all()` but remember that you can use them with any of the functions that you learned about in @sec-strings, i.e.:
- `str_detect(x, pattern)` returns a logical vector the same length as `x`, indicating whether each element matches (`TRUE`) or doesn't match (`FALSE`) the pattern.
- `str_count(x, pattern)` returns the number of times `pattern` matches in each element of `x`.
@ -68,7 +68,7 @@ We'll concentrate on showing how these patterns work with `str_view()` and `str_
### Escaping {#sec-regexp-escaping}
In [Chapter -@sec-strings], you'll learned how to match a literal `.` by using `fixed(".")`.
In @sec-strings, you'll learned how to match a literal `.` by using `fixed(".")`.
But what if you want to match a literal `.` as part of a bigger regular expression?
You'll need to use an **escape**, which tells the regular expression you want it to match exactly, not use its special behavior.
Like strings, regexps use the backslash for escaping, so to match a `.`, you need the regexp `\.`.
@ -201,7 +201,7 @@ str_view_all("abcd12345!@#%. ", "\\S+")
### Quantifiers
The **quantifiers** control how many times a pattern matches.
In [Chapter -@sec-strings] you learned about `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches).
In @sec-strings you learned about `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches).
For example, `colou?r` will match American or British spelling, `\d+` will match one or more digits, and `\s?` will optionally match a single whitespace.
You can also specify the number of matches precisely:

View File

@ -12,7 +12,7 @@ status("drafting")
So far you have learned about importing data from plain text files, e.g. `.csv` and `.tsv` files.
Sometimes you need to analyze data that lives in a spreadsheet.
In this chapter we will introduce you to tools for working with data in Excel spreadsheets and Google Sheets.
This will build on much of what you've learned in [Chapter -@sec-data-import] and [Chapter -@sec-import-rectangular], but we will also discuss additional considerations and complexities when working with data from spreadsheets.
This will build on much of what you've learned in @sec-data-import and @sec-import-rectangular, but we will also discuss additional considerations and complexities when working with data from spreadsheets.
If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper "Data Organization in Spreadsheets" by Karl Broman and Kara Woo: <https://doi.org/10.1080/00031305.2017.1375989>.
The best practices presented in this paper will save you much headache down the line when you import the data from a spreadsheet into R to analyse and visualise.
@ -222,7 +222,7 @@ penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
penguins
```
In [Chapter -@sec-iteration] we'll talk about ways of doing this sort of task without repetitive code <!--# Check to make sure that's the right place to present it -->.
In @sec-iteration we'll talk about ways of doing this sort of task without repetitive code.
### Reading part of a sheet

View File

@ -17,10 +17,6 @@ You'll then dive into creating strings from data.
Next, we'll discuss the basics of regular expressions, a powerful tool for describing patterns in strings, then use those tools to extract data from strings.
The chapter finishes up with functions that work with individual letters, including a brief discussion of where your expectations from English might steer you wrong when working with other languages, and a few useful non-stringr functions.
This chapter is paired with two other chapters.
Regular expression are a big topic, so we'll come back to them again in @sec-regular-expressions.
We'll also come back to strings again in @sec-programming-with-strings where we'll look at them from a programming perspective rather than a data analysis perspective.
### Prerequisites
In this chapter, we'll use functions from the stringr package which is part of the core tidyverse.
@ -457,7 +453,6 @@ str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
Alternatively, you can provide a replacement function: it's called with a vector of matches, and should return what to replacement them with.
We'll come back to this powerful tool in [Chapter -@sec-programming-with-strings].
```{r}
x <- c("1 house", "1 person has 2 cars", "3 people")

View File

@ -185,7 +185,7 @@ tb |> pull(x1) # by name
tb |> pull(1) # by position
```
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector, which you'll learn about in [Chapter -@sec-vectors].
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector, which you'll learn about in @sec-vectors.
```{r}
tb |> pull(x1, name = id)

View File

@ -15,26 +15,24 @@ Now we'll focus on new skills for specific types of data you will frequently enc
This part of the book proceeds as follows:
- In [Chapter -@sec-tibbles], you'll learn about the variant of the data frame that we use in this book: the **tibble**.
- In @sec-tibbles, you'll learn about the variant of the data frame that we use in this book: the **tibble**.
You'll learn what makes them different from regular data frames, and how you can construct them "by hand".
- [Chapter -@sec-relational-data] will give you tools for working with multiple interrelated datasets.
- @sec-joins will give you tools for working with multiple interrelated datasets.
- [Chapter -@sec-numbers] ...
- @sec-numbers ...
- [Chapter -@sec-logicals] ...
- @sec-logicals ...
- [Chapter -@sec-missing-values]...
- @sec-missing-values...
- [Chapter -@sec-strings] will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings.
- @sec-strings will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings.
- [Chapter -@sec-regular-expressions] ...
- @sec-regular-expressions ...
- [Chapter -@sec-factors] will introduce factors -- how R stores categorical data.
- @sec-factors will introduce factors -- how R stores categorical data.
They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.
- [Chapter -@sec-dates-and-times] will give you the key tools for working with dates and date-times.
- [Chapter -@sec-column-wise] will give you tools for performing the same operation on multiple columns.
- @sec-dates-and-times will give you the key tools for working with dates and date-times.
<!-- TO DO: Add chapter descriptions -->

View File

@ -393,7 +393,8 @@ knitr::include_graphics("diagrams/lists-subsetting.png")
```
The difference between `[` and `[[` is very important, but it's easy to get confused.
To help you remember, let me show you an unusual pepper shaker in @fig-pepper-1.If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet, as in @fig-pepper-2.
To help you remember, let me show you an unusual pepper shaker in @fig-pepper-1.
If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet, as in @fig-pepper-2.
`pepper[2]` would look the same, but would contain the second packet.
`pepper[1:2]` would be a pepper shaker containing two pepper packets.
`pepper[[1]]` would extract the pepper packet itself, as in @fig-pepper-3.
@ -402,7 +403,8 @@ To help you remember, let me show you an unusual pepper shaker in @fig-pepper-1.
#| label: fig-pepper-1
#| echo: false
#| out-width: "25%"
#| fig-cap: A pepper shaker that Hadley once found in his hotel room.
#| fig-cap: >
#| A pepper shaker that Hadley once found in his hotel room.
#| fig-alt: >
#| A photo of a glass pepper shaker. Instead of the pepper shaker
#| containing pepper, it contains many packets of pepper.
@ -608,3 +610,4 @@ The class of tibble includes "data.frame" which means tibbles inherit the regula
2. Try and make a tibble that has columns with different lengths.
What happens?

View File

@ -1,4 +1,4 @@
# Web scraping {#sec-import-webscrape}
# Web scraping {#sec-scraping}
```{r}
#| results: "asis"

View File

@ -105,7 +105,7 @@ some.people.use.periods
And_aFew.People_RENOUNCEconvention
```
We'll come back to names again when we talk more about code style in [Chapter -@sec-workflow-style].
We'll come back to names again when we talk more about code style in @sec-workflow-style.
You can inspect an object by typing its name:

View File

@ -115,7 +115,7 @@ But they're still good to know about even if you've never used `%>%` because you
- The `|>` placeholder is deliberately simple and can't replicate many features of the `%>%` placeholder: you can't pass it to multiple arguments, and it doesn't have any special behavior when the placeholder is used inside another function.
For example, `df %>% split(.$var)` is equivalent to `split(df, df$var)` and `df %>% {split(.$x, .$y)}` is equivalent to `split(df$x, df$y)`.
With `%>%` you can use `.` on the left-hand side of operators like `$`, `[[`, `[` (which you'll learn about in [Chapter -@sec-vectors]), so you can extract a single column from a data frame with (e.g.) `mtcars %>% .$cyl`.
With `%>%` you can use `.` on the left-hand side of operators like `$`, `[[`, `[` (which you'll learn about in @sec-vectors), so you can extract a single column from a data frame with (e.g.) `mtcars %>% .$cyl`.
A future version of R may add similar support for `|>` and `_`.
For the special case of extracting a column out of a data frame, you can also use `dplyr::pull()`:

View File

@ -34,7 +34,7 @@ This part of the book proceeds as follows:
- In @sec-rectangling, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.
- In @sec-import-webscrape, you'll learn about harvesting data off the web and getting it into R.
- In @sec-scraping, you'll learn about harvesting data off the web and getting it into R.
Some other types of data are not covered in this book: