TR feedback 2 (#1318)

This commit is contained in:
Hadley Wickham 2023-03-01 07:45:54 -06:00 committed by GitHub
parent bf07203845
commit 7cd62150c0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 81 additions and 58 deletions

View File

@ -122,7 +122,7 @@ Thanks to arrow, this code will work regardless of how large the underlying data
But it's currently rather slow: on Hadley's computer, it took \~10s to run.
That's not terrible given how much data we have, but we can make it much faster by switching to a better format.
## The parquet format
## The parquet format {#sec-parquet}
To make this data easier to work with, lets switch to the parquet file format and split it up into multiple files.
The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.

View File

@ -174,6 +174,15 @@ You can also force the creation of a date-time from a date by supplying a timezo
ymd("2017-01-31", tz = "UTC")
```
Here I use the UTC[^datetimes-3] timezone which you might also know as GMT, or Greenwich Mean Time, the time at 0° longitude[^datetimes-4]
. It doesn't use daylight savings time, making it a bit easier to compute with
.
[^datetimes-3]: You might wonder what UTC stands for.
It's a compromise between the English "Coordinated Universal Time" and French "Temps Universel Coordonné".
[^datetimes-4]: No prizes for guessing which country came up with the longitude system.
### From individual components
Instead of a single string, sometimes you'll have the individual components of the date-time spread across multiple columns.
@ -300,6 +309,7 @@ The next section will look at how arithmetic works with date-times.
### Getting components
You can pull out individual parts of the date with the accessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
These are effectively the opposites of `make_datetime()`.
```{r}
datetime <- ymd_hms("2026-07-08 12:34:56")
@ -629,8 +639,8 @@ We can fix this by adding `days(1)` to the arrival time of each overnight flight
flights_dt <- flights_dt |>
mutate(
overnight = arr_time < dep_time,
arr_time = arr_time + days(if_else(overnight, 0, 1)),
sched_arr_time = sched_arr_time + days(overnight * 1)
arr_time = arr_time + days(!overnight),
sched_arr_time = sched_arr_time + days(overnight)
)
```
@ -643,9 +653,10 @@ flights_dt |>
### Intervals {#sec-intervals}
It's obvious what `dyears(1) / ddays(365)` should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.
What does `dyears(1) / ddays(365)` return?
It's not quite one, because `dyear()` is defined as the number of seconds per average year, which is 365.25 days.
What should `years(1) / days(1)` return?
What does `years(1) / days(1)` return?
Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366!
There's not quite enough information for lubridate to give a single clear answer.
What it does instead is give an estimate:
@ -676,8 +687,8 @@ y2024 / days(1)
### Exercises
1. Explain `days(overnight * 1)` to someone who has just started learning R.
How does it work?
1. Explain `days(!overnight)` and `days(overnight)` to someone who has just started learning R.
What is the key fact you need to know?
2. Create a vector of dates giving the first day of every month in 2015.
Create a vector of dates giving the first day of every month in the *current* year.

View File

@ -19,6 +19,8 @@ Writing a function has three big advantages over using copy-and-paste:
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
4. It makes it easier to reuse work from project-to-project, increasing your productivity over time.
A good rule of thumb is to consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
In this chapter, you'll learn about three useful types of functions:
@ -327,12 +329,7 @@ Once you start writing functions, there are two RStudio shortcuts that are super
3. Given a vector of birthdates, write a function to compute the age in years.
4. Write your own functions to compute the variance and skewness of a numeric vector.
Variance is defined as $$
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
$$ where $\bar{x} = (\sum_i^n x_i) / n$ is the sample mean.
Skewness is defined as $$
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
$$
You can look up the definitions on Wikipedia or elsewhere.
5. Write `both_na()`, a summary function that takes two vectors of the same length and returns the number of positions that have an `NA` in both vectors.
@ -340,8 +337,12 @@ Once you start writing functions, there are two RStudio shortcuts that are super
Why are they useful even though they are so short?
```{r}
is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0
is_directory <- function(x) {
file.info(x)$isdir
}
is_readable <- function(x) {
file.access(x, 4) == 0
}
```
## Data frame functions
@ -484,7 +485,8 @@ count_prop <- function(df, var, sort = FALSE) {
diamonds |> count_prop(clarity)
```
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables in `…`.
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables.
Note that we use a default value for `sort` so that if the user doesn't supply their own value it will default to `FALSE`.
Or maybe you want to find the sorted unique values of a variable for a subset of the data.
Rather than supplying a variable and a value to do the filtering, we'll allow the user to supply a condition:
@ -499,8 +501,6 @@ unique_where <- function(df, condition, var) {
# Find all the destinations in December
flights |> unique_where(month == 12, dest)
# Which months did plane N14228 fly in?
flights |> unique_where(tailnum == "N14228", month)
```
Here we embrace `condition` because it's passed to `filter()` and `var` because it's passed to `distinct()` and `arrange()`.
@ -509,7 +509,7 @@ We've made all these examples to take a data frame as the first argument, but if
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they form the compound primary key that allows you to identify a row.
```{r}
flights_sub <- function(rows, cols) {
subset_flights <- function(rows, cols) {
flights |>
filter({{ rows }}) |>
select(time_hour, carrier, flight, {{ cols }})
@ -527,7 +527,10 @@ You might try writing something like:
count_missing <- function(df, group_vars, x_var) {
df |>
group_by({{ group_vars }}) |>
summarize(n_miss = sum(is.na({{ x_var }})))
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
flights |>
@ -541,7 +544,10 @@ We can work around that problem by using the handy `pick()` function, which allo
count_missing <- function(df, group_vars, x_var) {
df |>
group_by(pick({{ group_vars }})) |>
summarize(n_miss = sum(is.na({{ x_var }})))
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
flights |>
@ -605,7 +611,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
```{r}
#| eval: false
weather |> standardise_time(sched_dep_time)
weather |> standardize_time(sched_dep_time)
```
2. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: `distinct()`, `count()`, `group_by()`, `rename_with()`, `slice_min()`, `slice_sample()`.
@ -697,9 +703,9 @@ hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
diamonds |> hex_plot(carat, price, depth)
```
### Combining with dplyr
### Combining with other tidyverse
Some of the most useful helpers combine a dash of dplyr with ggplot2.
Some of the most useful helpers combine a dash of data manipulation with ggplot2.
For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:
@ -839,7 +845,7 @@ This makes it very obvious that something unusual is happening.
```{r}
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
str_sub(string, 1, str_length(prefix)) == prefix
}
f3 <- function(x, y) {
@ -851,6 +857,7 @@ This makes it very obvious that something unusual is happening.
3. Make a case for why `norm_r()`, `norm_d()` etc. would be better than `rnorm()`, `dnorm()`.
Make a case for the opposite.
How could you make the names even clearer?
## Summary

View File

@ -144,7 +144,7 @@ Let's motivate this problem with a simple example: what happens if we have some
```{r}
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na)))
}
df_miss <- tibble(
@ -397,22 +397,21 @@ If needed, you could `pivot_wider()` this back to the original form.
### Exercises
1. Compute the number of unique values in each column of `palmerpenguins::penguins`.
1. Practice your `across()` skills by:
2. Compute the mean of every column in `mtcars`.
1. Computing the number of unique values in each column of `palmerpenguins::penguins`.
3. Group `diamonds` by `cut`, `clarity`, and `color` then count the number of observations and the mean of each numeric column.
2. Computing the mean of every column in `mtcars`.
4. What happens if you use a list of functions, but don't name them?
3. Grouping `diamonds` by `cut`, `clarity`, and `color` then counting the number of observations and computing the mean of each numeric column.
2. What happens if you use a list of functions in `across()`, but don't name them?
How is the output named?
5. It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`.
Can you explain why?
6. Adjust `expand_dates()` to automatically remove the date columns after they've been expanded.
3. Adjust `expand_dates()` to automatically remove the date columns after they've been expanded.
Do you need to embrace any arguments?
7. Explain what each step of the pipeline in this function does.
4. Explain what each step of the pipeline in this function does.
What special feature of `where()` are we taking advantage of?
```{r}
@ -656,6 +655,7 @@ write_csv(gapminder, "gapminder.csv")
```
Now when you come back to this problem in the future, you can read in a single csv file.
For large and richer datasets, using parquet might be a better choice than `.csv`, as discussed in @sec-parquet.
```{r}
#| include: false
@ -733,7 +733,9 @@ files <- paths |>
```
Then a very useful strategy is to capture the structure of the data frames so that you can explore it using your data science skills.
One way to do so is with this handy `df_types` function that returns a tibble with one row for each column:
One way to do so is with this handy `df_types` function[^iteration-6] that returns a tibble with one row for each column:
[^iteration-6]: We're not going to explain how it works, but you if you look at the docs for the functions used, you should be able to puzzle it out.
```{r}
df_types <- function(df) {
@ -744,7 +746,7 @@ df_types <- function(df) {
)
}
df_types(starwars)
df_types(gapminder)
```
You can then apply this function to all of the files, and maybe do some pivoting to make it easier to see where the differences are.
@ -952,9 +954,9 @@ carat_histogram <- function(df) {
carat_histogram(by_clarity$data[[1]])
```
Now we can use `map()` to create a list of many plots[^iteration-6] and their eventual file paths:
Now we can use `map()` to create a list of many plots[^iteration-7] and their eventual file paths:
[^iteration-6]: You can print `by_clarity$plot` to get a crude animation --- you'll get one plot for each element of `plots`.
[^iteration-7]: You can print `by_clarity$plot` to get a crude animation --- you'll get one plot for each element of `plots`.
NOTE: this didn't happen for me.
```{r}

View File

@ -200,8 +200,7 @@ Surrogate keys can be particular useful when communicating to other humans: it's
## Basic joins {#sec-mutating-joins}
Now that you understand how data frames are connected via keys, we can start using joins to better understand the `flights` dataset.
dplyr provides six join functions: `left_join()`, `inner_join()`, `right_join()`, `semi_join()`, `anti_join(), and full_join()`.
They all have the same interface: they take a pair of data frames (`x` and `y`) and return a data frame.
dplyr provides six join functions: `left_join()`, `inner_join()`, `right_join()`, `full_join()`, `semi_join()`, and `anti_join().` They all have the same interface: they take a pair of data frames (`x` and `y`) and return a data frame.
The order of the rows and columns in the output is primarily determined by `x`.
In this section, you'll learn how to use one mutating join, `left_join()`, and two filtering joins, `semi_join()` and `anti_join()`.
@ -305,6 +304,10 @@ In older code you might see a different way of specifying the join keys, using a
Now that it exists, we prefer `join_by()` since it provides a clearer and more flexible specification.
`inner_join()`, `right_join()`, `full_join()` have the same interface as `left_join()`.
The difference is which rows they keep: left join keeps all the rows in `x`, the right join keeps all rows in `y`, the full join keeps all rows in either `x` or `y`, and the inner join only keeps rows that occur in both `x` and `y`.
We'll come back to these in more detail later.
### Filtering joins
As you might guess the primary action of a **filtering join** is to filter the rows.
@ -464,9 +467,6 @@ knitr::include_graphics("diagrams/join/setup2.png", dpi = 270)
In an actual join, matches will be indicated with dots, as in @fig-join-inner.
The number of dots equals the number of matches, which in turn equals the number of rows in the output, a new data frame that contains the key, the x values, and the y values.
The join shown here is a so-called **equi** **inner join**, where rows match if the keys are equal, so that the output contains only the rows with keys that appear in both `x` and `y`.
Equi-joins are the most common type of join, so we'll typically omit the equi prefix, and just call it an inner join.
We'll come back to non-equi joins in @sec-non-equi-joins.
```{r}
#| label: fig-join-inner
@ -572,6 +572,10 @@ However, this is not a great representation because while it might jog your memo
knitr::include_graphics("diagrams/join/venn.png", dpi = 270)
```
The joins shown here are the so-called **equi** **joins**, where rows match if the keys are equal.
Equi-joins are the most common type of join, so we'll typically omit the equi prefix, and just say "inner join" rather than "equi inner join".
We'll come back to non-equi joins in @sec-non-equi-joins.
### Row matching
So far we've explored what happens if a row in `x` matches zero or one rows in `y`.
@ -620,8 +624,6 @@ df1 |>
inner_join(df2, join_by(key))
```
This is one reason we like `left_join()` --- if it runs without warning, you know that each row of the output matches the row in the same position in `x`.
You can gain further control over row matching with two arguments:
- `unmatched` controls what happens when a row in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
@ -850,7 +852,7 @@ That leads to the following party days:
```{r}
parties <- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
)
```
@ -859,7 +861,7 @@ Now imagine that you have a table of employee birthdays:
```{r}
employees <- tibble(
name = sample(babynames::babynames$name, 100),
birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
birthday = ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
)
employees
```
@ -896,9 +898,9 @@ So it might be better to to be explicit about the date ranges that each party sp
```{r}
parties <- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = lubridate::ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
)
parties
```
@ -917,9 +919,9 @@ Ooops, there is an overlap, so let's fix that problem and continue:
```{r}
parties <- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = lubridate::ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
)
```

View File

@ -544,7 +544,7 @@ if_else(TRUE, "a", 1)
case_when(
x < -1 ~ TRUE,
x > 0 ~ lubridate::now()
x > 0 ~ now()
)
```

View File

@ -71,7 +71,7 @@ coalesce(x, 0)
Sometimes you'll hit the opposite problem where some concrete value actually represents a missing value.
This typically arises in data generated by older software that doesn't have a proper way to represent missing values, so it must instead use some special value like 99 or -999.
If possible, handle this when reading in the data, for example, by using the `na` argument to `readr::read_csv()`.
If possible, handle this when reading in the data, for example, by using the `na` argument to `readr::read_csv()`, e.g. `read_csv(path, na = "99")`.
If you discover the problem later, or your data source doesn't provide a way to handle on it read, you can use `dplyr::na_if()`:
```{r}
@ -206,7 +206,7 @@ For example, imagine we have a dataset that contains some health information abo
health <- tibble(
name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
age = c(34L, 88L, 75L, 47L, 56L),
age = c(34, 88, 75, 47, 56),
)
```
@ -248,6 +248,7 @@ The same problem comes up more generally with `dplyr::group_by()`.
And again you can use `.drop = FALSE` to preserve all factor levels:
```{r}
#| warning: false
health |>
group_by(smoker, .drop = FALSE) |>
summarize(

View File

@ -36,7 +36,7 @@ In the following three chapters, you'll learn skills to improve your programming
1. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice.
Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies.
Instead, in @sec-functions, you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused.
Instead, in @sec-functions, you'll learn how to write **functions** which let you extract out repeated tidyverse code so that it can be easily reused.
2. Functions extract out repeated code, but you often need to repeat the same actions on different inputs.
You need tools for **iteration** that let you do similar things again and again.