Rough out summaries for all transform chapters

This commit is contained in:
Hadley Wickham 2022-10-21 09:14:49 -05:00
parent 47d239b84b
commit 127db0fe81
8 changed files with 80 additions and 95 deletions

View File

@ -354,7 +354,8 @@ flights_dt |>
geom_freqpoly(binwidth = 60 * 30)
```
Computing the difference between a pair of date-times yields a difftime (more on that in @sec-intervals). We can convert that to an `hms` object to get a more useful x-axis:
Computing the difference between a pair of date-times yields a difftime (more on that in @sec-intervals).
We can convert that to an `hms` object to get a more useful x-axis:
```{r}
#| fig-alt: >
@ -427,9 +428,13 @@ update(ymd("2023-02-01"), hour = 400)
Next you'll learn about how arithmetic with dates works, including subtraction, addition, and division.
Along the way, you'll learn about three important classes that represent time spans:
- **durations**, which represent an exact number of seconds.
- **periods**, which represent human units like weeks and months.
- **intervals**, which represent a starting and ending point.
- **Durations**, which represent an exact number of seconds.
- **Periods**, which represent human units like weeks and months.
- **Intervals**, which represent a starting and ending point.
How do you pick between duration, periods, and intervals?
As always, pick the simplest data structure that solves your problem.
If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
### Durations
@ -592,12 +597,6 @@ y2023 / days(1)
y2024 / days(1)
```
### Summary
How do you pick between duration, periods, and intervals?
As always, pick the simplest data structure that solves your problem.
If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
### Exercises
1. Explain `days(overnight * 1)` to someone who has just started learning R.
@ -700,3 +699,13 @@ You can change the time zone in two ways:
x4b
x4b - x4
```
## Summary
This chapter has introduced you to the tools that lubridate provides to help you work with date-time data.
Working with dates and times can seem harder than necessary, but hopefully this chapter has helped you see why --- date-times are more complex than they seem at first glance, and handling every possible situation adds complexity.
Even if your data never crosses a day light savings boundary or involves a leap year, the functions need to be able to handle it.
The next chapter gives a round up of missing values.
You've seen them in a few places and have no doubt encounter in your own analysis, and it's how time to provide a grab bag of useful techniques for dealing with them.

View File

@ -428,3 +428,11 @@ There are only two places where you might notice different behavior:
- If you use an ordered function in a linear model, it will use "polygonal contrasts". These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don't routinely interpret them. If you want to learn more, we recommend `vignette("contrasts", package = "faux")` by Lisa DeBruine.
Given the arguable utility of these differences, we don't generally recommend using ordered factors.
## Summary
This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions.
forcats contains a wide range of other helpers that we didn't have space to discuss here, so whenever you're facing a factor analysis challenge that you haven't encountered before, I highly recommend skimming the [reference index](https://forcats.tidyverse.org/reference/index.html) to see if there's a canned function that can help solve your problem.
In the next chapter we'll switch gears to start learning about dates and times in R.
Dates and times seem deceptively simple, but as you'll soon see, the more you learn about them, the more complex they seem to get!

View File

@ -224,6 +224,18 @@ If you look carefully, you might intuit that the columns are named using using a
That's not a coincidence!
As you'll learn in the next section, you can use `.names` argument to supply your own glue spec.
### Missing values {#sec-across-missing-values}
```{r}
#| eval: false
df |>
mutate(across(where(is.numeric), coalesce, 0))
df |>
mutate(across(where(is.numeric), na_if, -99))
```
### Column names
The result of `across()` is named according to the specification provided in the `.names` variable.

View File

@ -15,7 +15,7 @@ It's relatively rare to find logical vectors in your raw data, but you'll create
We'll begin by discussing the most common way of creating logical vectors: with numeric comparisons.
Then you'll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries.
We'll finish off with some tools for making conditional changes, and a useful function for turning logical vectors into groups.
We'll finish off with `if_else()` and `case_when()`, two useful functions for making conditional changes powered by logical vectors.
### Prerequisites
@ -187,6 +187,8 @@ flights |>
arrange(desc(is.na(dep_time)), dep_time)
```
We'll come back to cover missing values in more depth in @sec-missing-values.
### Exercises
1. How does `dplyr::near()` work? Type `near` to see the source code.
@ -543,53 +545,13 @@ flights |>
)
```
## Making groups {#sec-groups-from-logical}
## Summary
Before we move on to the next chapter, we want to show you one last trick that's useful for grouping data.
Sometimes you want to start a new group every time some event occurs.
For example, when you're looking at website data, it's common to want to break up events into sessions, where a session is defined as a gap of more than x minutes since the last activity.
The definition of a logical vector is simple because each value must be either `TRUE`, `FALSE`, or `NA`.
But logical vectors provide a huge amount of power.
In this chapter, you learned how to create logical vectors with `>`, `<`, `<=`, `=>`, `==`, `!=`, and `is.na()`, how to combine them with `!`, `&`, and `|`, and how to summarize them with `any()`, `all()`, `sum()`, and `mean()`.
You also learned the powerful `if_else()` and `case_when()` that allow you to return values depending on the value of a logical vector.
Here's some made up data that illustrates the problem.
So far computed the time lag between the events, and figured out if there's a gap that's big enough to qualify:
```{r}
events <- tibble(
time = c(0, 1, 2, 3, 5, 10, 12, 15, 17, 19, 20, 27, 28, 30)
)
events <- events |>
mutate(
diff = time - lag(time, default = first(time)),
gap = diff >= 5
)
events
```
But how do we go from that logical vector to something that we can `group_by()`?
`consecutive_id()` comes to the rescue:
```{r}
events |> mutate(
group = consecutive_id(gap)
)
```
`consecutive_id()` starts a new group every time one of its arguments changes.
That makes it useful both here, with logical vectors, and in many other place.
For example, inspired by [this stackoverflow question](https://stackoverflow.com/questions/27482712), imagine you have a data frame with a bunch of repeated values:
```{r}
df <- tibble(
x = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b"),
y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199)
)
df
```
You want to keep the first row from each repeated `x`.
That's easier to express with a combination of `consecutive_id()` and `slice_head()`:
```{r}
df |>
group_by(id = consecutive_id(x)) |>
slice_head(n = 1)
```
We'll see logical vectors again and in the following chapters.
For example in @sec-strings you'll learn about `str_detect(x, pattern)` which returns a logical vector that's `TRUE` for the elements of `x` that match the `pattern`, and in @sec-dates-and-times you'll create logical vectors from the comparison of dates and times.
But for now, we're going to move onto the next most important type of vector: numeric vectors.

View File

@ -68,17 +68,6 @@ x <- c(1, 4, 5, 7, NA)
coalesce(x, 0)
```
You could use `mutate()` together with `across()` to apply this treatment to (say) every numeric column in a data frame:
```{r}
#| eval: false
df |>
mutate(across(where(is.numeric), coalesce, 0))
```
### Sentinel values
Sometimes you'll hit the opposite problem where some concrete value actually represents a missing value.
This typically arises in data generated by older software that doesn't have a proper way to represent missing values, so it must instead use some special value like 99 or -999.
@ -90,14 +79,7 @@ x <- c(1, 4, 5, 7, -99)
na_if(x, -99)
```
You could apply this transformation to every numeric column in a data frame with the following code.
```{r}
#| eval: false
df |>
mutate(across(where(is.numeric), na_if, -99))
```
In @sec-across-missing-values, you'll learn how to easily apply these tools to (e.g.) every numeric column in a data frame.
### NaN
@ -315,3 +297,12 @@ health |>
```
The main drawback of this approach is that you get an `NA` for the count, even though you know that it should be zero.
## Summary
Missing values are weird!
Sometimes they're recorded as an explicit `NA` but other times you only notice them by their absence.
This chapter has given you some tools for working with explicit missing values, tools for uncovering implicit missing values, and discussed some of the ways that implicit can become explicit and vice versa.
In the next chapter, we tackle the final chapter in this part of the book: joins.
This is a bit of a change from the chapters so far because we're going to discuss tools that work with data frames as a whole, not something that you put inside a data frame.

View File

@ -423,22 +423,6 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.
### Fill in missing values {#sec-missing-values-numbers}
You can fill in missing values with dplyr's `coalesce()`:
```{r}
x <- c(1, NA, 5, NA, 10)
coalesce(x, 0)
```
`coalesce()` is vectorised, so you can find the non-missing values from a pair of vectors:
```{r}
y <- c(2, 3, 4, NA, 5)
coalesce(x, y)
```
### Ranks
dplyr provides a number of ranking functions inspired by SQL, but you should always start with `dplyr::min_rank()`.
@ -765,3 +749,12 @@ For example:
3. Create a plot to further explore the adventures of EGE.
Can you find any evidence that the airport moved locations?
## Summary
You're likely already familiar with many tools for working with numbers, and in this chapter you'll have learned how they're realized in R.
You also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vector like ranks and offsets.
Finally, we worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.
Over the next two chapters, we'll dive into working with strings with the stringr package.
Strings get two chapters because there really are two topics to cover: strings and regular expressions.

View File

@ -611,7 +611,7 @@ str_view("x X", fixed("X", ignore_case = TRUE))
```
If you're working with non-English text, you should generally use `coll()` instead, as it implements the full rules for capitalization as used by the `locale` you specify.
See @#sec-other-languages for more details.
See \@#sec-other-languages for more details.
```{r}
str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
@ -828,5 +828,12 @@ head(dir(pattern = "\\.Rmd$"))
## Summary
To continue learning about regular expressions, start with `vignette("regular-expressions", package = "stringr")`: it documents the full set of syntax supported by stringr.
Don't forget that stringr is implemented on top of stringi, so if you're struggling to find a function that does what you need, don't be afraid to look in stringi too.
You'll find it very easy to pick up because it follows the same conventions as stringr.
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
It's not R specific, but it covers the most advanced features and explains how regular expressions work under the hood.
In the next chapter, we'll talk about a data structure closely related to strings: factors.
Factors are used to represent categorical data in R, data where there is a fixed and known set of possible values identified by a vector of strings.

View File

@ -493,3 +493,6 @@ Fortunately there are three sets of functions where the locale matters:
[^strings-8]: Sorting in languages that don't have an alphabet (like Chinese) is more complicated still.
## Summary
In this chapter you've learned a wide of tools for working with strings, but you haven't learned one of the most important and powerful tools: regular expressions.
Regular expressions are very concise, but very expressive, language for describing patterns within strings, and are the topic of the next chapter.