Feedback from O'Reilly + style fixes

This commit is contained in:
Hadley Wickham 2022-11-23 11:55:08 -06:00
parent f0b19065c7
commit 19c89ebf64
1 changed files with 18 additions and 9 deletions

View File

@ -334,7 +334,7 @@ We can use `wday()` to see that more flights depart during the week than on the
flights_dt |>
mutate(wday = wday(dep_time, label = TRUE)) |>
ggplot(aes(x = wday)) +
geom_bar()
geom_bar()
```
There's an interesting pattern if we look at the average departure delay by minute within the hour.
@ -353,9 +353,10 @@ flights_dt |>
group_by(minute) |>
summarize(
avg_delay = mean(dep_delay, na.rm = TRUE),
n = n()) |>
n = n()
) |>
ggplot(aes(minute, avg_delay)) +
geom_line()
geom_line()
```
Interestingly, if we look at the *scheduled* departure time we don't see such a strong pattern:
@ -371,23 +372,30 @@ sched_dep <- flights_dt |>
group_by(minute) |>
summarize(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n())
n = n()
)
ggplot(sched_dep, aes(minute, avg_delay)) +
geom_line()
```
So why do we see that pattern with the actual departure times?
Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times.
Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times, as @fig-human-rounding shows.
Always be alert for this sort of pattern whenever you work with data that involves human judgement!
```{r}
#| label: fig-human-rounding
#| fig-cap: >
#| A frequency polygon showing the number of flights scheduled to
#| depart each hour. You can see a strong preference for round numbers
#| like 0 and 30 and generally for numbers that are a multiple of five.
#| fig-alt: >
#| A line plot with departure minute (0-60) on the x-axis and number of
#| flights (0-60000) on the y-axis. Most flights are scheduled to depart
#| on either the hour (~60,000) or the half hour (~35,000). Otherwise,
#| all most all flights are scheduled to depart on multiples of five,
#| with a few extra at 15, 45, and 55 minutes.
#| echo: false
ggplot(sched_dep, aes(minute, n)) +
geom_line()
```
@ -421,7 +429,7 @@ You can use rounding to show the distribution of flights across the course of a
flights_dt |>
mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |>
ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 60 * 30)
geom_freqpoly(binwidth = 60 * 30)
```
Computing the difference between a pair of date-times yields a difftime (more on that in @sec-intervals).
@ -438,12 +446,13 @@ We can convert that to an `hms` object to get a more useful x-axis:
flights_dt |>
mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |>
ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 60 * 30)
geom_freqpoly(binwidth = 60 * 30)
```
### Modifying components
You can also use each accessor function to modify the components of a date/time:
You can also use each accessor function to modify the components of a date/time.
This doesn't come up much in data analysis, but can be useful when cleaning data that has clearly incorrect dates.
```{r}
(datetime <- ymd_hms("2026-07-08 12:34:56"))
@ -490,7 +499,7 @@ update(ymd("2023-02-01"), hour = 400)
6. What makes the distribution of `diamonds$carat` and `flights$sched_dep_time` similar?
7. Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early.
7. Confirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early.
Hint: create a binary variable that tells you whether or not a flight was delayed.
## Time spans