Feedback from O'Reilly + style fixes

This commit is contained in:
Hadley Wickham 2022-11-23 11:55:08 -06:00
parent f0b19065c7
commit 19c89ebf64
1 changed files with 18 additions and 9 deletions

View File

@ -334,7 +334,7 @@ We can use `wday()` to see that more flights depart during the week than on the
flights_dt |> flights_dt |>
mutate(wday = wday(dep_time, label = TRUE)) |> mutate(wday = wday(dep_time, label = TRUE)) |>
ggplot(aes(x = wday)) + ggplot(aes(x = wday)) +
geom_bar() geom_bar()
``` ```
There's an interesting pattern if we look at the average departure delay by minute within the hour. There's an interesting pattern if we look at the average departure delay by minute within the hour.
@ -353,9 +353,10 @@ flights_dt |>
group_by(minute) |> group_by(minute) |>
summarize( summarize(
avg_delay = mean(dep_delay, na.rm = TRUE), avg_delay = mean(dep_delay, na.rm = TRUE),
n = n()) |> n = n()
) |>
ggplot(aes(minute, avg_delay)) + ggplot(aes(minute, avg_delay)) +
geom_line() geom_line()
``` ```
Interestingly, if we look at the *scheduled* departure time we don't see such a strong pattern: Interestingly, if we look at the *scheduled* departure time we don't see such a strong pattern:
@ -371,23 +372,30 @@ sched_dep <- flights_dt |>
group_by(minute) |> group_by(minute) |>
summarize( summarize(
avg_delay = mean(arr_delay, na.rm = TRUE), avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()) n = n()
)
ggplot(sched_dep, aes(minute, avg_delay)) + ggplot(sched_dep, aes(minute, avg_delay)) +
geom_line() geom_line()
``` ```
So why do we see that pattern with the actual departure times? So why do we see that pattern with the actual departure times?
Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times. Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times, as @fig-human-rounding shows.
Always be alert for this sort of pattern whenever you work with data that involves human judgement! Always be alert for this sort of pattern whenever you work with data that involves human judgement!
```{r} ```{r}
#| label: fig-human-rounding
#| fig-cap: >
#| A frequency polygon showing the number of flights scheduled to
#| depart each hour. You can see a strong preference for round numbers
#| like 0 and 30 and generally for numbers that are a multiple of five.
#| fig-alt: > #| fig-alt: >
#| A line plot with departure minute (0-60) on the x-axis and number of #| A line plot with departure minute (0-60) on the x-axis and number of
#| flights (0-60000) on the y-axis. Most flights are scheduled to depart #| flights (0-60000) on the y-axis. Most flights are scheduled to depart
#| on either the hour (~60,000) or the half hour (~35,000). Otherwise, #| on either the hour (~60,000) or the half hour (~35,000). Otherwise,
#| all most all flights are scheduled to depart on multiples of five, #| all most all flights are scheduled to depart on multiples of five,
#| with a few extra at 15, 45, and 55 minutes. #| with a few extra at 15, 45, and 55 minutes.
#| echo: false
ggplot(sched_dep, aes(minute, n)) + ggplot(sched_dep, aes(minute, n)) +
geom_line() geom_line()
``` ```
@ -421,7 +429,7 @@ You can use rounding to show the distribution of flights across the course of a
flights_dt |> flights_dt |>
mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |> mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |>
ggplot(aes(dep_hour)) + ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 60 * 30) geom_freqpoly(binwidth = 60 * 30)
``` ```
Computing the difference between a pair of date-times yields a difftime (more on that in @sec-intervals). Computing the difference between a pair of date-times yields a difftime (more on that in @sec-intervals).
@ -438,12 +446,13 @@ We can convert that to an `hms` object to get a more useful x-axis:
flights_dt |> flights_dt |>
mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |> mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |>
ggplot(aes(dep_hour)) + ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 60 * 30) geom_freqpoly(binwidth = 60 * 30)
``` ```
### Modifying components ### Modifying components
You can also use each accessor function to modify the components of a date/time: You can also use each accessor function to modify the components of a date/time.
This doesn't come up much in data analysis, but can be useful when cleaning data that has clearly incorrect dates.
```{r} ```{r}
(datetime <- ymd_hms("2026-07-08 12:34:56")) (datetime <- ymd_hms("2026-07-08 12:34:56"))
@ -490,7 +499,7 @@ update(ymd("2023-02-01"), hour = 400)
6. What makes the distribution of `diamonds$carat` and `flights$sched_dep_time` similar? 6. What makes the distribution of `diamonds$carat` and `flights$sched_dep_time` similar?
7. Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. 7. Confirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early.
Hint: create a binary variable that tells you whether or not a flight was delayed. Hint: create a binary variable that tells you whether or not a flight was delayed.
## Time spans ## Time spans