diff --git a/datetimes.qmd b/datetimes.qmd index 6f7e85e..3428636 100644 --- a/datetimes.qmd +++ b/datetimes.qmd @@ -334,7 +334,7 @@ We can use `wday()` to see that more flights depart during the week than on the flights_dt |> mutate(wday = wday(dep_time, label = TRUE)) |> ggplot(aes(x = wday)) + - geom_bar() + geom_bar() ``` There's an interesting pattern if we look at the average departure delay by minute within the hour. @@ -353,9 +353,10 @@ flights_dt |> group_by(minute) |> summarize( avg_delay = mean(dep_delay, na.rm = TRUE), - n = n()) |> + n = n() + ) |> ggplot(aes(minute, avg_delay)) + - geom_line() + geom_line() ``` Interestingly, if we look at the *scheduled* departure time we don't see such a strong pattern: @@ -371,23 +372,30 @@ sched_dep <- flights_dt |> group_by(minute) |> summarize( avg_delay = mean(arr_delay, na.rm = TRUE), - n = n()) + n = n() + ) ggplot(sched_dep, aes(minute, avg_delay)) + geom_line() ``` So why do we see that pattern with the actual departure times? -Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times. +Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times, as @fig-human-rounding shows. Always be alert for this sort of pattern whenever you work with data that involves human judgement! ```{r} +#| label: fig-human-rounding +#| fig-cap: > +#| A frequency polygon showing the number of flights scheduled to +#| depart each hour. You can see a strong preference for round numbers +#| like 0 and 30 and generally for numbers that are a multiple of five. #| fig-alt: > #| A line plot with departure minute (0-60) on the x-axis and number of #| flights (0-60000) on the y-axis. Most flights are scheduled to depart #| on either the hour (~60,000) or the half hour (~35,000). Otherwise, #| all most all flights are scheduled to depart on multiples of five, #| with a few extra at 15, 45, and 55 minutes. +#| echo: false ggplot(sched_dep, aes(minute, n)) + geom_line() ``` @@ -421,7 +429,7 @@ You can use rounding to show the distribution of flights across the course of a flights_dt |> mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |> ggplot(aes(dep_hour)) + - geom_freqpoly(binwidth = 60 * 30) + geom_freqpoly(binwidth = 60 * 30) ``` Computing the difference between a pair of date-times yields a difftime (more on that in @sec-intervals). @@ -438,12 +446,13 @@ We can convert that to an `hms` object to get a more useful x-axis: flights_dt |> mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |> ggplot(aes(dep_hour)) + - geom_freqpoly(binwidth = 60 * 30) + geom_freqpoly(binwidth = 60 * 30) ``` ### Modifying components -You can also use each accessor function to modify the components of a date/time: +You can also use each accessor function to modify the components of a date/time. +This doesn't come up much in data analysis, but can be useful when cleaning data that has clearly incorrect dates. ```{r} (datetime <- ymd_hms("2026-07-08 12:34:56")) @@ -490,7 +499,7 @@ update(ymd("2023-02-01"), hour = 400) 6. What makes the distribution of `diamonds$carat` and `flights$sched_dep_time` similar? -7. Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. +7. Confirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed. ## Time spans