diff --git a/data-transform.qmd b/data-transform.qmd index 83d7868..032e836 100644 --- a/data-transform.qmd +++ b/data-transform.qmd @@ -58,7 +58,7 @@ glimpse(flights) ``` In both views, the variables names are followed by abbreviations that tell you the type of each variable: `` is short for integer, `` is short for double (aka real numbers), `` for character (aka strings), and `` for date-time. -These are important because the operations you can perform on a column depend so much on its "type", and these types are used to organize the chapters in the next section of the book. +These are important because the operations you can perform on a column depend so much on its "type". ### dplyr basics @@ -102,7 +102,7 @@ We'll also discuss `distinct()` which finds rows with unique values but unlike ` `filter()` allows you to keep rows based on the values of the columns[^data-transform-1]. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. -For example, we could find all flights that arrived more than 120 minutes (two hours) late: +For example, we could find all flights that departed more than 120 minutes (two hours) late: [^data-transform-1]: Later, you'll learn about the `slice_*()` family which allows you to choose rows based on their positions. @@ -225,7 +225,7 @@ flights |> ### Exercises -1. In a single pipeline, find all flights that meet all of the following conditions: +1. In a single pipeline, find all flights that meet each of the following conditions: - Had an arrival delay of two or more hours - Flew to Houston (`IAH` or `HOU`) @@ -251,7 +251,7 @@ flights |> ## Columns -There are four important verbs that affect the columns without changing the rows: `mutate()` creates new columns that are derived from the existing columns, `select()` changes which columns are present; `rename()` changes the names of the columns; and `relocate()` changes the positions of the columns. +There are four important verbs that affect the columns without changing the rows: `mutate()` creates new columns that are derived from the existing columns, `select()` changes which columns are present, `rename()` changes the names of the columns, and `relocate()` changes the positions of the columns. ### `mutate()` {#sec-mutate} @@ -479,7 +479,7 @@ flights |> arrange(desc(speed)) ``` -Even though this pipeline has four steps, it's easy to skim because the verbs come at the start of each line: start with the `flights` data, then filter, then group, then summarize. +Even though this pipeline has four steps, it's easy to skim because the verbs come at the start of each line: start with the `flights` data, then filter, then mutate, then select, then arrange. What would happen if we didn't have the pipe? We could nest each function call inside the previous call: @@ -575,7 +575,7 @@ This means subsequent operations will now work "by month". ### `summarize()` {#sec-summarize} The most important grouped operation is a summary, which, if being used to calculate a single summary statistic, reduces the data frame to have a single row for each group. -In dplyr, this is operation is performed by `summarize()`[^data-transform-3], as shown by the following example, which computes the average departure delay by month: +In dplyr, this operation is performed by `summarize()`[^data-transform-3], as shown by the following example, which computes the average departure delay by month: [^data-transform-3]: Or `summarise()`, if you prefer British English.