diff --git a/data-transform.qmd b/data-transform.qmd index 1b939b9..f1a202f 100644 --- a/data-transform.qmd +++ b/data-transform.qmd @@ -96,6 +96,7 @@ Let's dive in! The most important verbs that operate on rows are `filter()`, which changes which rows are present without changing their order, and `arrange()`, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged. +We'll also discuss `distinct()` which finds rows with unique values but unlike `arrange()` and `filter()` it can also optionally modify the columns. ### `filter()` @@ -197,6 +198,23 @@ flights |> arrange(desc(arr_delay)) ``` +### `distinct()` + +`distinct()` finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. +Most of the time, however, you'll want to the distinct combination of some variables, so you can also optionally supply column names: + +```{r} +# This would remove any duplicate rows if there were any +flights |> + distinct() + +# This finds all unique origin and destination pairs. +flights |> + distinct(origin, dest) +``` + +Note that if you want to find the number of duplicates, or rows that weren't duplicated, you're better off swapping `distinct()` for `count()` and then filtering as needed. + ### Exercises 1. Find all flights that @@ -213,10 +231,12 @@ flights |> 3. Sort `flights` to find the fastest flights (Hint: try sorting by a calculation). -4. Which flights traveled the farthest? - Which traveled the shortest? +4. Was there a flight on every day of 2017? -5. Does it matter what order you used `filter()` and `arrange()` in if you're using both? +5. Which flights traveled the farthest distance? + Which traveled the least distance? + +6. Does it matter what order you used `filter()` and `arrange()` in if you're using both? Why/why not? Think about the results and how much work the functions would have to do. @@ -224,6 +244,7 @@ flights |> There are four important verbs that affect the columns without changing the rows: `mutate()`, `select()`, `rename()`, and `relocate()`. `mutate()` creates new columns that are functions of the existing columns; `select()`, `rename()`, and `relocate()` change which columns are present, their names, or their positions. +We'll also discuss `pull()` since it allows you to get a column out of data frame. ### `mutate()` {#sec-mutate}