diff --git a/factors.qmd b/factors.qmd index 2db1a28..ecf6027 100644 --- a/factors.qmd +++ b/factors.qmd @@ -339,7 +339,7 @@ gss_cat |> count(partyid) ``` -`fct_recode()` will leave levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist. +`fct_recode()` will leave the levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist. To combine groups, you can assign multiple old levels to the same new level: diff --git a/logicals.qmd b/logicals.qmd index 419870c..3df9235 100644 --- a/logicals.qmd +++ b/logicals.qmd @@ -344,7 +344,7 @@ flights |> In most cases, however, `any()` and `all()` are a little too crude, and it would be nice to be able to get a little more detail about how many values are `TRUE` or `FALSE`. That leads us to the numeric summaries. -### Numeric summaries +### Numeric summaries of logical vectors When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0. This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` will give the number of `TRUE`s and `mean(x)` the proportion of `TRUE`s. @@ -382,7 +382,7 @@ flights |> ### Logical subsetting There's one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. -This makes use of the base `[` (pronounced subset) operator, which you'll learn more about this in @sec-vector-subsetting. +This makes use of the base `[` (pronounced subset) operator, which you'll learn more about in @sec-vector-subsetting. Imagine we wanted to look at the average delay just for flights that were actually delayed. One way to do so would be to first filter the flights: diff --git a/missing-values.qmd b/missing-values.qmd index 42372eb..e7a8694 100644 --- a/missing-values.qmd +++ b/missing-values.qmd @@ -35,7 +35,7 @@ To begin, let's explore a few handy tools for creating or eliminating missing ex ### Last observation carried forward A common use for missing values is as a data entry convenience. -Sometimes data that has been entered by hand, missing values indicate that the value in the previous row has been repeated: +When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward): ```{r} treatment <- tribble( @@ -60,7 +60,7 @@ You can use the `.direction` argument to fill in missing values that have been g ### Fixed values -Some times missing values represent some fixed and known value, mostly commonly 0. +Some times missing values represent some fixed and known value, most commonly 0. You can use `dplyr::coalesce()` to replace them: ```{r} diff --git a/numbers.qmd b/numbers.qmd index c2c5761..85ab241 100644 --- a/numbers.qmd +++ b/numbers.qmd @@ -28,7 +28,7 @@ library(tidyverse) library(nycflights13) ``` -### Counts +## Counts It's surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with `count()`. This function is great for quick exploration and checks during analysis: @@ -59,7 +59,7 @@ flights |> ) ``` -`n()` is a special summary function that doesn't take any arguments and instead access information about the "current" group. +`n()` is a special summary function that doesn't take any arguments and instead accesses information about the "current" group. This means that it only works inside dplyr verbs: ```{r} @@ -554,7 +554,7 @@ You can lead or lag by more than one position by using the second argument, `n`. 8. Find all destinations that are flown by at least two carriers. Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination. -## Summaries +## Numeric summaries Just using the counts, means, and sums that we've introduced already can get you a long way, but R provides many other useful summary functions. Here are a selection that you might find useful. @@ -621,12 +621,12 @@ flights |> ### Spread -Sometimes you're not so interested in where the bulk of the data lies, but how it is spread out. +Sometimes you're not so interested in where the bulk of the data lies, but in how it is spread out. Two commonly used summaries are the standard deviation, `sd(x)`, and the inter-quartile range, `IQR()`. We won't explain `sd()` here since you're probably already familiar with it, but `IQR()` might be new --- it's `quantile(x, 0.75) - quantile(x, 0.25)` and gives you the range that contains the middle 50% of the data. We can use this to reveal a small oddity in the `flights` data. -You might expect that the spread of the distance between origin and destination to be zero, since airports are always in the same place. +You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below makes it looks like one airport, [EGE](https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport), might have moved. ```{r}