diff --git a/EDA.qmd b/EDA.qmd index 74aa0e0..75e00e3 100644 --- a/EDA.qmd +++ b/EDA.qmd @@ -112,7 +112,7 @@ ggplot(data = diamonds, mapping = aes(x = cut)) + ``` The height of the bars displays how many observations occurred with each x value. -You can compute these values manually with `dplyr::count()`: +You can compute these values manually with `count()`: ```{r} diamonds |> @@ -136,7 +136,7 @@ ggplot(data = diamonds, mapping = aes(x = carat)) + geom_histogram(binwidth = 0.5) ``` -You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_width()`: +You can compute this by hand by combining `count()` and `cut_width()`: ```{r} diamonds |> @@ -359,17 +359,17 @@ If you've encountered unusual values in your dataset, and simply want to move on 2. Instead, I recommend replacing the unusual values with missing values. The easiest way to do this is to use `mutate()` to replace the variable with a modified copy. - You can use the `ifelse()` function to replace unusual values with `NA`: + You can use the `if_else()` function to replace unusual values with `NA`: ```{r} diamonds2 <- diamonds |> - mutate(y = ifelse(y < 3 | y > 20, NA, y)) + mutate(y = if_else(y < 3 | y > 20, NA, y)) ``` -`ifelse()` has three arguments. +`if_else()` has three arguments. The first argument `test` should be a logical vector. The result will contain the value of the second argument, `yes`, when `test` is `TRUE`, and the value of the third argument, `no`, when it is false. -Alternatively to `if_else()`, use `dplyr::case_when()`. +Alternatively to `if_else()`, use `case_when()`. `case_when()` is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple `if_else()` statements nested inside one another. Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. @@ -397,10 +397,12 @@ ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + ``` Other times you want to understand what makes observations with missing values different to observations with recorded values. -For example, in `nycflights13::flights`, missing values in the `dep_time` variable indicate that the flight was cancelled. +For example, in `nycflights13::flights`[^eda-1], missing values in the `dep_time` variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with `is.na()`. +[^eda-1]: Remember that when need to be explicit about where a function (or dataset) comes from, we'll use the special form `package::function()` or `package::dataset`. + ```{r} #| fig-alt: > #| A frequency polygon of scheduled departure times of flights. Two lines diff --git a/data-transform.qmd b/data-transform.qmd index 3b457f0..30b25ff 100644 --- a/data-transform.qmd +++ b/data-transform.qmd @@ -31,6 +31,8 @@ library(tidyverse) Take careful note of the conflicts message that's printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()` and `stats::lag()`. +So far we've mostly ignored which package a function comes from because most of the time it doesn't matter. +However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, we'll use the same syntax as R: `packagename::functionname()`. ### nycflights13 diff --git a/data-visualize.qmd b/data-visualize.qmd index 42dab9d..c30c4b9 100644 --- a/data-visualize.qmd +++ b/data-visualize.qmd @@ -42,9 +42,6 @@ library(tidyverse) You only need to install a package once, but you need to reload it every time you start a new session. -If we need to be explicit about where a function (or dataset) comes from, we'll use the special form `package::function()`. -For example, `ggplot2::ggplot()` tells you explicitly that we're using the `ggplot()` function from the ggplot2 package. - ## First steps Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines?