base-R typos and comments (#1123)

This commit is contained in:
Jon Harmon 2022-12-06 12:48:23 -06:00 committed by GitHub
parent 4f88cf741f
commit cf823b61fb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 21 additions and 23 deletions

View File

@ -70,11 +70,9 @@ There are five main types of things that you can subset a vector with, i.e. that
x <- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
!is.na(x)
x[!is.na(x)]
# All even (or missing!) values of x
x %% 2 == 0
x[x %% 2 == 0]
```
@ -96,7 +94,7 @@ There are five main types of things that you can subset a vector with, i.e. that
### Subsetting data frames
There are quite a few different ways[^base-r-1] that you can use `[` with a data frame, but the most important way is to selecting rows and columns independently with `df[rows, cols]`. Here `rows` and `cols` are vectors as described above.
There are quite a few different ways[^base-r-1] that you can use `[` with a data frame, but the most important way is to select rows and columns independently with `df[rows, cols]`. Here `rows` and `cols` are vectors as described above.
For example, `df[rows, ]` and `df[, cols]` select just rows or just columns, using the empty subset to preserve the other dimension.
[^base-r-1]: Read <https://adv-r.hadley.nz/subsetting.html#subset-multiple> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.
@ -125,8 +123,8 @@ We need to use it here because `[` doesn't use tidy evaluation, so you need to b
There's an important difference between tibbles and data frames when it comes to `[`.
In this book we've mostly used tibbles, which *are* data frames, but they tweak some older behaviors to make your life a little easier.
In most places, you can use tibbles and data frame interchangeably, so when we want to draw particular attention to R's built-in data frame, we'll write `data.frame`s.
So if `df` is a `data.frame`, then `df[, cols]` will return a vector if `col` selects a single column and a data frame if it selects more than one column.
In most places, you can use "tibble" and "data frame" interchangeably, so when we want to draw particular attention to R's built-in data frame, we'll write `data.frame`.
If `df` is a `data.frame`, then `df[, cols]` will return a vector if `col` selects a single column and a data frame if it selects more than one column.
If `df` is a tibble, then `[` will always return a tibble.
```{r}
@ -140,7 +138,7 @@ df2[, "x"]
One way to avoid this ambiguity with `data.frame`s is to explicitly specify `drop = FALSE`:
```{r}
df1[, "x", drop = FALSE]
df1[, "x" , drop = FALSE]
```
### dplyr equivalents
@ -174,7 +172,7 @@ A number of dplyr verbs are special cases of `[`:
df[order(df$x, df$y), ]
```
You can use `order(decreasing = TRUE)` to sort all columns in descending order or `-rank(col)` to individual sort columns in decreasing order.
You can use `order(decreasing = TRUE)` to sort all columns in descending order or `-rank(col)` to individually sort columns in decreasing order.
- Both `select()` and `relocate()` are similar to subsetting the columns with a character vector:
@ -215,11 +213,11 @@ This function was the inspiration for much of dplyr's syntax.
## Selecting a single element `$` and `[[` {#sec-subset-one}
`[`, which selects many elements, is paired with `[[` and `$`, which extract a single element.
In this section, we'll show you how to use `[[` and `$` to pull columns out of a data frames, discuss a couple more differences between `data.frames` and tibbles, and emphasize some important differences between `[` and `[[` when used with lists.
In this section, we'll show you how to use `[[` and `$` to pull columns out of data frames, discuss a couple more differences between `data.frames` and tibbles, and emphasize some important differences between `[` and `[[` when used with lists.
### Data frames
`[[` and `$` can be used extract columns out of a data frame.
`[[` and `$` can be used to extract columns out of a data frame.
`[[` can access by position or by name, and `$` is specialized for access by name:
```{r}
@ -243,11 +241,11 @@ tb$z <- tb$x + tb$y
tb
```
There are a number other base approaches to creating new columns including with `transform()`, `with()`, and `within()`.
There are a number of other base R approaches to creating new columns including with `transform()`, `with()`, and `within()`.
Hadley collected a few examples at <https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf>.
Using `$` directly is convenient when performing quick summaries.
For example, if you just want find the size of the biggest diamond or the possible values of `cut`, there's no need to use `summarize()`:
For example, if you just want to find the size of the biggest diamond or the possible values of `cut`, there's no need to use `summarize()`:
```{r}
max(diamonds$carat)
@ -289,7 +287,7 @@ For this reason we sometimes joke that tibbles are lazy and surly: they do less
### Lists
`[[` and `$` are also really important for working with lists, and it's important to understand how they differ to `[`.
`[[` and `$` are also really important for working with lists, and it's important to understand how they differ from `[`.
Lets illustrate the differences with a list named `l`:
```{r}
@ -306,6 +304,7 @@ l <- list(
```{r}
str(l[1:2])
str(l[1])
str(l[4])
```
@ -390,7 +389,7 @@ df[["x"]]
In @sec-iteration, you learned tidyverse techniques for iteration like `dplyr::across()` and the map family of functions.
In this section, you'll learn about their base equivalents, the **apply family**.
In this context apply and maps are synonyms because another way of saying "map a function over each element of a vector" is "apply a function over each element of a vector".
In this context apply and map are synonyms because another way of saying "map a function over each element of a vector" is "apply a function over each element of a vector".
Here we'll give you a quick overview of this family so you can recognize them in the wild.
The most important member of this family is `lapply()`, which is very similar to `purrr::map()`[^base-r-3].
@ -442,13 +441,13 @@ Unfortunately `tapply()` returns its results in a named vector which requires so
If you want to see how you might use `tapply()` or other base techniques to perform other grouped summaries, Hadley has collected a few techniques [in a gist](https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec).
The final member of the apply family is the titular `apply()`, which works with matrices and arrays.
In particular, watch out of `apply(df, 2, something)` which is a slow and potentially dangerous way of doing `lapply(df, something)`.
In particular, watch out for `apply(df, 2, something)`, which is a slow and potentially dangerous way of doing `lapply(df, something)`.
This rarely comes up in data science because we usually work with data frames and not matrices.
## For loops
For loops are the fundamental building block of iteration that both the apply and map families use under the hood.
For loops are powerful and general tool that are important to learn as you become a more experienced R programmer.
For loops are powerful and general tools that are important to learn as you become a more experienced R programmer.
The basic structure of a for loop looks like this:
```{r}
@ -458,7 +457,7 @@ for (element in vector) {
}
```
The most straightforward use of `for()` loops is achieve the same affect as `walk()`: call some function with a side-effect on each element of a list.
The most straightforward use of `for()` loops is to achieve the same affect as `walk()`: call some function with a side-effect on each element of a list.
For example, in @sec-save-database instead of using walk:
```{r}
@ -519,12 +518,12 @@ for (path in paths) {
```
We recommend avoiding this pattern because it can become very slow when the vector is very long.
This the source of the persistent canard that `for` loops are slow: they're not, but iteratively growing a vector is.
This is the source of the persistent canard that `for` loops are slow: they're not, but iteratively growing a vector is.
## Plots
Many R users who don't otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look.
However, base R plotting functions can still be useful because they're so concise --- it's very little typing to do a basic exploratory plot.
Many R users who don't otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, and a modern look.
However, base R plotting functions can still be useful because they're so concise --- it takes very little typing to do a basic exploratory plot.
There are two main types of base plot you'll see in the wild: scatterplots and histograms, produced with `plot()` and `hist()` respectively.
Here's a quick example from the diamonds dataset:
@ -540,11 +539,10 @@ Note that base plotting functions work with vectors, so you need to pull columns
## Summary
In this chapter, we've shown you selection of base R functions useful for subsetting and iteration.
Compared to approaches discussed elsewhere in the book, these functions tend have more of a "vector" flavor than a "data frame" flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification.
In this chapter, we've shown you a selection of base R functions useful for subsetting and iteration.
Compared to approaches discussed elsewhere in the book, these functions tend to have more of a "vector" flavor than a "data frame" flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification.
This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.
This chapter concludes the programming section of the book.
You've made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can *program* in R.
We hope these chapters have sparked your interested in programming and that you're are looking forward to learning more outside of this book.
We hope these chapters have sparked your interested in programming and that you're looking forward to learning more outside of this book.