More TR edits (#1335)

* Remaining dataviz edits

* Workflow basics edits

* Bold first appearance

* Data transform edits

* None of this should be causing the action fail though...

* Update data-transform.qmd

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>

* Update data-transform.qmd

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>

* Address review comments

* Add missing pipe!

---------

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
This commit is contained in:
Mine Cetinkaya-Rundel 2023-03-02 16:09:49 -05:00 committed by GitHub
parent bd32ddcfbb
commit 1338fcd169
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 121 additions and 90 deletions

View File

@ -10,8 +10,8 @@ status("complete")
## Introduction
Visualization is an important tool for generating insight, but it's rare that you get the data in exactly the right form you need for it.
Often you'll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with.
Visualization is an important tool for generating insight, but it's rare that you get the data in exactly the right form you need to visualize it.
Often you'll need to create some new variables or summaries to answer your questions with your data, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with.
You'll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the **dplyr** package and a new dataset on flights that departed New York City in 2013.
The goal of this chapter is to give you an overview of all the key tools for transforming a data frame.
@ -47,12 +47,11 @@ The data comes from the US [Bureau of Transportation Statistics](http://www.tran
flights
```
If you've used R before, you might notice that this data frame prints a little differently to other data frames you've seen.
That's because it's a **tibble**, a special type of data frame used by the tidyverse to avoid some common gotchas.
The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
`flights` is a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas.
The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
There are a few options to see everything.
If you're using RStudio, the most convenient is probably `View(flights)`, which will open an interactive scrollable and filterable view.
Otherwise you can use `print(flights, width = Inf)` to show all columns, or use call `glimpse()`:
Otherwise you can use `print(flights, width = Inf)` to show all columns, or use `glimpse()`:
```{r}
glimpse(flights)
@ -63,16 +62,16 @@ These are important because the operations you can perform on a column depend so
### dplyr basics
You're about to learn the primary dplyr verbs which will allow you to solve the vast majority of your data manipulation challenges.
You're about to learn the primary dplyr verbs (functions) which will allow you to solve the vast majority of your data manipulation challenges.
But before we discuss their individual differences, it's worth stating what they have in common:
1. The first argument is always a data frame.
2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
2. The subsequent arguments typically describe which columns to operate on, using the variable names (without quotes).
3. The result is always a new data frame.
Since each verb is quite simple, solving complex problems will usually require combining multiple verbs, and we'll do so with the pipe, `|>`.
Because each verb does one thing well, solving complex problems will usually require combining multiple verbs, and we'll do so with the pipe, `|>`.
We'll discuss the pipe more in @the-pipe, but in brief, the pipe takes the thing on its left and passes it along to the function on its right so that `x |> f(y)` is equivalent to `f(x, y)`, and `x |> f(y) |> g(z)` is equivalent to into `g(f(x, y), z)`.
The easiest way to pronounce the pipe is "then".
That makes it possible to get a sense of the following code even though you haven't yet learned the details:
@ -94,7 +93,7 @@ Let's dive in!
## Rows
The most important verbs that operate on rows are `filter()`, which changes which rows are present without changing their order, and `arrange()`, which changes the order of the rows without changing which are present.
The most important verbs that operate on rows of a dataset are `filter()`, which changes which rows are present without changing their order, and `arrange()`, which changes the order of the rows without changing which are present.
Both functions only affect the rows, and the columns are left unchanged.
We'll also discuss `distinct()` which finds rows with unique values but unlike `arrange()` and `filter()` it can also optionally modify the columns.
@ -166,8 +165,8 @@ flights |>
filter(month == 1 | 2)
```
This works, in the sense that it doesn't throw an error, but it doesn't do what you want.
We'll come back to what it does and why in @sec-boolean-operations.
This "works", in the sense that it doesn't throw an error, but it doesn't do what you want because `|` first checks the condition `month == 1` and then checks the condition `2`, which is not a sensible condition to check.
We'll learn more about what's happening here and why in @sec-boolean-operations.
### `arrange()`
@ -175,20 +174,23 @@ We'll come back to what it does and why in @sec-boolean-operations.
It takes a data frame and a set of column names (or more complicated expressions) to order by.
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
For example, the following code sorts by the departure time, which is spread over four columns.
We get the earliest years first, then within a year the earliest months, etc.
```{r}
flights |>
arrange(year, month, day, dep_time)
```
You can use `desc()` to re-order by a column in descending order.
For example, this code shows the most delayed flights:
You can use `desc()` to re-order the data frame based a column, in descending order.
For example, this code shows the most delayed flights first:
```{r}
flights |>
arrange(desc(dep_delay))
```
Note that the number of rows has not changed -- we're only arranging the data, we're not filtering it.
### `distinct()`
`distinct()` finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows.
@ -208,19 +210,20 @@ Note that if you want to find the number of duplicates, or rows that weren't dup
### Exercises
1. Find all flights that
1. In a singe pipeline, find all flights that meet all of the following conditions:
a. Had an arrival delay of two or more hours
b. Flew to Houston (`IAH` or `HOU`)
c. Were operated by United, American, or Delta
d. Departed in summer (July, August, and September)
e. Arrived more than two hours late, but didn't leave late
f. Were delayed by at least an hour, but made up over 30 minutes in flight
- Had an arrival delay of two or more hours
- Flew to Houston (`IAH` or `HOU`)
- Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn't leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
2. Sort `flights` to find the flights with longest departure delays.
Find the flights that left earliest in the morning.
3. Sort `flights` to find the fastest flights (Hint: try sorting by a calculation).
3. Sort `flights` to find the fastest flights.
(Hint: Try including a math calculation inside of your function.)
4. Was there a flight on every day of 2013?
@ -233,9 +236,7 @@ Note that if you want to find the number of duplicates, or rows that weren't dup
## Columns
There are four important verbs that affect the columns without changing the rows: `mutate()`, `select()`, `rename()`, and `relocate()`.
`mutate()` creates new columns that are derived from the existing columns; `select()`, `rename()`, and `relocate()` change which columns are present, their names, or their positions.
We'll also discuss `pull()` since it allows you to get a column out of data frame.
There are four important verbs that affect the columns without changing the rows: `mutate()` creates new columns that are derived from the existing columns, `select()` changes which columns are present; `rename()` changes the names of the columns; and `relocate()` changes the positions of the columns.
### `mutate()` {#sec-mutate}
@ -265,12 +266,13 @@ flights |>
)
```
The `.` is a sign that `.before` is an argument to the function, not the name of a new variable.
The `.` is a sign that `.before` is an argument to the function, not the name of a third new variable we are creating.
You can also use `.after` to add after a variable, and in both `.before` and `.after` you can use the variable name instead of a position.
For example, we could add the new variables after `day`:
```{r}
#| results: false
flights |>
mutate(
gain = dep_delay - arr_delay,
@ -280,11 +282,12 @@ flights |>
```
Alternatively, you can control which variables are kept with the `.keep` argument.
A particularly useful argument is `"used"` which allows you to see the inputs and outputs from your calculations.
A particularly useful argument is `"used"` which allows you to keep only the inputs and outputs from your calculations.
For example, the following output will contain only the variables `dep_delay`, `arr_delay`, `air_time`, `gain`, `hours`, and `gain_per_hour`.
```{r}
#| results: false
flights |>
mutate(
gain = dep_delay - arr_delay,
@ -294,6 +297,10 @@ flights |>
)
```
Note that since we haven't assigned the result of the above computation back to `flights`, the new variables `gain,` `hours`, and `gain_per_hour` will only be printed but will not be stored in a data frame.
And if we want them to be available in a data frame for future use, we should think carefully about whether we want the result to be assigned back to `flights`, overwriting the original data frame with many more variables, or to a new object.
Often, the right answer is a new object that is named informatively to indicate its contents, e.g., `delay_gain`, but you might also have good reasons for overwriting `flights`.
### `select()` {#sec-select}
It's not uncommon to get datasets with hundreds or even thousands of variables.
@ -305,6 +312,7 @@ In this situation, the first challenge is often just focusing on the variables y
```{r}
#| results: false
flights |>
select(year, month, day)
```
@ -313,6 +321,7 @@ In this situation, the first challenge is often just focusing on the variables y
```{r}
#| results: false
flights |>
select(year:day)
```
@ -321,6 +330,7 @@ In this situation, the first challenge is often just focusing on the variables y
```{r}
#| results: false
flights |>
select(!year:day)
```
@ -329,6 +339,7 @@ In this situation, the first challenge is often just focusing on the variables y
```{r}
#| results: false
flights |>
select(where(is.character))
```
@ -353,15 +364,13 @@ flights |>
### `rename()`
If you just want to keep all the existing variables and just want to rename a few, you can use `rename()` instead of `select()`:
If you want to keep all the existing variables and just want to rename a few, you can use `rename()` instead of `select()`:
```{r}
flights |>
rename(tail_num = tailnum)
```
It works exactly the same way as `select()`, but keeps all the variables that aren't explicitly selected.
If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out `janitor::clean_names()` which provides some useful automated cleaning.
### `relocate()`
@ -375,7 +384,7 @@ flights |>
relocate(time_hour, air_time)
```
But you can use the same `.before` and `.after` arguments as `mutate()` to choose where to put them:
You can also specify where to put them using the `.before` and `.after` arguments, just like in `mutate()`:
```{r}
#| results: false
@ -410,7 +419,7 @@ ggplot(flights, aes(x = air_time - airtime2)) + geom_histogram()
2. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
3. What happens if you include the name of a variable multiple times in a `select()` call?
3. What happens if you specify the name of a variable multiple times in a `select()` call?
4. What does the `any_of()` function do?
Why might it be helpful in conjunction with this vector?
@ -420,17 +429,27 @@ ggplot(flights, aes(x = air_time - airtime2)) + geom_histogram()
```
5. Does the result of running the following code surprise you?
How do the select helpers deal with case by default?
How do the select helpers deal with upper and lower case by default?
How can you change that default?
```{r}
#| eval: false
select(flights, contains("TIME"))
flights |> select(contains("TIME"))
```
6. Rename `air_time` to `air_time_min` to indicate units of measurement and move it to the beginning of the data frame.
7. Why doesn't the following work, and what does the error mean?
```{r}
#| error: true
flights |>
select(tailnum) |>
arrange(arr_delay)
```
## The pipe {#the-pipe}
We've shown you simple examples of the pipe above, but its real power arises when you start to combine multiple verbs.
@ -444,7 +463,7 @@ flights |>
arrange(desc(speed))
```
Even though this pipe has four steps, it's easy to skim because the verbs come at the start of each line: start with the `flights` data, then filter, then group, then summarize.
Even though this pipeline has four steps, it's easy to skim because the verbs come at the start of each line: start with the `flights` data, then filter, then group, then summarize.
What would happen if we didn't have the pipe?
We could nest each function call inside the previous call:
@ -467,7 +486,7 @@ arrange(
)
```
Or we could use a bunch of intermediate variables:
Or we could use a bunch of intermediate objects:
```{r}
#| results: false
@ -539,7 +558,7 @@ This means subsequent operations will now work "by month".
### `summarize()` {#sec-summarize}
The most important grouped operation is a summary, which collapses each group to a single row.
The most important grouped operation is a summary, which, if being used to calculate a single summary statistic, reduces the data frame to have a single row for each group.
In dplyr, this is operation is performed by `summarize()`[^data-transform-3], as shown by the following example, which computes the average departure delay by month:
[^data-transform-3]: Or `summarise()`, if you prefer British English.
@ -548,13 +567,14 @@ In dplyr, this is operation is performed by `summarize()`[^data-transform-3], as
flights |>
group_by(month) |>
summarize(
delay = mean(dep_delay)
avg_delay = mean(dep_delay)
)
```
Uhoh!
Something has gone wrong and all of our results are `NA` (pronounced "N-A"), R's symbol for missing value.
We'll come back to discuss missing values in detail in @sec-missing-values, but for now we'll remove them by using `na.rm = TRUE`:
Something has gone wrong and all of our results are `NA`s (pronounced "N-A"), R's symbol for missing value.
This happened because some of the observed flights had missing data in the delay column, and so when we calculated the mean including those values, we got an `NA` result.
We'll come back to discuss missing values in detail in @sec-missing-values, but for now we'll tell the `mean()` function to ignore all missing values by setting the argument `na.rm` to `TRUE`:
```{r}
flights |>
@ -580,29 +600,30 @@ Means and counts can get you a surprisingly long way in data science!
### The `slice_` functions
There are five handy functions that allow you pick off specific rows within each group:
There are five handy functions that allow you extract specific rows within each group:
- `df |> slice_head(n = 1)` takes the first row from each group.
- `df |> slice_tail(n = 1)` takes the last row in each group.
- `df |> slice_min(x, n = 1)` takes the row with the smallest value of `x`.
- `df |> slice_max(x, n = 1)` takes the row with the largest value of `x`.
- `df |> slice_min(x, n = 1)` takes the row with the smallest value of column `x`.
- `df |> slice_max(x, n = 1)` takes the row with the largest value of column `x`.
- `df |> slice_sample(n = 1)` takes one random row.
You can vary `n` to select more than one row, or instead of `n =`, you can use `prop = 0.1` to select (e.g.) 10% of the rows in each group.
For example, the following code finds the most delayed flight to each destination:
For example, the following code finds the flights that are most delayed upon arrival at each destination:
```{r}
flights |>
group_by(dest) |>
slice_max(arr_delay, n = 1)
slice_max(arr_delay, n = 1) |>
relocate(dest)
```
This is similar to computing the max delay with `summarize()`, but you get the whole row instead of the single summary.
This is similar to computing the max delay with `summarize()`, but you get the whole corresponding row instead of the single summary statistic.
### Grouping by multiple variables
You can create groups using more than one variable.
For example, we could make a group for each day:
For example, we could make a group for each day.
```{r}
daily <- flights |>
@ -616,9 +637,7 @@ To make it obvious what's happening, dplyr displays a message that tells you how
```{r}
daily_flights <- daily |>
summarize(
n = n()
)
summarize(n = n())
```
If you're happy with this behavior, you can explicitly request it in order to suppress the message:
@ -637,28 +656,35 @@ Alternatively, change the default behavior by setting a different value, e.g. `"
### Ungrouping
You might also want to remove grouping outside of `summarize()`.
You might also want to remove grouping from a data frame without using `summarize()`.
You can do this with `ungroup()`.
```{r}
daily |>
ungroup()
```
Now let's see what happens when you summarize an ungrouped data frame.
```{r}
daily |>
ungroup() |>
summarize(
delay = mean(dep_delay, na.rm = TRUE),
avg_delay = mean(dep_delay, na.rm = TRUE),
flights = n()
)
```
As you can see, when you summarize an ungrouped data frame, you get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.
You get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.
### Exercises
1. Which carrier has the worst delays?
1. Which carrier has the worst average delays?
Challenge: can you disentangle the effects of bad airports vs. bad carriers?
Why/why not?
(Hint: think about `flights |> group_by(carrier, dest) |> summarize(n())`)
2. Find the most delayed flight to each destination.
2. Find the flights that are most delayed upon departure from each destination.
3. How do delays vary over the course of the day.
Illustrate your answer with a plot.
@ -678,8 +704,7 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
)
```
a. What does the following code do?
Run it, analyze the result, and describe what `group_by()` does.
a. Write down what you think the output will look like, then check if you were correct, and describe what `group_by()` does.
```{r}
#| eval: false
@ -688,8 +713,7 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
group_by(y)
```
b. What does the following code do?
Run it, analyze the result, and describe what `arrange()` does.
b. Write down what you think the output will look like, then check if you were correct, and describe what `arrange()` does.
Also comment on how it's different from the `group_by()` in part (a)?
```{r}
@ -699,8 +723,7 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
arrange(y)
```
c. What does the following code do?
Run it, analyze the result, and describe what the pipeline does.
c. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
```{r}
#| eval: false
@ -710,8 +733,7 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
summarize(mean_x = mean(x))
```
d. What does the following code do?
Run it, analyze the result, and describe what the pipeline does.
d. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
Then, comment on what the message says.
```{r}
@ -722,8 +744,7 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
summarize(mean_x = mean(x))
```
e. What does the following code do?
Run it, analyze the result, and describe what the pipeline does.
e. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
How is the output different from the one in part (d).
```{r}
@ -734,8 +755,7 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
summarize(mean_x = mean(x), .groups = "drop")
```
f. What do the following pipelines do?
Run both, analyze the results, and describe what each pipeline does.
f. Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does.
How are the outputs of the two pipelines different?
```{r}
@ -755,7 +775,7 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
Whenever you do any aggregation, it's always a good idea to include a count (`n()`).
That way, you can ensure that you're not drawing conclusions based on very small amounts of data.
We'll demonstrate this with some baseball data from the **Lahman** package.
Specifically, we will compare what proportion of times a player gets a hit vs. the number of times they try to put the ball in play:
Specifically, we will compare what proportion of times a player gets a hit (`H`) vs. the number of times they try to put the ball in play (`AB`):
```{r}
batters <- Lahman::Batting |>
@ -769,7 +789,7 @@ batters
When we plot the skill of the batter (measured by the batting average, `performance`) against the number of opportunities to hit the ball (measured by times at bat, `n`), you see two patterns:
1. The variation in our aggregate decreases as we get more data points.
1. The variation in `performance` is larger among players with fewer at-bats.
The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you'll see that the variation decreases as the sample size increases[^data-transform-4].
2. There's a positive correlation between skill (`perf`) and opportunities to hit the ball (`n`) because teams want to give their best batters the most opportunities to hit the ball.
@ -793,7 +813,7 @@ batters |>
```
Note the handy pattern for combining ggplot2 and dplyr.
It's a bit annoying that you have to switch from `|>` to `+`, but it's not too much of a hassle once you get the hang of it.
You just have to remember to switch from `|>`, for dataset processing, to `+` for adding layers to your plot.
This also has important implications for ranking.
If you naively sort on `desc(performance)`, the people with the best batting averages are clearly lucky, not skilled:

View File

@ -83,7 +83,7 @@ A data frame is a rectangular collection of variables (in the columns) and obser
Type the name of the data frame in the console and R will print a preview of its contents.
Note that it says `tibble` on top of this preview.
In the tidyverse, we use special data frames called tibbles that you will learn more about soon.
In the tidyverse, we use special data frames called **tibbles** that you will learn more about soon.
```{r}
penguins
@ -155,7 +155,7 @@ ggplot(data = penguins)
Next, we need to tell `ggplot()` how the information from our data will be visually represented.
The `mapping` argument of the `ggplot()` function defines how variables in your dataset are mapped to visual properties (**aesthetics**) of your plot.
The `mapping` argument is always paired with the `aes()` function, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes.
The `mapping` argument is always defined using the `aes()` function, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes.
For now, we will only map flipper length to the `x` aesthetic and body mass to the `y` aesthetic.
ggplot2 looks for the mapped variables in the `data` argument, in this case, `penguins`.
@ -662,7 +662,7 @@ ggplot(penguins, aes(x = body_mass_g, color = species)) +
We've also customized the thickness of the lines using the `linewidth` argument in order to make them stand out a bit more against the background.
Alternatively, we can map `species` to both `color` and `fill` aesthetics and use the `alpha` aesthetic to add transparency to the filled density curves.
Additionally, we can map `species` to both `color` and `fill` aesthetics and use the `alpha` aesthetic to add transparency to the filled density curves.
This aesthetic takes values between 0 (completely transparent) and 1 (completely opaque).
In the following plot it's *set* to 0.5.
@ -754,7 +754,7 @@ To facet your plot by a single variable, use `facet_wrap()`.
The first argument of `facet_wrap()` is a formula[^data-visualize-3], which you create with `~` followed by a variable name.
The variable that you pass to `facet_wrap()` should be categorical.
[^data-visualize-3]: Here "formula" is the name of the type of thing created by `~`, not a synonym for "equation".
[^data-visualize-3]: Here "formula" is the name of the thing created by `~`, not a synonym for "equation".
```{r}
#| warning: false

View File

@ -18,7 +18,7 @@ Before we go any further, let's ensure you've got a solid foundation in running
## Coding basics
Let's review some basics we've omitted so far in the interest of getting you plotting as quickly as possible.
You can use R as a calculator:
You can use R to do basic math calculations:
```{r}
1 / 200 * 30
@ -32,13 +32,16 @@ You can create new objects with the assignment operator `<-`:
x <- 3 * 4
```
Note that the value of `x` is not printed, it's just stored.
If you want to view the value, type `x` in the console.
You can **c**ombine multiple elements into a vector with `c()`:
```{r}
primes <- c(2, 3, 5, 7, 11, 13)
```
And basic arithmetic is applied to every element of the vector:
And operations applied to the vector are applied to every element of it:
```{r}
primes * 2
@ -58,7 +61,7 @@ When reading that code, say "object name gets value" in your head.
You will make lots of assignments, and `<-` is a pain to type.
You can save time with RStudio's keyboard shortcut: Alt + - (the minus sign).
Notice that RStudio automatically surrounds `<-` with spaces, which is a good code formatting practice.
Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.
Code can be miserable to read on a good day, so giveyoureyesabreak and use spaces.
## Comments
@ -81,8 +84,7 @@ But as the code you're writing gets more complex, comments can save you (and you
Use comments to explain the *why* of your code, not the *how* or the *what*.
The *what* and *how* of your code are always possible to figure out, even if it might be tedious, by carefully reading it.
But if you describe the "what" in your comments and your code, you'll have to remember to update the comment and code in tandem carefully.
If you change the code and forget to update the comment, they'll be inconsistent, leading to confusion when you return to your code in the future.
If you describe every step in the comments, and then change the code, you will have to remember to update the comments as well or it will be confusing when you return to your code in the future.
Figuring out *why* something was done is much more difficult, if not impossible.
For example, `geom_smooth()` has an argument called `span`, which controls the smoothness of the curve, with larger values yielding a smoother curve.
@ -122,11 +124,10 @@ this_is_a_really_long_name <- 2.5
To inspect this object, try out RStudio's completion facility: type "this", press TAB, add characters until you have a unique prefix, then press return.
Ooops, you made a mistake!
The value of `this_is_a_really_long_name` should be 3.5, not 2.5.
Use another keyboard shortcut to help you fix it.
Type "this" then press Cmd/Ctrl + ↑.
Doing so will list all the commands you've typed that start with those letters.
Let's assume you made a mistake, and that the value of `this_is_a_really_long_name` should be 3.5, not 2.5.
You can use another keyboard shortcut to help you fix it.
For example, you can press ↑ to bring the last command you typed and edit it.
Or, type "this" then press Cmd/Ctrl + ↑ to list all the commands you've typed that start with those letters.
Use the arrow keys to navigate, then press enter to retype the command.
Change 2.5 to 3.5 and rerun.
@ -148,6 +149,7 @@ R_rocks
```
This illustrates the implied contract between you and R: R will do the tedious computations for you, but in exchange, you must be completely precise in your instructions.
If not, you're likely to get an error that says the object you're looking for was not found.
Typos matter; R can't read your mind and say, "oh, they probably meant `r_rocks` when they typed `r_rock`".
Case matters; similarly, R can't read your mind and say, "oh, they probably meant `r_rocks` when they typed `R_rocks`".
@ -158,7 +160,7 @@ R has a large collection of built-in functions that are called like this:
```{r}
#| eval: false
function_name(arg1 = val1, arg2 = val2, ...)
function_name(argument1 = value1, argument2 = value2, ...)
```
Let's try using `seq()`, which makes regular **seq**uences of numbers, and while we're at it, learn more helpful features of RStudio.
@ -170,13 +172,21 @@ If you want more help, press F1 to get all the details in the help tab in the lo
When you've selected the function you want, press TAB again.
RStudio will add matching opening (`(`) and closing (`)`) parentheses for you.
Type the arguments `1, 10` and hit return.
Type the name of the first argument, `from`, and set it equal to `1`.
Then, type the name of the second argument, `to`, and set it equal to `2`.
Finally, hit return.
```{r}
seq(from = 1, to = 10)
```
We often omit the names of the first arguments in function calls, so we can rewrite this as follows:
```{r}
seq(1, 10)
```
Type this code and notice that RStudio provides similar assistance with the paired quotation marks:
Type the following code and notice that RStudio provides similar assistance with the paired quotation marks:
```{r}
x <- "hello world"
@ -222,10 +232,11 @@ knitr::include_graphics("screenshots/rstudio-env.png")
```{r}
#| eval: false
libary(tidyverse)
libary(todyverse)
ggplot(dota = mpg) +
geom_point(maping = aes(x = displ, y = hwy))
ggplot(dTA = mpg) +
geom_point(maping = aes(x = displ y = hwy)) +
geom_smooth(method = "lm)
```
3. Press Option + Shift + K / Alt + Shift + K.