TR edits - Chp 1-9 (#1312)

* Mention parquet and databases

* Simplify language

* Explain what var and obs mean

* Data View() alternative

* Explain density

* Boxplot definition

* Clarify IQR, hide figure, add exercise

* will -> can

* Transform edits

* Fix typo

* Clairfy cases
This commit is contained in:
Mine Cetinkaya-Rundel 2023-02-27 21:54:34 -05:00 committed by GitHub
parent c0f0375d44
commit 9887705f43
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 67 additions and 48 deletions

View File

@ -41,7 +41,7 @@ From this chapter on, we'll suppress the loading message from `library(tidyverse
You can represent the same underlying data in multiple ways.
The example below shows the same data organized in three different ways.
Each dataset shows the same values of four variables: *country*, *year*, *population*, and *cases* of TB (tuberculosis), but each dataset organizes the values in a different way.
Each dataset shows the same values of four variables: *country*, *year*, *population*, and number of documented *cases* of TB (tuberculosis), but each dataset organizes the values in a different way.
```{r}
#| echo: false

View File

@ -111,7 +111,7 @@ For example, we could find all flights that arrived more than 120 minutes (two h
```{r}
flights |>
filter(arr_delay > 120)
filter(dep_delay > 120)
```
As well as `>` (greater than), you can use `>=` (greater than or equal to), `<` (less than), `<=` (less than or equal to), `==` (equal to), and `!=` (not equal to).
@ -192,7 +192,7 @@ flights |>
```
You can combine `arrange()` and `filter()` to solve more complex problems.
For example, we could look for the flights that were most delayed on arrival that left on roughly on time:
For example, we could filter for the flights that left roughly on time, then arrange the results to see which flights were most delayed on arrival:
```{r}
flights |>
@ -210,12 +210,12 @@ Most of the time, however, you'll want the distinct combination of some variable
flights |>
distinct()
# This finds all unique origin and destination pairs.
# This finds all unique origin and destination pairs
flights |>
distinct(origin, dest)
```
Note that if you want to find the number of duplicates, or rows that weren't duplicated, you're better off swapping `distinct()` for `count()` and then filtering as needed.
Note that if you want to find the number of duplicates, or rows that weren't duplicated, you're better off swapping `distinct()` for `count()`, which will give the number of observations per unique level, and then filtering as needed.
### Exercises
@ -245,7 +245,7 @@ Note that if you want to find the number of duplicates, or rows that weren't dup
## Columns
There are four important verbs that affect the columns without changing the rows: `mutate()`, `select()`, `rename()`, and `relocate()`.
`mutate()` creates new columns that are functions of the existing columns; `select()`, `rename()`, and `relocate()` change which columns are present, their names, or their positions.
`mutate()` creates new columns that are derived from the existing columns; `select()`, `rename()`, and `relocate()` change which columns are present, their names, or their positions.
We'll also discuss `pull()` since it allows you to get a column out of data frame.
### `mutate()` {#sec-mutate}
@ -421,6 +421,8 @@ ggplot(flights, aes(x = air_time - airtime2)) + geom_histogram()
select(flights, contains("TIME"))
```
6. Rename `air_time` to `air_time_min` to indicate units of measurement and move it to the beginning of the data frame.
## Groups
So far you've learned about functions that work with rows and columns.
@ -689,12 +691,12 @@ That seems pretty surprising, so lets draw a scatterplot of number of flights vs
```{r}
#| fig-alt: >
#| A scatterplot showing number of flights versus after delay. Delays
#| A scatterplot showing number of flights versus average arrival delay. Delays
#| for planes with very small number of flights have very high variability
#| (from -50 to ~300), but the variability rapidly decreases as the
#| number of flights increases.
ggplot(delays, aes(x = n, y = delay)) +
ggplot(delays, aes(x = delay, y = n)) +
geom_point(alpha = 1/10)
```
@ -708,16 +710,14 @@ When looking at this sort of plot, it's often useful to filter out the groups wi
```{r}
#| warning: false
#| fig-alt: >
#| Now that the y-axis (average delay) is smaller (-20 to 60 minutes),
#| we can see a more complicated story. The smooth line suggests
#| an initial decrease in average delay from 10 minutes to 0 minutes
#| as number of flights per plane increases from 25 to 100.
#| This is followed by a gradual increase up to 10 minutes for 250
#| flights, then a gradual decrease to ~5 minutes at 500 flights.
#| Scatterplot of number of flights of a given plane vs. the average delay
#| for those flights, for planes with more than 25 flights. As average delay
#| increases from -20 to 10, the number of flights also increases. For
#| larger average delayes, the number of flights decreases.
delays |>
filter(n > 25) |>
ggplot(aes(x = n, y = delay)) +
ggplot(aes(x = delay, y = n)) +
geom_point(alpha = 1/10) +
geom_smooth(se = FALSE)
```

View File

@ -72,7 +72,7 @@ And how about by the island where the penguin lives.
You can test your answer with the `penguins` **data frame** found in palmerpenguins (a.k.a. `palmerpenguins::penguins`).
A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
`penguins` contains `r nrow(penguins)` observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER[^data-visualize-2].
`penguins` contains `r nrow(penguins)` observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER[^data-visualize-2]. In this context, a variable refers to an attribute of all the penguins, and an observation refers to all the attributes of a single penguin.
[^data-visualize-2]: Horst AM, Hill AP, Gorman KB (2020).
palmerpenguins: Palmer Archipelago (Antarctica) penguin data.
@ -86,7 +86,7 @@ penguins
This data frame contains `r ncol(penguins)` columns.
For an alternative view, where you can see all variables and the first few observations of each variable, use `glimpse()`.
Or, if you're in RStudio, run `View(penguins)` to open an interactive data viewer.
Or, if you're in RStudio, click on the name of the data frame in the Environment pane or run `View(penguins)` to open an interactive data viewer.
```{r}
glimpse(penguins)
@ -157,17 +157,11 @@ ggplot2 looks for the mapped variables in the `data` argument, in this case, `pe
The following plots show the result of adding these mappings, one at a time.
```{r}
#| layout-ncol: 2
#| fig-alt: >
#| There are two plots. The plot on the left is shows flipper length on
#| the x-axis. The values range from 170 to 230 The plot on the right
#| also shows body mass on the y-axis. The values range from 3000 to
#| 6000.
#| The plot shows flipper length on the x-axis, with values that range from
#| 170 to 230, and body mass on the y-axis, with values that range from 3000
#| to 6000.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm)
)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
@ -202,7 +196,7 @@ ggplot(
```
Now we have something that looks like what we might think of as a "scatter plot".
It doesn't yet match our "ultimate goal" plot, but using this plot we can start answering the question that motivated our exploration: "What does the relationship between flipper length and body mass look like?" The relationship appears to be positive, fairly linear, and moderately strong.
It doesn't yet match our "ultimate goal" plot, but using this plot we can start answering the question that motivated our exploration: "What does the relationship between flipper length and body mass look like?" The relationship appears to be positive (as flipper length increases, so does body mass), fairly linear (the points are clustered around a line instead of a curve), and moderately strong (there isn't too much scatter around such a line).
Penguins with longer flippers are generally larger in terms of their body mass.
Before we add more layers to this plot, let's pause for a moment and review the warning message we got:
@ -225,7 +219,8 @@ For the remaining plots in this chapter we will suppress this warning so it's no
### Adding aesthetics and layers
Scatterplots are useful for displaying the relationship between two variables, but it's always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship.
Let's incorporate species into our plot and see if this reveals any additional insights into the apparent relationship between flipper length and body mass.
For example, does the relationship between flipper length and body mass differ by species?
Let's incorporate species into our plot and see if this reveals any additional insights into the apparent relationship between these variables.
We will do this by representing species with different colored points.
To achieve this, where should `species` go in the ggplot call from earlier?
@ -483,8 +478,6 @@ penguins |>
geom_point()
```
This is the most common syntax you'll see in the wild.
## Visualizing distributions
How you visualize the distribution of a variable depends on the type of variable: categorical or numerical.
@ -525,20 +518,17 @@ You will learn more about factors and functions for dealing with factors (like `
A variable is **numerical** if it can take any of an infinite set of ordered values.
Numbers and date-times are two examples of continuous variables.
To visualize the distribution of a continuous variable, you can use a histogram or a density plot.
One commonly used visualization for distributions of continuous variables is a histogram.
```{r}
#| warning: false
#| layout-ncol: 2
#| fig-alt: >
#| A histogram (on the left) and density plot (on the right) of body masses
#| of penguins. The distribution is unimodal and right skewed, ranging
#| between approximately 2500 to 6500 grams.
#| A histogram of body masses of penguins. The distribution is unimodal
#| and right skewed, ranging between approximately 2500 to 6500 grams.
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 200)
ggplot(penguins, aes(x = body_mass_g)) +
geom_density()
```
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin.
@ -572,6 +562,23 @@ ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 2000)
```
An alternative visualization for distributions of numerical variables is a density plot.
A density plot is a smoothed-out version of a histogram and a practical alternative, particularly for continuous data that comes from an underlying smooth distribution.
We won't go into how `geom_density()` estimates the density (you can read more about that in the function documentation), but let's explain how the density curve is drawn with an analogy.
Imagine a histogram made out of wooden blocks.
Then, imagine that you drop a cooked spaghetti string over it.
The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve.
It shows fewer details than a histogram but can make it easier to quickly glean the shape of the distribution, particularly with respect to modes and skewness.
```{r}
#| fig-alt: >
#| A density plot of body masses of penguins. The distribution is unimodal
#| and right skewed, ranging between approximately 2500 to 6500 grams.
ggplot(penguins, aes(x = body_mass_g)) +
geom_density()
```
### Exercises
1. Make a bar plot of `species` of `penguins`, where you assign `species` to the `y` aesthetic.
@ -604,10 +611,10 @@ In the following sections you will learn about commonly used plots for visualizi
### A numerical and a categorical variable
To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots.
A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians.
A **boxplot** is a type of visual shorthand for measures of position (percentiles) that describe a distribution that are commonly used in statistical analysis of data.
As shown in @fig-eda-boxplot, each boxplot consists of:
- A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR).
- A box that indicates the range of the middle half of the data, a distance known as the interquartile range (IQR), stretching from the 25th percentile of the distribution to the 75th percentile.
In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution.
These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
@ -792,11 +799,7 @@ You will learn about many other geoms for visualizing distributions of variables
```{r}
#| warning: false
#| fig-alt: >
#| Scatterplot of bill depth vs. bill length where different color and
#| shape pairings represent each species. The plot has two legends,
#| one labelled "species" which shows the shape scale and the other
#| that shows the color scale.
#| fig-show: hide
ggplot(
data = penguins,
@ -809,6 +812,19 @@ You will learn about many other geoms for visualizing distributions of variables
labs(color = "Species")
```
7. Create the two following segmented bar plots.
Which question can you answer with the first one?
Which question can you answer with the second one?
```{r}
#| fig-show: hide
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "fill")
ggplot(penguins, aes(x = species, fill = island)) +
geom_bar(position = "fill")
```
## Saving your plots {#sec-ggsave}
Once you've made a plot, you might want to get it out of R by saving it as an image that you can use elsewhere.

View File

@ -97,10 +97,13 @@ This book will teach you the tidymodels family of packages, which, as you might
### Big data
This book proudly focuses on small, in-memory datasets.
This book proudly and primarily focuses on small, in-memory datasets.
This is the right place to start because you can't tackle big data unless you have experience with small data.
The tools you learn in this book will easily handle hundreds of megabytes of data, and with a bit of care, you can typically use them to work with 1-2 Gb of data.
If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table).
The tools you learn in majority of this book will easily handle hundreds of megabytes of data, and with a bit of care, you can typically use them to work with 1-2 Gb of data.
That being said, the book also touches on getting data out of databases and out of parquet files, both of which are commonly used solutions for storing big data.
However, if you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table).
This book doesn't teach data.table because it has a very concise interface that offers fewer linguistic cues, which makes it harder to learn.
However, the performance payoff is well worth the effort required to learn it if you're working with large data.
@ -131,7 +134,7 @@ You should strive to learn new things throughout your career, but make sure your
We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science.
R is not just a programming language; it is also an interactive environment for doing data science.
To support interaction, R is a much more flexible language than many of its peers.
This flexibility has its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process.
This flexibility has its downsides, but the big upside is how easy it is to have code that is structured like the problem you are trying to solve for specific parts of the data science process.
These mini languages help you think about problems as a data scientist while supporting fluent interaction between your brain and the computer.
## Prerequisites

View File

@ -10,7 +10,7 @@ status("complete")
You now have some experience running R code.
We didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration!
Frustration is natural when you start programming in R because it is such a stickler for punctuation, and even one character out of place will cause it to complain.
Frustration is natural when you start programming in R because it is such a stickler for punctuation, and even one character out of place can cause it to complain.
But while you should expect to be a little frustrated, take comfort in that this experience is typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.
Before we go any further, let's ensure you've got a solid foundation in running R code and that you know some of the most helpful RStudio features.