Eliminate two plots in EDA.qmd

Noticed these in passing. cc @mine-cetinkaya-rundel.
This commit is contained in:
Hadley Wickham 2023-02-07 10:40:45 -06:00
parent 03f1c6c6f4
commit 504db47630
1 changed files with 14 additions and 23 deletions

37
EDA.qmd
View File

@ -637,20 +637,6 @@ ggplot(smaller, aes(x = carat, y = price)) +
By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summaries a different number of points.
One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`.
Another approach is to display approximately the same number of points in each bin.
That's the job of `cut_number()`:
```{r}
#| fig-alt: >
#| Side-by-side box plots of price by carat. Each box plot represents 20
#| diamonds. The box plots show that as carat increases the median price
#| increases as well. Cheaper, smaller diamonds have outliers on the higher
#| end, more expensive, bigger diamonds have outliers on the lower end.
ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_number(carat, 20)))
```
#### Exercises
1. Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon.
@ -665,21 +651,26 @@ ggplot(smaller, aes(x = carat, y = price)) +
4. Combine two of the techniques you've learned to visualize the combined distribution of cut, carat, and price.
5. Two dimensional plots reveal outliers that are not visible in one dimensional plots.
For example, some points in the plot below have an unusual combination of `x` and `y` values, which makes the points outliers even though their `x` and `y` values appear normal when examined separately.
For example, some points in the following plot have an unusual combination of `x` and `y` values, which makes the points outliers even though their `x` and `y` values appear normal when examined separately.
Why is a scatterplot a better display than a binned plot for this case?
```{r}
#| dev: "png"
#| fig-alt: >
#| A scatterplot of widths vs. lengths of diamonds. There is a positive,
#| strong, linear relationship. There are a few unusual observations
#| above and below the bulk of the data, more below it than above.
ggplot(diamonds, aes(x = x, y = y)) +
#| eval: false
diamonds |>
filter(x >= 4) |>
ggplot(aes(x = x, y = y)) +
geom_point() +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
Why is a scatterplot a better display than a binned plot for this case?
6. Instead of creating boxes of equal width with `cut_width()`, we could create boxes that contain roughly equal number of points with `cut_number()`.
What are the advantages and disadvantages of this approach?
```{r}
#| eval: false
ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_number(carat, 20)))
```
## Patterns and models