diff --git a/EDA.qmd b/EDA.qmd index 3ed05be..32f6685 100644 --- a/EDA.qmd +++ b/EDA.qmd @@ -637,20 +637,6 @@ ggplot(smaller, aes(x = carat, y = price)) + By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`. -Another approach is to display approximately the same number of points in each bin. -That's the job of `cut_number()`: - -```{r} -#| fig-alt: > -#| Side-by-side box plots of price by carat. Each box plot represents 20 -#| diamonds. The box plots show that as carat increases the median price -#| increases as well. Cheaper, smaller diamonds have outliers on the higher -#| end, more expensive, bigger diamonds have outliers on the lower end. - -ggplot(smaller, aes(x = carat, y = price)) + - geom_boxplot(aes(group = cut_number(carat, 20))) -``` - #### Exercises 1. Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. @@ -665,21 +651,26 @@ ggplot(smaller, aes(x = carat, y = price)) + 4. Combine two of the techniques you've learned to visualize the combined distribution of cut, carat, and price. 5. Two dimensional plots reveal outliers that are not visible in one dimensional plots. - For example, some points in the plot below have an unusual combination of `x` and `y` values, which makes the points outliers even though their `x` and `y` values appear normal when examined separately. + For example, some points in the following plot have an unusual combination of `x` and `y` values, which makes the points outliers even though their `x` and `y` values appear normal when examined separately. + Why is a scatterplot a better display than a binned plot for this case? ```{r} - #| dev: "png" - #| fig-alt: > - #| A scatterplot of widths vs. lengths of diamonds. There is a positive, - #| strong, linear relationship. There are a few unusual observations - #| above and below the bulk of the data, more below it than above. - - ggplot(diamonds, aes(x = x, y = y)) + + #| eval: false + diamonds |> + filter(x >= 4) |> + ggplot(aes(x = x, y = y)) + geom_point() + coord_cartesian(xlim = c(4, 11), ylim = c(4, 11)) ``` - Why is a scatterplot a better display than a binned plot for this case? +6. Instead of creating boxes of equal width with `cut_width()`, we could create boxes that contain roughly equal number of points with `cut_number()`. + What are the advantages and disadvantages of this approach? + + ```{r} + #| eval: false + ggplot(smaller, aes(x = carat, y = price)) + + geom_boxplot(aes(group = cut_number(carat, 20))) + ``` ## Patterns and models