US spelling

This commit is contained in:
Mine Çetinkaya-Rundel 2022-04-13 22:45:52 -04:00
parent 9a7c0c405c
commit 69002adce6
1 changed files with 23 additions and 23 deletions

46
EDA.Rmd
View File

@ -2,13 +2,13 @@
## Introduction
This chapter will show you how to use visualisation and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short.
This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short.
EDA is an iterative cycle.
You:
1. Generate questions about your data.
2. Search for answers by visualising, transforming, and modelling your data.
2. Search for answers by visualizing, transforming, and modelling your data.
3. Use what you learn to refine your questions and/or generate new questions.
@ -20,7 +20,7 @@ As your exploration continues, you will home in on a few particularly productive
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data.
Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not.
To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
To do data cleaning, you'll need to deploy all the tools of EDA: visualization, transformation, and modelling.
### Prerequisites
@ -81,11 +81,11 @@ This is true even if you measure quantities that are constant, like the speed of
Each of your measurements will include a small amount of error that varies from measurement to measurement.
Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information.
The best way to understand that pattern is to visualise the distribution of the variable's values.
The best way to understand that pattern is to visualize the distribution of the variable's values.
### Visualising distributions
### Visualizing distributions
How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous.
How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous.
A variable is **categorical** if it can only take one of a small set of values.
In R, categorical variables are usually saved as factors or character vectors.
To examine the distribution of a categorical variable, use a bar chart:
@ -143,9 +143,9 @@ ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)
```
There are a few challenges with this type of plot, which we will come back to in [visualising a categorical and a continuous variable](#cat-cont).
There are a few challenges with this type of plot, which we will come back to in visualizing[ a categorical and a continuous variable](#cat-cont).
Now that you can visualise variation, what should you look for in your plots?
Now that you can visualize variation, what should you look for in your plots?
And what type of follow-up questions should you ask?
I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information.
The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).
@ -346,7 +346,7 @@ In the next section we'll explore some techniques for improving this comparison.
If variation describes the behavior *within* a variable, covariation describes the behavior *between* variables.
**Covariation** is the tendency for the values of two or more variables to vary together in a related way.
The best way to spot covariation is to visualise the relationship between two or more variables.
The best way to spot covariation is to visualize the relationship between two or more variables.
How you do that should again depend on the type of variables involved.
### A categorical and continuous variable {#cat-cont}
@ -369,7 +369,7 @@ ggplot(diamonds) +
```
To make the comparison easier we need to swap what is displayed on the y-axis.
Instead of displaying count, we'll display **density**, which is the count standardised so that the area under each frequency polygon is one.
Instead of displaying count, we'll display **density**, which is the count standardized so that the area under each frequency polygon is one.
```{r}
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
@ -405,7 +405,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
```
We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot).
It supports the counterintuitive finding that better quality diamonds are cheaper on average!
It supports the counter-intuitive finding that better quality diamonds are cheaper on average!
In the exercises, you'll be challenged to figure out why.
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on.
@ -438,7 +438,7 @@ ggplot(data = mpg) +
#### Exercises
1. Use what you've learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
1. Use what you've learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.
2. What variable in the diamonds dataset is most important for predicting the price of a diamond?
How is that variable correlated with cut?
@ -453,7 +453,7 @@ ggplot(data = mpg) +
What do you learn?
How do you interpret the plots?
5. Compare and contrast `geom_violin()` with a facetted `geom_histogram()`, or a coloured `geom_freqpoly()`.
5. Compare and contrast `geom_violin()` with a faceted `geom_histogram()`, or a coloured `geom_freqpoly()`.
What are the pros and cons of each method?
6. If you have a small dataset, it's sometimes useful to use `geom_jitter()` to see the relationship between a continuous and categorical variable.
@ -462,7 +462,7 @@ ggplot(data = mpg) +
### Two categorical variables
To visualise the covariation between categorical variables, you'll need to count the number of observations for each combination.
To visualize the covariation between categorical variables, you'll need to count the number of observations for each combination.
One way to do that is to rely on the built-in `geom_count()`:
```{r}
@ -480,7 +480,7 @@ diamonds |>
count(color, cut)
```
Then visualise with `geom_tile()` and the fill aesthetic:
Then visualize with `geom_tile()` and the fill aesthetic:
```{r}
diamonds |>
@ -504,7 +504,7 @@ For larger plots, you might want to try the heatmaply package, which creates int
### Two continuous variables
You've already seen one great way to visualise the covariation between two continuous variables: draw a scatterplot with `geom_point()`.
You've already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with `geom_point()`.
You can see covariation as a pattern in the points.
For example, you can see an exponential relationship between the carat size and price of a diamond.
@ -541,7 +541,7 @@ ggplot(data = smaller) +
```
Another option is to bin one continuous variable so it acts like a categorical variable.
Then you can use one of the techniques for visualising the combination of a categorical and a continuous variable that you learned about.
Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about.
For example, you could bin `carat` and then for each group, display a boxplot:
```{r}
@ -550,7 +550,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
```
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`.
By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarises a different number of points.
By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summaries a different number of points.
One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`.
Another approach is to display approximately the same number of points in each bin.
@ -563,16 +563,16 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
#### Exercises
1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon.
1. Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon.
What do you need to consider when using `cut_width()` vs `cut_number()`?
How does that impact a visualisation of the 2d distribution of `carat` and `price`?
How does that impact a visualization of the 2d distribution of `carat` and `price`?
2. Visualise the distribution of carat, partitioned by price.
2. Visualize the distribution of carat, partitioned by price.
3. How does the price distribution of very large diamonds compare to small diamonds?
Is it as you expect, or does it surprise you?
4. Combine two of the techniques you've learned to visualise the combined distribution of cut, carat, and price.
4. Combine two of the techniques you've learned to visualize the combined distribution of cut, carat, and price.
5. Two dimensional plots reveal outliers that are not visible in one dimensional plots.
For example, some points in the plot below have an unusual combination of `x` and `y` values, which makes the points outliers even though their `x` and `y` values appear normal when examined separately.
@ -682,7 +682,7 @@ diamonds |>
## Learning more
If you want to learn more about the mechanics of ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>.
It's been recently updated, so it includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation.
It's been recently updated, so it includes dplyr and tidyr code, and has much more space to explore all the facets of visualization.
Unfortunately the book isn't generally available for free, but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.
Another useful resource is the [*R Graphics Cookbook*](https://amzn.com/1449316956) by Winston Chang.