Minor Typos (#189)

Also changed spellings to British English for consistency.

FYI. Table number of light speed measurements currently reads (#tab:variation). The code appears to be correct.
This commit is contained in:
Terence Teo 2016-07-25 12:21:00 -04:00 committed by Hadley Wickham
parent f1cc2088f9
commit 149c27ef61
1 changed files with 19 additions and 19 deletions

38
EDA.Rmd
View File

@ -2,17 +2,17 @@
## Introduction
This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an interative cycle. You:
This chapter will show you how to use visualisation and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:
1. Generate questions about your data.
1. Search for answers by visualizing, transforming, and modeling your data.
1. Search for answers by visualising, transforming, and modelling your data.
1. Use what you learn to refine your questions and or generate new questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel be free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be deadends. As your exploration continues you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel be free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others.
EDA is an important part of any data analysis, even if the questions are handed to your on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you're ask questions about whether your data meets your expectations or not. To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you're ask questions about whether your data meets your expectations or not. To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
### Prerequisites
@ -32,7 +32,7 @@ library(dplyr)
> vague, than an exact answer to the wrong question, which can always be made
> precise." --- John Tukey
Your goal during EDA is to develop your understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
EDA is fundamentally a creative process. And like most creative processes, the key to asking _quality_ questions is to generate a large _quantity_ of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.
@ -79,11 +79,11 @@ options(old)
Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of variable's values.
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of variable's values.
### Visualizing distributions
### Visualising distributions
How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart:
How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart:
```{r}
ggplot(data = diamonds) +
@ -109,7 +109,7 @@ You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_wid
diamonds %>% count(cut_width(carat, 0.5))
```
A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number observations fall in each bun. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar.
A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the $x$ variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a binwidth of less than three and choose a smaller binwidth.
@ -127,7 +127,7 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
geom_freqpoly(binwidth = 0.1)
```
Now that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
### Typical values
@ -135,7 +135,7 @@ In both bar charts and histograms, tall bars show the common values of a variabl
* Which values are the most common? Why?
* Which values are the rare? Why? Does that match your expectations?
* Which values are rare? Why? Does that match your expectations?
* Can you see any unusual patterns? What might explain them?
@ -211,8 +211,8 @@ When you discover an outlier it's a good idea to trace it back as far as possibl
might decide which dimension is the length, width, and depth.
1. Explore the distribution of `price`. Do you discover anything unusual
or surprising? (Hint: carefully think about reasonsable values of
`binwidth` and experiment.)
or surprising? (Hint: carefully think about the `binwidth` and make sure
you)
1. How many diamonds have 0.99 carats? Why?
@ -286,7 +286,7 @@ However this plot isn't great because there are many more non-cancelled flights
## Covariation
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.
### A categorical and continuous variable
@ -342,7 +342,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuive finding that better quality diamonds are cheaper on average! In the exercises, you'll be challenged to figure out why.
`cut` is an ordered factor: fair is worse than good, which is wrose than very good and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how hwy mileage varies across classes:
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
@ -378,7 +378,7 @@ ggplot(data = mpg) +
How does this compare to using `coord_flip()`?
1. One problem with boxplots is that they were developed in an era of
much smaller datasets and tend to display an prohibitively large
much smaller datasets and tend to display a prohibitively large
number of "outlying values". One approach to remedy this problem is
the letter value plot. Install the lvplot package, and try using
`geom_lvplot()` to display the distribution of price vs cut. What
@ -430,7 +430,7 @@ If the categorical variables are unordered, you might want to use the seriation
delays vary by destination and month of year. What makes the
plot difficult to read? How could you improve it?
1. Why is slightly better to use `aes(x = color, y = cut)` rather
1. Why is it slightly better to use `aes(x = color, y = cut)` rather
than `aes(x = cut, y = color)` in the example above?
### Two continuous variables
@ -462,14 +462,14 @@ ggplot(data = smaller) +
geom_hex(aes(x = carat, y = price))
```
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a discrete and a continuous variable that you learned about. For example, you could bin `carat` and then for each group displaying a boxplot:
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a discrete and a continuous variable that you learned about. For example, you could bin `carat` and then for each group, display a boxplot:
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)))
```
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell the each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot to be proportional to the number of points with `varwidth = TRUE`.
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`.
Another approach is to display approximately the same number of points in each bin. That's the job of `cut_number()`: