Completes draft of variation/EDA chapter.

This commit is contained in:
Garrett 2016-05-22 21:31:00 -04:00
parent 2a1f6b7e5c
commit 718341a2fc
6 changed files with 50 additions and 17 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

View File

@ -1,7 +1,3 @@
---
output: html_document
---
# Exploratory Data Analysis (EDA)
```{r include = FALSE}
@ -516,11 +512,9 @@ You can examine coefficients, model statistics, and residuals of a model fit to
I'll postpone teaching you how to fit and interpret models with R until Part 4. Altough models are something simple, a description of your data set, they are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make inferences and predictions. You'll be able to do more with models if you learn a few more skills before you begin to model data.
## A last word on variables and observations
## Exploring further
Variables, observations, and visualization
Every data set contains more information than it displays. You can use the values in your data to calculate new variables, as well as new, group-level, observations. This section will show you how to calculate new variables and observations, which you can use in visualizations, clustering algorithms, and modeling algorithms.
Every data set contains more variables and observations than it displays. You can use the values in your data to compute new variables or to measure new new, group-level observations on subgroups of your data. These new variables and observations provide a further source of insights that you can explore with visualizations, clustering algorithms, and models.
### Making new variables
@ -534,12 +528,11 @@ diamonds %>%
The window functions from Chapter 3 are particularly useful for calculating new variables. To calculate a variable from two or more variables, use basic operators or the `map2()`, `map3()`, and `map_n()` functions from purr. You will learn more about purrr in Chapter ?.
PCA and PFA
Statisticians can use R to extract potential variables with more sophisticated algorithms. R provides `prcomp()` for Principle Components Analysis and `factanal()` for factor analysis. The psych and SEM packages also provide further tools for working with latent variables.
### Making new observations
If your data set contains subgroups, you can derive a new data set from it of observations that describe the subgroups. To do this, first use dplyr's `group_by()` function to group the data into subgroups. Then use dplyr's `summarise()` function to calculate group level values. The measures of location, rank and spread listed in Chapter 3 are particularly useful for describing subgroups.
If your data set contains subgroups, you can derive from your data a new data set of observations that describe the subgroups. To do this, first use dplyr's `group_by()` function to group the data into subgroups. Then use dplyr's `summarise()` function to calculate group level values. The measures of location, rank and spread listed in Chapter 3 are particularly useful for describing subgroups.
```{r}
mpg %>%
@ -547,14 +540,54 @@ mpg %>%
summarise(n_obs = n(), avg_hwy = mean(hwy), sd_hwy = sd(hwy))
```
Group level observations and group geoms
## A last word on variables, values, and observations
## Summary
Variables, values, and observations provide a basis for Exploratory Data Analysis: if a relationship exists between two variables, then the relationship will exist between the values of those variables when those values are measured in the same observation. As a result, relationships between variables will appear as patterns in your data.
Data is difficult to comprehend, which means that you need to visualize, model, and transform it.
Within any particular observation, the exact form of the relationship between values may be obscured by mediating factors, measurement error, or random noise; which means the patterns in your data will appear as signals obscured by noise.
Once you comprehend the information in your data, you can make inferences from your data.
Due to a quirk of the human cognitive system, the easiest way to spot the signal admidst the noise is to visualize your data. The concepts of variables, values, and observations make this easy to do. To visualize your data, represent each observation with its own geometric object, such as a point. Then map each variable to an aesthetic property of the point, setting specific values of the variable to specific levels of the aesthetic. Or compute group-level statistics (i.e. observations) and map them to geoms, something that `geom_bar()`, `geom_boxplot()` and other geoms do for you automatically.
But all of this will involve a computer. To make head way, you will need to know how to program in a computer language (R), import data to use with that language, and tidy the data into the format that works best for that language.
## Exploratory Data Analysis and Data Science
When you are finished you will want to report and reproduce your results.
As a term, data science has been used in many ways by different people. This fluidity is necessary for a term that describes a wide breadth of activity, as data science does. Although different data science activities will take different forms, you can use the principles in this chapter to build a general model of data science. The model requires one limit to the definition of data science: data science must rely in some way on human judgement and expertise.
To judge or interpret the information in a data set, you must first comprehend that information. Data is difficult to comprehend, which means that you need to visualize, model, and transform it, a process that we have referred to as Exploratory Data Analysis.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-1.png")
```
Once you comprehend the information in your data, you can use it to make inferences from your data. Often this involves making deductions from a model. This is what you do when you conduct a hypothesis test, make a prediction (wth or without a confidence interval), or score cases in a database.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-2.png")
```
But all of this will involve a computer; you can make little headway with pencil and paper calculations when you work with data. To work efficiently, you will need to know how to program in a computer language, such as R, import data to use with that language, and tidy the data into the format that works best for that language.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-3.png")
```
Finally, if your work is meaningful at all you will need to report it in a way that your audience can understand. Your audience might be fellow scientists who will want to ensure that the work is reproducible, non-scientists who will need to understand your findings in plain language, or future you who will be thankful if you make it easy to come back up to speed on your work and recreate it as necessary. To satisfy these audiences, you may choose to communicate your results in a report or to bundle your work into some type of useful format, like a package or a Shiny app.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-4.png")
```
This model forms a roadmap for the rest of the book.
* Part 1 of the book covered the central tasks of the model above, Exploratory Data Analysis.
* Part 2 will cover the logistical tasks of working with data in a computer language: importing and tidying the data, skills I call Data Wrangling.
* Part 3 will teach you some of the most efficient ways to program in R with data.
* Part 4 discusses models and how to apply them.
* Part 5 will teach you the most popular format for reporting and reproducing the results of an R analysis.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-5.png")
```