Rough first pass at summaries for all whole game chapters

This commit is contained in:
Hadley Wickham 2022-09-29 10:36:22 -05:00
parent a1c9cf2ff2
commit d9a86edcf0
10 changed files with 110 additions and 24 deletions

11
EDA.qmd
View File

@ -1,4 +1,4 @@
# Exploratory Data Analysis {#sec-exploratory-data-analysis}
# Exploratory data analysis {#sec-exploratory-data-analysis}
```{r}
#| results: "asis"
@ -953,9 +953,10 @@ diamonds |>
geom_tile()
```
## Learning more
## Summary
If you want to learn more about the mechanics of ggplot2, we highly recommend reading the [ggplot2 book](https://ggplot2-book.org).
Another useful resource is the [*R Graphics Cookbook*](https://r-graphics.org) by Winston Chang.
In this chapter you've learned a variety of tools to help you understand the variation within your data.
You've seen technique that work with a single variable at a time and with a pair of variables.
This might seem painful restrictive if you have tens or hundreds of variables in your data, but they're foundation upon which all other techniques are built.
<!--# TODO: add Claus + Kieran books -->
In the next chapter, we'll tackle our final piece of workflow advice: how to get help when you're stuck.

View File

@ -326,3 +326,12 @@ RDS supports list-columns (which you'll learn about in @sec-rectangling; feather
file.remove("students-2.csv")
file.remove("students.rds")
```
## Summary
In this chapter, you've learned how to use readr to load rectangular flat files from disk into R.
You've learned how csv files work, some of the problems you might encounter, and how to overcome them.
We'll come to data import a few times in this book: @sec-import-databases will show you how to load data from databases, @sec-import-spreadsheets from Excel and googlesheets, @sec-import-rectangling from JSON, and @sec-import-scraping from websites.
Now that you're writing a substantial amount of R code, it's time to learn more about organizing your code into files and directories.
In the next chapter, you'll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.

View File

@ -23,7 +23,6 @@ In this chapter, you'll first learn the definition of tidy data and see it appli
Then we'll dive into the main tool you'll use for tidying data: pivoting.
Pivoting allows you to change the form of your data, without changing any of the values.
We'll finish up with a discussion of usefully untidy data, and how you can create it if needed.
If you particularly enjoy this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software.
### Prerequisites
@ -744,3 +743,13 @@ Depending on what you want to do next, you may find any of the following three s
geom_point() +
coord_equal()
```
## Summary
In this chapter you learned about tidy data: data that has variables in columns and observations in rows.
Tidy data makes working in the tidyverse easier, because it's a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format.
To that end, you learn about `pivot_longer()` and `pivot_wider()` which allow you to tidy up many untidy datasets.
Of course, tidy data can't solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done.
If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software.
In the next chapter, we'll pivot back to workflow to discuss the importance of code style, keeping your code "tidy" (ha!) in order to make it easy for you and others to read and understand your code.

View File

@ -665,3 +665,12 @@ batters |>
```
You can find a good explanation of this problem and how to overcome it at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <https://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
## Summary
In this chapter, you've learned the tools that dplyr provides for working with data frames.
The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`, those that manipulate the columns (like `select()` and `mutate()`), and those that manipulate groups (like `group_by()` and `summarise()`).
In this chapter, we've focused on these "whole data frame" tools, but you haven't yet learned much about what you can do with the individual variable.
We'll come back to that in @sec-transform-intro, where each chapter will give you tools for a specific type of variable.
For now, we'll pivot back to workflow, and in the next chapter you'll learn more about the pipe, `|>`, why we recommend it, and a little of the history that lead from magrittr's `%>%` to base R's `|>`.

View File

@ -15,8 +15,6 @@ R has several systems for making graphs, but ggplot2 is one of the most elegant
ggplot2 implements the **grammar of graphics**, a coherent system for describing and building graphs.
With ggplot2, you can do more and faster by learning one system and applying it in many places.
If you'd like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading "The Layered Grammar of Graphics", <https://vita.had.co.nz/papers/layered-grammar.pdf>, the scientific paper that discusses the theoretical underpinnings..
### Prerequisites
This chapter focuses on ggplot2, one of the core packages in the tidyverse.
@ -139,7 +137,8 @@ We will begin with the `<MAPPINGS>` component.
> "The greatest value of a picture is when it forces us to notice what we never expected to see." --- John Tukey
In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend.
These cars have a higher fuel efficiency than you might expect. That is, they have a higher miles per gallon than other cars with similar engine sizes.
These cars have a higher fuel efficiency than you might expect.
That is, they have a higher miles per gallon than other cars with similar engine sizes.
How can you explain these cars?
```{r}
@ -1303,3 +1302,20 @@ knitr::include_graphics("images/visualization-grammar-3.png")
You could use this method to build *any* plot that you imagine.
In other words, you can use the code template that you've learned in this chapter to build hundreds of thousands of unique plots.
If you'd like to learn more about this theoretical underpinnings of ggplot2, you might enjoy reading "[The Layered Grammar of Graphics](https://vita.had.co.nz/papers/layered-grammar.pdf)", the scientific paper that describes the theory of ggplot2 in detail.
## Summary
In this chapter, you've learn the basics of data visualization with ggplot2.
We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, colour, size and shape.
You then learned about facets, which allow you to create small multiples, where each panel contains a subgroup from your data.
We then gave you a whirlwind tour of the geoms and stats which control the "type" of graph you get, whether it's a scatterplot, line plot, histogram, or something else.
Position adjustment control the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what `x` and `y` mean.
We'll use visualizations again and again through out this book, introducing new techniques as we need them.
If you want to get a comprehensive understand of ggplot2, we recommend reading the book, [*ggplot2: Elegant Graphics for Data Analysis*](https://ggplot2-book.org).
Other useful resources are the [*R Graphics Cookbook*](https://r-graphics.org) by Winston Chang and [*Fundamentals of Data Visualization*](https://clauswilke.com/dataviz/) by Claus Wilke.
With the basics of visualization under your belt, in the next chapter we're going to switch gears a little and give you some practical workflow advice.
We intersperse workflow advice with data science tools throughout this part of the book because it'll help you stay organize as you write increasing amounts of R code.

View File

@ -78,13 +78,17 @@ primes * 2
With short pieces of code like this, it might not be necessary to leave a command for every single line of code.
But as the code you're writing gets more complex, comments can save you (and your collaborators) a lot of time in figuring out what was done in the code.
However, ultimately, *what* was done is possible to figure out, even if it might be tedious at times, as the code is self-documenting.
However, remembering or figuring out *why* something was done can be much more difficult, or impossible.
For example, `geom_smooth()`, which draws a smooth curve to represent the patterns of the data has an argument called `span`, which controls the "wiggliness" of the smoother with larger values for `span` yielding a smoother curve.
The default value of this argument is 0.75.
Suppose you decide to change the value of `span`, and set it to 0.3.
It would be very useful to add a comment noting why you decided to make this change, for yourself in the future and others reviewing your code.
In the following example the first comment for the same code is not as good as the second one as it doesn't say why the decision to change the span was made.
Use comments to explain the *why* of your code, not the *how* or the *what*.
The *what* and *how* of code your is always possible to figure out, even if it might be tedious, by carefully reading the code.
But if you describe the "what" in your comments and your code, you'll have to remember to carefully update the comment and code in tandem.
If you change the code and forget to update the comment, they'll be inconsistent which will lead to confusion when you come back to your code in the future.
Figuring out *why* something was done is much more difficult, if not impossible.
For example, `geom_smooth()` has an argument called `span`, which controls the smoothness of the curve, with larger values yielding a smoother curve.
Suppose you decide to change the value of `span` from its default of 0.75 to 0.3: it's easy for a future reader to understand *what* is happening, but unless you note your thinking in a comment, no one will understand *why* you changed the default.
For data analysis code, use comments to explain your overall plan of attack and record important insight as you encounter them.
There's no way to re-capture this knowledge from the code itself.
## What's in a name? {#sec-whats-in-a-name}
@ -226,3 +230,8 @@ knitr::include_graphics("screenshots/rstudio-env.png")
3. Press Alt + Shift + K.
What happens?
How can you get to the same place using the menus?
## Summary
Now that you've learned a little more about how R code works, and some tips to help you understand your code when you come back to it in the future.
In the next chapter, we'll continue your data science journey by teaching you about dplyr, the tidyverse package that helps you transform data, whether it's selecting important variables, filtering down to rows of interest, or computing summary statistics.

View File

@ -121,3 +121,12 @@ To keep up with the R community more broadly, we recommend reading [R Weekly](ht
If you're an active Twitter user, you might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)), Mine ([\@minebocek](https://twitter.com/minebocek)), Garrett ([\@statgarrett](https://twitter.com/statgarrett)), or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
If you want the full fire hose of new developments, you can also read the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag.
This is one the key tools that Hadley and Mine use to keep up with new developments in the community.
## Summary
This chapter concludes the Whole Game part of the book.
You've now seen the most important parts of the data science process: visualization, transformation, tidying and importing.
Now you've got a holistic view of whole process and we start to get into the the details of small pieces.
The next part of the book, Transform, goes into depth into the different types of variables that you might encounter: logical vectors, numbers, strings, factors, and date-times, and covers important related topics like tibbles, regular expression, missing values, and joins.
There's no need to read these chapters in order; dip in and out as needed for the specific data that you're working with.

View File

@ -128,3 +128,14 @@ But they're still good to know about even if you've never used `%>%` because you
- `%>%` allows you to start a pipe with `.` to create a function rather than immediately executing the pipe; this is not supported by the base pipe.
Luckily there's no need to commit entirely to one pipe or the other --- you can use the base pipe for the majority of cases where it's sufficient, and use the magrittr pipe when you really need its special features.
## Summary
In this chapter, you've learn more about the pipe: why we recommend it and some of the history that lead to `|>`.
The pipe is important because you'll use it again and again throughout your analysis, but hopefully it will quickly become invisible and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it.
In the next chapter, we switch back to data science tools, learning about tidy data.
Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse.
This consistency makes your life easier because once you have tidy data, you it just works with the vast majority of tidyverse functions.
Of course, life is never easy and most datasets that you encounter in the wild will not already be tidy.
So we'll also teach you how to use the tidyr package to tidy your untidy data.

View File

@ -352,3 +352,11 @@ Then everything you need is in one place and cleanly separated from all the othe
2. What other common mistakes will RStudio diagnostics report?
Read <https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics> to find out.
## Summary
In this chapter, you've learned how to organize your R code in scripts (files) and projects (directories).
Much like code style, this may feel like busywork at first.
But as you accumulate more code across multiple projects, you'll learn to appreciate how a little up front organisation can save you a bunch of time down the road.
Next up, we'll switch back to data science tooling to talk about exploratory data analysis (or EDA for short), a philosophy and set of tools that you can use with your data to start to get a sense of what's going on.

View File

@ -54,7 +54,7 @@ Use `_` to separate words within a name.
short_flights <- flights |> filter(air_time < 60)
# Avoid:
SHORTFLIGHTS <- flights |> filter(air_time < 60)
SHORTFLIGHTS <- flights |> filter(air_time < 60)
```
As a general rule of thumb, it's better to prefer long, descriptive names that are easy to understand, rather than concise names that are fast to type.
@ -239,14 +239,9 @@ flights |>
geom_point()
```
## Organization
## Sectioning comments
Use comments to explain the "why" of your code, not the "how" or the "what".
If you simply describe what your code is doing in prose, you'll have to be careful to update the comment and code in tandem: if you change the code and forget to update the comment, they'll be inconsistent which will lead to confusion when you come back to your code in the future.
For data analysis code, use comments to explain your overall plan of attack and record important insight as you encounter them.
There's no way to re-capture this knowledge from the code itself.
As your scripts get longer, use **sectioning** comments to break up your file into manageable pieces:
As your scripts get longer, you can use **sectioning** comments to break up your file into manageable pieces:
```{r}
#| eval: false
@ -281,3 +276,13 @@ knitr::include_graphics("screenshots/rstudio-nav.png")
flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)
```
## Summary
In this chapter, you've learn the most important principles of code style.
These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, you'll see how important a consistent style is.
And don't forget about the styler package: it's a great way to quickly improve the quality of poorly styled code.
So far, we've worked with datasets bundled inside of R packages.
This makes it easier to get some practice on pre-prepared data, but obviously your data won't available in this way.
So in the next chapter, you're going to learn how load data from disk into your R session using the readr package.