Update relational-data.Rmd

typos
2016-01-30 13:13:46 +00:00 · 2016-01-30 13:13:46 +00:00 · ae06075c35
parent 8101753650
commit ae06075c35
1 changed files with 87 additions and 87 deletions
--- a/relational-data.Rmd
+++ b/relational-data.Rmd
@ -18,7 +18,7 @@ It's rare that a data analysis involves only a single table of data. Typically y

 Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair; sometimes both elements of a pair can be the same table.

-To work with relational data you need verbs that work with pairs of tables. There are three families of verbs design to work with relational data:
+To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:

 * __Mutating joins__, which add new variables to one data frame from matching
  rows in another.
@ -28,11 +28,11 @@ To work with relational data you need verbs that work with pairs of tables. Ther

 * __Set operations__, which treat observations like they were set elements.

-The most common place to find relational data is in a _relational_ database management system, a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is little different. Generally, dplyr is a little easier to use than SQL because it's specialised to data analysis: it makes common data analysis operations easier, at the expense of making it difficult to do other things.
+The most common place to find relational data is in a _relational_ database management system, a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because it's specialised to data analysis: it makes common data analysis operations easier, at the expense of making it difficult to do other things.

 ## nycflights13 {#nycflights13-relational}

-You'll learn about relational data with other datasets from the nycflights13 package. As well as the `flights` table that you've worked with so far, nycflights13 contains a four related data frames:
+You'll learn about relational data with other datasets from the nycflights13 package. As well as the `flights` table that you've worked with so far, nycflights13 contains four other related data frames:

 *   `airlines` lets you look up the full carrier name from its abbreviated
    code:
@ -112,7 +112,7 @@ There are two types of keys:
  each plane.

 * A __foreign key__ uniquely identifies an observation in another table.
-  For example, the `flights$tailnum` is a foregin key because it matches each
+  For example, the `flights$tailnum` is a foreign key because it matches each
  flight to a unique plane.

 A variable can be both part of primary key _and_ a foreign key. For example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airport` table.
@ -124,16 +124,16 @@ planes %>% count(tailnum) %>% filter(n > 1)
 weather %>% count(year, month, day, hour, origin) %>% filter(n > 1)
 ```

-Sometimes a table does't have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. For example, what's the primary key in the `flights` table? You might think it would be the date plus the flight or tail number, but neither of those are unique:
+Sometimes a table doesn't have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. For example, what's the primary key in the `flights` table? You might think it would be the date plus the flight or tail number, but neither of those are unique:

 ```{r}
 flights %>% count(year, month, day, flight) %>% filter(n > 1)
 flights %>% count(year, month, day, tailnum) %>% filter(n > 1)
 ```

-When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easiser to communicate problems with a specific flight. Unfortunately that is not the case! If a table lacks a primary key, it's sometimes useful to add one with `row_number()`. That makes it easier to match observations if you've done some filtering and want to check back in with the original data. This is called a surrogate key.
+When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight. Unfortunately that is not the case! If a table lacks a primary key, it's sometimes useful to add one with `row_number()`. That makes it easier to match observations if you've done some filtering and want to check back in with the original data. This is called a surrogate key.

-A primary key and the corresponding foreign key in another table form a __relation__. Relations are typically one-to-many. For example, each flight has one plane, but each plane has many flights. In other data, you'll occassionaly see a 1-to-1 relationship. You can think of this as a special case of 1-to-many. It's possible to model many-to-many relations with a many-to-1 relation plus a 1-to-many relation. For example, in this data there's a many-to-many relationship between airlines and airports: each airport flies to many airlines; each airport hosts many airlines. 
+A primary key and the corresponding foreign key in another table form a __relation__. Relations are typically one-to-many. For example, each flight has one plane, but each plane has many flights. In other data, you'll occasionally see a 1-to-1 relationship. You can think of this as a special case of 1-to-many. It's possible to model many-to-many relations with a many-to-1 relation plus a 1-to-many relation. For example, in this data there's a many-to-many relationship between airlines and airports: each airport flies to many airlines; each airport hosts many airlines.

 ### Exercises

@ -243,7 +243,7 @@ Graphically, that looks like:
 knitr::include_graphics("diagrams/join-outer.png")
 ```

-The most commonly used join is the left join: you use this when ever you lookup additional data out of another table, becasuse it preserves the original observations even when there isn't a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
+The most commonly used join is the left join: you use this whenever you lookup additional data out of another table, because it preserves the original observations even when there isn't a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

 Another way to depict the different types of joins is with a Venn diagram:

@ -352,7 +352,7 @@ So far, the pairs of tables have always been joined by a single variable, and th
 1.  What weather conditions make it more likely to see a delay?

 1.  What happened on June 13 2013? Display the spatial pattern of delays,
-    and then use google to cross-reference with the weather.
+    and then use Google to cross-reference with the weather.

    ```{r, eval = FALSE, include = FALSE}
    worst <- filter(not_cancelled, month == 6, day == 13)
@ -385,17 +385,17 @@ SQL is the inspiration for dplyr's conventions, so the translation is straightfo
 dplyr                        | SQL
 -----------------------------|-------------------------------------------
 `inner_join(x, y, by = "z")` | `SELECT * FROM x INNER JOIN y USING (z)`
-`left_join(x, y, by = "z")`  | `SELECT * FROM x LEFT OUTER JOIN USING (z)`
-`right_join(x, y, by = "z")` | `SELECT * FROM x RIGHT OUTER JOIN USING (z)`
-`full_join(x, y, by = "z")`  | `SELECT * FROM x FULL OUTER JOIN USING (z)`
+`left_join(x, y, by = "z")`  | `SELECT * FROM x LEFT OUTER JOIN y USING (z)`
+`right_join(x, y, by = "z")` | `SELECT * FROM x RIGHT OUTER JOIN y USING (z)`
+`full_join(x, y, by = "z")`  | `SELECT * FROM x FULL OUTER JOIN y USING (z)`

-Note that "INNER" and "OUTER" are optional, and often ommitted. 
+Note that "INNER" and "OUTER" are optional, and often omitted.

-Joining different variables between the tables, e.g. `inner_join(x, y, by = c("a" = "b"))` uses a slightly different syntax in SQL: `SELECT * FROM x INNER JOIN y ON x.a = y.b`. As this syntax suggests SQL supports a wide range of join types than dplyr because you can connect the tables using constraints other than equiality (sometimes called non-equijoins).
+Joining different variables between the tables, e.g. `inner_join(x, y, by = c("a" = "b"))` uses a slightly different syntax in SQL: `SELECT * FROM x INNER JOIN y ON x.a = y.b`. As this syntax suggests SQL supports a wide range of join types than dplyr because you can connect the tables using constraints other than equality (sometimes called non-equijoins).

 ## Filtering joins {#filtering-joins}

-Filtering joins match obserations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
+Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

 * `semi_join(x, y)` __keeps__ all observations in `x` that have a match in `y`.
 * `anti_join(x, y)` __drops__ all observations in `x` that have a match in `y`.
@ -494,7 +494,7 @@ Be aware that simply checking the number of rows before and after the join is no

 ## Set operations {#set-operations}

-The final type of two-table verb is set operations. Generally, I use these the least frequently, but they are occassionally useful when you want to break a single complex filter into simpler pieces that you then combine.
+The final type of two-table verb is set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces that you then combine.

 All these operations work with a complete row, comparing the values of every variable. These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets: