diff --git a/tidy.Rmd b/tidy.Rmd index b4ae17e..7c469e9 100644 --- a/tidy.Rmd +++ b/tidy.Rmd @@ -119,7 +119,7 @@ The second step is to resolve one of two common problems: 1. One variable might be spread across multiple columns. -1. One observation might be scattered across mutliple rows. +1. One observation might be scattered across multiple rows. Typically a dataset will only suffer from one of these problems; it'll only suffer from both if you're really unlucky! To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`. @@ -185,10 +185,10 @@ To tidy this up, we first analyse the representation in similar way to `gather() * The column that contains variable names, the `key` column. Here, it's `type`. -* The column that contains values froms multiple variables, the `value` +* The column that contains values forms multiple variables, the `value` column. Here it's `count`. -Once we've figured that out, we can use `spread()`, as shown progammatically below, and visually in Figure \@ref(fig:tidy-spread). +Once we've figured that out, we can use `spread()`, as shown programmatically below, and visually in Figure \@ref(fig:tidy-spread). ```{r} spread(table2, key = type, value = count) @@ -317,7 +317,7 @@ table5 %>% unite(new, century, year) ``` -In this case we also need to use the `sep` arguent. The default will place an underscore (`_`) between the values from different columns. Here we don't want any separator so we use `""`: +In this case we also need to use the `sep` argument. The default will place an underscore (`_`) between the values from different columns. Here we don't want any separator so we use `""`: ```{r} table5 %>% @@ -345,7 +345,7 @@ table5 %>% ## Missing values -Changing the representation of a dataset brings up an important subtlety of missing values. Suprisingly, a value can be missing in one of two possible ways: +Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways: * __Explicitly__, i.e. flagged with `NA`. * __Implicitly__, i.e. simply not present in the data. @@ -442,7 +442,7 @@ The best place to start is almost always to gathering together the columns that in the variable names (e.g. `new_sp_m014`, `new_ep_m014`, `new_ep_f014`) these are likely to be values, not variables. -So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't know what those values represent yet, so we'll give them the generic name `"key"`. We know the cells repesent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present. +So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't know what those values represent yet, so we'll give them the generic name `"key"`. We know the cells represent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present. ```{r} who1 <- who %>% @@ -550,7 +550,7 @@ who %>% ## Non-tidy data -Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the perjorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well founded data structures that are not tidy data. There are two mains reasons to use other data structures: +Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the pejorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well founded data structures that are not tidy data. There are two mains reasons to use other data structures: * Alternative representations may have substantial performance or space advantages.