From 6afdb03666783447af71fbc1a5a00751b509aa52 Mon Sep 17 00:00:00 2001 From: hadley Date: Fri, 7 Oct 2016 08:16:20 -0500 Subject: [PATCH] More @jennybc comments --- import.Rmd | 4 ++-- relational-data.Rmd | 8 ++++---- tibble.Rmd | 3 +++ tidy.Rmd | 9 +++++---- 4 files changed, 14 insertions(+), 10 deletions(-) diff --git a/import.Rmd b/import.Rmd index 840521e..6f194c8 100644 --- a/import.Rmd +++ b/import.Rmd @@ -285,14 +285,14 @@ Encodings are a rich and complex topic, and I've only scratched the surface here ### Factors {#readr-factors} -R uses factors to represent categorical variables that have a known set of possible values. Given `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present: +R uses factors to represent categorical variables that have a known set of possible values. Give `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present: ```{r} fruit <- c("apple", "banana") parse_factor(c("apple", "banana", "bananana"), levels = fruit) ``` -If you have problematic entries, it's often easier to read in as strings and then use the tools you'll learn about in [strings] and [factors] to clean them up. +But it you many problematic entries, it's often easier to leave as character vectors and then use the tools you'll learn about in [strings] and [factors] to clean them up. ### Dates, date-times, and times {#readr-datetimes} diff --git a/relational-data.Rmd b/relational-data.Rmd index 9760bd7..a46fd0a 100644 --- a/relational-data.Rmd +++ b/relational-data.Rmd @@ -90,10 +90,6 @@ For nycflights13: it contained weather records for all airports in the USA, what additional relation would it define with `flights`? -1. You might expect that there's an implicit relationship between plane - and airline, because each plane is flown by a single airline. Confirm - or reject this hypothesis using data. - 1. We know that some days of the year are "special", and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the @@ -531,6 +527,10 @@ flights %>% 1. What does `anti_join(flights, airports, by = c("dest" = "faa"))` tell you? What does `anti_join(airports, flights, by = c("faa" = "dest"))` tell you? +1. You might expect that there's an implicit relationship between plane + and airline, because each plane is flown by a single airline. Confirm + or reject this hypothesis using the tools you've learned above. + ## Join problems The data you've been working with in this chapter has been cleaned up so that you'll have as few problems as possible. Your own data is unlikely to be so nice, so there are a few things that you should do with your own data to make your joins go smoothly. diff --git a/tibble.Rmd b/tibble.Rmd index 51d2c3d..9ef2015 100644 --- a/tibble.Rmd +++ b/tibble.Rmd @@ -158,6 +158,9 @@ The main reason that some older functions don't work with tibble is the `[` func df[, c("abc", "xyz")] ``` +1. If you have the name of a variable stored in an object, e.g. `var <- "mpg"`, + how can you extract the reference variable from a tibble? + 1. Practice referring to non-syntactic names in the following data frame by: 1. Extracting the variable called `1`. diff --git a/tidy.Rmd b/tidy.Rmd index 770a78d..5d41fe3 100644 --- a/tidy.Rmd +++ b/tidy.Rmd @@ -340,7 +340,8 @@ table5 %>% do? Why would you set it to `FALSE`? 1. Compare and contrast `separate()` and `extract()`. Why are there - three variations of separation, but only one unite? + three variations of separation (by position, by separator, and with + groups), but only one unite? ## Missing values @@ -441,7 +442,7 @@ The best place to start is almost always to gather together the columns that are in the variable names (e.g. `new_sp_m014`, `new_ep_m014`, `new_ep_f014`) these are likely to be values, not variables. -So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't know what those values represent yet, so we'll give them the generic name `"key"`. We know the cells represent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present. +So we need to gather together all the columns from `new_sp_m014` to `newrel_f65`. We don't know what those values represent yet, so we'll give them the generic name `"key"`. We know the cells represent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present. ```{r} who1 <- who %>% @@ -539,10 +540,10 @@ who %>% missing values? What's the difference between an `NA` and zero? 1. What happens if you neglect the `mutate()` step? + (`mutate(key = stringr::str_replace(key, "newrel", "new_rel"))`) 1. I claimed that `iso2` and `iso3` were redundant with `country`. - Confirm my claim by creating a table that uniquely maps from `country` - to `iso2` and `iso3`. + Confirm this claim. 1. For each country, year, and sex compute the total number of cases of TB. Make an informative visualisation of the data.