diff --git a/data-tidy.qmd b/data-tidy.qmd index f53dfd8..6ecb267 100644 --- a/data-tidy.qmd +++ b/data-tidy.qmd @@ -44,16 +44,6 @@ You can represent the same underlying data in multiple ways. The example below shows the same data organized in three different ways. Each dataset shows the same values of four variables: *country*, *year*, *population*, and number of documented *cases* of TB (tuberculosis), but each dataset organizes the values in a different way. -```{r} -#| echo: false - -table2 <- table1 |> - pivot_longer(cases:population, names_to = "type", values_to = "count") - -table3 <- table2 |> - pivot_wider(names_from = year, values_from = count) -``` - ```{r} table1 @@ -136,7 +126,7 @@ ggplot(table1, aes(x = year, y = cases)) + 1. For each of the sample tables, describe what each observation and each column represents. -2. Sketch out the process you'd use to calculate the `rate` for `table2` and `table3`. +2. Sketch out the process you'd use to calculate the `rate` from `table2`. You will need to perform four operations: a. Extract the number of TB cases per country per year. @@ -360,7 +350,7 @@ There are two columns that are already variables and are easy to interpret: `cou They are followed by 56 columns like `sp_m_014`, `ep_m_4554`, and `rel_m_3544`. If you stare at these columns for long enough, you'll notice there's a pattern. Each column name is made up of three pieces separated by `_`. -The first piece, `sp`/`rel`/`ep`, describes the method used for the diagnosis, the second piece, `m`/`f` is the `gender` (coded as a binary variable in this dataset), and the third piece, `014`/`1524`/`2534`/`3544`/`4554`/`5564/``65` is the `age` range (`014` represents 0-14, for example). +The first piece, `sp`/`rel`/`ep`, describes the method used for the diagnosis, the second piece, `m`/`f` is the `gender` (coded as a binary variable in this dataset), and the third piece, `014`/`1524`/`2534`/`3544`/`4554`/``` 5564/``65 ``` is the `age` range (`014` represents 0-14, for example). So in this case we have six pieces of information recorded in `who2`: the country and the year (already columns); the method of diagnosis, the gender category, and the age range category (contained in the other column names); and the count of patients in that category (cell values). To organize these six pieces of information in six separate columns, we use `pivot_longer()` with a vector of column names for `names_to` and instructors for splitting the original variable names into pieces for `names_sep` as well as a column name for `values_to`: