From 86efe55bc294e90a6b651f5577b77ca3044abe6d Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Fri, 10 Mar 2023 08:12:25 -0500 Subject: [PATCH] Mostly hide msgs to save space (#1356) --- data-import.qmd | 40 ++++++++++++++++------------------------ 1 file changed, 16 insertions(+), 24 deletions(-) diff --git a/data-import.qmd b/data-import.qmd index 6b8a0e4..312eb63 100644 --- a/data-import.qmd +++ b/data-import.qmd @@ -58,8 +58,7 @@ read_csv("data/students.csv") |> We can read this file into R using `read_csv()`. The first argument is the most important: the path to the file. -You can think about the path as the address of the file. -The following says that the file is called `students.csv` and that it's in the `data` folder. +You can think about the path as the address of the file: the file is called `students.csv` and that it lives in the `data` folder. ```{r} #| message: true @@ -114,7 +113,7 @@ students |> An alternative approach is to use `janitor::clean_names()` to use some heuristics to turn them all into snake case at once[^data-import-1]. -[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses `|>`. +[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that use `|>`. ```{r} #| message: false @@ -128,9 +127,7 @@ For example, `meal_plan` is a categorical variable with a known set of possible ```{r} students |> janitor::clean_names() |> - mutate( - meal_plan = factor(meal_plan) - ) + mutate(meal_plan = factor(meal_plan)) ``` Note that the values in the `meal_plan` variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (``) to factor (``). @@ -307,12 +304,14 @@ It then works through the following questions: You can see that behavior in action in this simple example: ```{r} +#| message: false + read_csv(" logical,numeric,date,string TRUE,1,2021-01-15,abc false,4.5,2021-02-15,def - T,Inf,2021-02-16,ghi" -) + T,Inf,2021-02-16,ghi +") ``` This heuristic works well if you have a clean dataset, but in real life, you'll encounter a selection of weird and beautiful failures. @@ -331,13 +330,14 @@ simple_csv <- " . 20 30" - ``` If we read it without any additional arguments, `x` becomes a character column: ```{r} -df <- read_csv(simple_csv) +#| message: false + +read_csv(simple_csv) ``` In this very small case, you can easily see the missing value `.`. @@ -363,7 +363,9 @@ That suggests this dataset uses `.` for missing values. So then we set `na = "."`, the automatic guessing succeeds, giving us the numeric column that we want: ```{r} -df <- read_csv(simple_csv, na = ".") +#| message: false + +read_csv(simple_csv, na = ".") ``` ### Column types @@ -407,6 +409,8 @@ For example, you might have sales data for multiple months, with each month's da With `read_csv()` you can read these data in at once and stack them on top of each other in a single data frame. ```{r} +#| message: false + sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv") read_csv(sales_files, id = "file") ``` @@ -425,7 +429,7 @@ sales_files <- c( read_csv(sales_files, id = "file") ``` -With the additional `id` parameter we have added a new column called `file` to the resulting data frame that identifies the file the data come from. +The `id` argument adds a new column called `file` to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you're reading in do not have an identifying column that can help you trace the observations back to their original sources. If you have many files you want to read in, it can get cumbersome to write out their names as a list. @@ -515,18 +519,6 @@ tibble( ) ``` -Note that every column in tibble must be same size, so you'll get an error if they're not: - -```{r} -#| error: true - -tibble( - x = c(1, 2), - y = c("h", "m", "g"), - z = c(0.08, 0.83, 0.6) -) -``` - Laying out the data by column can make it hard to see how the rows are related, so an alternative is `tribble()`, short for **tr**ansposed t**ibble**, which lets you lay out your data row by row. `tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form: