Mostly hide msgs to save space (#1356)

This commit is contained in:
Mine Cetinkaya-Rundel 2023-03-10 08:12:25 -05:00 committed by GitHub
parent 0a134cb118
commit 86efe55bc2
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 16 additions and 24 deletions

View File

@ -58,8 +58,7 @@ read_csv("data/students.csv") |>
We can read this file into R using `read_csv()`.
The first argument is the most important: the path to the file.
You can think about the path as the address of the file.
The following says that the file is called `students.csv` and that it's in the `data` folder.
You can think about the path as the address of the file: the file is called `students.csv` and that it lives in the `data` folder.
```{r}
#| message: true
@ -114,7 +113,7 @@ students |>
An alternative approach is to use `janitor::clean_names()` to use some heuristics to turn them all into snake case at once[^data-import-1].
[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses `|>`.
[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that use `|>`.
```{r}
#| message: false
@ -128,9 +127,7 @@ For example, `meal_plan` is a categorical variable with a known set of possible
```{r}
students |>
janitor::clean_names() |>
mutate(
meal_plan = factor(meal_plan)
)
mutate(meal_plan = factor(meal_plan))
```
Note that the values in the `meal_plan` variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
@ -307,12 +304,14 @@ It then works through the following questions:
You can see that behavior in action in this simple example:
```{r}
#| message: false
read_csv("
logical,numeric,date,string
TRUE,1,2021-01-15,abc
false,4.5,2021-02-15,def
T,Inf,2021-02-16,ghi"
)
T,Inf,2021-02-16,ghi
")
```
This heuristic works well if you have a clean dataset, but in real life, you'll encounter a selection of weird and beautiful failures.
@ -331,13 +330,14 @@ simple_csv <- "
.
20
30"
```
If we read it without any additional arguments, `x` becomes a character column:
```{r}
df <- read_csv(simple_csv)
#| message: false
read_csv(simple_csv)
```
In this very small case, you can easily see the missing value `.`.
@ -363,7 +363,9 @@ That suggests this dataset uses `.` for missing values.
So then we set `na = "."`, the automatic guessing succeeds, giving us the numeric column that we want:
```{r}
df <- read_csv(simple_csv, na = ".")
#| message: false
read_csv(simple_csv, na = ".")
```
### Column types
@ -407,6 +409,8 @@ For example, you might have sales data for multiple months, with each month's da
With `read_csv()` you can read these data in at once and stack them on top of each other in a single data frame.
```{r}
#| message: false
sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
read_csv(sales_files, id = "file")
```
@ -425,7 +429,7 @@ sales_files <- c(
read_csv(sales_files, id = "file")
```
With the additional `id` parameter we have added a new column called `file` to the resulting data frame that identifies the file the data come from.
The `id` argument adds a new column called `file` to the resulting data frame that identifies the file the data come from.
This is especially helpful in circumstances where the files you're reading in do not have an identifying column that can help you trace the observations back to their original sources.
If you have many files you want to read in, it can get cumbersome to write out their names as a list.
@ -515,18 +519,6 @@ tibble(
)
```
Note that every column in tibble must be same size, so you'll get an error if they're not:
```{r}
#| error: true
tibble(
x = c(1, 2),
y = c("h", "m", "g"),
z = c(0.08, 0.83, 0.6)
)
```
Laying out the data by column can make it hard to see how the rows are related, so an alternative is `tribble()`, short for **tr**ansposed t**ibble**, which lets you lay out your data row by row.
`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
This makes it possible to lay out small amounts of data in an easy to read form: