Minor edits

This commit is contained in:
Mine Çetinkaya-Rundel 2022-03-04 23:58:23 -05:00
parent eb61248d8c
commit 4f32e9afcc
1 changed files with 10 additions and 10 deletions

View File

@ -74,7 +74,7 @@ read_csv("a,b,c
```
In both cases `read_csv()` uses the first line of the data for the column names, which is a very common convention.
There are two cases where you might want to tweak this behaviour:
There are two cases where you might want to tweak this behavior:
1. Sometimes there are a few lines of metadata at the top of the file.
You can use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop all lines that start with (e.g.) `#`.
@ -118,7 +118,7 @@ To read in more challenging files, you'll need to learn more about how readr par
### First steps
Let's take another look at the `students` data.
In the `favourite.food` column, there are a bunch of foot items and then the character string `N/A`, which should have been an real `NA` that R will recognize as "not available".
In the `favourite.food` column, there are a bunch of food items and then the character string `N/A`, which should have been an real `NA` that R will recognize as "not available".
This is something we can address using the `na` argument.
```{r message = FALSE}
@ -127,7 +127,7 @@ students <- read_csv("data/students.csv", na = c("N/A", ""))
students
```
Once you read data in, the first step is usually involve transforming it in some way to make it easier to work with in the rest of your analysis.
Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis.
For example, the column names in the `students` file we read in are formatted in non-standard ways.
You might consider renaming them one by one with `dplyr::rename()` or you might use the `janitor::clean_names()` function turn them all into snake case at once.[^data-import-1]
This function takes in a data frame and returns a data frame with variable names converted to snake case.
@ -140,7 +140,7 @@ students |>
clean_names()
```
Another common task after reading in data is to consider the variable types.
Another common task after reading in data is to consider variable types.
For example, `meal_type` is a categorical variable with a known set of possible values.
In R, factors can be used to work with categorical variables.
We can convert this variable to a factor using the `factor()` function.
@ -162,23 +162,22 @@ We discuss the details of fixing this issue in Chapter \@ref(import-spreadsheets
### Compared to base R
If you've used R before, you might wonder why we're not using `read.csv()`.
There are a few good reasons to favour readr functions over the base equivalents:
There are a few good reasons to favor readr functions over the base equivalents:
- They are typically much faster (\~10x) than their base equivalents.
Long running jobs have a progress bar, so you can see what's happening.
If you're looking for raw speed, try `data.table::fread()`.
It doesn't fit quite so well into the tidyverse, but it can be quite a bit faster.
- They produce tibbles, they don't convert character vectors to factors, use row names, or munge the column names.
- They produce tibbles, and they don't use row names or munge the column names.
These are common sources of frustration with the base R functions.
- They are more reproducible.
Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else's.
Base R functions inherit some behavior from your operating system and environment variables, so import code that works on your computer might not work on someone else's.
### Exercises
1. What function would you use to read a file where fields were separated with\
"\|"?
1. What function would you use to read a file where fields were separated with "\|"?
2. Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common?
@ -218,7 +217,7 @@ With the additional `id` parameter we have added a new column called `file` to t
This is especially helpful in circumstances where the files you're reading in do not have an identifying column that can help you trace the observations back to their original sources.
If you have many files you want to read in, it can get cumbersome to write out their names as a list.
Instead, you can use the `dir_ls()` function from the fs package to find the files for you by matching a pattern in the file names.
Instead, you can use the `dir_ls()` function from the [fs](https://fs.r-lib.org/) package to find the files for you by matching a pattern in the file names.
```{r}
library(fs)
@ -244,6 +243,7 @@ You can also specify how missing values are written with `na`, and if you want t
write_csv(students, "students.csv")
```
Now let's read that csv file back in.
Note that the type information is lost when you save to csv:
```{r, warning = FALSE, message = FALSE}