r4ds/data-import.qmd

# Data import {#sec-data-import}

```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```

## Introduction

Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
In this chapter, you'll learn how to read plain-text rectangular files into R.
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data, which we'll come back to in @sec-wrangle.

### Prerequisites

In this chapter, you'll learn how to load flat files in R with the **readr** package, which is part of the core tidyverse.

```{r}
#| label: setup
#| message: false

library(tidyverse)
```

## Getting started

Most of readr's functions are concerned with turning flat files into data frames:

-   `read_csv()` reads comma delimited files, `read_csv2()` reads semicolon separated files (common in countries where `,` is used as the decimal place), `read_tsv()` reads tab delimited files, and `read_delim()` reads in files with any delimiter.

-   `read_fwf()` reads fixed width files.
    You can specify fields either by their widths with `fwf_widths()` or their position with `fwf_positions()`.
    `read_table()` reads a common variation of fixed width files where columns are separated by white space.

-   `read_log()` reads Apache style log files.
    (But also check out [webreadr](https://github.com/Ironholds/webreadr) which is built on top of `read_log()` and provides many more helpful tools.)

These functions all have similar syntax: once you've mastered one, you can use the others with ease.
For the rest of this chapter we'll focus on `read_csv()`.
Not only are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.

## Reading data from a file

Here is what a simple CSV file with a row for column names (also commonly referred to as the header row) and six rows of data looks like.

```{r}
#| echo: false
#| message: false

read_lines("data/students.csv") |> cat(sep = "\n")
```

Note that the `,`s separate the columns.
@tbl-students-table shows a representation of the same data as a table.

```{r}
#| label: tbl-students-table
#| echo: false
#| message: false
#| tbl-cap: Data from the students.csv file as a table.

read_csv("data/students.csv") |>
  knitr::kable()
```

The first argument to `read_csv()` is the most important: it's the path to the file to read.

```{r}
#| message: true

students <- read_csv("data/students.csv")
```

When you run `read_csv()` it prints out a message that tells you how many rows (excluding the header row) and columns the data has along with the delimiter used, and the column specifications (names of columns organized by the type of data the column contains).
It also prints out some information about how to retrieve the full column specification as well as how to quiet this message.
This message is an important part of readr, which we'll come back to in @sec-parsing-a-file on parsing a file.

You can also supply an inline csv file.
This is useful for experimenting with readr and for creating reproducible examples to share with others:

```{r}
#| message: false

read_csv("a,b,c
1,2,3
4,5,6")
```

In both cases `read_csv()` uses the first line of the data for the column names, which is a very common convention.
There are two cases where you might want to tweak this behavior:

1.  Sometimes there are a few lines of metadata at the top of the file.
    You can use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop all lines that start with (e.g.) `#`.

    ```{r}
    #| message: false

    read_csv("The first line of metadata
      The second line of metadata
      x,y,z
      1,2,3", skip = 2)

    read_csv("# A comment I want to skip
      x,y,z
      1,2,3", comment = "#")
    ```

2.  The data might not have column names.
    You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings, and instead label them sequentially from `X1` to `Xn`:

    ```{r}
    #| message: false

    read_csv("1,2,3\n4,5,6", col_names = FALSE)
    ```

    (`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in [Chapter -@sec-strings].)

    Alternatively you can pass `col_names` a character vector which will be used as the column names:

    ```{r}
    #| message: false

    read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
    ```

Another option that commonly needs tweaking is `na`: this specifies the value (or values) that are used to represent missing values in your file:

```{r}
#| message: false

read_csv("a,b,c\n1,2,.", na = ".")
```

This is all you need to know to read \~75% of CSV files that you'll encounter in practice.
You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`.
To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.

### First steps

Let's take another look at the `students` data.
In the `favourite.food` column, there are a bunch of food items and then the character string `N/A`, which should have been an real `NA` that R will recognize as "not available".
This is something we can address using the `na` argument.

```{r}
#| message: false

students <- read_csv("data/students.csv", na = c("N/A", ""))

students
```

Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis.
For example, the column names in the `students` file we read in are formatted in non-standard ways.
You might consider renaming them one by one with `dplyr::rename()` or you might use the `janitor::clean_names()` function turn them all into snake case at once.[^data-import-1]
This function takes in a data frame and returns a data frame with variable names converted to snake case.

[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses `|>`.

```{r}
#| message: false

library(janitor)
students |>
  clean_names()
```

Another common task after reading in data is to consider variable types.
For example, `meal_type` is a categorical variable with a known set of possible values.
In R, factors can be used to work with categorical variables.
We can convert this variable to a factor using the `factor()` function.
You'll learn more about factors in [Chapter -@sec-factors].

```{r}
students <- students |>
  clean_names() |>
  mutate(meal_plan = factor(meal_plan))

students
```

Note that the values in the `meal_type` variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).

Before you move on to analyzing these data, you'll probably want to fix the `age` column as well: currently it's a character variable because of the one observation that is typed out as `five` instead of a numeric `5`.
We discuss the details of fixing this issue in [Chapter -@sec-import-spreadsheets] in further detail.

### Compared to base R

If you've used R before, you might wonder why we're not using `read.csv()`.
There are a few good reasons to favor readr functions over the base equivalents:

-   They are typically much faster (\~10x) than their base equivalents.
    Long running jobs have a progress bar, so you can see what's happening.
    If you're looking for raw speed, try `data.table::fread()`.
    It doesn't fit quite so well into the tidyverse, but it can be quite a bit faster.

-   They produce tibbles, and they don't use row names or munge the column names.
    These are common sources of frustration with the base R functions.

-   They are more reproducible.
    Base R functions inherit some behavior from your operating system and environment variables, so import code that works on your computer might not work on someone else's.

### Exercises

1.  What function would you use to read a file where fields were separated with "\|"?

2.  Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common?

3.  What are the most important arguments to `read_fwf()`?

4.  Sometimes strings in a CSV file contain commas.
    To prevent them from causing problems they need to be surrounded by a quoting character, like `"` or `'`. By default, `read_csv()` assumes that the quoting character will be `"`.
    What argument to `read_csv()` do you need to specify to read the following text into a data frame?

    ```{r}
    #| eval: false

    "x,y\n1,'a,b'"
    ```

5.  Identify what is wrong with each of the following inline CSV files.
    What happens when you run the code?

    ```{r}
    #| eval: false

    read_csv("a,b\n1,2,3\n4,5,6")
    read_csv("a,b,c\n1,2\n1,2,3,4")
    read_csv("a,b\n\"1")
    read_csv("a,b\n1,2\na,b")
    read_csv("a;b\n1;3")
    ```

## Reading data from multiple files

Sometimes your data is split across multiple files instead of being contained in a single file.
For example, you might have sales data for multiple months, with each month's data in a separate file: `01-sales.csv` for January, `02-sales.csv` for February, and `03-sales.csv` for March.
With `read_csv()` you can read these data in at once and stack them on top of each other in a single data frame.

```{r}
sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
read_csv(sales_files, id = "file")
```

With the additional `id` parameter we have added a new column called `file` to the resulting data frame that identifies the file the data come from.
This is especially helpful in circumstances where the files you're reading in do not have an identifying column that can help you trace the observations back to their original sources.

If you have many files you want to read in, it can get cumbersome to write out their names as a list.
Instead, you can use the `dir_ls()` function from the [fs](https://fs.r-lib.org/) package to find the files for you by matching a pattern in the file names.

```{r}
library(fs)
sales_files <- dir_ls("data", glob = "*sales.csv")
sales_files
```

## Writing to a file {#sec-writing-to-a-file}

readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`.
Both functions increase the chances of the output file being read back in correctly by:

-   Always encoding strings in UTF-8.

-   Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.

If you want to export a csv file to Excel, use `write_excel_csv()` --- this writes a special character (a "byte order mark") at the start of the file which tells Excel that you're using the UTF-8 encoding.

The most important arguments are `x` (the data frame to save), and `file` (the location to save it).
You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.

```{r}
#| eval: false

write_csv(students, "students.csv")
```

Now let's read that csv file back in.
Note that the type information is lost when you save to csv:

```{r}
#| warning: false
#| message: false

students
write_csv(students, "students-2.csv")
read_csv("students-2.csv")
```

This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in.
There are two alternatives:

1.  `write_rds()` and `read_rds()` are uniform wrappers around the base functions `readRDS()` and `saveRDS()`.
    These store data in R's custom binary format called RDS:

    ```{r}
    write_rds(students, "students.rds")
    read_rds("students.rds")
    ```

2.  The feather package implements a fast binary file format that can be shared across programming languages:

    ```{r}
    #| eval: false

    library(feather)
    write_feather(students, "students.feather")
    read_feather("students.feather")
    #> # A tibble: 6 × 5
    #>   student_id full_name        favourite_food     meal_plan             age
    #>        <dbl> <chr>            <chr>              <fct>               <dbl>
    #> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    #> 2          2 Barclay Lynn     French fries       Lunch only              5
    #> 3          3 Jayendra Lyne    NA                 Breakfast and lunch     7
    #> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    #> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
    #> 6          6 Güvenç Attila    Ice cream          Lunch only              6
    ```

Feather tends to be faster than RDS and is usable outside of R.
RDS supports list-columns (which you'll learn about in @sec-rectangling; feather currently does not.

```{r}
#| include: false
file.remove("students-2.csv")
file.remove("students.rds")
```
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								# Data import {#sec-data-import}
-												Make sure first element is heading

											
										
										
											2015-12-12 02:34:20 +08:00
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								```{r}
 								#| results: "asis"
 								#| echo: false
 								source("_common.R")
 								status("polishing")
-												Add chapter status

											
										
										
											2021-05-04 21:10:39 +08:00
+								```
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								## Introduction
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
 								In this chapter, you'll learn how to read plain-text rectangular files into R.
-												Mild import/wrangling reorg

											
										
										
											2022-06-20 23:40:11 +08:00
+								Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data, which we'll come back to in @sec-wrangle.
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								### Prerequisites
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								In this chapter, you'll learn how to load flat files in R with the **readr** package, which is part of the core tidyverse.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| label: setup
 								#| message: false
-												Use tidyverse package

Fixes #451

											
										
										
											2016-10-04 01:30:24 +08:00
+								library(tidyverse)
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Polishing data import

											
										
										
											2016-07-12 04:38:39 +08:00
+								## Getting started
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
 								Most of readr's functions are concerned with turning flat files into data frames:
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								-   `read_csv()` reads comma delimited files, `read_csv2()` reads semicolon separated files (common in countries where `,` is used as the decimal place), `read_tsv()` reads tab delimited files, and `read_delim()` reads in files with any delimiter.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								-   `read_fwf()` reads fixed width files.
 								    You can specify fields either by their widths with `fwf_widths()` or their position with `fwf_positions()`.
 								    `read_table()` reads a common variation of fixed width files where columns are separated by white space.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								-   `read_log()` reads Apache style log files.
 								    (But also check out [webreadr](https://github.com/Ironholds/webreadr) which is built on top of `read_log()` and provides many more helpful tools.)
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								These functions all have similar syntax: once you've mastered one, you can use the others with ease.
 								For the rest of this chapter we'll focus on `read_csv()`.
 								Not only are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								## Reading data from a file
 								Here is what a simple CSV file with a row for column names (also commonly referred to as the header row) and six rows of data looks like.
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| echo: false
 								#| message: false
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								read_lines("data/students.csv") |> cat(sep = "\n")
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								```
 								Note that the `,`s separate the columns.
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								@tbl-students-table shows a representation of the same data as a table.
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								#| label: tbl-students-table
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								#| echo: false
 								#| message: false
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								#| tbl-cap: Data from the students.csv file as a table.
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								read_csv("data/students.csv") |>
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								  knitr::kable()
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								```
-												Data import proofing

											
										
										
											2016-08-12 21:09:18 +08:00
+								The first argument to `read_csv()` is the most important: it's the path to the file to read.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| message: true
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								students <- read_csv("data/students.csv")
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								When you run `read_csv()` it prints out a message that tells you how many rows (excluding the header row) and columns the data has along with the delimiter used, and the column specifications (names of columns organized by the type of data the column contains).
 								It also prints out some information about how to retrieve the full column specification as well as how to quiet this message.
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								This message is an important part of readr, which we'll come back to in @sec-parsing-a-file on parsing a file.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								You can also supply an inline csv file.
 								This is useful for experimenting with readr and for creating reproducible examples to share with others:
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| message: false
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								read_csv("a,b,c
 ,2,3
 ,5,6")
 								```
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								In both cases `read_csv()` uses the first line of the data for the column names, which is a very common convention.
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+								There are two cases where you might want to tweak this behavior:
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
 .  Sometimes there are a few lines of metadata at the top of the file.
 								    You can use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop all lines that start with (e.g.) `#`.
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								    ```{r}
 								    #| message: false
-												More on parsing vectors

											
										
										
											2016-07-08 00:17:11 +08:00
+								    read_csv("The first line of metadata
 								      The second line of metadata
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								      x,y,z
-												More on parsing vectors

											
										
										
											2016-07-08 00:17:11 +08:00
+,2,3", skip = 2)
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								    read_csv("# A comment I want to skip
 								      x,y,z
 ,2,3", comment = "#")
 								    ```
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
 .  The data might not have column names.
 								    You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings, and instead label them sequentially from `X1` to `Xn`:
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								    ```{r}
 								    #| message: false
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								    read_csv("1,2,3\n4,5,6", col_names = FALSE)
 								    ```
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								    (`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in [Chapter -@sec-strings].)
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
 								    Alternatively you can pass `col_names` a character vector which will be used as the column names:
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								    ```{r}
 								    #| message: false
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								    read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
 								    ```
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Proofreading import

											
										
										
											2016-07-27 03:57:25 +08:00
+								Another option that commonly needs tweaking is `na`: this specifies the value (or values) that are used to represent missing values in your file:
-												Polishing data import

											
										
										
											2016-07-12 04:38:39 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| message: false
-												Polishing data import

											
										
										
											2016-07-12 04:38:39 +08:00
+								read_csv("a,b,c\n1,2,.", na = ".")
 								```
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								This is all you need to know to read \~75% of CSV files that you'll encounter in practice.
 								You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`.
 								To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.
-												Small edits to import.Rmd (typos and mistakes)

											
										
										
											2016-04-07 02:24:44 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								### First steps
 								Let's take another look at the `students` data.
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+								In the `favourite.food` column, there are a bunch of food items and then the character string `N/A`, which should have been an real `NA` that R will recognize as "not available".
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								This is something we can address using the `na` argument.
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| message: false
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								students <- read_csv("data/students.csv", na = c("N/A", ""))
 								students
 								```
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+								Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis.
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								For example, the column names in the `students` file we read in are formatted in non-standard ways.
 								You might consider renaming them one by one with `dplyr::rename()` or you might use the `janitor::clean_names()` function turn them all into snake case at once.[^data-import-1]
 								This function takes in a data frame and returns a data frame with variable names converted to snake case.
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses `|>`.
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| message: false
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								library(janitor)
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								students |>
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								  clean_names()
 								```
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+								Another common task after reading in data is to consider variable types.
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								For example, `meal_type` is a categorical variable with a known set of possible values.
 								In R, factors can be used to work with categorical variables.
 								We can convert this variable to a factor using the `factor()` function.
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								You'll learn more about factors in [Chapter -@sec-factors].
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
 								```{r}
-												Convert from %>% to |>

											
										
										
											2022-02-24 03:15:52 +08:00
+								students <- students |>
 								  clean_names() |>
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								  mutate(meal_plan = factor(meal_plan))
 								students
 								```
 								Note that the values in the `meal_type` variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
 								Before you move on to analyzing these data, you'll probably want to fix the `age` column as well: currently it's a character variable because of the one observation that is typed out as `five` instead of a numeric `5`.
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								We discuss the details of fixing this issue in [Chapter -@sec-import-spreadsheets] in further detail.
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								### Compared to base R
-												Start writing about readr

											
										
										
											2015-09-23 02:35:39 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								If you've used R before, you might wonder why we're not using `read.csv()`.
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+								There are a few good reasons to favor readr functions over the base equivalents:
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								-   They are typically much faster (\~10x) than their base equivalents.
 								    Long running jobs have a progress bar, so you can see what's happening.
 								    If you're looking for raw speed, try `data.table::fread()`.
 								    It doesn't fit quite so well into the tidyverse, but it can be quite a bit faster.
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+								-   They produce tibbles, and they don't use row names or munge the column names.
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								    These are common sources of frustration with the base R functions.
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								-   They are more reproducible.
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+								    Base R functions inherit some behavior from your operating system and environment variables, so import code that works on your computer might not work on someone else's.
-												Update import.Rmd

typos
											
										
										
											2016-01-27 21:33:35 +08:00
-												Fix typos

											
										
										
											2016-07-12 09:29:17 +08:00
+								### Exercises
-												More on import

											
										
										
											2016-07-09 05:23:19 +08:00
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+.  What function would you use to read a file where fields were separated with "\|"?
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
 .  Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common?
 .  What are the most important arguments to `read_fwf()`?
 .  Sometimes strings in a CSV file contain commas.
 								    To prevent them from causing problems they need to be surrounded by a quoting character, like `"` or `'`. By default, `read_csv()` assumes that the quoting character will be `"`.
 								    What argument to `read_csv()` do you need to specify to read the following text into a data frame?
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								    ```{r}
 								    #| eval: false
-												More on import

											
										
										
											2016-07-09 05:23:19 +08:00
+								    "x,y\n1,'a,b'"
 								    ```
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
 .  Identify what is wrong with each of the following inline CSV files.
-												Polishing data import

											
										
										
											2016-07-12 04:38:39 +08:00
+								    What happens when you run the code?
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								    ```{r}
 								    #| eval: false
-												Polishing data import

											
										
										
											2016-07-12 04:38:39 +08:00
+								    read_csv("a,b\n1,2,3\n4,5,6")
 								    read_csv("a,b,c\n1,2\n1,2,3,4")
 								    read_csv("a,b\n\"1")
 								    read_csv("a,b\n1,2\na,b")
 								    read_csv("a;b\n1;3")
 								    ```
-												More on import

											
										
										
											2016-07-09 05:23:19 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								## Reading data from multiple files
-												More on parsing vectors

											
										
										
											2016-07-08 00:17:11 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								Sometimes your data is split across multiple files instead of being contained in a single file.
 								For example, you might have sales data for multiple months, with each month's data in a separate file: `01-sales.csv` for January, `02-sales.csv` for February, and `03-sales.csv` for March.
 								With `read_csv()` you can read these data in at once and stack them on top of each other in a single data frame.
-												More on column types

											
										
										
											2015-09-23 21:58:16 +08:00
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```{r}
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
 								read_csv(sales_files, id = "file")
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												More on column types

											
										
										
											2015-09-23 21:58:16 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								With the additional `id` parameter we have added a new column called `file` to the resulting data frame that identifies the file the data come from.
 								This is especially helpful in circumstances where the files you're reading in do not have an identifying column that can help you trace the observations back to their original sources.
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								If you have many files you want to read in, it can get cumbersome to write out their names as a list.
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+								Instead, you can use the `dir_ls()` function from the [fs](https://fs.r-lib.org/) package to find the files for you by matching a pattern in the file names.
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								```{r}
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								library(fs)
 								sales_files <- dir_ls("data", glob = "*sales.csv")
 								sales_files
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
											
										
										
											2022-05-14 04:46:49 +08:00
+								## Writing to a file {#sec-writing-to-a-file}
-												More on column types

											
										
										
											2015-09-23 21:58:16 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`.
 								Both functions increase the chances of the output file being read back in correctly by:
 								-   Always encoding strings in UTF-8.
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								-   Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												Data import proofing

											
										
										
											2016-08-12 21:09:18 +08:00
+								If you want to export a csv file to Excel, use `write_excel_csv()` --- this writes a special character (a "byte order mark") at the start of the file which tells Excel that you're using the UTF-8 encoding.
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								The most important arguments are `x` (the data frame to save), and `file` (the location to save it).
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| eval: false
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								write_csv(students, "students.csv")
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								```
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												Minor edits

											
										
										
											2022-03-05 12:58:23 +08:00
+								Now let's read that csv file back in.
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								Note that the type information is lost when you save to csv:
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| warning: false
 								#| message: false
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								students
 								write_csv(students, "students-2.csv")
 								read_csv("students-2.csv")
-												Rough out import chapter

											
										
										
											2016-07-07 02:59:50 +08:00
+								```
-												More on column types

											
										
										
											2015-09-23 21:58:16 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in.
 								There are two alternatives:
 .  `write_rds()` and `read_rds()` are uniform wrappers around the base functions `readRDS()` and `saveRDS()`.
 								    These store data in R's custom binary format called RDS:
-												More on column types

											
										
										
											2015-09-23 21:58:16 +08:00
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								    ```{r}
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								    write_rds(students, "students.rds")
 								    read_rds("students.rds")
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								    ```
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
 .  The feather package implements a fast binary file format that can be shared across programming languages:
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								    ```{r}
 								    #| eval: false
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								    library(feather)
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								    write_feather(students, "students.feather")
 								    read_feather("students.feather")
 								    #> # A tibble: 6 × 5
 								    #>   student_id full_name        favourite_food     meal_plan             age
 								    #>        <dbl> <chr>            <chr>              <fct>               <dbl>
 								    #> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
 								    #> 2          2 Barclay Lynn     French fries       Lunch only              5
 								    #> 3          3 Jayendra Lyne    NA                 Breakfast and lunch     7
 								    #> 4          4 Leon Rossini     Anchovies          Lunch only             NA
 								    #> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
 								    #> 6          6 Güvenç Attila    Ice cream          Lunch only              6
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								    ```
-												Rough notes for import & transform

											
										
										
											2015-09-21 21:41:14 +08:00
-												Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
											
										
										
											2021-02-21 23:40:40 +08:00
+								Feather tends to be faster than RDS and is usable outside of R.
-												Mild import/wrangling reorg

											
										
										
											2022-06-20 23:40:11 +08:00
+								RDS supports list-columns (which you'll learn about in @sec-rectangling; feather currently does not.
-												Rough notes for import & transform

											
										
										
											2015-09-21 21:41:14 +08:00
-												UK -> US spelling, multi-line alt text, YAML chunk opts

											
										
										
											2022-05-08 13:32:25 +08:00
+								```{r}
 								#| include: false
-												Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
											
										
										
											2021-10-27 03:00:33 +08:00
+								file.remove("students-2.csv")
 								file.remove("students.rds")
-												Complete pass through import

											
										
										
											2016-07-10 22:19:56 +08:00
+								```