Made the 'intro' page grammatically correct. (#1452)

* Made the entire page grammatically correct.

* Update intro.qmd

* Update intro.qmd

* Update intro.qmd

* Update intro.qmd

---------

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
This commit is contained in:
SM Raiyyan 2023-05-09 20:49:50 -05:00 committed by GitHub
parent c6f11c6707
commit 87fc313bb0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 16 additions and 16 deletions

View File

@ -39,7 +39,7 @@ If you can't get your data into R, you can't do data science on it!
Once you've imported your data, it is a good idea to **tidy** it.
Tidying your data means storing it in a consistent form that matches the semantics of the dataset with how it is stored.
In brief, when your data is tidy, each column is a variable, and each row is an observation.
In brief, when your data is tidy, each column is a variable and each row is an observation.
Tidy data is important because the consistent structure lets you focus your efforts on answering questions about the data, not fighting to get the data into the right form for different functions.
Once you have tidy data, a common next step is to **transform** it.
@ -56,9 +56,9 @@ Visualizations can surprise you, but they don't scale particularly well because
**Models** are complementary tools to visualization.
Once you have made your questions sufficiently precise, you can use a model to answer them.
Models are a fundamentally mathematical or computational tool, so they generally scale well.
Models are fundamentally mathematical or computational tools, so they generally scale well.
Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains!
But every model makes assumptions, and by its very nature a model cannot question its own assumptions.
But every model makes assumptions, and by its very nature, a model cannot question its own assumptions.
That means a model cannot fundamentally surprise you.
The last step of data science is **communication**, an absolutely critical part of any data analysis project.
@ -69,20 +69,20 @@ Programming is a cross-cutting tool that you use in nearly every part of a data
You don't need to be an expert programmer to be a successful data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks and solve new problems with greater ease.
You'll use these tools in every data science project, but they're not enough for most projects.
There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools you'll learn in this book, but you'll need other tools to tackle the remaining 20%.
There's a rough 80/20 rule at play: you can tackle about 80% of every project using the tools you'll learn in this book, but you'll need other tools to tackle the remaining 20%.
Throughout this book, we'll point you to resources where you can learn more.
## How this book is organized
The previous description of the tools of data science is organized roughly according to the order in which you use them in an analysis (although, of course, you'll iterate through them multiple times).
In our experience, however, learning data importing and tidying first is sub-optimal because 80% of the time, it's routine and boring, and the other 20% of the time, it's weird and frustrating.
In our experience, however, learning data importing and tidying first is suboptimal because, 80% of the time, it's routine and boring, and the other 20% of the time, it's weird and frustrating.
That's a bad place to start learning a new subject!
Instead, we'll start with visualization and transformation of data that's already been imported and tidied.
That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth the effort.
Within each chapter, we try and adhere to a consistent pattern: start with some motivating examples so you can see the bigger picture and then dive into the details.
Within each chapter, we try to adhere to a consistent pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details.
Each section of the book is paired with exercises to help you practice what you've learned.
Although it can be tempting to skip the exercises, there's no better way to learn than practicing on real problems.
Although it can be tempting to skip the exercises, there's no better way to learn than by practicing on real problems.
## What you won't learn
@ -92,21 +92,21 @@ That means this book can't cover every important topic.
### Modeling
Modelling is super important for data science, but it's a big topic and unfortunately we just don't have the space to give it the coverage it deserves here.
To learn more modeling, we highly recommend [Tidy Modeling with R](https://www.tmwr.org) by our colleagues Max Kuhn and Julia Silge.
Modeling is super important for data science, but it's a big topic, and unfortunately, we just don't have the space to give it the coverage it deserves here.
To learn more about modeling, we highly recommend [Tidy Modeling with R](https://www.tmwr.org) by our colleagues Max Kuhn and Julia Silge.
This book will teach you the tidymodels family of packages, which, as you might guess from the name, share many conventions with the tidyverse packages we use in this book.
### Big data
This book proudly and primarily focuses on small, in-memory datasets.
This is the right place to start because you can't tackle big data unless you have experience with small data.
The tools you'll learn throughout the majority of this book will easily handle hundreds of megabytes of data, and with a bit of care, you can typically use them to work a few gigabytes of data.
The tools you'll learn throughout the majority of this book will easily handle hundreds of megabytes of data, and with a bit of care, you can typically use them to work with a few gigabytes of data.
We'll also show you how to get data out of databases and parquet files, both of which are often used to store big data.
You won't necessarily be able to work with the entire dataset, but that's not a problem because you only need a subset or subsample to answer the question that you're interested in.
If you're routinely working with larger data (10-100 Gb, say), we recommend learning more about [data.table](https://github.com/Rdatatable/data.table).
We don't teach it here because it uses a different interface to the tidyverse and requires you to learn some different conventions.
However, it is incredible faster and the performance payoff is worth investing some time learning it if you're working with large data.
If you're routinely working with larger data (10100 GB, say), we recommend learning more about [data.table](https://github.com/Rdatatable/data.table).
We don't teach it here because it uses a different interface than the tidyverse and requires you to learn some different conventions.
However, it is incredibly faster, and the performance payoff is worth investing some time in learning it if you're working with large data.
### Python, Julia, and friends
@ -131,13 +131,13 @@ They include reusable functions, documentation that describes how to use them, a
To download R, go to CRAN, the **c**omprehensive **R** **a**rchive **n**etwork, <https://cloud.r-project.org>.
A new major version of R comes out once a year, and there are 2-3 minor releases each year.
It's a good idea to update regularly.
Upgrading can be a bit of a hassle, especially for major versions requiring you to re-install all your packages, but putting it off only makes it worse.
Upgrading can be a bit of a hassle, especially for major versions that require you to re-install all your packages, but putting it off only makes it worse.
We recommend R 4.2.0 or later for this book.
### RStudio
RStudio is an integrated development environment, or IDE, for R programming, which you can download from <https://posit.co/download/rstudio-desktop/>.
RStudio is updated a couple of times a year, and it will automatically let you know when a new version is out so there's no need to check back.
RStudio is updated a couple of times a year, and it will automatically let you know when a new version is out, so there's no need to check back.
It's a good idea to upgrade regularly to take advantage of the latest and greatest features.
For this book, make sure you have at least RStudio 2022.02.0.
@ -194,7 +194,7 @@ You can see if updates are available by running `tidyverse_update()`.
### Other packages
There are many other excellent packages that are not part of the tidyverse because they solve problems in a different domain or are designed with a different set of underlying principles.
This doesn't make them better or worse, just different.
This doesn't make them better or worse; it just makes them different.
In other words, the complement to the tidyverse is not the messyverse but many other universes of interrelated packages.
As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.