Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge.
The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science efficiently and reproducibly.
Data science is a huge field, and there's no way you can master it all by reading a single book.
The goal of this book is to give you a solid foundation in the most important tools, and enough knowledge to find the resources to learn more when necessary.
This typically means that you take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R.
If you can't get your data into R, you can't do data science on it!
Tidy data is important because the consistent structure lets you focus your efforts on answering questions about the data, not fighting to get the data into the right form for different functions.
Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).
Together, tidying and transforming are called **wrangling**, because getting your data in a form that's natural to work with often feels like a fight!
Programming is a cross-cutting tool that you use in nearly every part of a data science project.
You don't need to be an expert programmer to be a successful data scientist, but learning more about programming pays off, because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
You'll use these tools in every data science project, but for most projects they're not enough.
There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%.
The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you'll iterate through them multiple times).
In our experience, however, learning data ingest and tidying first is sub-optimal, because 80% of the time it's routine and boring, and the other 20% of the time it's weird and frustrating.
Within each chapter, we try and adhere to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details.
To learn more about modeling, we highly recommend [Tidy Modeling with R](https://www.tmwr.org), by our colleagues Max Kuhn and Julia Silge.
This book will teach you the tidymodels family of packages, which, as you might guess from the name, share many conventions with the tidyverse packages we use in this book.
The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care, you can typically use them to work with 1-2 Gb of data.
This would be trivial if you had just 10 or 100 people, but instead you have a million.
Fortunately, each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like [Hadoop](https://hadoop.apache.org/) or [Spark](https://spark.apache.org/)) that allows you to send different datasets to different computers for processing.
Once you've figured out how to answer your question for a single subset using the tools described in this book, you can learn new tools like **sparklyr** to solve it for the full dataset.
However, we strongly believe that it's best to master one tool at a time.
You will get better faster if you dive deep, rather than spreading yourself thinly over many topics.
This doesn't mean you should only know one thing, just that you'll generally learn faster if you stick to one thing at a time.
You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next interesting thing.
To support interaction, R is a much more flexible language than many of its peers.
This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process.
These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.
If you've never programmed before, you might find [Hands on Programming with R](https://rstudio-education.github.io/hopr/) by Garrett to be a useful adjunct to this book.
There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the **tidyverse**, and a handful of other packages.
Packages are the fundamental units of reproducible R code.
They include reusable functions, the documentation that describes how to use them, and sample data.
To download R, go to CRAN, the **c**omprehensive **R** **a**rchive **n**etwork.
CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages.
Don't try and pick a mirror that's close to you: instead use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
Upgrading can be a bit of a hassle, especially for major versions, which require you to re-install all your packages, but putting it off only makes it worse.
On your own computer, type that line of code in the console, and then press enter to run it.
R will download the packages from CRAN and install them on to your computer.
If you have problems installing, make sure that you are connected to the internet, and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, or are designed with a different set of underlying principles.
This doesn't make them better or worse, just different.
In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages.
As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
In your console, you type after the `>`, called the **prompt**; we don't show the prompt in the book.
In the book, output is commented out with `#>`; in your console it appears directly after your code.
These two differences mean that if you're working with an electronic version of the book, you can easily copy code out of the book and into the console.
- Sometimes, to make it clear which package an object comes from, we'll use we'll use the package name followed by two colons, like `dplyr::mutate()`, or\
This book isn't just the product of Hadley, Mine, and Garrett, but is the result of many conversations (in person and online) that we've had with many people in the R community.
There are a few people we'd like to thank in particular, because they have spent many hours answering our questions and helping us to better think about data science: