More intro polishing

This commit is contained in:
hadley 2015-12-07 17:53:43 -06:00
parent 4db8b3efbc
commit d7d15cad7b
1 changed files with 41 additions and 32 deletions

View File

@ -11,13 +11,11 @@ install.packages <- function(...) invisible()
# Welcome
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to introduce you to the most important tools that you need to do data science with in R. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to introduce you to the most important in R tools that you need to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
## What you will learn
Data science is a huge field, and there's no way you can master by after reading a single book. The goal of this book is to give you a solid foundation into the most important tools. These are the tools that in our experience, people use everyday. There's definitely an 80-20 rule at play: you'll do 80% of every project using this handful of tools, but the remaining 20% will is much more variable. Our goal is to teach you that 80% and to point you to where you can learn more.
We think about data science as using six main tools:
Data science is a huge field, and there's no way you can master it by reading a single book. The goal of this book is to give you a solid foundation with the most important tools. Our model of the tools needed in a typical data science project looks something like this:
`r bookdown::embed_png("diagrams/data-science.png")`
@ -35,41 +33,34 @@ __Models__ are the complementary tools to visualisation. Models are a fundamenta
It doesn't matter how well models and visualisation have led you to understand the data, unless you can __commmunicate__ your results to other people. Communication is an absolutely critical part of any data analysis project.
There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming often pays off. Becoming a better programmer will allow you automate common tasks, and solve new problems with greater ease.
You'll use these tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play: you can probably tackle 80% of every project using the tools we'll teach you, but you'll need more to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
## How you will learn
Above, I've listed the components of the data science process in roughly the order you'll encounter them in an analysis (although of course you'll iterate multiple times). In our experience, however, this is not the best way to learn them:
The above description of the tools of data science was organised roughly around the order in which you use them in analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them:
* Starting with data ingest is boring. It's much more interesting to learn
some new visualisation and manipulation tools on data that's already been
imported and cleaned. You'll later learn the skills to apply these new ideas
to your own data.
* You need to learn some cross-cutting tools that help in: programming, RStudio
IDE.
* Starting with data ingest is boring. 80% of the time it's routine and boring,
and the other 20% of the time it's horrendously frustrating. That's not a
good place to start learning! Instead, we'll start with visualisation and
transformation on on data that's already been imported and tidied. That will
keep your motivation high so that when you tackle the frustrating parts of
getting your data in R for your own projects, you'll know the payoff and be
motivated to stick with it through the pain.
* Some topics, like modelling, are best explained with other tools, like
visualisation and manipulation. These topics need to come later in the book.
visualisation, transformation and tidying.
* Programming tools are not necessarily interesting in their own right,
but do allow you to tackle considerably more challenging problems. We'll
give you a selection of programming tools in the middle of the book, and
then show you how they synergise with the key data science tools to tackle
intersting problems.
We've honed this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated. We try and stick to a similar pattern within each chapter: give some bigger motivating examples so you can see the bigger picture, and then dive into the details.
We've polished this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated.
Each section of the book also comes with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing. If you were taking a class with either of us, we'd force you to do them by making them homework. (Sometimes I feel like teaching is the art of tricking people to do what's in their own best interests.)
## Talking about data science
Throughout the book, we will discuss the principles of data that will help you become a better scientist. That begins here. We will refer to the terms below throughout the book because they are so useful.
* A _variable_ is a quantity, quality, or property that you can measure.
* A _value_ is the state of a variable when you measure it. The value of a
variable may change from measurement to measurement.
* An _observation_ is a set of measurments you make under similar conditions
(usually all at the same time or on the same object). Observations contain
values that you measure on different variables.
These terms will help us speak precisely about the different parts of a data set. They will also provide a system for turning data into insights.
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data that doesn't naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing.
## What you won't learn
@ -119,6 +110,24 @@ However, we strongly believe that it's best to master one tool at a time. You wi
No trees or graphs, images or sounds.
## Talking about data science
Throughout the book, we will discuss the principles of data that will help you become a better scientist. That begins here. We will refer to the terms below throughout the book because they are so useful.
* A _variable_ is a quantity, quality, or property that you can measure.
* A _value_ is the state of a variable when you measure it. The value of a
variable may change from measurement to measurement.
* An _observation_ is a set of measurments you make under similar conditions
(usually all at the same time or on the same object). Observations contain
values that you measure on different variables.
These terms will help us speak precisely about the different parts of a data set. They will also provide a system for turning data into insights.
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data that doesn't naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.
## Prerequisites
We've made few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.