Some more things we don't cover

This commit is contained in:
hadley 2016-06-17 13:14:18 -05:00
parent 06d1aa6496
commit d289a41a55
1 changed files with 8 additions and 1 deletions

View File

@ -60,7 +60,7 @@ Within each chapter, we try and stick to a similar pattern: start with some moti
There are some important topics that this book doesn't cover. We believe it's important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible. That means this book can't cover every important topic.
### Big data
### Big n data (many observations)
This book proudly focuses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). We don't teach data.table here because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth a little extra effort to learn it.
@ -68,6 +68,8 @@ Many big data problems are often small data problems in disguise. Often your com
Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset.
### Big p data (many variables)
### Python
In this book, you won't learn anything about Python, Julia, or any other programming language useful for data science. This isn't because we think these tools are bad. They're not! And in practice, most data science teams use a mix of languages, often at least R and Python.
@ -78,6 +80,11 @@ However, we strongly believe that it's best to master one tool at a time. You wi
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data sets that do not naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.
### Inference
Exploratory vs. confirmatory
### Formal Statistics and Machine Learning
This book focuses on practical tools for understanding your data: visualization, modelling, and transformation. You can develop your understanding further by learning probability theory, statistical hypothesis testing, and machine learning methods; but we won't teach you those things here. There are many books that cover these topics, but few that integrate the other parts of the data science process. When you are ready, you can and should read books devoted to each of these topics. We recommend *Statistical Modeling: A Fresh Approach* by Danny Kaplan; *An Introduction to Statistical Learning* by James, Witten, Hastie, and Tibshirani; and *Applied Predictive Modeling* by Kuhn and Johnson.