19 Books Every Aspiring (Or Experienced) Data Scientist Should Read

Whether you’re an aspiring data scientist or a seasoned professional, these are 19 books that will help you improve your skills in machine learning, data analysis, visualization, statistics, and more. I’ve read and can highly recommend them all.

I’ve split this data science reading list into a few categories, but some may naturally belong in several groups (R for Data Science, for example, is great for learning both data visualization and analysis.) And just so you know, I may collect a commission from Amazon for any books you purchase using these links.

Machine Learning (Theoretical)

1. Introduction to Statistical Learning — James, Tibshirani, and Hastie - $21.72

A good first book on machine learning. Shows how most popular machine learning algorithms work, and also teaches a proper workflow for training and evaluating models (e.g. train/test splits, cross validtion, picking a loss function.) Allows a reader to get an intuitive grasp of what is going on inside the “black box”, but is a little too far on qualitative side if one hopes to gain a full understanding. For a deeper dive, see the advanced version Elements of Statistical Learning.

2. Elements of Statistical Learning — James, Tibshirani, and Hastie [HIGHLY RECOMMEND] - $30.60

Similar to Introduction to Statistical Learning, but much more mathematically dense. This is my favorite reference book for machine learning theory. Topics include generalized linear models, additive models, bagging, boosting, tree-fitting algorithms, random forests, gradient boosting, and much more. It’s worth reading several times.

3. Learning from Data – Abu-Mostafa and Magdon-Ismail - $45.00

Another machine learning book that focuses on theory. It won’t show you how to train your own models, but it will help to understand why models work and what guarantees we’re able to make about learning and generalization. Less focused on specific ML algorithms, and more focused on the properties of learning and generalization.

4. Deep Learning – Goodfellow, Bengio and Courville [HIGHLY RECOMMEND] - $25.01

A balance of intuition, applicability, and theory that this field has been lacking. Begins with the nuts and bolts of feedforward networks, and then goes into depth about the state of the art in model regularization, optimization, and various model classes and architectures. Filled with useful tips and tricks for implementing models. This is the best book out there right now for learning how deep learning works.

Machine Learning (Practical / Applied)

5. Deep Learning with Python – Francois Chollet - $24.48

A handy reference for Keras. This book is helpful for bridging the gap between beginner deep learning tutorials and more advanced / state-of-the-art methods. It’s not the best for learning theory, but will help you to implement what you read in papers.

Working with Data (Python an R Programming)

6. R for Data Science – Wickham and Grolemund [HIGHLY RECOMMEND] - $18.17

An absurdly useful book for learning how to manipulate data with R and the Tidyverse (dplyr, ggplot, forcats, etc.) I read this once when I was first learning R and again after a few years of experience and learned new things each time. This book will make anyone better at data analysis, visualization, manipulation, and cleaning.

7. Advanced R – Hadley Wickham - $53.73

This book will teach you how the R language works on a much lower, more technical level. It’s a useful book for helping advanced users write more performant code. It’s also useful for people learning R whose background is primarily in other languages, as it will help to draw parallels between R and other languages.

8. Analyzing Baseball Data with R – Albert and Marchi - $49.66

This book focuses on baseball data specifically, but is filled with data analysis and visualistion examples in R. It’s filled with real-world examples of munging data to answer questions and understand rich data sets (in this case, baseball data).

9. Python for Data Analysis – Wes McKinney - $23.09

This is a great book for learning the ins and outs of Pandas. It will teach you to clean, aggregate, transform, and visualize data in Python using Pandas dataframes. It’s Python’s closest equivalent to R for Data Science.

Data Visualization

10. The Visual Display of Quantitative Information – Edward Tufte - $32.95

Communicating what your data have to say with clarity, precision, and efficiency. Its pretty graphics also make it a great coffee table book.

Econometrics / Applied Statistics

11. Mostly Harmless Econometrics: An Empiricist’s Companion – Angrist and Pischke [HIGHLY RECOMMEND] - $26.28

A handbook on advanced econometrics. Useful for brushing up on linear models (simple and multiple linear regression) and experiment design (instrumental variables, difference-in-difference models, answering causal questions.)

12. Mastering Metrics: The Path from Cause to Effect – Angrist and Pischke - $27.67

This book is very similar to Mostly Harmless Econometrics, but more beginner-friendly. Get this one instead if you’re learning econometrics for the first time.

Experiment Design

13. Bit by Bit: Social Research in the Digital Age – Matthew Salganik - $27.73

This book felt like a greatest hits compliation of all the most useful and exciting things I learned about experiment design as an undergrad. It’s the best book I’ve found to date for marrying the strengths of old-school statisticians and newer-school data scientists.

Miscellanious (lighter reads)

14. The Signal and the Noise — Nate Silver - $11.28

What is data science? How can it be used to solve real-world problems?

15. The Book of Why – Judea Pearl - $21.23

Learn the basics of causality

16. The Information – James Gleick - $13.98

A short, digestible history of and introduction to information theory. It won’t make you an expert, but you’ll get the main ideas.

17. The Book: Playing the Percentages in Baseball - Tango, Lichtman and Dolphin - $19.95

Learn how data science has revolutionized the game of baseball, with plenty of applied examples using real data.

18. Superforecasting: the Art and Science of Prediction — Phillip Tetlock - $13.11

What makes a good forecast? See how a mix of quantitative and qualitative techniques can be combined to see the future.

19. Chasing Perfection - Andy Clockner - $12.99

A mostly-qualitative run through the current state of basketball analytics, detailing recent phenomena such as the decline of the mid-range jumper, tanking for draft picks, and the specialized medical analyses being used to ensure player longevity.