Whether you’re an aspiring data scientist or a seasoned professional, these are 19 books that will help you improve your skills in machine learning, data analysis, visualization, statistics, and more. I’ve read and can highly recommend them all.
I’ve split this data science reading list into a few categories, but some may naturally belong in several groups (R for Data Science, for example, is great for learning both data visualization and analysis.) And just so you know, I may collect a commission from Amazon for any books you purchase using these links.
Machine Learning (Theoretical)
A good first book on machine learning. Shows how most popular machine learning algorithms work, and also teaches a proper workflow for training and evaluating models (e.g. train/test splits, cross validtion, picking a loss function.) Allows a reader to get an intuitive grasp of what is going on inside the “black box”, but is a little too far on qualitative side if one hopes to gain a full understanding. For a deeper dive, see the advanced version Elements of Statistical Learning.
2. Elements of Statistical Learning — James, Tibshirani, and Hastie [HIGHLY RECOMMEND] - $30.60
Similar to Introduction to Statistical Learning, but much more mathematically dense. This is my favorite reference book for machine learning theory. Topics include generalized linear models, additive models, bagging, boosting, tree-fitting algorithms, random forests, gradient boosting, and much more. It’s worth reading several times.
Another machine learning book that focuses on theory. It won’t show you how to train your own models, but it will help to understand why models work and what guarantees we’re able to make about learning and generalization. Less focused on specific ML algorithms, and more focused on the properties of learning and generalization.
4. Deep Learning – Goodfellow, Bengio and Courville [HIGHLY RECOMMEND] - $25.01
A balance of intuition, applicability, and theory that this field has been lacking. Begins with the nuts and bolts of feedforward networks, and then goes into depth about the state of the art in model regularization, optimization, and various model classes and architectures. Filled with useful tips and tricks for implementing models. This is the best book out there right now for learning how deep learning works.
Machine Learning (Practical / Applied)
5. Deep Learning with Python – Francois Chollet - $24.48
A handy reference for Keras. This book is helpful for bridging the gap between beginner deep learning tutorials and more advanced / state-of-the-art methods. It’s not the best for learning theory, but will help you to implement what you read in papers.
Working with Data (Python an R Programming)
6. R for Data Science – Wickham and Grolemund [HIGHLY RECOMMEND] - $18.17
An absurdly useful book for learning how to manipulate data with R and the Tidyverse (dplyr, ggplot, forcats, etc.) I read this once when I was first learning R and again after a few years of experience and learned new things each time. This book will make anyone better at data analysis, visualization, manipulation, and cleaning.
7. Advanced R – Hadley Wickham - $53.73
This book will teach you how the R language works on a much lower, more technical level. It’s a useful book for helping advanced users write more performant code. It’s also useful for people learning R whose background is primarily in other languages, as it will help to draw parallels between R and other languages.
This book focuses on baseball data specifically, but is filled with data analysis and visualistion examples in R. It’s filled with real-world examples of munging data to answer questions and understand rich data sets (in this case, baseball data).
9. Python for Data Analysis – Wes McKinney - $23.09
This is a great book for learning the ins and outs of Pandas. It will teach you to clean, aggregate, transform, and visualize data in Python using Pandas dataframes. It’s Python’s closest equivalent to R for Data Science.
Econometrics / Applied Statistics
11. Mostly Harmless Econometrics: An Empiricist’s Companion – Angrist and Pischke [HIGHLY RECOMMEND] - $26.28
A handbook on advanced econometrics. Useful for brushing up on linear models (simple and multiple linear regression) and experiment design (instrumental variables, difference-in-difference models, answering causal questions.)
This book felt like a greatest hits compliation of all the most useful and exciting things I learned about experiment design as an undergrad. It’s the best book I’ve found to date for marrying the strengths of old-school statisticians and newer-school data scientists.