James LeDoux

Data scientist and armchair sabermetrician

Leaving MLB: Lessons Learned in my First Data Science Role

August 14, 2017

For the past three months I have had the exciting opportunity to intern as a data scientist at Major League Baseball Advanced Media, the technology arm of MLB. During this time I’ve built models and analyzed data relating to user-facing products, sales teams, and the sport itself. While it would be impossible to detail everything I’ve learned and worked on in a single post, this will serve as a brief overview of the experience.

Working with the Clubs: Customer Lead Scoring

A core component of my team’s work centered around building lead scoring models for MLB’s 30 teams. The problem, in brief, is this: given a fan’s history of past purchases, games attended, and interaction with a team’s digital media, how likely are they to either upgrade or renew their ticket package for next season? Baseball fans are a diverse group of customers, and there is no one-size-fits-all solution to understanding what drives a team’s fan base’s purchasing habits. For this reason, each team requires its own specially tailored lead scoring model.

The lead scoring models themselves are relatively simple; usually logistic regression models with a small number of variables. The real challenges in building these models come from simulteneously optimizing lift and accuracy (which are not always positively correlated) while maintaining interpretability, which is of high importance to the teams that rely on the models’ outputs.

In the end, there were a few traits of successful lead scoring models that proved consistent across teams. Of the thousands of variables considered, there was almost always one that would stick into the final model representing each of the following categories: purchaser recency, duration, quantity, engagement level, and source. This makes intuitive sense, since these five categories effectively explain the most important aspects of a fan’s relationship with a team. If we know how recent somebody has purchased, we have an idea as to how active they are. Knowing how long they have been a customer tells us something about loyalty. The number of games attended is as direct a signal possible for a fan’s interest level in attending future games. Engagement level, measured through interaction with email marketing campaigns, measures something similar to games purchased, but through a different channel. And last, the source of a customer’s purchases (i.e. buying directly from the club or on the secondary market) tends to have an impact on future purchasing habits as well. With one variable from each of these categories placed into a model, the other couple-thousand often become redundant.

The end result of this process is something I find exciting. Given one of these models, a team can understand who’s most likely to make a purchase before they make a call. The lift generated by these models makes for a tactical, data-driven sales strategy that allows a club to better understand its fans and make the most of its resources. The signals used in these models, I imagine, apply not only to the pro sports industry, but to consumer sales as a field in general.

Streaming Media: MLB.TV Churn Prediction

Another type of modeling my team worked with was churn prediction for the MLB.TV streaming product. Churn prediction works similarly to lead scoring at the surface level, in the sense that both are binary classification problems, but there are a few interesting nuances to working with churn that warrant a different approach.

Being a consumer-facing digital product, the most important features for determining whether somebody will churn come from their activity level. The users churning are almost never the active ones using the app multiple times per week. Domain-specific indicators also come into play. Being a baseball-streaming product, for example, the performance of a fan’s favorite team can play a role in their overall satisfaction with the product. Only a superfan, after all, will pay to watch their team lose day in and day out.

MLB.TV (image: mlb.com)

With churn modeling, however, there is no external client. This means that the model is allowed to be more flexible, since a computer doesn’t care about model interpretability. This gave me the opportunity to experiment with the idea of replacing our existing models with something more flexible. For this task, I fit an ensemble of a random forest classifier, gradient boosted trees (XGBoost), and a LASSO logistic regression model, resulting in a moderate performance boost.

Predicting the Lifetime Value of a Fan

A third project I worked on was an attempt to predict the lifetime value of a ticket purchaser. This effort cuts right at the heart of the challenges facing any data science effort to model the habits of a diverse customer base. For this domain in particular, useful signals can be elusive. An individual buyer’s purchasing habits can vary significantly year-over-year, and levels of activity range from passive one-game-per-year purchasers to superfans spending thousands of dollars. To add an additional layer of complexity, MLB’s fans come from vastly different markets. We would expect the behavior of a Kansas City fan, for example, to differ from that of a New York fan.

All of this results in a situation where a single linear model is insufficient. For tasks as complicated as this one, the out-of-the-box solutions taught in academia do not always play nicely with the problem at hand. The solution, of course, is to try a few different approaches and see what works. The answer could lie in fitting a single model per club, one model per audience segment, or in something entirely different. At very least, I believe a tree-based approach makes the most sense here, since this would be an effective learner of the team and segment-based interactions that would be challenging to hand-craft otherwise. This task remains a work in progress, however, so I have no perfect solution to offer.

Sabermetrics and Improving Statcast

The last area I had a chance to work on is what I would broadly characterize as baseball-facing work. Working with our in-house baseball researchers, I helped to automate some of our anomaly detection reporting, taking previously-manual Excel work and re-writing it in SQL and Python. Parts of this work involved the fitting and evaluation of mixed effect logistic regression models. These models were new to me at the time and interesting to learn about. Other ad-hoc requests in this domain would come on occasion, usually involving some sort of data analysis regarding Statcast data on the relationships between pitch speed, type, and location and the outcome of a pitch or plate appearance.

Lessons Learned

To wrap this up, here are the top pieces of advice I would offer somebody entering their first data science job:

Most tasks don’t require machine learning. Penalized (LASSO or Ridge) logistic regression and ordinary least squares with simple feature engineering will usually do the trick.
Some tasks do necessitate machine learning. Whenever you fit a simple model, try a random forest or GBM alongside it just to be sure you aren’t oversimplifying the problem with a linear model.
Knowing the difference between a #1 and a #2 problem is a crucial part of real-world data science. This is part of understanding your client’s needs and your problem domain.
Data science is more than just building models; most of the work happens before the model is trained. You will spend hours retrieving, joining, aggregating, sanity-checking, and exploring your data.
Real world data is full of surprises. Distributions change over time, features are sparse, and variables don’t always mean what you think they do. Always profile your data to avoid mistakes early on.

2020 2
2019 1
2018 2
2017 6
2016 2

2020

Multi-Armed Bandits in Python: Epsilon Greedy, UCB1, Bayesian UCB, and EXP3

13 minute read

This post explores four algorithms for solving the multi-armed bandit problem (Epsilon Greedy, EXP3, Bayesian UCB, and UCB1), with implementations in Python ...

Offline Evaluation of Multi-Armed Bandit Algorithms in Python using Replay

9 minute read

Multi-armed bandit algorithms are seeing renewed excitement, but evaluating their performance using a historic dataset is challenging. Here’s how I go about ...

2019

Understanding the AdTech Auctions in Your Browser: an Analysis of 30,000 Prebid.js Auctions

7 minute read

An analysis of auction dynamics in client-side header bidding

2018

Predicting The Shift: Boosting and Bagging for Strategic Infield Positioning

23 minute read

Using machine learning to predict strategic infield positioning using statcast data and contextual feature engineering.

Visualizing MLB Team Rankings with ggplot2 and Bump Charts

3 minute read

A quick tutorial on fetching MLB win-loss data with pybaseball and cleaning and visuzlizing it with the tidyverse (dplyr and ggplot).

2017

On Draft Pick Value, the New Lottery, and Tanking

12 minute read

Tanking becomes a hot topic each season once it becomes apparent which of the NBA’s worst teams will be missing the playoffs. In this post I address the valu...

A Statcast Tribute to Baseball’s Strangest Pitch: the Eephus

7 minute read

I’ve been borderline obsessed with the eephus pitch for some time now. Every time I see a player pull this pitch out of their arsenal I become equal parts ex...

Leaving MLB: Lessons Learned in my First Data Science Role

4 minute read

For the past three months I have had the exciting opportunity to intern as a data scientist at Major League Baseball Advanced Media, the technology arm of ML...

Introducing pybaseball: an Open Source Package for Baseball Data Analysis

2 minute read

Throughout my baseball-facing work at MLB Advanced Media, I came to realize that there was no reliable Python tool available for sabermetric research and adv...

Bookshelf

4 minute read

A collection of some of my favorite books. Business, popular economics, stats and machine learning, and some literature.

338 Cups of Coffee

6 minute read

Each cup of coffee I have consumed in the past 5 months has been logged on a spreadsheet. Here’s what I’ve learned by data sciencing my coffee consumption.

2016

Building a Content-Based Recommender System for Books: Using Natural Language Processing to Understand Literary Preference

4 minute read

Literature is a tricky area for data science. Think of your five favorite books. What do they have in common? Some may share an author or genre, but besides ...

Machine Learning and the NFL Field Goal: Using Statistical Learning Techniques to Isolate Placekicker Ability

4 minute read

Probabilistic modeling on NFL field goal data. Applying logistic regression, random forests, and neural networks in R to measure contributing factors of fiel...