For the past three months I have had the exciting opportunity to intern as a data scientist at Major League Baseball Advanced Media, the technology arm of MLB. During this time I’ve built models and analyzed data relating to user-facing products, sales teams, and the sport itself. While it would be impossible to detail everything I’ve learned and worked on in a single post, this will serve as a brief overview of the experience.
Working with the Clubs: Customer Lead Scoring
A core component of my team’s work centered around building lead scoring models for MLB’s 30 teams. The problem, in brief, is this: given a fan’s history of past purchases, games attended, and interaction with a team’s digital media, how likely are they to either upgrade or renew their ticket package for next season? Baseball fans are a diverse group of customers, and there is no one-size-fits-all solution to understanding what drives a team’s fan base’s purchasing habits. For this reason, each team requires its own specially tailored lead scoring model.
The lead scoring models themselves are relatively simple; usually logistic regression models with a small number of variables. The real challenges in building these models come from simulteneously optimizing lift and accuracy (which are not always positively correlated) while maintaining interpretability, which is of high importance to the teams that rely on the models’ outputs.
In the end, there were a few traits of successful lead scoring models that proved consistent across teams. Of the thousands of variables considered, there was almost always one that would stick into the final model representing each of the following categories: purchaser recency, duration, quantity, engagement level, and source. This makes intuitive sense, since these five categories effectively explain the most important aspects of a fan’s relationship with a team. If we know how recent somebody has purchased, we have an idea as to how active they are. Knowing how long they have been a customer tells us something about loyalty. The number of games attended is as direct a signal possible for a fan’s interest level in attending future games. Engagement level, measured through interaction with email marketing campaigns, measures something similar to games purchased, but through a different channel. And last, the source of a customer’s purchases (i.e. buying directly from the club or on the secondary market) tends to have an impact on future purchasing habits as well. With one variable from each of these categories placed into a model, the other couple-thousand often become redundant.
The end result of this process is something I find exciting. Given one of these models, a team can understand who’s most likely to make a purchase before they make a call. The lift generated by these models makes for a tactical, data-driven sales strategy that allows a club to better understand its fans and make the most of its resources. The signals used in these models, I imagine, apply not only to the pro sports industry, but to consumer sales as a field in general.
Streaming Media: MLB.TV Churn Prediction
Another type of modeling my team worked with was churn prediction for the MLB.TV streaming product. Churn prediction works similarly to lead scoring at the surface level, in the sense that both are binary classification problems, but there are a few interesting nuances to working with churn that warrant a different approach.
Being a consumer-facing digital product, the most important features for determining whether somebody will churn come from their activity level. The users churning are almost never the active ones using the app multiple times per week. Domain-specific indicators also come into play. Being a baseball-streaming product, for example, the performance of a fan’s favorite team can play a role in their overall satisfaction with the product. Only a superfan, after all, will pay to watch their team lose day in and day out.
MLB.TV (image: mlb.com)
With churn modeling, however, there is no external client. This means that the model is allowed to be more flexible, since a computer doesn’t care about model interpretability. This gave me the opportunity to experiment with the idea of replacing our existing models with something more flexible. For this task, I fit an ensemble of a random forest classifier, gradient boosted trees (XGBoost), and a LASSO logistic regression model, resulting in a moderate performance boost.
Predicting the Lifetime Value of a Fan
A third project I worked on was an attempt to predict the lifetime value of a ticket purchaser. This effort cuts right at the heart of the challenges facing any data science effort to model the habits of a diverse customer base. For this domain in particular, useful signals can be elusive. An individual buyer’s purchasing habits can vary significantly year-over-year, and levels of activity range from passive one-game-per-year purchasers to superfans spending thousands of dollars. To add an additional layer of complexity, MLB’s fans come from vastly different markets. We would expect the behavior of a Kansas City fan, for example, to differ from that of a New York fan.
All of this results in a situation where a single linear model is insufficient. For tasks as complicated as this one, the out-of-the-box solutions taught in academia do not always play nicely with the problem at hand. The solution, of course, is to try a few different approaches and see what works. The answer could lie in fitting a single model per club, one model per audience segment, or in something entirely different. At very least, I believe a tree-based approach makes the most sense here, since this would be an effective learner of the team and segment-based interactions that would be challenging to hand-craft otherwise. This task remains a work in progress, however, so I have no perfect solution to offer.
Sabermetrics and Improving Statcast
The last area I had a chance to work on is what I would broadly characterize as baseball-facing work. Working with our in-house baseball researchers, I helped to automate some of our anomaly detection reporting, taking previously-manual Excel work and re-writing it in SQL and Python. Parts of this work involved the fitting and evaluation of mixed effect logistic regression models. These models were new to me at the time and interesting to learn about. Other ad-hoc requests in this domain would come on occasion, usually involving some sort of data analysis regarding Statcast data on the relationships between pitch speed, type, and location and the outcome of a pitch or plate appearance.
To wrap this up, here are the top pieces of advice I would offer somebody entering their first data science job:
Most tasks don’t require machine learning. Penalized (LASSO or Ridge) logistic regression and ordinary least squares with simple feature engineering will usually do the trick.
Some tasks do necessitate machine learning. Whenever you fit a simple model, try a random forest or GBM alongside it just to be sure you aren’t oversimplifying the problem with a linear model.
Knowing the difference between a #1 and a #2 problem is a crucial part of real-world data science. This is part of understanding your client’s needs and your problem domain.
Data science is more than just building models; most of the work happens before the model is trained. You will spend hours retrieving, joining, aggregating, sanity-checking, and exploring your data.
Real world data is full of surprises. Distributions change over time, features are sparse, and variables don’t always mean what you think they do. Always profile your data to avoid mistakes early on.