In this post I discuss the multi-armed bandit problem and implementations of four specific bandit algorithms in Python (epsilon greedy, UCB1, a Bayesian UCB, and EXP3). I evaluate their performance as content recommendation systems on a real-world movie ratings dataset and provide simple, reproducible code for applying these algorithms to other tasks.
Multi-armed bandits belong to a class of online learning algorithms that allocate a fixed number of resources to a set of competing choices, attempting to learn an optimal resource allocation policy over time.
The multi-armed bandit problem is often introduced via an analogy of a gambler playing slot machines. Imagine you’re at a casino and are presented with a row of \(k\) slot machines, with each machine having a hidden payoff function that determines how much it will pay out. You enter the casino with a fixed amount of money and want to learn the best strategy to maximize your profits. Initially you have no information about which machine is expected to pay out the most money, so you try one at random and observe its payout. Now that you have a little more information than you had before, you need to decide: do I exploit this machine now that I know more about its payoff function, or do I explore the other options by pulling arms that I have less information about? You want to strike the most profitable balance between exploring all potential machines so that you don’t miss out on a valuable one by simply not trying it enough times, and exploiting the machine that has been most profitable so far. A multi-armed bandit algorithm is designed to learn an optimal balance for allocating resources between a fixed number of choices in a situation such as this one, maximizing cumulative rewards over time by learning an efficient explore vs. exploit policy.
Before looking at any specific algorithms, it’s useful to first establish a few definitions and core principles, since the language and problem setup of the bandit setting differs slightly from those of traditional machine learning. The bandit setting, in short, looks like this:
This bears several similarities to reinforcement learning techniques such as Q-learning, which similarly learn and modify a policy over time. The time-dependence of a bandit problem (start with zero or minimal information about all arms, learn more over time) is a significant departure from the traditional machine learning problem setting, where the full dataset is available to a model at once, which can be trained as a one-off process. Bandits require repeated, incremental policy updates.
There are several nuances to running a multi-armed bandit experiment using a real-world dataset. I describe the experiment setup in detail in this post. I encourage you to read through it before proceeding. If not, here’s the short version of how this experiment is set up:
But really, read the full version to better understand the ins and outs of evaluating a multi-armed bandit algorithm using a historic dataset.
The simplest bandits follow semi-uniform strategies. The most popular of these is called epsilon greedy.
Like the name suggests, the epsilon greedy algorithm follows a greedy arm selection policy, selecting the best-performing arm at each time step. However, \(\epsilon\) percent of the time, it will go off-policy and choose an arm at random. The value of \(\epsilon\) determines the fraction of the time when the algorithm explores available arms, and exploits the ones that have performed the best historically the rest of the time.
This algorithm has a few perks. First, it’s easy to explain (explore \(\epsilon \%\) of time steps, exploit \((1-\epsilon)\%\). The algorithm fits in a single sentence!). Second, \(\epsilon\) is straightforward to optimize. Third, despite its simplicity, it typically yields pretty good results. Epsilon greedy is the linear regression of bandit algorithms.
Much like linear regression can be extended to a broader family of generalized linear models, there are several adaptations of the epsilon greedy algorithm that trade off some of its simplicity for better performance. One such improvement is to use an epsilon-decreasing strategy. In this version of the algorithm, \(\epsilon\) decays over time. The intuition for this is that the need for exploration decreases over time, and selecting random arms becomes increasingly inefficient as the algorithm eventually has more complete information about the available arms. Another available take on this algorithm is an epsilon-first strategy, where the bandit acts completely random for a fixed amount of time to sample the available arms, and then purely exploits thereafter. I’m not going to use either of these approaches in this post, but it’s worth mentioning that these options are out there.
Implementing the traditional epsilon greedy bandit strategy in Python is straightforward:
def epsilon_greedy_policy(df, arms, epsilon=0.15, slate_size=5, batch_size=50):
'''
Applies Epsilon Greedy policy to generate movie recommendations.
Args:
df: dataframe. Dataset to apply the policy to
arms: list or array. ID of every eligible arm.
epsilon: float. represents the % of timesteps where we explore random arms
slate_size: int. the number of recommendations to make at each step.
batch_size: int. the number of users to serve these recommendations to before updating the bandit's policy.
'''
# draw a 0 or 1 from a binomial distribution, with epsilon% likelihood of drawing a 1
explore = np.random.binomial(1, epsilon)
# if explore: shuffle movies to choose a random set of recommendations
if explore == 1 or df.shape[0]==0:
recs = np.random.choice(arms, size=(slate_size), replace=False)
# if exploit: sort movies by "like rate", recommend movies with the best performance so far
else:
scores = df[['movieId', 'liked']].groupby('movieId').agg({'liked': ['mean', 'count']})
scores.columns = ['mean', 'count']
scores['movieId'] = scores.index
scores = scores.sort_values('mean', ascending=False)
recs = scores.loc[scores.index[0:slate_size], 'movieId'].values
return recs
# apply epsilon greedy policy to the historic dataset (all arm-pulls prior to the current step that passed the replay-filter)
recs = epsilon_greedy_policy(df=history.loc[history.t<=t,], arms=df.movieId.unique)
# get the score from this set of recommendations, add this to the bandit's history to influence its future decisions
history, action_score = score(history, df, t, batch_size, recs)
Epsilon greedy performs pretty well, but it’s easy to see how selecting arms at random can be inefficient. If you have one movie that 50% of users have liked, and another at 5% have liked, epsilon greedy is equally likely to pick either of these movies when exploring random arms. Upper Confidence Bound algorithms were introduced as a class of bandit algorithm that explores more efficiently.
Upper Confidence Bound algorithms construct a confidence interval of what each arm’s true performance might be, factoring in the uncertainty caused by variance in the data and the fact that we’re only able to observe a limited sample of pulls for any given arm. The algorithms then optimistically assume that each arm will perform as well as its upper confidence bound (UCB), selecting the arm with the highest UCB.
This has a number of nice qualities. First, you can parameterize the size of the confidence interval to control how aggressively the bandit explores or exploits (e.g you can run a 99% confidence interval to explore heavily, or a 50% confidence interval to mostly exploit.) Second, using upper confidence bounds causes the bandit to explore more efficiently than an epsilon greedy bandit. This happens because confidence intervals shrink as you see additional data points for a given arm. So, while the algorithm will gravitate toward picking arms with high average performance, it will periodically give less-explored arms a chance since their confidence intervals are wider.
Seeing this visually helps to understand how these confidence bounds produce an efficient balance of exploration and exploitation. Below I’ve produced an imaginary scenario where a UCB bandit is determining which article to show at the top of a news website. There are three articles, judged according to the upper confidence bound of their click-through-rate (CTR).
Artice A has been seen 100 times and has the best CTR. Article B has a slightly worse CTR than article A, but it hasn’t been seen by as many users, so there’s also more uncertainty about how well it’s going to perform in the long run. For this reason, it has a larger confidence bound, giving it a slightly higher UCB score than article A. Article C was published just moments ago, so almost no users have seen it. We’re extremely uncertain about how high its CTR will ultimately be, so its UCB is highest of all for now despite its initial CTR being low.
Over time, more users will see articles B and C, and their confidence bounds will become more narrow and look more like that of article A. As we learn more about B and C, we’ll shift from exploration toward exploitation as the articles’ confidence intervals collapse toward their means. Unless the CTR of article B or C improves, the bandit will quickly start to favor article A again as the other articles’ confidence bounds shrink.
A good UCB algorithm to start with is UCB1. UCB1 uses Hoeffding’s inequality to assign an upper bound to an arm’s mean reward where there’s high probability that the true mean will be below the UCB assigned by the algorithm. The inequality states that:
\[P(\mu_{a} > \hat{\mu}_{t,a} + U_{t}(a)) \leq e^{-2tU_{t}(a)^2},\]where \(\mu_{a}\) is arm \(a\)’s true mean reward, \(\hat{\mu}_{t,a}\) is \(a\)’s observed mean reward at time \(t\), and \(U_{t}(a)\) is an upper confidence bound value for arm \(a\) which, when added the mean reward, gives you an upper confidence bound. Setting \(p = e^{-2tU_{t}(a)^2}\) gives us the following value for the UCB term:
\[U_{t}(a) = \sqrt{\frac{-\log{p}}{2n_{a}}}.\]Note that in the denominator I’m replacing \(t\) with \(n_{a}\), since it represents the number of times arm \(a\) has been pulled, which will eventually differ from the total number of time steps \(t\) the algorithm has been running at a given point in time.
Setting the probability \(p\) of the true mean being greater than the UCB to be less than or equal to \(t^{-4}\), a small probability that quickly converges to zero as the number of rounds \(t\) grows, ultimately gives us the UCB1 algorithm, which pulls the arm that maximizes:
\[\bar{x}_{t,a}+ \sqrt{\frac{2\log(t)}{n_{a}}}.\]Here \(\bar{x}_{t,a}\) is the mean observed reward of arm \(a\) at time \(t\), \(t\) is the current time step in the algorithm, and \(n_{a}\) is the number of times arm \(a\) has been pulled so far.
Putting this all together, it means that a high “like” rate for a movie in this dataset will increase the likelihood of an arm being pulled, but so will a lower number of times the arm has been pulled so far, which encourages exploration. Also notice that the part of the function that includes the number of time steps the algorithm has been running (\(t\)) is inside a logarithm, which causes the algorithm’s propensity to explore to decay over time. Jeremy Kun’s blog has a very nice explanation of this algorithm and the proofs that support it. I also found this post from Lilian Weng’s blog helpful for understanding how the confidence bounds are created using Hoeffding’s inequality.
Here’s how the UCB1 policy looks in Python:
def ucb1_policy(df, t, ucb_scale=2.0):
'''
Applies UCB1 policy to generate movie recommendations
Args:
df: dataframe. Dataset to apply UCB policy to.
ucb_scale: float. Most implementations use 2.0.
t: int. represents the current time step.
'''
scores = df[['movieId', 'liked']].groupby('movieId').agg({'liked': ['mean', 'count', 'std']})
scores.columns = ['mean', 'count', 'std']
scores['ucb'] = scores['mean'] + np.sqrt(
(
(2 * np.log10(t)) /
scores['count']
)
)
scores['movieId'] = scores.index
scores = scores.sort_values('ucb', ascending=False)
recs = scores.loc[scores.index[0:args.n], 'movieId'].values
return recs
recs = ucb1_policy(df=history.loc[history.t<=t,], t, ucb_scale=args.ucb_scale)
history, action_score = score(history, df, t, args.batch_size, recs)
An extension of UCB1 that goes a step further is the Bayesian UCB algorithm. This bandit algorithm takes the same principles of UCB1, but lets you incorporate prior information about the distribution of an arm’s rewards to explore more efficiently (the Hoeffding inequality’s approach to generating a UCB1’s confidence bound makes no such assumptions).
Going from UCB1 to a Bayesian UCB can be fairly simple. If you assume the rewards of each arm are normally distributed, you can simply swap out the UCB term from UCB1 with \(\frac{c\sigma(x_{a})}{\sqrt{n_{a}}}\), where \(\sigma(x_{a})\) is the standard deviation of arm \(a\)’s rewards, \(c\) is an adjustable hyperparameter for determining the size of the confidence interval you’re adding to an arm’s mean observed reward, \(n_{a}\) is the number of times arm \(a\) has been pulled, and \(\bar{x}_{a} \pm \frac{c\sigma(x_{a})}{\sqrt{n_{a}}}\) is a confidence interval for arm \(a\) (so a 95% confidence interval can be represented with \(c=1.96\)). It’s common to see this outperform UCB1 in practice. You can see a little more detail about this in these slides from UCL’s reinforcement learning course.
Implementation-wise, turning the above UCB1 policy into a bayesian UCB policy is pretty simple. All you have to do is replace this logic from the UCB1 policy:
scores['ucb'] = scores['mean'] + np.sqrt(
(
(2 * np.log10(t)) /
scores['count']
)
)
with this:
scores['ucb'] = scores['mean'] + (ucb_scale * scores['std'] / np.sqrt(scores['count']))
and there you have it! Your UCB bandit is now bayesian.
A third popular bandit strategy is an algorithm called EXP3, short for Exponential-weight algorithm for Exploration and Exploitation. EXP3 feels a bit more like traditional machine learning algorithms than epsilon greedy or UCB1, because it learns weights for defining how promising each arm is over time. Similar to with UCB1, EXP3 attempts to be an efficient learner by placing more weight on good arms and less weight on ones that aren’t as promising.
The algorithm starts by initializing a vector of weights \(w\) with one weight per arm in the dataset and each weight initialized to equal 1. It also takes as input an exploration parameter \(\gamma\), which controls the algorithm’s likelihood to explore arms uniformly at random. Then, for each time step, we:
\[\begin{align} &1. \text{ Set } p_{i}(t) = (1-\gamma)\frac{w_{i}(t)}{\sum_{a=1}^{k} w_{a}(t)} + \frac{\gamma}{k} \\ &2. \text{ Draw the next arm } i_{t} \text{ randomly according to probabilities } p_{i}(t), ..., p_{k}(t) \\ &3. \text{ Observe reward } x_{i_{t}}(t) \in [0,1] \\ &4. \text{ Define the estimated reward } \hat{x}_{a_t}(t) \text{ to be: } \frac{x_{a_t}(t)}{p_{a_t}(t)} \text{ for } a=i_{t} \text{, 0 for all other } a. \\ &5. \text{ Set } \displaystyle w_{i_t}(t+1) = w_{i_t}(t) e^{\gamma \hat{x}_{i_t}(t) / K} \end{align}\]Here \(i_{t}\) represents a given arm at step \(t\), where there are \(k\) available arms to choose from and \(a\) is an index over all \(k\) arms used to denote summing over all weights in step (1) and assigning all non-selected arms a reward of zero in step (4).
In English, the algorithm exploits by drawing from a learned distribution of weights \(w\) which prioritize better-performing arms, but in a probabilistic way that still lets all arms be sampled from. The exploration parameter \(\gamma\) gives an additional nudge of favoritism to all arms, making worse-performing arms more likely to be sampled. Taken to its extreme, \(\gamma=1\) would cause the learned weights to be ignored entirely in favor of pure, random exploration.
In Python, the EXP3 recommendation policy looks like this:
import math
import pandas as pd
import numpy as np
from numpy.random import choice
def distr(weights, gamma=0.0):
weight_sum = float(sum(weights))
return tuple((1.0 - gamma) * (w / weight_sum) + (gamma / len(weights)) for w in weights)
def draw(probability_distribution, n_recs=1):
arm = choice(df.movieId.unique(), size=n_recs,
p=probability_distribution, replace=False)
return arm
def update_weights(weights, gamma, movieId_weight_mapping, probability_distribution, actions):
# iter through actions. up to n updates / rec
if actions.shape[0] == 0:
return weights
for a in range(actions.shape[0]):
action = actions[a:a+1]
weight_idx = movieId_weight_mapping[action.movieId.values[0]]
estimated_reward = 1.0 * action.liked.values[0] / probability_distribution[weight_idx]
weights[weight_idx] *= math.exp(estimated_reward * gamma / num_arms)
return weights
def exp3_policy(df, history, t, weights, movieId_weight_mapping, gamma, n_recs, batch_size):
'''
Applies EXP3 policy to generate movie recommendations
Args:
df: dataframe. Dataset to apply EXP3 policy to
history: dataframe. events that the offline bandit has access to (not discarded by replay evaluation method)
t: int. represents the current time step.
weights: array or list. Weights used by EXP3 algorithm.
movieId_weight_mapping: dict. Maping between movie IDs and their index in the array of EXP3 weights.
gamma: float. hyperparameter for algorithm tuning.
n_recs: int. Number of recommendations to generate in each iteration.
batch_size: int. Number of observations to show recommendations to in each iteration.
'''
probability_distribution = distr(weights, gamma)
recs = draw(probability_distribution, n_recs=n_recs)
history, action_score = score(history, df, t, batch_size, recs)
weights = update_weights(weights, gamma, movieId_weight_mapping, probability_distribution, action_score)
action_score = action_score.liked.tolist()
return history, action_score, weights
movieId_weight_mapping = dict(map(lambda t: (t[1], t[0]), enumerate(df.movieId.unique())))
history, action_score, weights = exp3_policy(df, history, t, weights, movieId_weight_mapping, args.gamma, args.n, args.batch_size)
rewards.extend(action_score)
Jeremy Kun again provides a great explanation of its theoretical underpinnings and regret bounds. I drew heavily from his post and the EXP3 Wikipedia entry in writing this section.
It’s expected that these bandit algorithms’ performance relative to one another will depend heavily on the task. Frequently introducing new arms might benefit a UCB algorithm’s efficient exploration policy, for example, while an adversarial task such as learning to play a game might favor the randomness baked into EXP3’s policy. Futile as it may be to declare one of them the “best” algorithm, let’s throw them all at a broadly useful task and see which bandit is best fit for the job.
Here I’ll use the Movielens dataset, reporting on the mean and cumulative reward over time for each algorithm. For more details on the experiment setup, see the Dataset and Experiment Setup section of this post at the beginning of the article, or this post which discusses offline bandit evaluation in full detail.
First we’ll need to tune each algorithm’s hyperparameters to compare each algorithm’s best possible performance to that of the others. This means finding an optimal value of epsilon
for epsilon greedy, the scale parameter that we use for determining the size of the confidence interval for Bayesian UCB, and gamma
for EXP3. I’ll leave UCB1 alone since it’s not typically seen as having tunable hyperparameters (although there does exist a parameterized version of it that’s slightly more involved to implement and less theoretically-sound.) I identified good values for these hyperparameters by trying six values which linearly spanned a range of potential values that subjectively seemed reasonable to me, and selected the hyperparameter value which yielded the highest mean reward over the lifetime of the algorithm. Each parameter search was run using batch sizes of 10,000 events and recommendation slates of 5 movies recommended at each pass of the algorithm.
The above three plots show the mean reward for the three classes of algorithm across different hyperparameter values. The best gamma for EXP3 was 0.1, the best epsilon for Epsilon Greedy was 0.1, and the best UCB algorithm was a Bayesian UCB using a scale parameter of 1.5.
I used a large batch size of 10,000 recommendations per iteration of the algorithm while running the above hyperparameter search to speed things up, since a bandit runs fairly slow on a large dataset, let alone 19 of them like I used in this parameter search. For a final evaluation, now that we’re able to select the best possible version of each algorithm, I’ll reduce the batch size to just 100 recommendations per pass of the algorithm, giving each bandit more time to learn its explore-exploit policy.
Without further ado, here’s the cumulative and 200-movie trailing average reward generated by each of these parameter-tuned bandits over time:
The first takeaway from this is that EXP3 significantly underperforms Epsilon Greedy and Bayesian UCB. This is fairly consistent with what I’ve seen on other people’s implementations.
More interestingly, we see the UCB bandit achieve a higher cumulative and average reward than the other two algorithms. It’s predictably a slower learner than Epsilon Greedy. All arms start with a large confidence interval since nothing is initially known about them, so it begins its simulation highly biased toward exploration over exploitation. Meanwhile, Epsilon Greedy spends most of its time exploiting, which gives it a faster initial climb toward its eventual peak performance. Due to its more principled and efficient approach to exploration, however, the UCB bandit ultimately learns the best policy, overtaking Epsilon Greedy after roughly 25,000 training iterations.
The final mean rewards yielded by these three approaches, after roughly 1,000,000 training iterations, were 0.567 for the Bayesian UCB algorithm, 0.548 for Epsilon Greedy, and 0.468 for EXP3. To give some additional context to this, random guessing in this task would yield an average reward of 0.309 (the mean “like” rate in this dataset), so all three algorithms have clearly achieved some degree of learning in this task.
In this post I discussed and implemented four multi-armed bandit algorithms: Epsilon Greedy, EXP3, UCB1, and Bayesian UCB. Faced with a content-recommendation task (recommending movies using the Movielens-25m dataset), Epsilon Greedy and both UCB algorithms did particularly well, with the Bayesian UCB algorithm being the most performant of the group. This experiment shows that these algorithms can be viable choices for a production recommender system, all doing significantly better than random guessing and adapting their policies in intelligent ways as they obtain more information about their environments.
One important consideration that this experiment demonstrates is that picking a bandit algorithm isn’t a one-size-fits-all task. The suitability of any given algorithm for your task depends not only on your problem domain, but also on the size of your dataset. While the UCB algorithm was ultimately the most successful in this experiment, it took roughly 25,000 iterations of the algorithm for it to reach a point where it consistently outperformed Epsilon Greedy. This demonstrates that, depending on the volume of your data, you may want a faster-learning algorithm such as Epsilon Greedy, rather than a slower-learning, but ultimately more performant algorithm such as a Bayesian UCB.
A second thing to consider is that none of these algorithms take into account information about their environment or a user’s past behavior. A traditional recommender system may still outperform any of these bandits if you have other features at your disposal to make accurate predictions for a given user, as opposed to making global optimizations that apply uniformly to all users as is the case with these four bandit algorithms. There exists a compromise between these two approaches called Contextual Bandits, which apply a bandit-learning approach but use information about content and users to make more accurate recommendations. I may explore these in a future post to see how a contextual bandit fares compared to these four context-free bandits.
I would last like to thank Jeremy Kun, Lilian Weng, and Noel Welsh, whose resources I found very helpful in understanding the mathematics behind UCB1 and EXP3. I would recommend any of their above-linked resources for further reading on these topics.
Code for this post can be found on github.
Multi-armed bandit algorithms are seeing renewed excitement in research and industry. Part of this is likely because they address some of the major problems internet companies face today: a need to explore a constantly changing landscape of (news articles, videos, ads, insert whatever your company does here) while avoiding wasting too much time showing low-quality content to users. Part of this is also may be related to advancements in a class of personalizable bandit algorithms, contextual bandits, which pair nicely with recent advances in reinforcement learning.
In either case, bandit algorithms are notoriously hard to work with using real-world datasets. Being online learning algorithms, there’s some nuance to evaluating and tuning them offline without exposing an untested algorithm to real users in a live production setting. It’s important to be able to evaluate these algorithms offline, however, for at least two reasons. First, not everybody has access to a production environment with the scale required to experiment with an online learning algorithm. And second, even those who do have a popular product at their disposal should probably be a little more careful with it than blindly throwing algorithms into production and hoping they’re successful.
Whether you’re a hobbyist wanting to experiment with bandits in your free time or someone at a big company who wants to optimize an algorithm before exposing it to users, you’re going to need to evaluate your model offline. This post discusses some methods I’ve found useful in doing this.
For this post I’m using the Movielens 25m dataset. This dataset includes roughly 25m movie ratings for 27,000 movies provided by 138,000 users of the University of Minnesota’s Movielens service.
To cast this dataset as a bandit problem, we’ll pretend that a user rated every movie that they saw, ignoring any sort of non-rating bias that may exist. Since bandit algorithms have a time component to them (they can only see data from the past, which is constantly updated as the model learns), I shuffle the data and create a pseudo-timestamp value (which is really just a row number, but this is enough for a simulated bandit environment). To further simplify the problem, I redefine the problem from being a 0-5 star rating problem to a binary problem of modeling whether or not a user “liked” a movie. I define a rating of 4.5 stars or more as a “liked” movie, and anything else as a movie the user didn’t like. To further aid learning, I discard movies from the dataset with less than 1,500 ratings. Too few ratings to a movie can cause the model to get stuck in offline evaluation, for reasons that will make more sense soon.
The end result is a dataset of roughly 6.5 million binary like/no-like movie ratings of the form:
\([timestamp, userId, movieId, liked]\).
I do this by constructing the following get_ratings_25m
function, which creates the dataset and turns it into a viable bandit problem.
def read_movielens_25m():
ratings = pd.read_csv('ratings.csv', engine='python')
movies = pd.read_csv('movies.csv', engine='python')
links = pd.read_csv('links.csv', engine='python')
movies = movies.join(movies.genres.str.get_dummies().astype(bool))
movies.drop('genres', inplace=True, axis=1)
df = ratings.join(movies, on='movieId', how='left', rsuffix='_movie')
return df
def preprocess_movielens_25m(df, min_number_of_reviews=20000):
# remove ratings of movies with < N ratings. too few ratings will cause the recsys to get stuck in offline evaluation
movies_to_keep = pd.DataFrame(df.movieId.value_counts())\
.loc[pd.DataFrame(df.movieId.value_counts())['movieId']>=min_number_of_reviews].index
df = df.loc[df['movieId'].isin(movies_to_keep)]
# shuffle rows to debias order of user ids
df = df.sample(frac=1)
# create a 't' column to represent time steps for the bandit to simulate a live learning scenario
df['t'] = np.arange(len(df))
df.index = df['t']
# rating >= 4.5 stars is a 'like', < 4 stars is a 'dislike'
df['liked'] = df['rating'].apply(lambda x: 1 if x >= 4.5 else 0)
return df
def get_ratings_25m(min_number_of_reviews=20000):
df = read_movielens_25m()
df = preprocess_movielens_25m(df, min_number_of_reviews=20000)
return df
Now that we have a dataset, we need to construct a simulation environment to use for training the bandit. A traditional ML model is trained by building a representative training and test set, where you train and tune a model on the training set and evaluate its performance using the test set. A bandit algorithm isn’t so simple. Bandits are algorithms that learn over time. At each time step, the bandit needs to be able to observe data from the past, update its decision rule, take action by serving predictions based on this updated decision-making policy, and observe a reward value for these actions. The time component means that the training data that the bandit has at its disposal is constantly changing, and that the score you use to evaluate it is also changing over time based on small pieces of feedback from the most recent time step, rather than based on feedback from a large test set like you’d use with a traditional ML approach.
This learning process is computationally tedious when there are a large number of time steps. In a perfect world, a bandit would view each event as its own time step and make a large number of small improvements. With large datasets and the need for offline evaluation, this is often unreasonable. Bandits can be very slow to train if they’re updated once for each row in your dataset, and using large datasets is important in an offline evaluation setting because a large number of observations end up needing to be discarded by the algorithm (more on this in the next section). For these reasons, it proves useful to deviate from the theoretical setting by batching the learning process in two ways.
First, we batch in time steps. Instead of updating the algorithm once per rating event, we can update it once every \(n\) events, requiring \(\frac{t}{n}\) training steps instead of \(t\) to make it through the whole dataset. This shortcut is a realistic one, as even a live production environment is probably going to be making updates on some sort of cron schedule.
Second, we can expand this from a single-movie recommendation problem to a slate recommendation problem. In the simplest theoretical setting, a bandit recommends one movie and the user reacts by liking it or not liking it. When we evaluate a bandit using historic data, we don’t always know how a user would have reacted to our recommendation policy, since we only know the user’s reaction to the movie they were served by the system that was in production when they visited the website. We need to discard such recommendations, and for this reason, recommending one movie at a time proves inefficient due to the large volume of recommendations we can’t learn from.
To learn more efficiently, we can instead recommend slates of movies. A slate is just a technical term for recommending more than one movie at a time. In this case, we can recommend the bandit’s top 5 movies to a user, and if the user rated one of those movies, we can use that observation to improve the algorithm. This way, we’re much more likely to receive something from this training iteration that helps the model to improve.
Slate recommendations are picking up research interest due to their practicality. Most modern recommender systems are recommending more than one piece of content at a time, after all (see: YouTube, Netflix, Spotify, etc.) These papers (1, 2) from Ie et al. (1) and Chen et al. (2) are helpful examples of modern approaches to slate recommendation problems.
Last, we need to create a second dataset that represents a subset of the full dataset. I call this dataset history
in my implementation, because it represents the historic record of events that the bandit is able to use to influence its recommendations. Because a bandit is an online learner, it needs a dataset containing only events prior to the current timestep we’re simulating in order for it to act like it will in a production setting. I do this by initiating an empty dataframe prior to training with the same format as the full dataset I built in the previous section, and growing this dataset at each time step by appending new rows. The reason it’s useful to use this as a separate dataframe rather than just filtering the complete dataset at each time step is that not all events can be added to the history
dataset. I’ll explain which events get added to this dataset and which don’t in the next section of this post, but for now, you’ll see in the code below that the history
dataframe is updated by our scoring function at each time step.
Here’s how this all looks in Python. Note that this uses a score
function which we haven’t yet defined. I’m also using a naive recommendation policy that just selects random movies, since this post is about the training methodology rather than the algorithm itself. I’ll explore various bandit algorithms in a future post.
# simulation params: slate size, batch size (number of events per training iteration)
slate_size = 5
batch_size = 10
df = get_ratings_25m(min_number_of_reviews=1500)
# initialize empty history
# (the algorithm should be able to see all events and outcomes prior to the current timestep, but no current or future outcomes)
history = pd.DataFrame(data=None, columns=df.columns)
history = history.astype({'movieId': 'int32', 'liked': 'float'})
# initialize empty list for storing scores from each step
rewards = []
for t in range(df.shape[0]//batch_size):
t = t * batch_size
# generate recommendations from a random policy
recs = np.random.choice(df.movieId.unique(), size=(slate_size), replace=False)
# send recommendations and dataset to a scoring function so the model can learn & adjust its policy in the next iteration
history, action_score = replay_score(history, df, t, batch_size, recs)
if action_score is not None:
action_score = action_score.liked.tolist()
rewards.extend(action_score)
Your bandit’s recommendations will be different from those generated by the model whose recommendations are reflected in your historic dataset. This creates problems which lead to some of the key challenges in evaluating these algorithms using historic data.
The first reason this is problematic is that your data is probably biased. An online learner requires a feedback loop where it presents an action, observes a user’s response, and then updates its policy accordingly. A historic dataset is going to be biased by the mechanism that generated it. Your algorithm assumes that it is what generated the recommendation, but in reality, everything in your dataset was generated by a completely separate model or heuristic. An ideal solution to this is to randomize the recommendation policy of the production system that’s generating your training data to create a dataset that’s independent and identically distributed and without algorithmic bias. You may not have the ability to implement this if you’re receiving data from an outside party or if randomizing a recommendation policy for the sake of better training data is too harmful of a user experience, but it’s worth at least being aware of algorithmic bias in your training data if it’s going to affect the bandit you’re training.
The second problem is that your algorithm will often produce recommendations that are different from the recommendations seen by users in the historic dataset. You can’t supply a reward value for these recommendations because you don’t know what the user’s response would have been to a recommendation they never saw. You can only know how a user responded to what was supplied to them by the production system. The solution to this is a method called replay (Li et al., 2010). Replay evaluation essentially takes your historic event steam and your algorithm’s recommendations at each time step, and throws out all samples except for those where your model’s recommendation is the same as the one the user saw in the historic dataset. This, paired with an unbiased data generating mechanism (such as a randomized recommendation policy), proves to be an unbiased method for offline evaluation of an online learing algorithm.
Netflix’s Jaya Kawale and Fernando Amat provide a nice visual explanation of Replay in these slides from a 2018 conference talk. In this image, there is a production movie recommendation policy (top row) and an offline recommendation policy from a bandit they’re training (bottom). Replay selects the cases where the two recommendation policies agree with each other (the columns with black boxes surrounding them: users 1, 4, and 6), and uses only these play/no-play decisions to score the offline model. So, in this example, the model gets a score of 2/3, since 2 of the 3 matches between the two policies were played.
One drawback to this method is that it significantly shrinks the size of your dataset. If you have \(k\) arms and \(T\) samples, you can expect to have \(\frac{T}{k}\) usable recommendations for evaluating the model (Li et al., 2010). For this reason, you’re going to need a large dataset in order to test your algorithm, since replay evaluation is going to discard most of your data. Nicol et al. (2014) explore ways to improve this via bootstrapping, but for this post I’m using the classic replay method for evaluating the models.
def replay_score(history, df, t, batch_size, recs):
# reward if rec matches logged data, ignore otherwise
actions = df[t:t+batch_size]
actions = actions.loc[actions['movieId'].isin(recs)]
actions['scoring_round'] = t
# add row to history if recs match logging policy
history = history.append(actions)
action_liked = actions[['movieId', 'liked']]
return history, action_liked
It’s important to note that replay evaluation is more than just a technique for deciding which events to use for scoring an algorithm’s performance. Replay also decides which events from the original dataset your bandit is allowed to see in future time steps. In order to mirror a real-world online learning scenario, a bandit starts with no data and adds new data points to its memory as it observes how users react to its recommendations. It’s not realistic to let the bandit have access to data points that didn’t come from its recommendation policy. We basically have to pretend those events didn’t happen, otherwise the offline bandit is going to receive most of its data from another algorithm’s policy and is basically just going to end up copying the recommendations reflected in the original dataset. For this reason, we should only add a row to the bandit’s history
dataset when the replay technique returns a match between the online and offline policies. In the above function, this can be seen in the history
dataframe, to which we only append actions which are matched between the policies.
The final result of this is a complete bandit setting, constructed using historic data. The bandit steps through the dataset, making recommendations based on a policy it’s learning from the data. It begins with zero context on user behavior (an empty history
dataframe). It receives user feedback as it recommends movies that match with the recommendations present in the historic dataset. Each time it encounters such a match, it adds this context to its history
dataset and can use this as future context for improving its recommendation policy. Over time, history
grows larger (although never nearly as large as the original dataset, since replay discards most recommendations), and the bandit becomes more effective in completing its movie recommendation task.
The last piece you’ll need to evaluate your bandit is one or more evaluation metrics. The literature around bandits focuses primarily on something called regret as its metric of choice. Regret can be loosely defined as the difference between the reward of the arm chosen by an algorithm and the reward it would have received had it acted optimally and chose the best possible arm. You will find pages and pages of proofs showing upper bounds on the regret of any particular bandit algorithm. For our purposes, though, regret a flawed metric. To measure regret, you need to know the reward of the arms that the bandit didn’t choose. In the real world you will never know this! Analyses of this optimal, counterfactual world are academically important, but they don’t take us far in the applied world. We need another metric.
The good news is that, while we can’t measure an algorithm’s cumulative regret, we can measure its cumulative reward, which, in practical terms, is just as good. This is simply the cumulative sum of all the bandit’s replay scores from the cases where a non-null score exists. This is my preferred metric for evaluating a bandit’s offline performance.
It may also be useful to include some metrics that are more meaningful to the specific task that the bandit is performing. If it’s recommending articles or ads on a website, for example, you may want to measure an N-timestep trailing click-through rate to see how CTR improves as the algorithm learns. If you’re recommending videos or articles, you may want to measure measure the completion rate of the views the algorithm generates to make sure it’s not recommending clickbait.
In the case of this dataset, I’ll implement a cumulative reward metric and a 50-timestep trailing CTR, and return both as lists so they can be analyzed as a time series if needed.
cumulative_rewards = np.cumsum(rewards)
trailing_ctr = np.asarray(pd.Series(rewards).rolling(50).mean())
Training a multi-armed bandit using a historic dataset is a bit cumbersome compared to training a traditional machine learning model, but none of the individual methods involved are prohibitively complex. I hope some of the logic laid out in this post is useful for others as they approach similar problems, allowing you to focus on the important parts without getting too bogged down by methodology.
Another thing worth noting is that I’m figuring this out as I go! If you know a better way to go about this or disagree with the approach I’m laying out in this post, send me a note and I’d be interested in discussing this.
Code for this post can be found on github.
For the past several months I’ve been collecting bid prices from the adtech auctions taking place in my browser. What follows are some findings from this data and what they tell us about monetization strategy in digital media.
If you want to collect your own data, I’ve open sourced the chrome extension I used to collect data for this post. Check it out here!
The primary method through which websites make money is selling ads (we’ll ignore subscriptions, sponsored content, etc. in this post.) In the early days of the Internet ads were sold directly to advertisers. This proved to be profitable, but left both sides of the transaction dissatisfied; the publisher couldn’t sell enough ads to monetize all of their pageviews, and the advertiser had trouble reaching scale, with the limiting factor being that both sides needed to negotiate pricing and logistics before the ad campaign could run. Monetizing a site’s traffic was too slow and required too much human input to work at Internet scale.
The answer to this problem was programmatic ads. Programmatic ads allow a site to auction off its remaining ad inventory on an open market. Advertisers, similarly, can reach essentially the entire world population through these open markets if they’re willing to pay. The primary way most sites sell programmatic ads is through an ad exchange that’s built into their ad server (AdX, and OpenX are two prominent examples.) The exchange, then, sends bid requests to thousands of demand-side platforms who are able to place bids to buy individual ad impressions on behalf of the brands they represent.
Using only one ad exchange, though, leaves a publisher’s business overly dependent on a single outside party, and also leaves them limited to the advertisers working with that particular exchange. The practice of header bidding has taken off in recent years as a response to this. Header bidding allows a publisher to make ad inventory available to several ad exchanges in parallel to the exchange that’s native to their ad server. The winning bids from all the exchanges are then able to compete with each other, with the most valuable bid winning the ad impression. This increased competition means that the publisher is able to get higher prices for their ad inventory. For a more thorough explainer, Digiday explains the practice better than I will.
There’s very little publicly-available data in the adtech space. And for good reason! For anyone whose business is to participate in advertising auctions, data (and what they do with it) is their secret sauce. Similarly, my employer wouldn’t have been too keen on me writing a post using company data. So I made my own dataset.
While most of the advertising world is hidden, we can get a glimpse into one special class of auction: client-side header bidding. Here the bids are placed within a user’s browser, making them accessible if you’re able to identify and understand the requests coming from an auction. I built a chrome extension called Auction House that parses a browser’s requests and collects data from ad auctions, making it easy to run your own adtech data collection.
I collected the following data on each bid using this tool:
In total I collected 96,306 bids from 30,600 auctions from January 1 to July 20, 2019. What follows are my primary findings for what this data can teach us about pricing in online advertising auctions.
The first point of interest is seasonality. At the population level there are several levels of seasonality in the programmatic advertising market. Demand moves according to time of day, day of week, and day of month, as well as quarterly and annual seasonality. Time of day and day of week seasonality are mostly due to consumer behavior; you’re more likely to buy something on nights and weekends, and therefore your attention is worth more to advertisers.
These trends are supported in this dataset. Using boxplots to show both mean and interquartile range of CPM (cost per 1000 impressions), Friday through Sunday ad prices are significantly higher than Monday through Thursday prices. The lower quartile price for Saturday and Sunday is roughly the mean price on weekdays, which is a pretty large gap. The time-of-day pattern is less pronounced, but prices are slightly higher at night.
There are also more global seasonal trends. The demand for online ad impressions is far from uniform throughout the year. A given company’s ads generally don’t make their way onto the Internet until the company has struck a deal with a demand side platform (DSP) who will handle the technical overhead of participating in ad auctions on their behalf. The DSP will set a series of goals for the company’s ads, including how many ads they plan to deliver, the expected cost, and the timeframe in which it will execute its ad buys. As a result of this fairly-traditional purchasing process, the demand for ads mostly follows the same business cycle as traditional business. This means that campaigns are typically set up to run on a monthly, quarterly, and annual basis. Demand rises through each of these cycles, partially re-setting at the end of each cycle until it reaches its peak during the Christmas holiday.
These trends weren’t as pronounced as I’d hoped they would be in this data, but it’s still visible. January has a low CPM, and you can see a slight decrease at the ends of March and June (when Q1 and Q2 budgets are expiring.) You can also see the more frequent, local spikes in demand coming from weekly seasonality. It will be interesting to look back on this next January if I keep recording this data, as the CPM gain is more dramatic in Q4.
In the below plot, the red line is each day’s average CPM, with the gray region being the interquartile range. The mean CPM is always in the upper end of the interquartile range, as there are many high prices skewing the data in the positive direction (while the other end of the distribution is bounded by a minimum value of zero.)
Last, you can see that the seasonal trend isn’t the same for all sites. BuzzFeed and BuzzFeed News have steady growth, while Vox and USA Today are noisy. ESPN implements a high price floor throughout March causing artificially high CPM. CNN disappears completely for a few weeks in June when they apparently temporarily stopped using Prebid. Recode disappears when their parent company dissolved their site in April. This variety is fitting, as there are many adtech strategies a publisher can pursue, and the collection of sites I examine in this data employ them all to varying extents. Each figure in the below grid could be a case study in itself.
The sell-side partners (SSPs) a publisher chooses to work with are a key component in their monetization strategy. Working with more exchanges generally means more demand has access to a site’s ad inventory, which means there’s a higher chance that it sells for a high price. Client-side header bidding also adds latency to a website, however, so at a certain point adding additional SSPs begins to degrade site performance, leading to a worse user experience and traffic decline.
Here are the most common SSPs I’ve seen from observing bidding patterns over the past 7 months, ordered by the number of impressions they’ve won.
Bidder | CPM (mean) | CPM (stddev) | Impressions |
---|---|---|---|
Rubicon | 3.04 | 4.09 | 8849 |
AppNexus | 2.03 | 2.92 | 6534 |
Index | 2.39 | 2.74 | 5622 |
OpenX | 1.78 | 2.14 | 2367 |
TrustX | 4.94 | 5.29 | 2090 |
AOL | 2.01 | 2.80 | 1523 |
Consumable | 3.55 | 4.10 | 1083 |
AppNexus | 1.13 | 1.79 | 581 |
TripleLift | 3.88 | 2.94 | 538 |
PubMatic | 2.48 | 1.54 | 385 |
These are all big names in the industry, each having collected a massive network of DSPs providing publishers access to the majority of the internet’s display advertising demand. There is also a long tail of less-known SSPs in this data. Among these are Colossus and Aardvark, which are apparently SSPs but I couldn’t even find their company websites.
From the sites I examined, the sweet spot for how many exchanges to include in one’s header seems to be around four. This doesn’t mean it’s optimal, but it does appear to be an industry standard among major publishers.
Related to the number of SSPs a publisher works with is how many are placing bids for a given ad impression. Breaking it out this way, you can see the expected result: more auction participants leads to a higher clearing price. This fits with what existing economic theory teaches us: holding supply constant, an increase in demand (here by way of expanding the number of bidders with access to a site’s inventory) will lead to higher prices. The pattern is imperfect (four bids has a lower cpm then three in this data), but this is probably because it’s looking at data across several sites.
Bids Submitted | CPM | Count |
---|---|---|
1 | 1.64 | 4264 |
2 | 2.22 | 5980 |
3 | 3.35 | 7775 |
4 | 2.68 | 8717 |
5 | 3.42 | 2329 |
6 | 5.35 | 1092 |
It’s worth noting that this relationship isn’t perfectly causal in this data. The reason why SSPs are submitting bids is an unobserved confounding factor. Maybe the user has unpurchased items in their Amazon shopping cart, for example, which is causing both higher bid prices from advertisers and a higher number of bids to be placed. It’s not possible to remove all these unknowable confounding factors, but the pattern is clear enough where you can feel confident in saying that increasing the amount of demand with access to a given piece of ad inventory leads to higher prices.
Another important driver of ad value is the size of the ad. Similar to with traditional advertising, larger ads have a higher market rate.
Creative Size | CPM (mean) | CPM (stddev) | Impressions |
---|---|---|---|
728x90 | 2.27 | 3.07 | 11012 |
300x250 | 2.78 | 3.48 | 10011 |
300x600 | 2.68 | 3.42 | 5052 |
970x250 | 3.57 | 4.81 | 2585 |
1030x590 | 16.41 | 2.76 | 261 |
970x90 | 1.34 | 1.37 | 172 |
640x480 | 14.27 | 8.54 | 58 |
160x600 | 1.17 | 1.52 | 48 |
It’s a bit deceptive to look at this in cross-site data, since every site’s implementation of these ads will be slightly different and drive different values. You can see, however, that a 970x250 ad is worth significantly more than a 728x90, which will often be eligible to serve in the same top-of-page ad slot. Similarly, a 300x600 ad has a higher cpm on average than a 160x600, which it generally competes with for space in a site’s sidebar.
Other comparisons from this table are less fair. A 728x90, for example, typically serves in a completely different section of a page than a 300x250, meaning that their performance metrics are vastly different. This causes their CPMs to differ for reasons entirely unrelated to the size itself. Other sizes, such as Vox’s 1030x590 ads, are not industry standard and serve in a completely different way than traditional ads, making it unfair to compare against more standard sizes.
In general, all else equal, bigger is better in terms of an ad’s value.
Another theme I found in this data is the impact that ad density has on pricing. Economics 101 says that if you flood a market with supply without a corresponding decrease in demand, the per-unit price goes down. This is exactly what happens when a website adds additional ads.
It’s not obvious how to define “the market” in this context, but if you think of it as the market for ad impressions on a given site, or for a given subset of users, then it follows that a site increasing the number of ads per pageview can negatively impact the CPM it receives. This is probably a fair way of looking at things, as ad campaigns are often limited to small pools of users and websites, making market definitions very small and specific in scope.
In the above plot I define “density” as the number of ads placed on a give pageview and “CPM” as the average price per 1000 ad impressions across websites at each density. The clear downward trend suggests that adding additional ads to a page has at least a slight negative impact on price. This can be studied more thoroughly by holding sites and ad sizes constant, but it’s hard to do this and maintain a useful sample size without having access to a large site’s internal data.
There are a few more things I might look into with this data. Namely, the sample size is getting large enough where it might be an interesting task to try to forecast future CPMs. It also may be interesting to parse out keywords from the urls that impressions belong to and see if there’s any relationship between the content of a page and its CPM. For example, do non-brand-safe keywords (“murder”, curse words, etc.) have a negative impact on CPM? Do commercial keywords (“shopping”, “product”, “amazon”) have a positive impact? Clustering for topics and finding their relationship with ad price might have some interesting results.
I’ll keep collecting data for the time being and might revisit this later. If you’ve made it this far, thanks for reading and let me know what you think!
]]>Baseball is an old game, and for the most part, we play it the same way today as we did several decades ago. One maneuver disrupting the old way of play in recent years in the infield shift. MLB.com defines the most common type of shift as “when three (or more, in some cases) infielders are positioned to the same side of second base (mlb.com).” Once a rare meneuver, the LA Times reported in 2015 that usage of The Shift had nearly doubled each year from 2011 to 2015 (LA Times). It seems that, in the aftermath of the early Moneyball era, where most teams have by now made significant investments in analytics, the value of strategic infield positioning has become widely appreciated.
A Typical Infield Shift // mlb.com
The reasons for shifting can be many. Mike Petriello provides an excellent analysis when and why it happens in 9 things you need to know about the shift. The most obvious recipients of The Shift are lefty hitters whose spraychart shows a heavy skew toward hitting down the first baseline. It’s also clear that some teams have embraced The Shift more than others. The Astros, for example, have shifted on over 40% of plate appearances in 2018, while the Cubs barely shift at all.
The goal of this project is to see if shifts can be predicted. Predicting shifts is interesting for a few different reasons. If you’re on the defensive side and want to better know when your infield is supposed to be shifting, it may be helpful to see how likely the rest of the league would be to shift in that same situation. If you’re deciding whether to send in a pinch hitter, it will be useful to know both whether the defense is expected to shift, and how effective your batter is going to be against the maneuver. And, more generally, The Shift is just a fun maneuver that I’d like to better understand.
I will begin by collecting data from Baseball Savant, which tells us how the defense is positioned at a given point in time. From there, I’ll create features to describe game context, player identity, and player ability that will help to form predictions. Last, I’ll build five different models, ranging from simple generalized linear models to more advanced machine learning techniques, to see just how effectively The Shift can be predicted before it happens. Let’s get started.
The LA Dodgers Getting Shifty // DENIS POROY / GETTY
I use pybaseball to collect statcast data. Baseball Savant has made field position classifications available since the beginning of the 2016 season, so I collect data from 2016 to present (August 2018 at the time of writing this post). Data collection is a simple one-liner.
from pybaseball import statcast
df = statcast('2016-03-25', '2018-08-17')
For a long query like this one, the scraper takes a while to complete. I recommend running this and then leaving it alone for a while. Save a copy once you have it to avoid having to re-scrape.
Now, some simple cleaning to make feature engineering and analysis possible. To start, I:
game_date
into a proper datetime
typegame_pk
+ at_bat_number
)inning_topbot
is_shift
, defined as equaling 1 when if_fielding_alignment
is equal to Infield shift
In code, it looks like this:
# only consider regular season games
df = df.loc[df['game_type'] == 'R',]
df['game_date'] = pd.to_datetime(df['game_date'])
df['atbat_pk'] = df['game_pk'].astype(str) + df['at_bat_number'].astype(str)
# we don't have column for which team is pitching, but we know the home team pitches the top and away pitches bottom
df['team_pitching'] = np.where(df['inning_topbot']=='Top', df['home_team'], df['away_team'])
df['team_batting'] = np.where(df['inning_topbot']=='Top', df['away_team'], df['home_team'])
#is_shift == 1 if the defense was shifted at any point during the plate appearance
df['is_shift'] = np.where(df['if_fielding_alignment'] == 'Infield shift', 1, 0)
shifts = pd.DataFrame(df.groupby('atbat_pk')['is_shift'].sum()).reset_index()
shifts.loc[shifts.is_shift > 0, 'is_shift'] = 1
df = df.merge(shifts, on='atbat_pk', suffixes=('_old', ''))\
df = df[pd.notnull(df['is_shift'])]
Last, since our goal is to predict whether a shift occurred on a given plate appearance, we’ll want to reshape the data so that each record reflects a single plate appearance. Pybaseball’s statcast data comes in the lowest form of granularity offered by Baseball Savant: the individual pitch. To move the data to plate appearance-level granularity, I group by atbat_pk
and select the first row of each plate appearance, which shows the game state when a player first steps up to bat. This is an oversimplification of what happens throughout the plate appearance (maybe someone stole a base, maybe there was a pitching change), but it’s a good enough representation of reality to predict how the defense would play the situation.
plate_appearances = df.sort_values(['game_date', 'at_bat_number'], ascending=True).groupby(['atbat_pk']).first().reset_index()
Taking the first row of each plate appearance isn’t enough to build useful features, of course. The pitch-level data will still be used to create features representing player ability and game context.
We’ll begin feature engineering by creating the two most obvious features that come to mind for predicting shifts: how often does the current batter get shifted against, and how often does the defensive team shift in general?
To avoid information leakage (information slipping into the model from points in time happening after the present atbat_pk
), these features will be calculated using expanding means. This means that for each point in time, we’ll calculate the shift-rate from the beginning of time up until the present plate appearance, ignoring all future data that is unknown at that point in time. These features are calculated below.
# calculate an expanding mean for a given feature
def get_expanding_mean(df, featurename, base_colname):
# arrange rows by date + PA # for each batter
df = df.sort_values(['batter', 'game_date', 'at_bat_number'], ascending=True)
# calculate mean-to-date at each PA's point in time
feature_to_date = df.sort_values(['batter','game_date','at_bat_number'], ascending=True).groupby('batter')[base_colname].expanding(min_periods=1).mean()
feature_to_date = pd.DataFrame(feature_to_date).reset_index()
feature_to_date.columns = ['batter', 'index', featurename]
if 'index' in df.columns:
df = df.drop('index',1)
# join new feature onto the original dataframe
df = df.reset_index()
df = pd.merge(left=df, right=feature_to_date, left_on=['batter','index'],
right_on=['batter', 'index'], suffixes=['old',''])
return df
plate_appearances = get_expanding_mean(plate_appearances, 'avg_shifted_against', 'is_shift')
# shift rate to date for each team at each point in time
plate_appearances = plate_appearances.sort_values(['team_pitching', 'game_date'], ascending=True)
shifts_to_date = plate_appearances.sort_values(['team_pitching', 'game_date'], ascending=True).groupby('team_pitching')['is_shift'].expanding(min_periods=1).mean()
shifts_to_date = pd.DataFrame(shifts_to_date).reset_index()
shifts_to_date.columns = ['team_pitching', 'index', 'def_shift_pct']
plate_appearances = plate_appearances.drop('index',1)
plate_appearances = plate_appearances.reset_index()
plate_appearances = pd.merge(left=plate_appearances, right=shifts_to_date, left_on=['team_pitching','index'], right_on=['team_pitching', 'index'], suffixes=['old',''])
Note that I created a function called get_expanding_mean
in the above code chunk because later in this analysis I’ll repeat this same procedure for other features. The def_shift_pct
feature requires a slightly different grouping, however, so it gets calculated without a dedicated function.
A quick check of our shift leaders looks about right. These numbers aren’t a perfect match with the ones Baseball Savant shows in its leaderboard, but my PA counts match those of Baseball Reference while Savant’s don’t. This suggests that Baseball Savant is applying some sort of filtering to its data, while my PAs are unfiltered.
Name | Shift_Rate |
---|---|
chris davis | 89% |
ryan howard | 88% |
david ortiz | 84% |
joey gallo | 80% |
lucas duda | 77% |
brandon moss | 74% |
brian mccann | 73% |
colby rasmus | 71% |
adam laroche | 70% |
mitch moreland | 69% |
Just for fun, we can also see how it broke down by season and handedness, using the raw average rather than the expanding mean.
Name | Year | Bats | Shift_Rate |
---|---|---|---|
ryan howard | 2016 | L | 94% |
chris davis | 2017 | L | 94% |
chris davis | 2018 | L | 92% |
chris davis | 2016 | L | 92% |
joey gallo | 2018 | L | 89% |
justin smoak | 2018 | L | 89% |
colby rasmus | 2018 | L | 88% |
mark teixeira | 2016 | L | 88% |
carlos santana | 2018 | L | 87% |
curtis granderson | 2018 | L | 86% |
It’s interesting to see Teixeira, a switch hitter, make an appearance now that we’re grouping by handedness. This is the first feature of many that will show that shift rate alone doesn’t tell the full story!
Applying the same procedure to teams shows who leans on this maneuver heaviest.
Team | Shift Rate |
---|---|
HOU | 37% |
TB | 34% |
NYY | 22% |
BAL | 22% |
MIL | 21% |
MIN | 18% |
SEA | 18% |
COL | 17% |
CWS | 16% |
PIT | 15% |
SD | 14% |
OAK | 14% |
ARI | 14% |
LAD | 13% |
BOS | 13% |
PHI | 13% |
TOR | 11% |
KC | 11% |
CLE | 11% |
DET | 11% |
ATL | 10% |
WSH | 10% |
CIN | 10% |
LAA | 9% |
TEX | 9% |
MIA | 8% |
NYM | 7% |
SF | 7% |
STL | 5% |
CHC | 5% |
The Astros and Rays are by far The Shift’s biggest adopters, while the Cubs, Cardinals, and a few other barely shift at all.
As we’ll see in Model #1 later in this post, these two features alone capture much of the information required in order to predict The Shift. There are, of course, reasons not to stop here and call it a day: what about batters with few historic plate appearances? These instances will surely be impacted by the instability of small sample sizes. What about switch hitters? The Teixeira example shows that shift rates lie in these cases.
In the context of shift-decisions, a batter’s identity is really a proxy for several things: his handedness, power, expected launch angle, sprint speed, and even who bats after him. If we capture some of these things directly, we may both improve our model and bring stability to its ability to predict shifts for new players.
Capturing batter and pitcher handedness is simple:
# stand (batter_bats)
plate_appearances['left_handed_batter'] = np.where(plate_appearances['stand'] == 'L', 1, 0)
# pitcher_throws
plate_appearances['pitcher_throws_left'] = np.where(plate_appearances['p_throws'] == 'L', 1, 0)
I then expand the player-profile feature set by repeating the previously-defined expanding average procedure for other variables provided by Baseball Savant: woba_value (how many points each PA contributes to wOBA), babip_value (same, but for BABIP), launch_angle, launch_speed, and hit_distance_sc. A modification of this procedure is also applied to obtain how many plate appearances we’ve seen to date for the current batter within this data, taking an expanding count instead of an expanding average.
plate_appearances = get_expanding_mean(plate_appearances, 'woba', 'woba_value')
plate_appearances = get_expanding_mean(plate_appearances, 'babip', 'babip_value')
plate_appearances = get_expanding_mean(plate_appearances, 'launch_angle', 'launch_angle')
plate_appearances = get_expanding_mean(plate_appearances, 'launch_speed', 'launch_speed')
plate_appearances = get_expanding_mean(plate_appearances, 'hit_distance_sc', 'hit_distance_sc')
# number of plate appearances observed for each player at the time of the current PA
plate_appearances = plate_appearances.sort_values(['batter', 'game_date', 'at_bat_number'], ascending=True)
pas = plate_appearances.sort_values(['batter','game_date','at_bat_number'], ascending=True).groupby('batter')['index'].expanding(min_periods=1).count()
pas = pd.DataFrame(pas).reset_index()
pas.columns = ['batter', 'index', 'pas']
plate_appearances = plate_appearances.drop('index',1)
plate_appearances = plate_appearances.reset_index()
plate_appearances = pd.merge(left=plate_appearances, right=pas, left_on=['batter','index'], right_on=['batter', 'index'], suffixes=['old',''])
Let’s check these numbers and see if they look right. First off, I’m seeing familiar faces on the wOBA leaderboard. That’s a good sign.
Name | wOBA | PAs |
---|---|---|
mike trout | 0.434 | 2316 |
aaron judge | 0.417 | 1208 |
joey votto | 0.415 | 2561 |
j. d. martinez | 0.413 | 2153 |
paul goldschmidt | 0.403 | 2579 |
bryce harper | 0.400 | 2271 |
freddie freeman | 0.399 | 2199 |
josh donaldson | 0.398 | 2064 |
david ortiz | 0.397 | 1238 |
nolan arenado | 0.395 | 2519 |
kris bryant | 0.395 | 2366 |
Checking the leaderboard for exit velocity also looks good, showing Statcast darlings Stanton, Judge, and Ortiz all near the top during this period. These velocities are slightly lower than what we see in the MLB leaderboard, which is probably because I’m including launch angles for all plate appearance-ending events, including outs, whereas MLB is probably only including exit velocity on hits.
Name | Exit Velocity | PAs |
---|---|---|
david ortiz | 92 | 1238 |
nelson cruz | 91 | 2392 |
pedro alvarez | 91 | 1026 |
miguel cabrera | 90 | 1866 |
kendrys morales | 90 | 2229 |
giancarlo stanton | 90 | 1999 |
aaron judge | 90 | 1208 |
ryan zimmerman | 90 | 1636 |
matt olson | 90 | 740 |
josh donaldson | 90 | 2064 |
We’re off to a good start, having captured several components of the batter’s player profile, as well as measures of how often shifts occur for each batter and defense. Still missing from this data, however, is a sense of context. Shifts don’t happen in a vacuum. Priced into a shift-decision is the context of the current plate appearance (the score, men on base), and perhaps the team’s memory of how the current batter performed in previous plate appearances. If a player successfully hits out of the shift twice in a row, for example, a team may be less likely to try it a third time. This is the next category of feature I’ll create, attempting to capture the context in which a plate appearance occurs.
The first piece of context I’ll create is the base state. I’ll create four features: binary flags stating whether there’s a man on first, second, and third at the beginning of the plate appearance, and a count feature saying how many men are on base in total. Something that I haven’t tried is taking interactions of the binary flags (e.g. man on first AND second, first AND third, second AND third). I might add that in if I revisit this later.
df['man_on_first'] = np.where(df['on_1b'] > 0 , 1, 0)
df['man_on_second'] = np.where(df['on_2b'] > 0 , 1, 0)
df['man_on_third'] = np.where(df['on_3b'] > 0 , 1, 0)
df['men_on_base'] = df['man_on_first'] + df['man_on_second'] + df['man_on_third']
The result looks like this:
on_1b | on_2b | on_3b | man_on_first | man_on_second | man_on_third | men_on_base |
---|---|---|---|---|---|---|
572761 | 607231 | NaN | 1 | 1 | 0 | 2 |
621550 | NaN | NaN | 1 | 0 | 0 | 1 |
621550 | NaN | NaN | 1 | 0 | 0 | 1 |
621550 | NaN | NaN | 1 | 0 | 0 | 1 |
621550 | NaN | NaN | 1 | 0 | 0 | 1 |
Another piece of context that might matter is when the game took place. Maybe part of The Shift’s likelihood can be attributed in part to the year (teams saw it work last season, so they made it a bigger part of their strategy the following season) and the time of year (teams know less about their opponents earlier in the season, so they shift less). We already have a feature for what year it is. Let’s create one for the month as well.
df['Month'] = df['game_date'].dt.month
Next, let’s make the game’s score easier for the model to work with. We already know each team’s score, but a tree-based model needs to take two steps to make use of this, and linear model can’t gain much from it at all, because each team’s score is only interesting in the context of their opponent’s. Taking the difference between the two should help.
df['score_differential'] = df['fld_score'] - df['bat_score']
The Savant data has several categorical variables I’d like to use. To make use of these, I’ll create dummies for each one of them: for each unique value in the categorical variable, a binary feature is created as a flag for whether the variable took on that value. I’m doing this for a few features, but it will be particularly interesting for team_pitching
(capturing a team’s defensive strategy) and batter
(capturing the leftover features of a player’s profile that we haven’t been able to control for with the model’s other features).
The time-related features will be cast as dummies as well in case there are month- or year-specific effects that deviate from the linear relationship that a model might glean from representing these as continuous features.
Last, I’ll also create dummies from the events
feature. These can’t be used directly, as they provide future-information that is not known at the beginning of the plate appearance. They can, however, be lagged and used to describe what’s happened in the batter’s most recent plate appearances, which I’ll do next.
# create dummies for pitching team, batting team, pitcher id, batter id
dummies = pd.get_dummies(plate_appearances['team_pitching']).rename(columns=lambda x: 'defense_' + str(x))
plate_appearances = pd.concat([plate_appearances, dummies], axis=1)
dummies = pd.get_dummies(plate_appearances['team_batting']).rename(columns=lambda x: 'atbat_' + str(x))
plate_appearances = pd.concat([plate_appearances, dummies], axis=1)
dummies = pd.get_dummies(plate_appearances['batter']).rename(columns=lambda x: 'batterid_' + str(x))
plate_appearances = pd.concat([plate_appearances, dummies], axis=1)
dummies = pd.get_dummies(plate_appearances['pitcher']).rename(columns=lambda x: 'pitcherid_' + str(x))
plate_appearances = pd.concat([plate_appearances, dummies], axis=1)
# bb_type dummies (to be lagged)
dummies = pd.get_dummies(plate_appearances['bb_type']).rename(columns=lambda x: 'bb_type_' + str(x))
plate_appearances = pd.concat([plate_appearances, dummies], axis=1)
plate_appearances.drop(['bb_type'], inplace=True, axis=1)
# month and year dummies
dummies = pd.get_dummies(plate_appearances['Month']).rename(columns=lambda x: 'Month_' + str(x))
plate_appearances = pd.concat([plate_appearances, dummies], axis=1)
dummies = pd.get_dummies(plate_appearances['game_year']).rename(columns=lambda x: 'Year_' + str(x))
plate_appearances = pd.concat([plate_appearances, dummies], axis=1)
# events
dummies = pd.get_dummies(plate_appearances['events']).rename(columns=lambda x: 'event_' + str(x))
plate_appearances = pd.concat([plate_appearances, dummies], axis=1)
plate_appearances.drop(['team_pitching', 'team_batting', 'home_team', 'away_team', 'inning_topbot'], inplace=True, axis=1)
Lagged variables are a way to encode information about the past. For the features we have access to during the present plate appearance (did they get on base? Did the defense shift? Was the shift successful?), we can also access them for each of the batter’s previous plate appearances. This is interesting information to send to the model, as it’s almost certainly playing through the players’ minds when someone new steps up to the plate.
The first step for creating these features is creating them at the present-time. Here’s I’ll define whether the batter got on base, whether they achieved a hit, whether the plate appearance was successful for the defensive team, and whether the plate appearance represents a shift that can be viewed as successful from the defense’s point of view. Everything else that I’ll lag already exists at present-time as a feature provided by Baseball Savant.
plate_appearances['onbase'] = plate_appearances.event_single + plate_appearances.event_single + plate_appearances.event_double + plate_appearances.event_triple
plate_appearances['hit'] = plate_appearances.event_single + plate_appearances.event_double \
+ plate_appearances.event_triple + plate_appearances.event_home_run
plate_appearances['successful_outcome_defense'] = plate_appearances.event_field_out + plate_appearances.event_strikeout + plate_appearances.event_grounded_into_double_play \
+ plate_appearances.event_double_play + plate_appearances.event_fielders_choice_out + plate_appearances.event_other_out \
+ plate_appearances.event_triple_play
plate_appearances['successful_shift'] = plate_appearances['is_shift'] * plate_appearances['successful_outcome_defense']
Missing values complicate the lagging of features, so these should be imputed before proceeding.
# simple imputations: hit location, hit_distance_sc, launch_speed, launch_angle, effective_speed,
# estimated_woba_using_speedangle, babip_value, iso_value
plate_appearances.loc[pd.isna(plate_appearances.hit_location), 'hit_location'] = 0
plate_appearances.loc[pd.isna(plate_appearances.hit_distance_sc), 'hit_distance_sc'] = 0
plate_appearances.loc[pd.isna(plate_appearances.launch_speed), 'launch_speed'] = 0
plate_appearances.loc[pd.isna(plate_appearances.launch_angle), 'launch_angle'] = 0
plate_appearances.loc[pd.isna(plate_appearances.effective_speed), 'effective_speed'] = 0
plate_appearances.loc[pd.isna(plate_appearances.estimated_woba_using_speedangle), 'estimated_woba_using_speedangle'] = 0
plate_appearances.loc[pd.isna(plate_appearances.babip_value), 'babip_value'] = 0
plate_appearances.loc[pd.isna(plate_appearances.iso_value), 'iso_value'] = 0
plate_appearances.loc[pd.isna(plate_appearances.woba_denom), 'woba_denom'] = 1
plate_appearances.loc[pd.isna(plate_appearances.launch_speed_angle), 'launch_speed_angle'] = 0
Now for the fun part. First sort everything chronologically for each batter. Define which columns should be lagged, and how many plate appearances into the past we should look. Then, for each lagged feature, and for each number of PAs into the past t
that should be captured, group by the batter’s id and shift the column up t
positions.
# finally: lag things for a fuller sense of context
plate_appearances = plate_appearances.sort_values(['batter', 'game_date', 'at_bat_number'], ascending=True)
cols_to_lag = ['is_shift', 'onbase', 'hit', 'successful_outcome_defense', 'successful_shift',
'woba_value', 'launch_speed', 'launch_angle', 'hit_distance_sc', 'bb_type_popup', 'bb_type_line_drive',
'bb_type_ground_ball', 'bb_type_fly_ball']
# how many PAs back to we want to consider?
lag_time = 5
for col in cols_to_lag:
for time in range(lag_time):
feature_name = col + '_lag' + '_{}'.format(time+1)
plate_appearances[feature_name] = plate_appearances.groupby('batter')[col].shift(time+1)
Since this shifts the column t
positions forward in time, the first t
rows for each player will be missing the lagged value. For every other point in time, however, we will now know what happened up to t
plate appearances ago.
There are two things to consider when picking how many plate appearances into the past we should look. The first is that each additional point in time we choose for this window will provide some extra information, but that the amount of information we gain will decrease as the size of this window increases.
Big-I “information” in this case can be described as the extent to which the added feature surprises us. Knowing nothing about past plate appearances, adding one point of past information tells us something we didn’t know before. Knowing whether the defense shifted at t-1
tells us a lot about whether they’ll shift the next time, greatly improving our prediction at time t
. Compared to what we knew without it, point t-1
surprised us. Extending this one point in time further into the past, point t-2
tells us something new, but it’s not quite as surprising, as t-1
already captures a lot of what t-2
is telling us on average. Pair these diminishing returns with the feature bloat that they bring, and it’s probably not worth lagging more than a few points in time into the past.
The second thing to consider is that the more you lag, the more missing values your data will have. I chose to lag 5 points into the past, so my lagged features will be missing values for each batter’s first 5 plate appearances. This means I’ll have to throw out each player’s first five plate appearances in order to feed this data to a model. This doesn’t feel like much of a sacrifice at t=5
, but it’s another reason to avoid lagging too far into the past.
Here’s an example of what this looks like for the is_shift
lags for Mitch Moreland’s first 8 plate appearances. Note that NaNs in rows 1 - 5, and how a shift stays represented in the data for five plate appearances by moving through the lagged features as time progresses.
PA Number | is_shift | is_shift_lag_1 | is_shift_lag_2 | is_shift_lag_3 | is_shift_lag_4 | is_shift_lag_5 |
---|---|---|---|---|---|---|
1 | 1 | NaN | NaN | NaN | NaN | NaN |
2 | 1 | 1 | NaN | NaN | NaN | NaN |
3 | 0 | 1 | 1 | NaN | NaN | NaN |
4 | 0 | 0 | 1 | 1 | NaN | NaN |
5 | 1 | 0 | 0 | 1 | 1 | NaN |
6 | 1 | 1 | 0 | 0 | 1 | 1 |
7 | 1 | 1 | 1 | 0 | 0 | 1 |
8 | 0 | 1 | 1 | 1 | 0 | 0 |
I chose t=5
somewhat arbitrarily. It seemed like enough to capture most of what will be in a player’s recent memory without exploding the model’s feature space to the point of complicating the training process. There’s room for experimentation here, though, which I’d encourage for anyone planning on using a model like this in a serious way.
That was a marathon, but we now have an interesting and expansive set of features to build models on, covering player profile, ability, and context. Let’s predict some shifts.
My general modeling approach is to use k-fold cross validation for a few different models. For this reason, a fair model evaluation will necessitate a training set (which will be broken into folds for multiple train/test splits) and a holdout set, which will only be accessed once a final and “best” model is selected, for a blind evaluation of its performance.
Given the 655,847 plate appearances in this data, a 3-fold cross validation will entail 349,784 training and 174,893 test samples for each of its three train/test splits, with 131,170 samples remaining in the 20% holdout set to assess model performance.
This is set up as follows:
train_percent = .8
train_samples = int(plate_appearances.shape[0] * train_percent)
holdout_samples = int(plate_appearances.shape[0] * (1 - train_percent))
y = plate_appearances['is_shift']
X = plate_appearances.drop(['is_shift', 'batter', 'pitcher'], 1)
X_train = X[:train_samples]
X_holdout = X[train_samples:]
y_train = y[:train_samples]
y_holdout = y[train_samples:]
For the sake of my own sanity, I’m going to sacrifice some of my models’ accuracy in the name of cutting down their training time by doing some preliminary feature selection. This isn’t the only way to do this, but my method of choice is to train a random forest and use Scikit-Learn’s built in gini importances in order to rank features by their importance to the model. Everything with an importance above a cutoff value will remain in the model, while the other features will be thrown out.
#train a random forest
n_estimator = 100
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator, n_jobs=3, verbose=2)
rf.fit(X_train, y_train)
# use random forest's feature importances to select only important features
sfm = SelectFromModel(rf, prefit=True, threshold=0.0001)
# prune unimportant features from train and holdout dataframes
feature_idx = sfm.get_support()
feature_names = X_train.columns[feature_idx]
X_train = pd.DataFrame(sfm.transform(X_train))
X_holdout = pd.DataFrame(sfm.transform(X_holdout))
X_train.columns = feature_names
X_holdout.columns = feature_names
As is common in high dimensional data, these data had several sparsely-populated features that contained very little information about shift decisions. Chief among these low-information features were the pitcher and batter dummies, which populated the majority of the feature space. After dropping low-importance features, only a few of these player ID variables remained.
For an idea about which features the final Random Forest model used in this project found most important, the ranked Gini importances of its top features are shown here:
Ranked Feature Importances from Random Forest Model
This shows that most of the information is captured by whether the defense shifted in recent plate appearances, as the three most important features are the two most recent shift-lags and the batter’s historic shifted-against rate. After that, there’s a noticeable dropoff between the shift-related features and those describing game state, player ability, and the more-distant past.
Trimming low-importance features takes us from 2,781 to 109 features, and reduces the training time of a 60-model XGBoost parameter search from 4 hours down to just 29 minutes. The cost of this is essentially nonexistent. In a first-pass at modeling this, the 2,700-feature XGBoost model obtained an AUC score of 0.911 and the 100-feature model scored 0.910. That’s a tradeoff I will happily take.
A model is only interesting in the context of what it’s competing against. A good baseline to start with is the worst model imaginable, taking zero covariates into consideration. The no-model model simply guesses the majority class 100% of the time. In this case, the classes are imbalanced, so even no-model scores well on accuracy. Here y_train.mean()
shows that 14% of plate appearances contain The Shift, so a model that only guesses no shift (y = 0) will be correct 86% of the time. Any lift in accuracy above this point will be learned from data.
A reasonable next step from this is a simple model using the bare-minimum set of features needed to understand The Shift. In this case, the model is logistic regression, and the only features are:
These two features collectively capture a lot of information, giving an understanding of what we know about The Shift without any machine learning or clever feature engineering.
A few lines get us this improved logistic baseline:
lr = LogisticRegression(n_jobs=3)
lr.fit(X_train[['avg_shifted_against', 'def_shift_pct']], y_train)
y_pred_lr = lr.predict_proba(X_test[['avg_shifted_against', 'def_shift_pct']])[:, 1]
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_lr)
So: knowing only how often a batter is shifted against and how often the defense shifts in general, how well can we predict The Shift?
It turns out, the answer is “pretty well.”
This model achieves 88.6% out of sample accuracy and an AUC score of 0.887. The accuracy is slightly better than that of the no-model model, and the high AUC score shows that the model has learned a decent amount.
Given that we have 88.6% accuracy and 0.887 AUC knowing only these two things, we can tell that a large amount of the mental calculus going into a shift-decision can be summarized by the batter’s identity and the defense’s overarching philosophy toward this maneuver.
With two baselines established, the next step in the ladder of modeling complexity is a logistic model with two improvements from the previous:
LogisticRegressionCV
to shrink the model’s weights toward zero, optimizing the shrinkage parameter to maximize accuracy on unseen datalr2 = LogisticRegressionCV(n_jobs=3)
lr2.fit(X_train, y_train)
y_pred_lr2 = lr2.predict_proba(X_test)[:, 1]
fpr_lr2, tpr_lr2, _ = roc_curve(y_test, y_pred_lr2)
This model performs much better than the simple logit, with an accuracy of 92.2% and an AUC score of 0.940. This is a big leap in performance, considering how close the previous model had already come to perfect accuracy and AUC. This comes at the cost of training time, which is increased due to the larger feature space and tuning of the regularization parameter over three folds.
Next I’ll cross an arbitrary boundary between what one might call purely statistical modeling and Machine Learning
with something tree-based. I opt to use a random forest classifier here because it’s a good baseline for how far machine learning will take you in a modeling task. It’s hard to overfit, relatively quick to train, and typically serves as a good hint for whether it’s worth going all-in on a more powerful nonparametric modeling approach such as gradient boosting or a neural network.
Since this and the following model are slower to train and have more tunable parameters than the previously-used linear models, I’ll first build a timer function and a modeling function so I can be consistent with how the best versions of these models are selected.
I’ll use randomized parameter search to find an optimal set of hyperparameters. This has been shown to produce superior results to grid search in less time for two reasons: it doesn’t need to run an exhaustive search to explore the parameter space, and it can explore the space more completely by sampling parameter values from distributions, rather than a grid search’s approach of sampling only from user-specified lists of values. In my implementation, I’ll accept a draws
parameter, for how many draws from the sample distributions should be tested while exploring the feature space, and a folds
parameter, for how many folds we’d like to cross validate over.
def timer(start_time=None):
if not start_time:
start_time = datetime.now()
return start_time
elif start_time:
thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
tmin, tsec = divmod(temp_sec, 60)
print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))
return None
def model_param_search(model, param_dict, fit_dict=None, folds=3, draws=20):
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search_model = RandomizedSearchCV(model, param_distributions=param_dict,
fit_params=fit_dict, n_iter=draws,
scoring='roc_auc', n_jobs=1, cv=skf.split(X_train,y_train), verbose=10,
random_state=1001)
start_time = timer(None)
random_search_model.fit(X_train, y_train)
timer(start_time)
print('\n All results:')
print(random_search_model.cv_results_)
print('\n Best estimator:')
print(random_search_model.best_estimator_)
print('\n Best hyperparameters:')
print(random_search_model.best_params_)
return random_search_model
Applying this to the random forest model, I’ll draw from a uniform distribution to tune the number of trees used in the forest, simultaneously toggling the number of features considered in each tree, the criterion used for assessing the quality of decision-splits, and a flag for whether to use down-sampling to overcome the training data’s imbalanced classes caused the The Shift’s rarity.
rf_params = {
'n_estimators': st.randint(4,200),
'max_features': ['sqrt', .5, 1],
'criterion': ['entropy', 'gini'],
'class_weight': ['balanced', None]
}
random_search_rf = model_param_search(model=RandomForestClassifier(verbose=0, n_jobs=4), param_dict=rf_params, draws=10)
In the interest of time, I sample only ten times from this search space while training over three folds, equaling 30 models trained in total. Given access to better hardware, I’d feel more comfortable with the final model having doubled this number, but alas, my 12” macbook wasn’t having it.
This ended up taking just over three hours to complete, with the following as its most successful set of parameters:
{
'class_weight': None,
'criterion': 'entropy',
'max_features': 0.5,
'n_estimators': 184
}
The result was a model with 92.9% accuracy and 0.957 AUC score. This marks an improvement from the logistic score, but not as big a jump as we’d achieved by introducing contextual features in the previous step.
The last solo-model I’ll train is a gradient boosting machine using XGBoost. This is a step up from the Random Forest in complexity, and usually a good choice for a final, best model in a project like this, evidenced by its almost unmatched success in Kaggle competitions. The obvious benefit of gradient boosting is that it’s usually able to produce superior results to other tree-based methods by placing increased weights on the samples it has the most trouble classifying (general info here), but it comes at the cost of having more hyperparameters to tune and a greater propensity to overfit. In my experience it’s generally 2x the work for a 1 - 5% performance boost over a random forest. In this case, my goal is accuracy, so I’ll take it.
My approach to training this model is similar to the previous: I conduct a randomized parameter search, this time taking 20 draws from a set of parameter distributions and cross validating over three folds, making for 60 models in total.
The hyperparameters I tune are:
# using proper random search setup instead of a set list of available options
xgb_params = {
'min_child_weight': st.randint(1,10),
'gamma': [0], #st.uniform(0, 5),
'subsample': st.uniform(0.5, 0.5),
'colsample_bytree': st.uniform(0.4, 0.6),
'max_depth': st.randint(2, 10),
'learning_rate': st.uniform(0.02, 0.2),
'reg_lambda': [1, 10, 100],
'n_estimators': st.randint(10, 1000)
}
min_child_weight
controls complexity by requiring a minimum weight in order to make a new split in a tree. gamma
and reg_lambda
perform similar functions, where reg_lambda
is an L2 regularizer on the model’s weights and gamma
is the minimum loss reduction needed to add a new split to a tree. subsample
is the percentage of rows that a tree is allowed to see, and colsample_bytree
does the same thing, but for features rather than records. Last, max_depth
defines the depth of each tree, n_estimators
defines the number of trees to be built, and learning_rate
sets the pace at which the model is allowed to update its weights at each step during gradient descent.
To manage overfitting, I also pass a dictionary of model-fitting parameters to define an early stopping rule. This just means that, for each model, I will halt training and select the best version of the model so far if the test loss doesn’t improve for ten consecutive iterations. Assuming the out of sample loss score follows a convex pattern, this means we can find the best model without overfitting.
# create a separate holdout set for XGB early stopping
train_percent = .9
train_samples_xgb = int(X_train.shape[0] * train_percent)
test_samples_xgb = int(X_train.shape[0] * (1 - train_percent))
X_train_xgb = X_train[:train_samples_xgb]
X_test_xgb = X_train[train_samples_xgb:]
y_train_xgb = y_train[:train_samples_xgb]
y_test_xgb = y_train[train_samples_xgb:]
xgb_fit = {
'eval_set': [(X_test_xgb, y_test_xgb)],
'eval_metric': 'auc',
'early_stopping_rounds': 10
}
xgb = XGBClassifier(objective='binary:logistic',
silent=False, nthread=3)
random_search_xgb = model_param_search(model=xgb, param_dict=xgb_params, fit_dict = xgb_fit)
One additional step I take in the above code chunk is that I create a new, additional holdout set to use for this model’s training. I’m doing this because XGBoost’s early stopping method doesn’t work well within sklearn’s randomized parameter search module. While an sklearn model will use each cross validation iteration’s holdout fold for parameter tuning using RandomizedSearchCV
, XGBoost requires a static evaluation set to be passed to its eval_set
parameter. It would be cheating to pass the holdout set used for final model evaluation into this parameter, so I create an intermediate holdout set out of our training set to give the model something to evaluate against for parameter tuning.
A better approach would be to define the model’s cross validation and parameter search manually in this case, passing each iteration’s holdout fold as the eval set during cross validation. This would have the benefit of letting the model see 100%, rather than 90%, of the training data, and also of using all of the training data for evaluation rather than just our new and static 10% holdout set. Given more time or a more serious use case for the model, I’d recommend going this route.
Training 60 models isn’t quick, so I recommend setting this up and forgetting it for a while. About 3 hours later, the best-performing hyperparameters were:
{
'colsample_bytree': 0.8760738910546659,
'gamma': 0,
'learning_rate': 0.21627727424943,
'max_depth': 8,
'min_child_weight': 7,
'n_estimators': 408,
'reg_lambda': 10,
'subsample': 0.8750373222444883
}
None of these values are at the extreme ends of the distributions defined for the parameter search, so I feel safe in saying that the feature space has been explored adequately.
The resulting model has 93.2% accuracy and 0.957 AUC score. As expected, this is better than the random forest, but not by a lot.
The last model is a simple ensemble of the three best models. The intuition behind this is that each model has learned something from the data, and that one model’s deficiencies may be corrected by the strengths of the others. Regression toward the mean is your friend when all models involved are good models.
In this case, there’s the additional benefit that we have three very-different models: a generalized linear model, a random forest, and a gradient boosting machine. The benefits of ensembling are greatest when the models aren’t highly correlated.
I’ll weight the models in order of performance, giving more attention to the XGBoost model and less to the logistic model.
# an ensemble of the three best models
y_pred_ensemble = (3*y_pred_xgb + 2*y_pred_rf + y_pred_lr2) / 6
fpr_ens, tpr_ens, _ = roc_curve(y_holdout, y_pred_ensemble)
Another approach I tried unsuccessfully was model stacking. In this approach I took the outputs of these same three models and fed them to a logistic regression model. To my surprise, this model-of-models approach didn’t perform any better than this simpler ensembling method, so I’m not going to give its results any further attention here.
Because The Shift is relatively uncommon, I lean on AUC score as my metric of choice. Accuracy is more interpretable as a metric, so I will report on that too, but AUC is the most fair way to compare these models against one another.
Model | AUC | Accuracy |
---|---|---|
Logistic Without Context | 0.887 | 88.6% |
Logistic With Context | 0.940 | 92.2% |
Random Forest | 0.957 | 92.9% |
XGBoost | 0.957 | 93.2% |
Average of Models | 0.959 | 93.2% |
Stacked Models | 0.955 | 92.8% |
My first thought when seeing this is that it’s a good reminder that most of your gains in any modeling task come from feature engineering. Using hand-crafted features improves the logistic AUC from 0.88 to 0.94, and gives it nearly four percentage points of increased accuracy. That’s a huge gain in a problem where only 14% of observations come from the minority class that we’re trying to predict. Moving from the logistic model to more complicated models, however, shows smaller gains. The features got us most of the way there, and machine learning provided the incremental “last mile” gains needed to move from a decent model to an optimal one.
AUC Scores for All Models
The different models’ ROC curves show this visually. There’s a huge gain in going from just the two main variables to including all of our hand-crafted features. After that, there are some gains to show for using more advanced modeling techniques, but the gains are comparatively small.
The best model here is the average of the three best standalone models, which provides an AUC score of almost 0.96 and an accuracy above 93%. Those are both really good scores, indicating that these models have learned a lot from the underlying data.
One thing I noticed that can make the modeling experience much better in this case is to come up with a criterion for dropping features that don’t provide much information to the model. An approach I found success with was to use sklearn’s SelectFromModel
function in conjunction with RandomForestClassifier
’s built-in gini importance scores. Another approach that may have worked here would have been to sue an L1-penalized logistic regression model, optimize the shrinkage parameter, and thrown out the features whose weights had shrunk to zero.
Something that isn’t in this model, which I wish I’d used, was each team’s record at the time of each game. I suspect that strategic positioning happens less when the playoffs are out of reach. This could be captured by using win-percentage and games-left-in-season as features in the model, and I suspect this would provide some lift in the final results.
Another thing I wish I’d been able to capture in this model is a player’s sprint speed. I don’t believe this is present in the data that I was working with, but I would guess that this plays at least a minor role when a team decides whether to shift.
One last idea this leaves untested is whether I didn’t go far enough in exploring the idea of modeling this as a sequence learning problem. Providing lagged features clearly gives considerable lift over only using static features. What would happen if we represented players’ atbat histories as sequences and trained an LSTM on the relationship between these histories and their corresponding defensive positioning? The data are reasonably large, so it’s not completely crazy to think that this approach would work.
All in all, I’m pretty happy with how this turned out. Taking player profile, ability, and game context into account, these models show that we can predict The Shift with over 93% accuracy, with a near-perfect AUC score of 0.959. A perfect model is probably impossible here, but I’d like to think up a few new features that can help to fix these final 7% of missed predictions. If I come up with anything interesting, I’ll be updating this post.
]]>The 2018 MLB season has so far been just like every other season: filled with ups, downs, win streaks, teams plagued with injuries, and so-on. In this post I aim to catch up on the current season with a single chart, showing how the leagues’ rankings have changed throughout the year. My visualization of choice here is the bump chart, a type of line chart showing changes in rankings over time. If you just want to see the final product, this is what it looks like:
If you’re still reading, here’s how you can create your own.
First we’ll need data on every team’s record at each point within the season so far. My plan is to visualize this data in R with ggplot, and there are several capable R packages for pulling baseball data (baseballr and Lahman, to name two). Since I maintain the pybaseball package, however, I’ll eat my own dogfood and start from there.
I use pybaseball.schedule_and_record(year, team_code)
to fetch each team’s 2018 data. Once these data are concatenated together to create the whole season’s records-file, I clean the Date
column to standardize dates across the dataframe, cut off dates that are in the future, and calculate each team’s win percentage at the end of each game day. I then export this csv so it can be used in R with ggplot.
import pandas as pd
from pybaseball import schedule_and_record
teams = ['BOS','NYY','TB','TOR','BAL','CLE','MIN','KC','CHW',
'DET','HOU','LAA','SEA','TEX','OAK','WSN','MIA','ATL',
'NYM','PHI','CHC','MIL','STL','PIT','CIN','LAD','ARI',
'COL','SD','SF']
# collect every team's record for the 2018 season
records = []
for t in teams:
s = schedule_and_record(2018, t)
records.append(s)
#concatenate records together so the whole season is in one dataframe
df = pd.concat(records, axis = 0)
# standardize the date formats of double-header games
df.Date = df.Date.str.replace(' (1)','',regex=False)
df.Date = df.Date.str.replace(' (2)','',regex=False)
# turn this into a date format that Pandas will recognize
df.Date = pd.to_datetime(df.Date,format='%A, %b %d')
df.Date = df.Date.map(lambda x: x.replace(year=2018))
# cut out games that havent happened yet
df = df.loc[df.Date < '2018-08-05']
# extract win and loss values from "w-l" strings
df['W'] = df['W-L'].str.split('-').str[0].astype(int)
df['L'] = df['W-L'].str.split('-').str[1].astype(int)
df['win_pct'] = df['W'] / (df['W'] + df['L'])
df.to_csv('2018-records.csv')
Next the data get loaded into R. It is easiest to rank the teams when their win rate is known at every point in time, not only on game days. For this reason, my first task for preparing the data is to fill in these missing non-game-day dates with the win-percentage of each team’s most recent game day.
library(cowplot)
library(dplyr)
library(tidyr)
win_percentages = read.csv('2018-records.csv')
win_percentages = win_percentages[, c('Tm', 'Date', 'win_pct')]
win_percentages[is.na(win_percentages$win_pct), 'win_pct'] = 0
# create a dummy column to give dplyr left_join the effect of a cross join
dates = setNames(data.frame(unique(win_percentages$Date), dummy=1), c('Date', 'dummy'))
teams = setNames(data.frame(unique(win_percentages$Tm), dummy=1), c('Tm', 'dummy'))
# rejoin tables to have one row per day per team
df = left_join(dates, teams, by='dummy')
df = left_join(df, win_percentages, by=c('Tm','Date'))
# fill non-gameday win percentages with the previous-gameday's win percent
df = df %>% mutate(Date = as.Date(Date)) %>%
complete(Date = seq.Date(min(Date), max(Date), by="day")) %>%
group_by(Tm) %>% fill('win_pct')
# remove NAs generated by the all star break when no games were played
df = df[!is.na(df$Tm),]
Now that the data is formatted how we want it, we’ll need rank them by their records. Because the bump-chart will be messy with all teams involved, and also because rankings across leagues don’t have much real-world value, we’ll want to separate NL teams from AL teams first. To do this, create a list of one league’s teams and use a dplyr filter
on it.
al_teams = c('BOS', 'NYY', 'BAL', 'TBR', 'TOR', 'CHW','CLE',
'DET', 'KCR', 'MIN','HOU','LAA', 'OAK', 'SEA', 'TEX')
al = df %>% filter(Tm %in% al_teams)
nl = df %>% filter(!(Tm %in% al_teams))
Next, correct for double-header days by grouping on date-team combinations and taking only the last row of each group. This is the record ad the end of the team’s double-header. After this, ranking teams by their records is as simple as grouping by Date
, sorting by win percentage, and ordering them from best to worst. Ties in this case will naively go to the team that comes first in the alphabet.
by_date = al %>% group_by(Date, Tm) %>% filter(row_number()==n()) %>% unique()
by_date <- by_date %>% group_by(Date) %>%
arrange(Date, desc(win_pct), Tm) %>%
mutate(Rank = rank(-win_pct, ties.method = "first"))
Now on to the fun part: graphing it. We can start by defining the colors that will be associated with each team’s line on the graph. Because team colors are well known to fans, this will help the plot’s interpretability. Conveniently, there’s a website built for this exact purpose: teamcolorcodes.com. I selected a hex code for one of each team’s colors and add them to a list like so:
team_colors = c(BOS = '#BD3039', NYY = '#003087', TBR = '#8FBCE6', KCR = '#BD9B60',
CHW = '#27251F', BAL = '#DF4601', CLE = '#E31937', MIN = '#002B5C',
DET = '#FA4616', HOU = '#EB6E1F', LAA = '#BA0021', SEA = '#005C5C',
TEX = '#003278', OAK = '#003831', WSN = '#14225A', MIA = '#FF6600',
ATL = '#13274F', NYM = '#002D72', PHI = '#E81828', CHC = '#0E3386',
MIL = '#B6922E', STL = '#C41E3A', PIT = '#FDB827', CIN = '#C6011F',
LAD = '#005A9C', ARI = '#A71930', COL = '#33006F', SDP = '#002D62',
SFG = '#FD5A1E', TOR = '#134A8E')
First let’s create the main plot: a bump chart showing the 15 teams’ rankings throughout the progression of the season. I plot a geom_line
for each value of Tm
, and then flip the scale so that a lower (better) value of Rank
will be at the top of the Y axis. To make this somewhat interpretable, I label the teams at their final positions (yesterday, the most recent date for which I have data) at the tail-end of the chart, so that they can be traced back in time throughout the season, and color-code them with scale_color_manual
so that each line matches the team’s colors. The code for the main chart is this:
p = ggplot(data = by_date, aes(x = Date, y = Rank, group = Tm)) +
geom_line(aes(color = Tm), size = .75, show.legend=F) +
scale_y_reverse(breaks = 1:32) +
geom_text(data = subset(by_date, Date == as.Date("2018-08-04")),
aes(label=Tm), size = 2.5, hjust = -.1) +
scale_color_manual(values=team_colors)
Which produces an outcome that looks like this:
This is nice, but it’s also a bit noisy. As a visual aid, it will next be nice to show each team’s line in isolation along the border. To do this, we’ll first need a function for generating graph p
for each team in isolation. This is the same as above, but with the axes wiped out to minimize noise.
teamplot = function(team_code){
ggplot(data = by_date[by_date$Tm==team_code,], aes(x = Date, y = Rank, group = Tm)) +
geom_line(aes(color = Tm), size = .75, show.legend=F) +
scale_y_reverse(breaks = 1:32) +
scale_color_manual(values=team_colors) +
labs(y=team_code) +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_text(angle=0, size=6))
}
With this in place, all that’s left is to arrange the team-charts and league-chart side by side. Luckily, there’s a library for that. Using library(grid)
, I define a 16x4 grid. Row 1 will be a title for the entire figure. Columns 2 - 4 will belong to the main chart showing all teams. Rows 2 - 16 in column 1, finally, will belong to the 15 individual-team charts. This is put together like so:
library(grid)
pushViewport(viewport(layout = grid.layout(nrow = 16, ncol = 4)))
# helper funciton for defining a region in the layout
# source: http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/
define_region <- function(row, col){
viewport(layout.pos.row = row, layout.pos.col = col)
}
grid.text(expression(bold("American League Standings")), vp = define_region(row =1:1, col = 1:4), gp=gpar(fontsize=15))
print(p, vp = define_region(row = 2:16, col = 2:4))
teams=unique(by_date[by_date$Date=='2018-08-04','Tm'])
for (idx in 2:16){
print(teamplot(teams[[1]][[idx-1]]), vp = define_region(row = idx, col = 1))
}
which, finally, gives us a finished product that looks like this:
Replace al
for nl
in the code above to get the same figure for the National League, and then we’re done here. Simple enough!
Tanking becomes a hot topic each season once it becomes apparent which of the NBA’s worst teams will be missing the playoffs. Though never explicitly acknowledged, a team with no playoff hopes will sometimes suspiciously begin to decrease the minutes of its starters, leaving its worst lineups on the floor in ways that can only be described as anti-competitive. With playoff glory out of the question, the process of losing by design in pursuit of better draft picks is one of the only remaining ways to extract value from a disappointing season. And so the race to the bottom begins; the losses pile up and fan base turns its eyes to the college talent pool, wondering which member of the upcoming draft class might be the one to turn around the struggling franchise.
The league leadership’s distaste for this strategy of designed awfulness is no secret: it’s boring to watch, bad for advertising revenue, and goes against the ideals of competitive play that Commissioner Adam Silver so adamantly supports. This phenomenon of shedding wins for improved lottery odds is of such high concern to league officials that, beginning with the 2019 draft, lottery odds will be adjusted in order to decrease teams’ incentives to tank. This post is an attempt to understand the value of the draft and of tanking in the NBA, before and after the implementation of the upcoming lottery changes.
Namely, I will address the following questions:
The data for this analysis comes from two different sources. First I manually grabbed all draft results from 1960 to present from basketball-reference.com. The second dataset comes from Kaggle, which provides a dataset of NBA season-level data for each player since 1950. Of this, I kept only the observations since the 1960 season, as anybody who’d played before 1960 wouldn’t be found in the draft pick data. Because the draft has undergone some changes since the 60s, I also removed all picks outside the first and second round, as the draft only has two rounds in its present form.
To determine the value of a draft pick we’ll need to be able to assign it a value. Two of the most commonly used metrics for overall player value are win share (WS) and player efficiency rating (PER).
PER strives to measure per-minute performance while adjusting for pace, taking into account field goals, free throws, 3-pointers, assists, blocks, steals, missed shots, turnovers, and personal fouls. The league average is always equal to 15, values in the 20s indicate star-level performance, and below 10 typically puts a player toward the end of his team’s bench (Wikipedia). Some issues with PER are that it’s measured per-minute, sometimes assigning excessive value to low-minute players and equating them with all stars. It’s also challenging to interpret, since PER points have no direct connection with wins or any other metric. Despite this, PER sees wide use due to its quality of taking multiple facets of the game into account and assigning a single value to a player’s overall per-minute performance. As this metric can become volatile for low-minutes players due to small-sample properties, I ignore PER values for players during seasons in which they averaged less than five minutes per game.
Win Share serves as a convenient response to PER’s flaws. Similar to PER, win shares take into account just about every box score statistic relevant to a player’s performance. WS holds slightly more real-world value, however, in that it’s directly tied to wins. Under Basketball Reference’s system, one team win equals one win share. Players’ contributions are weighed, dividing WS into Offensive and Defensive Win Share (OWS/DWS), where WS = OWS + DWS. Each team’s wins are divided amongst its players according to how much each has contributed on offense and defense. For details on how this is calculated, I suggest reading Basketball Reference’s description here. The all-time single-season record for win shares is Kareem Abdul-Jabbar, who was responsible for 25.4 of the Bucks’ wins in 1971-72. A player on a losing team, on the other hand, has little hope of achieving a high win share, as there are less win shares to distribute throughout the roster.
The last statistic I’ll use in this analysis is an adaptation of sports-reference’s Approximate Value (AV) statistic. Approximate Value takes an existing stat and scales it to weight a player’s best seasons heavier. In my case I’ll front-weight a player’s best win share seasons, adding up 1*(best season) + 0.95*(second best) + 0.9*(third best), all the way down to 0.55*(tenth best). My reason for doing this is that career win shares, as they stand, measure greatness over the span of a player’s career. Some may prefer this, but others will object that this disproportionately rewards players with better longevity and health, which usually can’t be reliably forecast on draft day. Since this is an exercise in the value of draft picks, I find it necessary to include AV as an alternative to WS for this reason. As we’ll see later, AV and WS tell approximately the same story, but I do like that this statistic evens the playing field between a player like Kareem, who had 20 healthy seasons (Career WS 273, AV 144), and Michael Jordan, who only played 15 seasons (Career WS 214, AV 147). It also allows some of the current players to enter the conversation, such as Kevin Durant, currently in his 11th season, who is ranked 35th in career WS (since his career is likely far from over) but 17th in ten-year AV. This adaptation of the AV statistic was inspired by this piece from FiveThirtyEight, which uses AV to evaluate the value of draft picks in the injury-plagued, short-careered National Football League.
As a quick sanity check, let’s see who some of the all-time greats are according to these three measures.
Table 1: Top Ten Players Drafted Since 1960 According to Approximate Value (AV)
This looks about right. Career Win Share rewards greatness over the course of a career, Approximate Value rewards greatness at a player’s peak over a ten year period, and PER rewards a player’s average efficiency. Accordingly, all players on this list are deserving of legend status, AV rewards the short-careered all-time greats such as David Robinson and Magic Johnson, and PER rewards those who packed the most productivity into their time on the floor.
Since PER doesn’t appear to match the other two measures too closely, we can double-check that these metrics tell roughly the same stort using a measure called Kendall Correlation. Kendall correlation is a measure of the pairwise similarity or dissimilarity of two sets of rankings. A value close to 1 indicates near-perfect agreement between the ranking systems, close to -1 indicates near-perfect disagreement, and close to 0 indicates little-to-no relationship between them.
Here we see a correlation of 0.89 between AV and WS, of 0.61 between AV and PER, and of 0.58 between PER and WS. So while AV and WS are much more similar to one another than either of these two metrics are to PER, they’re all similar enough accept as valid ranking metrics, significantly positively correlated with P<0.001 while clearly succeeding in the task of identifying high achievers.
With this in mind, we should now feel comfortable accepting that these are useful metrics that tell similar but non-identical stories of player value. Taken holistically, these should be all we need to evaluate the values of positions in the draft.
Given each player’s draft position, AV, WS and PER, the first step is to see how many points of each of these metrics can be expected from each draft position.
AV, WS, and PER by Pick Number, with Means Shown in Red
The above plot shows two of the things we’d expect: a downward trend in player value as the draft progresses, and a relatively high variance in player value at each pick. The relationship between pick number and each of these player-ability metrics is highly significant, with each having an absolute pearson correlation of >0.41 with P<0.001. What’s more interesting, however, is that each of these three relationships appears to be nonlinear in a similar way, indicating a potentially complex relationship between tanking and draft value. Also worth noting is that many players selected within the last 10 picks of their respective draft classes had negative PER ratings, which aren’t shown on the above figure but can be inferred by the average (red) falling below the visible points on the scatter plot.
We’ll need to take a step beyond these simple averages in order to give this relationship a more believable functional form. It defies logic to say that the 8th pick, for example, is worth 3 more win shares than the 7th pick. For this reason I fit a fifth-order polynomial to this data to enforce the qualities of smoothness and having pick values consistently decrease according to their position in the draft.
Smoothed Expected Values for Each Pick Number. Error Bar Represents One Standard Deviation in Either Direction.
The outputs of these smoothed functions are what I’ll use for the rest of this analysis when referring to pick values. The relationships between WS, AV and pick number resemble exponential decay until these values fall off in the draft’s final picks. Expected PER decreases in a a much more linear fashion once outside the top ten picks, falling off similar to AV and WS at the tail end of the draft.
The error bars in these plots represent standard deviations from the mean, meaning that 68.27% of values lie within these ranges. For WS and AV the standard deviations decrease in the draft’s later picks, indicating that as players get worse we can become more confident what their output will be. PER, on the other hand, gets less stable in the tail end of the graph. This is because PER is measured per minute, meaning that low-minutes players will sometimes have illogical ratings due to small sample properties. All three measures see higher standard deviations in the draft’s final two picks due to the fact that the league hasn’t always had 30 teams.
The expected AV, WS, and PER for each draft position is shown below.
Table 2: Expected Values and Standard Deviations of Each Pick Position
Before getting into tanking, the above numbers are valuable to teams and worth a closer look. With these values alone we can objectively evaluate the quality of a draft-day trade that involves only draft picks.
Table 3 does exactly that. The first example reads as follows:
In 2017 the Kings received picks 15 and 20 from the Blazers in exchange for the 10th pick. The picks received by the Kings were worth an expected 37.19 career win shares and combined 20.54 PER, while the Blazers’ pick was worth an expected 29.86 win shares and 11.70 PER. The Kings came out on top in this trade, netting an additional 7.33 expected win shares and 8.84 points of PER.
Table 3: WS and PER Gained/Lost in Recent Draft Day Trades Involving Only Draft Picks
The lesson learned here? Much like Richard Thaler teaches us about NFL draft pick valuations, NBA teams should trade down in the draft far more often than they currently do. The 76ers and Timberwolves appear to be on to this strategy, but the rest of the league is yet to follow suit.
This same analysis can be extended to trades involving players and future draft picks, but it involves a little more work and uncertainty. Some thoughts on how to take into account the three non-draft pick assets that get tossed around on draft day:
Under its current rules, the lottery provides the league’s worst 14 teams a chance at winning each of the first three picks. After the first three picks are determined according to randomly generated numbers, each of the following picks is assigned to the worst remaining team. Outside the worst 14 teams, draft position is predetermined according to the inverse of a team’s rank within the league. Each team’s odds of receiving each lottery placement are shown below in highlighted cells, where rows represent inverse rankings (i.e. rankings in terms of how bad a team was) and columns represent pick numbers.
Table 4: Odds of Winning Each Pick by Inverse Record Ranking. Lottery Picks are Highlighted.
Outside the lottery (picks 1 - 14), the 15th worst team has a 100% chance of earning the 15th pick, the 16th worst team gets the 16th pick, and so on down the line. Second round picks are unaffected by the lottery, meaning that the worst team gets the first pick of the draft’s second round regardless of whether they won the lottery for the first pick overall. For more detail on this process, Wikipedia has a nice overview
Taking the above lottery odds and earlier-discussed valuations of each pick position, it’s easy to combine these in order to place a value on each end-of-season rank in terms of its expected draft-day value.
Starting with Table 4’s lottery odds, multiply the percentages in each column by the value of the corresponding pick number from Table 2. Once pick value and probability have been multiplied together for all 30 teams and 60 picks (most of these values will be zero, since there’s no uncertainty in the assignment of picks 15 - 60), take the sum of these values for each end-of-season ranking to obtain its expected draft-day value in terms of draft picks. Stated a different way, this is simply the dot product of the (ranking x pick probability) matrix and the vector of pick valuations.
Two examples to illustrate this:
Worst team in the league: 0.250(69.28) + 0.215(61.53) + 0.178(54.77) + 0.375(48.91) + 1.000(10.43) = 68.19 expected career win shares
Tenth worst team: 0.011(69.28) + 0.013(61.53) + 0.016(54.77) + 0.870(27.56) + 0.089(25.61) + 0.002(23.97) + 1.000(6.29) = 35.03 expected career win shares
Below are the expected values of all 30 end-of-season rankings in terms of win shares, approximate value, and points of player efficiency rating. As we’d expect, these show similar patterns to the relationships between pick number and value. The two main differences are that these curves are shifted vertically and compressed, since the values now include both first and second round picks, and that they’re flatter, since each of the bottom 14 ranked teams has a chance at the top three picks. The exact values from these plots are shown in the next section in Table 5.
Expected AV, WS, and PER of Each End-of-Season Ranking
With the value of each end-of-season ranking established, we’re finally able to face the question this analysis set out to answer: what’s to gain from tanking? Should a team act anticompetitive in order to improve its lottery odds?
The value of tanking one position in the league standings is the difference between the value of a team’s current ranking and the value of falling one place closer to the bottom of the league. These values come from the weighing of draft pick odds and their values discussed in the previous section. These season-end rankings’ values are shown alongside their “tanking values” in Table 5, where a tanking value is defined as the difference between one end-of-season ranking’s expected draft day value and that of the next worse ranking. These values are positive for all rankings and for all three measures of value, with no tanking value being assigned to the worst team in the league since there’s nowhere to fall for a team that’s already hit rock bottom.
Table 5: Expected Value and Tanking Opportunity of Each Inverse Team Ranking
Ranking these positions by how much a team would gain by tanking one additional spot shows an interesting relationship. For all three measures of value, the worst non-lottery teams have the least to gain by tanking. It’s also interesting to see that teams expected to be placed in the middle of the lottery have more to gain than those closest to the league’s worst ranking. If we discard playoff-eligible teams from these rankings, the sweet spot for tanking is ranks five through eight. It’s valuable to tank for teams of all ranks, of course, but these mid-lottery rankings are where we see that teams have the most to gain from losing a few additional games.
To understand this relationship better, let’s plot tanking value against inverse league rank. Again, by “inverse rank” I’m referring to the opposite of a team’s place in the league standings, where the league’s worst team has an inverse rank of one, the second worst team has an inverse rank of two, and so on.
Draft-Pick Value of Falling One Position in the Rankings
After seeing this visually, a few additional things become apparent about tanking. First, there’s a significant increase in tanking-value for the league’s 15th-worst team. This is because Falling from 15th to 14th worst enters a team into the lottery, giving them a chance at the much-more-valuable first, second, third pick, as well as a guarantee of the slightly-more-valuable 14th pick if the better options don’t pan out. Second, the power-law nature of the value of draft picks also shows through in this. Tanking becomes increasingly valuable as rank improves for the league’s best teams, since a disproportionate amount of the draft’s value tends to come from the first half of the first round.
Since no sensible team in the playoff hunt is going to throw wins for a better draft position, we can safely ignore the right-hand side of these plots. What’s most important is that the relationship between ranking and expected draft-pick value is nonlinear in a way that gives the league’s fifth through eighth worst teams the most to gain from losing games down the stretch.
Turning to the real-world value of falling in the rankings, however, these values individually aren’t shocking. Falling one position in the league rankings is worth an additional 0.62 points of PER, 4.35 career wins, and 3.11 approximate value points at best. This is hardly going to turn around a franchise.
A formal long-term strategy of tanking, on the other hand, might be a different story. Imagine a team whose non-tanking league ranking is seventh worst. If this team were to adopt a policy of end-of-year tanking for just two seasons in a row, falling three spots each year, they would gain an impressive 3.53 points of PER, 26.21 career win shares, and 18.28 points of AV. Stretch this out to a third year of designed awfulness and you have 5.29 PER, 39.32 WS, and 27.42 AV, all in addition to the already-high expected values of their mid-lottery odds before tanking. Suddenly this looks valuable, and it’s not far off from what the 76ers have been accused of doing in recent seasons.
The NBA is aware of this strategy, and in September 2017 came to an agreement with the Board of Governors to change the lottery’s odds in a way that disincentivizes the practice of tanking. The first major change is that, starting 2019, the league’s three worst teams will have identical odds of winning the first pick rather than the current odds which favor worse records. The second change is that lottery teams will be able to fall further down in the draft than is possible under the current rules. Where the worst possible outcome for the league’s worst team is currently the fourth overall pick, starting 2019 they will be able to fall to as far as fifth.
Table 6: Lottery Odds Effective 2019
So the rewards for being among the league’s absolute-worst will decrease, and the worst-case scenario for these same teams also becomes slightly less desirable. Will this be enough to stop teams from tanking? Let’s see how severely these changes affect each end-of-season ranking and tanking value.
Table 7: Tanking Values under Current and Future Lottery Odds
Table 7 shows that the values of each end-of-season ranking will remain similar in magnitude after the new lottery rules go into effect. The values of tanking for the worst nine teams, however, will decrease, while these values increase for the five best lottery teams (ranks 10 - 14). Non-lottery teams’ odds are unaffected, as there’s no uncertainty surrounding their draft positions.
Value of Tanking Before and After Lottery Change
Plotting these new values against the current ones, it’s clear that the relationship between season-end ranking and draft value is visibly flatter under the new lottery. For those taking the anti-tanking stance, this is a good thing. For those who believe that the league’s worst teams deserve better lottery odds in order to ascend from mediocrity, this is a bad thing.
Tanking Values Before and After Rule Change
Plotting tanking values before and after this change shows a similar story. The new odds make it less valuable for inverse ranks 1 - 9 to lose games, more valuable for ranks 10 - 14, and leave ranks 15 - 30 untouched.
A final question worth answering is how this all nets out. At the league level, will it be less valuable to tank than under the current rules? The answer to this is a simple “yes.” The sum of the differences in tanking value for each rank between the old and new lottery system is negative in terms of AV, WS, and PER. Specifically, the new lottery odds result in 0.64, 5.98, and 4.10 point drops in available PER, WS, and AV to be gained from tanking respectively. Tanking will still be valuable to the league’s worst teams, but the incentive to do so has been ever-so-slightly reduced, and will be shifted slightly toward teams that are still in the playoff hunt (the 10th - 14th worst teams). As a result, one can expect a slightly more competitive league going forward.
]]>I’ve been borderline obsessed with the eephus for some time now. Every time I see a player pull this pitch out of their arsenal I become equal parts excited and bamboozled. My reaction is typically equal parts “I could throw that,” and “how on earth didn’t he hit that?”
For those who aren’t familiar, here’s a quick description and history of the eephus. In short, an eephus is a blooper pitch: it has a lazy, rec-league style delivery, can arch well above the batter’s head en route to the plate, and tends to travel anywhere from 40 to 70 mph as it leaves the pitcher’s hand. It’s oftentimes difficult to tell whether it was thrown on purpose or if the pitcher temporarily forgot how to throw a baseball.
This pitch is said to have first been thrown by Bill Phillips, who made the pitch a part of his game from 1890 to 1903. The pitch was later brought to prominence by Rip Sewell roughly 40 years later, and has seen sporatic use since. This pitch has gone by a variety of names over the years, including being referred to as a “junk pitch”, “dead fish”, “LaLob”, and a “spaceball” for its high arch (source: A Brief History of the Eephus Pitch - NYTimes).
Well below the speed of an average changeup, and typically lacking any element of deception as to what’s coming in its delivery, why does anyone throw this bizarre pitch? The prevailing theory is that the comically slow speed of this pitch throws off a batter’s calibration, making the pitches that follow appear blazing fast. In other instances, people speculate that the pitch is simply a mistake, having slipped out of the pitcher’s hand. Regardless, little research has been done to date on this uncommon pitch, and I think it deserves better than that. Thus, this post is going to serve as an exploratory analysis of and tribute to the mythical eephus.
Before going any further in this post, here’s some quick suggested viewing for context on the big league pitch that you could probably throw just as effectively as Clayton Kershaw:
Now that this pitch has received a sufficient amount of hype, let’s get up close and personal with the eephus and see what it looks like by the numbers. To do this we’ll need data on every eephus that’s been thrown during the Statcast and PITCHf/x eras. For this, I used the pybaseball library to retrieve the Statcast and PITCHf/x data on every Major League pitch that’s been thrown since the 2008 season. Among these 7,212,136 observations, only 2,090 of them represent eephus pitches. That’s just 0.02 percent - a rare pitch indeed!
Eephuses thrown by season
The eephus saw its Statcast-era golden age in the year 2014, when over 400 were thrown. With the exception of the 2012 - 2015 seasons, it appears most common to see less than 200 thrown in a given year. Turning to the list of pitchers who’ve used this pitch, it becomes clear that it’s no coincidence that the 2012 - 2015 spike in eephus use coincided with the era of a healthy R.A. Dickey. This eephus-throwing knuckleballer, in fact, is responsible for more than twice as many eephus pitches as the next-most prolific user of the pitch.
Eephus count by pitcher, 2008 - 2017
In recent history, only Dickey, Padilla, Despaigne, and Chen have been prolific enough users of the pitch to have more than 100 in-game examples under their belt. It makes sense that this would be an uncommon pitch for most of those who use it; once the eephus loses its element of surprise, it’s no longer a novel and disorienting pitch, but essentially a Little League World Series-level fastball that any major league batter worth his place on a roster would hit out of the park.
Since data on any particular pitch type is only relevant in the context of other pitches, we’ll first compare the eephus against the closest things it has to peers: the fastball, knuckleball, and changeup.
The most relevant data point here is speed: the eephus has an average speed of just 64.5mph. That’s 23% slower than the average changeup, and 30% slower than the average fastball. The pitch doesn’t demonstrate the same low spin rate of other purposefully-slow pitches, however, despite slowness being its defining characteristic. While the knuckleball and changeup show spin rates in the 1500s and 1700s, the eephus spins at a lofty 2301 rpm - a solid 100rpm faster than the average fastball. As spin rate is a relatively new metric to have access to, the experts aren’t completely certain what a high or low spin rate means for pitch quality. Early research, however, suggests that high spin rate is a good thing for a non-breaking ball.
Statcast Zones (source: Baseball Savant)
The last summary stat shown in the table above is the percentage of each pitch type that’s placed down the middle of the strike zone, along its edges, and outside. Here I use the Statcast zones shown above, defining “down the middle” as being in zone 5, “edge of strike zone” as zones 1, 2, 3, 4, 6, 7, 8, and 9, and “outside strikezone” as zones 11 through 14. At a high level, the farther pitches tend to be placed from the middle of the strike zone, the more likely it is that pitchers are using this pitch for strategic reasons and the less likely it is that a pitcher is confident in the pitch’s ability to get past a batter without being expertly placed. Here we see about what we’d expect. Fastballs are placed within the strike zone relatively more often than the slow-speed changeup and eephus, with the eephus being thrown outside the strike zone two percentage points more often than the changeup and 12 percentage points more often than the fastball. This makes intuitive sense, since one can imagine that a well-prepared power hitter could do some damage to a 60mph pitch thrown down the middle. Due to the eephus’ high arch, it may be challenging to place accurately as well, which would also contribute to how often it lands outside the strike zone.
Eephus (L) and Fastball (R) Placement from Batter's View
The above figure shows this same idea in slightly more detail. While the sample size is much smaller for the eephus than the fastball, it’s clear that eephus pitchers make a concerted effort to keep this pitch well out of reach, at the expense of it often having no chance of entering the strike zone.
While summary stats are useful, a simple average never tells the full story. To better understand baseball’s slowest pitch, let’s take a look at how its release speeds are distributed relative to these other pitches.
From this figure we can see that the eephus’ slowness is even more pronounced than one may have thought! In fact, if we throw out the fastest 1% of eephus pitches which are outliers that appear to have been misclassified, we see that the remaining 99% of recorded eephus pitches are slower than 97% of recorded changeups. So while there is some overlap between the two pitches in terms of speed, the eephus is essentially in a league of its own in terms of slowness.
The speed gap between the eephus and the fastball is even more pronounced. One can imagine how disorienting it would be to see an eephus float by after a 95mph fastball, or how blazingly fast this same fastball would appear after a 60mph eephus. As a side note, the bi-modality of knuckleball speeds suggests that Statcast may be misclassifying some of these pitches as knuckleballs when they’re actually eephuses. Since there’s no accurate way of saying which declared-knuckleballs are actually eephuses, however, we’ll have to leave those pitches be.
This brings us to a more practical question: does the eephus actually work? The most salient argument for its use is the one alluded to earlier: the extreme speed differential between an eephus and any other pitch both catches batters off guard for the eephus itself, and makes a non-eephus follow-up pitch appear faster and harder to track. But does this theory hold up in practice? Let’s examine the effectiveness of the eephus vs. a few more common pitches, and then test whether an eephus actually makes the following pitch harder to hit.
For examining the effectiveness of the eephus vs. all other pitches, the following five metrics provide a nice overview of how batters fare against it: contact percentage, hit percentage, launch angle, exit velocity, and barrel percent. These metrics collectively represent how hittable the pitch is, how high quality a better’s contact with an eephus tends to be, and whether people hit the eephus for power or for contact.
First, perhaps surprisingly, batters make contact with this pitch about as often as every other pitch, making contact with the eephus just 0.33 percentage points more often than an average pitch. The quality of this contact, however, tends to be lower. Despite making contact with this slightly more often, for example, it becomes a hit almost 11% less often. A second way of looking at this is that its barrel percent, measured as the percentage of eephus pitches with an expected batting average of above 0.500 based on the ball’s speed and angle off the bat, is a tenth of a percentage point lower for eephus pitches, amounting to a 2% drop. This isn’t a large decrease, but paired with the pitch’s higher contact percent and lower hit percent, it paints a picture of frequent but low-quality contact.
Barrel percent is calculated using the ball’s exit velocity and launch angle off the bat, but these factors can be examined in isolation as well to better understand what type of contact is being made. Here both the average and distribution of these metrics show that batters’ launch angles are about the same for an eephus vs. non-eephus pitch, but the speed of the ball off their bat is slower. This is reflected by the ball’s average exit velocity being 4.29mph slower and the distribution of this metric being shifted noticeably toward the slower side for the eephus vs. every other pitch.
Now that we’ve established that the eephus itself may have the desirable quality of drawing out low-quality contact, let’s return to the theory posed earlier: is a fastball harder to hit if it’s thrown after an eephus? Do pitchers strategically throw fastballs more frequently after an eephus? These same questions could be posed for pitch types other than the fastball, but if this effect exists, this is where we’d expect it to be most pronounced, so we’ll leave the other pitches out for now. The answer to the first of these questions is a definitive “not really.” An average batter makes contact with 19.18% of fastballs thrown. When the previous pitch was an eephus, this contact percentage actually increases to 22.60%. Further, this contact tends to be high quality contact. 8.49% of eephus-preceded fastballs turned into hits, while this number is only 6.26% on average. Measuring barrels shares a similar story, where a near-average 5.4% of fastballs are barreled on average, but a much-higher 6.4% are barreled when the previous pitch was an eephus. It’s difficult to make a strong claim about the impact of an eephus on a follow-up fastball, however, due to sample size constraints. 703 post-eephus fastballs have been thrown during the PITCHf/x and Statcast eras, and only 203 of these happened since barrels became measurable in 2015. This is hardly enough data to trust these particular numbers out of sample. It appears from this analysis, however, that a fastball thrown after an eephus performs either identically or slightly better than an identical fastball under other circumstances. Based on these results, I would take any claim that a fastball is extra hard to hit after an eephus pitch with a grain of salt.
The second of these questions is easier to answer. While approximately 64% of major league pitches are fastballs, only 47% of eephuses whose plate appearance contained a follow-up pitch were followed by a fastball. Even if we remove eephus-throwing knuckleballer R.A. Dickey from this data, the number is still below average at 61%. It looks like non-knuckleball pitchers throw fastballs at approximately their normal frequency after eephus pitches, and that R.A. Dickey steers away from the post-eephus fastball almost entirely. Perhaps this means that pitchers already understand that the extra-fast-looking post-eephus fastball is only a myth.
Since the eephus doesn’t appear to be any better than a fastball as an isolated pitch, and we’ve also debunked the theory that a fastball is more deadly when thrown after an eephus, is there any reason to consider using this pitch? Perhaps. Examining the on base percentage (OBP) of plate appearances where the eephus was featured, and comparing this to the OBP of non-eephus plate appearances, we do see a slight decrease when the eephus is used. An eephus-containing atbat sees the batter get on base 30.8% of the time, whereas an average plate appearance has a slightly higher OBP of 31.9%. A difference of more than an entire percentage point is larger than I would have expected here, and suggests that something about this rare pitch may, indeed, work in a pitcher’s favor.
Despite its incredibly slow speed, the eephus pitch manages to hold its own. Batters have trouble making high quality contact with the pitch, and in general get on base less often when the pitch is utilized in a plate appearance. That said, analyzing a rare pitch inevitably means working with small sample sizes, which means that it’s hard to gain many deep insights into this pitch beyond some simple summary stats. A word of caution, however: a pitcher should always be careful not to throw this “surprise” pitch twice in a row, lest they end up like poor Orlando Hernandez.
]]>
A core component of my team’s work centered around building lead scoring models for MLB’s 30 teams. The problem, in brief, is this: given a fan’s history of past purchases, games attended, and interaction with a team’s digital media, how likely are they to either upgrade or renew their ticket package for next season? Baseball fans are a diverse group of customers, and there is no one-size-fits-all solution to understanding what drives a team’s fan base’s purchasing habits. For this reason, each team requires its own specially tailored lead scoring model.
The lead scoring models themselves are relatively simple; usually logistic regression models with a small number of variables. The real challenges in building these models come from simulteneously optimizing lift and accuracy (which are not always positively correlated) while maintaining interpretability, which is of high importance to the teams that rely on the models’ outputs.
In the end, there were a few traits of successful lead scoring models that proved consistent across teams. Of the thousands of variables considered, there was almost always one that would stick into the final model representing each of the following categories: purchaser recency, duration, quantity, engagement level, and source. This makes intuitive sense, since these five categories effectively explain the most important aspects of a fan’s relationship with a team. If we know how recent somebody has purchased, we have an idea as to how active they are. Knowing how long they have been a customer tells us something about loyalty. The number of games attended is as direct a signal possible for a fan’s interest level in attending future games. Engagement level, measured through interaction with email marketing campaigns, measures something similar to games purchased, but through a different channel. And last, the source of a customer’s purchases (i.e. buying directly from the club or on the secondary market) tends to have an impact on future purchasing habits as well. With one variable from each of these categories placed into a model, the other couple-thousand often become redundant.
The end result of this process is something I find exciting. Given one of these models, a team can understand who’s most likely to make a purchase before they make a call. The lift generated by these models makes for a tactical, data-driven sales strategy that allows a club to better understand its fans and make the most of its resources. The signals used in these models, I imagine, apply not only to the pro sports industry, but to consumer sales as a field in general.
Another type of modeling my team worked with was churn prediction for the MLB.TV streaming product. Churn prediction works similarly to lead scoring at the surface level, in the sense that both are binary classification problems, but there are a few interesting nuances to working with churn that warrant a different approach.
Being a consumer-facing digital product, the most important features for determining whether somebody will churn come from their activity level. The users churning are almost never the active ones using the app multiple times per week. Domain-specific indicators also come into play. Being a baseball-streaming product, for example, the performance of a fan’s favorite team can play a role in their overall satisfaction with the product. Only a superfan, after all, will pay to watch their team lose day in and day out.
MLB.TV (image: mlb.com)
With churn modeling, however, there is no external client. This means that the model is allowed to be more flexible, since a computer doesn’t care about model interpretability. This gave me the opportunity to experiment with the idea of replacing our existing models with something more flexible. For this task, I fit an ensemble of a random forest classifier, gradient boosted trees (XGBoost), and a LASSO logistic regression model, resulting in a moderate performance boost.
A third project I worked on was an attempt to predict the lifetime value of a ticket purchaser. This effort cuts right at the heart of the challenges facing any data science effort to model the habits of a diverse customer base. For this domain in particular, useful signals can be elusive. An individual buyer’s purchasing habits can vary significantly year-over-year, and levels of activity range from passive one-game-per-year purchasers to superfans spending thousands of dollars. To add an additional layer of complexity, MLB’s fans come from vastly different markets. We would expect the behavior of a Kansas City fan, for example, to differ from that of a New York fan.
All of this results in a situation where a single linear model is insufficient. For tasks as complicated as this one, the out-of-the-box solutions taught in academia do not always play nicely with the problem at hand. The solution, of course, is to try a few different approaches and see what works. The answer could lie in fitting a single model per club, one model per audience segment, or in something entirely different. At very least, I believe a tree-based approach makes the most sense here, since this would be an effective learner of the team and segment-based interactions that would be challenging to hand-craft otherwise. This task remains a work in progress, however, so I have no perfect solution to offer.
The last area I had a chance to work on is what I would broadly characterize as baseball-facing work. Working with our in-house baseball researchers, I helped to automate some of our anomaly detection reporting, taking previously-manual Excel work and re-writing it in SQL and Python. Parts of this work involved the fitting and evaluation of mixed effect logistic regression models. These models were new to me at the time and interesting to learn about. Other ad-hoc requests in this domain would come on occasion, usually involving some sort of data analysis regarding Statcast data on the relationships between pitch speed, type, and location and the outcome of a pitch or plate appearance.
To wrap this up, here are the top pieces of advice I would offer somebody entering their first data science job:
Most tasks don’t require machine learning. Penalized (LASSO or Ridge) logistic regression and ordinary least squares with simple feature engineering will usually do the trick.
Some tasks do necessitate machine learning. Whenever you fit a simple model, try a random forest or GBM alongside it just to be sure you aren’t oversimplifying the problem with a linear model.
Knowing the difference between a #1 and a #2 problem is a crucial part of real-world data science. This is part of understanding your client’s needs and your problem domain.
Data science is more than just building models; most of the work happens before the model is trained. You will spend hours retrieving, joining, aggregating, sanity-checking, and exploring your data.
Real world data is full of surprises. Distributions change over time, features are sparse, and variables don’t always mean what you think they do. Always profile your data to avoid mistakes early on.
Unfortunately for Python-loving statisticians like myself, there has historically been no tool for bringing this data into one’s work without a painful process of manual curation. For this reason, I’m releasing pybaseball: an open source package for 21st century baseball data analysis in Python.
How does Aaron Judge hit baseballs into the stratosphere? Pybaseball and Statcast can help you find out (image: mlb.com)
Pybaseball takes the pain out of collecting and cleaning baseball data from the internet. In short, I scraped Baseball Savant, FanGraphs, and Baseball Reference so you don’t have to. Currently, this means that you can retrieve pitch, season, and game-level data on individual players and teams, historic schedule and record data, and division standings with simple, Pythonic one-liners. The stats that this library provides range from the classics (BA, RBI, HR, W, L, K, IP), to the slightly more sophisticated (OBP, SLG, WHIP, WAR), to what would have sounded like science fiction a few short years ago (exit velocity, spin speed, pitch x, y, and z coordinates). The goal of this library is to provide the data necessary to answer any baseball research question.
The statcast
function returns statcast pitch-level data.
>>> from pybaseball import statcast
>>> data = statcast(start_dt='2017-06-15', end_dt='2017-06-28')
>>> data.head(2)
index pitch_type game_date release_speed release_pos_x release_pos_z
0 314 CU 2017-06-27 79.7 -1.3441 5.4075
1 332 FF 2017-06-27 98.1 -1.3547 5.4196
player_name batter pitcher events ... release_pos_y
0 Matt Bush 608070.0 456713.0 field_out ... 54.8585
1 Matt Bush 429665.0 456713.0 field_out ... 54.3470
estimated_ba_using_speedangle estimated_woba_using_speedangle woba_value
0 0.100 0.137 0.0
1 0.269 0.258 0.0
woba_denom babip_value iso_value launch_speed_angle at_bat_number pitch_number
0 1.0 0.0 0.0 3.0 64.0 1.0
1 1.0 0.0 0.0 3.0 63.0 3.0
[2 rows x 79 columns]
Similarly, if you want player-level stats aggregated by season, you can pull 299 different features per player per season from FanGraphs using the pitching_stats
and batting_stats
functions.
>>> from pybaseball import pitching_stats
>>> data = pitching_stats(2012, 2016)
>>> data.head()
Season Name Team Age W L ERA WAR G GS
336 2015.0 Clayton Kershaw Dodgers 27.0 16.0 7.0 2.13 8.6 33.0 33.0
236 2014.0 Clayton Kershaw Dodgers 26.0 21.0 3.0 1.77 7.6 27.0 27.0
472 2014.0 Corey Kluber Indians 28.0 18.0 9.0 2.44 7.4 34.0 34.0
235 2015.0 Jake Arrieta Cubs 29.0 22.0 6.0 1.77 7.3 33.0 33.0
256 2013.0 Clayton Kershaw Dodgers 25.0 16.0 9.0 1.83 7.1 33.0 33.0
... wSL/C (pi) wXX/C (pi) O-Swing% (pi) Z-Swing% (pi)
336 ... 1.76 22.85 0.364 0.665
236 ... 2.62 NaN 0.371 0.670
472 ... 3.92 NaN 0.336 0.598
235 ... 2.42 NaN 0.329 0.618
256 ... 0.74 NaN 0.339 0.635
Swing% (pi) O-Contact% (pi) Z-Contact% (pi) Contact% (pi) Zone% (pi)
336 0.511 0.478 0.811 0.689 0.487
236 0.525 0.536 0.831 0.730 0.515
472 0.468 0.485 0.886 0.744 0.505
235 0.468 0.595 0.856 0.762 0.483
256 0.484 0.563 0.873 0.763 0.492
Pace (pi)
336 23.4
236 23.7
472 24.6
235 23.3
256 23.4
[5 rows x 299 columns]
But wait, there’s more! Say you’re interested in comparing the performances of historic teams. That, too, is easy with pybaseball. With this package, one can dissect the 1927 “Murderers Row” Yankees season with a single line of Python.
>>> from pybaseball import schedule_and_record
>>> data = schedule_and_record(1927, 'NYY')
>>> data.head()
Date Tm Home_Away Opp W/L R RA Inn W-L Rank \
1 Tuesday, Apr 12 NYY Home PHA W 8.0 3.0 9.0 1-0 1.0
2 Wednesday, Apr 13 NYY Home PHA W 10.0 4.0 9.0 2-0 1.0
3 Thursday, Apr 14 NYY Home PHA T 9.0 9.0 10.0 2-0 1.0
4 Friday, Apr 15 NYY Home PHA W 6.0 3.0 9.0 3-0 1.0
5 Saturday, Apr 16 NYY Home BOS W 5.0 2.0 9.0 4-0 1.0
GB Win Loss Save Time D/N Attendance Streak
1 Tied Hoyt Grove None 2:05 D 72000.0 1
2 up 0.5 Ruether Gray None 2:15 D 8000.0 2
3 Tied None None None 2:50 D 9000.0 2
4 Tied Pennock Ehmke None 2:27 D 16000.0 3
5 up 1.0 Shocker Ruffing None 2:05 D 25000.0 4
These examples are just the tip of the iceberg, but hopefully this gives a taste of the power and versatility of pybaseball.
Pybaseball is pip installable. Simply run pip install pybaseball
and it’s on your machine.
If any of this piqued your interest, full documentation and complete examples are available on the Github repo here. If you like what you’ve seen so far, please give it a star. If you really like what you’ve seen so far, drop me a suggestion or submit a code improvement. Last, if you end up using pybaseball for any type of project or analysis, I would love to hear about it. Send me a note or reach out on Twitter!
]]>Currently Reading: Bayesian Data Analysis 3
A solid overview of the statistical learning theory that underlies machine learning. Allows a reader to get an intuitive grasp of what is going on inside the “black box”, but is a little too far on qualitative side if one hopes to gain a full understanding. For a deeper dive, see the advanced version Elements of Statistical Learning.
Like the above, but much more dense. Worth suffering through its first several chapters. Builds character.
Another machine learning book that focuses on theory. It won’t show you how to train your own models, but it will help to understand why models work and what guarantees we’re able to make about learning and generalization.
A wonderful balance of intuition and theory that the field has been lacking. Begins with the nuts and bolts of feedforward networks, and then goes into depth on the state of the art in model regularization, optimization, and various model classes and architectures. Filled with useful tips and tricks for implementing models.
A handy reference for Keras. This book is helpful for bridging the gap between beginner deep learning tutorials and more advanced / state-of-the-art methods. It’s not helpful for learning theory, but will help you to implement what you read in papers.
Communicating what your data have to say with clarity, precision, and efficiency. Its pretty graphics also make it a great coffee table book.
A handbook on advanced econometrics. Useful for brushing up on linear models (simple and multiple linear regression) and experiment design (instrumental variables, difference-in-difference models, answering causal questions.)
This book is very similar to Mostly Harmless Econometrics, but more beginner-friendly. Get this one instead if you’re learning econometrics for the first time.
This book felt like a greatest hits compliation of all the most useful and exciting things I learned about experiment design as an undergrad. It’s the best book I’ve found to date for marrying the strengths of old-school statisticians and newer-school data scientists.
An absurdly useful book for learning how to manipulate data with R and the Tidyverse (dplyr, ggplot, forcats, etc.) I read this once when I was first learning R and again after a few years of experience and learned new things each time. This book will make anyone better at data analysis, visualization, manipulation, and cleaning.
Python’s closest equivalent to R for Data Science. Useful for understanding Pandas dataframes more deeply, and helped me to rely on stack overflow a lot less.
My life would have turned out quite different had it not been for Nate Silver and this book. The Signal and the Noise helped me to discover my love for data science!
I can’t shake the feeling that, as a data scientist, this book hates me. Nonetheless, it opened my eyes to an approach to causal inference that was entirely different than anything I’d been exposed to before.
A reference book on sabermetrics with code samples in R. This book is useful to keep around when working with some of the main publicly-available baseball data sources such as Retrosheet and the Lahman database.
A short, digestible history of and introduction to information theory. It won’t make you an expert, but you’ll get the main ideas.
This book covers the rise of behavioral economics from one of its earliest practitioners. Thaler draws from his experience to give an often Freakonomics-esque run-down of how economic models fail to describe real-world human behavior.
A layman’s version of the theories that laid the groundwork for behavioral economics. Kahneman explains the two chief mechanisms in our brains (fast and slow thought), and how they cause predictable biases.
This read a lot like Thinking Fast and Slow, but Ariely is a much better writer. Another book about how our brains take shortcuts that lead to irrational decision making.
A psychology-computer science fusion piece on how fundamental computer science algorithms and data strucutres can aide decision making. A fun way to tie stacks, queues, sorting algorithms into your everyday life.
My love for Steven Levitt’s work is second only to that for Nate Silver’s. Freakonomics showed me that the economics tool set can be used to advance causes much greater than economics itself.
A book on incentives, and how just about everything has a price tag if framed correctly.
A collection of posts from the Freakonomics blog strung together into a greater narrative. A nice mix of incentive schemes, economic ramblings, and musings on irrational behavior.
A poker professional’s take on training yourself to think rationally and probabilistically.
This book came to my attention because people were arguing about it on Twitter. Duckworth’s research attempts to measure people’s level of grittiness. I didn’t find it very useful or interesting.
I was lucky enough to work with Tango while working as a statistician at MLB. This book is essentially the bible for a modern-day sabermetrician, answering baseball’s most fundamental strategic questions with an empirial approach and interpretable models.
An engaging read for statheads and trivia fanatics. Tetlock draws from his experience as the head of the Good Judgment Project (a lengthy study on forecasting) to break down what exactly makes a great forecaster able to see the future better than the rest of us. The key findings are based in psychology and methods of improvement through self-evaluation.
The story of how Billy Beane’s Oakland A’s are able to build successful teams in one of baseball’s smallest markets. Lewis’ walk through the logic of sabermetrics and sorting signal from noise in baseball data was eye opening as a stats geek and sports fan alike.
A mostly-qualitative run through the current state of basketball analytics, detailing recent phenomena such as the decline of the mid-range jumper, tanking for draft picks, and the specialized medical analyses being used to ensure player longevity.
The rise of Bezos and Amazon. A handbook on long-term thinking and execution in complex environments.
Musk’s success story seems similar to that of Bezos: defined by an obsessive focus on a small number of long-term goals and a superhuman work ethic.
Advice from a VC and former CEO about how to get through the low points as a leader, and how to lead when things are not going well.
How to build and scale an organization where innovation comes natural, told by two of the leaders responsible for doing this at Google.
Peter Thiel’s notes on how to start a startup. Advice on market positioning, culture, and overcoming the challenges of early-stage entrepreneurship.
A breakdown of the dangers of journalist and pundit-fed pseudo-economics. The stories of those who “peddle prosperity” in this way are seldom grounded in facts. This book breaks down the rise and farce of Reaganomics, and how to be weary of such false theories in the future.
Simple, easy to follow value investing principles from a hedge fund manager.
An uncharacteristically interesting book about HR from Google’s former HR chief. Ideas on using data for better HR decisions.
More value investing, this time greatly simplified. One of the most useful reads for an investor who is not a finance pro.
The bible for any value investor. Graham’s Mr. Market illustration remains relevant today.
Updated: 2019/08/02
]]>