<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://jamesrledoux.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://jamesrledoux.com/" rel="alternate" type="text/html" /><updated>2026-04-05T20:55:25+00:00</updated><id>https://jamesrledoux.com/feed.xml</id><title type="html">Data are.</title><subtitle>Data scientist and armchair sabermetrician.</subtitle><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><entry><title type="html">AxeDB: Guitar Pricing Intelligence</title><link href="https://jamesrledoux.com/projects/axedb-guitar-site/" rel="alternate" type="text/html" title="AxeDB: Guitar Pricing Intelligence" /><published>2026-03-23T00:00:00+00:00</published><updated>2026-03-23T00:00:00+00:00</updated><id>https://jamesrledoux.com/projects/axedb-guitar-site</id><content type="html" xml:base="https://jamesrledoux.com/projects/axedb-guitar-site/"><![CDATA[<meta property="og:image" content="/images/fulls/axedb-listing-page.png" />

<meta property="og:image:type" content="image/png" />

<meta property="og:image:width" content="200" />

<meta property="og:image:height" content="200" />

<meta name="twitter:card" content="summary_large_image" />

<meta name="twitter:site" content="@ledoux_james" />

<meta name="twitter:creator" content="@ledoux_james" />

<meta name="twitter:title" content="AxeDB: Guitar Pricing Intelligence" />

<meta name="twitter:image" content="http://jamesrledoux.com/images/fulls/axedb-listing-page.png" />

<p>I’ve come out of side project retirement for something I’m calling <a href="https://axedb.com/">AxeDB</a>. AxeDB is a platform for vintage and used guitar pricing intelligence. Using data from millions of used guitar sales, it shows pricing trends on specific current and historic models and serves as a platform for research and analysis to quantify what’s previously been a largely qualitative domain.</p>

<h1 id="why-axedb">Why AxeDB</h1>
<p>I built AxeDB because it’s the kind of niche website I wished existed for my own music gear hobby. As an amateur musician, gear nerd, and data scientist, I find myself browsing Reverb and YouTube guitar content for both the gear and the market aspect. Yes, I like to look at new and vintage instruments, pedals, amps, etc., listen to demos, and ask myself if I need a new piece of gear (I don’t.) But I also see a vintage Blackguard Telecaster sell for the price of a midwestern starter home and like to ask myself: why’s it worth that?</p>

<p>AxeDB is a place you can scratch that kind of itch. You can check a modern and vintage guitar sales market index like it’s the S&amp;P 500, see whether the newly released American Ultra Luxe Vintage line is trading at a discount on the secondary market yet, and read a deep dive analysis of where exactly the value comes from on that Blackguard (and how you’re going to find a more affordable one for yourself one day.) It’s for the musicians, the collectors, and the nerds.</p>

<h1 id="what-it-does">What It Does</h1>

<h2 id="model-level-secondary-market-trends-a-tool-for-buyers-and-sellers">Model-Level Secondary Market Trends: A Tool for Buyers and Sellers</h2>

<p>Series/Model-level views into secondary market trends. For a given model (e.g. the American Vintage II Telecaster), see how secondary market sales have trended relative to the retail price on the brand new guitar, browse live listings from Reverb and Sweetwater, and compare those listings to previous sales.</p>

<p>It’s a great tool for scouting deals and tracking the market.</p>

<p align="center">
    <img src="/images/fulls/axedb-model-page.png" alt="" />
</p>

<h2 id="vintage-and-modern-aftermarket-market-indexes">Vintage and Modern-Aftermarket Market Indexes</h2>

<p>Stock Market-style market indexes for modern and vintage guitar sales. A cross-market pulse on how the used markets are trending.</p>

<p align="center">
    <img src="/images/fulls/axedb-market-index.png" alt="" />
</p>

<h2 id="market-research">Market Research</h2>

<p>Quantitative research on guitars, music, and instrument value. So far I’ve posted research on the <a href="https://axedb.com/research/vintage-telecaster-market">vintage Telecaster market</a> and the <a href="https://axedb.com/research/fender-custom-colors-refinish-value">history of Fender’s custom colors, refinishes, and how they affect sales price</a>. I plan to keep sharing research using the data foundation I’ve developed from tens of thousands of vintage guitar sales.</p>

<p align="center">
    <img src="/images/fulls/axedb-tele-market.png" alt="" />
</p>

<h2 id="editorial-content">Editorial Content</h2>

<p>Detail-heavy editorial content such as histories of the specs and provenance of famous guitars. See this example of <a href="https://axedb.com/famous-axes/kurt-cobain-skystang">Cobain’s Sky-Stang</a> as an example. The thought here is that, eventually, I could have a searchable catalog of every famous artist’s guitars, their specs, and how you can get your own.</p>

<h1 id="data-foundation-and-ai-exploration">Data Foundation and AI Exploration</h1>

<p>The foundation of AxeDB is what its name implies: a database of axes (guitars). I pursued this project, in part, to practice building data and AI infrastructure from scratch and to experiment with LLMs.</p>

<p>I largely vibecoded the foundations of this project, relying heavily on Claude Code and Codex to develop the app and its foundational data pipelines, and using OpenAI’s API for data enrichment and labeling to enhance the quality of listing and sales data collected from messy online sources. Agentic AI capabilities have gotten really impressive over the past couple months, so it was exciting to get to lean in and develop something I’m excited about with an AI-first mindset.</p>

<p>The result, in addition to a user-facing web app, is a fascinating data foundation that a data scientist and guitar hobbyist like myself can dig in to. AxeDB is built on top of cleaned, structured data from millions of live listings and past sales of guitars from online marketplaces, individual dealers, and auction houses. The data refreshes frequently from live sources, so the data presented on the site remains fresh and actionable to a user. It’s a large and rich dataset that I hope will continue to power analyses and ML projects for a long time to come.</p>

<p>My hope is that the site is not only an interesting data and engineering exercise for myself, but a also powerful utility and source of entertainment to musicians and gear nerds like myself. Check it out and feel free to send me a note if you have feedback or ideas for site features or future research projects.</p>]]></content><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><category term="projects" /><summary type="html"><![CDATA[New project for studying the used and vintage guitar markets]]></summary></entry><entry><title type="html">Multi-Armed Bandits in Python: Epsilon Greedy, UCB1, Bayesian UCB, and EXP3</title><link href="https://jamesrledoux.com/algorithms/bandit-algorithms-epsilon-ucb-exp-python/" rel="alternate" type="text/html" title="Multi-Armed Bandits in Python: Epsilon Greedy, UCB1, Bayesian UCB, and EXP3" /><published>2020-03-24T00:00:00+00:00</published><updated>2020-03-24T00:00:00+00:00</updated><id>https://jamesrledoux.com/algorithms/bandit-algorithms-epsilon-ucb-exp-python</id><content type="html" xml:base="https://jamesrledoux.com/algorithms/bandit-algorithms-epsilon-ucb-exp-python/"><![CDATA[<meta property="og:image" content="/images/fulls/final_bandit_results.png" />

<meta property="og:image:type" content="image/png" />

<meta property="og:image:width" content="200" />

<meta property="og:image:height" content="200" />

<meta name="twitter:card" content="summary_large_image" />

<meta name="twitter:site" content="@jmzledoux" />

<meta name="twitter:creator" content="@jmzledoux" />

<meta name="twitter:title" content="Multi-Armed Bandits in Python: Epsilon Greedy, UCB1, Bayesian UCB, and EXP3" />

<meta name="twitter:image" content="http://jamesrledoux.com/images/fulls/final_bandit_results.png" />

<p>In this post I discuss the multi-armed bandit problem and implementations of four specific bandit algorithms in Python (epsilon greedy, UCB1, a Bayesian UCB, and EXP3). I evaluate their performance as content recommendation systems on a real-world movie ratings dataset and provide simple, reproducible code for applying these algorithms to other tasks.</p>

<h1 id="whats-a-bandit">What’s a Bandit?</h1>

<p>Multi-armed bandits belong to a class of online learning algorithms that allocate a fixed number of resources to a set of competing choices, attempting to learn an optimal resource allocation policy over time.</p>

<p>The multi-armed bandit problem is often introduced via an analogy of a gambler playing slot machines. Imagine you’re at a casino and are presented with a row of \(k\) slot machines, with each machine having a hidden payoff function that determines how much it will pay out. You enter the casino with a fixed amount of money and want to learn the best strategy to maximize your profits. Initially you have no information about which machine is expected to pay out the most money, so you try one at random and observe its payout. Now that you have a little more information than you had before, you need to decide: do I exploit this machine now that I know more about its payoff function, or do I explore the other options by pulling arms that I have less information about? You want to strike the most profitable balance between exploring all potential machines so that you don’t miss out on a valuable one by simply not trying it enough times, and exploiting the machine that has been most profitable so far. A multi-armed bandit algorithm is designed to learn an optimal balance for allocating resources between a fixed number of choices in a situation such as this one, maximizing cumulative rewards over time by learning an efficient explore vs. exploit policy.</p>

<p align="center">
    <img src="/images/fulls/bandit-octopus.jpg" alt="" />
</p>

<p>Before looking at any specific algorithms, it’s useful to first establish a few definitions and core principles, since the language and problem setup of the bandit setting differs slightly from those of traditional machine learning. The bandit setting, in short, looks like this:</p>

<ul>
  <li>You’re presented with \(k\) distinct “arms” to choose from. An arm can be a piece of content for a recommender system, a stock pick, a promotional offer, etc.</li>
  <li>Observe information about how these arms have performed in the past, such as how many times the arm has been pulled and what its payoff value was each time</li>
  <li>“Pull” the arm (choose the action) deemed best by the algorithm’s policy</li>
  <li>Observe its reward (how positive the outcome was) and/or its regret (how much worse this action was compared to how the best-possible action would have performed in hindsight)</li>
  <li>Use this reward and/or regret information to update the policy used to select arms in the future</li>
  <li>Continue this process over time, attempting to learn a policy that balances exploration and exploitation in order to minimize cumulative regret</li>
</ul>

<p>This bears several similarities to reinforcement learning techniques such as <a href="https://en.wikipedia.org/wiki/Q-learning">Q-learning</a>, which similarly learn and modify a policy over time. The time-dependence of a bandit problem (start with zero or minimal information about all arms, learn more over time) is a significant departure from the traditional machine learning problem setting, where the full dataset is available to a model at once, which can be trained as a one-off process. Bandits require repeated, incremental policy updates.</p>

<h1 id="dataset-and-experiment-setup">Dataset and Experiment Setup</h1>
<p>There are several nuances to running a multi-armed bandit experiment using a real-world dataset. I describe the experiment setup in detail in <a href="https://jamesrledoux.com/algorithms/offline-bandit-evaluation">this post</a>. I encourage you to read through it before proceeding. If not, here’s the short version of how this experiment is set up:</p>

<ul>
  <li>I use the Movielens dataset of 25m movie ratings</li>
  <li>The problem is re-cast from a 0-5 star rating problem to a binary like/no-like problem, with 4.5 stars and above representing a “liked” movie</li>
  <li>I use a method called Replay to remove bias in the historic dataset and simulate how the bandit would perform in a live production environment</li>
  <li>I evaluate the algorithms’ performance using Replay and the percentage of the bandits’ recommendations that were “liked” to assess algorithm quality</li>
  <li>To speed up the time it takes to run these algorithms, I recommend slates of movies instead of one movie at a time, and I also serve recommendations to batches of users rather than updating the bandit’s policy once for each data point</li>
</ul>

<p>But really, read the <a href="https://jamesrledoux.com/algorithms/offline-bandit-evaluation">full version</a> to better understand the ins and outs of evaluating a multi-armed bandit algorithm using a historic dataset.</p>

<p><br /></p>

<h1 id="epsilon-greedy">Epsilon Greedy</h1>

<p>The simplest bandits follow semi-uniform strategies. The most popular of these is called <strong>epsilon greedy</strong>.</p>

<p>Like the name suggests, the epsilon greedy algorithm follows a greedy arm selection policy, selecting the best-performing arm at each time step. However, \(\epsilon\) percent of the time, it will go off-policy and choose an arm at random. The value of \(\epsilon\) determines the fraction of the time when the algorithm explores available arms, and exploits the ones that have performed the best historically the rest of the time.</p>

<p>This algorithm has a few perks. First, it’s easy to explain (explore \(\epsilon \%\) of time steps, exploit \((1-\epsilon)\%\). The algorithm fits in a single sentence!). Second, \(\epsilon\) is straightforward to optimize. Third, despite its simplicity, it typically yields pretty good results. Epsilon greedy is the linear regression of bandit algorithms.</p>

<p>Much like linear regression can be extended to a broader family of generalized linear models, there are several adaptations of the epsilon greedy algorithm that trade off some of its simplicity for better performance. One such improvement is to use an epsilon-decreasing strategy. In this version of the algorithm, \(\epsilon\) decays over time. The intuition for this is that the need for exploration decreases over time, and selecting random arms becomes increasingly inefficient as the algorithm eventually has more complete information about the available arms. Another available take on this algorithm is an epsilon-first strategy, where the bandit acts completely random for a fixed amount of time to sample the available arms, and then purely exploits thereafter. I’m not going to use either of these approaches in this post, but it’s worth mentioning that these options are out there.</p>

<p>Implementing the traditional epsilon greedy bandit strategy in Python is straightforward:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">epsilon_greedy_policy</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">arms</span><span class="p">,</span> <span class="n">epsilon</span><span class="o">=</span><span class="mf">0.15</span><span class="p">,</span> <span class="n">slate_size</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">50</span><span class="p">):</span>
    <span class="s">'''
    Applies Epsilon Greedy policy to generate movie recommendations.
    Args:
        df: dataframe. Dataset to apply the policy to
        arms: list or array. ID of every eligible arm.
        epsilon: float. represents the % of timesteps where we explore random arms
        slate_size: int. the number of recommendations to make at each step.
        batch_size: int. the number of users to serve these recommendations to before updating the bandit's policy.
    '''</span>
    <span class="c1"># draw a 0 or 1 from a binomial distribution, with epsilon% likelihood of drawing a 1
</span>    <span class="n">explore</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">binomial</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">epsilon</span><span class="p">)</span>
    <span class="c1"># if explore: shuffle movies to choose a random set of recommendations
</span>    <span class="k">if</span> <span class="n">explore</span> <span class="o">==</span> <span class="mi">1</span> <span class="ow">or</span> <span class="n">df</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span>
        <span class="n">recs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">arms</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">slate_size</span><span class="p">),</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="c1"># if exploit: sort movies by "like rate", recommend movies with the best performance so far
</span>    <span class="k">else</span><span class="p">:</span>
        <span class="n">scores</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s">'movieId'</span><span class="p">,</span> <span class="s">'liked'</span><span class="p">]].</span><span class="n">groupby</span><span class="p">(</span><span class="s">'movieId'</span><span class="p">).</span><span class="n">agg</span><span class="p">({</span><span class="s">'liked'</span><span class="p">:</span> <span class="p">[</span><span class="s">'mean'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">]})</span>
        <span class="n">scores</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'mean'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">]</span>
        <span class="n">scores</span><span class="p">[</span><span class="s">'movieId'</span><span class="p">]</span> <span class="o">=</span> <span class="n">scores</span><span class="p">.</span><span class="n">index</span>
        <span class="n">scores</span> <span class="o">=</span> <span class="n">scores</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'mean'</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
        <span class="n">recs</span> <span class="o">=</span> <span class="n">scores</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">scores</span><span class="p">.</span><span class="n">index</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="n">slate_size</span><span class="p">],</span> <span class="s">'movieId'</span><span class="p">].</span><span class="n">values</span>
    <span class="k">return</span> <span class="n">recs</span>

<span class="c1"># apply epsilon greedy policy to the historic dataset (all arm-pulls prior to the current step that passed the replay-filter)
</span><span class="n">recs</span> <span class="o">=</span> <span class="n">epsilon_greedy_policy</span><span class="p">(</span><span class="n">df</span><span class="o">=</span><span class="n">history</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">history</span><span class="p">.</span><span class="n">t</span><span class="o">&lt;=</span><span class="n">t</span><span class="p">,],</span> <span class="n">arms</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="n">movieId</span><span class="p">.</span><span class="n">unique</span><span class="p">)</span>

<span class="c1"># get the score from this set of recommendations, add this to the bandit's history to influence its future decisions
</span><span class="n">history</span><span class="p">,</span> <span class="n">action_score</span> <span class="o">=</span> <span class="n">score</span><span class="p">(</span><span class="n">history</span><span class="p">,</span> <span class="n">df</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">recs</span><span class="p">)</span>
</code></pre></div></div>

<p><br /></p>

<h1 id="ucb1-upper-confidence-bound-algorithm">UCB1 (Upper Confidence Bound Algorithm)</h1>

<p>Epsilon greedy performs pretty well, but it’s easy to see how selecting arms at random can be inefficient. If you have one movie that 50% of users have liked, and another at 5% have liked, epsilon greedy is equally likely to pick either of these movies when exploring random arms. Upper Confidence Bound algorithms were introduced as a class of bandit algorithm that explores more efficiently.</p>

<p>Upper Confidence Bound algorithms construct a confidence interval of what each arm’s true performance might be, factoring in the uncertainty caused by variance in the data and the fact that we’re only able to observe a limited sample of pulls for any given arm. The algorithms then optimistically assume that each arm will perform as well as its upper confidence bound (UCB), selecting the arm with the highest UCB.</p>

<p>This has a number of nice qualities. First, you can parameterize the size of the confidence interval to control how aggressively the bandit explores or exploits (e.g you can run a 99% confidence interval to explore heavily, or a 50% confidence interval to mostly exploit.) Second, using upper confidence bounds causes the bandit to explore more efficiently than an epsilon greedy bandit. This happens because confidence intervals shrink as you see additional data points for a given arm. So, while the algorithm will gravitate toward picking arms with high average performance, it will periodically give less-explored arms a chance since their confidence intervals are wider.</p>

<p>Seeing this visually helps to understand how these confidence bounds produce an efficient balance of exploration and exploitation. Below I’ve produced an imaginary scenario where a UCB bandit is determining which article to show at the top of a news website. There are three articles, judged according to the upper confidence bound of their click-through-rate (CTR).</p>

<p align="center">
    <img src="/images/fulls/ucb-drawing.jpg" alt="" />
</p>

<p>Artice A has been seen 100 times and has the best CTR. Article B has a slightly worse CTR than article A, but it hasn’t been seen by as many users, so there’s also more uncertainty about how well it’s going to perform in the long run. For this reason, it has a larger confidence bound, giving it a slightly higher UCB score than article A. Article C was published just moments ago, so almost no users have seen it. We’re extremely uncertain about how high its CTR will ultimately be, so its UCB is highest of all for now despite its initial CTR being low.</p>

<p>Over time, more users will see articles B and C, and their confidence bounds will become more narrow and look more like that of article A. As we learn more about B and C, we’ll shift from exploration toward exploitation as the articles’ confidence intervals collapse toward their means. Unless the CTR of article B or C improves, the bandit will quickly start to favor article A again as the other articles’ confidence bounds shrink.</p>

<p>A good UCB algorithm to start with is <strong>UCB1</strong>. UCB1 uses <a href="https://en.wikipedia.org/wiki/Hoeffding%27s_inequality">Hoeffding’s inequality</a> to assign an upper bound to an arm’s mean reward where there’s high probability that the true mean will be below the UCB assigned by the algorithm. The inequality states that:</p>

\[P(\mu_{a} &gt; \hat{\mu}_{t,a} + U_{t}(a)) \leq e^{-2tU_{t}(a)^2},\]

<p>where \(\mu_{a}\) is arm \(a\)’s true mean reward, \(\hat{\mu}_{t,a}\) is \(a\)’s observed mean reward at time \(t\), and \(U_{t}(a)\) is an upper confidence bound value for arm \(a\) which, when added the mean reward, gives you an upper confidence bound. Setting \(p = e^{-2tU_{t}(a)^2}\) gives us the following value for the UCB term:</p>

\[U_{t}(a) = \sqrt{\frac{-\log{p}}{2n_{a}}}.\]

<p>Note that in the denominator I’m replacing \(t\) with \(n_{a}\), since it represents the number of times arm \(a\) has been pulled, which will eventually differ from the total number of time steps \(t\) the algorithm has been running at a given point in time.</p>

<p>Setting the probability \(p\) of the true mean being greater than the UCB to be less than or equal to \(t^{-4}\), a small probability that quickly converges to zero as the number of rounds \(t\) grows, ultimately gives us the UCB1 algorithm, which pulls the arm that maximizes:</p>

\[\bar{x}_{t,a}+ \sqrt{\frac{2\log(t)}{n_{a}}}.\]

<p>Here \(\bar{x}_{t,a}\) is the mean observed reward of arm \(a\) at time \(t\),  \(t\) is the current time step in the algorithm, and \(n_{a}\) is the number of times arm \(a\) has been pulled so far.</p>

<p>Putting this all together, it means that a high “like” rate for a movie in this dataset will increase the likelihood of an arm being pulled, but so will a lower number of times the arm has been pulled so far, which encourages exploration. Also notice that the part of the function that includes the number of time steps the algorithm has been running (\(t\)) is inside a logarithm, which causes the algorithm’s propensity to explore to decay over time. <a href="https://jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/">Jeremy Kun’s blog</a> has a very nice explanation of this algorithm and the proofs that support it. I also found <a href="https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html#hoeffdings-inequality">this post</a> from Lilian Weng’s blog helpful for understanding how the confidence bounds are created using Hoeffding’s inequality.</p>

<p>Here’s how the UCB1 policy looks in Python:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">ucb1_policy</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">ucb_scale</span><span class="o">=</span><span class="mf">2.0</span><span class="p">):</span>
    <span class="s">'''
    Applies UCB1 policy to generate movie recommendations
    Args:
        df: dataframe. Dataset to apply UCB policy to.
        ucb_scale: float. Most implementations use 2.0.
        t: int. represents the current time step.
    '''</span>
    <span class="n">scores</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s">'movieId'</span><span class="p">,</span> <span class="s">'liked'</span><span class="p">]].</span><span class="n">groupby</span><span class="p">(</span><span class="s">'movieId'</span><span class="p">).</span><span class="n">agg</span><span class="p">({</span><span class="s">'liked'</span><span class="p">:</span> <span class="p">[</span><span class="s">'mean'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">,</span> <span class="s">'std'</span><span class="p">]})</span>
    <span class="n">scores</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'mean'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">,</span> <span class="s">'std'</span><span class="p">]</span>
    <span class="n">scores</span><span class="p">[</span><span class="s">'ucb'</span><span class="p">]</span> <span class="o">=</span> <span class="n">scores</span><span class="p">[</span><span class="s">'mean'</span><span class="p">]</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span>
            <span class="p">(</span>
                <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="n">t</span><span class="p">))</span> <span class="o">/</span>
                <span class="n">scores</span><span class="p">[</span><span class="s">'count'</span><span class="p">]</span>
            <span class="p">)</span>
        <span class="p">)</span>
    <span class="n">scores</span><span class="p">[</span><span class="s">'movieId'</span><span class="p">]</span> <span class="o">=</span> <span class="n">scores</span><span class="p">.</span><span class="n">index</span>
    <span class="n">scores</span> <span class="o">=</span> <span class="n">scores</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'ucb'</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="n">recs</span> <span class="o">=</span> <span class="n">scores</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">scores</span><span class="p">.</span><span class="n">index</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="n">args</span><span class="p">.</span><span class="n">n</span><span class="p">],</span> <span class="s">'movieId'</span><span class="p">].</span><span class="n">values</span>
    <span class="k">return</span> <span class="n">recs</span>

<span class="n">recs</span> <span class="o">=</span> <span class="n">ucb1_policy</span><span class="p">(</span><span class="n">df</span><span class="o">=</span><span class="n">history</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">history</span><span class="p">.</span><span class="n">t</span><span class="o">&lt;=</span><span class="n">t</span><span class="p">,],</span> <span class="n">t</span><span class="p">,</span> <span class="n">ucb_scale</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">ucb_scale</span><span class="p">)</span>
<span class="n">history</span><span class="p">,</span> <span class="n">action_score</span> <span class="o">=</span> <span class="n">score</span><span class="p">(</span><span class="n">history</span><span class="p">,</span> <span class="n">df</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">args</span><span class="p">.</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">recs</span><span class="p">)</span>
</code></pre></div></div>

<p><br /></p>

<h1 id="from-ucb1-to-a-bayesian-ucb">From UCB1 to a Bayesian UCB</h1>

<p>An extension of UCB1 that goes a step further is the Bayesian UCB algorithm. This bandit algorithm takes the same principles of UCB1, but lets you incorporate prior information about the distribution of an arm’s rewards to explore more efficiently (the Hoeffding inequality’s approach to generating a UCB1’s confidence bound makes no such assumptions).</p>

<p>Going from UCB1 to a Bayesian UCB can be fairly simple. If you assume the rewards of each arm are normally distributed, you can simply swap out the UCB term from UCB1 with \(\frac{c\sigma(x_{a})}{\sqrt{n_{a}}}\), where \(\sigma(x_{a})\) is the standard deviation of arm \(a\)’s rewards, \(c\) is an adjustable hyperparameter for determining the size of the confidence interval you’re adding to an arm’s mean observed reward, \(n_{a}\) is the number of times arm \(a\) has been pulled, and \(\bar{x}_{a} \pm \frac{c\sigma(x_{a})}{\sqrt{n_{a}}}\) is a confidence interval for arm \(a\) (so a 95% confidence interval can be represented with \(c=1.96\)). It’s common to see this outperform UCB1 in practice. You can see a little more detail about this in <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/XX.pdf">these slides</a> from UCL’s reinforcement learning course.</p>

<p>Implementation-wise, turning the above UCB1 policy into a bayesian UCB policy is pretty simple. All you have to do is replace this logic from the UCB1 policy:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scores</span><span class="p">[</span><span class="s">'ucb'</span><span class="p">]</span> <span class="o">=</span> <span class="n">scores</span><span class="p">[</span><span class="s">'mean'</span><span class="p">]</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span>
        <span class="p">(</span>
            <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="n">t</span><span class="p">))</span> <span class="o">/</span>
            <span class="n">scores</span><span class="p">[</span><span class="s">'count'</span><span class="p">]</span>
        <span class="p">)</span>
    <span class="p">)</span>
</code></pre></div></div>

<p>with this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scores</span><span class="p">[</span><span class="s">'ucb'</span><span class="p">]</span> <span class="o">=</span> <span class="n">scores</span><span class="p">[</span><span class="s">'mean'</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="n">ucb_scale</span> <span class="o">*</span> <span class="n">scores</span><span class="p">[</span><span class="s">'std'</span><span class="p">]</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'count'</span><span class="p">]))</span>
</code></pre></div></div>

<p>and there you have it! Your UCB bandit is now bayesian.</p>

<p><br /></p>

<h1 id="exp3">EXP3</h1>

<p>A third popular bandit strategy is an algorithm called EXP3, short for <em>Exponential-weight algorithm for Exploration and Exploitation</em>. EXP3 feels a bit more like traditional machine learning algorithms than epsilon greedy or UCB1, because it learns weights for defining how promising each arm is over time. Similar to with UCB1, EXP3 attempts to be an efficient learner by placing more weight on good arms and less weight on ones that aren’t as promising.</p>

<p>The algorithm starts by initializing a vector of weights \(w\) with one weight per arm in the dataset and each weight initialized to equal 1. It also takes as input an exploration parameter \(\gamma\), which controls the algorithm’s likelihood to explore arms uniformly at random. Then, for each time step, we:</p>

\[\begin{align}
&amp;1. \text{ Set } p_{i}(t) = (1-\gamma)\frac{w_{i}(t)}{\sum_{a=1}^{k} w_{a}(t)} + \frac{\gamma}{k}
\\
&amp;2. \text{ Draw the next arm } i_{t} \text{ randomly according to probabilities } p_{i}(t), ..., p_{k}(t)
\\
&amp;3. \text{ Observe reward } x_{i_{t}}(t) \in [0,1]
\\
&amp;4. \text{ Define the estimated reward } \hat{x}_{a_t}(t) \text{ to be: } \frac{x_{a_t}(t)}{p_{a_t}(t)} \text{ for } a=i_{t} \text{, 0 for all other } a.
\\
&amp;5. \text{ Set } \displaystyle w_{i_t}(t+1) = w_{i_t}(t) e^{\gamma \hat{x}_{i_t}(t) / K}
\end{align}\]

<p>Here \(i_{t}\) represents a given arm at step \(t\), where there are \(k\) available arms to choose from and \(a\) is an index over all \(k\) arms used to denote summing over all weights in step (1) and assigning all non-selected arms a reward of zero in step (4).</p>

<p>In English, the algorithm exploits by drawing from a learned distribution of weights \(w\) which prioritize better-performing arms, but in a probabilistic way that still lets all arms be sampled from. The exploration parameter \(\gamma\) gives an additional nudge of favoritism to all arms, making worse-performing arms more likely to be sampled. Taken to its extreme, \(\gamma=1\) would cause the learned weights to be ignored entirely in favor of pure, random exploration.</p>

<p>In Python, the EXP3 recommendation policy looks like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">math</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">numpy.random</span> <span class="kn">import</span> <span class="n">choice</span>

<span class="k">def</span> <span class="nf">distr</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">0.0</span><span class="p">):</span>
    <span class="n">weight_sum</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">weights</span><span class="p">))</span>
    <span class="k">return</span> <span class="nb">tuple</span><span class="p">((</span><span class="mf">1.0</span> <span class="o">-</span> <span class="n">gamma</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">w</span> <span class="o">/</span> <span class="n">weight_sum</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="n">gamma</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">weights</span><span class="p">))</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">weights</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">draw</span><span class="p">(</span><span class="n">probability_distribution</span><span class="p">,</span> <span class="n">n_recs</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
    <span class="n">arm</span> <span class="o">=</span> <span class="n">choice</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">movieId</span><span class="p">.</span><span class="n">unique</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="n">n_recs</span><span class="p">,</span>
        <span class="n">p</span><span class="o">=</span><span class="n">probability_distribution</span><span class="p">,</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">arm</span>

<span class="k">def</span> <span class="nf">update_weights</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">movieId_weight_mapping</span><span class="p">,</span> <span class="n">probability_distribution</span><span class="p">,</span> <span class="n">actions</span><span class="p">):</span>
    <span class="c1"># iter through actions. up to n updates / rec
</span>    <span class="k">if</span> <span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">weights</span>
    <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">actions</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span>
        <span class="n">action</span> <span class="o">=</span> <span class="n">actions</span><span class="p">[</span><span class="n">a</span><span class="p">:</span><span class="n">a</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
        <span class="n">weight_idx</span> <span class="o">=</span> <span class="n">movieId_weight_mapping</span><span class="p">[</span><span class="n">action</span><span class="p">.</span><span class="n">movieId</span><span class="p">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span>
        <span class="n">estimated_reward</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">*</span> <span class="n">action</span><span class="p">.</span><span class="n">liked</span><span class="p">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="n">probability_distribution</span><span class="p">[</span><span class="n">weight_idx</span><span class="p">]</span>
        <span class="n">weights</span><span class="p">[</span><span class="n">weight_idx</span><span class="p">]</span> <span class="o">*=</span> <span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">estimated_reward</span> <span class="o">*</span> <span class="n">gamma</span> <span class="o">/</span> <span class="n">num_arms</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">weights</span>

<span class="k">def</span> <span class="nf">exp3_policy</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">history</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">weights</span><span class="p">,</span> <span class="n">movieId_weight_mapping</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">n_recs</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">):</span>
    <span class="s">'''
    Applies EXP3 policy to generate movie recommendations
    Args:
        df: dataframe. Dataset to apply EXP3 policy to
        history: dataframe. events that the offline bandit has access to (not discarded by replay evaluation method)
        t: int. represents the current time step.
        weights: array or list. Weights used by EXP3 algorithm.
        movieId_weight_mapping: dict. Maping between movie IDs and their index in the array of EXP3 weights.
        gamma: float. hyperparameter for algorithm tuning.
        n_recs: int. Number of recommendations to generate in each iteration. 
        batch_size: int. Number of observations to show recommendations to in each iteration.
    '''</span>
    <span class="n">probability_distribution</span> <span class="o">=</span> <span class="n">distr</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">gamma</span><span class="p">)</span>
    <span class="n">recs</span> <span class="o">=</span> <span class="n">draw</span><span class="p">(</span><span class="n">probability_distribution</span><span class="p">,</span> <span class="n">n_recs</span><span class="o">=</span><span class="n">n_recs</span><span class="p">)</span>
    <span class="n">history</span><span class="p">,</span> <span class="n">action_score</span> <span class="o">=</span> <span class="n">score</span><span class="p">(</span><span class="n">history</span><span class="p">,</span> <span class="n">df</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">recs</span><span class="p">)</span>
    <span class="n">weights</span> <span class="o">=</span> <span class="n">update_weights</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">movieId_weight_mapping</span><span class="p">,</span> <span class="n">probability_distribution</span><span class="p">,</span> <span class="n">action_score</span><span class="p">)</span>
    <span class="n">action_score</span> <span class="o">=</span> <span class="n">action_score</span><span class="p">.</span><span class="n">liked</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">history</span><span class="p">,</span> <span class="n">action_score</span><span class="p">,</span> <span class="n">weights</span>


<span class="n">movieId_weight_mapping</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">movieId</span><span class="p">.</span><span class="n">unique</span><span class="p">())))</span>
<span class="n">history</span><span class="p">,</span> <span class="n">action_score</span><span class="p">,</span> <span class="n">weights</span> <span class="o">=</span> <span class="n">exp3_policy</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">history</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">weights</span><span class="p">,</span> <span class="n">movieId_weight_mapping</span><span class="p">,</span> <span class="n">args</span><span class="p">.</span><span class="n">gamma</span><span class="p">,</span> <span class="n">args</span><span class="p">.</span><span class="n">n</span><span class="p">,</span> <span class="n">args</span><span class="p">.</span><span class="n">batch_size</span><span class="p">)</span>	
<span class="n">rewards</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="n">action_score</span><span class="p">)</span>
</code></pre></div></div>

<p>Jeremy Kun again provides a great explanation of its <a href="https://jeremykun.com/2013/11/08/adversarial-bandits-and-the-exp3-algorithm/">theoretical underpinnings and regret bounds</a>. I drew heavily from his post and the <a href="https://en.wikipedia.org/wiki/Multi-armed_bandit#Exp3[45]">EXP3 Wikipedia entry</a> in writing this section.</p>

<h1 id="results-from-a-movie-recommendation-task">Results from a Movie Recommendation Task</h1>

<p>It’s expected that these bandit algorithms’ performance relative to one another will depend heavily on the task. Frequently introducing new arms might benefit a UCB algorithm’s efficient exploration policy, for example, while an adversarial task such as learning to play a game might favor the randomness baked into EXP3’s policy. Futile as it may be to declare one of them the “best” algorithm, let’s throw them all at a broadly useful task and see which bandit is best fit for the job.</p>

<p>Here I’ll use the Movielens dataset, reporting on the mean and cumulative reward over time for each algorithm. For more details on the experiment setup, see the <strong>Dataset and Experiment Setup</strong> section of this post at the beginning of the article, or <a href="https://jamesrledoux/com/offline-bandit-evaluation">this post</a> which discusses offline bandit evaluation in full detail.</p>

<p>First we’ll need to tune each algorithm’s hyperparameters to compare each algorithm’s best possible performance to that of the others. This means finding an optimal value of <code class="language-plaintext highlighter-rouge">epsilon</code> for epsilon greedy, the scale parameter that we use for determining the size of the confidence interval for Bayesian UCB, and <code class="language-plaintext highlighter-rouge">gamma</code> for EXP3. I’ll leave UCB1 alone since it’s not typically seen as having tunable hyperparameters (although there does exist a parameterized version of it that’s slightly more involved to implement and less theoretically-sound.) I identified good values for these hyperparameters by trying six values which linearly spanned a range of potential values that subjectively seemed reasonable to me, and selected the hyperparameter value which yielded the highest mean reward over the lifetime of the algorithm. Each parameter search was run using batch sizes of 10,000 events and recommendation slates of 5 movies recommended at each pass of the algorithm.</p>

<p align="center">
    <img src="/images/fulls/bandit_tuning_all_plots.png" alt="" />
</p>

<p>The above three plots show the mean reward for the three classes of algorithm across different hyperparameter values. The best gamma for EXP3 was 0.1, the best epsilon for Epsilon Greedy was 0.1, and the best UCB algorithm was a Bayesian UCB using a scale parameter of 1.5.</p>

<p>I used a large batch size of 10,000 recommendations per iteration of the algorithm while running the above hyperparameter search to speed things up, since a bandit runs fairly slow on a large dataset, let alone 19 of them like I used in this parameter search. For a final evaluation, now that we’re able to select the best possible version of each algorithm, I’ll reduce the batch size to just 100 recommendations per pass of the algorithm, giving each bandit more time to learn its explore-exploit policy.</p>

<p>Without further ado, here’s the cumulative and 200-movie trailing average reward generated by each of these parameter-tuned bandits over time:</p>

<p align="center">
    <img src="/images/fulls/final_bandit_results.png" alt="" />
</p>

<p>The first takeaway from this is that EXP3 significantly underperforms Epsilon Greedy and Bayesian UCB. This is fairly consistent with what I’ve seen on other people’s implementations.</p>

<p>More interestingly, we see the UCB bandit achieve a higher cumulative and average reward than the other two algorithms. It’s predictably a slower learner than Epsilon Greedy. All arms start with a large confidence interval since nothing is initially known about them, so it begins its simulation highly biased toward exploration over exploitation. Meanwhile, Epsilon Greedy spends most of its time exploiting, which gives it a faster initial climb toward its eventual peak performance. Due to its more principled and efficient approach to exploration, however, the UCB bandit ultimately learns the best policy, overtaking Epsilon Greedy after roughly 25,000 training iterations.</p>

<p>The final mean rewards yielded by these three approaches, after roughly 1,000,000 training iterations, were 0.567 for the Bayesian UCB algorithm, 0.548 for Epsilon Greedy, and 0.468 for EXP3. To give some additional context to this, random guessing in this task would yield an average reward of 0.309 (the mean “like” rate in this dataset), so all three algorithms have clearly achieved some degree of learning in this task.</p>

<h1 id="parting-thoughts">Parting Thoughts</h1>

<p>In this post I discussed and implemented four multi-armed bandit algorithms: Epsilon Greedy, EXP3, UCB1, and Bayesian UCB. Faced with a content-recommendation task (recommending movies using the Movielens-25m dataset), Epsilon Greedy and both UCB algorithms did particularly well, with the Bayesian UCB algorithm being the most performant of the group. This experiment shows that these algorithms can be viable choices for a production recommender system, all doing significantly better than random guessing and adapting their policies in intelligent ways as they obtain more information about their environments.</p>

<p>One important consideration that this experiment demonstrates is that picking a bandit algorithm isn’t a one-size-fits-all task. The suitability of any given algorithm for your task depends not only on your problem domain, but also on the size of your dataset. While the UCB algorithm was ultimately the most successful in this experiment, it took roughly 25,000 iterations of the algorithm for it to reach a point where it consistently outperformed Epsilon Greedy. This demonstrates that, depending on the volume of your data, you may want a faster-learning algorithm such as Epsilon Greedy, rather than a slower-learning, but ultimately more performant algorithm such as a Bayesian UCB.</p>

<p>A second thing to consider is that none of these algorithms take into account information about their environment or a user’s past behavior. A traditional recommender system may still outperform any of these bandits if you have other features at your disposal to make accurate predictions for a given user, as opposed to making global optimizations that apply uniformly to all users as is the case with these four bandit algorithms. There exists a compromise between these two approaches called Contextual Bandits, which apply a bandit-learning approach but use information about content and users to make more accurate recommendations. I may explore these in a future post to see how a contextual bandit fares compared to these four context-free bandits.</p>

<p>I would last like to thank <a href="https://jeremykun.com/2013/11/08/adversarial-bandits-and-the-exp3-algorithm/">Jeremy</a> <a href="https://jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/">Kun</a>, <a href="https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html#upper-confidence-bounds">Lilian Weng</a>, and <a href="https://www.cs.bham.ac.uk/internal/courses/robotics/lectures/ucb1.pdf">Noel Welsh</a>, whose resources I found very helpful in understanding the mathematics behind UCB1 and EXP3. I would recommend any of their above-linked resources for further reading on these topics.</p>

<p>Code for this post can be found <a href="https://github.com/jldbc/bandits">on github</a>.</p>

<h3 id="want-to-learn-more-about-multi-armed-bandit-algorithms-i-recommend-reading-bandit-algorithms-for-website-optimization-by-john-myles-white">Want to learn more about multi-armed bandit algorithms? I recommend reading <a href="https://www.amazon.com/Bandit-Algorithms-Website-Optimization-Developing/dp/1449341330?tag=ledoux-20">Bandit Algorithms for Website Optimization by John Myles White</a>.</h3>

<p><img src="https://images-na.ssl-images-amazon.com/images/I/51sDue2Z-9L._SX375_BO1,204,203,200_.jpg" alt="Bandits Book Cover" /></p>

<p><a href="https://www.amazon.com/Bandit-Algorithms-Website-Optimization-Developing/dp/1449341330?tag=ledoux-20">Get it on Amazon here for $17.77</a></p>]]></content><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><category term="algorithms" /><summary type="html"><![CDATA[This post explores four algorithms for solving the multi-armed bandit problem (Epsilon Greedy, EXP3, Bayesian UCB, and UCB1), with implementations in Python and discussion of experimental results using the Movielens-25m dataset.]]></summary></entry><entry><title type="html">Offline Evaluation of Multi-Armed Bandit Algorithms in Python using Replay</title><link href="https://jamesrledoux.com/algorithms/offline-bandit-evaluation/" rel="alternate" type="text/html" title="Offline Evaluation of Multi-Armed Bandit Algorithms in Python using Replay" /><published>2020-01-20T00:00:00+00:00</published><updated>2020-01-20T00:00:00+00:00</updated><id>https://jamesrledoux.com/algorithms/offline-bandit-evaluation</id><content type="html" xml:base="https://jamesrledoux.com/algorithms/offline-bandit-evaluation/"><![CDATA[<meta property="og:image" content="/images/fulls/robot-cowboy.png" />

<meta property="og:image:type" content="image/png" />

<meta property="og:image:width" content="200" />

<meta property="og:image:height" content="200" />

<meta name="twitter:card" content="summary_large_image" />

<meta name="twitter:site" content="@jmzledoux" />

<meta name="twitter:creator" content="@jmzledoux" />

<meta name="twitter:title" content="Offline Evaluation of Multi-Armed Bandit Algorithms in Python using Replay" />

<meta name="twitter:image" content="http://jamesrledoux.com/images/fulls/robot-cowboy.png" />

<p align="center">
    <img src="/images/fulls/robot-cowboy.png" alt="" />
</p>

<p>Multi-armed bandit algorithms are seeing renewed excitement in research and industry. Part of this is likely because they address some of the major problems internet companies face today: a need to explore a constantly changing landscape of (news articles, videos, ads, insert whatever your company does here) while avoiding wasting too much time showing low-quality content to users. Part of this is also may be related to advancements in a class of personalizable bandit algorithms, <em>contextual bandits</em>, which pair nicely with recent advances in reinforcement learning.</p>

<p>In either case, bandit algorithms are notoriously hard to work with using real-world datasets. Being online learning algorithms, there’s some nuance to evaluating and tuning them offline without exposing an untested algorithm to real users in a live production setting. It’s important to be able to evaluate these algorithms offline, however, for at least two reasons. First, not everybody has access to a production environment with the scale required to experiment with an online learning algorithm. And second, even those who do have a popular product at their disposal should probably be a little more careful with it than blindly throwing algorithms into production and hoping they’re successful.</p>

<p>Whether you’re a hobbyist wanting to experiment with bandits in your free time or someone at a big company who wants to optimize an algorithm before exposing it to users, you’re going to need to evaluate your model offline. This post discusses some methods I’ve found useful in doing this.</p>

<p><br /></p>

<h1 id="creating-a-dataset">Creating a Dataset</h1>
<p>For this post I’m using the <a href="https://grouplens.org/datasets/movielens/25m/">Movielens 25m</a> dataset. This dataset includes roughly 25m movie ratings for 27,000 movies provided by 138,000 users of the University of Minnesota’s <a href="https://movielens.org/">Movielens</a> service.</p>

<p>To cast this dataset as a bandit problem, we’ll pretend that a user rated every movie that they saw, ignoring any sort of non-rating bias that may exist. Since bandit algorithms have a time component to them (they can only see data from the past, which is constantly updated as the model learns), I shuffle the data and create a pseudo-timestamp value (which is really just a row number, but this is enough for a simulated bandit environment). To further simplify the problem, I redefine the problem from being a 0-5 star rating problem to a binary problem of modeling whether or not a user “liked” a movie. I define a rating of 4.5 stars or more as a “liked” movie, and anything else as a movie the user didn’t like. To further aid learning, I discard movies from the dataset with less than 1,500 ratings. Too few ratings to a movie can cause the model to get stuck in offline evaluation, for reasons that will make more sense soon.</p>

<p>The end result is a dataset of roughly 6.5 million binary like/no-like movie ratings of the form:</p>

<p>\([timestamp, userId, movieId, liked]\).</p>

<p>I do this by constructing the following <code class="language-plaintext highlighter-rouge">get_ratings_25m</code> function, which creates the dataset and turns it into a viable bandit problem.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">read_movielens_25m</span><span class="p">():</span>
    <span class="n">ratings</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'ratings.csv'</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'python'</span><span class="p">)</span>
    <span class="n">movies</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'movies.csv'</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'python'</span><span class="p">)</span>
    <span class="n">links</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'links.csv'</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'python'</span><span class="p">)</span>
    <span class="n">movies</span> <span class="o">=</span> <span class="n">movies</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">movies</span><span class="p">.</span><span class="n">genres</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">().</span><span class="n">astype</span><span class="p">(</span><span class="nb">bool</span><span class="p">))</span>
    <span class="n">movies</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'genres'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">ratings</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">movies</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'movieId'</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s">'left'</span><span class="p">,</span> <span class="n">rsuffix</span><span class="o">=</span><span class="s">'_movie'</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">df</span>

<span class="k">def</span> <span class="nf">preprocess_movielens_25m</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">min_number_of_reviews</span><span class="o">=</span><span class="mi">20000</span><span class="p">):</span>
    <span class="c1"># remove ratings of movies with &lt; N ratings. too few ratings will cause the recsys to get stuck in offline evaluation
</span>    <span class="n">movies_to_keep</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">movieId</span><span class="p">.</span><span class="n">value_counts</span><span class="p">())</span>\
        <span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">movieId</span><span class="p">.</span><span class="n">value_counts</span><span class="p">())[</span><span class="s">'movieId'</span><span class="p">]</span><span class="o">&gt;=</span><span class="n">min_number_of_reviews</span><span class="p">].</span><span class="n">index</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'movieId'</span><span class="p">].</span><span class="n">isin</span><span class="p">(</span><span class="n">movies_to_keep</span><span class="p">)]</span>
    <span class="c1"># shuffle rows to debias order of user ids
</span>    <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="n">frac</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="c1"># create a 't' column to represent time steps for the bandit to simulate a live learning scenario
</span>    <span class="n">df</span><span class="p">[</span><span class="s">'t'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
    <span class="n">df</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'t'</span><span class="p">]</span>
    <span class="c1"># rating &gt;= 4.5 stars is a 'like', &lt; 4 stars is a 'dislike'
</span>    <span class="n">df</span><span class="p">[</span><span class="s">'liked'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'rating'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">x</span> <span class="o">&gt;=</span> <span class="mf">4.5</span> <span class="k">else</span> <span class="mi">0</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">df</span>

<span class="k">def</span> <span class="nf">get_ratings_25m</span><span class="p">(</span><span class="n">min_number_of_reviews</span><span class="o">=</span><span class="mi">20000</span><span class="p">):</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">read_movielens_25m</span><span class="p">()</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">preprocess_movielens_25m</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">min_number_of_reviews</span><span class="o">=</span><span class="mi">20000</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>

<p><br /></p>

<h1 id="setting-up-a-simulation-environment">Setting Up a Simulation Environment</h1>
<p>Now that we have a dataset, we need to construct a simulation environment to use for training the bandit. A traditional ML model is trained by building a representative training and test set, where you train and tune a model on the training set and evaluate its performance using the test set. A bandit algorithm isn’t so simple. Bandits are algorithms that learn over time. At each time step, the bandit needs to be able to observe data from the past, update its decision rule, take action by serving predictions based on this updated decision-making policy, and observe a reward value for these actions. The time component means that the training data that the bandit has at its disposal is constantly changing, and that the score you use to evaluate it is also changing over time based on small pieces of feedback from the most recent time step, rather than based on feedback from a large test set like you’d use with a traditional ML approach.</p>

<p>This learning process is computationally tedious when there are a large number of time steps. In a perfect world, a bandit would view each event as its own time step and make a large number of small improvements. With large datasets and the need for offline evaluation, this is often unreasonable. Bandits can be very slow to train if they’re updated once for each row in your dataset, and using large datasets is important in an offline evaluation setting because a large number of observations end up needing to be discarded by the algorithm (more on this in the next section). For these reasons, it proves useful to deviate from the theoretical setting by batching the learning process in two ways.</p>

<p>First, we batch in time steps. Instead of updating the algorithm once per rating event, we can update it once every \(n\) events, requiring \(\frac{t}{n}\) training steps instead of \(t\) to make it through the whole dataset. This shortcut is a realistic one, as even a live production environment is probably going to be making updates on some sort of cron schedule.</p>

<p>Second, we can expand this from a single-movie recommendation problem to a slate recommendation problem. In the simplest theoretical setting, a bandit recommends one movie and the user reacts by liking it or not liking it. When we evaluate a bandit using historic data, we don’t always know how a user would have reacted to our recommendation policy, since we only know the user’s reaction to the movie they were served by the system that was in production when they visited the website. We need to discard such recommendations, and for this reason, recommending one movie at a time proves inefficient due to the large volume of recommendations we can’t learn from.</p>

<p>To learn more efficiently, we can instead recommend <strong>slates</strong> of movies. A slate is just a technical term for recommending more than one movie at a time. In this case, we can recommend the bandit’s top 5 movies to a user, and if the user rated one of those movies, we can use that observation to improve the algorithm. This way, we’re much more likely to receive something from this training iteration that helps the model to improve.</p>

<p>Slate recommendations are picking up research interest due to their practicality. Most modern recommender systems are recommending more than one piece of content at a time, after all (see: YouTube, Netflix, Spotify, etc.) These papers (<a href="https://arxiv.org/pdf/1905.12767.pdf">1</a>, <a href="https://arxiv.org/pdf/1812.02353.pdf">2</a>) from Ie et al. (1) and Chen et al. (2) are helpful examples of modern approaches to slate recommendation problems.</p>

<p>Last, we need to create a second dataset that represents a subset of the full dataset. I call this dataset <code class="language-plaintext highlighter-rouge">history</code> in my implementation, because it represents the historic record of events that the bandit is able to use to influence its recommendations. Because a bandit is an online learner, it needs a dataset containing only events prior to the current timestep we’re simulating in order for it to act like it will in a production setting. I do this by initiating an empty dataframe prior to training with the same format as the full dataset I built in the previous section, and growing this dataset at each time step by appending new rows. The reason it’s useful to use this as a separate dataframe rather than just filtering the complete dataset at each time step is that not all events can be added to the <code class="language-plaintext highlighter-rouge">history</code> dataset. I’ll explain which events get added to this dataset and which don’t in the next section of this post, but for now, you’ll see in the code below that the <code class="language-plaintext highlighter-rouge">history</code> dataframe is updated by our scoring function at each time step.</p>

<p>Here’s how this all looks in Python. Note that this uses a <code class="language-plaintext highlighter-rouge">score</code> function which we haven’t yet defined. I’m also using a naive recommendation policy that just selects random movies, since this post is about the training methodology rather than the algorithm itself. I’ll explore various bandit algorithms in a future post.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># simulation params: slate size, batch size (number of events per training iteration)
</span><span class="n">slate_size</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">10</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">get_ratings_25m</span><span class="p">(</span><span class="n">min_number_of_reviews</span><span class="o">=</span><span class="mi">1500</span><span class="p">)</span>

<span class="c1"># initialize empty history 
# (the algorithm should be able to see all events and outcomes prior to the current timestep, but no current or future outcomes)
</span><span class="n">history</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span>
<span class="n">history</span> <span class="o">=</span> <span class="n">history</span><span class="p">.</span><span class="n">astype</span><span class="p">({</span><span class="s">'movieId'</span><span class="p">:</span> <span class="s">'int32'</span><span class="p">,</span> <span class="s">'liked'</span><span class="p">:</span> <span class="s">'float'</span><span class="p">})</span>

<span class="c1"># initialize empty list for storing scores from each step
</span><span class="n">rewards</span> <span class="o">=</span> <span class="p">[]</span>

<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">//</span><span class="n">batch_size</span><span class="p">):</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">t</span> <span class="o">*</span> <span class="n">batch_size</span>
    <span class="c1"># generate recommendations from a random policy
</span>    <span class="n">recs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">movieId</span><span class="p">.</span><span class="n">unique</span><span class="p">(),</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">slate_size</span><span class="p">),</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="c1"># send recommendations and dataset to a scoring function so the model can learn &amp; adjust its policy in the next iteration
</span>    <span class="n">history</span><span class="p">,</span> <span class="n">action_score</span> <span class="o">=</span> <span class="n">replay_score</span><span class="p">(</span><span class="n">history</span><span class="p">,</span> <span class="n">df</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">recs</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">action_score</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">action_score</span> <span class="o">=</span> <span class="n">action_score</span><span class="p">.</span><span class="n">liked</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
        <span class="n">rewards</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="n">action_score</span><span class="p">)</span>
</code></pre></div></div>

<p><br /></p>

<h1 id="offline-evaluation-of-an-online-learnering-algorithm">Offline Evaluation of an Online Learnering Algorithm</h1>
<p>Your bandit’s recommendations will be different from those generated by the model whose recommendations are reflected in your historic dataset. This creates problems which lead to some of the key challenges in evaluating these algorithms using historic data.</p>

<p>The first reason this is problematic is that your data is probably biased. An online learner requires a feedback loop where it presents an action, observes a user’s response, and then updates its policy accordingly. A historic dataset is going to be biased by the mechanism that generated it. Your algorithm assumes that <em>it</em> is what generated the recommendation, but in reality, everything in your dataset was generated by a completely separate model or heuristic. An ideal solution to this is to randomize the recommendation policy of the production system that’s generating your training data to create a dataset that’s independent and identically distributed and without algorithmic bias. You may not have the ability to implement this if you’re receiving data from an outside party or if randomizing a recommendation policy for the sake of better training data is too harmful of a user experience, but it’s worth at least being aware of algorithmic bias in your training data if it’s going to affect the bandit you’re training.</p>

<p>The second problem is that your algorithm will often produce recommendations that are different from the recommendations seen by users in the historic dataset. You can’t supply a reward value for these recommendations because you don’t know what the user’s response would have been to a recommendation they never saw. You can only know how a user responded to what was supplied to them by the production system. The solution to this is a method called <strong>replay</strong> (<a href="https://arxiv.org/abs/1003.5956">Li et al., 2010</a>). Replay evaluation essentially takes your historic event steam and your algorithm’s recommendations at each time step, and throws out all samples except for those where your model’s recommendation is the same as the one the user saw in the historic dataset. This, paired with an unbiased data generating mechanism (such as a randomized recommendation policy), proves to be an unbiased method for offline evaluation of an online learing algorithm.</p>

<p>Netflix’s Jaya Kawale and Fernando Amat provide a nice visual explanation of Replay in <a href="https://www.slideshare.net/JayaKawale/a-multiarmed-bandit-framework-for-recommendations-at-netflix">these slides</a> from a 2018 conference talk. In this image, there is a production movie recommendation policy (top row) and an offline recommendation policy from a bandit they’re training (bottom). Replay selects the cases where the two recommendation policies agree with each other (the columns with black boxes surrounding them: users 1, 4, and 6), and uses only these play/no-play decisions to score the offline model. So, in this example, the model gets a score of 2/3, since 2 of the 3 matches between the two policies were played.</p>

<p align="center">
    <img src="/images/fulls/netflix-replay.png" alt="" />
</p>

<p>One drawback to this method is that it significantly shrinks the size of your dataset. If you have \(k\) arms and \(T\) samples, you can expect to have \(\frac{T}{k}\) usable recommendations for evaluating the model (<a href="https://arxiv.org/abs/1003.5956">Li et al., 2010</a>). For this reason, you’re going to need a large dataset in order to test your algorithm, since replay evaluation is going to discard most of your data. Nicol et al. <a href="https://arxiv.org/pdf/1405.3536.pdf">(2014)</a> explore ways to improve this via bootstrapping, but for this post I’m using the classic replay method for evaluating the models.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">replay_score</span><span class="p">(</span><span class="n">history</span><span class="p">,</span> <span class="n">df</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">recs</span><span class="p">):</span>
    <span class="c1"># reward if rec matches logged data, ignore otherwise
</span>    <span class="n">actions</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">t</span><span class="p">:</span><span class="n">t</span><span class="o">+</span><span class="n">batch_size</span><span class="p">]</span>
    <span class="n">actions</span> <span class="o">=</span> <span class="n">actions</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">actions</span><span class="p">[</span><span class="s">'movieId'</span><span class="p">].</span><span class="n">isin</span><span class="p">(</span><span class="n">recs</span><span class="p">)]</span>
    <span class="n">actions</span><span class="p">[</span><span class="s">'scoring_round'</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span>
    <span class="c1"># add row to history if recs match logging policy
</span>    <span class="n">history</span> <span class="o">=</span> <span class="n">history</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">actions</span><span class="p">)</span>
    <span class="n">action_liked</span> <span class="o">=</span> <span class="n">actions</span><span class="p">[[</span><span class="s">'movieId'</span><span class="p">,</span> <span class="s">'liked'</span><span class="p">]]</span>
    <span class="k">return</span> <span class="n">history</span><span class="p">,</span> <span class="n">action_liked</span>
</code></pre></div></div>

<p>It’s important to note that replay evaluation is more than just a technique for deciding which events to use for scoring an algorithm’s performance. Replay also decides which events from the original dataset your bandit is allowed to see in future time steps. In order to mirror a real-world online learning scenario, a bandit starts with no data and adds new data points to its memory as it observes how users react to its recommendations. It’s not realistic to let the bandit have access to data points that didn’t come from its recommendation policy. We basically have to pretend those events didn’t happen, otherwise the offline bandit is going to receive most of its data from another algorithm’s policy and is basically just going to end up copying the recommendations reflected in the original dataset. For this reason, we should only add a row to the bandit’s <code class="language-plaintext highlighter-rouge">history</code> dataset when the replay technique returns a match between the online and offline policies. In the above function, this can be seen in the <code class="language-plaintext highlighter-rouge">history</code> dataframe, to which we only append actions which are matched between the policies.</p>

<p>The final result of this is a complete bandit setting, constructed using historic data. The bandit steps through the dataset, making recommendations based on a policy it’s learning from the data. It begins with zero context on user behavior (an empty <code class="language-plaintext highlighter-rouge">history</code> dataframe). It receives user feedback as it recommends movies that match with the recommendations present in the historic dataset. Each time it encounters such a match, it adds this context to its <code class="language-plaintext highlighter-rouge">history</code> dataset and can use this as future context for improving its recommendation policy. Over time, <code class="language-plaintext highlighter-rouge">history</code> grows larger (although never nearly as large as the original dataset, since replay discards most recommendations), and the bandit becomes more effective in completing its movie recommendation task.</p>

<p><br /></p>

<h1 id="evaluation-metrics-for-a-bandit">Evaluation Metrics for a Bandit</h1>
<p>The last piece you’ll need to evaluate your bandit is one or more evaluation metrics. The literature around bandits focuses primarily on something called <strong>regret</strong> as its metric of choice. Regret can be loosely defined as the difference between the reward of the arm chosen by an algorithm and the reward it <em>would have</em> received had it acted optimally and chose the best possible arm. You will find pages and pages of proofs showing upper bounds on the regret of any particular bandit algorithm. For our purposes, though, regret a flawed metric. To measure regret, you need to know the reward of the arms that the bandit didn’t choose. In the real world you will never know this! Analyses of this optimal, counterfactual world are academically important, but they don’t take us far in the applied world. We need another metric.</p>

<p>The good news is that, while we can’t measure an algorithm’s cumulative regret, we <em>can</em> measure its cumulative reward, which, in practical terms, is just as good. This is simply the cumulative sum of all the bandit’s replay scores from the cases where a non-null score exists. This is my preferred metric for evaluating a bandit’s offline performance.</p>

<p>It may also be useful to include some metrics that are more meaningful to the specific task that the bandit is performing. If it’s recommending articles or ads on a website, for example, you may want to measure an N-timestep trailing click-through rate to see how CTR improves as the algorithm learns. If you’re recommending videos or articles, you may want to measure measure the completion rate of the views the algorithm generates to make sure it’s not recommending clickbait.</p>

<p>In the case of this dataset, I’ll implement a cumulative reward metric and a 50-timestep trailing CTR, and return both as lists so they can be analyzed as a time series if needed.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cumulative_rewards</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">cumsum</span><span class="p">(</span><span class="n">rewards</span><span class="p">)</span>
<span class="n">trailing_ctr</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(</span><span class="n">rewards</span><span class="p">).</span><span class="n">rolling</span><span class="p">(</span><span class="mi">50</span><span class="p">).</span><span class="n">mean</span><span class="p">())</span>
</code></pre></div></div>

<p><br /></p>

<h1 id="parting-thoughts">Parting Thoughts</h1>
<p>Training a multi-armed bandit using a historic dataset is a bit cumbersome compared to training a traditional machine learning model, but none of the individual methods involved are prohibitively complex. I hope some of the logic laid out in this post is useful for others as they approach similar problems, allowing you to focus on the important parts without getting too bogged down by methodology.</p>

<p>Another thing worth noting is that I’m figuring this out as I go! If you know a better way to go about this or disagree with the approach I’m laying out in this post, <a href="mailto:ledoux.james.r@gmail.com">send me a note</a> and I’d be interested in discussing this.</p>

<p>Code for this post can be found <a href="https://github.com/jldbc/bandits">on github</a>.</p>

<h3 id="want-to-learn-more-about-multi-armed-bandit-algorithms-i-recommend-reading-bandit-algorithms-for-website-optimization-by-john-myles-white">Want to learn more about multi-armed bandit algorithms? I recommend reading <a href="https://www.amazon.com/Bandit-Algorithms-Website-Optimization-Developing/dp/1449341330?tag=ledoux-20">Bandit Algorithms for Website Optimization by John Myles White</a>.</h3>

<p><img src="https://images-na.ssl-images-amazon.com/images/I/51sDue2Z-9L._SX375_BO1,204,203,200_.jpg" alt="Bandits Book Cover" /></p>

<p><a href="https://www.amazon.com/Bandit-Algorithms-Website-Optimization-Developing/dp/1449341330?tag=ledoux-20">Get it on Amazon here for $17.77</a></p>]]></content><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><category term="algorithms" /><summary type="html"><![CDATA[Multi-armed bandit algorithms are seeing renewed excitement, but evaluating their performance using a historic dataset is challenging. Here's how I go about implementing offline bandit evaluation techniques, with examples shown in Python.]]></summary></entry><entry><title type="html">Understanding the AdTech Auctions in Your Browser: an Analysis of 30,000 Prebid.js Auctions</title><link href="https://jamesrledoux.com/projects/prebid-analysis/" rel="alternate" type="text/html" title="Understanding the AdTech Auctions in Your Browser: an Analysis of 30,000 Prebid.js Auctions" /><published>2019-07-28T00:00:00+00:00</published><updated>2019-07-28T00:00:00+00:00</updated><id>https://jamesrledoux.com/projects/prebid-analysis</id><content type="html" xml:base="https://jamesrledoux.com/projects/prebid-analysis/"><![CDATA[<meta property="og:image" content="/images/fulls/sites.png" />

<meta property="og:image:type" content="image/png" />

<meta property="og:image:width" content="200" />

<meta property="og:image:height" content="200" />

<meta name="twitter:card" content="summary_large_image" />

<meta name="twitter:site" content="@jmzledoux" />

<meta name="twitter:creator" content="@jmzledoux" />

<meta name="twitter:title" content="Understanding the AdTech Auctions in Your Browser: an Analysis of 30,000 Prebid.js Auctions" />

<meta name="twitter:image" content="http://jamesrledoux.com/images/fulls/sites.png" />

<p>For the past several months I’ve been collecting bid prices from the adtech auctions taking place in my browser. What follows are some findings from this data and what they tell us about monetization strategy in digital media.</p>

<p>If you want to collect your own data, I’ve open sourced the chrome extension I used to collect data for this post. Check it out <a href="https://www.github.com/jldbc/auctionhouse">here</a>!</p>

<h1 id="a-quick-primer-on-adtech-and-header-bidding">A Quick Primer on AdTech and Header Bidding</h1>

<p>The primary method through which websites make money is selling ads (we’ll ignore subscriptions, sponsored content, etc. in this post.) In the early days of the Internet ads were sold directly to advertisers. This proved to be profitable, but left both sides of the transaction dissatisfied; the publisher couldn’t sell enough ads to monetize all of their pageviews, and the advertiser had trouble reaching scale, with the limiting factor being that both sides needed to negotiate pricing and logistics before the ad campaign could run. Monetizing a site’s traffic was too slow and required too much human input to work at Internet scale.</p>

<p>The answer to this problem was programmatic ads. Programmatic ads allow a site to auction off its remaining ad inventory on an open market. Advertisers, similarly, can reach essentially the entire world population through these open markets if they’re willing to pay. The primary way most sites sell programmatic ads is through an ad exchange that’s built into their ad server (<a href="https://admanager.google.com/home/?authuser=0">AdX</a>, and <a href="https://www.openx.com/">OpenX</a> are two prominent examples.) The exchange, then, sends bid requests to thousands of demand-side platforms who are able to place bids to buy individual ad impressions on behalf of the brands they represent.</p>

<p>Using only one ad exchange, though, leaves a publisher’s business overly dependent on a single outside party, and also leaves them limited to the advertisers working with that particular exchange. The practice of header bidding has taken off in recent years as a response to this. Header bidding allows a publisher to make ad inventory available to several ad exchanges in parallel to the exchange that’s native to their ad server. The winning bids from all the exchanges are then able to compete with each other, with the most valuable bid winning the ad impression. This increased competition means that the publisher is able to get higher prices for their ad inventory. For a more thorough explainer, <a href="https://digiday.com/media/wtf-header-bidding/">Digiday</a> explains the practice better than I will.</p>

<h1 id="data-collection-with-auctionhouse">Data Collection with AuctionHouse</h1>

<p>There’s very little publicly-available data in the adtech space. And for good reason! For anyone whose business is to participate in advertising auctions, data (and what they do with it) is their secret sauce. Similarly, my employer wouldn’t have been too keen on me writing a post using company data. So I made my own dataset.</p>

<p>While most of the advertising world is hidden, we can get a glimpse into one special class of auction: client-side header bidding. Here the bids are placed within a user’s browser, making them accessible if you’re able to identify and understand the requests coming from an auction. I built a chrome extension called <a href="https://www.github.com/jldbc/auctionhouse">Auction House</a> that parses a browser’s requests and collects data from ad auctions, making it easy to run your own adtech data collection.</p>

<p>I collected the following data on each bid using this tool:</p>
<ul>
  <li>Timestamp</li>
  <li>URL</li>
  <li>Auction Provider (e.g. Rubicon, Index Exchange)</li>
  <li>Bid Price</li>
  <li>Ad Size (length x width, in pixels)</li>
</ul>

<p>In total I collected 96,306 bids from 30,600 auctions from January 1 to July 20, 2019. What follows are my primary findings for what this data can teach us about pricing in online advertising auctions.</p>

<h1 id="seasonality">Seasonality</h1>

<p>The first point of interest is seasonality. At the population level there are several levels of seasonality in the programmatic advertising market. Demand moves according to time of day, day of week, and day of month, as well as quarterly and annual seasonality. Time of day and day of week seasonality are mostly due to consumer behavior; you’re more likely to buy something on nights and weekends, and therefore your attention is worth more to advertisers.</p>
<p align="center">
    <img src="/images/fulls/dow2.png" alt="" />
</p>

<p align="center">
    <img src="/images/fulls/hour-fixed-axes.png" alt="" />
</p>

<p>These trends are supported in this dataset. Using boxplots to show both mean and interquartile range of CPM (cost per 1000 impressions), Friday through Sunday ad prices are significantly higher than Monday through Thursday prices. The lower quartile price for Saturday and Sunday is roughly the mean price on weekdays, which is a pretty large gap. The time-of-day pattern is less pronounced, but prices are slightly higher at night.</p>

<p>There are also more global seasonal trends. The demand for online ad impressions is far from uniform throughout the year. A given company’s ads generally don’t make their way onto the Internet until the company has struck a deal with a demand side platform (DSP) who will handle the technical overhead of participating in ad auctions on their behalf. The DSP will set a series of goals for the company’s ads, including how many ads they plan to deliver, the expected cost, and the timeframe in which it will execute its ad buys. As a result of this fairly-traditional purchasing process, the demand for ads mostly follows the same business cycle as traditional business. This means that campaigns are typically set up to run on a monthly, quarterly, and annual basis. Demand rises through each of these cycles, partially re-setting at the end of each cycle until it reaches its peak during the Christmas holiday.</p>

<p>These trends weren’t as pronounced as I’d hoped they would be in this data, but it’s still visible. January has a low CPM, and you can see a slight decrease at the ends of March and June (when Q1 and Q2 budgets are expiring.) You can also see the more frequent, local spikes in demand coming from weekly seasonality. It will be interesting to look back on this next January if I keep recording this data, as the CPM gain is more dramatic in Q4.</p>

<p>In the below plot, the red line is each day’s average CPM, with the gray region being the interquartile range. The mean CPM is always in the upper end of the interquartile range, as there are many high prices skewing the data in the positive direction (while the other end of the distribution is bounded by a minimum value of zero.)</p>

<p align="center">
    <img src="/images/fulls/seasonality.png" alt="" />
</p>

<p>Last, you can see that the seasonal trend isn’t the same for all sites. BuzzFeed and BuzzFeed News have steady growth, while Vox and USA Today are noisy. ESPN implements a high price floor throughout March causing artificially high CPM. CNN disappears completely for a few weeks in June when they apparently temporarily stopped using Prebid. Recode disappears when their parent company dissolved their site in April. This variety is fitting, as there are many adtech strategies a publisher can pursue, and the collection of sites I examine in this data employ them all to varying extents. Each figure in the below grid could be a case study in itself.</p>

<p align="center">
    <img src="/images/fulls/sites.png" alt="" />
</p>

<h1 id="ssps-the-common-the-obscure">SSPs: The Common, The Obscure</h1>

<p>The sell-side partners (SSPs) a publisher chooses to work with are a key component in their monetization strategy. Working with more exchanges generally means more demand has access to a site’s ad inventory, which means there’s a higher chance that it sells for a high price. Client-side header bidding also adds latency to a website, however, so at a certain point adding additional SSPs begins to degrade site performance, leading to a worse user experience and traffic decline.</p>

<p>Here are the most common SSPs I’ve seen from observing bidding patterns over the past 7 months, ordered by the number of impressions they’ve won.</p>

<table>
  <thead>
    <tr>
      <th>Bidder</th>
      <th>CPM (mean)</th>
      <th>CPM (stddev)</th>
      <th>Impressions</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Rubicon</td>
      <td>3.04</td>
      <td>4.09</td>
      <td>8849</td>
    </tr>
    <tr>
      <td>AppNexus</td>
      <td>2.03</td>
      <td>2.92</td>
      <td>6534</td>
    </tr>
    <tr>
      <td>Index</td>
      <td>2.39</td>
      <td>2.74</td>
      <td>5622</td>
    </tr>
    <tr>
      <td>OpenX</td>
      <td>1.78</td>
      <td>2.14</td>
      <td>2367</td>
    </tr>
    <tr>
      <td>TrustX</td>
      <td>4.94</td>
      <td>5.29</td>
      <td>2090</td>
    </tr>
    <tr>
      <td>AOL</td>
      <td>2.01</td>
      <td>2.80</td>
      <td>1523</td>
    </tr>
    <tr>
      <td>Consumable</td>
      <td>3.55</td>
      <td>4.10</td>
      <td>1083</td>
    </tr>
    <tr>
      <td>AppNexus</td>
      <td>1.13</td>
      <td>1.79</td>
      <td>581</td>
    </tr>
    <tr>
      <td>TripleLift</td>
      <td>3.88</td>
      <td>2.94</td>
      <td>538</td>
    </tr>
    <tr>
      <td>PubMatic</td>
      <td>2.48</td>
      <td>1.54</td>
      <td>385</td>
    </tr>
  </tbody>
</table>

<p>These are all big names in the industry, each having collected a massive network of DSPs providing publishers access to the majority of the internet’s display advertising demand. There is also a long tail of less-known SSPs in this data. Among these are Colossus and Aardvark, which are apparently SSPs but I couldn’t even find their company websites.</p>

<p>From the sites I examined, the sweet spot for how many exchanges to include in one’s header seems to be around four. This doesn’t mean it’s optimal, but it does appear to be an industry standard among major publishers.</p>

<h1 id="relationship-between-auction-participants-and-price">Relationship Between Auction Participants and Price</h1>

<p>Related to the number of SSPs a publisher works with is how many are placing bids for a given ad impression. Breaking it out this way, you can see the expected result: more auction participants leads to a higher clearing price. This fits with what existing economic theory teaches us: holding supply constant, an increase in demand (here by way of expanding the number of bidders with access to a site’s inventory) will lead to higher prices. The pattern is imperfect (four bids has a lower cpm then three in this data), but this is probably because it’s looking at data across several sites.</p>

<table>
  <thead>
    <tr>
      <th>Bids Submitted</th>
      <th>CPM</th>
      <th>Count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>1.64</td>
      <td>4264</td>
    </tr>
    <tr>
      <td>2</td>
      <td>2.22</td>
      <td>5980</td>
    </tr>
    <tr>
      <td>3</td>
      <td>3.35</td>
      <td>7775</td>
    </tr>
    <tr>
      <td>4</td>
      <td>2.68</td>
      <td>8717</td>
    </tr>
    <tr>
      <td>5</td>
      <td>3.42</td>
      <td>2329</td>
    </tr>
    <tr>
      <td>6</td>
      <td>5.35</td>
      <td>1092</td>
    </tr>
  </tbody>
</table>

<p>It’s worth noting that this relationship isn’t perfectly causal in this data. The reason why SSPs are submitting bids is an unobserved confounding factor. Maybe the user has unpurchased items in their Amazon shopping cart, for example, which is causing both higher bid prices from advertisers and a higher number of bids to be placed. It’s not possible to remove all these unknowable confounding factors, but the pattern is clear enough where you can feel confident in saying that increasing the amount of demand with access to a given piece of ad inventory leads to higher prices.</p>

<h1 id="creative-sizes">Creative Sizes</h1>

<p>Another important driver of ad value is the size of the ad. Similar to with traditional advertising, larger ads have a higher market rate.</p>

<table>
  <thead>
    <tr>
      <th>Creative Size</th>
      <th>CPM (mean)</th>
      <th>CPM (stddev)</th>
      <th>Impressions</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>728x90</td>
      <td>2.27</td>
      <td>3.07</td>
      <td>11012</td>
    </tr>
    <tr>
      <td>300x250</td>
      <td>2.78</td>
      <td>3.48</td>
      <td>10011</td>
    </tr>
    <tr>
      <td>300x600</td>
      <td>2.68</td>
      <td>3.42</td>
      <td>5052</td>
    </tr>
    <tr>
      <td>970x250</td>
      <td>3.57</td>
      <td>4.81</td>
      <td>2585</td>
    </tr>
    <tr>
      <td>1030x590</td>
      <td>16.41</td>
      <td>2.76</td>
      <td>261</td>
    </tr>
    <tr>
      <td>970x90</td>
      <td>1.34</td>
      <td>1.37</td>
      <td>172</td>
    </tr>
    <tr>
      <td>640x480</td>
      <td>14.27</td>
      <td>8.54</td>
      <td>58</td>
    </tr>
    <tr>
      <td>160x600</td>
      <td>1.17</td>
      <td>1.52</td>
      <td>48</td>
    </tr>
  </tbody>
</table>

<p>It’s a bit deceptive to look at this in cross-site data, since every site’s implementation of these ads will be slightly different and drive different values. You can see, however, that a 970x250 ad is worth significantly more than a 728x90, which will often be eligible to serve in the same top-of-page ad slot. Similarly, a 300x600 ad has a higher cpm on average than a 160x600, which it generally competes with for space in a site’s sidebar.</p>

<p>Other comparisons from this table are less fair. A 728x90, for example, typically serves in a completely different section of a page than a 300x250, meaning that their performance metrics are vastly different. This causes their CPMs to differ for reasons entirely unrelated to the size itself. Other sizes, such as Vox’s 1030x590 ads, are not industry standard and serve in a completely different way than traditional ads, making it unfair to compare against more standard sizes.</p>

<p>In general, all else equal, bigger is better in terms of an ad’s value.</p>

<h1 id="ad-density">Ad Density</h1>

<p>Another theme I found in this data is the impact that ad density has on pricing. Economics 101 says that if you flood a market with supply without a corresponding decrease in demand, the per-unit price goes down. This is exactly what happens when a website adds additional ads.</p>

<p>It’s not obvious how to define “the market” in this context, but if you think of it as the market for ad impressions on a given site, or for a given subset of users, then it follows that a site increasing the number of ads per pageview can negatively impact the CPM it receives. This is probably a fair way of looking at things, as ad campaigns are often limited to small pools of users and websites, making market definitions very small and specific in scope.</p>

<p align="center">
    <img src="/images/fulls/density3.png" alt="" />
</p>

<p>In the above plot I define “density” as the number of ads placed on a give pageview and “CPM” as the average price per 1000 ad impressions across websites at each density. The clear downward trend suggests that adding additional ads to a page has at least a slight negative impact on price. This can be studied more thoroughly by holding sites and ad sizes constant, but it’s hard to do this and maintain a useful sample size without having access to a large site’s internal data.</p>

<h1 id="whats-next">What’s Next</h1>

<p>There are a few more things I might look into with this data. Namely, the sample size is getting large enough where it might be an interesting task to try to forecast future CPMs. It also may be interesting to parse out keywords from the urls that impressions belong to and see if there’s any relationship between the content of a page and its CPM. For example, do non-brand-safe keywords (“murder”, curse words, etc.) have a negative impact on CPM? Do commercial keywords (“shopping”, “product”, “amazon”) have a positive impact? Clustering for topics and finding their relationship with ad price might have some interesting results.</p>

<p>I’ll keep collecting data for the time being and might revisit this later. If you’ve made it this far, thanks for reading and let me know what you think!</p>]]></content><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><category term="projects" /><summary type="html"><![CDATA[An analysis of auction dynamics in client-side header bidding]]></summary></entry><entry><title type="html">Predicting The Shift: Boosting and Bagging for Strategic Infield Positioning</title><link href="https://jamesrledoux.com/projects/predicting-the-shift/" rel="alternate" type="text/html" title="Predicting The Shift: Boosting and Bagging for Strategic Infield Positioning" /><published>2018-09-30T00:00:00+00:00</published><updated>2018-09-30T00:00:00+00:00</updated><id>https://jamesrledoux.com/projects/predicting-the-shift</id><content type="html" xml:base="https://jamesrledoux.com/projects/predicting-the-shift/"><![CDATA[<meta property="og:image" content="/images/fulls/theshift.png" />

<meta property="og:image:type" content="image/png" />

<meta property="og:image:width" content="200" />

<meta property="og:image:height" content="200" />

<meta name="twitter:card" content="summary_large_image" />

<meta name="twitter:site" content="@jmzledoux" />

<meta name="twitter:creator" content="@jmzledoux" />

<meta name="twitter:title" content="Predicting The Shift: Boosting and Bagging for Strategic Infield Positioning" />

<meta name="twitter:image" content="http://jamesrledoux.com/images/fulls/theshift.png" />

<h1 id="the-shift">“The Shift”</h1>

<p>Baseball is an old game, and for the most part, we play it the same way today as we did several decades ago. One maneuver disrupting the old way of play in recent years in the infield shift. MLB.com defines the most common type of shift as “when three (or more, in some cases) infielders are positioned to the same side of second base (<a href="http://m.mlb.com/glossary/statcast/shifts">mlb.com</a>).” Once a rare meneuver, the LA Times reported in 2015 that usage of The Shift had nearly doubled each year from 2011 to 2015 (<a href="http://www.latimes.com/sports/la-sp-baseball-defensive-shifts-20150719-story.html">LA Times</a>). It seems that, in the aftermath of the early Moneyball era, where most teams have by now made significant investments in analytics, the value of strategic infield positioning has become widely appreciated.</p>

<p align="center">
    <img src="/images/fulls/theshift.png" alt="" />
    A Typical Infield Shift // mlb.com
</p>

<p>The reasons for shifting can be many. Mike Petriello provides an excellent analysis when and why it happens in <a href="https://www.mlb.com/news/9-things-you-need-to-know-about-the-shift/c-276706888">9 things you need to know about the shift</a>. The most obvious recipients of The Shift are lefty hitters whose spraychart shows a heavy skew toward hitting down the first baseline. It’s also clear that some teams have embraced The Shift more than others. The Astros, for example, have shifted on over 40% of plate appearances in 2018, while the Cubs barely shift at all.</p>

<p>The goal of this project is to see if shifts can be predicted. Predicting shifts is interesting for a few different reasons. If you’re on the defensive side and want to better know when your infield is <strong>supposed</strong> to be shifting, it may be helpful to see how likely the rest of the league would be to shift in that same situation. If you’re deciding whether to send in a pinch hitter, it will be useful to know both whether the defense is expected to shift, and how effective your batter is going to be against the maneuver. And, more generally, The Shift is just a fun maneuver that I’d like to better understand.</p>

<p>I will begin by collecting data from Baseball Savant, which tells us how the defense is positioned at a given point in time. From there, I’ll create features to describe game context, player identity, and player ability that will help to form predictions. Last, I’ll build five different models, ranging from simple generalized linear models to more advanced machine learning techniques, to see just how effectively The Shift can be predicted before it happens. Let’s get started.</p>

<p align="center">
    <img src="/images/fulls/gettyimages-455266466.jpg" alt="" />
    The LA Dodgers Getting Shifty // DENIS POROY / GETTY
</p>

<h1 id="data-acquisition-and-cleaning">Data Acquisition and Cleaning</h1>
<p>I use <a href="https://www.github.com/jldbc/pybaseball">pybaseball</a> to collect statcast data. Baseball Savant has made field position classifications available since the beginning of the 2016 season, so I collect data from 2016 to present (August 2018 at the time of writing this post). Data collection is a simple one-liner.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pybaseball</span> <span class="kn">import</span> <span class="n">statcast</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">statcast</span><span class="p">(</span><span class="s">'2016-03-25'</span><span class="p">,</span> <span class="s">'2018-08-17'</span><span class="p">)</span>
</code></pre></div></div>

<p>For a long query like this one, the scraper takes a while to complete. I recommend running this and then leaving it alone for a while. Save a copy once you have it to avoid having to re-scrape.</p>

<p>Now, some simple cleaning to make feature engineering and analysis possible. To start, I:</p>
<ul>
  <li>remove games from outside the regular season</li>
  <li>turn <code class="language-plaintext highlighter-rouge">game_date</code> into a proper <code class="language-plaintext highlighter-rouge">datetime</code> type</li>
  <li>create a pk unique to each plate appearance (<code class="language-plaintext highlighter-rouge">game_pk</code> + <code class="language-plaintext highlighter-rouge">at_bat_number</code>)</li>
  <li>identify which team is batting and pitching using <code class="language-plaintext highlighter-rouge">inning_topbot</code></li>
  <li>create the feature of interest, <code class="language-plaintext highlighter-rouge">is_shift</code>, defined as equaling 1 when <code class="language-plaintext highlighter-rouge">if_fielding_alignment</code> is equal to <code class="language-plaintext highlighter-rouge">Infield shift</code></li>
  <li>drop rows where we don’t know whether the defense was shifted</li>
</ul>

<p>In code, it looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># only consider regular season games
</span><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'game_type'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'R'</span><span class="p">,]</span>

<span class="n">df</span><span class="p">[</span><span class="s">'game_date'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'game_date'</span><span class="p">])</span>

<span class="n">df</span><span class="p">[</span><span class="s">'atbat_pk'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'game_pk'</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s">'at_bat_number'</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>

<span class="c1"># we don't have column for which team is pitching, but we know the home team pitches the top and away pitches bottom
</span><span class="n">df</span><span class="p">[</span><span class="s">'team_pitching'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'inning_topbot'</span><span class="p">]</span><span class="o">==</span><span class="s">'Top'</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s">'home_team'</span><span class="p">],</span> <span class="n">df</span><span class="p">[</span><span class="s">'away_team'</span><span class="p">])</span>
<span class="n">df</span><span class="p">[</span><span class="s">'team_batting'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'inning_topbot'</span><span class="p">]</span><span class="o">==</span><span class="s">'Top'</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s">'away_team'</span><span class="p">],</span> <span class="n">df</span><span class="p">[</span><span class="s">'home_team'</span><span class="p">])</span>

<span class="c1">#is_shift == 1 if the defense was shifted at any point during the plate appearance
</span><span class="n">df</span><span class="p">[</span><span class="s">'is_shift'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'if_fielding_alignment'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'Infield shift'</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">shifts</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'atbat_pk'</span><span class="p">)[</span><span class="s">'is_shift'</span><span class="p">].</span><span class="nb">sum</span><span class="p">()).</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">shifts</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">shifts</span><span class="p">.</span><span class="n">is_shift</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="s">'is_shift'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">shifts</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'atbat_pk'</span><span class="p">,</span> <span class="n">suffixes</span><span class="o">=</span><span class="p">(</span><span class="s">'_old'</span><span class="p">,</span> <span class="s">''</span><span class="p">))</span>\

<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">notnull</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'is_shift'</span><span class="p">])]</span>
</code></pre></div></div>

<p>Last, since our goal is to predict whether a shift occurred on a given plate appearance, we’ll want to reshape the data so that each record reflects a single plate appearance. Pybaseball’s statcast data comes in the lowest form of granularity offered by Baseball Savant: the individual pitch. To move the data to plate appearance-level granularity, I group by <code class="language-plaintext highlighter-rouge">atbat_pk</code> and select the first row of each plate appearance, which shows the game state when a player first steps up to bat. This is an oversimplification of what happens throughout the plate appearance (maybe someone stole a base, maybe there was a pitching change), but it’s a good enough representation of reality to predict how the defense would play the situation.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'game_date'</span><span class="p">,</span> <span class="s">'at_bat_number'</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">groupby</span><span class="p">([</span><span class="s">'atbat_pk'</span><span class="p">]).</span><span class="n">first</span><span class="p">().</span><span class="n">reset_index</span><span class="p">()</span>
</code></pre></div></div>
<p>Taking the first row of each plate appearance isn’t enough to build useful features, of course. The pitch-level data will still be used to create features representing player ability and game context.</p>

<h1 id="feature-engineering-pt1-player-ability-and-profile">Feature Engineering Pt 1: Player Ability and Profile</h1>

<p>We’ll begin feature engineering by creating the two most obvious features that come to mind for predicting shifts: how often does the current batter get shifted against, and how often does the defensive team shift in general?</p>

<p>To avoid information leakage (information slipping into the model from points in time happening after the present <code class="language-plaintext highlighter-rouge">atbat_pk</code>), these features will be calculated using expanding means. This means that for each point in time, we’ll calculate the shift-rate from the beginning of time up until the present plate appearance, ignoring all future data that is unknown at that point in time. These features are calculated below.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># calculate an expanding mean for a given feature
</span><span class="k">def</span> <span class="nf">get_expanding_mean</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">featurename</span><span class="p">,</span> <span class="n">base_colname</span><span class="p">):</span>
    <span class="c1"># arrange rows by date + PA # for each batter
</span>    <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'batter'</span><span class="p">,</span> <span class="s">'game_date'</span><span class="p">,</span> <span class="s">'at_bat_number'</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="c1"># calculate mean-to-date at each PA's point in time
</span>    <span class="n">feature_to_date</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'batter'</span><span class="p">,</span><span class="s">'game_date'</span><span class="p">,</span><span class="s">'at_bat_number'</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">groupby</span><span class="p">(</span><span class="s">'batter'</span><span class="p">)[</span><span class="n">base_colname</span><span class="p">].</span><span class="n">expanding</span><span class="p">(</span><span class="n">min_periods</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
    <span class="n">feature_to_date</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">feature_to_date</span><span class="p">).</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="n">feature_to_date</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'batter'</span><span class="p">,</span> <span class="s">'index'</span><span class="p">,</span> <span class="n">featurename</span><span class="p">]</span>
    <span class="k">if</span> <span class="s">'index'</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">:</span>
        <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'index'</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
    <span class="c1"># join new feature onto the original dataframe
</span>    <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">left</span><span class="o">=</span><span class="n">df</span><span class="p">,</span> <span class="n">right</span><span class="o">=</span><span class="n">feature_to_date</span><span class="p">,</span> <span class="n">left_on</span><span class="o">=</span><span class="p">[</span><span class="s">'batter'</span><span class="p">,</span><span class="s">'index'</span><span class="p">],</span>
                                 <span class="n">right_on</span><span class="o">=</span><span class="p">[</span><span class="s">'batter'</span><span class="p">,</span> <span class="s">'index'</span><span class="p">],</span> <span class="n">suffixes</span><span class="o">=</span><span class="p">[</span><span class="s">'old'</span><span class="p">,</span><span class="s">''</span><span class="p">])</span>
    <span class="k">return</span> <span class="n">df</span>

<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">get_expanding_mean</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="s">'avg_shifted_against'</span><span class="p">,</span> <span class="s">'is_shift'</span><span class="p">)</span>

<span class="c1"># shift rate to date for each team at each point in time
</span><span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'team_pitching'</span><span class="p">,</span> <span class="s">'game_date'</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">shifts_to_date</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'team_pitching'</span><span class="p">,</span> <span class="s">'game_date'</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">groupby</span><span class="p">(</span><span class="s">'team_pitching'</span><span class="p">)[</span><span class="s">'is_shift'</span><span class="p">].</span><span class="n">expanding</span><span class="p">(</span><span class="n">min_periods</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
<span class="n">shifts_to_date</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">shifts_to_date</span><span class="p">).</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">shifts_to_date</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'team_pitching'</span><span class="p">,</span> <span class="s">'index'</span><span class="p">,</span> <span class="s">'def_shift_pct'</span><span class="p">]</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'index'</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">left</span><span class="o">=</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="n">right</span><span class="o">=</span><span class="n">shifts_to_date</span><span class="p">,</span> <span class="n">left_on</span><span class="o">=</span><span class="p">[</span><span class="s">'team_pitching'</span><span class="p">,</span><span class="s">'index'</span><span class="p">],</span> <span class="n">right_on</span><span class="o">=</span><span class="p">[</span><span class="s">'team_pitching'</span><span class="p">,</span> <span class="s">'index'</span><span class="p">],</span> <span class="n">suffixes</span><span class="o">=</span><span class="p">[</span><span class="s">'old'</span><span class="p">,</span><span class="s">''</span><span class="p">])</span>
</code></pre></div></div>
<p>Note that I created a function called <code class="language-plaintext highlighter-rouge">get_expanding_mean</code> in the above code chunk because later in this analysis I’ll repeat this same procedure for other features. The <code class="language-plaintext highlighter-rouge">def_shift_pct</code> feature requires a slightly different grouping, however, so it gets calculated without a dedicated function.</p>

<p>A quick check of our shift leaders looks about right. These numbers aren’t a perfect match with the ones <a href="https://baseballsavant.mlb.com/visuals/batter-positioning?playerId=undefined&amp;teamId=&amp;opponent=&amp;firstBase=0&amp;shift=1&amp;season=2017&amp;attempts=25&amp;batSide=L&amp;gb=1&amp;fb=0">Baseball Savant</a> shows in its leaderboard, but my PA counts match those of Baseball Reference while Savant’s don’t. This suggests that Baseball Savant is applying some sort of filtering to its data, while my PAs are unfiltered.</p>

<!--
| Year | Name              | Bats | PAs | Shifts | Shift_Rate |
|------|-------------------|------|-----|--------|---------------|
| 2016 | ryan howard       | L    | 363 | 343    | 94%           |
| 2017 | chris davis       | L    | 520 | 489    | 94%           |
| 2018 | chris davis       | L    | 410 | 378    | 92%           |
| 2016 | chris davis       | L    | 670 | 614    | 92%           |
| 2018 | joey gallo        | L    | 460 | 410    | 89%           |
|      | curtis granderson | L    | 329 | 284    | 86%           |
| 2016 | david ortiz       | L    | 627 | 541    | 86%           |
| 2017 | lucas duda        | L    | 485 | 414    | 85%           |
| 2015 | ryan howard       | L    | 506 | 422    | 83%           |
| 2017 | joey gallo        | L    | 532 | 443    | 83%           | -->

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Shift_Rate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>chris davis</td>
      <td>89%</td>
    </tr>
    <tr>
      <td>ryan howard</td>
      <td>88%</td>
    </tr>
    <tr>
      <td>david ortiz</td>
      <td>84%</td>
    </tr>
    <tr>
      <td>joey gallo</td>
      <td>80%</td>
    </tr>
    <tr>
      <td>lucas duda</td>
      <td>77%</td>
    </tr>
    <tr>
      <td>brandon moss</td>
      <td>74%</td>
    </tr>
    <tr>
      <td>brian mccann</td>
      <td>73%</td>
    </tr>
    <tr>
      <td>colby rasmus</td>
      <td>71%</td>
    </tr>
    <tr>
      <td>adam laroche</td>
      <td>70%</td>
    </tr>
    <tr>
      <td>mitch moreland</td>
      <td>69%</td>
    </tr>
  </tbody>
</table>

<p>Just for fun, we can also see how it broke down by season and handedness, using the raw average rather than the expanding mean.</p>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Year</th>
      <th>Bats</th>
      <th>Shift_Rate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>ryan howard</td>
      <td>2016</td>
      <td>L</td>
      <td>94%</td>
    </tr>
    <tr>
      <td>chris davis</td>
      <td>2017</td>
      <td>L</td>
      <td>94%</td>
    </tr>
    <tr>
      <td>chris davis</td>
      <td>2018</td>
      <td>L</td>
      <td>92%</td>
    </tr>
    <tr>
      <td>chris davis</td>
      <td>2016</td>
      <td>L</td>
      <td>92%</td>
    </tr>
    <tr>
      <td>joey gallo</td>
      <td>2018</td>
      <td>L</td>
      <td>89%</td>
    </tr>
    <tr>
      <td>justin smoak</td>
      <td>2018</td>
      <td>L</td>
      <td>89%</td>
    </tr>
    <tr>
      <td>colby rasmus</td>
      <td>2018</td>
      <td>L</td>
      <td>88%</td>
    </tr>
    <tr>
      <td>mark teixeira</td>
      <td>2016</td>
      <td>L</td>
      <td>88%</td>
    </tr>
    <tr>
      <td>carlos santana</td>
      <td>2018</td>
      <td>L</td>
      <td>87%</td>
    </tr>
    <tr>
      <td>curtis granderson</td>
      <td>2018</td>
      <td>L</td>
      <td>86%</td>
    </tr>
  </tbody>
</table>

<p>It’s interesting to see Teixeira, a switch hitter, make an appearance now that we’re grouping by handedness. This is the first feature of many that will show that shift rate alone doesn’t tell the full story!</p>

<p>Applying the same procedure to teams shows who leans on this maneuver heaviest.</p>

<table>
  <thead>
    <tr>
      <th>Team</th>
      <th>Shift Rate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>HOU</td>
      <td>37%</td>
    </tr>
    <tr>
      <td>TB</td>
      <td>34%</td>
    </tr>
    <tr>
      <td>NYY</td>
      <td>22%</td>
    </tr>
    <tr>
      <td>BAL</td>
      <td>22%</td>
    </tr>
    <tr>
      <td>MIL</td>
      <td>21%</td>
    </tr>
    <tr>
      <td>MIN</td>
      <td>18%</td>
    </tr>
    <tr>
      <td>SEA</td>
      <td>18%</td>
    </tr>
    <tr>
      <td>COL</td>
      <td>17%</td>
    </tr>
    <tr>
      <td>CWS</td>
      <td>16%</td>
    </tr>
    <tr>
      <td>PIT</td>
      <td>15%</td>
    </tr>
    <tr>
      <td>SD</td>
      <td>14%</td>
    </tr>
    <tr>
      <td>OAK</td>
      <td>14%</td>
    </tr>
    <tr>
      <td>ARI</td>
      <td>14%</td>
    </tr>
    <tr>
      <td>LAD</td>
      <td>13%</td>
    </tr>
    <tr>
      <td>BOS</td>
      <td>13%</td>
    </tr>
    <tr>
      <td>PHI</td>
      <td>13%</td>
    </tr>
    <tr>
      <td>TOR</td>
      <td>11%</td>
    </tr>
    <tr>
      <td>KC</td>
      <td>11%</td>
    </tr>
    <tr>
      <td>CLE</td>
      <td>11%</td>
    </tr>
    <tr>
      <td>DET</td>
      <td>11%</td>
    </tr>
    <tr>
      <td>ATL</td>
      <td>10%</td>
    </tr>
    <tr>
      <td>WSH</td>
      <td>10%</td>
    </tr>
    <tr>
      <td>CIN</td>
      <td>10%</td>
    </tr>
    <tr>
      <td>LAA</td>
      <td>9%</td>
    </tr>
    <tr>
      <td>TEX</td>
      <td>9%</td>
    </tr>
    <tr>
      <td>MIA</td>
      <td>8%</td>
    </tr>
    <tr>
      <td>NYM</td>
      <td>7%</td>
    </tr>
    <tr>
      <td>SF</td>
      <td>7%</td>
    </tr>
    <tr>
      <td>STL</td>
      <td>5%</td>
    </tr>
    <tr>
      <td>CHC</td>
      <td>5%</td>
    </tr>
  </tbody>
</table>

<p>The Astros and Rays are by far The Shift’s biggest adopters, while the Cubs, Cardinals, and a few other barely shift at all.</p>

<p>As we’ll see in Model #1 later in this post, these two features alone capture much of the information required in order to predict The Shift. There are, of course, reasons not to stop here and call it a day: what about batters with few historic plate appearances? These instances will surely be impacted by the instability of small sample sizes. What about switch hitters? The Teixeira example shows that shift rates lie in these cases.</p>

<p>In the context of shift-decisions, a batter’s identity is really a proxy for several things: his handedness, power, expected launch angle, sprint speed, and even who bats after him. If we capture some of these things directly, we may both improve our model and bring stability to its ability to predict shifts for new players.</p>

<p>Capturing batter and pitcher handedness is simple:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># stand (batter_bats)
</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'left_handed_batter'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'stand'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'L'</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="c1"># pitcher_throws
</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'pitcher_throws_left'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'p_throws'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'L'</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<p>I then expand the player-profile feature set by repeating the previously-defined expanding average procedure for other variables provided by Baseball Savant: woba_value (how many points each PA contributes to wOBA), babip_value (same, but for BABIP), launch_angle, launch_speed, and hit_distance_sc. A modification of this procedure is also applied to obtain how many plate appearances we’ve seen to date for the current batter within this data, taking an expanding count instead of an expanding average.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">get_expanding_mean</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="s">'woba'</span><span class="p">,</span> <span class="s">'woba_value'</span><span class="p">)</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">get_expanding_mean</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="s">'babip'</span><span class="p">,</span> <span class="s">'babip_value'</span><span class="p">)</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">get_expanding_mean</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="s">'launch_angle'</span><span class="p">,</span> <span class="s">'launch_angle'</span><span class="p">)</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">get_expanding_mean</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="s">'launch_speed'</span><span class="p">,</span> <span class="s">'launch_speed'</span><span class="p">)</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">get_expanding_mean</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="s">'hit_distance_sc'</span><span class="p">,</span> <span class="s">'hit_distance_sc'</span><span class="p">)</span>

<span class="c1"># number of plate appearances observed for each player at the time of the current PA
</span><span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'batter'</span><span class="p">,</span> <span class="s">'game_date'</span><span class="p">,</span> <span class="s">'at_bat_number'</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">pas</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'batter'</span><span class="p">,</span><span class="s">'game_date'</span><span class="p">,</span><span class="s">'at_bat_number'</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">groupby</span><span class="p">(</span><span class="s">'batter'</span><span class="p">)[</span><span class="s">'index'</span><span class="p">].</span><span class="n">expanding</span><span class="p">(</span><span class="n">min_periods</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">count</span><span class="p">()</span>
<span class="n">pas</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">pas</span><span class="p">).</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">pas</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'batter'</span><span class="p">,</span> <span class="s">'index'</span><span class="p">,</span> <span class="s">'pas'</span><span class="p">]</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'index'</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">left</span><span class="o">=</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="n">right</span><span class="o">=</span><span class="n">pas</span><span class="p">,</span> <span class="n">left_on</span><span class="o">=</span><span class="p">[</span><span class="s">'batter'</span><span class="p">,</span><span class="s">'index'</span><span class="p">],</span> <span class="n">right_on</span><span class="o">=</span><span class="p">[</span><span class="s">'batter'</span><span class="p">,</span> <span class="s">'index'</span><span class="p">],</span> <span class="n">suffixes</span><span class="o">=</span><span class="p">[</span><span class="s">'old'</span><span class="p">,</span><span class="s">''</span><span class="p">])</span>
</code></pre></div></div>

<p>Let’s check these numbers and see if they look right. First off, I’m seeing familiar faces on the wOBA leaderboard. That’s a good sign.</p>

<!--
| Name             | wOBA     | PAs  |
|------------------|----------|------|
| mike trout       | 0.434081 | 2318 |
| aaron judge      | 0.417219 | 1208 |
| joey votto       | 0.41484  | 2564 |
| j. d. martinez   | 0.413254 | 2154 |
| paul goldschmidt | 0.402363 | 2581 | -->

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>wOBA</th>
      <th>PAs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>mike trout</td>
      <td>0.434</td>
      <td>2316</td>
    </tr>
    <tr>
      <td>aaron judge</td>
      <td>0.417</td>
      <td>1208</td>
    </tr>
    <tr>
      <td>joey votto</td>
      <td>0.415</td>
      <td>2561</td>
    </tr>
    <tr>
      <td>j. d. martinez</td>
      <td>0.413</td>
      <td>2153</td>
    </tr>
    <tr>
      <td>paul goldschmidt</td>
      <td>0.403</td>
      <td>2579</td>
    </tr>
    <tr>
      <td>bryce harper</td>
      <td>0.400</td>
      <td>2271</td>
    </tr>
    <tr>
      <td>freddie freeman</td>
      <td>0.399</td>
      <td>2199</td>
    </tr>
    <tr>
      <td>josh donaldson</td>
      <td>0.398</td>
      <td>2064</td>
    </tr>
    <tr>
      <td>david ortiz</td>
      <td>0.397</td>
      <td>1238</td>
    </tr>
    <tr>
      <td>nolan arenado</td>
      <td>0.395</td>
      <td>2519</td>
    </tr>
    <tr>
      <td>kris bryant</td>
      <td>0.395</td>
      <td>2366</td>
    </tr>
  </tbody>
</table>

<!-- http://thisdavej.com/copy-table-in-excel-and-paste-as-a-markdown-table/ -->

<p>Checking the leaderboard for exit velocity also looks good, showing Statcast darlings Stanton, Judge, and Ortiz all near the top during this period. These velocities are slightly lower than what we see in the <a href="http://m.mlb.com/statcast/leaderboard#avg-hit-velo,r,2018">MLB leaderboard</a>, which is probably because I’m including launch angles for all plate appearance-ending events, including outs, whereas MLB is probably only including exit velocity on hits.</p>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Exit Velocity</th>
      <th>PAs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>david ortiz</td>
      <td>92</td>
      <td>1238</td>
    </tr>
    <tr>
      <td>nelson cruz</td>
      <td>91</td>
      <td>2392</td>
    </tr>
    <tr>
      <td>pedro alvarez</td>
      <td>91</td>
      <td>1026</td>
    </tr>
    <tr>
      <td>miguel cabrera</td>
      <td>90</td>
      <td>1866</td>
    </tr>
    <tr>
      <td>kendrys morales</td>
      <td>90</td>
      <td>2229</td>
    </tr>
    <tr>
      <td>giancarlo stanton</td>
      <td>90</td>
      <td>1999</td>
    </tr>
    <tr>
      <td>aaron judge</td>
      <td>90</td>
      <td>1208</td>
    </tr>
    <tr>
      <td>ryan zimmerman</td>
      <td>90</td>
      <td>1636</td>
    </tr>
    <tr>
      <td>matt olson</td>
      <td>90</td>
      <td>740</td>
    </tr>
    <tr>
      <td>josh donaldson</td>
      <td>90</td>
      <td>2064</td>
    </tr>
  </tbody>
</table>

<p>We’re off to a good start, having captured several components of the batter’s player profile, as well as measures of how often shifts occur for each batter and defense. Still missing from this data, however, is a sense of context. Shifts don’t happen in a vacuum. Priced into a shift-decision is the context of the current plate appearance (the score, men on base), and perhaps the team’s memory of how the current batter performed in previous plate appearances. If a player successfully hits out of the shift twice in a row, for example, a team may be less likely to try it a third time. This is the next category of feature I’ll create, attempting to capture the context in which a plate appearance occurs.</p>

<h1 id="features-pt-2-creating-context">Features Pt. 2: Creating Context</h1>
<p>The first piece of context I’ll create is the base state. I’ll create four features: binary flags stating whether there’s a man on first, second, and third at the beginning of the plate appearance, and a count feature saying how many men are on base in total. Something that I haven’t tried is taking interactions of the binary flags (e.g. man on first AND second, first AND third, second AND third). I might add that in if I revisit this later.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">[</span><span class="s">'man_on_first'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'on_1b'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'man_on_second'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'on_2b'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'man_on_third'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'on_3b'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'men_on_base'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'man_on_first'</span><span class="p">]</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s">'man_on_second'</span><span class="p">]</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s">'man_on_third'</span><span class="p">]</span>
</code></pre></div></div>

<p>The result looks like this:</p>

<table>
  <thead>
    <tr>
      <th>on_1b</th>
      <th>on_2b</th>
      <th>on_3b</th>
      <th>man_on_first</th>
      <th>man_on_second</th>
      <th>man_on_third</th>
      <th>men_on_base</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>572761</td>
      <td>607231</td>
      <td>NaN</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>2</td>
    </tr>
    <tr>
      <td>621550</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <td>621550</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <td>621550</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <td>621550</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

<p>Another piece of context that might matter is when the game took place. Maybe part of The Shift’s likelihood can be attributed in part to the year (teams saw it work last season, so they made it a bigger part of their strategy the following season) and the time of year (teams know less about their opponents earlier in the season, so they shift less). We already have a feature for what year it is. Let’s create one for the month as well.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">df</span><span class="p">[</span><span class="s">'Month'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'game_date'</span><span class="p">].</span><span class="n">dt</span><span class="p">.</span><span class="n">month</span>
</code></pre></div></div>

<p>Next, let’s make the game’s score easier for the model to work with. We already know each team’s score, but a tree-based model needs to take two steps to make use of this, and linear model can’t gain much from it at all, because each team’s score is only interesting in the context of their opponent’s. Taking the difference between the two should help.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">[</span><span class="s">'score_differential'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'fld_score'</span><span class="p">]</span> <span class="o">-</span> <span class="n">df</span><span class="p">[</span><span class="s">'bat_score'</span><span class="p">]</span>
</code></pre></div></div>

<p>The Savant data has several categorical variables I’d like to use. To make use of these, I’ll create dummies for each one of them: for each unique value in the categorical variable, a binary feature is created as a flag for whether the variable took on that value. I’m doing this for a few features, but it will be particularly interesting for <code class="language-plaintext highlighter-rouge">team_pitching</code> (capturing a team’s defensive strategy) and <code class="language-plaintext highlighter-rouge">batter</code> (capturing the leftover features of a player’s profile that we haven’t been able to control for with the model’s other features).</p>

<p>The time-related features will be cast as dummies as well in case there are month- or year-specific effects that deviate from the linear relationship that a model might glean from representing these as continuous features.</p>

<p>Last, I’ll also create dummies from the <code class="language-plaintext highlighter-rouge">events</code> feature. These can’t be used directly, as they provide future-information that is not known at the beginning of the plate appearance. They can, however, be lagged and used to describe what’s happened in the batter’s most recent plate appearances, which I’ll do next.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># create dummies for pitching team, batting team, pitcher id, batter id
</span><span class="n">dummies</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'team_pitching'</span><span class="p">]).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="s">'defense_'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="n">dummies</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">dummies</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'team_batting'</span><span class="p">]).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="s">'atbat_'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="n">dummies</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">dummies</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'batter'</span><span class="p">]).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="s">'batterid_'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="n">dummies</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">dummies</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'pitcher'</span><span class="p">]).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="s">'pitcherid_'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="n">dummies</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="c1"># bb_type dummies (to be lagged)
</span><span class="n">dummies</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'bb_type'</span><span class="p">]).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="s">'bb_type_'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="n">dummies</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plate_appearances</span><span class="p">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'bb_type'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="c1"># month and year dummies
</span><span class="n">dummies</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'Month'</span><span class="p">]).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="s">'Month_'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="n">dummies</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">dummies</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'game_year'</span><span class="p">]).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="s">'Year_'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="n">dummies</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="c1"># events
</span><span class="n">dummies</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'events'</span><span class="p">]).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="s">'event_'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">plate_appearances</span><span class="p">,</span> <span class="n">dummies</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">plate_appearances</span><span class="p">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'team_pitching'</span><span class="p">,</span> <span class="s">'team_batting'</span><span class="p">,</span> <span class="s">'home_team'</span><span class="p">,</span> <span class="s">'away_team'</span><span class="p">,</span> <span class="s">'inning_topbot'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>

<p>Lagged variables are a way to encode information about the past. For the features we have access to during the present plate appearance (did they get on base? Did the defense shift? Was the shift successful?), we can also access them for each of the batter’s previous plate appearances. This is interesting information to send to the model, as it’s almost certainly playing through the players’ minds when someone new steps up to the plate.</p>

<p>The first step for creating these features is creating them at the present-time.  Here’s I’ll define whether the batter got on base, whether they achieved a hit, whether the plate appearance was successful for the defensive team, and whether the plate appearance represents a shift that can be viewed as successful from the defense’s point of view. Everything else that I’ll lag already exists at present-time as a feature provided by Baseball Savant.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plate_appearances</span><span class="p">[</span><span class="s">'onbase'</span><span class="p">]</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_single</span> <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_single</span> <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_double</span> <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_triple</span>

<span class="n">plate_appearances</span><span class="p">[</span><span class="s">'hit'</span><span class="p">]</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_single</span> <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_double</span> \
                           <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_triple</span>  <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_home_run</span>
    
<span class="n">plate_appearances</span><span class="p">[</span><span class="s">'successful_outcome_defense'</span><span class="p">]</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_field_out</span> <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_strikeout</span> <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_grounded_into_double_play</span> \
                        <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_double_play</span> <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_fielders_choice_out</span> <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_other_out</span> \
                        <span class="o">+</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">event_triple_play</span>
<span class="n">plate_appearances</span><span class="p">[</span><span class="s">'successful_shift'</span><span class="p">]</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">[</span><span class="s">'is_shift'</span><span class="p">]</span> <span class="o">*</span> <span class="n">plate_appearances</span><span class="p">[</span><span class="s">'successful_outcome_defense'</span><span class="p">]</span>
</code></pre></div></div>

<p>Missing values complicate the lagging of features, so these should be imputed before proceeding.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># simple imputations: hit location, hit_distance_sc, launch_speed, launch_angle, effective_speed, 
# estimated_woba_using_speedangle, babip_value, iso_value
</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">hit_location</span><span class="p">),</span> <span class="s">'hit_location'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">plate_appearances</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">hit_distance_sc</span><span class="p">),</span> <span class="s">'hit_distance_sc'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">plate_appearances</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">launch_speed</span><span class="p">),</span> <span class="s">'launch_speed'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">plate_appearances</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">launch_angle</span><span class="p">),</span> <span class="s">'launch_angle'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">plate_appearances</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">effective_speed</span><span class="p">),</span> <span class="s">'effective_speed'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">plate_appearances</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">estimated_woba_using_speedangle</span><span class="p">),</span> <span class="s">'estimated_woba_using_speedangle'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">plate_appearances</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">babip_value</span><span class="p">),</span> <span class="s">'babip_value'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">plate_appearances</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">iso_value</span><span class="p">),</span> <span class="s">'iso_value'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">plate_appearances</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">woba_denom</span><span class="p">),</span> <span class="s">'woba_denom'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">plate_appearances</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">launch_speed_angle</span><span class="p">),</span> <span class="s">'launch_speed_angle'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
</code></pre></div></div>

<p>Now for the fun part. First sort everything chronologically for each batter. Define which columns should be lagged, and how many plate appearances into the past we should look.  Then, for each lagged feature, and for each number of PAs into the past <code class="language-plaintext highlighter-rouge">t</code> that should be captured, group by the batter’s id and shift the column up <code class="language-plaintext highlighter-rouge">t</code> positions.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># finally: lag things for a fuller sense of context
</span><span class="n">plate_appearances</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'batter'</span><span class="p">,</span> <span class="s">'game_date'</span><span class="p">,</span> <span class="s">'at_bat_number'</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">cols_to_lag</span> <span class="o">=</span> <span class="p">[</span><span class="s">'is_shift'</span><span class="p">,</span> <span class="s">'onbase'</span><span class="p">,</span> <span class="s">'hit'</span><span class="p">,</span> <span class="s">'successful_outcome_defense'</span><span class="p">,</span> <span class="s">'successful_shift'</span><span class="p">,</span>
              <span class="s">'woba_value'</span><span class="p">,</span> <span class="s">'launch_speed'</span><span class="p">,</span> <span class="s">'launch_angle'</span><span class="p">,</span> <span class="s">'hit_distance_sc'</span><span class="p">,</span> <span class="s">'bb_type_popup'</span><span class="p">,</span> <span class="s">'bb_type_line_drive'</span><span class="p">,</span> 
               <span class="s">'bb_type_ground_ball'</span><span class="p">,</span> <span class="s">'bb_type_fly_ball'</span><span class="p">]</span>

<span class="c1"># how many PAs back to we want to consider?
</span><span class="n">lag_time</span> <span class="o">=</span> <span class="mi">5</span>
<span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">cols_to_lag</span><span class="p">:</span>
    <span class="k">for</span> <span class="n">time</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">lag_time</span><span class="p">):</span>
        <span class="n">feature_name</span> <span class="o">=</span> <span class="n">col</span> <span class="o">+</span> <span class="s">'_lag'</span> <span class="o">+</span> <span class="s">'_{}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">time</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">plate_appearances</span><span class="p">[</span><span class="n">feature_name</span><span class="p">]</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'batter'</span><span class="p">)[</span><span class="n">col</span><span class="p">].</span><span class="n">shift</span><span class="p">(</span><span class="n">time</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>

<p>Since this shifts the column <code class="language-plaintext highlighter-rouge">t</code> positions forward in time, the first <code class="language-plaintext highlighter-rouge">t</code> rows for each player will be missing the lagged value. For every other point in time, however, we will now know what happened up to <code class="language-plaintext highlighter-rouge">t</code> plate appearances ago.</p>

<p>There are two things to consider when picking how many plate appearances into the past we should look. The first is that each additional point in time we choose for this window will provide some extra information, but that the amount of information we gain will decrease as the size of this window increases.</p>

<p>Big-I “information” in this case can be described as the extent to which the added feature surprises us. Knowing nothing about past plate appearances, adding one point of past information tells us something we didn’t know before. Knowing whether the defense shifted at <code class="language-plaintext highlighter-rouge">t-1</code>  tells us a lot about whether they’ll shift the next time, greatly improving our prediction at time <code class="language-plaintext highlighter-rouge">t</code>. Compared to what we knew without it, point <code class="language-plaintext highlighter-rouge">t-1</code> surprised us. Extending this one point in time further into the past, point <code class="language-plaintext highlighter-rouge">t-2</code> tells us something new, but it’s not quite as surprising, as <code class="language-plaintext highlighter-rouge">t-1</code> already captures a lot of what <code class="language-plaintext highlighter-rouge">t-2</code> is telling us on average. Pair these diminishing returns with the feature bloat that they bring, and it’s probably not worth lagging more than a few points in time into the past.</p>

<p>The second thing to consider is that the more you lag, the more missing values your data will have. I chose to lag 5 points into the past, so my lagged features will be missing values for each batter’s first 5 plate appearances. This means I’ll have to throw out each player’s first five plate appearances in order to feed this data to a model. This doesn’t feel like much of a sacrifice at <code class="language-plaintext highlighter-rouge">t=5</code>, but it’s another reason to avoid lagging too far into the past.</p>

<p>Here’s an example of what this looks like for the <code class="language-plaintext highlighter-rouge">is_shift</code> lags for Mitch Moreland’s first 8 plate appearances. Note that NaNs in rows 1 - 5, and how a shift stays represented in the data for five plate appearances by moving through the lagged features as time progresses.</p>

<table>
  <thead>
    <tr>
      <th>PA Number</th>
      <th>is_shift</th>
      <th>is_shift_lag_1</th>
      <th>is_shift_lag_2</th>
      <th>is_shift_lag_3</th>
      <th>is_shift_lag_4</th>
      <th>is_shift_lag_5</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>1</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <td>2</td>
      <td>1</td>
      <td>1</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <td>3</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <td>4</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <td>5</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>NaN</td>
    </tr>
    <tr>
      <td>6</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <td>7</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <td>8</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>I chose <code class="language-plaintext highlighter-rouge">t=5</code> somewhat arbitrarily. It seemed like enough to capture most of what will be in a player’s recent memory without exploding the model’s feature space to the point of complicating the training process. There’s room for experimentation here, though, which I’d encourage for anyone planning on using a model like this in a serious way.</p>

<p>That was a marathon, but we now have an interesting and expansive set of features to build models on, covering player profile, ability, and context. <strong>Let’s predict some shifts.</strong></p>

<h1 id="modeling-the-shift-boosting-bagging-logits-and-stacking">Modeling The Shift: Boosting, Bagging, Logits, and Stacking</h1>

<p>My general modeling approach is to use k-fold cross validation for a few different models. For this reason, a fair model evaluation will necessitate a training set (which will be broken into folds for multiple train/test splits) and a holdout set, which will only be accessed once a final and “best” model is selected, for a blind evaluation of its performance.</p>

<p>Given the 655,847 plate appearances in this data, a 3-fold cross validation will entail 349,784 training and 174,893 test samples for each of its three train/test splits, with 131,170 samples remaining in the 20% holdout set to assess model performance.</p>

<p>This is set up as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train_percent</span> <span class="o">=</span> <span class="p">.</span><span class="mi">8</span>
<span class="n">train_samples</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">train_percent</span><span class="p">)</span>
<span class="n">holdout_samples</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">plate_appearances</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">train_percent</span><span class="p">))</span>

<span class="n">y</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">[</span><span class="s">'is_shift'</span><span class="p">]</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">plate_appearances</span><span class="p">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'is_shift'</span><span class="p">,</span> <span class="s">'batter'</span><span class="p">,</span> <span class="s">'pitcher'</span><span class="p">],</span> <span class="mi">1</span><span class="p">)</span>

<span class="n">X_train</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:</span><span class="n">train_samples</span><span class="p">]</span>
<span class="n">X_holdout</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">train_samples</span><span class="p">:]</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">y</span><span class="p">[:</span><span class="n">train_samples</span><span class="p">]</span>
<span class="n">y_holdout</span> <span class="o">=</span> <span class="n">y</span><span class="p">[</span><span class="n">train_samples</span><span class="p">:]</span>
</code></pre></div></div>

<h2 id="feature-selection">Feature Selection</h2>
<p>For the sake of my own sanity, I’m going to sacrifice some of my models’ accuracy in the name of cutting down their training time by doing some preliminary feature selection. This isn’t the only way to do this, but my method of choice is to train a random forest and use Scikit-Learn’s built in gini importances in order to rank features by their importance to the model. Everything with an importance above a cutoff value will remain in the model, while the other features will be thrown out.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#train a random forest
</span><span class="n">n_estimator</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">rf</span> <span class="o">=</span> <span class="n">RandomForestClassifier</span><span class="p">(</span><span class="n">max_depth</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">n_estimators</span><span class="o">=</span><span class="n">n_estimator</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">rf</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>

<span class="c1"># use random forest's feature importances to select only important features
</span><span class="n">sfm</span> <span class="o">=</span> <span class="n">SelectFromModel</span><span class="p">(</span><span class="n">rf</span><span class="p">,</span> <span class="n">prefit</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">threshold</span><span class="o">=</span><span class="mf">0.0001</span><span class="p">)</span>

<span class="c1"># prune unimportant features from train and holdout dataframes
</span><span class="n">feature_idx</span> <span class="o">=</span> <span class="n">sfm</span><span class="p">.</span><span class="n">get_support</span><span class="p">()</span>
<span class="n">feature_names</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">.</span><span class="n">columns</span><span class="p">[</span><span class="n">feature_idx</span><span class="p">]</span>
<span class="n">X_train</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">sfm</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">))</span>
<span class="n">X_holdout</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">sfm</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_holdout</span><span class="p">))</span>
<span class="n">X_train</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="n">feature_names</span>
<span class="n">X_holdout</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="n">feature_names</span>
</code></pre></div></div>

<p>As is common in high dimensional data, these data had several sparsely-populated features that contained very little information about shift decisions. Chief among these low-information features were the pitcher and batter dummies, which populated the majority of the feature space. After dropping low-importance features, only a few of these player ID variables remained.</p>

<p>For an idea about which features the final Random Forest model used in this project found most important, the ranked Gini importances of its top features are shown here:</p>

<p align="center">
    <img src="/images/fulls/feature_importance.png" alt="" />
    Ranked Feature Importances from Random Forest Model
</p>

<p>This shows that most of the information is captured by whether the defense shifted in recent plate appearances, as the three most important features are the two most recent shift-lags and the batter’s historic shifted-against rate. After that, there’s a noticeable dropoff between the shift-related features and those describing game state, player ability, and the more-distant past.</p>

<p>Trimming low-importance features takes us from 2,781 to 109 features, and reduces the training time of a 60-model XGBoost parameter search from 4 hours down to just 29 minutes. The cost of this is essentially nonexistent. In a first-pass at modeling this, the 2,700-feature XGBoost model obtained an AUC score of 0.911 and the 100-feature model scored 0.910. That’s a tradeoff I will happily take.</p>

<h2 id="model-0-the-no-model-model">Model 0: The No-Model Model</h2>
<p>A model is only interesting in the context of what it’s competing against. A good baseline to start with is the worst model imaginable, taking zero covariates into consideration. The no-model model simply guesses the majority class 100% of the time. In this case, the classes are imbalanced, so even no-model scores well on accuracy. Here <code class="language-plaintext highlighter-rouge">y_train.mean()</code> shows that 14% of plate appearances contain The Shift, so a model that only guesses no shift (y = 0) will be correct 86% of the time. Any lift in accuracy above this point will be learned from data.</p>

<h2 id="model-1-zero-context-logit-baseline">Model 1: Zero-Context Logit Baseline</h2>

<p>A reasonable next step from this is a simple model using the bare-minimum set of features needed to understand The Shift. In this case, the model is logistic regression, and the only features are:</p>

<ul>
  <li>Batter shift %: the percentage of a batter’s plate appearances to date which the defense has shifted against, AND:</li>
  <li>Defensive shift %: the percentage of plate appearances to date in which the defense has shifted</li>
</ul>

<p>These two features collectively capture a lot of information, giving an understanding of what we know about The Shift without any machine learning or clever feature engineering.</p>

<p>A few lines get us this improved logistic baseline:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lr</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">(</span><span class="n">n_jobs</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">lr</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">[[</span><span class="s">'avg_shifted_against'</span><span class="p">,</span> <span class="s">'def_shift_pct'</span><span class="p">]],</span> <span class="n">y_train</span><span class="p">)</span>

<span class="n">y_pred_lr</span> <span class="o">=</span> <span class="n">lr</span><span class="p">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">X_test</span><span class="p">[[</span><span class="s">'avg_shifted_against'</span><span class="p">,</span> <span class="s">'def_shift_pct'</span><span class="p">]])[:,</span> <span class="mi">1</span><span class="p">]</span>
<span class="n">fpr_lr</span><span class="p">,</span> <span class="n">tpr_lr</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">roc_curve</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred_lr</span><span class="p">)</span>
</code></pre></div></div>

<p>So: knowing only how often a batter is shifted against and how often the defense shifts in general, how well can we predict The Shift?</p>

<p>It turns out, the answer is “pretty well.”</p>

<p>This model achieves 88.6% out of sample accuracy and an AUC score of 0.887. The accuracy is slightly better than that of the no-model model, and the high AUC score shows that the model has learned a decent amount.</p>

<p>Given that we have 88.6% accuracy and 0.887 AUC knowing only these two things, we can tell that a large amount of the mental calculus going into a shift-decision can be summarized by the batter’s identity and the defense’s overarching philosophy toward this maneuver.</p>

<h2 id="model-2-logit-with-context">Model 2: Logit with Context</h2>
<p>With two baselines established, the next step in the ladder of modeling complexity is a logistic model with two improvements from the previous:</p>
<ul>
  <li>feature set: this model gets to see the full feature set, including player identifiers, handedness, lagged variables, etc.</li>
  <li>regularization: to avoid overfitting given all these features, I use L2 regularization with sklearn’s <code class="language-plaintext highlighter-rouge">LogisticRegressionCV</code> to shrink the model’s weights toward zero, optimizing the shrinkage parameter to maximize accuracy on unseen data</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lr2</span> <span class="o">=</span> <span class="n">LogisticRegressionCV</span><span class="p">(</span><span class="n">n_jobs</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">lr2</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>

<span class="n">y_pred_lr2</span> <span class="o">=</span> <span class="n">lr2</span><span class="p">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">X_test</span><span class="p">)[:,</span> <span class="mi">1</span><span class="p">]</span>
<span class="n">fpr_lr2</span><span class="p">,</span> <span class="n">tpr_lr2</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">roc_curve</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred_lr2</span><span class="p">)</span>
</code></pre></div></div>

<p>This model performs much better than the simple logit, with an accuracy of 92.2% and an AUC score of 0.940. This is a big leap in performance, considering how close the previous model had already come to perfect accuracy and AUC. This comes at the cost of training time, which is increased due to the larger feature space and tuning of the regularization parameter over three folds.</p>

<h2 id="model-3-random-forest-classifier">Model 3: Random Forest Classifier</h2>
<p>Next I’ll cross an arbitrary boundary between what one might call purely statistical modeling and <code class="language-plaintext highlighter-rouge">Machine Learning</code> with something tree-based. I opt to use a random forest classifier here because it’s a good baseline for how far machine learning will take you in a modeling task. It’s hard to overfit, relatively quick to train, and typically serves as a good hint for whether it’s worth going all-in on a more powerful nonparametric modeling approach such as gradient boosting or a neural network.</p>

<p>Since this and the following model are slower to train and have more tunable parameters than the previously-used linear models, I’ll first build a timer function and a modeling function so I can be consistent with how the best versions of these models are selected.</p>

<p>I’ll use randomized parameter search to find an optimal set of hyperparameters. This <a href="http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf">has been shown to produce superior results to grid search</a> in less time for two reasons: it doesn’t need to run an exhaustive search to explore the parameter space, and it can explore the space more completely by sampling parameter values from distributions, rather than a grid search’s approach of sampling only from user-specified lists of values. In my implementation, I’ll accept a <code class="language-plaintext highlighter-rouge">draws</code> parameter, for how many draws from the sample distributions should be tested while exploring the feature space, and a <code class="language-plaintext highlighter-rouge">folds</code> parameter, for how many folds we’d like to cross validate over.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">timer</span><span class="p">(</span><span class="n">start_time</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">start_time</span><span class="p">:</span>
        <span class="n">start_time</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">()</span>
        <span class="k">return</span> <span class="n">start_time</span>
    <span class="k">elif</span> <span class="n">start_time</span><span class="p">:</span>
        <span class="n">thour</span><span class="p">,</span> <span class="n">temp_sec</span> <span class="o">=</span> <span class="nb">divmod</span><span class="p">((</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">()</span> <span class="o">-</span> <span class="n">start_time</span><span class="p">).</span><span class="n">total_seconds</span><span class="p">(),</span> <span class="mi">3600</span><span class="p">)</span>
        <span class="n">tmin</span><span class="p">,</span> <span class="n">tsec</span> <span class="o">=</span> <span class="nb">divmod</span><span class="p">(</span><span class="n">temp_sec</span><span class="p">,</span> <span class="mi">60</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s"> Time taken: %i hours %i minutes and %s seconds.'</span> <span class="o">%</span> <span class="p">(</span><span class="n">thour</span><span class="p">,</span> <span class="n">tmin</span><span class="p">,</span> <span class="nb">round</span><span class="p">(</span><span class="n">tsec</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>
    <span class="k">return</span> <span class="bp">None</span>

<span class="k">def</span> <span class="nf">model_param_search</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">param_dict</span><span class="p">,</span> <span class="n">fit_dict</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">folds</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">draws</span><span class="o">=</span><span class="mi">20</span><span class="p">):</span>
    <span class="n">skf</span> <span class="o">=</span> <span class="n">StratifiedKFold</span><span class="p">(</span><span class="n">n_splits</span><span class="o">=</span><span class="n">folds</span><span class="p">,</span> <span class="n">shuffle</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span> <span class="n">random_state</span> <span class="o">=</span> <span class="mi">1001</span><span class="p">)</span>
    <span class="n">random_search_model</span> <span class="o">=</span> <span class="n">RandomizedSearchCV</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">param_distributions</span><span class="o">=</span><span class="n">param_dict</span><span class="p">,</span> 
                                   <span class="n">fit_params</span><span class="o">=</span><span class="n">fit_dict</span><span class="p">,</span> <span class="n">n_iter</span><span class="o">=</span><span class="n">draws</span><span class="p">,</span> 
                                   <span class="n">scoring</span><span class="o">=</span><span class="s">'roc_auc'</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="n">skf</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">),</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> 
                                   <span class="n">random_state</span><span class="o">=</span><span class="mi">1001</span><span class="p">)</span>
    <span class="n">start_time</span> <span class="o">=</span> <span class="n">timer</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
    <span class="n">random_search_model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
    <span class="n">timer</span><span class="p">(</span><span class="n">start_time</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s"> All results:'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">random_search_model</span><span class="p">.</span><span class="n">cv_results_</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s"> Best estimator:'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">random_search_model</span><span class="p">.</span><span class="n">best_estimator_</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s"> Best hyperparameters:'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">random_search_model</span><span class="p">.</span><span class="n">best_params_</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">random_search_model</span>
</code></pre></div></div>

<p>Applying this to the random forest model, I’ll draw from a uniform distribution to tune the number of trees used in the forest, simultaneously toggling the number of features considered in each tree, the criterion used for assessing the quality of decision-splits, and a flag for whether to use down-sampling to overcome the training data’s imbalanced classes caused the The Shift’s rarity.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rf_params</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">'n_estimators'</span><span class="p">:</span> <span class="n">st</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">200</span><span class="p">),</span>
    <span class="s">'max_features'</span><span class="p">:</span> <span class="p">[</span><span class="s">'sqrt'</span><span class="p">,</span> <span class="p">.</span><span class="mi">5</span><span class="p">,</span>  <span class="mi">1</span><span class="p">],</span>
    <span class="s">'criterion'</span><span class="p">:</span> <span class="p">[</span><span class="s">'entropy'</span><span class="p">,</span> <span class="s">'gini'</span><span class="p">],</span>
    <span class="s">'class_weight'</span><span class="p">:</span> <span class="p">[</span><span class="s">'balanced'</span><span class="p">,</span> <span class="bp">None</span><span class="p">]</span>
<span class="p">}</span>

<span class="n">random_search_rf</span> <span class="o">=</span> <span class="n">model_param_search</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="n">RandomForestClassifier</span><span class="p">(</span><span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=</span><span class="mi">4</span><span class="p">),</span> <span class="n">param_dict</span><span class="o">=</span><span class="n">rf_params</span><span class="p">,</span> <span class="n">draws</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>

<p>In the interest of time, I sample only ten times from this search space while training over three folds, equaling 30 models trained in total. Given access to better hardware, I’d feel more comfortable with the final model having doubled this number, but alas, my 12” macbook wasn’t having it.</p>

<p>This ended up taking just over three hours to complete, with the following as its most successful set of parameters:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
	'class_weight': None,
	'criterion': 'entropy',
	'max_features': 0.5,
	'n_estimators': 184
}
</code></pre></div></div>

<p>The result was a model with 92.9% accuracy and 0.957 AUC score. This marks an improvement from the logistic score, but not as big a jump as we’d achieved by introducing contextual features in the previous step.</p>

<h2 id="model-4-gradient-boosted-trees">Model 4: Gradient Boosted Trees</h2>
<p>The last solo-model I’ll train is a gradient boosting machine using <a href="github.com/dmlc/xgboost">XGBoost</a>. This is a step up from the Random Forest in complexity, and usually a good choice for a final, best model in a project like this, evidenced by its <a href="https://www.kaggle.com/sudalairajkumar/winning-solutions-of-kaggle-competitions">almost unmatched success in Kaggle competitions</a>. The obvious benefit of gradient boosting is that it’s usually able to produce superior results to  other tree-based methods by placing increased weights on the samples it has the most trouble classifying (general info <a href="https://en.wikipedia.org/wiki/Boosting_(machine_learning)#Boosting_algorithms">here</a>), but it comes at the cost of having more hyperparameters to tune and a greater propensity to overfit. In my experience it’s generally 2x the work for a 1 - 5% performance boost over a random forest. In this case, my goal is accuracy, so I’ll take it.</p>

<p>My approach to training this model is similar to the previous: I conduct a randomized parameter search, this time taking 20 draws from a set of parameter distributions and cross validating over three folds, making for 60 models in total.</p>

<p>The hyperparameters I tune are:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># using proper random search setup instead of a set list of available options
</span><span class="n">xgb_params</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">'min_child_weight'</span><span class="p">:</span> <span class="n">st</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span>
    <span class="s">'gamma'</span><span class="p">:</span> <span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="c1">#st.uniform(0, 5),
</span>    <span class="s">'subsample'</span><span class="p">:</span> <span class="n">st</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">),</span>
    <span class="s">'colsample_bytree'</span><span class="p">:</span> <span class="n">st</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.6</span><span class="p">),</span>
    <span class="s">'max_depth'</span><span class="p">:</span> <span class="n">st</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span>
    <span class="s">'learning_rate'</span><span class="p">:</span> <span class="n">st</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="mf">0.02</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">),</span>
    <span class="s">'reg_lambda'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">],</span>
    <span class="s">'n_estimators'</span><span class="p">:</span> <span class="n">st</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">min_child_weight</code> controls complexity by requiring a minimum weight in order to make a new split in a tree. <code class="language-plaintext highlighter-rouge">gamma</code> and <code class="language-plaintext highlighter-rouge">reg_lambda</code> perform similar functions, where <code class="language-plaintext highlighter-rouge">reg_lambda</code> is an L2 regularizer on the model’s weights and <code class="language-plaintext highlighter-rouge">gamma</code> is the minimum loss reduction needed to add a new split to a tree. <code class="language-plaintext highlighter-rouge">subsample</code> is the percentage of rows that a tree is allowed to see, and <code class="language-plaintext highlighter-rouge">colsample_bytree</code> does the same thing, but for features rather than records. Last, <code class="language-plaintext highlighter-rouge">max_depth</code> defines the depth of each tree, <code class="language-plaintext highlighter-rouge">n_estimators</code> defines the number of trees to be built, and <code class="language-plaintext highlighter-rouge">learning_rate</code> sets the pace at which the model is allowed to update its weights at each step during gradient descent.</p>

<p>To manage overfitting, I also pass a dictionary of model-fitting parameters to define an early stopping rule. This just means that, for each model, I will halt training and select the best version of the model so far if the test loss doesn’t improve for ten consecutive iterations. Assuming the out of sample loss score follows a convex pattern, this means we can find the best model without overfitting.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># create a separate holdout set for XGB early stopping
</span><span class="n">train_percent</span> <span class="o">=</span> <span class="p">.</span><span class="mi">9</span>
<span class="n">train_samples_xgb</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">X_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">train_percent</span><span class="p">)</span>
<span class="n">test_samples_xgb</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">X_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">train_percent</span><span class="p">))</span>

<span class="n">X_train_xgb</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">[:</span><span class="n">train_samples_xgb</span><span class="p">]</span>
<span class="n">X_test_xgb</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">[</span><span class="n">train_samples_xgb</span><span class="p">:]</span>
<span class="n">y_train_xgb</span> <span class="o">=</span> <span class="n">y_train</span><span class="p">[:</span><span class="n">train_samples_xgb</span><span class="p">]</span>
<span class="n">y_test_xgb</span> <span class="o">=</span> <span class="n">y_train</span><span class="p">[</span><span class="n">train_samples_xgb</span><span class="p">:]</span>

<span class="n">xgb_fit</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">'eval_set'</span><span class="p">:</span> <span class="p">[(</span><span class="n">X_test_xgb</span><span class="p">,</span> <span class="n">y_test_xgb</span><span class="p">)],</span>
    <span class="s">'eval_metric'</span><span class="p">:</span> <span class="s">'auc'</span><span class="p">,</span> 
    <span class="s">'early_stopping_rounds'</span><span class="p">:</span> <span class="mi">10</span>
<span class="p">}</span>

<span class="n">xgb</span> <span class="o">=</span> <span class="n">XGBClassifier</span><span class="p">(</span><span class="n">objective</span><span class="o">=</span><span class="s">'binary:logistic'</span><span class="p">,</span>
                    <span class="n">silent</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">nthread</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>

<span class="n">random_search_xgb</span> <span class="o">=</span> <span class="n">model_param_search</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="n">xgb</span><span class="p">,</span> <span class="n">param_dict</span><span class="o">=</span><span class="n">xgb_params</span><span class="p">,</span> <span class="n">fit_dict</span> <span class="o">=</span> <span class="n">xgb_fit</span><span class="p">)</span>
</code></pre></div></div>

<p>One additional step I take in the above code chunk is that I create a new, additional holdout set to use for this model’s training. I’m doing this because XGBoost’s early stopping method doesn’t work well within sklearn’s randomized parameter search module. While an sklearn model will use each cross validation iteration’s holdout fold for parameter tuning using <code class="language-plaintext highlighter-rouge">RandomizedSearchCV</code>, XGBoost requires a static evaluation set to be passed to its <code class="language-plaintext highlighter-rouge">eval_set</code> parameter. It would be cheating to pass the holdout set used for final model evaluation into this parameter, so I create an intermediate holdout set out of our training set to give the model something to evaluate against for parameter tuning.</p>

<p>A better approach would be to define the model’s cross validation and parameter search manually in this case, passing each iteration’s holdout fold as the eval set during cross validation. This would have the benefit of letting the model see 100%, rather than 90%, of the training data, and also of using all of the training data for evaluation rather than just our new and static 10% holdout set. Given more time or a more serious use case for the model, I’d recommend going this route.</p>

<p>Training 60 models isn’t quick, so I recommend setting this up and forgetting it for a while. About 3 hours later, the best-performing hyperparameters were:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
	'colsample_bytree': 0.8760738910546659, 
	'gamma': 0, 
	'learning_rate': 0.21627727424943, 
	'max_depth': 8, 
	'min_child_weight': 7, 
	'n_estimators': 408, 
	'reg_lambda': 10, 
	'subsample': 0.8750373222444883
}
</code></pre></div></div>

<p>None of these values are at the extreme ends of the distributions defined for the parameter search, so I feel safe in saying that the feature space has been explored adequately.</p>

<p>The resulting model has 93.2% accuracy and 0.957 AUC score. As expected, this is better than the random forest, but not by a lot.</p>

<h2 id="model-5-ensemble-model">Model 5: Ensemble Model</h2>
<p>The last model is a simple ensemble of the three best models. The intuition behind this is that each model has learned something from the data, and that one model’s deficiencies may be corrected by the strengths of the others. Regression toward the mean is your friend when all models involved are good models.</p>

<p>In this case, there’s the additional benefit that we have three very-different models: a generalized linear model, a random forest, and a gradient boosting machine. The benefits of ensembling are greatest when the models aren’t highly correlated.</p>

<p>I’ll weight the models in order of performance, giving more attention to the XGBoost model and less to the logistic model.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># an ensemble of the three best models
</span><span class="n">y_pred_ensemble</span> <span class="o">=</span> <span class="p">(</span><span class="mi">3</span><span class="o">*</span><span class="n">y_pred_xgb</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="n">y_pred_rf</span> <span class="o">+</span> <span class="n">y_pred_lr2</span><span class="p">)</span> <span class="o">/</span> <span class="mi">6</span>
<span class="n">fpr_ens</span><span class="p">,</span> <span class="n">tpr_ens</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">roc_curve</span><span class="p">(</span><span class="n">y_holdout</span><span class="p">,</span> <span class="n">y_pred_ensemble</span><span class="p">)</span>
</code></pre></div></div>

<p>Another approach I tried unsuccessfully was model stacking. In this approach I took the outputs of these same three models and fed them to a logistic regression model. To my surprise, this model-of-models approach didn’t perform any better than this simpler ensembling method, so I’m not going to give its results any further attention here.</p>

<h1 id="howd-we-do">How’d we do?</h1>

<p>Because The Shift is relatively uncommon, I lean on AUC score as my metric of choice. Accuracy is more interpretable as a metric, so I will report on that too, but AUC is the most fair way to compare these models against one another.</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>AUC</th>
      <th>Accuracy</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Logistic Without Context</td>
      <td>0.887</td>
      <td>88.6%</td>
    </tr>
    <tr>
      <td>Logistic With Context</td>
      <td>0.940</td>
      <td>92.2%</td>
    </tr>
    <tr>
      <td>Random Forest</td>
      <td>0.957</td>
      <td>92.9%</td>
    </tr>
    <tr>
      <td>XGBoost</td>
      <td>0.957</td>
      <td>93.2%</td>
    </tr>
    <tr>
      <td>Average of Models</td>
      <td>0.959</td>
      <td>93.2%</td>
    </tr>
    <tr>
      <td>Stacked Models</td>
      <td>0.955</td>
      <td>92.8%</td>
    </tr>
  </tbody>
</table>

<p>My first thought when seeing this is that it’s a good reminder that most of your gains in any modeling task come from feature engineering. Using hand-crafted features improves the logistic AUC from 0.88 to 0.94, and gives it nearly four percentage points of increased accuracy. That’s a huge gain in a problem where only 14% of observations come from the minority class that we’re trying to predict. Moving from the logistic model to more complicated models, however, shows smaller gains. The features got us most of the way there, and machine learning provided the incremental “last mile” gains needed to move from a decent model to an optimal one.</p>

<p align="center">
    <img src="/images/fulls/results-auc.png" alt="" />
    AUC Scores for All Models
</p>

<p>The different models’ ROC curves show this visually. There’s a huge gain in going from just the two main variables to including all of our hand-crafted features. After that, there are some gains to show for using more advanced modeling techniques, but the gains are comparatively small.</p>

<p>The best model here is the average of the three best standalone models, which provides an AUC score of almost 0.96 and an accuracy above 93%. Those are both really good scores, indicating that these models have learned a lot from the underlying data.</p>

<h1 id="discussion">Discussion</h1>
<p>One thing I noticed that can make the modeling experience <strong>much</strong> better in this case is to come up with a criterion for dropping features that don’t provide much information to the model. An approach I found success with was to use sklearn’s <code class="language-plaintext highlighter-rouge">SelectFromModel</code> function in conjunction with <code class="language-plaintext highlighter-rouge">RandomForestClassifier</code>’s built-in gini importance scores. Another approach that may have worked here would have been to sue an L1-penalized logistic regression model, optimize the shrinkage parameter, and thrown out the features whose weights had shrunk to zero.</p>

<p>Something that isn’t in this model, which I wish I’d used, was each team’s record at the time of each game. I suspect that strategic positioning happens less when the playoffs are out of reach. This could be captured by using win-percentage and games-left-in-season as features in the model, and I suspect this would provide some lift in the final results.</p>

<p>Another thing I wish I’d been able to capture in this model is a player’s sprint speed. I don’t believe this is present in the data that I was working with, but I would guess that this plays at least a minor role when a team decides whether to shift.</p>

<p>One last idea this leaves untested is whether I didn’t go far enough in exploring the idea of modeling this as a sequence learning problem. Providing lagged features clearly gives considerable lift over only using static features. What would happen if we represented players’ atbat histories as sequences and trained an LSTM on the relationship between these histories and their corresponding defensive positioning? The data are reasonably large, so it’s not completely crazy to think that this approach would work.</p>

<p>All in all, I’m pretty happy with how this turned out. Taking player profile, ability, and game context into account, these models show that we can predict The Shift with over 93% accuracy, with a near-perfect AUC score of 0.959. A perfect model is probably impossible here, but I’d like to think up a few new features that can help to fix these final 7% of missed predictions. If I come up with anything interesting, I’ll be updating this post.</p>]]></content><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><category term="projects" /><summary type="html"><![CDATA[Using machine learning to predict strategic infield positioning using statcast data and contextual feature engineering.]]></summary></entry><entry><title type="html">Visualizing MLB Team Rankings with ggplot2 and Bump Charts</title><link href="https://jamesrledoux.com/visualization/mlb-rankings/" rel="alternate" type="text/html" title="Visualizing MLB Team Rankings with ggplot2 and Bump Charts" /><published>2018-08-05T00:00:00+00:00</published><updated>2018-08-05T00:00:00+00:00</updated><id>https://jamesrledoux.com/visualization/mlb-rankings</id><content type="html" xml:base="https://jamesrledoux.com/visualization/mlb-rankings/"><![CDATA[<meta property="og:image" content="/images/fulls/bump_chart_final.png" />

<meta property="og:image:type" content="image/png" />

<meta property="og:image:width" content="200" />

<meta property="og:image:height" content="200" />

<meta name="twitter:card" content="summary_large_image" />

<meta name="twitter:site" content="@jmzledoux" />

<meta name="twitter:creator" content="@jmzledoux" />

<meta name="twitter:title" content="Visualizing MLB Team Rankings with ggplot2 and Bump Charts" />

<meta name="twitter:image" content="http://jamesrledoux.com/images/fulls/bump_chart_final.png" />

<p>The 2018 MLB season has so far been just like every other season: filled with ups, downs, win streaks, teams plagued with injuries, and so-on. In this post I aim to catch up on the current season with a single chart, showing how the leagues’ rankings have changed throughout the year. My visualization of choice here is the <a href="https://www.google.com/search?q=bump+chart&amp;rlz=1C5CHFA_enUS776US776&amp;source=lnms&amp;tbm=isch&amp;sa=X&amp;ved=0ahUKEwjtker9ltfcAhUh64MKHaWcAn0Q_AUICigB&amp;biw=1215&amp;bih=638">bump chart</a>, a type of line chart showing changes in rankings over time. If you just want to see the final product, this is what it looks like:</p>

<p align="center">
    <img src="/images/fulls/bump_chart_final.png" alt="" />
</p>

<p>If you’re still reading, here’s how you can create your own.</p>

<h2 id="data-acquisition-pybaseball">Data Acquisition: Pybaseball</h2>

<p>First we’ll need data on every team’s record at each point within the season so far. My plan is to visualize this data in R with ggplot, and there are several capable R packages for pulling baseball data (<a href="https://github.com/BillPetti/baseballr">baseballr</a> and <a href="https://cran.r-project.org/web/packages/Lahman/Lahman.pdf">Lahman</a>, to name two). Since I maintain the <a href="https://github.com/jldbc/pybaseball">pybaseball</a> package, however, I’ll eat my own dogfood and start from there.</p>

<p>I use <code class="language-plaintext highlighter-rouge">pybaseball.schedule_and_record(year, team_code)</code> to fetch each team’s 2018 data. Once these data are concatenated together to create the whole season’s records-file, I clean the <code class="language-plaintext highlighter-rouge">Date</code> column to standardize dates across the dataframe, cut off dates that are in the future, and calculate each team’s win percentage at the end of each game day. I then export this csv so it can be used in R with ggplot.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">pybaseball</span> <span class="kn">import</span> <span class="n">schedule_and_record</span>

<span class="n">teams</span> <span class="o">=</span> <span class="p">[</span><span class="s">'BOS'</span><span class="p">,</span><span class="s">'NYY'</span><span class="p">,</span><span class="s">'TB'</span><span class="p">,</span><span class="s">'TOR'</span><span class="p">,</span><span class="s">'BAL'</span><span class="p">,</span><span class="s">'CLE'</span><span class="p">,</span><span class="s">'MIN'</span><span class="p">,</span><span class="s">'KC'</span><span class="p">,</span><span class="s">'CHW'</span><span class="p">,</span>
         <span class="s">'DET'</span><span class="p">,</span><span class="s">'HOU'</span><span class="p">,</span><span class="s">'LAA'</span><span class="p">,</span><span class="s">'SEA'</span><span class="p">,</span><span class="s">'TEX'</span><span class="p">,</span><span class="s">'OAK'</span><span class="p">,</span><span class="s">'WSN'</span><span class="p">,</span><span class="s">'MIA'</span><span class="p">,</span><span class="s">'ATL'</span><span class="p">,</span>
         <span class="s">'NYM'</span><span class="p">,</span><span class="s">'PHI'</span><span class="p">,</span><span class="s">'CHC'</span><span class="p">,</span><span class="s">'MIL'</span><span class="p">,</span><span class="s">'STL'</span><span class="p">,</span><span class="s">'PIT'</span><span class="p">,</span><span class="s">'CIN'</span><span class="p">,</span><span class="s">'LAD'</span><span class="p">,</span><span class="s">'ARI'</span><span class="p">,</span>
         <span class="s">'COL'</span><span class="p">,</span><span class="s">'SD'</span><span class="p">,</span><span class="s">'SF'</span><span class="p">]</span>

<span class="c1"># collect every team's record for the 2018 season
</span><span class="n">records</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">teams</span><span class="p">:</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">schedule_and_record</span><span class="p">(</span><span class="mi">2018</span><span class="p">,</span> <span class="n">t</span><span class="p">)</span>
    <span class="n">records</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>

<span class="c1">#concatenate records together so the whole season is in one dataframe
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">records</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>

<span class="c1"># standardize the date formats of double-header games
</span><span class="n">df</span><span class="p">.</span><span class="n">Date</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">Date</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' (1)'</span><span class="p">,</span><span class="s">''</span><span class="p">,</span><span class="n">regex</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">Date</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">Date</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' (2)'</span><span class="p">,</span><span class="s">''</span><span class="p">,</span><span class="n">regex</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

<span class="c1"># turn this into a date format that Pandas will recognize
</span><span class="n">df</span><span class="p">.</span><span class="n">Date</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">Date</span><span class="p">,</span><span class="nb">format</span><span class="o">=</span><span class="s">'%A, %b %d'</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">Date</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">Date</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="mi">2018</span><span class="p">))</span>

<span class="c1"># cut out games that havent happened yet
</span><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="p">.</span><span class="n">Date</span> <span class="o">&lt;</span> <span class="s">'2018-08-05'</span><span class="p">]</span>

<span class="c1"># extract win and loss values from "w-l" strings
</span><span class="n">df</span><span class="p">[</span><span class="s">'W'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'W-L'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'-'</span><span class="p">).</span><span class="nb">str</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'L'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'W-L'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'-'</span><span class="p">).</span><span class="nb">str</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'win_pct'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'W'</span><span class="p">]</span> <span class="o">/</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'W'</span><span class="p">]</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s">'L'</span><span class="p">])</span>

<span class="n">df</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">'2018-records.csv'</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="data-preparation">Data Preparation</h2>

<p>Next the data get loaded into R. It is easiest to rank the teams when their win rate is known at every point in time, not only on game days. For this reason, my first task for preparing the data is to fill in these missing non-game-day dates with the win-percentage of each team’s most recent game day.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">cowplot</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span><span class="w">

</span><span class="n">win_percentages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s1">'2018-records.csv'</span><span class="p">)</span><span class="w">
</span><span class="n">win_percentages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">win_percentages</span><span class="p">[,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Tm'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Date'</span><span class="p">,</span><span class="w"> </span><span class="s1">'win_pct'</span><span class="p">)]</span><span class="w">
</span><span class="n">win_percentages</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">win_percentages</span><span class="o">$</span><span class="n">win_pct</span><span class="p">),</span><span class="w"> </span><span class="s1">'win_pct'</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">

</span><span class="c1"># create a dummy column to give dplyr left_join the effect of a cross join</span><span class="w">
</span><span class="n">dates</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">win_percentages</span><span class="o">$</span><span class="n">Date</span><span class="p">),</span><span class="w"> </span><span class="n">dummy</span><span class="o">=</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Date'</span><span class="p">,</span><span class="w"> </span><span class="s1">'dummy'</span><span class="p">))</span><span class="w">
</span><span class="n">teams</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">win_percentages</span><span class="o">$</span><span class="n">Tm</span><span class="p">),</span><span class="w"> </span><span class="n">dummy</span><span class="o">=</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Tm'</span><span class="p">,</span><span class="w"> </span><span class="s1">'dummy'</span><span class="p">))</span><span class="w">

</span><span class="c1"># rejoin tables to have one row per day per team</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">dates</span><span class="p">,</span><span class="w"> </span><span class="n">teams</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="o">=</span><span class="s1">'dummy'</span><span class="p">)</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">win_percentages</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s1">'Tm'</span><span class="p">,</span><span class="s1">'Date'</span><span class="p">))</span><span class="w">

</span><span class="c1"># fill non-gameday win percentages with the previous-gameday's win percent</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">Date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="n">Date</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
            </span><span class="n">complete</span><span class="p">(</span><span class="n">Date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq.Date</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">Date</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">Date</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="o">=</span><span class="s2">"day"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
            </span><span class="n">group_by</span><span class="p">(</span><span class="n">Tm</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">fill</span><span class="p">(</span><span class="s1">'win_pct'</span><span class="p">)</span><span class="w">

</span><span class="c1"># remove NAs generated by the all star break when no games were played</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">[</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">Tm</span><span class="p">),]</span><span class="w">
</span></code></pre></div></div>

<p>Now that the data is formatted how we want it, we’ll need rank them by their records. Because the bump-chart will be messy with all teams involved, and also because rankings across leagues don’t have much real-world value, we’ll want to separate NL teams from AL teams first. To do this, create a list of one league’s teams and use a dplyr <code class="language-plaintext highlighter-rouge">filter</code> on it.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">al_teams</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'BOS'</span><span class="p">,</span><span class="w"> </span><span class="s1">'NYY'</span><span class="p">,</span><span class="w"> </span><span class="s1">'BAL'</span><span class="p">,</span><span class="w"> </span><span class="s1">'TBR'</span><span class="p">,</span><span class="w"> </span><span class="s1">'TOR'</span><span class="p">,</span><span class="w"> </span><span class="s1">'CHW'</span><span class="p">,</span><span class="s1">'CLE'</span><span class="p">,</span><span class="w">
             </span><span class="s1">'DET'</span><span class="p">,</span><span class="w"> </span><span class="s1">'KCR'</span><span class="p">,</span><span class="w"> </span><span class="s1">'MIN'</span><span class="p">,</span><span class="s1">'HOU'</span><span class="p">,</span><span class="s1">'LAA'</span><span class="p">,</span><span class="w"> </span><span class="s1">'OAK'</span><span class="p">,</span><span class="w"> </span><span class="s1">'SEA'</span><span class="p">,</span><span class="w"> </span><span class="s1">'TEX'</span><span class="p">)</span><span class="w">
</span><span class="n">al</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">Tm</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">al_teams</span><span class="p">)</span><span class="w">
</span><span class="n">nl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">Tm</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">al_teams</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p>Next, correct for double-header days by grouping on date-team combinations and taking only the last row of each group. This is the record ad the end of the team’s double-header. After this, ranking teams by their records is as simple as grouping by  <code class="language-plaintext highlighter-rouge">Date</code>, sorting by win percentage, and ordering them from best to worst. Ties in this case will naively go to the team that comes first in the alphabet.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">by_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">al</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">Tm</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">row_number</span><span class="p">()</span><span class="o">==</span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">unique</span><span class="p">()</span><span class="w">
</span><span class="n">by_date</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">by_date</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">Date</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
           </span><span class="n">arrange</span><span class="p">(</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">desc</span><span class="p">(</span><span class="n">win_pct</span><span class="p">),</span><span class="w"> </span><span class="n">Tm</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
           </span><span class="n">mutate</span><span class="p">(</span><span class="n">Rank</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rank</span><span class="p">(</span><span class="o">-</span><span class="n">win_pct</span><span class="p">,</span><span class="w"> </span><span class="n">ties.method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"first"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<h2 id="visualization">Visualization</h2>

<p>Now on to the fun part: graphing it. We can start by defining the colors that will be associated with each team’s line on the graph. Because team colors are well known to fans, this will help the plot’s interpretability. Conveniently, there’s a website built for this exact purpose: <a href="https://teamcolorcodes.com/">teamcolorcodes.com</a>. I selected a hex code for one of each team’s colors and add them to a list like so:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">team_colors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">BOS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#BD3039'</span><span class="p">,</span><span class="w"> </span><span class="n">NYY</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#003087'</span><span class="p">,</span><span class="w"> </span><span class="n">TBR</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#8FBCE6'</span><span class="p">,</span><span class="w"> </span><span class="n">KCR</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#BD9B60'</span><span class="p">,</span><span class="w">
                </span><span class="n">CHW</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#27251F'</span><span class="p">,</span><span class="w"> </span><span class="n">BAL</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#DF4601'</span><span class="p">,</span><span class="w"> </span><span class="n">CLE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#E31937'</span><span class="p">,</span><span class="w"> </span><span class="n">MIN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#002B5C'</span><span class="p">,</span><span class="w">
                </span><span class="n">DET</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#FA4616'</span><span class="p">,</span><span class="w"> </span><span class="n">HOU</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#EB6E1F'</span><span class="p">,</span><span class="w"> </span><span class="n">LAA</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#BA0021'</span><span class="p">,</span><span class="w"> </span><span class="n">SEA</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#005C5C'</span><span class="p">,</span><span class="w"> 
                </span><span class="n">TEX</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#003278'</span><span class="p">,</span><span class="w"> </span><span class="n">OAK</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#003831'</span><span class="p">,</span><span class="w"> </span><span class="n">WSN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#14225A'</span><span class="p">,</span><span class="w"> </span><span class="n">MIA</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#FF6600'</span><span class="p">,</span><span class="w">
                </span><span class="n">ATL</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#13274F'</span><span class="p">,</span><span class="w"> </span><span class="n">NYM</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#002D72'</span><span class="p">,</span><span class="w"> </span><span class="n">PHI</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#E81828'</span><span class="p">,</span><span class="w"> </span><span class="n">CHC</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#0E3386'</span><span class="p">,</span><span class="w">
                </span><span class="n">MIL</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#B6922E'</span><span class="p">,</span><span class="w"> </span><span class="n">STL</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#C41E3A'</span><span class="p">,</span><span class="w"> </span><span class="n">PIT</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#FDB827'</span><span class="p">,</span><span class="w"> </span><span class="n">CIN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#C6011F'</span><span class="p">,</span><span class="w">
                </span><span class="n">LAD</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#005A9C'</span><span class="p">,</span><span class="w"> </span><span class="n">ARI</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#A71930'</span><span class="p">,</span><span class="w"> </span><span class="n">COL</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#33006F'</span><span class="p">,</span><span class="w"> </span><span class="n">SDP</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#002D62'</span><span class="p">,</span><span class="w">
                </span><span class="n">SFG</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#FD5A1E'</span><span class="p">,</span><span class="w"> </span><span class="n">TOR</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'#134A8E'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>First let’s create the main plot: a bump chart showing the 15 teams’ rankings throughout the progression of the season. I plot a <code class="language-plaintext highlighter-rouge">geom_line</code> for each value of <code class="language-plaintext highlighter-rouge">Tm</code>, and then flip the scale so that a lower (better) value of <code class="language-plaintext highlighter-rouge">Rank</code> will be at the top of the Y axis. To make this somewhat interpretable, I label the teams at their final positions (yesterday, the most recent date for which I have data) at the tail-end of the chart, so that they can be traced back in time throughout the season, and color-code them with <code class="language-plaintext highlighter-rouge">scale_color_manual</code> so that each line matches the team’s colors. The code for the main chart is this:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">by_date</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Rank</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Tm</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Tm</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.75</span><span class="p">,</span><span class="w"> </span><span class="n">show.legend</span><span class="o">=</span><span class="nb">F</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">scale_y_reverse</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">32</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_text</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">subset</span><span class="p">(</span><span class="n">by_date</span><span class="p">,</span><span class="w"> </span><span class="n">Date</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2018-08-04"</span><span class="p">)),</span><span class="w"> 
              </span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="n">Tm</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2.5</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-.1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">scale_color_manual</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="n">team_colors</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Which produces an outcome that looks like this:</p>
<p align="center">
    <img src="/images/fulls/bump_chart_main.png" alt="" />
</p>

<p>This is nice, but it’s also a bit noisy. As a visual aid, it will next be nice to show each team’s line in isolation along the border. To do this, we’ll first need a function for generating graph <code class="language-plaintext highlighter-rouge">p</code> for each team in isolation. This is the same as above, but with the axes wiped out to minimize noise.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">teamplot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">team_code</span><span class="p">){</span><span class="w">
    </span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">by_date</span><span class="p">[</span><span class="n">by_date</span><span class="o">$</span><span class="n">Tm</span><span class="o">==</span><span class="n">team_code</span><span class="p">,],</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Rank</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Tm</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Tm</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.75</span><span class="p">,</span><span class="w"> </span><span class="n">show.legend</span><span class="o">=</span><span class="nb">F</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">scale_y_reverse</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">32</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">scale_color_manual</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="n">team_colors</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
    </span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">team_code</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
          </span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
          </span><span class="n">axis.ticks.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
          </span><span class="n">axis.ticks.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
          </span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
          </span><span class="n">axis.title.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="o">=</span><span class="m">6</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>With this in place, all that’s left is to arrange the team-charts and league-chart side by side. Luckily, there’s a library for that. Using <code class="language-plaintext highlighter-rouge">library(grid)</code>, I define a 16x4 grid. Row 1 will be a title for the entire figure. Columns 2 - 4 will belong to the main chart showing all teams. Rows 2 - 16 in column 1, finally, will belong to the 15 individual-team charts. This is put together like so:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">grid</span><span class="p">)</span><span class="w">
</span><span class="n">pushViewport</span><span class="p">(</span><span class="n">viewport</span><span class="p">(</span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">grid.layout</span><span class="p">(</span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">)))</span><span class="w">

</span><span class="c1"># helper funciton for defining a region in the layout</span><span class="w">
</span><span class="c1"># source: http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/</span><span class="w">
</span><span class="n">define_region</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">row</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="p">){</span><span class="w">
  </span><span class="n">viewport</span><span class="p">(</span><span class="n">layout.pos.row</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">row</span><span class="p">,</span><span class="w"> </span><span class="n">layout.pos.col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">col</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> 
</span><span class="n">grid.text</span><span class="p">(</span><span class="nf">expression</span><span class="p">(</span><span class="n">bold</span><span class="p">(</span><span class="s2">"American League Standings"</span><span class="p">)),</span><span class="w"> </span><span class="n">vp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">define_region</span><span class="p">(</span><span class="n">row</span><span class="w"> </span><span class="o">=</span><span class="m">1</span><span class="o">:</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">),</span><span class="w"> </span><span class="n">gp</span><span class="o">=</span><span class="n">gpar</span><span class="p">(</span><span class="n">fontsize</span><span class="o">=</span><span class="m">15</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">vp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">define_region</span><span class="p">(</span><span class="n">row</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">4</span><span class="p">))</span><span class="w"> 
</span><span class="n">teams</span><span class="o">=</span><span class="n">unique</span><span class="p">(</span><span class="n">by_date</span><span class="p">[</span><span class="n">by_date</span><span class="o">$</span><span class="n">Date</span><span class="o">==</span><span class="s1">'2018-08-04'</span><span class="p">,</span><span class="s1">'Tm'</span><span class="p">])</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">idx</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">16</span><span class="p">){</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">teamplot</span><span class="p">(</span><span class="n">teams</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="n">idx</span><span class="m">-1</span><span class="p">]]),</span><span class="w"> </span><span class="n">vp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">define_region</span><span class="p">(</span><span class="n">row</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">idx</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>which, finally, gives us a finished product that looks like this:</p>
<p align="center">
    <img src="/images/fulls/bump_chart_final.png" alt="" />
</p>

<p>Replace <code class="language-plaintext highlighter-rouge">al</code> for <code class="language-plaintext highlighter-rouge">nl</code> in the code above to get the same figure for the National League, and then we’re done here. Simple enough!</p>]]></content><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><category term="visualization" /><summary type="html"><![CDATA[A quick tutorial on fetching MLB win-loss data with pybaseball and cleaning and visuzlizing it with the tidyverse (dplyr and ggplot).]]></summary></entry><entry><title type="html">On Draft Pick Value, the New Lottery, and Tanking</title><link href="https://jamesrledoux.com/projects/nba-draft/" rel="alternate" type="text/html" title="On Draft Pick Value, the New Lottery, and Tanking" /><published>2017-11-30T00:00:00+00:00</published><updated>2017-11-30T00:00:00+00:00</updated><id>https://jamesrledoux.com/projects/nba-draft</id><content type="html" xml:base="https://jamesrledoux.com/projects/nba-draft/"><![CDATA[<meta property="og:image" content="http://jamesrledoux.com/images/fulls/pick_value_smooth_final.png" />

<meta property="og:image:type" content="image/png" />

<meta property="og:image:width" content="200" />

<meta property="og:image:height" content="200" />

<meta name="twitter:card" content="summary_large_image" />

<meta name="twitter:site" content="@jmzledoux" />

<meta name="twitter:creator" content="@jmzledoux" />

<meta name="twitter:title" content="On Draft Pick Value, the New Lottery, and Tanking" />

<meta name="twitter:image" content="http://jamesrledoux.com/images/fulls/pick_value_smooth_final.png" />

<p>Tanking becomes a hot topic each season once it becomes apparent which of the NBA’s worst teams will be missing the playoffs. Though never explicitly acknowledged, a team with no playoff hopes will sometimes suspiciously begin to decrease the minutes of its starters, leaving its worst lineups on the floor in ways that can only be described as anti-competitive. With playoff glory out of the question, the process of losing by design in pursuit of better draft picks is one of the only remaining ways to extract value from a disappointing season. And so the race to the bottom begins; the losses pile up and fan base turns its eyes to the college talent pool, wondering which member of the upcoming draft class might be the one to turn around the struggling franchise.</p>

<p>The league leadership’s distaste for this strategy of designed awfulness is no secret: it’s boring to watch, bad for advertising revenue, and goes against the ideals of competitive play that Commissioner Adam Silver so adamantly supports. This phenomenon of shedding wins for improved lottery odds is of such high concern to league officials that, beginning with the 2019 draft, lottery odds will be adjusted in order to decrease teams’ incentives to tank. This post is an attempt to understand the value of the draft and of tanking in the NBA, before and after the implementation of the upcoming lottery changes.</p>

<p>Namely, I will address the following questions:</p>
<ul>
  <li>How many wins is a draft pick worth?</li>
  <li>How much is there to gain by tanking one position in the league standings?</li>
  <li>What’s a good draft day trade?</li>
  <li>How will the new lottery odds impact this when implemented in 2019?</li>
</ul>

<h2 id="data">Data</h2>
<p>The data for this analysis comes from two different sources. First I manually grabbed all draft results from 1960 to present from <a href="https://www.basketball-reference.com">basketball-reference.com</a>.  The second dataset comes from <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwi7n-a-19vXAhUIhlQKHcunA90QFggoMAA&amp;url=https%3A%2F%2Fwww.kaggle.com%2Fdrgilermo%2Fnba-players-stats&amp;usg=AOvVaw2DbzUk5ojLaP5E3kOEB5_d">Kaggle</a>, which provides a dataset of NBA season-level data for each player since 1950. Of this, I kept only the observations since the 1960 season, as anybody who’d played before 1960 wouldn’t be found in the draft pick data. Because the draft has undergone some changes since the 60s, I also removed all picks outside the first and second round, as the draft only has two rounds in its present form.</p>

<h2 id="metrics-of-interest-win-share-approximate-value-and-player-efficiency-rating">Metrics of Interest: Win Share, Approximate Value, and Player Efficiency Rating</h2>
<p>To determine the value of a draft pick we’ll need to be able to assign it a value. Two of the most commonly used metrics for overall player value are win share (WS) and player efficiency rating (PER).</p>

<p>PER strives to measure per-minute performance while adjusting for pace, taking into account field goals, free throws, 3-pointers, assists, blocks, steals, missed shots, turnovers, and personal fouls. The league average is always equal to 15, values in the 20s indicate star-level performance, and below 10 typically puts a player toward the end of his team’s bench <a href="https://en.wikipedia.org/wiki/Player_efficiency_rating">(Wikipedia)</a>. Some issues with PER are that it’s measured per-minute, sometimes assigning excessive value to low-minute players and equating them with all stars. It’s also challenging to interpret, since PER points have no direct connection with wins or any other metric. Despite this, PER sees wide use due to its quality of taking multiple facets of the game into account and assigning a single value to a player’s overall per-minute performance. As this metric can become volatile for low-minutes players due to small-sample properties, I ignore PER values for players during seasons in which they averaged less than five minutes per game.</p>

<p><a href="https://www.basketball-reference.com/about/ws.html">Win Share</a> serves as a convenient response to PER’s flaws. Similar to PER, win shares take into account just about every box score statistic relevant to a player’s performance. WS holds slightly more real-world value, however, in that it’s directly tied to wins. Under Basketball Reference’s system, one team win equals one win share. Players’ contributions are weighed, dividing WS into Offensive and Defensive Win Share (OWS/DWS), where WS = OWS + DWS. Each team’s wins are divided amongst its players according to how much each has contributed on offense and defense. For details on how this is calculated, I suggest reading Basketball Reference’s description <a href="https://www.basketball-reference.com/about/ws.html">here</a>. The all-time single-season record for win shares is Kareem Abdul-Jabbar, who was responsible for 25.4 of the Bucks’ wins in 1971-72. A player on a losing team, on the other hand, has little hope of achieving a high win share, as there are less win shares to distribute throughout the roster.</p>

<p>The last statistic I’ll use in this analysis is an adaptation of sports-reference’s <a href="https://www.sports-reference.com/blog/approximate-value/">Approximate Value (AV)</a> statistic. Approximate Value takes an existing stat and scales it to weight a player’s best seasons heavier. In my case I’ll front-weight a player’s best win share seasons, adding up 1*(best season) + 0.95*(second best) + 0.9*(third best), all the way down to 0.55*(tenth best). My reason for doing this is that career win shares, as they stand, measure greatness over the span of a player’s career. Some may prefer this, but others will object that this disproportionately rewards players with better longevity and health, which usually can’t be reliably forecast on draft day. Since this is an exercise in the value of draft picks, I find it necessary to include AV as an alternative to WS for this reason. As we’ll see later, AV and WS tell approximately the same story, but I do like that this statistic evens the playing field between a player like Kareem, who had 20 healthy seasons (Career WS 273, AV 144), and Michael Jordan, who only played 15 seasons (Career WS 214, AV 147). It also allows some of the current players to enter the conversation, such as Kevin Durant, currently in his 11th season, who is ranked 35th in career WS (since his career is likely far from over) but 17th in ten-year AV. This adaptation of the AV statistic was inspired by <a href="https://fivethirtyeight.com/features/are-some-positions-riskier-to-pick-than-others-in-the-nfl-draft/">this piece</a> from FiveThirtyEight, which uses AV to evaluate the value of draft picks in the injury-plagued, short-careered National Football League.</p>

<p>As a quick sanity check, let’s see who some of the all-time greats are according to these three measures.</p>

<p align="center">
    <img src="/images/fulls/top_players_av.png" alt="" />
    Table 1: Top Ten Players Drafted Since 1960 According to Approximate Value (AV)
</p>

<p>This looks about right. Career Win Share rewards greatness over the course of a career, Approximate Value rewards greatness at a player’s peak over a ten year period, and PER rewards a player’s average efficiency. Accordingly, all players on this list are deserving of legend status, AV rewards the short-careered all-time greats such as David Robinson and Magic Johnson, and PER rewards those who packed the most productivity into their time on the floor.</p>

<p>Since PER doesn’t appear to match the other two measures too closely, we can double-check that these metrics tell roughly the same stort using a measure called <a href="https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient">Kendall Correlation</a>. Kendall correlation is a measure of the pairwise similarity or dissimilarity of two sets of rankings. A value close to 1 indicates near-perfect agreement between the ranking systems, close to -1 indicates near-perfect disagreement, and close to 0 indicates little-to-no relationship between them.</p>

<p>Here we see a correlation of 0.89 between AV and WS, of 0.61 between AV and PER, and of 0.58 between PER and WS. So while AV and WS are much more similar to one another than either of these two metrics are to PER, they’re all similar enough accept as valid ranking metrics, significantly positively correlated with P&lt;0.001 while clearly succeeding in the task of identifying high achievers.</p>

<p>With this in mind, we should now feel comfortable accepting that these are useful metrics that tell similar but non-identical stories of player value. Taken holistically, these should be all we need to evaluate the values of positions in the draft.</p>

<h2 id="whats-the-relationship-between-draft-position-and-pick-value">What’s the Relationship Between Draft Position and Pick Value?</h2>

<p>Given each player’s draft position, AV, WS and PER, the first step is to see how many points of each of these metrics can be expected from each draft position.</p>

<p align="center">
    <img src="/images/fulls/initial_scatter_final.png" alt="" />
    AV, WS, and PER by Pick Number, with Means Shown in Red
</p>

<p>The above plot shows two of the things we’d expect: a downward trend in player value as the draft progresses, and a relatively high variance in player value at each pick. The relationship between pick number and each of these player-ability metrics is highly significant, with each having an absolute pearson correlation of &gt;0.41 with P&lt;0.001. What’s more interesting, however, is that each of these three relationships appears to be nonlinear in a similar way, indicating a potentially complex relationship between tanking and draft value. Also worth noting is that many players selected within the last 10 picks of their respective draft classes had negative PER ratings, which aren’t shown on the above figure but can be inferred by the average (red) falling below the visible points on the scatter plot.</p>

<p>We’ll need to take a step beyond these simple averages in order to give this relationship a more believable functional form. It defies logic to say that the 8th pick, for example, is worth 3 more win shares than the 7th pick. For this reason I fit a fifth-order polynomial to this data to enforce the qualities of smoothness and having pick values consistently decrease according to their position in the draft.</p>

<p align="center">
    <img src="/images/fulls/pick_value_smooth_final.png" alt="" />
    Smoothed Expected Values for Each Pick Number. Error Bar Represents One Standard Deviation in Either Direction. 
</p>

<p>The outputs of these smoothed functions are what I’ll use for the rest of this analysis when referring to pick values. The relationships between WS, AV and pick number resemble exponential decay until these values fall off in the draft’s final picks. Expected PER decreases in a a much more linear fashion once outside the top ten picks, falling off similar to AV and WS at the tail end of the draft.</p>

<p>The error bars in these plots represent standard deviations from the mean, meaning that 68.27% of values lie within these ranges. For WS and AV the standard deviations decrease in the draft’s later picks, indicating that as players get worse we can become more confident what their output will be. PER, on the other hand, gets less stable in the tail end of the graph. This is because PER is measured per minute, meaning that low-minutes players will sometimes have illogical ratings due to small sample properties. All three measures see higher standard deviations in the draft’s final two picks due to the fact that the league hasn’t always had 30 teams.</p>

<p>The expected AV, WS, and PER for each draft position is shown below.</p>

<p align="center">
    <img src="/images/fulls/allvals_before_final.png" alt="" />
    <img src="/images/fulls/allvals_before_final_22.png" alt="" />
    Table 2: Expected Values and Standard Deviations of Each Pick Position
</p>

<h2 id="so-whats-a-good-draft-day-trade">So What’s a Good Draft Day Trade?</h2>
<p>Before getting into tanking, the above numbers are valuable to teams and worth a closer look. With these values alone we can objectively evaluate the quality of a draft-day trade that involves only draft picks.</p>

<p>Table 3 does exactly that. The first example reads as follows:</p>
<blockquote>
  <p>In 2017 the Kings received picks 15 and 20 from the Blazers in exchange for the 10th pick. The picks received by the Kings were worth an expected 37.19 career win shares and combined 20.54 PER, while the Blazers’ pick was worth an expected 29.86 win shares and 11.70 PER. The Kings came out on top in this trade, netting an additional 7.33 expected win shares and 8.84 points of PER.</p>
</blockquote>

<p align="center">
    <img src="/images/fulls/trades_final.png" alt="" />
    Table 3: WS and PER Gained/Lost in Recent Draft Day Trades Involving Only Draft Picks
</p>

<p>The lesson learned here? Much like Richard Thaler teaches us about <a href="http://www.nber.org/papers/w11270.pdf">NFL draft pick valuations</a>, NBA teams should trade down in the draft far more often than they currently do. The 76ers and Timberwolves appear to be on to this strategy, but the rest of the league is yet to follow suit.</p>

<p>This same analysis can be extended to trades involving players and future draft picks, but it involves a little more work and uncertainty. Some thoughts on how to take into account the three non-draft pick assets that get tossed around on draft day:</p>
<ul>
  <li>Future draft picks: estimate where the team sending the pick will stand in the league rankings the year of the pick. Weight this pick’s value according to the lottery odds of that ranking (more on that soon)</li>
  <li>A player: use their PER or estimate how many win shares they have left in their career</li>
  <li>Cash considerations: The way I understand it these are usually used to keep a team under the salary cap in order to make the trade possible. Other times a team will trade a pick they don’t need for a worse pick plus money. I think you’re generally safe considering cash to be worth 0 WS, 0 AV, 0 PER.</li>
</ul>

<h2 id="how-the-lottery-works">How the Lottery Works</h2>
<p>Under its current rules, the lottery provides the league’s worst 14 teams a chance at winning each of the first three picks. After the first three picks are determined according to randomly generated numbers, each of the following picks is assigned to the worst remaining team. Outside the worst 14 teams, draft position is predetermined according to the inverse of a team’s rank within the league. Each team’s odds of receiving each lottery placement are shown below in highlighted cells, where rows represent inverse rankings (i.e. rankings in terms of how bad a team was) and columns represent pick numbers.</p>

<p align="center">
    <img src="/images/fulls/old_odds.png" alt="" />
    Table 4: Odds of Winning Each Pick by Inverse Record Ranking. Lottery Picks are Highlighted. 
</p>

<p>Outside the lottery (picks 1 - 14), the 15th worst team has a 100% chance of earning the 15th pick, the 16th worst team gets the 16th pick, and so on down the line. Second round picks are unaffected by the lottery, meaning that the worst team gets the first pick of the draft’s second round regardless of whether they won the lottery for the first pick overall. For more detail on this process, <a href="https://en.wikipedia.org/wiki/NBA_draft_lottery">Wikipedia</a> has a nice overview</p>

<h2 id="whats-the-draft-pick-value-of-an-end-of-season-ranking">What’s the Draft-Pick Value of an End of Season Ranking?</h2>

<p>Taking the above lottery odds and earlier-discussed valuations of each pick position, it’s easy to combine these in order to place a value on each end-of-season rank in terms of its expected draft-day value.</p>

<p>Starting with Table 4’s lottery odds, multiply the percentages in each column by the value of the corresponding pick number from Table 2. Once pick value and probability have been multiplied together for all 30 teams and 60 picks (most of these values will be zero, since there’s no uncertainty in the assignment of picks 15 - 60), take the sum of these values for each end-of-season ranking to obtain its expected draft-day value in terms of draft picks. Stated a different way, this is simply the dot product of the (ranking x pick probability) matrix and the vector of pick valuations.</p>

<p>Two examples to illustrate this:</p>
<blockquote>
  <p>Worst team in the league: 0.250(69.28) + 0.215(61.53) + 0.178(54.77) + 0.375(48.91) + 1.000(10.43) = 68.19 expected career win shares</p>
</blockquote>

<blockquote>
  <p>Tenth worst team: 0.011(69.28) + 0.013(61.53) + 0.016(54.77) + 0.870(27.56) + 0.089(25.61) + 0.002(23.97) + 1.000(6.29) = 35.03 expected career win shares</p>
</blockquote>

<p>Below are the expected values of all 30 end-of-season rankings in terms of win shares, approximate value, and points of player efficiency rating. As we’d expect, these show similar patterns to the relationships between pick number and value. The two main differences are that these curves are shifted vertically and compressed, since the values now include both first and second round picks, and that they’re flatter, since each of the bottom 14 ranked teams has a chance at the top three picks. The exact values from these plots are shown in the next section in Table 5.</p>

<p align="center">
    <img src="/images/fulls/rank_val_final.png" alt="" />
    Expected AV, WS, and PER of Each End-of-Season Ranking
</p>

<h2 id="whats-the-value-of-tanking-one-position">What’s the Value of Tanking One Position?</h2>
<p>With the value of each end-of-season ranking established, we’re finally able to face the question this analysis set out to answer: what’s to gain from tanking? Should a team act anticompetitive in order to improve its lottery odds?</p>

<p>The value of tanking one position in the league standings is the difference between the value of a team’s current ranking and the value of falling one place closer to the bottom of the league. These values come from the weighing of draft pick odds and their values discussed in the previous section. These season-end rankings’ values are shown alongside their “tanking values” in Table 5, where a tanking value is defined as the difference between one end-of-season ranking’s expected draft day value and that of the next worse ranking. These values are positive for all rankings and for all three measures of value, with no tanking value being assigned to the worst team in the league since there’s nowhere to fall for a team that’s already hit rock bottom.</p>

<p align="center">
    <img src="/images/fulls/pre_table_final.png" alt="" />
    Table 5: Expected Value and Tanking Opportunity of Each Inverse Team Ranking
</p>

<p>Ranking these positions by how much a team would gain by tanking one additional spot shows an interesting relationship. For all three measures of value, the worst non-lottery teams have the least to gain by tanking. It’s also interesting to see that teams expected to be placed in the middle of the lottery have more to gain than those closest to the league’s worst ranking. If we discard playoff-eligible teams from these rankings, the sweet spot for tanking is ranks five through eight. It’s valuable to tank for teams of all ranks, of course, but these mid-lottery rankings are where we see that teams have the most to gain from losing a few additional games.</p>

<p>To understand this relationship better, let’s plot tanking value against inverse league rank. Again, by “inverse rank” I’m referring to the opposite of a team’s place in the league standings, where the league’s worst team has an inverse rank of one, the second worst team has an inverse rank of two, and so on.</p>

<p align="center">
    <img src="/images/fulls/tank_bar_final.png" alt="" />
    Draft-Pick Value of Falling One Position in the Rankings
</p>

<p>After seeing this visually, a few additional things become apparent about tanking. First, there’s a significant increase in tanking-value for the league’s 15th-worst team. This is because Falling from 15th to 14th worst enters a team into the lottery, giving them a chance at the much-more-valuable first, second, third pick, as well as a guarantee of the slightly-more-valuable 14th pick if the better options don’t pan out. Second, the power-law nature of the value of draft picks also shows through in this. Tanking becomes increasingly valuable as rank improves for the league’s best teams, since a disproportionate amount of the draft’s value tends to come from the first half of the first round.</p>

<p>Since no sensible team in the playoff hunt is going to throw wins for a better draft position, we can safely ignore the right-hand side of these plots. What’s most important is that the relationship between ranking and expected draft-pick value is nonlinear in a way that gives the league’s fifth through eighth worst teams the most to gain from losing games down the stretch.</p>

<p>Turning to the real-world value of falling in the rankings, however, these values individually aren’t shocking. Falling one position in the league rankings is worth an additional 0.62 points of PER, 4.35 career wins, and 3.11 approximate value points <strong>at best.</strong> This is hardly going to turn around a franchise.</p>

<p>A formal long-term strategy of tanking, on the other hand, might be a different story. Imagine a team whose non-tanking league ranking is seventh worst. If this team were to adopt a policy of end-of-year tanking for just two seasons in a row, falling three spots each year, they would gain an impressive 3.53 points of PER, 26.21 career win shares, and 18.28 points of AV. Stretch this out to a third year of designed awfulness and you have 5.29 PER, 39.32 WS, and 27.42 AV, all in addition to the already-high expected values of their mid-lottery odds before tanking. Suddenly this looks valuable, and it’s not far off from what the 76ers have been <a href="https://www.nytimes.com/2014/12/05/sports/basketball/philadelphia-76ers-take-the-low-road-to-the-top.html">accused of doing</a> in recent seasons.</p>

<h2 id="how-will-the-new-lottery-odds-impact-tanking">How will the New Lottery Odds Impact Tanking?</h2>

<p>The NBA is aware of this strategy, and in September 2017 came to an agreement with the Board of Governors to change the lottery’s odds in a way that disincentivizes the practice of tanking. The first major change is that, starting 2019, the league’s three worst teams will have identical odds of winning the first pick rather than the current odds which favor worse records. The second change is that lottery teams will be able to fall further down in the draft than is possible under the current rules. Where the worst possible outcome for the league’s worst team is currently the fourth overall pick, starting 2019 they will be able to fall to as far as fifth.</p>

<p align="center">
    <img src="/images/fulls/new_odds.png" alt="" />
    Table 6: Lottery Odds Effective 2019
</p>

<p>So the rewards for being among the league’s absolute-worst will decrease, and the worst-case scenario for these same teams also becomes slightly less desirable. Will this be enough to stop teams from tanking? Let’s see how severely these changes affect each end-of-season ranking and tanking value.</p>

<p align="center">
    <img src="/images/fulls/tv_change_2019.png" alt="" />
    Table 7: Tanking Values under Current and Future Lottery Odds
</p>

<p>Table 7 shows that the values of each end-of-season ranking will remain similar in magnitude after the new lottery rules go into effect. The values of tanking for the worst nine teams, however, will decrease, while these values increase for the five best lottery teams (ranks 10 - 14). Non-lottery teams’ odds are unaffected, as there’s no uncertainty surrounding their draft positions.</p>

<p align="center">
    <img src="/images/fulls/rank_comps_final.png" alt="" />
    Value of Tanking Before and After Lottery Change
</p>

<p>Plotting these new values against the current ones, it’s clear that the relationship between season-end ranking and draft value is visibly flatter under the new lottery. For those taking the anti-tanking stance, this is a good thing. For those who believe that the league’s worst teams deserve better lottery odds in order to ascend from mediocrity, this is a bad thing.</p>

<p align="center">
    <img src="/images/fulls/tank_bar_comps_final.png" alt="" />
    Tanking Values Before and After Rule Change
</p>

<p>Plotting tanking values before and after this change shows a similar story. The new odds make it less valuable for inverse ranks 1 - 9 to lose games, more valuable for ranks 10 - 14, and leave ranks 15 - 30 untouched.</p>

<p>A final question worth answering is how this all nets out. At the league level, will it be less valuable to tank than under the current rules? The answer to this is a simple “yes.” The sum of the differences in tanking value for each rank between the old and new lottery system is negative in terms of AV, WS, and PER. Specifically, the new lottery odds result in 0.64, 5.98, and 4.10 point drops in available PER, WS, and AV to be gained from tanking respectively. Tanking will still be valuable to the league’s worst teams, but the incentive to do so has been ever-so-slightly reduced, and will be shifted slightly toward teams that are still in the playoff hunt (the 10th - 14th worst teams). As a result, one can expect a slightly more competitive league going forward.</p>]]></content><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><category term="projects" /><summary type="html"><![CDATA[Tanking becomes a hot topic each season once it becomes apparent which of the NBA's worst teams will be missing the playoffs. In this post I address the value of a draft pick and of tanking in the league's end-of-season rankings, with applications to trade valuation and the impact of the league's recently proposed changes to the draft.]]></summary></entry><entry><title type="html">A Statcast Tribute to Baseball’s Strangest Pitch: the Eephus</title><link href="https://jamesrledoux.com/projects/eephus/" rel="alternate" type="text/html" title="A Statcast Tribute to Baseball’s Strangest Pitch: the Eephus" /><published>2017-11-14T00:00:00+00:00</published><updated>2017-11-14T00:00:00+00:00</updated><id>https://jamesrledoux.com/projects/eephus</id><content type="html" xml:base="https://jamesrledoux.com/projects/eephus/"><![CDATA[<meta property="og:image" content="http://jamesrledoux.com/images/fulls/speed_dist.png" />

<meta property="og:image:type" content="image/png" />

<meta property="og:image:width" content="200" />

<meta property="og:image:height" content="200" />

<meta name="twitter:card" content="summary_large_image" />

<meta name="twitter:site" content="@jmzledoux" />

<meta name="twitter:creator" content="@jmzledoux" />

<meta name="twitter:title" content="A Statcast Tribute to Baseball's Strangest Pitch: the Eephus" />

<meta name="twitter:image" content="http://jamesrledoux.com/images/fulls/speed_dist.png" />

<p>I’ve been borderline obsessed with the eephus for some time now. Every time I see a player pull this pitch out of their arsenal I become equal parts excited and bamboozled. My reaction is typically equal parts “I could throw that,” and “how on earth didn’t he hit that?”</p>

<p>For those who aren’t familiar, here’s a quick description and history of the eephus. In short, an eephus is a blooper pitch: it has a lazy, rec-league style delivery, can arch well above the batter’s head en route to the plate, and tends to travel anywhere from 40 to 70 mph as it leaves the pitcher’s hand. It’s oftentimes difficult to tell whether it was thrown on purpose or if the pitcher temporarily forgot how to throw a baseball.</p>

<p>This pitch is said to have first been thrown by <a href="https://www.baseball-reference.com/players/p/phillbi02.shtml">Bill Phillips</a>, who made the pitch a part of his game <a href="https://en.wikipedia.org/wiki/Eephus_pitch">from 1890 to 1903</a>. The pitch was later brought to prominence by <a href="https://www.baseball-reference.com/players/s/sewelri01.shtml">Rip Sewell</a> roughly 40 years later, and has seen sporatic use since. This pitch has gone by a variety of names over the years, including being referred to as a “junk pitch”, “dead fish”, “LaLob”, and a “spaceball” for its high arch <a href="https://bats.blogs.nytimes.com/2008/07/30/a-brief-history-of-the-eephus-pitch/">(source: A Brief History of the Eephus Pitch - NYTimes)</a>.</p>

<p>Well below the speed of an average changeup, and typically lacking any element of deception as to what’s coming in its delivery, why does anyone throw this bizarre pitch? The prevailing theory is that the comically slow speed of this pitch throws off a batter’s calibration, making the pitches that follow appear blazing fast. In other instances, people speculate that the pitch is simply a mistake, having slipped out of the pitcher’s hand. Regardless, little research has been done to date on this uncommon pitch, and I think it deserves better than that. Thus, this post is going to serve as an exploratory analysis of and tribute to the mythical eephus.</p>

<p>Before going any further in this post, here’s some quick suggested viewing for context on the big league pitch that you could probably throw just as effectively as Clayton Kershaw:</p>

<p><a href="https://www.youtube.com/watch?v=VfWXADedncM"><img src="/images/fulls/ep_compilation.png" alt="Eephus Pitch Compilation" /></a></p>

<p>Now that this pitch has received a sufficient amount of hype, let’s get up close and personal with the eephus and see what it looks like by the numbers. To do this we’ll need data on every eephus that’s been thrown during the Statcast and PITCHf/x eras. For this, I used the <a href="https://github.com/jldbc/pybaseball">pybaseball</a> library to retrieve the Statcast and PITCHf/x data on every Major League pitch that’s been thrown since the 2008 season. Among these 7,212,136 observations, only 2,090 of them represent eephus pitches. That’s just 0.02 percent - a rare pitch indeed!</p>

<p align="center">
    <img src="/images/fulls/ep_by_year.png" alt="" />
    Eephuses thrown by season
</p>

<p>The eephus saw its Statcast-era golden age in the year 2014, when over 400 were thrown. With the exception of the 2012 - 2015 seasons, it appears most common to see less than 200 thrown in a given year. Turning to the list of pitchers who’ve used this pitch, it becomes clear that it’s no coincidence that the 2012 - 2015 spike in eephus use coincided with the era of a healthy R.A. Dickey. This eephus-throwing knuckleballer, in fact, is responsible for more than twice as many eephus pitches as the next-most prolific user of the pitch.</p>

<p align="center">
    <img src="/images/fulls/ep_by_pitcher.png" />
    Eephus count by pitcher, 2008 - 2017
</p>

<p>In recent history, only Dickey, Padilla, Despaigne, and Chen have been prolific enough users of the pitch to have more than 100 in-game examples under their belt. It makes sense that this would be an uncommon pitch for most of those who use it; once the eephus loses its element of surprise, it’s no longer a novel and disorienting pitch, but essentially a Little League World Series-level fastball that any major league batter worth his place on a roster would hit out of the park.</p>

<p>Since data on any particular pitch type is only relevant in the context of other pitches, we’ll first compare the eephus against the closest things it has to peers: the fastball, knuckleball, and changeup.</p>

<p align="center">
    <img src="/images/fulls/eephus_summary.png" />
</p>

<p>The most relevant data point here is speed: the eephus has an average speed of just 64.5mph. That’s 23% slower than the average changeup, and 30% slower than the average fastball. The pitch doesn’t demonstrate the same low spin rate of other purposefully-slow pitches, however, despite slowness being its defining characteristic. While the knuckleball and changeup show spin rates in the 1500s and 1700s, the eephus spins at a lofty 2301 rpm - a solid 100rpm faster than the average fastball. As spin rate is a relatively new metric to have access to, the experts aren’t completely certain what a high or low spin rate means for pitch quality. Early research, however, suggests that <a href="http://m.mlb.com/news/article/212735620/what-statcast-spin-rate-means-for-fastballs/">high spin rate is a good thing</a> for a non-breaking ball.</p>

<p align="center">
    <img src="/images/fulls/statcast_zones.png" /> 
    <br />
    Statcast Zones (source: Baseball Savant)
</p>

<p>The last summary stat shown in the table above is the percentage of each pitch type that’s placed down the middle of the strike zone, along its edges, and outside. Here I use the Statcast zones shown above, defining “down the middle” as being in zone 5, “edge of strike zone” as zones 1, 2, 3, 4, 6, 7, 8, and 9, and “outside strikezone” as zones 11 through 14. At a high level, the farther pitches tend to be placed from the middle of the strike zone, the more likely it is that pitchers are using this pitch for strategic reasons and the less likely it is that a pitcher is confident in the pitch’s ability to get past a batter without being expertly placed. Here we see about what we’d expect. Fastballs are placed within the strike zone relatively more often than the slow-speed changeup and eephus, with the eephus being thrown outside the strike zone two percentage points more often than the changeup and 12 percentage points more often than the fastball. This makes intuitive sense, since one can imagine that a well-prepared power hitter could do some damage to a 60mph pitch thrown down the middle. Due to the eephus’ high arch, it may be challenging to place accurately as well, which would also contribute to how often it lands outside the strike zone.</p>

<p align="center">
  <!--figure-->
      <img src="/images/fulls/ep_position.png" width="45%" length="45%" />
      <img src="/images/fulls/fb_position.png" width="45%" length="45%" /> 
      <!--figcaption align="center">Eephus (L) and Fastball (R) Placement from Batter's View</figcaption-->
  <!--/figure-->
  <br />
  Eephus (L) and Fastball (R) Placement from Batter's View
</p>

<p>The above figure shows this same idea in slightly more detail. While the sample size is much smaller for the eephus than the fastball, it’s clear that eephus pitchers make a concerted effort to keep this pitch well out of reach, at the expense of it often having no chance of entering the strike zone.</p>

<p>While summary stats are useful, a simple average never tells the full story. To better understand baseball’s slowest pitch, let’s take a look at how its release speeds are distributed relative to these other pitches.</p>

<p align="center">
    <img src="/images/fulls/speed_dist.png" alt="" />
</p>
<p>From this figure we can see that the eephus’ slowness is even more pronounced than one may have thought! In fact, if we throw out the fastest 1% of eephus pitches which are outliers that appear to have been misclassified, we see that the remaining 99% of recorded eephus pitches are slower than 97% of recorded changeups. So while there is <em>some</em> overlap between the two pitches in terms of speed, the eephus is essentially in a league of its own in terms of slowness.</p>

<p>The speed gap between the eephus and the fastball is even more pronounced. One can imagine how disorienting it would be to see an eephus float by after a 95mph fastball, or how blazingly fast this same fastball would appear after a 60mph eephus. As a side note, the bi-modality of knuckleball speeds suggests that Statcast may be misclassifying some of these pitches as knuckleballs when they’re actually eephuses. Since there’s no accurate way of saying which declared-knuckleballs are actually eephuses, however, we’ll have to leave those pitches be.</p>

<p>This brings us to a more practical question: <em>does the eephus actually work?</em> The most salient argument for its use is the one alluded to earlier: the extreme speed differential between an eephus and any other pitch both catches batters off guard for the eephus itself, and makes a non-eephus follow-up pitch appear faster and harder to track. But does this theory hold up in practice? Let’s examine the effectiveness of the eephus vs. a few more common pitches, and then test whether an eephus actually makes the following pitch harder to hit.</p>

<p>For examining the effectiveness of the eephus vs. all other pitches, the following five metrics provide a nice overview of how batters fare against it: contact percentage, hit percentage, launch angle, exit velocity, and barrel percent. These metrics collectively represent how hittable the pitch is, how high quality a better’s contact with an eephus tends to be, and whether people hit the eephus for power or for contact.</p>

<p align="center">
    <img src="/images/fulls/hitability_stats.png" alt="" />
</p>

<p>First, perhaps surprisingly, batters make contact with this pitch about as often as every other pitch, making contact with the eephus just 0.33 percentage points more often than an average pitch. The quality of this contact, however, tends to be lower. Despite making contact with this slightly more often, for example, it becomes a hit almost 11% less often. A second way of looking at this is that its barrel percent, measured as the percentage of eephus pitches with an expected batting average of above 0.500 based on the ball’s speed and angle off the bat, is a tenth of a percentage point lower for eephus pitches, amounting to a 2% drop. This isn’t a large decrease, but paired with the pitch’s higher contact percent and lower hit percent, it paints a picture of frequent but low-quality contact.</p>

<p align="center">
  <!--figure-->
      <img src="/images/fulls/speeds.png" width="45%" length="45%" />
      <img src="/images/fulls/angles.png" width="45%" length="45%" /> 
      <!--figcaption align="center">Eephus (L) and Fastball (R) Placement from Batter's View</figcaption-->
  <!--/figure-->
</p>

<p>Barrel percent is calculated using the ball’s exit velocity and launch angle off the bat, but these factors can be examined in isolation as well to better understand what type of contact is being made. Here both the average and distribution of these metrics show that batters’ launch angles are about the same for an eephus vs. non-eephus pitch, but the speed of the ball off their bat is slower. This is reflected by the ball’s average exit velocity being 4.29mph slower and the distribution of this metric being shifted noticeably toward the slower side for the eephus vs. every other pitch.</p>

<p>Now that we’ve established that the eephus itself may have the desirable quality of drawing out low-quality contact, let’s return to the theory posed earlier: is a fastball harder to hit if it’s thrown after an eephus? Do pitchers strategically throw fastballs more frequently after an eephus? These same questions could be posed for pitch types other than the fastball, but if this effect exists, this is where we’d expect it to be most pronounced, so we’ll leave the other pitches out for now. The answer to the first of these questions is a definitive “not really.” An average batter makes contact with 19.18% of fastballs thrown. When the previous pitch was an eephus, this contact percentage actually increases to 22.60%. Further, this contact tends to be high quality contact. 8.49% of eephus-preceded fastballs turned into hits, while this number is only 6.26% on average. Measuring barrels shares a similar story, where a near-average 5.4% of fastballs are barreled on average, but a much-higher 6.4% are barreled when the previous pitch was an eephus. It’s difficult to make a strong claim about the impact of an eephus on a follow-up fastball, however, due to sample size constraints. 703 post-eephus fastballs have been thrown during the PITCHf/x and Statcast eras, and only 203 of these happened since barrels became measurable in 2015. This is hardly enough data to trust these particular numbers out of sample. It appears from this analysis, however, that a fastball thrown after an eephus performs either identically or slightly better than an identical fastball under other circumstances. Based on these results, I would take any claim that a fastball is extra hard to hit after an eephus pitch with a grain of salt.</p>

<p>The second of these questions is easier to answer. While approximately 64% of major league pitches are fastballs,  only 47% of eephuses whose plate appearance contained a follow-up pitch were followed by a fastball. Even if we remove eephus-throwing knuckleballer R.A. Dickey from this data, the number is still below average at 61%. It looks like non-knuckleball pitchers throw fastballs at approximately their normal frequency after eephus pitches, and that R.A. Dickey steers away from the post-eephus fastball almost entirely. Perhaps this means that pitchers already understand that the extra-fast-looking post-eephus fastball is only a myth.</p>

<p>Since the eephus doesn’t appear to be any better than a fastball as an isolated pitch, and we’ve also debunked the theory that a fastball is more deadly when thrown after an eephus, is there any reason to consider using this pitch? Perhaps. Examining the on base percentage (OBP) of plate appearances where the eephus was featured, and comparing this to the OBP of non-eephus plate appearances, we do see a slight decrease when the eephus is used. An eephus-containing atbat sees the batter get on base 30.8% of the time, whereas an average plate appearance has a slightly higher OBP of 31.9%. A difference of more than an entire percentage point is larger than I would have expected here, and suggests that something about this rare pitch may, indeed, work in a pitcher’s favor.</p>

<p>Despite its incredibly slow speed, the eephus pitch manages to hold its own. Batters have trouble making high quality contact with the pitch, and in general get on base less often when the pitch is utilized in a plate appearance. That said, analyzing a rare pitch inevitably means working with small sample sizes, which means that it’s hard to gain many deep insights into this pitch beyond some simple summary stats. A word of caution, however: a pitcher should always be careful not to throw this “surprise” pitch twice in a row, lest they end up like poor Orlando Hernandez.</p>

<p><a href="https://www.youtube.com/watch?v=uW0V6OsxDBo"><img src="/images/fulls/arod_eephus.png" alt="Eephus Pitch Compilation" /></a></p>]]></content><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><category term="projects" /><summary type="html"><![CDATA[I've been borderline obsessed with the eephus pitch for some time now. Every time I see a player pull this pitch out of their arsenal I become equal parts excited and bamboozled. Startlingly little research has been done to date on this uncommon pitch, and thus, this post is going to serve as an exploratory analysis of and tribute to the mythical eephus.]]></summary></entry><entry><title type="html">Leaving MLB: Lessons Learned in my First Data Science Role</title><link href="https://jamesrledoux.com/data-science/mlb-stats/" rel="alternate" type="text/html" title="Leaving MLB: Lessons Learned in my First Data Science Role" /><published>2017-08-14T00:00:00+00:00</published><updated>2017-08-14T00:00:00+00:00</updated><id>https://jamesrledoux.com/data-science/mlb-stats</id><content type="html" xml:base="https://jamesrledoux.com/data-science/mlb-stats/"><![CDATA[<p>For the past three months I have had the exciting opportunity to intern as a data scientist at Major League Baseball Advanced Media, the technology arm of MLB. During this time I’ve built models and analyzed data relating to user-facing products, sales teams, and the sport itself. While it would be impossible to detail everything I’ve learned and worked on in a single post, this will serve as a brief overview of the experience.</p>
<p align="center">
    <img src="/images/fulls/bam.png" alt="" />
</p>

<h1 id="working-with-the-clubs-customer-lead-scoring">Working with the Clubs: Customer Lead Scoring</h1>

<p>A core component of my team’s work centered around building lead scoring models for MLB’s 30 teams. The problem, in brief, is this: given a fan’s history of past purchases, games attended, and interaction with a team’s digital media, how likely are they to either upgrade or renew their ticket package for next season? Baseball fans are a diverse group of customers, and there is no one-size-fits-all solution to understanding what drives a team’s fan base’s purchasing habits. For this reason, each team requires its own specially tailored lead scoring model.</p>

<p>The lead scoring models themselves are relatively simple; usually logistic regression models with a small number of variables. The real challenges in building these models come from simulteneously optimizing lift and accuracy (which are not always positively correlated) while maintaining interpretability, which is of high importance to the teams that rely on the models’ outputs.</p>

<p>In the end, there were a few traits of successful lead scoring models that proved consistent across teams. Of the thousands of variables considered, there was almost always one that would stick into the final model representing each of the following categories: purchaser recency, duration, quantity, engagement level, and source. This makes intuitive sense, since these five categories effectively explain the most important aspects of a fan’s relationship with a team. If we know how recent somebody has purchased, we have an idea as to how active they are. Knowing how long they have been a customer tells us something about loyalty. The number of games attended is as direct a signal possible for a fan’s interest level in attending future games. Engagement level, measured through interaction with email marketing campaigns, measures something similar to games purchased, but through a different channel. And last, the source of a customer’s purchases (i.e. buying directly from the club or on the secondary market) tends to have an impact on future purchasing habits as well. With one variable from each of these categories placed into a model, the other couple-thousand often become redundant.</p>

<p>The end result of this process is something I find exciting. Given one of these models, a team can understand who’s most likely to make a purchase before they make a call. The lift generated by these models makes for a tactical, data-driven sales strategy that allows a club to better understand its fans and make the most of its resources. The signals used in these models, I imagine, apply not only to the pro sports industry, but to consumer sales as a field in general.</p>

<h1 id="streaming-media-mlbtv-churn-prediction">Streaming Media: MLB.TV Churn Prediction</h1>

<p>Another type of modeling my team worked with was churn prediction for the MLB.TV streaming product. Churn prediction works similarly to lead scoring at the surface level, in the sense that both are binary classification problems, but there are a few interesting nuances to working with churn that warrant a different approach.</p>

<p>Being a consumer-facing digital product, the most important features for determining whether somebody will churn come from their activity level. The users churning are almost never the active ones using the app multiple times per week. Domain-specific indicators also come into play. Being a baseball-streaming product, for example, the performance of a fan’s favorite team can play a role in their overall satisfaction with the product. Only a superfan, after all, will pay to watch their team lose day in and day out.</p>

<p align="center">
    <img src="/images/fulls/mlbtv.jpg" alt="" />
</p>
<p align="center">
    <em align="center">MLB.TV (image: mlb.com)</em>
</p>
<p>With churn modeling, however, there is no external client. This means that the model is allowed to be more flexible, since a computer doesn’t care about model interpretability. This gave me the opportunity to experiment with the idea of replacing our existing models with something more flexible. For this task, I fit an ensemble of a random forest classifier, gradient boosted trees (XGBoost), and a LASSO logistic regression model, resulting in a moderate performance boost.</p>

<h1 id="predicting-the-lifetime-value-of-a-fan">Predicting the Lifetime Value of a Fan</h1>
<p>A third project I worked on was an attempt to predict the lifetime value of a ticket purchaser. This effort cuts right at the heart of the challenges facing any data science effort to model the habits of a diverse customer base. For this domain in particular, useful signals can be elusive. An individual buyer’s purchasing habits can vary significantly year-over-year, and  levels of activity range from passive one-game-per-year purchasers to superfans spending thousands of dollars. To add an additional layer of complexity, MLB’s fans come from vastly different markets. We would expect the behavior of a Kansas City fan, for example, to differ from that of a New York fan.</p>

<p>All of this results in a situation where a single linear model is insufficient. For tasks as complicated as this one, the out-of-the-box solutions taught in academia do not always play nicely with the problem at hand. The solution, of course, is to try a few different approaches and see what works. The answer could lie in fitting a single model per club, one model per audience segment, or in something entirely different. At very least, I believe a tree-based approach makes the most sense here, since this would be an effective learner of the team and segment-based interactions that would be challenging to hand-craft otherwise. This task remains a work in progress, however, so I have no perfect solution to offer.</p>

<h1 id="sabermetrics-and-improving-statcast">Sabermetrics and Improving Statcast</h1>
<p>The last area I had a chance to work on is what I would broadly characterize as baseball-facing work. Working with our in-house baseball researchers, I helped to automate some of our anomaly detection reporting, taking previously-manual Excel work and re-writing it in SQL and Python. Parts of this work involved the fitting and evaluation of <a href="https://stats.idre.ucla.edu/r/dae/mixed-effects-logistic-regression/">mixed effect logistic regression</a> models. These models were new to me at the time and interesting to learn about. Other ad-hoc requests in this domain would come on occasion, usually involving some sort of data analysis regarding Statcast data on the relationships between pitch speed, type, and location and the outcome of a pitch or plate appearance.</p>

<h1 id="lessons-learned">Lessons Learned</h1>
<p>To wrap this up, here are the top pieces of advice I would offer somebody entering their first data science job:</p>

<ol>
  <li>
    <p>Most tasks don’t require machine learning. Penalized (LASSO or Ridge) logistic regression and ordinary least squares with simple feature engineering will usually do the trick.</p>
  </li>
  <li>
    <p>Some tasks <strong>do</strong> necessitate machine learning. Whenever you fit a simple model, try a random forest or GBM alongside it just to be sure you aren’t oversimplifying the problem with a linear model.</p>
  </li>
  <li>
    <p>Knowing the difference between a #1 and a #2 problem is a crucial part of real-world data science. This is part of understanding your client’s needs and your problem domain.</p>
  </li>
  <li>
    <p>Data science is more than just building models; most of the work happens before the model is trained. You will spend hours retrieving, joining, aggregating, sanity-checking, and exploring your data.</p>
  </li>
  <li>
    <p>Real world data is full of surprises. Distributions change over time, features are sparse, and variables don’t always mean what you think they do. Always profile your data to avoid mistakes early on.</p>
  </li>
</ol>]]></content><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><category term="data-science" /><summary type="html"><![CDATA[For the past three months I have had the exciting opportunity to intern as a data scientist at Major League Baseball Advanced Media, the technology arm of MLB. This post gives an overview of what I've been working on and the advice I would give a fellow first-time data scientist on their first day on the job.]]></summary></entry><entry><title type="html">Introducing pybaseball: an Open Source Package for Baseball Data Analysis</title><link href="https://jamesrledoux.com/projects/open-source/introducing-pybaseball/" rel="alternate" type="text/html" title="Introducing pybaseball: an Open Source Package for Baseball Data Analysis" /><published>2017-07-27T00:00:00+00:00</published><updated>2017-07-27T00:00:00+00:00</updated><id>https://jamesrledoux.com/projects/open-source/introducing-pybaseball</id><content type="html" xml:base="https://jamesrledoux.com/projects/open-source/introducing-pybaseball/"><![CDATA[<p>Baseball and statistics go together like peanut butter and jelly; it’s almost hard to imagine following one without involving the other. In recent years, the data that make this game enjoyable for so many have only gotten better. With the introduction of the <a href="http://m.mlb.com/statcast/leaderboard" target="_blank">Statcast</a> system for measuring sci-fi-sounding statistics such as the spin speed of a thrown ball and its launch angle off a player’s bat, I believe we are at the beginning of an exciting new era for baseball statistics.</p>

<p>Unfortunately for Python-loving statisticians like myself, there has historically been no tool for bringing this data into one’s work without a painful process of manual curation. For this reason, I’m releasing pybaseball: an open source package for 21st century baseball data analysis in Python.</p>

<p align="center">
    <img src="/images/fulls/judge-derby-mlbdotcom.jpg" alt="" />
</p>
<p align="center">
    <em align="center">How does Aaron Judge hit baseballs into the stratosphere? Pybaseball and Statcast can help you find out (image: mlb.com)</em>
</p>

<h1 id="what-it-does">What It Does</h1>
<p>Pybaseball takes the pain out of collecting and cleaning baseball data from the internet. In short, I scraped <a href="https://baseballsavant.mlb.com/statcast_leaderboard" target="_blank">Baseball Savant</a>, <a href="http://www.fangraphs.com/leaders.aspx?pos=all&amp;stats=pit&amp;lg=all&amp;qual=y&amp;type=8&amp;season=2017&amp;month=0&amp;season1=2017&amp;ind=0" target="_blank">FanGraphs</a>, and <a href="https://www.baseball-reference.com/" target="_blank">Baseball Reference</a> so you don’t have to. Currently, this means that you can retrieve pitch, season, and game-level data on individual players and teams, historic schedule and record data, and division standings with simple, Pythonic one-liners. The stats that this library provides range from the classics (BA, RBI, HR, W, L, K, IP), to the slightly more sophisticated (OBP, SLG, WHIP, WAR), to what would have sounded like science fiction a few short years ago (exit velocity, spin speed, pitch x, y, and z coordinates). The goal of this library is to provide the data necessary to answer any baseball research question.</p>

<h1 id="how-it-works">How it Works</h1>
<p>The <code class="language-plaintext highlighter-rouge">statcast</code> function returns statcast pitch-level data.</p>

<pre>
  <code class="python">
&gt;&gt;&gt; from pybaseball import statcast 
&gt;&gt;&gt; data = statcast(start_dt='2017-06-15', end_dt='2017-06-28')
&gt;&gt;&gt; data.head(2)  
   index pitch_type  game_date  release_speed  release_pos_x  release_pos_z  
0    314         CU 2017-06-27           79.7        -1.3441         5.4075
1    332         FF 2017-06-27           98.1        -1.3547         5.4196

  player_name    batter   pitcher     events     ...      release_pos_y  
0   Matt Bush  608070.0  456713.0  field_out     ...            54.8585
1   Matt Bush  429665.0  456713.0  field_out     ...            54.3470

   estimated_ba_using_speedangle  estimated_woba_using_speedangle  woba_value  
0                          0.100                            0.137         0.0
1                          0.269                            0.258         0.0

   woba_denom babip_value iso_value launch_speed_angle at_bat_number pitch_number  
0         1.0         0.0       0.0                3.0          64.0          1.0
1         1.0         0.0       0.0                3.0          63.0          3.0  
[2 rows x 79 columns]
  </code>
</pre>

<p>Similarly, if you want player-level stats aggregated by season, you can pull 299 different features per player per season from FanGraphs using the <code class="language-plaintext highlighter-rouge">pitching_stats</code> and <code class="language-plaintext highlighter-rouge">batting_stats</code> functions.</p>

<pre>
  <code class="python">
&gt;&gt;&gt; from pybaseball import pitching_stats
&gt;&gt;&gt; data = pitching_stats(2012, 2016)
&gt;&gt;&gt; data.head()
     Season             Name     Team   Age     W    L   ERA  WAR     G    GS  
336  2015.0  Clayton Kershaw  Dodgers  27.0  16.0  7.0  2.13  8.6  33.0  33.0
236  2014.0  Clayton Kershaw  Dodgers  26.0  21.0  3.0  1.77  7.6  27.0  27.0
472  2014.0     Corey Kluber  Indians  28.0  18.0  9.0  2.44  7.4  34.0  34.0
235  2015.0     Jake Arrieta     Cubs  29.0  22.0  6.0  1.77  7.3  33.0  33.0
256  2013.0  Clayton Kershaw  Dodgers  25.0  16.0  9.0  1.83  7.1  33.0  33.0

       ...      wSL/C (pi)  wXX/C (pi)  O-Swing% (pi)  Z-Swing% (pi)  
336    ...            1.76       22.85          0.364          0.665
236    ...            2.62         NaN          0.371          0.670
472    ...            3.92         NaN          0.336          0.598
235    ...            2.42         NaN          0.329          0.618
256    ...            0.74         NaN          0.339          0.635

     Swing% (pi)  O-Contact% (pi)  Z-Contact% (pi)  Contact% (pi)  Zone% (pi)  
336        0.511            0.478            0.811          0.689       0.487
236        0.525            0.536            0.831          0.730       0.515
472        0.468            0.485            0.886          0.744       0.505
235        0.468            0.595            0.856          0.762       0.483
256        0.484            0.563            0.873          0.763       0.492

     Pace (pi)
336       23.4
236       23.7
472       24.6
235       23.3
256       23.4

[5 rows x 299 columns]

  </code>
</pre>

<p>But wait, there’s more! Say you’re interested in comparing the performances of historic teams. That, too, is easy with pybaseball. With this package, one can dissect the 1927 “Murderers Row” Yankees season with a single line of Python.</p>

<pre>
  <code class="python">
&gt;&gt;&gt; from pybaseball import schedule_and_record
&gt;&gt;&gt; data = schedule_and_record(1927, 'NYY')
&gt;&gt;&gt; data.head()
                Date   Tm Home_Away  Opp W/L     R   RA   Inn  W-L  Rank  \
1    Tuesday, Apr 12  NYY      Home  PHA   W   8.0  3.0   9.0  1-0   1.0
2  Wednesday, Apr 13  NYY      Home  PHA   W  10.0  4.0   9.0  2-0   1.0
3   Thursday, Apr 14  NYY      Home  PHA   T   9.0  9.0  10.0  2-0   1.0
4     Friday, Apr 15  NYY      Home  PHA   W   6.0  3.0   9.0  3-0   1.0
5   Saturday, Apr 16  NYY      Home  BOS   W   5.0  2.0   9.0  4-0   1.0

       GB      Win     Loss  Save  Time D/N  Attendance  Streak
1    Tied     Hoyt    Grove  None  2:05   D     72000.0       1
2  up 0.5  Ruether     Gray  None  2:15   D      8000.0       2
3    Tied     None     None  None  2:50   D      9000.0       2
4    Tied  Pennock    Ehmke  None  2:27   D     16000.0       3
5  up 1.0  Shocker  Ruffing  None  2:05   D     25000.0       4

  </code>
</pre>

<p>These examples are just the tip of the iceberg, but hopefully this gives a taste of the power and versatility of pybaseball.</p>

<h1 id="how-to-install-pybaseball">How to Install Pybaseball</h1>
<p>Pybaseball is pip installable. Simply run <code class="language-plaintext highlighter-rouge">pip install pybaseball</code> and it’s on your machine.</p>

<h1 id="where-to-read-more">Where to Read More</h1>
<p>If any of this piqued your interest, full documentation and complete examples are available on the Github repo <a href="http://github.com/jldbc/pybaseball" target="_blank">here</a>. If you like what you’ve seen so far, please give it a star. If you <strong>really</strong> like what you’ve seen so far, drop me a suggestion or submit a code improvement. Last, if you end up using pybaseball for any type of project or analysis, I would love to hear about it. <a href="mailto:ledoux.james.r@gmail.com">Send me a note</a> or reach out <a href="http://twitter.com/jmzledoux" target="_blank">on Twitter</a>!</p>]]></content><author><name>James LeDoux</name><email>ledoux.james.r@gmail.com</email></author><category term="projects" /><category term="open-source" /><summary type="html"><![CDATA[Throughout my baseball-facing work at MLB Advanced Media, I came to realize that there was no reliable Python tool available for sabermetric research and advanced baseball statistics. As a response to this, I built pybaseball - a Python package for baseball data analysis.]]></summary></entry></feed>