James LeDoux

Data scientist and armchair sabermetrician

Visualizing MLB Team Rankings with ggplot2 and Bump Charts

August 05, 2018

The 2018 MLB season has so far been just like every other season: filled with ups, downs, win streaks, teams plagued with injuries, and so-on. In this post I aim to catch up on the current season with a single chart, showing how the leagues’ rankings have changed throughout the year. My visualization of choice here is the bump chart, a type of line chart showing changes in rankings over time. If you just want to see the final product, this is what it looks like:

If you’re still reading, here’s how you can create your own.

Data Acquisition: Pybaseball

First we’ll need data on every team’s record at each point within the season so far. My plan is to visualize this data in R with ggplot, and there are several capable R packages for pulling baseball data (baseballr and Lahman, to name two). Since I maintain the pybaseball package, however, I’ll eat my own dogfood and start from there.

I use pybaseball.schedule_and_record(year, team_code) to fetch each team’s 2018 data. Once these data are concatenated together to create the whole season’s records-file, I clean the Date column to standardize dates across the dataframe, cut off dates that are in the future, and calculate each team’s win percentage at the end of each game day. I then export this csv so it can be used in R with ggplot.

import pandas as pd
from pybaseball import schedule_and_record

teams = ['BOS','NYY','TB','TOR','BAL','CLE','MIN','KC','CHW',
         'DET','HOU','LAA','SEA','TEX','OAK','WSN','MIA','ATL',
         'NYM','PHI','CHC','MIL','STL','PIT','CIN','LAD','ARI',
         'COL','SD','SF']

# collect every team's record for the 2018 season
records = []
for t in teams:
    s = schedule_and_record(2018, t)
    records.append(s)

#concatenate records together so the whole season is in one dataframe
df = pd.concat(records, axis = 0)

# standardize the date formats of double-header games
df.Date = df.Date.str.replace(' (1)','',regex=False)
df.Date = df.Date.str.replace(' (2)','',regex=False)

# turn this into a date format that Pandas will recognize
df.Date = pd.to_datetime(df.Date,format='%A, %b %d')
df.Date = df.Date.map(lambda x: x.replace(year=2018))

# cut out games that havent happened yet
df = df.loc[df.Date < '2018-08-05']

# extract win and loss values from "w-l" strings
df['W'] = df['W-L'].str.split('-').str[0].astype(int)
df['L'] = df['W-L'].str.split('-').str[1].astype(int)
df['win_pct'] = df['W'] / (df['W'] + df['L'])

df.to_csv('2018-records.csv')

Data Preparation

Next the data get loaded into R. It is easiest to rank the teams when their win rate is known at every point in time, not only on game days. For this reason, my first task for preparing the data is to fill in these missing non-game-day dates with the win-percentage of each team’s most recent game day.

library(cowplot)
library(dplyr)
library(tidyr)

win_percentages = read.csv('2018-records.csv')
win_percentages = win_percentages[, c('Tm', 'Date', 'win_pct')]
win_percentages[is.na(win_percentages$win_pct), 'win_pct'] = 0

# create a dummy column to give dplyr left_join the effect of a cross join
dates = setNames(data.frame(unique(win_percentages$Date), dummy=1), c('Date', 'dummy'))
teams = setNames(data.frame(unique(win_percentages$Tm), dummy=1), c('Tm', 'dummy'))

# rejoin tables to have one row per day per team
df = left_join(dates, teams, by='dummy')
df = left_join(df, win_percentages, by=c('Tm','Date'))

# fill non-gameday win percentages with the previous-gameday's win percent
df = df %>% mutate(Date = as.Date(Date)) %>%
            complete(Date = seq.Date(min(Date), max(Date), by="day")) %>%
            group_by(Tm) %>% fill('win_pct')

# remove NAs generated by the all star break when no games were played
df = df[!is.na(df$Tm),]

Now that the data is formatted how we want it, we’ll need rank them by their records. Because the bump-chart will be messy with all teams involved, and also because rankings across leagues don’t have much real-world value, we’ll want to separate NL teams from AL teams first. To do this, create a list of one league’s teams and use a dplyr filter on it.

al_teams = c('BOS', 'NYY', 'BAL', 'TBR', 'TOR', 'CHW','CLE',
             'DET', 'KCR', 'MIN','HOU','LAA', 'OAK', 'SEA', 'TEX')
al = df %>% filter(Tm %in% al_teams)
nl = df %>% filter(!(Tm %in% al_teams))

Next, correct for double-header days by grouping on date-team combinations and taking only the last row of each group. This is the record ad the end of the team’s double-header. After this, ranking teams by their records is as simple as grouping by Date, sorting by win percentage, and ordering them from best to worst. Ties in this case will naively go to the team that comes first in the alphabet.

by_date = al %>% group_by(Date, Tm) %>% filter(row_number()==n()) %>% unique()
by_date <- by_date %>% group_by(Date) %>% 
           arrange(Date, desc(win_pct), Tm) %>%
           mutate(Rank = rank(-win_pct, ties.method = "first"))

Visualization

Now on to the fun part: graphing it. We can start by defining the colors that will be associated with each team’s line on the graph. Because team colors are well known to fans, this will help the plot’s interpretability. Conveniently, there’s a website built for this exact purpose: teamcolorcodes.com. I selected a hex code for one of each team’s colors and add them to a list like so:

team_colors = c(BOS = '#BD3039', NYY = '#003087', TBR = '#8FBCE6', KCR = '#BD9B60',
                CHW = '#27251F', BAL = '#DF4601', CLE = '#E31937', MIN = '#002B5C',
                DET = '#FA4616', HOU = '#EB6E1F', LAA = '#BA0021', SEA = '#005C5C', 
                TEX = '#003278', OAK = '#003831', WSN = '#14225A', MIA = '#FF6600',
                ATL = '#13274F', NYM = '#002D72', PHI = '#E81828', CHC = '#0E3386',
                MIL = '#B6922E', STL = '#C41E3A', PIT = '#FDB827', CIN = '#C6011F',
                LAD = '#005A9C', ARI = '#A71930', COL = '#33006F', SDP = '#002D62',
                SFG = '#FD5A1E', TOR = '#134A8E')

First let’s create the main plot: a bump chart showing the 15 teams’ rankings throughout the progression of the season. I plot a geom_line for each value of Tm, and then flip the scale so that a lower (better) value of Rank will be at the top of the Y axis. To make this somewhat interpretable, I label the teams at their final positions (yesterday, the most recent date for which I have data) at the tail-end of the chart, so that they can be traced back in time throughout the season, and color-code them with scale_color_manual so that each line matches the team’s colors. The code for the main chart is this:

p = ggplot(data = by_date, aes(x = Date, y = Rank, group = Tm)) +
    geom_line(aes(color = Tm), size = .75, show.legend=F) +
    scale_y_reverse(breaks = 1:32) +
    geom_text(data = subset(by_date, Date == as.Date("2018-08-04")), 
              aes(label=Tm), size = 2.5, hjust = -.1) +
    scale_color_manual(values=team_colors)

Which produces an outcome that looks like this:

This is nice, but it’s also a bit noisy. As a visual aid, it will next be nice to show each team’s line in isolation along the border. To do this, we’ll first need a function for generating graph p for each team in isolation. This is the same as above, but with the axes wiped out to minimize noise.

teamplot = function(team_code){
    ggplot(data = by_date[by_date$Tm==team_code,], aes(x = Date, y = Rank, group = Tm)) +
    geom_line(aes(color = Tm), size = .75, show.legend=F) +
    scale_y_reverse(breaks = 1:32) +
    scale_color_manual(values=team_colors) + 
    labs(y=team_code) +
    theme(axis.title.x = element_blank(),
          axis.text.x = element_blank(),
          axis.ticks.x = element_blank(),
          axis.ticks.y = element_blank(),
          axis.text.y = element_blank(),
          axis.title.y = element_text(angle=0, size=6))
}

With this in place, all that’s left is to arrange the team-charts and league-chart side by side. Luckily, there’s a library for that. Using library(grid), I define a 16x4 grid. Row 1 will be a title for the entire figure. Columns 2 - 4 will belong to the main chart showing all teams. Rows 2 - 16 in column 1, finally, will belong to the 15 individual-team charts. This is put together like so:

library(grid)
pushViewport(viewport(layout = grid.layout(nrow = 16, ncol = 4)))

# helper funciton for defining a region in the layout
# source: http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/
define_region <- function(row, col){
  viewport(layout.pos.row = row, layout.pos.col = col)
} 
grid.text(expression(bold("American League Standings")), vp = define_region(row =1:1, col = 1:4), gp=gpar(fontsize=15))
print(p, vp = define_region(row = 2:16, col = 2:4)) 
teams=unique(by_date[by_date$Date=='2018-08-04','Tm'])
for (idx in 2:16){
  print(teamplot(teams[[1]][[idx-1]]), vp = define_region(row = idx, col = 1))
}

which, finally, gives us a finished product that looks like this:

Replace al for nl in the code above to get the same figure for the National League, and then we’re done here. Simple enough!

2020 2
2019 1
2018 2
2017 6
2016 2

2020

Multi-Armed Bandits in Python: Epsilon Greedy, UCB1, Bayesian UCB, and EXP3

13 minute read

This post explores four algorithms for solving the multi-armed bandit problem (Epsilon Greedy, EXP3, Bayesian UCB, and UCB1), with implementations in Python ...

Offline Evaluation of Multi-Armed Bandit Algorithms in Python using Replay

9 minute read

Multi-armed bandit algorithms are seeing renewed excitement, but evaluating their performance using a historic dataset is challenging. Here’s how I go about ...

2019

Understanding the AdTech Auctions in Your Browser: an Analysis of 30,000 Prebid.js Auctions

7 minute read

An analysis of auction dynamics in client-side header bidding

2018

Predicting The Shift: Boosting and Bagging for Strategic Infield Positioning

23 minute read

Using machine learning to predict strategic infield positioning using statcast data and contextual feature engineering.

Visualizing MLB Team Rankings with ggplot2 and Bump Charts

4 minute read

A quick tutorial on fetching MLB win-loss data with pybaseball and cleaning and visuzlizing it with the tidyverse (dplyr and ggplot).

2017

On Draft Pick Value, the New Lottery, and Tanking

12 minute read

Tanking becomes a hot topic each season once it becomes apparent which of the NBA’s worst teams will be missing the playoffs. In this post I address the valu...

A Statcast Tribute to Baseball’s Strangest Pitch: the Eephus

7 minute read

I’ve been borderline obsessed with the eephus pitch for some time now. Every time I see a player pull this pitch out of their arsenal I become equal parts ex...

Leaving MLB: Lessons Learned in my First Data Science Role

4 minute read

For the past three months I have had the exciting opportunity to intern as a data scientist at Major League Baseball Advanced Media, the technology arm of ML...

Introducing pybaseball: an Open Source Package for Baseball Data Analysis

2 minute read

Throughout my baseball-facing work at MLB Advanced Media, I came to realize that there was no reliable Python tool available for sabermetric research and adv...

Bookshelf

4 minute read

A collection of some of my favorite books. Business, popular economics, stats and machine learning, and some literature.

338 Cups of Coffee

6 minute read

Each cup of coffee I have consumed in the past 5 months has been logged on a spreadsheet. Here’s what I’ve learned by data sciencing my coffee consumption.

2016

Building a Content-Based Recommender System for Books: Using Natural Language Processing to Understand Literary Preference

4 minute read

Literature is a tricky area for data science. Think of your five favorite books. What do they have in common? Some may share an author or genre, but besides ...

Machine Learning and the NFL Field Goal: Using Statistical Learning Techniques to Isolate Placekicker Ability

4 minute read

Probabilistic modeling on NFL field goal data. Applying logistic regression, random forests, and neural networks in R to measure contributing factors of fiel...