Create Dummy Variables in Pandas

June 01, 2019

This post shows how to create dummy variables using Pandas’ pd.get_dummies function.

Background

A dummy variable is a binary variable that indicates whether a separate categorical variable takes on a specific value. For a categorical variable that takes on more than one value, it is useful to create one dummy variable for each unique value that the categorical variable takes on. Here’s how you can do that in Python:

Imports

import pandas as pd

Create a Toy Dataset

data = {"Name": ["James", "Alice", "Phil"],
		"Age": [24, 28, 40],
		"Sex": ["Male", "Female", "Male"]}
df = pd.DataFrame(data)
print(df)
Name Age Sex
James 24 Male
Alice 28 Female
Phil 40 Male

Create Dummy Variables

Create dummy variables with Pandas’ get_dummies function. You will often be creating dummies for several different categorical features in your dataset, so I like to add a descriptive prefix to my dummy columns’ names. The example works fine without the .rename() at the end of this example, though, so feel free to omit it if it doesn’t help.

# create dummies for pitching team, batting team, pitcher id, batter id
dummies = pd.get_dummies(df['Sex']).rename(columns=lambda x: 'Sex_' + str(x))
# bring the dummies back into the original dataset
df = pd.concat([df, dummies], axis=1)
print(df)
Name Age Sex Sex_Female Sex_Male
James 24 Male 0 1
Alice 28 Female 1 0
Phil 40 Male 0 1