Create Dummy Variables in Pandas
June 01, 2019
This post shows how to create dummy variables using Pandas’ pd.get_dummies
function.
Background
A dummy variable is a binary variable that indicates whether a separate categorical variable takes on a specific value. For a categorical variable that takes on more than one value, it is useful to create one dummy variable for each unique value that the categorical variable takes on. Here’s how you can do that in Python:
Imports
import pandas as pd
Create a Toy Dataset
data = {"Name": ["James", "Alice", "Phil"],
"Age": [24, 28, 40],
"Sex": ["Male", "Female", "Male"]}
df = pd.DataFrame(data)
print(df)
Name | Age | Sex |
---|---|---|
James | 24 | Male |
Alice | 28 | Female |
Phil | 40 | Male |
Create Dummy Variables
Create dummy variables with Pandas’ get_dummies
function. You will often be creating dummies for several different categorical features in your dataset, so I like to add a descriptive prefix to my dummy columns’ names. The example works fine without the .rename()
at the end of this example, though, so feel free to omit it if it doesn’t help.
# create dummies for pitching team, batting team, pitcher id, batter id
dummies = pd.get_dummies(df['Sex']).rename(columns=lambda x: 'Sex_' + str(x))
# bring the dummies back into the original dataset
df = pd.concat([df, dummies], axis=1)
print(df)
Name | Age | Sex | Sex_Female | Sex_Male |
---|---|---|---|---|
James | 24 | Male | 0 | 1 |
Alice | 28 | Female | 1 | 0 |
Phil | 40 | Male | 0 | 1 |