Drop Duplicate Rows in a DataFrame

June 01, 2019

This post shows how to remove duplicate records and combinations of columns in a Pandas dataframe and keep only the unique values.

Imports

import pandas as pd

Create Example Data

data = {"Name": ["James", "Alice", "Phil", "James"],
		"Age": [24, 28, 40, 24],
		"Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
print(df)

Name	Age	Sex
James	24	Male
Alice	28	Female
Phil	40	Male
James	24	Male

Drop Duplicate Rows

drop_duplicates returns only the dataframe’s unique values. Removing duplicate records is sample.

df = df.drop_duplicates()
print(df)

Name	Age	Sex
James	24	Male
Alice	28	Female
Phil	40	Male

To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique. To do this conditional on a different column’s value, you can sort_values(colname) and specify keep equals either first or last.

In our example data, this could be useful if we had two entries for Name = James, one with Age = 24 and one with Age = 25. If we know we only want the oldest example for each person, we can sort by Age and drop duplicates of the name column, keeping only the observation with the highest age.

data = {"Name": ["James", "Alice", "Phil", "James"],
		"Age": [24, 28, 40, 25],
		"Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
print(df)

Name	Age	Sex
James	24	Male
Alice	28	Female
Phil	40	Male
James	25	Male

df = df.sort_values('Age', ascending=False)
df = df.drop_duplicates(subset='Name', keep='first')
print(df)

Name	Age	Sex
Phil	40	Male
Alice	28	Female
James	25	Male

Despite the full records not being duplicated, our duplicatation problem is once again resolved.

Python for Data Analysis book cover

Get it on Amazon here for $36.26

James LeDoux

Drop Duplicate Rows in a DataFrame

Imports

Create Example Data

Drop Duplicate Rows

Want to improve your Python skills? I recommend reading Python for Data Analysis by Wes McKinney, the creator of Pandas