Drop Duplicate Rows in a DataFrame
June 01, 2019
This post shows how to remove duplicate records and combinations of columns in a Pandas dataframe and keep only the unique values.
Imports
import pandas as pd
Create Example Data
data = {"Name": ["James", "Alice", "Phil", "James"],
"Age": [24, 28, 40, 24],
"Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
print(df)
Name | Age | Sex |
---|---|---|
James | 24 | Male |
Alice | 28 | Female |
Phil | 40 | Male |
James | 24 | Male |
Drop Duplicate Rows
drop_duplicates
returns only the dataframe’s unique values. Removing duplicate records is sample.
df = df.drop_duplicates()
print(df)
Name | Age | Sex |
---|---|---|
James | 24 | Male |
Alice | 28 | Female |
Phil | 40 | Male |
To remove duplicates of only one or a subset of columns, specify subset
as the individual column or list of columns that should be unique. To do this conditional on a different column’s value, you can sort_values(colname)
and specify keep
equals either first
or last
.
In our example data, this could be useful if we had two entries for Name = James, one with Age = 24 and one with Age = 25. If we know we only want the oldest example for each person, we can sort by Age and drop duplicates of the name column, keeping only the observation with the highest age.
data = {"Name": ["James", "Alice", "Phil", "James"],
"Age": [24, 28, 40, 25],
"Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
print(df)
Name | Age | Sex |
---|---|---|
James | 24 | Male |
Alice | 28 | Female |
Phil | 40 | Male |
James | 25 | Male |
df = df.sort_values('Age', ascending=False)
df = df.drop_duplicates(subset='Name', keep='first')
print(df)
Name | Age | Sex |
---|---|---|
Phil | 40 | Male |
Alice | 28 | Female |
James | 25 | Male |
Despite the full records not being duplicated, our duplicatation problem is once again resolved.