Drop Duplicate Rows in a DataFrame
June 01, 2019
This post shows how to remove duplicate records and combinations of columns in a Pandas dataframe and keep only the unique values.
Imports
import pandas as pd
Create Example Data
data = {"Name": ["James", "Alice", "Phil", "James"],
"Age": [24, 28, 40, 24],
"Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
print(df)
| Name | Age | Sex |
|---|---|---|
| James | 24 | Male |
| Alice | 28 | Female |
| Phil | 40 | Male |
| James | 24 | Male |
Drop Duplicate Rows
drop_duplicates returns only the dataframe’s unique values. Removing duplicate records is sample.
df = df.drop_duplicates()
print(df)
| Name | Age | Sex |
|---|---|---|
| James | 24 | Male |
| Alice | 28 | Female |
| Phil | 40 | Male |
To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique. To do this conditional on a different column’s value, you can sort_values(colname) and specify keep equals either first or last.
In our example data, this could be useful if we had two entries for Name = James, one with Age = 24 and one with Age = 25. If we know we only want the oldest example for each person, we can sort by Age and drop duplicates of the name column, keeping only the observation with the highest age.
data = {"Name": ["James", "Alice", "Phil", "James"],
"Age": [24, 28, 40, 25],
"Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
print(df)
| Name | Age | Sex |
|---|---|---|
| James | 24 | Male |
| Alice | 28 | Female |
| Phil | 40 | Male |
| James | 25 | Male |
df = df.sort_values('Age', ascending=False)
df = df.drop_duplicates(subset='Name', keep='first')
print(df)
| Name | Age | Sex |
|---|---|---|
| Phil | 40 | Male |
| Alice | 28 | Female |
| James | 25 | Male |
Despite the full records not being duplicated, our duplicatation problem is once again resolved.
Want to improve your Python skills? I recommend reading Python for Data Analysis by Wes McKinney, the creator of Pandas
