Group and Aggregate by One or More Columns in Pandas

June 01, 2019

Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python’s closest equivalent to dplyr’s group_by + summarise logic. Here’s a quick example of how to group on one or multiple columns and summarise data with aggregation functions using Pandas.

Imports

import pandas as pd

Create a Toy Dataset

In this case, say we have data on baseball players. We know their team, whether they’re a pitcher or a position player, and their age. With this data we can compare the average ages of the different teams, and then break this out further by pitchers vs. non-pitchers.

data = {"Team": ["Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees"],
		"Pos": ["Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher", "Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher"],
		"Age": [24, 28, 40, 22, 29, 33, 31, 26, 21, 36, 25, 31]}
df = pd.DataFrame(data)
print(df)

Team	Pos	Age
Red Sox	Pitcher	24
Red Sox	Pitcher	28
Red Sox	Pitcher	40
Red Sox	Not Pitcher	22
Red Sox	Not Pitcher	29
Red Sox	Not Pitcher	33
Yankees	Pitcher	31
Yankees	Pitcher	26
Yankees	Pitcher	21
Yankees	Not Pitcher	36
Yankees	Not Pitcher	25
Yankees	Not Pitcher	31

Group By One Column and Get Mean, Min, and Max values by Group

First we’ll group by Team with Pandas’ groupby function. After grouping we can pass aggregation functions to the grouped object as a dictionary within the agg function. This dict takes the column that you’re aggregating as a key, and either a single aggregation function or a list of aggregation functions as its value. To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary.

# group by Team, get mean, min, and max value of Age for each value of Team.
grouped_single = df.groupby('Team').agg({'Age': ['mean', 'min', 'max']})

print(grouped_single)

Team	mean	min	max
Red Sox	29.333333	22	40
Yankees	28.333333	21	36

Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I’d recommend flattening this after aggregating by renaming the new columns.

You’ll also see that your grouping column is now the dataframe’s index. Reset your index to make this easier to work with later on.

# rename columns
grouped_single.columns = ['age_mean', 'age_min', 'age_max']

# reset index to get grouped columns back
grouped_single = grouped_single.reset_index()

print(grouped_single)

Team	age_mean	age_min	age_max
Red Sox	29.333333	22	40
Yankees	28.333333	21	36

Grouping by Multiple Columns

It’s simple to extend this to work with multiple grouping variables. Say you want to summarise player age by team AND position. You can do this by passing a list of column names to groupby instead of a single string value.

grouped_multiple = df.groupby(['Team', 'Pos']).agg({'Age': ['mean', 'min', 'max']})
grouped_multiple.columns = ['age_mean', 'age_min', 'age_max']
grouped_multiple = grouped_multiple.reset_index()
print(grouped_multiple)

Team	Pos	age_mean	age_min	age_max
Red Sox	Not Pitcher	28.000000	22	33
Red Sox	Pitcher	30.666667	24	40
Yankees	Not Pitcher	30.666667	25	36
Yankees	Pitcher	26.000000	21	31

Python for Data Analysis book cover

Get it on Amazon here for $36.26

James LeDoux

Group and Aggregate by One or More Columns in Pandas

Imports

Create a Toy Dataset

Group By One Column and Get Mean, Min, and Max values by Group

Grouping by Multiple Columns

Want to improve your Python skills? I recommend reading Python for Data Analysis by Wes McKinney, the creator of Pandas