Group and Aggregate by One or More Columns in Pandas

Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python’s closest equivalent to dplyr’s group_by + summarise logic. Here’s a quick example of how to group on one or multiple columns and summarise data with aggregation functions using Pandas.

Imports

import pandas as pd

Create a Toy Dataset

In this case, say we have data on baseball players. We know their team, whether they’re a pitcher or a position player, and their age. With this data we can compare the average ages of the different teams, and then break this out further by pitchers vs. non-pitchers.

data = {"Team": ["Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees"],
		"Pos": ["Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher", "Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher"],
		"Age": [24, 28, 40, 22, 29, 33, 31, 26, 21, 36, 25, 31]}
df = pd.DataFrame(data)
print(df)
Team Pos Age
Red Sox Pitcher 24
Red Sox Pitcher 28
Red Sox Pitcher 40
Red Sox Not Pitcher 22
Red Sox Not Pitcher 29
Red Sox Not Pitcher 33
Yankees Pitcher 31
Yankees Pitcher 26
Yankees Pitcher 21
Yankees Not Pitcher 36
Yankees Not Pitcher 25
Yankees Not Pitcher 31

Group By One Column and Get Mean, Min, and Max values by Group

First we’ll group by Team with Pandas’ groupby function. After grouping we can pass aggregation functions to the grouped object as a dictionary within the agg function. This dict takes the column that you’re aggregating as a key, and either a single aggregation function or a list of aggregation functions as its value. To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary.

# group by Team, get mean, min, and max value of Age for each value of Team.
grouped_single = df.groupby('Team').agg({'Age': ['mean', 'min', 'max']})

print(grouped_single)
Team mean min max
Red Sox 29.333333 22 40
Yankees 28.333333 21 36

Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I’d recommend flattening this after aggregating by renaming the new columns.

You’ll also see that your grouping column is now the dataframe’s index. Reset your index to make this easier to work with later on.

# rename columns
grouped_single.columns = ['age_mean', 'age_min', 'age_max']

# reset index to get grouped columns back
grouped_single = grouped_single.reset_index()

print(grouped_single)
Team age_mean age_min age_max
Red Sox 29.333333 22 40
Yankees 28.333333 21 36

Grouping by Multiple Columns

It’s simple to extend this to work with multiple grouping variables. Say you want to summarise player age by team AND position. You can do this by passing a list of column names to groupby instead of a single string value.

grouped_multiple = df.groupby(['Team', 'Pos']).agg({'Age': ['mean', 'min', 'max']})
grouped_multiple.columns = ['age_mean', 'age_min', 'age_max']
grouped_multiple = grouped_multiple.reset_index()
print(grouped_multiple)
Team Pos age_mean age_min age_max
Red Sox Not Pitcher 28.000000 22 33
Red Sox Pitcher 30.666667 24 40
Yankees Not Pitcher 30.666667 25 36
Yankees Pitcher 26.000000 21 31