Group and Aggregate by One or More Columns in Pandas
June 01, 2019
Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python’s closest equivalent to dplyr’s group_by
+ summarise
logic. Here’s a quick example of how to group on one or multiple columns and summarise data with aggregation functions using Pandas.
Imports
import pandas as pd
Create a Toy Dataset
In this case, say we have data on baseball players. We know their team, whether they’re a pitcher or a position player, and their age. With this data we can compare the average ages of the different teams, and then break this out further by pitchers vs. non-pitchers.
data = {"Team": ["Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees"],
"Pos": ["Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher", "Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher"],
"Age": [24, 28, 40, 22, 29, 33, 31, 26, 21, 36, 25, 31]}
df = pd.DataFrame(data)
print(df)
Team | Pos | Age |
---|---|---|
Red Sox | Pitcher | 24 |
Red Sox | Pitcher | 28 |
Red Sox | Pitcher | 40 |
Red Sox | Not Pitcher | 22 |
Red Sox | Not Pitcher | 29 |
Red Sox | Not Pitcher | 33 |
Yankees | Pitcher | 31 |
Yankees | Pitcher | 26 |
Yankees | Pitcher | 21 |
Yankees | Not Pitcher | 36 |
Yankees | Not Pitcher | 25 |
Yankees | Not Pitcher | 31 |
Group By One Column and Get Mean, Min, and Max values by Group
First we’ll group by Team
with Pandas’ groupby
function. After grouping we can pass aggregation functions to the grouped object as a dictionary within the agg
function. This dict takes the column that you’re aggregating as a key, and either a single aggregation function or a list of aggregation functions as its value. To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary.
# group by Team, get mean, min, and max value of Age for each value of Team.
grouped_single = df.groupby('Team').agg({'Age': ['mean', 'min', 'max']})
print(grouped_single)
Team | mean | min | max |
---|---|---|---|
Red Sox | 29.333333 | 22 | 40 |
Yankees | 28.333333 | 21 | 36 |
Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I’d recommend flattening this after aggregating by renaming the new columns.
You’ll also see that your grouping column is now the dataframe’s index. Reset your index to make this easier to work with later on.
# rename columns
grouped_single.columns = ['age_mean', 'age_min', 'age_max']
# reset index to get grouped columns back
grouped_single = grouped_single.reset_index()
print(grouped_single)
Team | age_mean | age_min | age_max |
---|---|---|---|
Red Sox | 29.333333 | 22 | 40 |
Yankees | 28.333333 | 21 | 36 |
Grouping by Multiple Columns
It’s simple to extend this to work with multiple grouping variables. Say you want to summarise player age by team AND position. You can do this by passing a list of column names to groupby
instead of a single string value.
grouped_multiple = df.groupby(['Team', 'Pos']).agg({'Age': ['mean', 'min', 'max']})
grouped_multiple.columns = ['age_mean', 'age_min', 'age_max']
grouped_multiple = grouped_multiple.reset_index()
print(grouped_multiple)
Team | Pos | age_mean | age_min | age_max |
---|---|---|---|---|
Red Sox | Not Pitcher | 28.000000 | 22 | 33 |
Red Sox | Pitcher | 30.666667 | 24 | 40 |
Yankees | Not Pitcher | 30.666667 | 25 | 36 |
Yankees | Pitcher | 26.000000 | 21 | 31 |