Whether you’ve just started working with pandas and want to master one of its core capabilities, or you’re looking to fill in some gaps in your understanding about .groupby(), this tutorial will help you to break down and visualize a pandas GroupBy operation from start to finish.
This tutorial is meant to complement the official pandas documentation and the pandas Cookbook, where you’ll see self-contained, bite-sized examples. Here, however, you’ll focus on three more involved walkthroughs that use real-world datasets.
In this tutorial, you’ll cover:
- How to use pandas GroupBy operations on real-world data
- How the split-apply-combine chain of operations works
- How to decompose the split-apply-combine chain into steps
- How to categorize methods of a pandas GroupBy object based on their intent and result
This tutorial assumes that you have some experience with pandas itself, including how to read CSV files into memory as pandas objects with read_csv(). If you need a refresher, then check out Reading CSVs With pandas and pandas: How to Read and Write Files.
You can download the source code for all the examples in this tutorial by clicking on the link below:
Download Datasets:Click here to download the datasets that you’ll use to learn about pandas’ GroupBy in this tutorial.
Prerequisites
Before you proceed, make sure that you have the latest version of pandas available within a new virtual environment:
In this tutorial, you’ll focus on three datasets:
- The U.S. Congress dataset contains public information on historical members of Congress and illustrates several fundamental capabilities of
.groupby(). - The air quality dataset contains periodic gas sensor readings. This will allow you to work with floats and time series data.
- The news aggregator dataset holds metadata on several hundred thousand news articles. You’ll be working with strings and doing text munging with
.groupby().
You can download the source code for all the examples in this tutorial by clicking on the link below:
Download Datasets:Click here to download the datasets that you’ll use to learn about pandas’ GroupBy in this tutorial.
Once you’ve downloaded the .zip file, unzip the file to a folder called groupby-data/ in your current directory. Before you read on, ensure that your directory tree looks like this:
./
│
└── groupby-data/
│
├── legislators-historical.csv
├── airqual.csv
└── news.csv
With pandas installed, your virtual environment activated, and the datasets downloaded, you’re ready to jump in!
Example 1: U.S. Congress Dataset
You’ll jump right into things by dissecting a dataset of historical members of Congress. You can read the CSV file into a pandas DataFrame with read_csv():
# pandas_legislators.pyimportpandasaspddtypes={"first_name":"category","gender":"category","type":"category","state":"category","party":"category",}df=pd.read_csv("groupby-data/legislators-historical.csv",dtype=dtypes,usecols=list(dtypes)+["birthday","last_name"],parse_dates=["birthday"])The dataset contains members’ first and last names, birthday, gender, type ("rep" for House of Representatives or "sen" for Senate), U.S. state, and political party. You can use df.tail() to view the last few rows of the dataset:
>>> frompandas_legislatorsimportdf>>> df.tail() last_name first_name birthday gender type state party11970 Garrett Thomas 1972-03-27 M rep VA Republican11971 Handel Karen 1962-04-18 F rep GA Republican11972 Jones Brenda 1959-10-24 F rep MI Democrat11973 Marino Tom 1952-08-15 M rep PA Republican11974 Jones Walter 1943-02-10 M rep NC RepublicanThe DataFrame uses categorical dtypes for space efficiency:
>>> df.dtypeslast_name objectfirst_name categorybirthday datetime64[ns]gender categorytype categorystate categoryparty categorydtype: objectRead the full article at https://realpython.com/pandas-groupby/ »
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]