Introduction
One of the reasons I like using pandas instead of Excel for data analysis is that it is
easier to avoid certain types of copy-paste Excel errors. As great as pandas is, there
is still plenty of opportunity to make errors with pandas code. This article discusses a
subtle issue with pandas
groupby
code that can lead to big errors if you’re not
careful. I’m writing this because I have happened upon this in the past but it still bit
me big time just recently. I hope this article can help a few of you avoid this mistake.
The Problem
To illustrate this problem, we’ll use a simple data set that shows sales for 20 customers and includes their region and an internal customer segment designation of Platinum, Gold or Silver. Here is the full data set:
Customer ID | Customer Name | Region | Segment | Sales | |
---|---|---|---|---|---|
0 | 740150 | Barton LLC | US | Gold | 215000 |
1 | 714466 | Trantow-Barrows | EMEA | Silver | 430000 |
2 | 218895 | Kulas Inc | EMEA | Platinum | 410000 |
3 | 307599 | Kassulke, Ondricka and Metz | EMEA | 91000 | |
4 | 412290 | Jerde-Hilpert | EMEA | Gold | 630000 |
5 | 729833 | Koepp Ltd | US | 230000 | |
6 | 737550 | Fritsch, Russel and Anderson | US | Gold | 630000 |
7 | 146832 | Kiehn-Spinka | US | Silver | 615000 |
8 | 688981 | Keeling LLC | US | Platinum | 515000 |
9 | 786968 | Frami, Hills and Schmidt | US | Gold | 215000 |
10 | 239344 | Stokes LLC | US | Silver | 230000 |
11 | 672390 | Kuhn-Gusikowski | APAC | Platinum | 630000 |
12 | 141962 | Herman LLC | APAC | Gold | 215000 |
13 | 424914 | White-Trantow | US | 230000 | |
14 | 527099 | Sanford and Sons | US | Platinum | 615000 |
15 | 642753 | Pollich LLC | US | Gold | 419000 |
16 | 383080 | Will LLC | US | Silver | 415000 |
17 | 257198 | Cronin, Oberbrunner and Spencer | US | Platinum | 425000 |
18 | 604255 | Halvorson, Crona and Champlin | US | 430000 | |
19 | 163416 | Purdy-Kunde | APAC | Silver | 410000 |
The data looks pretty simple. There’s only one numeric column so let’s see what it totals to.
importpandasaspddf=pd.read_excel('https://github.com/chris1610/pbpython/raw/master/data/sales_9_2022.xlsx')df["Sales"].sum()
8000000
We have $8,000,000 in sales in the spreadsheet. Keep that number in mind.
Let’s do some simple analysis to summarize sales by region:
df.groupby(['Region']).agg({'Sales':'sum'})
Sales | |
---|---|
Region | |
APAC | 1255000 |
EMEA | 1561000 |
US | 5184000 |
We can double check the math:
df.groupby(['Region']).agg({'Sales':'sum'}).sum()
Sales8000000dtype:int64
Looks good. That’s what we expect. Now let’s see what sales look like by Segment:
df.groupby(['Region','Segment']).agg({'Sales':'sum'})
Which yields this table:
Sales | ||
---|---|---|
Region | Segment | |
APAC | Gold | 215000 |
Platinum | 630000 | |
Silver | 410000 | |
EMEA | Gold | 630000 |
Platinum | 410000 | |
Silver | 430000 | |
US | Gold | 1479000 |
Platinum | 1555000 | |
Silver | 1260000 |
This looks good. No errors and the table seems reasonable. We should continue our analysis right?
Nope. There’s a potentially subtle issue here. Let’s sum the data to double check:
df.groupby(['Region','Segment']).agg({'Sales':'sum'}).sum()
Sales7019000dtype:int64
This only includes $7,019,000. Where did the other $981,000 go? Is pandas broken?
You can see the issue clearly if we use the
dropna=False
parameter to explicitly
include
NaN
values in our results:
df.groupby(['Region','Segment'],dropna=False).agg({'Sales':'sum'})
Now we can see the
NaN
combinations with EMEA and the US groupings:
Sales | ||
---|---|---|
Region | Segment | |
APAC | Gold | 215000 |
Platinum | 630000 | |
Silver | 410000 | |
EMEA | Gold | 630000 |
Platinum | 410000 | |
Silver | 430000 | |
NaN | 91000 | |
US | Gold | 1479000 |
Platinum | 1555000 | |
Silver | 1260000 | |
NaN | 890000 |
If we check the sum, we can see it totals to $8M.
df.groupby(['Region','Segment'],dropna=False).agg({'Sales':'sum'}).sum()
Sales8000000dtype:int64
The pandas documentation is very clear on this:
dropna:
bool, default True- If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
The take away is that if your
groupby
columns contain any NA values,
then you need to make a conscious decision about whether or not you want to include those
values in the grouped results.
If you are ok dropping those values, then use the default
dropna=True
.
However, if you want to ensure that all values (Sales in this particular case) are included, then
make sure to use
dropna=False
in your
groupby
An ounce of prevention
The main way to deal with this potential issue is to understand if you have any
NaN
values
in your data. There are a couple of ways to do this.
You can use pure pandas:
df.isnull().sum()
CustomerID0CustomerName0Region0Segment4Sales0dtype:int64
There are other tools like missingno which provide a more robust interface for exploring the data.
I’m partial to sidetable. Here’s how to use it after it’s installed and imported:
df.stb.missing()
missing | total | percent | |
---|---|---|---|
Segment | 4 | 20 | 20.0 |
Customer ID | 0 | 20 | 0.0 |
Customer Name | 0 | 20 | 0.0 |
Region | 0 | 20 | 0.0 |
Sales | 0 | 20 | 0.0 |
Regardless of the approach you use, its worth keeping in mind that you need to know if
you have any null or
NaN
values in your data and how you would like to handle them in your analysis.
The other alternative to using the
dropna
is to explicitly fill in the
values using
fillna
df.fillna('unknown').groupby(['Region','Segment']).agg({'Sales':'sum'})
Now the unknown values are explicitly called out:
Sales | ||
---|---|---|
Region | Segment | |
APAC | Gold | 215000 |
Platinum | 630000 | |
Silver | 410000 | |
EMEA | Gold | 630000 |
Platinum | 410000 | |
Silver | 430000 | |
unknown | 91000 | |
US | Gold | 1479000 |
Platinum | 1555000 | |
Silver | 1260000 | |
unknown | 890000 |
Conclusion
When working with pandas
groupby
, the results can be surprising if you have
NaN
values in your dataframe columns. The default behavior is to drop those values
which means you can effectively “lose” some of your data during the process.
I have been bit by this behavior several times in the past. In some cases, it might not be a big deal. In others, you might need to sheepishly explain why your numbers aren’t adding up.
Have you seen this before? Let me know in the comments below.