Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22853

Chris Moffitt: Pandas Groupby Warning

$
0
0

Introduction

One of the reasons I like using pandas instead of Excel for data analysis is that it is easier to avoid certain types of copy-paste Excel errors. As great as pandas is, there is still plenty of opportunity to make errors with pandas code. This article discusses a subtle issue with pandas groupby code that can lead to big errors if you’re not careful. I’m writing this because I have happened upon this in the past but it still bit me big time just recently. I hope this article can help a few of you avoid this mistake.

The Problem

To illustrate this problem, we’ll use a simple data set that shows sales for 20 customers and includes their region and an internal customer segment designation of Platinum, Gold or Silver. Here is the full data set:

Customer IDCustomer NameRegionSegmentSales
0740150Barton LLCUSGold215000
1714466Trantow-BarrowsEMEASilver430000
2218895Kulas IncEMEAPlatinum410000
3307599Kassulke, Ondricka and MetzEMEA 91000
4412290Jerde-HilpertEMEAGold630000
5729833Koepp LtdUS 230000
6737550Fritsch, Russel and AndersonUSGold630000
7146832Kiehn-SpinkaUSSilver615000
8688981Keeling LLCUSPlatinum515000
9786968Frami, Hills and SchmidtUSGold215000
10239344Stokes LLCUSSilver230000
11672390Kuhn-GusikowskiAPACPlatinum630000
12141962Herman LLCAPACGold215000
13424914White-TrantowUS 230000
14527099Sanford and SonsUSPlatinum615000
15642753Pollich LLCUSGold419000
16383080Will LLCUSSilver415000
17257198Cronin, Oberbrunner and SpencerUSPlatinum425000
18604255Halvorson, Crona and ChamplinUS 430000
19163416Purdy-KundeAPACSilver410000

The data looks pretty simple. There’s only one numeric column so let’s see what it totals to.

importpandasaspddf=pd.read_excel('https://github.com/chris1610/pbpython/raw/master/data/sales_9_2022.xlsx')df["Sales"].sum()
8000000

We have $8,000,000 in sales in the spreadsheet. Keep that number in mind.

Let’s do some simple analysis to summarize sales by region:

df.groupby(['Region']).agg({'Sales':'sum'})
Sales
Region
APAC1255000
EMEA1561000
US5184000

We can double check the math:

df.groupby(['Region']).agg({'Sales':'sum'}).sum()
Sales8000000dtype:int64

Looks good. That’s what we expect. Now let’s see what sales look like by Segment:

df.groupby(['Region','Segment']).agg({'Sales':'sum'})

Which yields this table:

Sales
RegionSegment
APACGold215000
Platinum630000
Silver410000
EMEAGold630000
Platinum410000
Silver430000
USGold1479000
Platinum1555000
Silver1260000

This looks good. No errors and the table seems reasonable. We should continue our analysis right?

Nope. There’s a potentially subtle issue here. Let’s sum the data to double check:

df.groupby(['Region','Segment']).agg({'Sales':'sum'}).sum()
Sales7019000dtype:int64

This only includes $7,019,000. Where did the other $981,000 go? Is pandas broken?

You can see the issue clearly if we use the dropna=False parameter to explicitly include NaN values in our results:

df.groupby(['Region','Segment'],dropna=False).agg({'Sales':'sum'})

Now we can see the NaN combinations with EMEA and the US groupings:

Sales
RegionSegment
APACGold215000
Platinum630000
Silver410000
EMEAGold630000
Platinum410000
Silver430000
NaN91000
USGold1479000
Platinum1555000
Silver1260000
NaN890000

If we check the sum, we can see it totals to $8M.

df.groupby(['Region','Segment'],dropna=False).agg({'Sales':'sum'}).sum()
Sales8000000dtype:int64

The pandas documentation is very clear on this:

dropna: bool, default True
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

The take away is that if your groupby columns contain any NA values, then you need to make a conscious decision about whether or not you want to include those values in the grouped results.

If you are ok dropping those values, then use the default dropna=True .

However, if you want to ensure that all values (Sales in this particular case) are included, then make sure to use dropna=False in your groupby

An ounce of prevention

The main way to deal with this potential issue is to understand if you have any NaN values in your data. There are a couple of ways to do this.

You can use pure pandas:

df.isnull().sum()
CustomerID0CustomerName0Region0Segment4Sales0dtype:int64

There are other tools like missingno which provide a more robust interface for exploring the data.

I’m partial to sidetable. Here’s how to use it after it’s installed and imported:

df.stb.missing()
missingtotalpercent
Segment42020.0
Customer ID0200.0
Customer Name0200.0
Region0200.0
Sales0200.0

Regardless of the approach you use, its worth keeping in mind that you need to know if you have any null or NaN values in your data and how you would like to handle them in your analysis.

The other alternative to using the dropna is to explicitly fill in the values using fillna

df.fillna('unknown').groupby(['Region','Segment']).agg({'Sales':'sum'})

Now the unknown values are explicitly called out:

Sales
RegionSegment
APACGold215000
Platinum630000
Silver410000
EMEAGold630000
Platinum410000
Silver430000
unknown91000
USGold1479000
Platinum1555000
Silver1260000
unknown890000

Conclusion

When working with pandas groupby , the results can be surprising if you have NaN values in your dataframe columns. The default behavior is to drop those values which means you can effectively “lose” some of your data during the process.

I have been bit by this behavior several times in the past. In some cases, it might not be a big deal. In others, you might need to sheepishly explain why your numbers aren’t adding up.

Have you seen this before? Let me know in the comments below.


Viewing all articles
Browse latest Browse all 22853

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>