Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 23771

Peter Bengtsson: Conditional aggregation in Django 2.0

$
0
0

Django 2.0 came out a couple of weeks ago. It now supports "conditional aggregation" which is SQL standard I didn't even know about.

Before

So I have a Django app which has an endpoint that generates some human-friendly stats about the number of uploads (and their total size) in various different time intervals.

First of all, this is how it set up the time intervals:

today=timezone.now()start_today=today.replace(hour=0,minute=0,second=0)start_yesterday=start_today-datetime.timedelta(days=1)start_this_month=today.replace(day=1)start_this_year=start_this_month.replace(month=1)

And then, for each of these, there's a little function that returns a dict for each time interval:

defcount_and_size(qs,start,end):sub_qs=qs.filter(created_at__gte=start,created_at__lt=end)return{'count':sub_qs.count(),'total_size':sub_qs.aggregate(size=Sum('size'))['size'],}numbers['uploads']={'today':count_and_size(upload_qs,start_today,today),'yesterday':count_and_size(upload_qs,start_yesterday,start_today),'this_month':count_and_size(upload_qs,start_this_month,today),'this_year':count_and_size(upload_qs,start_this_year,today),}

What you get is exactly 2 x 4 = 8 queries. One COUNT and one SUM for each time interval. E.g.

SELECT SUM("upload_upload"."size") AS "size" 
FROM "upload_upload" 
WHERE ("upload_upload"."created_at">= ...

SELECT COUNT(*) AS "__count" 
FROM "upload_upload" 
WHERE ("upload_upload"."created_at">= ...

...6 more queries...

Middle

Oops. I think this code comes from a slightly rushed job. We can do the COUNT and the SUM at the same time for each query.

# New, improved count_and_size() function!defcount_and_size(qs,start,end):sub_qs=qs.filter(created_at__gte=start,created_at__lt=end)returnsub_qs.aggregate(count=Count('id'),total_size=Sum('size'),)numbers['uploads']={'today':count_and_size(upload_qs,start_today,today),'yesterday':count_and_size(upload_qs,start_yesterday,start_today),'this_month':count_and_size(upload_qs,start_this_month,today),'this_year':count_and_size(upload_qs,start_this_year,today),}

Much better, now there's only one query per time bucket. So 4 queries in total. E.g.

SELECT COUNT("upload_upload"."id") AS "count", SUM("upload_upload"."size") AS "total_size" 
FROM "upload_upload" 
WHERE ("upload_upload"."created_at">= ...

...3 more queries...

After

But we can do better than that! Instead, we use conditional aggregation. The syntax gets a bit hairy because there's so many keyword arguments, but I hope I've indented it nicely so it's easy to see how it works:

defmake_q(start,end):returnQ(created_at__gte=start,created_at__lt=end)q_today=make_q(start_today,today)q_yesterday=make_q(start_yesterday,start_today)q_this_month=make_q(start_this_month,today)q_this_year=make_q(start_this_year,today)aggregates=upload_qs.aggregate(today_count=Count('pk',filter=q_today),today_total_size=Sum('size',filter=q_today),yesterday_count=Count('pk',filter=q_yesterday),yesterday_total_size=Sum('size',filter=q_yesterday),this_month_count=Count('pk',filter=q_this_month),this_month_total_size=Sum('size',filter=q_this_month),this_year_count=Count('pk',filter=q_this_year),this_year_total_size=Sum('size',filter=q_this_year),)numbers['uploads']={'today':{'count':aggregates['today_count'],'total_size':aggregates['today_total_size'],},'yesterday':{'count':aggregates['yesterday_count'],'total_size':aggregates['yesterday_total_size'],},'this_month':{'count':aggregates['this_month_count'],'total_size':aggregates['this_month_total_size'],},'this_year':{'count':aggregates['this_year_count'],'total_size':aggregates['this_year_total_size'],},}

Voila! One single query to get all those pieces of data.
The SQL sent to PostgreSQL looks something like this:

SELECT 
  COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at">= ...)) AS "today_count", 
  SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at">= ...)) AS "today_total_size",

  COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at">= ...)) AS "yesterday_count", 
  SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at">= ...)) AS "yesterday_total_size", 

  ...

FROM "upload_upload";

Is this the best thing to do? I'm starting to have my doubts.

Watch Out!

When I take this now 1 monster query for a spin with an EXPLAIN ANALYZE prefix I notice something worrying!

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=74.33..74.34 rows=1 width=16) (actual time=0.587..0.587 rows=1 loops=1)
   ->  Seq Scan on upload_upload  (cost=0.00..62.13 rows=813 width=16) (actual time=0.012..0.210 rows=813 loops=1)
 Planning time: 0.427 ms
 Execution time: 0.674 ms
(4 rows)

A sequential scan! That's terrible. The created_at column is indexed in a BTREE so why can't it use the index.

The short answer is: I don't know!
I've uploaded a reduced, but still complete, example demonstrating this in a gist. It's very similar to the example in the stackoverflow question I asked.

So what did I do? I went back to the "middle" solution. One SELECT query per time bucket. So 4 queries in total, but at least all 4 is able to use an index.


Viewing all articles
Browse latest Browse all 23771

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>