Django 2.0 came out a couple of weeks ago. It now supports "conditional aggregation" which is SQL standard I didn't even know about.
Before
So I have a Django app which has an endpoint that generates some human-friendly stats about the number of uploads (and their total size) in various different time intervals.
First of all, this is how it set up the time intervals:
today=timezone.now()start_today=today.replace(hour=0,minute=0,second=0)start_yesterday=start_today-datetime.timedelta(days=1)start_this_month=today.replace(day=1)start_this_year=start_this_month.replace(month=1)
And then, for each of these, there's a little function that returns a dict for each time interval:
defcount_and_size(qs,start,end):sub_qs=qs.filter(created_at__gte=start,created_at__lt=end)return{'count':sub_qs.count(),'total_size':sub_qs.aggregate(size=Sum('size'))['size'],}numbers['uploads']={'today':count_and_size(upload_qs,start_today,today),'yesterday':count_and_size(upload_qs,start_yesterday,start_today),'this_month':count_and_size(upload_qs,start_this_month,today),'this_year':count_and_size(upload_qs,start_this_year,today),}
What you get is exactly 2 x 4 = 8 queries. One COUNT
and one SUM
for each time interval. E.g.
SELECT SUM("upload_upload"."size") AS "size" FROM "upload_upload" WHERE ("upload_upload"."created_at">= ... SELECT COUNT(*) AS "__count" FROM "upload_upload" WHERE ("upload_upload"."created_at">= ... ...6 more queries...
Middle
Oops. I think this code comes from a slightly rushed job. We can do the COUNT
and the SUM
at the same time for each query.
# New, improved count_and_size() function!defcount_and_size(qs,start,end):sub_qs=qs.filter(created_at__gte=start,created_at__lt=end)returnsub_qs.aggregate(count=Count('id'),total_size=Sum('size'),)numbers['uploads']={'today':count_and_size(upload_qs,start_today,today),'yesterday':count_and_size(upload_qs,start_yesterday,start_today),'this_month':count_and_size(upload_qs,start_this_month,today),'this_year':count_and_size(upload_qs,start_this_year,today),}
Much better, now there's only one query per time bucket. So 4 queries in total. E.g.
SELECT COUNT("upload_upload"."id") AS "count", SUM("upload_upload"."size") AS "total_size" FROM "upload_upload" WHERE ("upload_upload"."created_at">= ... ...3 more queries...
After
But we can do better than that! Instead, we use conditional aggregation. The syntax gets a bit hairy because there's so many keyword arguments, but I hope I've indented it nicely so it's easy to see how it works:
defmake_q(start,end):returnQ(created_at__gte=start,created_at__lt=end)q_today=make_q(start_today,today)q_yesterday=make_q(start_yesterday,start_today)q_this_month=make_q(start_this_month,today)q_this_year=make_q(start_this_year,today)aggregates=upload_qs.aggregate(today_count=Count('pk',filter=q_today),today_total_size=Sum('size',filter=q_today),yesterday_count=Count('pk',filter=q_yesterday),yesterday_total_size=Sum('size',filter=q_yesterday),this_month_count=Count('pk',filter=q_this_month),this_month_total_size=Sum('size',filter=q_this_month),this_year_count=Count('pk',filter=q_this_year),this_year_total_size=Sum('size',filter=q_this_year),)numbers['uploads']={'today':{'count':aggregates['today_count'],'total_size':aggregates['today_total_size'],},'yesterday':{'count':aggregates['yesterday_count'],'total_size':aggregates['yesterday_total_size'],},'this_month':{'count':aggregates['this_month_count'],'total_size':aggregates['this_month_total_size'],},'this_year':{'count':aggregates['this_year_count'],'total_size':aggregates['this_year_total_size'],},}
Voila! One single query to get all those pieces of data.
The SQL sent to PostgreSQL looks something like this:
SELECT COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at">= ...)) AS "today_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at">= ...)) AS "today_total_size", COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at">= ...)) AS "yesterday_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at">= ...)) AS "yesterday_total_size", ... FROM "upload_upload";
Is this the best thing to do? I'm starting to have my doubts.
Watch Out!
When I take this now 1 monster query for a spin with an EXPLAIN ANALYZE
prefix I notice something worrying!
QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Aggregate (cost=74.33..74.34 rows=1 width=16) (actual time=0.587..0.587 rows=1 loops=1) -> Seq Scan on upload_upload (cost=0.00..62.13 rows=813 width=16) (actual time=0.012..0.210 rows=813 loops=1) Planning time: 0.427 ms Execution time: 0.674 ms (4 rows)
A sequential scan! That's terrible. The created_at
column is indexed in a BTREE
so why can't it use the index.
The short answer is: I don't know!
I've uploaded a reduced, but still complete, example demonstrating this in a gist. It's very similar to the example in the stackoverflow question I asked.
So what did I do? I went back to the "middle" solution. One SELECT
query per time bucket. So 4 queries in total, but at least all 4 is able to use an index.