Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22462

Peter Bengtsson: Optimization of QuerySet.get() with or without select_related

$
0
0

If you know you're going to look up a related Django ORM object from another one, Django automatically takes care of that for you.

To illustrate, imaging a mapping that looks like this:

classArtist(models.Models):name=models.CharField(max_length=200)...classSong(models.Models):artist=models.ForeignKey(Artist)...

And with that in mind, suppose you do this:

>>>Song.objects.get(id=1234567).artist.name'Frank Zappa'

Internally, what Django does is that it looks the Song object first, then it does a look up automatically on the Artist. In PostgreSQL it looks something like this:

SELECT "main_song"."id", "main_song"."artist_id", ... FROM "main_song" WHERE "main_song"."id" = 1234567
SELECT "main_artist"."id", "main_artist"."name", ... FROM "main_artist" WHERE "main_artist"."id" = 111

Pretty clear. Right.

Now if you know you're going to need to look up that related field you can ask Django to make a join before the lookup even happens. It looks like this:

>>>Song.objects.select_related('artist').get(id=1234567).artist.name'Frank Zappa'

And the SQL needed looks like this:

SELECT "main_song"."id", ... , "main_artist"."name", ... 
FROM "main_song" INNER JOIN "main_artist" ON ("main_song"."artist_id" = "main_artist"."id") WHERE "main_song"."id" = 1234567

The question is; which is fastest?

Well, there's only one way to find out and that is to measure with some relatistic data.

Here's the benchmarking code:

deff1(id):try:returnSong.objects.get(id=id).artist.nameexceptSong.DoesNotExist:passdeff2(id):try:returnSong.objects.select_related('artist').get(id=id).artist.nameexceptSong.DoesNotExist:passdef_stats(r):#returns the median, average and standard deviation of a sequencetot=sum(r)avg=tot/len(r)sdsq=sum([(i-avg)**2foriinr])s=list(r)s.sort()returns[len(s)//2],avg,(sdsq/(len(r)-1or1))**.5times=defaultdict(list)functions=[f1,f2]foridinrange(100000,103000):forfinfunctions:t0=time.time()r=f(id)t1=time.time()ifr:times[f.__name__].append(t1-t0)# Shuffle the order so that one doesn't benefit more# from deep internal optimizations/caching in Postgre.random.shuffle(functions)fork,valuesintimes.items():print(k,[round(x*1000,2)forxin_stats(values)])

For the record, here are the parameters of this little benchmark:

  • The server is a 4Gb DigitialOcean server that is being used but not super busy
  • The PostgreSQL server is running on the same virtual machine as the Python process used in the benchmark
  • Django 1.10, Python 3.5, PostgreSQL 9.5.4
  • 1,588,258 "Song" objects, 102,496 "Artist" objects
  • Note that the range of IDs has gaps but they're not counted

The Result

FunctionMedianAverageStd Dev
f13.19ms9.17ms19.61ms
f22.28ms6.28ms15.30ms

The Conclusion

If you use the median, using select_related is 30% faster and if you use the average, using select_related is 46% faster.

So, if you know you're going to need to do that lookup put in .select_related(relation) before every .get(id=...) in your Django code.

Deep down in PostgreSQL, the inner join is ultimately two ID-by-index lookups. And that's what the first method is too. It's likely that the reason the inner join approach is faster is simply because there's less connection overheads.

Lastly, YOUR MILEAGE WILL VARY. Every benchmark is flawed but this quite realistic because it's not trying to be optimized in either way.


Viewing all articles
Browse latest Browse all 22462

Trending Articles