Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22875

Sebastian Witowski: How to Benchmark (Python) Code

$
0
0

While preparing to write the "Writing Faster Python" series, the first problem I faced was "How do I benchmark a piece of code in an objective, yet uncomplicated way".

I could run python -m timeit <piece of code> which is probably the simplest way of measuring how long it takes to execute some code[1]. But maybe it's too simple and I owe my readers some way of benchmarking that won't be interfered by sudden CPU spikes on my computer?

So here are a couple of different tools and techniques I tried. At the end of the article I will tell you which one I chose and why. Plus I will give you some rules of thumb when each tool might be handy.

python -m timeit #

The easiest way to measure how long it takes to run some code is to use the timeit module. You can write python -m timeit your_code() and Python will print out how long it took to run whatever your_code() does. I like to put the code I want to benchmark inside a function for more clarity, but you don't have to do this. You can directly write multiple Python statements separated by semicolons and that will work just fine. For example, to see how long it takes to sum up the first 1,000,000 numbers, we can run this code:

python -m timeit "sum(range(1_000_001))"
20 loops, best of 5: 11.5 msec per loop

However, python -m timeit approach has a major drawback - it doesn't separate the setup code from the code you want to benchmark. Let's say you have an import statement that takes relatively long time to import, compared to executing a function from that module. One such import can be import numpy. If we benchmark those two lines of code:

import numpy
numpy.arange(10)

the import will take most of the time during the benchmark. But you probably don't want to benchmark how long it takes to import modules. You just want to see how long it takes to execute some functions from that module.

python -m timeit -s "setup code" #

To separate the setup code from the code that is benchmarked, timeit supports -s parameter. Whatever code you pass to that parameter will be executed, but it won't be part of the benchmarks. So we can improve the above code and run it like this: python -m timeit -s "import numpy" "numpy.arange(10)".

python -m timeit -s "setup code" -n 10000 #

We can be a bit more strict and decide to execute our code the same number of times each time. By default if you don't specify the '-n' (or --number) parameter, timeit will try to run your code 1, 2, 5, 10, 20, ... until the total execution time exceeds 0.2 second. A slow function will be executed once, but a very fast one will run thousands of times. If you think that executing different code snippets different number of times affects your benchmarks, you can set this parameter to a predefined number.

docker #

One of the issues with running benchmarks with python -m timeit is that sometimes other processes on your computer might affect the Python process and randomly slow it down. For example, I've noticed that if I run my benchmarks with all the usual applications open (multiple Chrome instances with plenty of tabs, Teams and other messenger apps, etc.) all the benchmarks will take a bit longer than if I close basically all the apps on my computer.

So while trying to figure out how to avoid this situation, I decided to try to run my benchmarks in Docker. I came up with the following solution: docker run -w /home -it -v $(pwd):/home python:3.10.4-alpine python -m timeit -s "<some setup code>" "my_function()"

The above code will:

  1. Run Python alpine Docker container (so a small, barebone image with Python).
  2. Mount the current folder inside the Docker container (so we have access to scripts with code we want to benchmark).
  3. Run the same timeit command as before.

And the results were more consistent than without using Docker. Rerunning benchmarks multiple time, I was getting results with smaller deviation. I still had a deviation and some runs were slightly slower, some were slightly faster, but the difference was smaller than without using docker.

Python benchmarking libraries #

At some point you might decide that getting a "best of 5" number that timeit returns by default is not enough. What if I need to know what's the most pessimistic scenario (the maximum time it took to run my code)? Or what's the the difference between the slowest and fastest run? Is this difference huge and my function runs in a completely unpredictable amount of time? Or is it so tiny that it's almost negligeable?

There are better benchmarking tools that offer more statistics about your code.

rich-bench #

The first tool I checked was the rich-bench package that was created by Anthony Shaw together with his anti-patterns repository for a PyCon talk. This is a small tool that can benchmark a set of files with different code examples and present the results in a nicely formatted table. Each benchmark will compare two different functions and present the mean of the difference between their execution times, but also a min and max difference, so you can easily see the spread between the results.

pyperf #

If you need a more advanced benchmarking tool, you probably can't go wrong if you choose the official tool used by the Python Performance Benchmark Suite - an authoritative source of benchmarks for all Python implementations.pyperf is a really exhaustive tools with many different features, including automatic calibration, detection of unstable results, tracking memory usage, and different modes of work, depending if you want to compare some different pieces of code or get a bunch of stats for one function.

Let's see an example. For the benchmarks, I will use a simple but inefficient function to calculate a sum of powers of the first 1,000,000 numbers: sum(n * n for n in range(1_000_001)) .

Here is the output from timeit module:

$ python -m timeit "sum(n * n for n in range(1_000_001))"
5 loops, best of 5: 41 msec per loop

And here is the output of the pyperf:

$ python -m pyperf timeit "sum(n * n for n in range(1_000_001))" -o bench.json
.....................
Mean +- std dev: 41.5 ms +- 1.1 ms

The results are very similar, but with the -o parameter we told pyperf to store the benchmark results in a JSON file, so now we can analyze them and get much more information:

$ python -m pyperf stats bench.json
Total duration: 14.5 sec
Start date: 2022-11-09 18:19:37
End date: 2022-11-09 18:19:53
Raw value minimum: 163 ms
Raw value maximum: 198 ms

Number of calibration run: 1
Number of run with values: 20
Total number of run: 21

Number of warmup per run: 1
Number of value per run: 3
Loop iterations per value: 4
Total number of values: 60

Minimum: 40.8 ms
Median +- MAD: 41.3 ms +- 0.2 ms
Mean +- std dev: 41.5 ms +- 1.1 ms
Maximum: 49.6 ms

0th percentile: 40.8 ms (-2% of the mean) -- minimum
5th percentile: 40.9 ms (-1% of the mean)
25th percentile: 41.2 ms (-1% of the mean) -- Q1
50th percentile: 41.3 ms (-0% of the mean) -- median
75th percentile: 41.5 ms (+0% of the mean) -- Q3
95th percentile: 41.9 ms (+1% of the mean)
100th percentile: 49.6 ms (+20% of the mean) -- maximum

Number of outlier (out of 40.7 ms..41.9 ms): 3

hyperfine #

And in case you want to benchmark some code that is not a Python code, there is always the hyperfine that can be used to benchmark any CLI command. hyperfine has similar set of features as the pyperf does. It will automatically do a warmup runs, clear cache, detect statistical outliers. And all that with a nice progress bars and colors just makes the output looks beautiful.

You can run it for one command and will return the usual information like the mean, min, and max time, standard deviation, number of runs, etc. But you can also pass multiple commands and you will get a comparison which one was faster:

TODO: screenshot goes here

timeit is just fine...for me #

In the end I chose a very simple way of benchmarking: python -m timeit -s "setup code" "code to benchmark". I don't have to use the perfect way of benchmarking (if it even exist). That would be necessary if I were to benchmark one piece of code and share the results with the world. I couldn't use a random, inefficient method of measuring and tell you "this piece of code is bad because it runs in 15 seconds". You could use a better benchmarking tool, run it on a powerful computer and end up with the same code running in 1.5 second.

Comparing two pieces of code is a different story. Sure, a good, reliable benchmarking methodology is important. But in the end we care about relative speed difference between the code examples. If my computer run "Example A" in 10 seconds and "Example B" in 20 seconds, but your computer runs them in 5 and 10 seconds respectively, we can both conclude that "Example B" is twice as slow.

Using timeit is good enough. It lets me separate the setup code from the actual code I want to benchmark. And if you want to run the same benchmarks on your computer, you can do this straight away. You already have timeit installed with your distribution of Python. You don't have to install any additional library or set up Docker.

Much more important thing than the most accurate tool is how you set up your benchmarks.

Beware of how you structure your code #

Running benchmarks is the easy part. The tricky part is to remember to write your code in a way that won't "cheat". When I first wrote Sorting Lists article, I was so happy to find that sort() was so much faster than sorted(). "OMG, I found the holy grail of sorting in Python" - I though. Then someone pointed out that list.sort() sorts the list in place. So if I run my benchmarks, the first iteration will sort the list (which is slow) and each next iteration will sort an already sorted list (which is much faster). I had to update my article and start paying more attention to how I organize my benchmarks.

I had to make some trade-offs. For example, a lot of my examples use global variables because. I want to exclude the time it takes to build all the variables from my benchmarks. So I put all the input data into global variables and I only benchmark the code that really matters. Keeping the code examples as small as possible means lower chance that some irrelevant code (like constructing a very long list of input numbers) will take most of the time during the benchmark. Or that the interpreter will run some micro-optimization in one of the examples and not in the other only because of some very specific corner case that might not always work. But using global variables is slower than using local variables - they are simply slower to look up. So my benchmarks are slower because of them and I could have faster code by using local variables. But they are ALL slower. If 2 examples both use a global variable, that's fine, they will both have the same slowdown.

Conclusion #

Depending on your use case, you might reach for a different tool to benchmark your code:

  • pythom -m timeit "some code" for the simplest, easiest to run benchmarks where you just want to get "a number".
  • pythom -m timeit -s "setup code" "some code" much more useful version if you want to separate some setup code from the actual benchmarks.
  • docker is a good alternative if you want to separate your benchmark process from any external processed on your computer.
  • rich-bench sounds like a nice solution if you need a dedicated tool with some additional statistics like min, max, median, and nice output formatting and you're willing to set up your benchmarks in a specific structure that rich-bench requires.
  • pyperf gives you the most advanced set of statistics about your code. And it's used by the official Python benchmarks, so it's an excellent tools for advanced benchmarks.
  • hyperfine great tool if you want to benchmark any command, not only Python code. Or if you want to compare two commands together.


  1. Ok, technically I could print the current time with time.time(), run my code, print time.time() again, and subtract those two values. But, come on, that't not simple, that's rudimentary. ↩︎


Viewing all articles
Browse latest Browse all 22875

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>