Quantcast
Viewing all articles
Browse latest Browse all 23045

Reuven Lerner: Speedy string concatenation in Python

As many people know, one of the mantras of the Python programming language is, “There should be one, and only one, way to do it.”  (Use “import this” in your Python interactive shell to see the full list.)  However, there are often times when you could accomplish something in any of several ways. In such cases, it’s not always obvious which is the best one.

A student of mine recently e-mailed me, asking which is the most efficient way to concatenate strings in Python.

The results surprised me a bit — and gave me an opportunity to show her (and others) how to test such things.  I’m far from a benchmarking expert, but I do think that what I found gives some insights into concatenation.

First of all, let’s remember that Python provides us with several ways to concatenate strings.  We can use the + operator, for example:

>> 'abc' + 'def'
'abcdef'

We can also use the % operator, which can do much more than just concatenation, but which is a legitimate option:

>>> "%s%s" % ('abc', 'def')
'abcdef'

And as I’ve mentioned in previous blog posts, we also have a more modern way to do this, with the str.format method:

>>> '{0}{1}'.format('abc', 'def')
'abcdef'

As with the % operator, str.format is far more powerful than simple concatenation requires. But I figured that this would give me some insights into the relative speeds.

Now, how do we time things? In Jupyter (aka IPython), we can use the magic “timeit” command to run code.  I thus wrote four functions, each of which concatenates in a different way. I purposely used global variables (named “x” and “y”) to contain the original strings, and a local variable “z” in which to put the result.  The result was then returned from the function.  (We’ll play a bit with the values and definitions of “x” and “y” in a little bit.)

def concat1(): 
    z = x + y 
    return z 

 def concat2(): 
    z = "%s%s" % (x, y) 
    return z 

def concat3(): 
    z = "{}{}".format(x, y) 
    return z 

def concat4(): 
    z = "{0}{1}".format(x, y) 
    return z

I should note that concat3 and concat4 are almost identical, in that they both use str.format. The first uses the implicit locations of the parameters, and the second uses the explicit locations.  I decided that if I’m already benchmarking string concatenation, I might as well also find out if there’s any difference in speed when I give the parameters’ iindexes.

I then defined the two global variables:

x = 'abc' 
y = 'def'

Finally, I timed running each of these functions:

%timeit concat1()
%timeit concat2()
%timeit concat3()
%timeit concat4()

The results were as follows:

  • concat1: 153ns/loop
  • concat2: 275ns/loop
  • concat3: 398ns/loop
  • concat4: 393ns/loop

From this benchmark, we can see that concat1, which uses +, is significantly faster than any of the others.  Which is a bit sad, given how much I love to use str.format — but it also means that if I’m doing tons of string processing, I should stick to +, which might have less power, but is far faster.

The thing is, the above benchmark might be a bit problematic, because we’re using short strings.  Very short strings in Python are “interned,” meaning that they are defined once and then kept in a table so that they need not be allocated and created again.  After all, since strings are immutable, why would we create “abc” more than once?  We can just reference the first “abc” that we created.

This might mess up our benchmark a bit.  And besides, it’s good to check with something larger. Fortunately, we used global variables — so by changing those global variables’ definitions, we can run our benchmark and be sure that no interning is taking place:

x = 'abc' * 10000 
y = 'def' * 10000

Now, when we benchmark our functions again, here’s what we get:

  • concat1: 2.64µs/loop
  • concat2: 3.09µs/loop
  • concat3: 3.33µs/loop
  • concat4: 3.48µs/loop

Each loop took a lot longer — but we see that our + operator is still the fastest.  The difference isn’t as great, but it’s still pretty obvious and significant.

What about if we no longer use global variables, and if we allocate the strings within our function?  Will that make a difference?  Almost certainly not, but it’s worth a quick investigation:

def concat1(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = x + y 
     return z 

def concat2(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "%s%s" % (x, y) 
     return z 

def concat3(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "{}{}".format(x, y) 
     return z 

def concat4(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "{0}{1}".format(x, y) 
     return z 

And our final results are:

  • concat1: 4.89µs/loop
  • concat2: 5.78µs/loop
  • concat3: 6.22µs/loop
  • concat4: 6.19µs/loop

Once again, we see that + is the big winner here, but (again) but less of a margin than was the case with the short strings.  str.format is clearly shorter.  And we can see that in all of these tests, the difference between “{0}{1}” and “{}{}” in str.format is basically zero.

Upon reflection, this shouldn’t be a surprise. After all, + is a pretty simple operator, whereas % and str.format do much more.  Moreover, str.format is a method, which means that it’ll have greater overhead.

Now, there are a few more tests that I could have run — for example, with more than two strings.  But I do think that this demonstrates to at least some degree that + is the fastest way to achieve concatenation in Python.  Moreover, it shows that we can do simple benchmarking quickly and easily, conducting experiments that help us to understand which is the best way to do something in Python.

The post Speedy string concatenation in Python appeared first on Lerner Consulting Blog.


Viewing all articles
Browse latest Browse all 23045

Trending Articles