
Whatever your programs are doing, they often have to deal with vast amounts of data. This data is usually represented and manipulated in the form of strings. However, handling such a large quantity of input in strings can be very ineffective once you start manipulating them by copying, slicing, and modifying. Why?
Let's consider a small program which reads a large file of binary data, and
copies it partially into another file. To examine out the memory usage of this program, we will use https://pypi.python.org/pypi/memory_profiler[memory_profiler], an excellent Python package that allows us to see the memory usage of a program line by line.
@profile
def read_random():
with open("/dev/urandom", "rb") as source:
content = source.read(1024 * 10000)
content_to_write = content[1024:]
print("Content length: %d, content to write length %d" %
(len(content), len(content_to_write)))
with open("/dev/null", "wb") as target:
target.write(content_to_write)
if __name__ == '__main__':
read_random()
Running the above program using memory_profiler produces the following:
$ python -m memory_profiler memoryview/copy.py
Content length: 10240000, content to write length 10238976
Filename: memoryview/copy.py
Mem usage Increment Line Contents
======================================
@profile
9.883 MB 0.000 MB def read_random():
9.887 MB 0.004 MB with open("/dev/urandom", "rb") as source:
19.656 MB 9.770 MB content = source.read(1024 * 10000)
29.422 MB 9.766 MB content_to_write = content[1024:]
29.422 MB 0.000 MB print("Content length: %d, content to write length %d" %
29.434 MB 0.012 MB (len(content), len(content_to_write)))
29.434 MB 0.000 MB with open("/dev/null", "wb") as target:
29.434 MB 0.000 MB target.write(content_to_write)
The call to source.read
reads 10 MB from /dev/urandom
. Python needs to allocate around 10 MB of memory to store this data as a string. The instruction on the line just after, content[1024:]
, copies the entire block of data minus the first KB — allocating 10 more megabytes.
So what's interesting here, is to notice that the memory usage of the program increased by about 10 MB when building the variable content_to_write
. The slice operator is copying the entirety of content
, minus the first KB, into a new string object.
When dealing with extensive data, performing this kind of operation on large byte arrays is going to be a disaster. If you already have written C code, you know that using memcpy()
has a significant cost, both in term of memory usage and regarding general performance: copying memory is slow.
However, as a C programmer, you also know that strings are arrays of characters and that nothing stops you from looking at only part of this array without copying it, through the use of basic pointer arithmetic – assuming that the entire string is in a contiguous memory area.
This is possible in Python using objects which implement the buffer protocol. The buffer protocol is defined in http://www.python.org/dev/peps/pep-3118/[PEP 3118], which explains the C API used to provide this protocol to various types, such as strings.
When an object implements this protocol, you can use the memoryview
class constructor on it to build a new memoryview object that references the original object memory.
>>> s = b"abcdefgh"
>>> view = memoryview(s)
>>> view[1]
98
>>> limited = view[1:3]
>>> limited
<memory at 0x7fca18b8d460>
>>> bytes(view[1:3])
b'bc'
Note:
98
is the ASCII code for the letterb
.
In the example above, we use the fact that the memoryview
object's slice operator itself returns a memoryview
object. That means it does not copy any data but merely references a particular slice of it.
The graph below illustrates what happens:
Therefore, it is possible to rewrite the program above in a more efficient manner. We need to reference the data that we want to write using a memoryview object, rather than allocating a new string.
@profile
def read_random():
with open("/dev/urandom", "rb") as source:
content = source.read(1024 * 10000)
content_to_write = memoryview(content)[1024:]
print("Content length: %d, content to write length %d" %
(len(content), len(content_to_write)))
with open("/dev/null", "wb") as target:
target.write(content_to_write)
if __name__ == '__main__':
read_random()
Let's run the program above with the memory profiler:
$ python -m memory_profiler memoryview/copy-memoryview.py
Content length: 10240000, content to write length 10238976
Filename: memoryview/copy-memoryview.py
Mem usage Increment Line Contents
======================================
@profile
9.887 MB 0.000 MB def read_random():
9.891 MB 0.004 MB with open("/dev/urandom", "rb") as source:
19.660 MB 9.770 MB content = source.read(1024 * 10000) <1>
19.660 MB 0.000 MB content_to_write = memoryview(content)[1024:] <2>
19.660 MB 0.000 MB print("Content length: %d, content to write length %d" %
19.672 MB 0.012 MB (len(content), len(content_to_write)))
19.672 MB 0.000 MB with open("/dev/null", "wb") as target:
19.672 MB 0.000 MB target.write(content_to_write)
In that case, the source.read
call still allocates 10 MB of memory to read the content of the file. However, when using memoryview
to refer to the offset content, no more memory is allocated.
This version of the program ends up allocating 50% less memory than the original version!
This kind of trick is especially useful when dealing with sockets. When sending data over a socket, all the data might not be sent in a single call.
import socket
s = socket.socket(…)
s.connect(…)
# Build a bytes object with more than 100 millions times the letter `a`
data = b"a" * (1024 * 100000)
while data:
sent = s.send(data)
# Remove the first `sent` bytes sent
data = data[sent:] <2>
Using a mechanism as implemented above, the program copies the data over and over until the socket has sent everything. By using memoryview
, it is possible to achieve the same functionality with zero-copy, and therefore higher performance:
import socket
s = socket.socket(…)
s.connect(…)
# Build a bytes object with more than 100 millions times the letter `a`
data = b"a" * (1024 * 100000)
mv = memoryview(data)
while mv:
sent = s.send(mv)
# Build a new memoryview object pointing to the data which remains to be sent
mv = mv[sent:]
As this won't copy anything, it won't use any more memory than the 100 MB
initially needed for the data
variable.
So far we've used memoryview
objects to write data efficiently, but the same method can also be used to read data. Most I/O operations in Python know how to deal with objects implementing the buffer protocol. They can read from it, but also write to it. In this case, we don't need memoryview
objects – we can ask an I/O function to write into our pre-allocated object:
>>> ba = bytearray(8)
>>> ba
bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00')
>>> with open("/dev/urandom", "rb") as source:
... source.readinto(ba)
...
8
>>> ba
bytearray(b'`m.z\x8d\x0fp\xa1')
With such techniques, it's easy to pre-allocate a buffer (as you would do in C to mitigate the number of calls to malloc()
) and fill it at your convenience.
Using memoryview
, you can even place data at any point in the memory area:
>>> ba = bytearray(8)
>>> # Reference the _bytearray_ from offset 4 to its end
>>> ba_at_4 = memoryview(ba)[4:]
>>> with open("/dev/urandom", "rb") as source:
... # Write the content of /dev/urandom from offset 4 to the end of the
... # bytearray, effectively reading 4 bytes only
... source.readinto(ba_at_4)
...
4
>>> ba
bytearray(b'\x00\x00\x00\x00\x0b\x19\xae\xb2')
The buffer protocol is fundamental to achieve low memory overhead and great performances. As Python hides all the memory allocations, developers tend to forget what happens under the hood, at a high cost for the speed of their programs!
It's also good to know that both the objects in the array
module and the functions in the struct
module can handle the buffer protocol correctly, and can, therefore, efficiently perform when targeting zero copy.