I'm porting chemfp from Python2 to Python3. I read a lot of ASCII files. I'm trying to figure out if it's better to read them as binary bytes or as text strings.
No matter how I tweak Python3's open() parameters, I can't get the string read performance to within a factor of 2 of the bytes read performance. As I haven't seen much discussion of this, I figured I would document it here.
chemfp reads chemistry file formats which are specified as ASCII. They contain user-specified fields which are 8-bit clean, so sometimes people use them to encode non-ASCII data. For example, the SD tag field "price" might include the price in £GBP or €EUR, and include the currency symbol either as Latin-1 or UTF-8. (I haven't come across other encodings, but I've also never worked with SD files used internally in, say, a Japanese pharamceutical company.)
These are text files, so it makes sense to read it as text, right? The main problem is, reading in "r" mode is a lot slower than reading "rb" mode. Here's my benchmark, which uses Python 3.5.2 on a Mac OS X 10.10.5 machine to read the first 10MiB from a 3.1GiB file:
% python -V Python 3.5.2 % python -m timeit 'open("chembl_21.sdf", "r").read(10*1024*1024)' 100 loops, best of 3: 10.3 msec per loop % python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)' 100 loops, best of 3: 3.74 msec per loopThe Unicode string read() is much slower than the byte string read(), with a performance ratio of 2.75. (I'll give all numbers in ratios.)
Python2 had a similar problem. I originally used "U"niversal mode in chemfp to read the text files in FPS format, but found that if I switched from "rU" to "rB", and wrote my code to support both '\n' and '\r\n' conventions, I could double my overall system read performance - the "U" option gives a 10x slowdown!
% python2.7 -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)' 100 loops, best of 3: 3.7 msec per loop % python2.7 -m timeit 'open("chembl_21.sdf", "rU").read(10*1024*1024)' 10 loops, best of 3: 36.7 msec per loop
This observation is not new. A quick Duck Duck Go search found a 2015 blog post by Nelson Minar which concluded:
- Python 2 and Python 3 read bytes at the same speed
- In Python 2, decoding Unicode is 10x slower than reading bytes
- In Python 3, decoding Unicode is 3-7x slower than reading bytes
- In Python 3, universal newline conversion is ~1.5x slower than skipping it, at least if the file has DOS newlines
- In Python 3, codecs.open() is faster than open().
The Python3 open() function takes more parameters than Python2, including 'newline', which affects how the text mode reader identifies newlines, and 'encoding':
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True) ... newline controls how universal newlines works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows: * On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
I'll 'deuniversalize' the text reader and benchmark newline="\n" and newline="":
% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)' 100 loops, best of 3: 3.81 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline="\n").read(10*1024*1024)' 100 loops, best of 3: 8.8 msec per loop python -m timeit 'open("chembl_21.sdf", "r", newline="").read(10*1024*1024)' 100 loops, best of 3: 10.2 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline=None).read(10*1024*1024)' 100 loops, best of 3: 10.2 msec per loopThe ratio of 2.3 for newline="\n" slowndown is better than the 2.75 for univeral newlines and the newline="" case that Nelson Minar tested, but still less than half the performance of the byte reader.
I also wondered if the encoding made a difference:
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="ascii").read(10*1024*1024)' 100 loops, best of 3: 8.8 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="utf8").read(10*1024*1024)' 100 loops, best of 3: 8.8 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="latin-1").read(10*1024*1024)' 100 loops, best of 3: 10.1 msec per loopMy benchmark shows that ASCII and UTF-8 encodings are equally fast, and Latin-1 is 14% slower, even though my data set contains only ASCII. I did not expect any difference. I assume a lot of time has been spent making the UTF-8 code go fast, but don't know why the Latin-1 reader is noticably slower on ASCII data. Nelson Minar also tested the codecs.open() performance, so I'll repeat it:
% python -m timeit -s 'import codecs' 'codecs.open("chembl_21.sdf", "r").read(10*1024*1024)' 100 loops, best of 3: 10.2 msec per loopI noticed no performance difference between codec.open() and builtin open() for this test case.
I've left with a bit of a quandary. I work with ASCII text data, with only the occasional non-ASCII field. For example, chemfp has specialized code to read an id tag and encoded fingerprint field from an SD file. In rare and non-standard cases, the handful of characters in the id/title line might be non-ASCII, but the hex-encoded fingerprint is never anything other than ASCII. It makes sense to use the text reader. But If I use the text reader, it will decode everything in each record (typically 2K-8K bytes), when I only need to decode at most 100 bytes of the record.
In chemfp, I used to have a two-pass solution to find records in an SD file. The first pass found the fields of interest, and the second counted newlines for better error reporting. I found that even that level of data re-scanning caused an observable slowdown, so I shouldn't be surprised that an extra pass to check for non-ASCII characters might also be a problem. But, two-fold slowdown?
This performance overhead leads me to conclude that I need to process my performance critical files as bytes, rather than strings, and delay the byte-to-string decoding as much as possible.
RDKit and (non-)Unicode
I checked with the RDKit, which is a cheminformatics toolkit. The core is in C++, with Python extensions through Boost.Python. It treats the files as bytes, and lazily exposes the data to Python as Unicode. For example, if I places a byte string which is not valid UTF-8 in the title or tag field, then it will read and write the data without problems, because the data is stored in C++ data structures based on the byte string. But if I try to get the properties from Python, I get a UnicodeDecodeError.
Here's an example. First, I'll get a valid record, which is all ASCII:
>>> block = open("chembl_21.sdf", "rb").read(10000) >>> i = block.find(b"$$$$\n") >>> i 973 >>> record = block[:i+5] >>> chr(max(record)) 'm'I'll use the RDKit to parse the record, then show that I can read the molecule id, which comes from the first line (the title line) of the file:
>>> from rdkit import Chem >>> mol = Chem.MolFromMolBlock(record) >>> mol.GetProp("_Name") 'CHEMBL153534'If I then create a 'bad_record', by prefixing chr(0xC2) as the first byte to the record, then I can still process the record, but I cannot get the title:
>>> bad_record = b"\xC2" + record >>> bad_mol = Chem.MolFromMolBlock(bad_record) >>> bad_mol.GetNumAtoms() 16 >>> bad_mol.GetProp("_Name") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byteThis is because character 0xC2 is not a valid byte in UTF-8, and the RDKit uses a UTF-8 to bytes error handler which fails for invalid bytes.
I can even use C++ to write the string to a file, since the C++ code treats everything as bytes:
>>> outfile = Chem.SDWriter("/dev/tty") >>> outfile.write(mol); outfile.flush() CHEMBL153534 RDKit 2D 16 17 0 0 0 0 0 0 0 0999 V2000 ... >>> outfile.write(bad_mol); outfile.flush() ?CHEMBL153534 RDKit 2D 16 17 0 0 0 0 0 0 0 0999 V2000 ...(The symbol could not be represented in my UTF-8 terminal, so it uses a "?".)
On the other hand, I get a UnicodeDecodeError if I use a Python file object:
>>> from io import BytesIO >>> f = BytesIO() >>> outfile = Chem.SDWriter(f) >>> outfile.write(bad_mol) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte(It doesn't matter if I replace the BytesIO with a StringIO. It hasn't gotten to that point yet. When the SDWriter() is initialized with a Python file handle then told to write a molecule, it writes the molecule to a C++ byte string, converts that to a Python string, and passes that Python string to the file handle. The failure here is in the C++ to Python translation.)
The simple conclusion from this is the same as the punchline from the old joke "Doctor, doctor, it hurts when I do this.""Then don't do that." But it's too simple. SD files come from all sorts of places, including sources which may use '\xC2' as the Latin-1 encoding of Â. You don't want your seemingly well-tested system to crash because of something like this.
I'm not sure what the right solution is for the RDKit, but I can conclude that I need a test case something like this for any of my SD readers, and that correct support for the bytes/string translation is not easy.
Want to leave a comment?
If you have a better suggestion than using bytes, like a faster way to read ASCII text as Unicode strings, or have some insight into why reading an ASCII file as a string is so relatively slow in Python3, let me know.
But don't get me wrong. I do scientific programming, which with rare exceptions is centered around the 7-bit ASCII world of the 1960s. Python3 has made great efforts to make Unicode parsing and decoding fast, which is important for most real-world data outside of my niche.