Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22462

Malthe Borch: Python's Missing String Type

$
0
0

Thu, 17 Apr 201417:35:00 GMT

When Python 3.0 came out in late 2008, it was expected that the eventual wide adoption of the 3.x series would take roughly five years.

And on some Linux systems today, it's even the default interpreter.

$ python
Python 3.5.2 (default, Jun 28 2016, 08:46:01)
>>>

Yet, I don't know anyone who actually uses Python 3 for application development. I think there are two primary reasons for this:

The most controversial change in Python 3 was that the string type was changed from an 8-bit raw byte string to a unicode-based string type which makes sense because the string type is for human-readable text and unicode is able to represent any text.

Unfortunately, it broke almost every existing library. But it also missed the mark.

In Python 2 we have str and unicode. In Python 3 we have str and bytes. But there's a design that allows us to combine the functionality of both in a single type.

We can use a rope-like data structure where each leaf is a sequence of bytes with an encoding such as utf-8 (see also the paper from 1994 by Boehm, Atkinson and Plass.)

Rope data-structure

We can add any two str instances together, regardless of encoding, and use all of the common string methods and operators such as len and split. In all cases, the methods would respect the encodings of the various segments.

To "flatten" a rope, we encode it:

>>> string.encode('utf-8')

This is typically necessary only for I/O or use with external libraries.

What about raw bytes? Easy:

>>> data = open('foo.png', 'rb').read()

And if we know that a particular substring is actually encoded:

>>> header = data[1:4].decode('utf-8')

This works because data was read as raw bytes from a file. When we decode this data we get a rope that's composed of a single segment with a unicode-compatible encoding – utf-8.

Discussion on Hacker News.


Viewing all articles
Browse latest Browse all 22462

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>