This month I held a Q&A at PuPPy (the Puget Sound Python users group) that eventually led to me explaining why Python 3 came into existence and the whole string/bytes deal. I ended up receiving a compliment on the explanation which somewhat surprised me since I naively assumed people knew at this point why Python 3 was created. In hindsight it was silly of me to assume a majority of people, new or old to Python, would have either been told or honestly been struck with the curiosity to seek out and find an explanation. So this blog post is meant to simply explain why Python 3 exists, and specifically why we chose to make the whole backwards-incompatible unicode
/str
/bytes
change happen since that's the really tricky part in porting code to Python 3.
Text and binary data in Python 2 are a mess
Quick, what does the following literal represent semantically?
'abcd'
If you're a Python 3 user you would say it's the string consisting of the letters "a", "b", "c", and "d" in that order.
If you're a Python 2 user you may have said the same thing. You may have also said it was the bytes representing 97, 98, 99, and 100. And it's the fact that there are two correct answers in Python 2 for what the str
object represents that led to changing the language so that the single Python 3 answer was the only answer.
The Zen of Python says that "there should be one -- and preferably only one -- obvious way to do it". Having literals in the language that could represent either textual data or binary data was a problem. If you read something from the network, for instance, you would have to be very careful to either say the str
object you returned represented binary data or textual data because there was no way to know once the object left your control. Or you might have a bug in your code where you were meant to translate that str
object into textual data -- or something else entirely -- but you messed up and accidentally skipped that step. With the str
object potentially represent two different semantic types it was hard to notice when this kind of slip-up occurred.
Now you might try and argue that these issues are all solvable in Python 2 if you avoid the str
type for textual data and instead relied upon the unicode
type for text. While that's strictly true, people don't do that in practice. Either people get lazy and don't want to bother decoding to Unicode because it's extra work, or people get performance-hungry and try to avoid the cost of decoding. Either way it's making an assumption that you will code well enough to not mess up, and we all know that we are fallible human beings who are in fact not perfect. If people's hopes of coding bug-free code in Python 2 actually panned out then I wouldn't consistently hear from basically every person who ports their project to Python 3 that they found latent bugs in their code regarding encoding and decoding of text and binary data.
This point of avoiding bugs is a big deal that people forget. The simplification of the language and the removal of the implicitness of what a str
object might represent makes code less bug-prone. The Zen of Python points out that "explicit is better than implicit" for a reason: ambiguity and implicit knowledge that is not easily communicated code is easy to get wrong and leads to bugs. By forcing developers to explicitly separate out their binary data and textual data it leads to better code that has less of a chance to have a certain class of bug.
The rest of the world had gone all-in on Unicode (for good reason)
People sometimes forget how old Python is; Guido started coding Python in December 1989 and was first released as open source in Februrary 1991. This means that Python itself predates the first volume of the Unicode standard which came out in October 1991. Over the intervening years, languages created after Unicode standardized chose to use it as their implementation for strings. This placed Python 2 in this unfortunate position where it was gaining significant traction in 2004 (when Python 3 planning began), but it had arguably the weakest support for Unicode text due to the fact that the unicode
type was entirely optional and people were not using it for all textual data.
Supporting Unicode and text from any written language is important. Python is a language for the world, not just for those languages that support the Roman alphabet that ASCII covers. This is why Python 3 makes it "Unicode or bust" when it comes to text; it guarantees that all Python 3 code will support everyone in the world whether the developer who wrote the code explicitly meant for it to or not. In Python 2 there is a schism between projects that have taken the time to properly support the unicode
type for textual data and those that do not; in Python 3 there is no such schism and supporting all languages comes for free.
We assumed Python was just going to keep getting more popular
In 2004 we started PEP 3100 and thus began designing Python 3 (aside: the PEP originally was numbered 3000, but we renumbered it to 3100 so that the PEP which had the number 3000 would be the PEP on how we would handle the development of Python 3). We knew Python's popularity was on an upward trend and we hoped its growth would continue (which it thankfully has ☺). But this also meant that if we were going to fix any design mistakes and help continue the language's popularity, we needed to do it now rather than later. We assumed that more code would be written in Python 3 than in Python 2 over a long-enough time frame assuming we didn't botch Python 3 as it would last longer than Python 2 and be used more once Python 2.7 was only used for legacy projects and not new ones. So we decided to bear the pain of the Python 2/3 transition and created Python 3 under this assumption. Obviously it will take decades to see if Python 3 code in the world outstrips Python 2 code in terms of lines of code.
We will never do this kind of backwards-incompatible change again
We have decided as a team that a change as big as unicode
/str
/bytes
will never happen so abruptly again. When we started Python 3 we thought/hoped that the community would do what Python did and do one last feature release supporting Python 2 and then cut over to Python 3 development for feature development while doing bugfix releases only for the Python 2 version. That obviously didn't happen and we have learned our lesson. Plus we don't see any shortcomings in the fundamental design of the language that could warrant the need to make such a major change. So expect Python 4 to not do anything more drastic than to remove maybe deprecated modules from the standard library.
Conclusion
So that's why Python 3 is the way it is. We realized that there were a collection of bugs that people kept having due to the overloaded use of the str
type in Python 2, and so we fixed them in Python 3 by clearly separating textual data from binary data. It also helps that by making all textual data automatically support Unicode that projects suddenly were much easier to work with multiple languages. And we made the change when we did because we figured the sooner the better. We structured the transition thinking the community would come along with us in leaving Python 2 behind, but that turned out not to be the case and instead we have taken some more time and are using a Python 2/3 compatible subset of the language to manage the transition.