One of the more puzzling aspects of Python for newcomers to the language is the
stark usability differences between the standard library's urllib
module
and the popular (and well-recommended) third party module, requests
, when
it comes to writing HTTP(S) protocol clients. When your problem is
"talk to a HTTP server", the difference in usability isn't immediately obvious,
but it becomes clear as soon as additional requirements like SSL/TLS,
authentication, redirect handling, session management, and JSON request and
response bodies enter the picture.
It's tempting, and entirely understandable, to want to
chalk this difference
in ease of use up to requests
being "Pythonic" (in 2016 terms), while urllib
has now become un-Pythonic (despite being included in the standard library).
While there are certainly a few elements of that (e.g. the property
builtin
was only added in Python 2.2, while urllib2
was included in the original
Python 2.0 release and hence couldn't take that into account in its API design),
the vast majority of the usability difference relates to an entirely different
question we often forget to ask about the software we use:
What problem does it solve?
That is, many otherwise surprising discrepancies between urllib
/urllib2
and requests
are best explained by the fact that they solve different
problems, and the problems most HTTP client developers have today
are closer to those Kenneth Reitz designed requests
to solve in 2010/2011,
than they are to the problems that Jeremy Hylton was aiming to solve more than
a decade earlier.
It's all in the name
To quote the current Python 3 urllib
package documentation: "urllib is a
package that collects several modules for working with URLs".
And the docstring from Jeremy's
original commit message
adding urllib2
to CPython: "An extensible library for opening URLs using a
variety [of] protocols".
Wait, what? We're just trying to write a HTTP client, so why is the documentation talking about working with URLs in general?
While it may seem strange to developers accustomed to the modern HTTPS+JSON powered interactive web, it wasn't always clear that that was how things were going to turn out.
At the turn of the century, the expectation was instead that we'd retain a rich variety of data transfer protocols with different characteristics optimised for different purposes, and that the most useful client to have in the standard library would be one that could be used to talk to multiple different kinds of servers (like HTTP, FTP, NFS, etc), without client developers needing to worry too much about the specific protocol used (as indicated by the URL schema).
In practice, things didn't work out that way (mostly due to restrictive institutional firewalls meaning HTTP servers were the only remote services that could be accessed reliably), so folks in 2016 are now regularly comparing the usability of a dedicated HTTP(S)-only client library with a general purpose URL handling library that needs to be configured to specifically be using HTTP(S) before you gain access to most HTTP(S) features.
When it was written, urllib2
was a square peg that was designed to fit into
the square hole of "generic URL processing". By contrast, most modern client
developers are looking for a round peg to fit into the round hole that is
HTTPS+JSON processing - urllib
/urllib2
will fit if you shave the corners
off first, but requests
comes pre-rounded.
So why not add requests to the standard library?
Answering the not-so-obvious question of "What problem does it solve?" then
leads to a more obvious follow-up question: if the problems that urllib
/
urllib2
were designed to solve are no longer common, while the problems that
requests
solves are common, why not add requests
to the standard library?
If I recall correctly, Guido gave in-principle approval to this idea at a
language summit back in 2013 or so (after the requests
1.0 release), and it's
a fairly common assumption amongst the core development team that either
requests
itself (perhaps as a bundled snapshot of an independently upgradable
component) or a compatible subset of the API with a different implementation
will eventually end up in the standard library.
However, even putting aside the
misgivings of the requests developers
about the idea, there are still some non-trivial system integration problems
to solve in getting requests
to a point where it would be acceptable as a
standard library component.
In particular, one of the things that requests
does to more reliably handle
SSL/TLS certificates in a cross-platform way is to bundle the Mozilla
Certificate Bundle included in the certifi
project. This is a sensible
thing to do by default (due to the difficulties of obtaining reliable access
to system security certificates in a cross-platform way), but it conflicts
with the security policy of the standard library, which specifically aims to
delegate certificate management to the underlying operating system. That policy
aims to address two needs: allowing Python applications access to custom
institutional certificates added to the system certificate store (most notably,
private CA certificates for large organisations), and avoiding adding an
additional certificate store to end user systems that needs to be updated when
the root certificate bundle changes for any other reason.
These kinds of problems are technically solvable, but they're not fun to solve,
and the folks in a position to help solve them already have a great many other
demands on their time.This means we're not likely to see much in the way of
progress in this area as long as most of the CPython and requests
developers
are pursuing their upstream contributions as a spare time activity, rather than
as something they're specifically employed to do.