Quantcast
Channel: Planet Python
Viewing all 22412 articles
Browse latest View live

Roberto Alsina: Airflow By Example

$
0
0

Apache Airflow is a very interesting, popular and free tool to create, manage and monitor workflows, for example if you want to do ETL (Extract / Transform / Load) on data.

This sort of enterprise software often may seem complicated or overly unrelated to our everyday experience as developers but ... is it, really? How about if I just want to watch some TV shows? And experiment with some enterprise-level software at the same time?

Let's do that by learning how to use Airflow to watch TV.


Caveat: This post was originally a twitter thread, that's why all the examples are images and you can't copy/paste them. But hey, at least they are short. Also, typos, because I really just did this while tweeting it, no preparation beforehand.

Just in case: I did not download any "Star Trek: Picard" espisodes, and I have a Prime video subscription, so I don't need to download them via torrent. OTOH, if Sir Patrick ever reads this (which he won't): good job, sir!


A thread by Roberto Alsina

This is a script that gives you the information about the latest already aired episode of a TV series.

And this is a script that gives you the link to a torrent to download that episode.

This, on the other hand, is a script to download that torrent.

Of course, one thing about this script is not like the other scripts.

While the others take a couple of seconds to run, this one may take hours or days. But let's ignore that for now. This is a script that moves your downloaded file into a nice directory hierarchy.

However, I think this is nicer because it does the same thing but with names "guaranteed" to be right, more uniform, and transcodes it to, say, something a chromecast would like.

I could add extra tiny scripts that get subtitles for the languages you like and put them in the right location, and so on, but you get the idea.

Basically: it's easy to automate "I want to watch the latest episode of Picard"

However, it's totally impractical, because:

1) You have to go and tell it to get it
2) It takes hours
3) It may fail, and then you have to do it again
4) It will take hours again

But what if there was a way to define this set of tasks so that they run automatically, you know when they are working, when they start and when they finish, and you have to do nothing except waiting for a message in telegram that tells you "go watch Picard"? And that's where Apache Airflow enters the picture.

If you have a script that is a number of steps which depend on one another and are executed in order then it's possible to convert them into very simple airflow DAGs. Now I will take a little while to learn exactly HOW to do that and will continue this thread in a bit. Because, really, ETL (Extract / Transform / Load) is not as complicated as it may appear to be in most cases. BTW, if you want to have airflow with Python 3.8 (because it's nicer)

Now, this may look complicated, but really, I am defining a DAG (Directed Acyclic Graph) or "thingies connected with arrows that has no loops in it"

What are the "thingies" inside an airflow DAG? Operators. They come in many flavors, but I am using this one now, which sets up a venv, runs a function, then removes everything. Nice and clean. (airflow.apache.org/docs/stable/_a…) So, let's convert this into an Airflow operator.

It's not hard! operators are (in the case of python operators) simply functions.

Details: do all the imports inside the function.

Have a list of requirements ready if you require things.

Now I need to put this operator inside the DAG I created earlier. Again, it's a matter of declaring things.

So, now that we have a (very stupid, one node, no arrows) DAG ... what can we do with it?

Well, we can make sure airflow sees it (you need to tell airflow where your dags live)

We can check that our DAG has some task in it!

A task is just an instance of an Operator, we added one, so there should be one.

And of course, we can TEST the thing, and have our task do its trick. Note that I have to pass a date. We could even use that in a 0.0.2 version to check episodes we missed!

Hey, that worked! (BTW, of course it did not work THE FIRST TIME, come on).

Backfill means "start at this date and run whatever would have run if we had actually started at that date"

Now, if you run "airflow scheduler" and "airflow webserver" you can see things like this

And yes, that means this task will run daily and report everything in the nice web UI and all that.

But of course a single task is a lame DAG, so let's make it a bit more interesting. Now, let's create a second operator, which will run AFTER the one we had done is finished, and use its output.

It's based on this:

Following the same mechanical changes as before (imports in the function, etc) it will look like this:

This uses two pieces of data from the previous task.
So, we need to do 2 things:

1) Connect the two operators
2) Pass data from one to the other

Connecting two tasks in a DAG is simple. Declare them both and tell airflow they are connected and in what direction.

And of course that's now reflected in the airflow UI. Here you can see that Check_Picard has been successful (dark green border) and Search_Torrent has no status because it never ran (white border)

It's probably worth mentioning that patience is important at this point in the project, since "it runs quickly with immediate feedback" is not one of the benefits we are getting here.

This will be slower than just running the scripts by hand. And now we have the search_torrent task failing.

Why?

Well, luckily we are using airflow, so we have logs!

The problem is, Search_Torrent is not getting the right arguments. It "wants" a dict with at least series_name, season and episode in it.

And ... that's now how these things work in airflow :-)

Slight detour, I just ran into this: (issues.apache.org/jira/browse/AI…)

So, I need to rewrite my nice virtualenved operators into uglier not-virtualenved ones.

Shame, airflow, shame. BTW, this is a minor but important part of developing software. Sometimes, you are doing things right and it will not work because there is a bug somewhere else.

Suck it up! So, back to the code, let's recap. I now have two operators. One operator looks for the latest episode of a TV series (example: Picard), and returns all the relevant data in a serializable thing, like a dict of dicts of strings and ints.

The second one will search for a torrent of a given episode of a series. It uses the series name, the season number, and the episode number.

How does it know the data it needs to search? Since it was returned by the previous task in the DAG, it gets it from "context". Specifically it's there as a XCOM with key "return value".

And once it has all the interesting torrent information, then it just adds it to the data it got and returns THAT

How do we use these operators in our DAG?

To search for Picard, I use my tvdb operator with an argument "Picard"

To get the picard torrent, I use my rarbg operator, and use provide_context-True, so it can access the output of the other operator.

And then I hook them up

Does it work? Yes it does!

So, let's make it more intersting. A DAG of 2 nodes with one arrow is not the least interesting DAG ever but it's close!

So, what happens if I also want to watch ... "The Rookie"? I can just setup a new tvdb operator with a different argument, and connect them both to the operator that searches for torrents.

In the airflow graph it looks like this:

And in another view, the tree view, it looks like this:

So, this DAG will look every day for new episodes of Picard or The Rookie, then, if there is one, will trigger a torrent search for it. Adding further operators and tasks to actually download the torrent, transcode it, putting it in the right place, getting subtitles, and so ... is left as an exercise for the reader (I gave some hints at the beginning of the thread)

If you enjoyed this thread, consider hiring me ;-)

Senior Python Dev, eng mgmt experience, remote preferred, based near Buenos Aires, Argentina. (ralsina.me/weblog/posts/l…)


The Digital Cat: Dissecting a Web stack

$
0
0

It was gross. They wanted me to dissect a frog.

(Beetlejuice, 1988)

Introduction

Having recently worked with young web developers who were exposed for the first time to proper production infrastructure, I received many questions about the various components that one can find in the architecture of a "Web service". These questions clearly expressed the confusion (and sometimes the frustration) of developers who understand how to create endpoints in a high-level language such as Node.js or Python, but were never introduced to the complexity of what happens between the user's browser and their framework of choice. Most of the times they don't know why the framework itself is there in the first place.

The challenge is clear if we just list (in random order), some of the words we use when we discuss (Python) Web development: HTTP, cookies, web server, Websockets, FTP, multi-threaded, reverse proxy, Django, nginx, static files, POST, certificates, framework, Flask, SSL, GET, WSGI, session management, TLS, load balancing, Apache.

In this post, I want to review all the words mentioned above (and a couple more) trying to build a production-ready web service from the ground up. I hope this might help young developers to get the whole picture and to make sense of these "obscure" names that senior developers like me tend to drop in everyday conversations (sometimes arguably out of turn).

As the focus of the post is the global architecture and the reasons behind the presence of specific components, the example service I will use will be a basic HTML web page. The reference language will be Python but the overall discussion applies to any language or framework.

My approach will be that of first stating the rationale and then implementing a possible solution. After this, I will point out missing pieces or unresolved issues and move on with the next layer. At the end of the process, the reader should have a clear picture of why each component has been added to the system.

The perfect architecture

A very important underlying concept of system architectures is that there is no perfect solution devised by some wiser genius, that we just need to apply. Unfortunately, often people mistake design patterns for such a "magic solution". The "Design Patterns" original book, however, states that

Your design should be specific to the problem at hand but also general enough to address future problems and requirements. You also want to avoid redesign, or at least minimize it.

And later

Design patterns make it easier to reuse successful designs and architectures. [...] Design patterns help you choose design alternatives that make a system reusable and avoid alternatives that compromise reusability.

The authors of the book are discussing Object-oriented Programming, but these sentences can be applied to any architecture. As you can see, we have a "problem at hand" and "design alternatives", which means that the most important thing to understand is the requirements, both the present and future ones. Only with clear requirements in mind, one can effectively design a solution, possibly tapping into the great number of patterns that other designers already devised.

A very last remark. A web stack is a complex beast, made of several components and software packages developed by different programmers with different goals in mind. It is perfectly understandable, then, that such components have some degree of superposition. While the division line between theoretical layers is usually very clear, in practice the separation is often blurry. Expect this a lot, and you will never be lost in a web stack anymore.

Some definitions

Let's briefly review some of the most important concepts involved in a Web stack, the protocols.

TCP/IP

TCP/IP is a network protocol, that is, a set of established rules two computers have to follow to get connected over a physical network to exchange messages. TCP/IP is composed of two different protocols covering two different layers of the OSI stack, namely the Transport (TCP) and the Network (IP) ones. TCP/IP can be implemented on top of any physical interface (Data Link and Physical OSI layers), such as Ethernet and Wireless. Actors in a TCP/IP network are identified by a socket, which is a tuple made of an IP address and a port number.

As far as we are concerned when developing a Web service, however, we need to be aware that TCP/IP is a reliable protocol, which in telecommunications means that the protocol itself takes care or retransmissions when packets get lost. In other words, while the speed of the communication is not granted, we can be sure that once a message is sent it will reach its destination without errors.

HTTP

TCP/IP can guarantee that the raw bytes one computer sends will reach their destination, but this leaves completely untouched the problem of how to send meaningful information. In particular, in 1989 the problem Tim Barners-Lee wanted to solve was how to uniquely name hypertext resources in a network and how to access them.

HTTP is the protocol that was devised to solve such a problem and has since greatly evolved. With the help of other protocols such as WebSocket, HTTP invaded areas of communication for which it was originally considered unsuitable such as real-time communication or gaming.

At its core, HTTP is a protocol that states the format of a text request and the possible text responses. The initial version 0.9 published in 1991 defined the concept of URL and allowed only the GET operation that requested a specific resource. HTTP 1.0 and 1.1 added crucial features such as headers, more methods, and important performance optimisations. At the time of writing the adoption of HTTP/2 is around 45% of the websites in the world, and HTTP/3 is still a draft.

The most important feature of HTTP we need to keep in mind as developers is that it is a stateless protocol. This means that the protocol doesn't require the server to keep track of the state of the communication between requests, basically leaving session management to the developer of the service itself.

Session management is crucial nowadays because you usually want to have an authentication layer in front of a service, where a user provides credentials and accesses some private data. It is, however, useful in other contexts such as visual preferences or choices made by the user and re-used in later accesses to the same website. Typical solutions to the session management problem of HTTP involve the use of cookies or session tokens.

HTTPS

Security has become a very important word in recent years, and with a reason. The amount of sensitive data we exchange on the Internet or store on digital devices is increasing exponentially, but unfortunately so is the number of malicious attackers and the level of damage they can cause with their actions. The HTTP protocol is inherently

HTTP is inherently insecure, being a plain text communication between two servers that usually happens on a completely untrustable network such as the Internet. While security wasn't an issue when the protocol was initially conceived, it is nowadays a problem of paramount importance, as we exchange private information, often vital for people's security or for businesses. We need to be sure we are sending information to the correct server and that the data we send cannot be intercepted.

HTTPS solves both the problem of tampering and eavesdropping, encrypting HTTP with the Transport Layer Security (TLS) protocol, that also enforces the usage of digital certificates, issued by a trusted authority. At the time of writing, approximately 80% of websites loaded by Firefox use HTTPS by default. When a server receives an HTTPS connection and transforms it into an HTTP one it is usually said that it terminates TLS (or SSL, the old name of TLS).

WebSocket

One great disadvantage of HTTP is that communication is always initiated by the client and that the server can send data only when this is explicitly requested. Polling can be implemented to provide an initial solution, but it cannot guarantee the performances of proper full-duplex communication, where a channel is kept open between server and client and both can send data without being requested. Such a channel is provided by the WebSocket protocol.

WebSocket is a killer technology for applications like online gaming, real-time feeds like financial tickers or sports news, or multimedia communication like conferencing or remote education.

It is important to understand that WebSocket is not HTTP, and can exist without it. It is also true that this new protocol was designed to be used on top of an existing HTTP connection, so a WebSocket communication is often found in parts of a Web page, which was originally retrieved using HTTP in the first place.

Implementing a service over HTTP

Let's finally start discussing bits and bytes. The starting point for our journey is a service over HTTP, which means there is an HTTP request-response exchange. As an example, let us consider a GET request, the simplest of the HTTP methods.

GET/HTTP/1.1Host:localhostUser-Agent:curl/7.65.3Accept:*/*

As you can see, the client is sending a pure text message to the server, with the format specified by the HTTP protocol. The first line contains the method name (GET), the URL (/) and the protocol we are using, including its version (HTTP/1.1). The remaining lines are called headers and contain metadata that can help the server to manage the request. The complete value of the Host header is in this case localhost:80, but as the standard port for HTTP services is 80, we don't need to specify it.

If the server localhost is serving HTTP (i.e. running some software that understands HTTP) on port 80 the response we might get is something similar to

HTTP/1.0 200 OK
Date: Mon, 10 Feb 2020 08:41:33 GMT
Content-type: text/html
Content-Length: 26889
Last-Modified: Mon, 10 Feb 2020 08:41:27 GMT

<!DOCTYPE HTML><html>
...
</html>

As happened for the request, the response is a text message, formatted according to the standard. The first line mentions the protocol and the status of the request (200 in this case, that means success), while the following lines contain metadata in various headers. Finally, after an empty line, the message contains the resource the client asked for, the source code of the base URL of the website in this case. Since this HTML page probably contains references to other resources like CSS, JS, images, and so on, the browser will send several other requests to gather all the data it needs to properly show the page to the user.

So, the first problem we have is that of implementing a server that understands this protocol and sends a proper response when it receives an HTTP request. We should try to load the requested resource and return either a success (HTTP 200) if we can find it, or a failure (HTTP 404) if we can't.

1 Sockets and parsers

1.1 Rationale

TCP/IP is a network protocol that works with sockets. A socket is a tuple of an IP address (unique in the network) and a port (unique for a specific IP address) that the computer uses to communicate with others. A socket is a file-like object in an operating system, that can be thus opened and closed, and that we can read from or write to. Socket programming is a pretty low-level approach to the network, but you need to be aware that every software in your computer that provides network access has ultimately to deal with sockets (most probably through some library, though).

Since we are building things from the ground up, let's implement a small Python program that opens a socket connection, receives an HTTP request, and sends an HTTP response. As port 80 is a "low port" (a number smaller than 1024), we usually don't have permissions to open sockets there, so I will use port 8080. This is not a problem for now, as HTTP can be served on any port.

1.2 Implementation

Create the file server.py and type this code. Yes, type it, don't just copy and paste, you will not learn anything otherwise.

importsocket# Create a socket instance# AF_INET: use IP protocol version 4# SOCK_STREAM: full-duplex byte streams=socket.socket(socket.AF_INET,socket.SOCK_STREAM)# Allow reuse of addressess.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)# Bind the socket to any address, port 8080, and listens.bind(('',8080))s.listen()# Serve foreverwhileTrue:# Accept the connectionconn,addr=s.accept()# Receive data from this socket using a buffer of 1024 bytesdata=conn.recv(1024)# Print out the dataprint(data.decode('utf-8'))# Close the connectionconn.close()

This little program accepts a connection on port 8080 and prints the received data on the terminal. You can test it executing it and then running curl localhost:8080 in another terminal. You should see something like

$ python3 server.py 
GET / HTTP/1.1
Host: localhost:8080
User-Agent: curl/7.65.3
Accept: */*

The server keeps running the code in the while loop, so if you want to terminate it you have to do it with Ctrl+C. So far so good, but this is not an HTTP server yet, as it sends no response; you should actually receive an error message from curl that says curl: (52) Empty reply from server.

Sending back a standard response is very simple, we just need to call conn.sendall passing the raw bytes. A minimal HTTP response contains the protocol and the status, an empty line, and the actual content, for example

HTTP/1.1200OKHithere!

Our server becomes then

importsocket# Create a socket instance# AF_INET: use IP protocol version 4# SOCK_STREAM: full-duplex byte streams=socket.socket(socket.AF_INET,socket.SOCK_STREAM)# Allow reuse of addressess.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)# Bind the socket to any address, port 8080, and listens.bind(('',8080))s.listen()# Serve foreverwhileTrue:# Accept the connectionconn,addr=s.accept()# Receive data from this socket using a buffer of 1024 bytesdata=conn.recv(1024)# Print out the dataprint(data.decode('utf-8'))conn.sendall(bytes("HTTP/1.1 200 OK\n\nHi there!\n",'utf-8'))# Close the connectionconn.close()

At this point, we are not really responding to the user's request, however. Try different curl command lines like curl localhost:8080/index.html or curl localhost:8080/main.css and you will always receive the same response. We should try to find the resource the user is asking for and send that back in the response content.

This version of the HTTP server properly extracts the resource and tries to load it from the current directory, returning either a success of a failure

importsocketimportre# Create a socket instance# AF_INET: use IP protocol version 4# SOCK_STREAM: full-duplex byte streams=socket.socket(socket.AF_INET,socket.SOCK_STREAM)# Allow reuse of addressess.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)# Bind the socket to any address, port 8080, and listens.bind(('',8080))s.listen()HEAD_200="HTTP/1.1 200 OK\n\n"HEAD_404="HTTP/1.1 404 Not Found\n\n"# Serve foreverwhileTrue:# Accept the connectionconn,addr=s.accept()# Receive data from this socket using a buffer of 1024 bytesdata=conn.recv(1024)request=data.decode('utf-8')# Print out the dataprint(request)resource=re.match(r'GET /(.*) HTTP',request).group(1)try:withopen(resource,'r')asf:content=HEAD_200+f.read()print('Resource {} correctly served'.format(resource))exceptFileNotFoundError:content=HEAD_404+"Resource /{} cannot be found\n".format(resource)print('Resource {} cannot be loaded'.format(resource))print('--------------------')conn.sendall(bytes(content,'utf-8'))# Close the connectionconn.close()

As you can see this implementation is extremely simple. If you create a simple local file named index.html with this content

<head><title>This is my page</title><linkrel="stylesheet"href="main.css"></head><html><p>Some random content</p></html>

and run curl localhost:8080/index.html you will see the content of the file. At this point, you can even use your browser to open http://localhost:8080/index.html and you will see the title of the page and the content. A Web browser is a software capable of sending HTTP requests and of interpreting the content of the responses if this is HTML (and many other file types like images or videos), so it can render the content of the message. The browser is also responsible of retrieving the missing resources needed for the rendering, so when you provide links to style sheets or JS scripts with the <link> or the <script> tags in the HTML code of a page, you are instructing the browser to send an HTTP GET request for those files as well.

The output of server.py when I access http://localhost:8080/index.html is

GET/index.htmlHTTP/1.1Host:localhost:8080User-Agent:Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Accept-Language:en-GB,en;q=0.5Accept-Encoding:gzip, deflateConnection:keep-aliveUpgrade-Insecure-Requests:1Pragma:no-cacheCache-Control:no-cache


Resource index.html correctly served
--------------------
GET /main.css HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
Accept: text/css,*/*;q=0.1
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Referer: http://localhost:8080/index.html
Pragma: no-cache
Cache-Control: no-cache


Resource main.css cannot be loaded
--------------------
GET /favicon.ico HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
Accept: image/webp,*/*
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache


Resource favicon.ico cannot be loaded
--------------------

As you can see the browser sends rich HTTP requests, with a lot of headers, automatically requesting the CSS file mentioned in the HTML code and automatically trying to retrieve a favicon image.

1.3 Resources

These resources provide more detailed information on the topics discussed in this section

1.4 Issues

It gives a certain dose of satisfaction to build something from scratch and discover that it works smoothly with full-fledged software like the browser you use every day. I also think it is very interesting to discover that technologies like HTTP, that basically run the world nowadays, are at their core very simple.

That said, there are many features of HTTP that we didn't cover with our simple socket programming. For starters, HTTP/1.0 introduced other methods after GET, such as POST that is of paramount importance for today's websites, where users keep sending information to servers through forms. To implement all 9 HTTP methods we need to properly parse the incoming request and add relevant functions to our code.

At this point, however, you might notice that we are dealing a lot with low-level details of the protocol, which is usually not the core of our business. When we build a service over HTTP we believe that we have the knowledge to properly implement some code that can simplify a certain process, be it searching for other websites, shopping for books or sharing pictures with friends. We don't want to spend our time understanding the subtleties of the TCP/IP sockets and writing parsers for request-response protocols. It is nice to see how these technologies work, but on a daily basis, we need to focus on something at a higher level.

The situation of our small HTTP server is possibly worsened by the fact that HTTP is a stateless protocol. The protocol doesn't provide any way to connect two successive requests, thus keeping track of the state of the communication, which is the cornerstone of modern Internet. Every time we authenticate on a website and we want to visit other pages we need the server to remember who we are, and this implies keeping track of the state of the connection.

Long story short: to work as a proper HTTP server, our code should at this point implement all HTTP methods and cookies management. We also need to support other protocols like Websockets. These are all but trivial tasks, so we definitely need to add some component to the whole system that lets us focus on the business logic and not on the low-level details of application protocols.

2 Web framework

2.1 Rationale

Enter the Web framework!

As I discussed many times (see the book on clean architectures or the relative post) the role of the Web framework is that of converting HTTP requests into function calls, and function return values into HTTP responses. The framework's true nature is that of a layer that connects a working business logic to the Web, through HTTP and related protocols. The framework takes care of session management for us and maps URLs to functions, allowing us to focus on the application logic.

In the grand scheme of an HTTP service, this is what the framework is supposed to do. Everything the framework provides out of this scope, like layers to access DBs, template engines, and interfaces to other systems, is an addition that you, as a programmer, may find useful, but is not in principle part of the reason why we added the framework to the system. We add the framework because it acts as a layer between our business logic and HTTP.

2.2 Implementation

Thanks to Miguel Gringberg and his amazing Flask mega-tutorial I can set up Flask in seconds. I will not run through the tutorial here, as you can follow it on Miguel's website. I will only use the content of the first article (out of 23!) to create an extremely simple "Hello, world" application.

To run the following example you will need a virtual environment and you will have to pip install flask. Follow Miguel's tutorial if you need more details on this.

The app/__init__.py file is

fromflaskimportFlaskapplication=Flask(__name__)fromappimportroutes

and the app/routes.py file is

fromappimportapplication@application.route('/')@application.route('/index')defindex():return"Hello, world!"

You can already see here the power of a framework in action. We defined an index function and connected it with two different URLs (/ and /index) in 3 lines of Python. This leaves us time and energy to properly work on the business logic, that in this case is a revolutionary "Hello, world!". Nobody ever did this before.

Finally, the service.py file is

fromappimportapplication

Flask comes with a so-called development web server (do these words ring any bell now?) that we can run on a terminal

$ FLASK_APP=service.py flask run
 * Serving Flask app "service.py"
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

You can now visit the given URL with your browser and see that everything works properly. Remember that 127.0.0.1 is the special IP address that refers to "this computer"; the name localhost is usually created by the operating system as an alias for that, so the two are interchangeable. As you can see the standard port for Flask's development server is 5000, so you have to mention it explicitly, otherwise your browser would try to access port 80 (the default HTTP one). When you connect with the browser you will see some log messages about the HTTP requests

127.0.0.1 - - [14/Feb/2020 14:54:27]"GET / HTTP/1.1"200 -
127.0.0.1 - - [14/Feb/2020 14:54:28]"GET /favicon.ico HTTP/1.1"404 -

You can recognise both now, as those are the same request we got with our little server in the previous part of the article.

2.3 Resources

These resources provide more detailed information on the topics discussed in this section

2.4 Issues

Apparently, we solved all our problems, and many programmers just stop here. They learn how to use the framework (which is a big achievement!), but as we will shortly discover, this is not enough for a production system. Let's have a closer look at the output of the Flask server. It clearly says, among other things

   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.

The main issue we have when we deal with any production system is represented by performances. Think about what we do with JavaScript when we minimise the code: we consciously obfuscate the code in order to make the file smaller, but this is done for the sole purpose of making the file faster to retrieve.

For HTTP servers the story is not very different. The Web framework usually provides a development Web server, as Flask does, which properly implements HTTP, but does it in a very inefficient way. For starters, this is a blocking framework, which means that if our request takes seconds to be served (for example because the endpoint retrieves data from a very slow database), any other request will have to wait to be served in a queue. That ultimately means that the user will see a spinner in the browser's tab and just shake their head thinking that we can't build a modern website. Other performances concerns might be connected with memory management or disk caches, but in general, we are safe to say that this web server cannot handle any production load (i.e. multiple users accessing the web site at the same time and expecting good quality of service).

This is hardly surprising. After all, we didn't want to deal with TCP/IP connections to focus on our business, so we delegated this to other coders who maintain the framework. The framework's authors, in turn, want to focus on things like middleware, routes, proper handling of HTTP methods, and so on. They don't want to spend time trying to optimise the performances of the "multi-user" experience. This is especially true in the Python world (and somehow less true for Node.js, for example): Python is not heavily concurrency-oriented, and both the style of programming and the performances are not favouring fast, non-blocking applications. This is changing lately, with async and improvements in the interpreter, but I leave this for another post.

So, now that we have a full-fledged HTTP service, we need to make it so fast that users won't even notice this is not running locally on their computer.

3 Concurrency and façades

3.1 Rationale

Well, whenever you have performance issues, just go for concurrency. Now you have many problems! (see here)

Yes, concurrency solves many problems and it's the source of just as much, so we need to find a way to use it in the safest and less complicated way. We basically might want to add a layer that runs the framework in some concurrent way, without requiring us to change anything in the framework itself.

And whenever you have to homogenise different things just create a layer of indirection. This solves any problem but one. (see here)

So we need to create a layer that runs our service in a concurrent way, but we also want to keep it detached from the specific implementation of the service, that is independent of the framework or library that we are using.

3.2 Implementation

In this case, the solution is that of giving a specification of the API that web frameworks have to expose, in order to be usable by independent third-party components. In the Python world, this set of rules has been named WSGI, the Web Server Gateway Interface, but such interfaces exist for other languages such as Java or Ruby. The "gateway" mentioned here is the part of the system outside the framework, which in this discussion is the part that deals with production performances. Through WSGI we are defining a way for frameworks to expose a common interface, leaving people interested in concurrency free to implement something independently.

If the framework is compatible with the gateway interface, we can add software that deals with concurrency and uses the framework through the compatibility layer. Such a component is a production-ready HTTP server, and two common choices in the Python world are Gunicorn and uWSGI.

Production-ready HTTP server means that the software understands HTTP as the development server already did, but at the same time pushes performances in order to sustain a bigger workload, and as we said before this is done through concurrency.

Flask is compatible with WSGI, so we can make it work with Gunicorn. To install it in our virtual environment run pip install gunicorn and set it up creating a file names wsgi.py with the following content

fromappimportapplicationif__name__=="__main__":application.run()

To run Gunicorn specify the number of concurrent instances and the external port

$ gunicorn --workers 3 --bind 0.0.0.0:8000 wsgi
[2020-02-12 18:39:07 +0000][13393][INFO] Starting gunicorn 20.0.4
[2020-02-12 18:39:07 +0000][13393][INFO] Listening at: http://0.0.0.0:8000 (13393)[2020-02-12 18:39:07 +0000][13393][INFO] Using worker: sync
[2020-02-12 18:39:07 +0000][13396][INFO] Booting worker with pid: 13396[2020-02-12 18:39:07 +0000][13397][INFO] Booting worker with pid: 13397[2020-02-12 18:39:07 +0000][13398][INFO] Booting worker with pid: 13398

As you can see, Gunicorn has the concept of workers which are a generic way to express concurrency. Specifically, Gunicorn implements a pre-fork worker model, which means that it (pre)creates a different Unix process for each worker. You can check this running ps

$ ps ax | grep gunicorn
14919 pts/1    S+     0:00 ~/venv3/bin/python3 ~/venv3/bin/gunicorn --workers 3 --bind 0.0.0.0:8000 wsgi
14922 pts/1    S+     0:00 ~/venv3/bin/python3 ~/venv3/bin/gunicorn --workers 3 --bind 0.0.0.0:8000 wsgi
14923 pts/1    S+     0:00 ~/venv3/bin/python3 ~/venv3/bin/gunicorn --workers 3 --bind 0.0.0.0:8000 wsgi
14924 pts/1    S+     0:00 ~/venv3/bin/python3 ~/venv3/bin/gunicorn --workers 3 --bind 0.0.0.0:8000 wsgi

Using processes is just one of the two ways to implement concurrency in a Unix system, the other being using threads. The benefits and demerits of each solution are outside the scope of this post, however. For the time being just remember that you are dealing with multiple workers that process incoming requests asynchronously, thus implementing a non-blocking server, ready to accept multiple connections.

3.3 Resources

These resources provide more detailed information on the topics discussed in this section

3.4 Issues

Using a Gunicorn we have now a production-ready HTTP server, and apparently implemented everything we need. There are still many considerations and missing pieces, though.

Performances (again)

Are 3 workers enough to sustain the load of our new killer mobile application? We expect thousands of visitors per minute, so maybe we should add some. But while we increase the amount of workers, we have to keep in mind that the machine we are using has a finite amount of CPU power and memory. So, once again, we have to focus on performances, and in particular on scalability: how can we keep adding workers without having to stop the application, replace the machine with a more powerful one, and restart the service?

Embrace change

This is not the only problem we have to face in production. An important aspect of technology is that it changes over time, as new and (hopefully) better solutions become widespread. We usually design systems dividing them as much as possible into communicating layers exactly because we want to be free to replace a layer with something else, be it a simpler component or a more advanced one, one with better performances or maybe just a cheaper one. So, once again, we want to be able to evolve the underlying system keeping the same interface, exactly as we did in the case of web frameworks.

HTTPS

Another missing part of the system is HTTPS. Gunicorn and uWSGI do not understand the HTTPS protocol, so we need something in front of them that will deal with the "S" part of the protocol, leaving the "HTTP" part to the internal layers.

Load balancers

In general, a load balancer is just a component in a system that distributes work among a pool of workers. Gunicorn is already distributing load among its workers, so this is not a new concept, but we generally want to do it on a bigger level, among machines or among entire systems. Load balancing can be hierarchical and be structured on many levels. We can also assign more importance to some components of the system, flagging them as ready to accept more load (for example because their hardware is better). Load balancers are extremely important in network services, and the definition of load can be extremely different from system to system: generally speaking, in a Web service the number of connections is the standard measure of the load, as we assume that on average all connections bring the same amount of work to the system.

Reverse proxies

Load balancers are forward proxies, as they allow a client to contact any server in a pool. At the same time, a reverse proxy allows a client to retrieve data produced by several systems through the same entry point. Reverse proxies are a perfect way to route HTTP requests to sub-systems that can be implemented with different technologies. For example, you might want to have part of the system implemented with Python, using Django and Postgres, and another part served by an AWS Lambda function written in Go and connected with a non-relational database such as DynamoDB. Usually, in HTTP services this choice is made according to the URL (for example routing every URL that begins with /api/).

Logic

We also want a layer that can implement a certain amount of logic, to manage simple rules that are not related to the service we implemented. A typical example is that of HTTP redirections: what happens if a user accesses the service with an http:// prefix instead of https://? The correct way to deal with this is through an HTTP 301 code, but you don't want such a request to reach your framework, wasting resources for such a simple task.

4 The Web server

4.1 Rationale

The general label of Web server is given to software that performs the tasks we discussed. Two very common choices for this part of the system are nginx and Apache, two open source projects that are currently leading the market. With different technical approaches, they both implement all the features we discussed in the previous section (and many more).

4.2 Implementation

To test nginx without having to fight with the OS and install too many packages we can use Docker. Docker is useful to simulate a multi-machine environment, but it might also be your technology of choice for the actual production environment (AWS ECS works with Docker containers, for example).

The base configuration that we will run is very simple. One container will contain the Flask code and run the framework with Gunicorn, while the other container will run nginx. Gunicorn will serve HTTP on the internal port 8000, not exposed by Docker and thus not reachable from our browser, while nignx will expose port 80, the traditional HTTP port.

In the same directory of the file wsgi.py, create a Dockerfile

FROMpython:3.6ADD app /app
ADD wsgi.py /

WORKDIR .RUN pip install flask gunicorn
EXPOSE 8000

This starts from a Python Docker image, adds the app directory and the wsgi.py file, and installs Gunicorn. Now create a configuration for nginx in a file called nginx.conf in the same directory

server{listen80;server_namelocalhost;location/{proxy_passhttp://application:8000/;}}

This defines a server that listens on port 80 and that connects all the URL starting with / with a server called application on port 8000, which is the container running Gunicorn.

Last, create a file docker-compose.yml that will describe the configuration of the containers.

version:"3.7"services:application:build:context:.dockerfile:Dockerfilecommand:gunicorn --workers 3 --bind 0.0.0.0:8000 wsgiexpose:-8000nginx:image:nginxvolumes:-./nginx.conf:/etc/nginx/conf.d/default.confports:-8080:80depends_on:-application

As you can see the name application that we mentioned in the nginx configuration file is not a magic string, but is the name we assigned to the Gunicorn container in the Docker Compose configuration.

To create this infrastructure we need to install Docker Compose in our virtual environment through pip install docker-compose. I also created a file named .env with the name of the project

COMPOSE_PROJECT_NAME=service

At this point you can run Docker Compose with docker-compose up -d

$ docker-compose up -d
Creating network "service_default" with the default driver
Creating service_application_1 ... done
Creating service_nginx_1       ... done

If everything is working correctly, opening the browser and visiting localhost should show you the HTML page Flask is serving.

Through docker-compose logs we can check what services are doing. We can recognise the output of Gunicorn in the logs of the service named application

$ docker-compose logs application
Attaching to service_application_1
application_1  |[2020-02-14 08:35:42 +0000][1][INFO] Starting gunicorn 20.0.4
application_1  |[2020-02-14 08:35:42 +0000][1][INFO] Listening at: http://0.0.0.0:8000 (1)
application_1  |[2020-02-14 08:35:42 +0000][1][INFO] Using worker: sync
application_1  |[2020-02-14 08:35:42 +0000][8][INFO] Booting worker with pid: 8
application_1  |[2020-02-14 08:35:42 +0000][9][INFO] Booting worker with pid: 9
application_1  |[2020-02-14 08:35:42 +0000][10][INFO] Booting worker with pid: 10

but the one we are mostly interested with now is the service named nginx, so let's follow the logs in real-time with docker-compose logs -f nginx. Refresh the localhost page you visited with the browser, and the container should output something like

$ docker-compose logs -f nginx
Attaching to service_nginx_1
nginx_1        |192.168.192.1 - - [14/Feb/2020:08:42:20 +0000]"GET / HTTP/1.1"20013"-""Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0""-"

which is the standard log format of nginx. It shows the IP address of the client (192.168.192.1), the connection timestamp, the HTTP request and the response status code (200), plus other information on the client itself.

Let's now increase the number of services, to see the load balancing mechanism in action. To do this, first we need to change the log format of nginx to show the IP address of the machine that served the request. Change the nginx.conf file adding the log_format and access_log options

log_formatupstreamlog'[$time_local]$hostto:$upstream_addr:$request$status';server{listen80;server_namelocalhost;location/{proxy_passhttp://application:8000;}access_log/var/log/nginx/access.logupstreamlog;}

The $upstream_addr variable is the one that contains the IP address of the server proxied by nginx. Now run docker-compose down to stop all containers and then docker-compose up -d --scale application=3 to start them again

$ docker-compose down
Stopping service_nginx_1       ... done
Stopping service_application_1 ... done
Removing service_nginx_1       ... done
Removing service_application_1 ... done
Removing network service_default
$ docker-compose up -d --scale application=3
Creating network "service_default" with the default driver
Creating service_application_1 ... done
Creating service_application_2 ... done
Creating service_application_3 ... done
Creating service_nginx_1       ... done

As you can see, Docker Compose runs now 3 containers for the application service. If you open the logs stream and visit the page in the browser you will now see a slightly different output

$ docker-compose logs -f nginx
Attaching to service_nginx_1
nginx_1        |[14/Feb/2020:09:00:16 +0000] localhost to: 192.168.240.4:8000: GET / HTTP/1.1 200

where you can spot to: 192.168.240.4:8000 which is the IP address of one of the application containers. If you now visit the page again multiple times you should notice a change in the upstream address, something like

$ docker-compose logs -f nginx
Attaching to service_nginx_1
nginx_1        |[14/Feb/2020:09:00:16 +0000] localhost to: 192.168.240.4:8000: GET / HTTP/1.1 200
nginx_1        |[14/Feb/2020:09:00:17 +0000] localhost to: 192.168.240.2:8000: GET / HTTP/1.1 200
nginx_1        |[14/Feb/2020:09:00:17 +0000] localhost to: 192.168.240.3:8000: GET / HTTP/1.1 200
nginx_1        |[14/Feb/2020:09:00:17 +0000] localhost to: 192.168.240.4:8000: GET / HTTP/1.1 200
nginx_1        |[14/Feb/2020:09:00:17 +0000] localhost to: 192.168.240.2:8000: GET / HTTP/1.1 200

This shows that nginx is performing load balancing, but to tell the truth this is happening through Docker's DNS, and not by an explicit action performed by the web server. We can verify this accessing the nginx container and running dig application (you need to run apt update and apt install dnsutils to install dig)

root@99c2f348140e:/# dig application

;<<>> DiG 9.11.5-P4-5.1-Debian <<>> application
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7221;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0;; QUESTION SECTION:
;application.                   IN      A

;; ANSWER SECTION:
application.            600     IN      A       192.168.240.2
application.            600     IN      A       192.168.240.4
application.            600     IN      A       192.168.240.3

;; Query time: 1 msec
;; SERVER: 127.0.0.11#53(127.0.0.11);; WHEN: Fri Feb 1409:57:24 UTC 2020;; MSG SIZE  rcvd: 110

To see load balancing performed by nginx we can explicitly define two services and assign them different weights. Run docker-compose down and change the nginx configuration to

upstreamapp{serverapplication1:8000weight=3;serverapplication2:8000;}log_formatupstreamlog'[$time_local]$hostto:$upstream_addr:$request$status';server{listen80;server_namelocalhost;location/{proxy_passhttp://app;}access_log/var/log/nginx/access.logupstreamlog;}

We defined here an upstream structure that lists two different services, application1 and application2, giving to the first one a weight of 3. This mean that each 4 requests, 3 will be routed to the first service, and one to the second service. Now nginx is not just relying on the DNS, but consciously choosing between two different services.

Let's define the services accordingly in the Docker Compose configuration file

version:"3"services:application1:build:context:.dockerfile:Dockerfilecommand:gunicorn--workers 6 --bind 0.0.0.0:8000 wsgiexpose:-8000application2:build:context:.dockerfile:Dockerfilecommand:gunicorn--workers 3 --bind 0.0.0.0:8000 wsgiexpose:-8000nginx:image:nginxvolumes:-./nginx.conf:/etc/nginx/conf.d/default.confports:-80:80depends_on:-application1-application2

I basically duplicated the definition of application, but the first service is running now 6 workers, just for the sake of showing a possible difference between the two. Now run docker-compose up -d and docker-compose logs -f nginx. If you refresh the page on the browser multiple times you will see something like

$ docker-compose logs -f nginx
Attaching to service_nginx_1
nginx_1         |[14/Feb/2020:11:03:25 +0000] localhost to: 172.18.0.2:8000: GET / HTTP/1.1 200
nginx_1         |[14/Feb/2020:11:03:25 +0000] localhost to: 172.18.0.2:8000: GET /favicon.ico HTTP/1.1 404
nginx_1         |[14/Feb/2020:11:03:30 +0000] localhost to: 172.18.0.3:8000: GET / HTTP/1.1 200
nginx_1         |[14/Feb/2020:11:03:31 +0000] localhost to: 172.18.0.2:8000: GET / HTTP/1.1 200
nginx_1         |[14/Feb/2020:11:03:32 +0000] localhost to: 172.18.0.2:8000: GET / HTTP/1.1 200
nginx_1         |[14/Feb/2020:11:03:33 +0000] localhost to: 172.18.0.2:8000: GET / HTTP/1.1 200
nginx_1         |[14/Feb/2020:11:03:33 +0000] localhost to: 172.18.0.3:8000: GET / HTTP/1.1 200
nginx_1         |[14/Feb/2020:11:03:34 +0000] localhost to: 172.18.0.2:8000: GET / HTTP/1.1 200
nginx_1         |[14/Feb/2020:11:03:34 +0000] localhost to: 172.18.0.2:8000: GET / HTTP/1.1 200
nginx_1         |[14/Feb/2020:11:03:35 +0000] localhost to: 172.18.0.2:8000: GET / HTTP/1.1 200
nginx_1         |[14/Feb/2020:11:03:35 +0000] localhost to: 172.18.0.3:8000: GET / HTTP/1.1 200

where you can clearly notice the load balancing between 172.18.0.2 (application1) and 172.18.0.3 (application2) in action.

I will not show here an example of reverse proxy or HTTPS to prevent this post to become too long. You can find resources on those topics in the next section.

4.3 Resources

These resources provide more detailed information on the topics discussed in this section

4.4 Issues

Well, finally we can say that the job is done. Now we have a production-ready web server in front of our multi-threaded web framework and we can focus on writing Python code instead of dealing with HTTP headers.

Using a web server allows us to scale the infrastructure just adding new instances behind it, without interrupting the service. The HTTP concurrent server runs multiple instances of our framework, and the framework itself abstracts HTTP, mapping it to our high-level language.

Bonus: cloud infrastructures

Back in the early years of the Internet, companies used to have their own servers on-premise, and system administrators used to run the whole stack directly on the bare operating system. Needless to say, this was complicated, expensive, and failure-prone.

Nowadays "the cloud" is the way to go, so I want to briefly mention some components that can help you run such a web stack on AWS, which is the platform I know the most and the most widespread cloud provider in the world at the time of writing.

Elastic Beanstalk

This is the entry-level solution for simple applications, being a managed infrastructure that provides load balancing, auto-scaling, and monitoring. You can use several programming languages (among which Python and Node.js) and choose between different web servers like for example Apache or nginx. The components of an EB service are not hidden, but you don't have direct access to them, and you have to rely on configuration files to change the way they work. It's a good solution for simple services, but you will probably soon need more control.

Go to Elastic Beanstalk

Elastic Container Service (ECS)

With ECS you can run Docker containers grouping them in clusters and setting up auto-scale policies connected with metrics coming from CloudWatch. You have the choice of running them on EC2 instances (virtual machines) managed by you or on a serverless infrastructure called Fargate. ECS will run your Docker containers, but you still have to create DNS entries and load balancers on your own. You also have the choice of running your containers on Kubernetes using EKS (Elastic Kubernetes Service).

Go to Elastic Container Service

Elastic Compute Cloud (EC2)

This is the bare metal of AWS, where you spin up stand-alone virtual machines or auto-scaling group of them. You can SSH into these instances and provide scripts to install and configure software. You can install here your application, web servers, databases, whatever you want. While this used to be the way to go at the very beginning of the cloud computing age I don't think you should go for it. There is so much a cloud provider can give you in terms of associated services like logs or monitoring, and in terms of performances, that it doesn't make sense to avoid using them. EC2 is still there, anyway, and if you run ECS on top of it you need to know what you can and what you can't do.

Go to Elastic Compute Cloud

Elastic Load Balancing

While Network Load Balancers (NLB) manage pure TCP/IP connections, Application Load Balancers are dedicated to HTTP, and they can perform many of the services we need. They can reverse proxy through rules (that were recently improved) and they can terminate TLS, using certificates created in ACM (AWS Certificate Manager). As you can see, ALBs are a good replacement for a web server, even though they clearly lack the extreme configurability of a software. You can, however, use them as the first layer of load balancing, still using nginx or Apache behind them if you need some of the features they provide.

Go to Elastic Load Balancing

CloudFront

CloudFront is a Content Delivery Network, that is a geographically-distributed cache that provides faster access to your content. While CDNs are not part of the stack that I discussed in this post I think it is worth mentioning CF as it can speed-up any static content, and also terminate TLS in connection with AWS Certificate Manager.

Go to CloudFront

Conclusion

Glyph Lefkowitz: Modularity for Maintenance

$
0
0

Never send a human to do a machine’s job.

One of the best things about maintaining open source in the modern era is that there are so many wonderful, free tools to let machines take care of the busy-work associated with collaboration, code-hosting, continuous integration, code quality maintenance, and so on.

There are lots of great resources that explain how to automate various things that make maintenance easier.

Here are some things you can configure your Python project to do:

  1. Continuous integration, using any one of a number of providers:
    1. GitHub Actions
    2. CircleCI
    3. Azure Pipelines
    4. Appveyor
    5. GitLab CI&CD
    6. Travis CI
  2. Separate multiple test jobs with tox
  3. Lint your code with flake8
  4. Type-Check your code with MyPy
  5. Auto-update your dependencies, with one of:
    1. pyup.io
    2. requires.io, or
    3. Dependabot
  6. automatically find common security issues with Bandit
  7. check the status of your code coverage, with:
    1. Coveralls, or
    2. Codecov
  8. Auto-format your code with:
    1. Black for style
    2. autopep8 to fix common errors
    3. isort to keep your imports tidy
  9. Automatically update your dependencies
  10. Help your developers remember to do all of those steps with pre-commit
  11. Automatically release your code to PyPI via your CI provider
    1. including automatically building any C code for multiple platforms as a wheel so your users won’t have to
    2. and checking those build artifacts:
    3. to make sure they include all the files they should, with check-manifest
    4. and also that the binary artifacts have the correct dependencies for Linux
    5. and also for macOS
  12. Organize your release notes and versioning with towncrier

All of these tools are wonderful.

But... let’s say you1 maintain a few dozen Python projects. Being a good maintainer, you’ve started splitting up your big monolithic packages into smaller ones, so your utility modules can be commonly shared as widely as possible rather than re-implemented once for each big frameworks. This is great!

However, every one of those numbered list items above is now a task per project that you have to repeat from scratch. So imagine a matrix with all of those down one side and dozens of projects across the top - the full Cartesian product of these little administrative tasks is a tedious and exhausting pile of work.

If you’re lucky enough to start every project close to perfect already, you can skip some of this work, but that partially just front-loads the tedium; plus, projects tend to start quite simple, then gradually escalate in complexity, so it’s helpful to be able to apply these incremental improvements one at a time, as your project gets bigger.

I really wish there were a tool that could take each of these steps and turn them into a quick command-line operation; like, I type pyautomate pypi-upload and the tool notices which CI provider I use, whether I use tox or not, and adds the appropriate configuration entries to both my CI and tox configuration to allow me to do that, possibly prompting me for a secret. Same for pyautomate code-coverage or what have you. All of these automations are fairly straightforward; almost all of the files you need to edit are easily parse-able either as yaml, toml, or ConfigParser2 files.

A few years ago, I asked for this to be added to CookieCutter, but I think the task is just too big and complicated to reasonably expect the existing maintainers to ever get around to it.

If you have a bunch of spare time, and really wanted to turbo-charge the Python open source community, eliminating tons of drag on already-over-committed maintainers, such a tool would be amazing.


  1. and by you, obviously, I mean “I” 

  2. “INI-like files”, I guess? what is this format even called? 

Robin Parmar: A review of Processing books

$
0
0
Processing is the free and open Java development environment that targets artists who are intrigued by generative code. In essence it is the Java language with a friendly development interface and built-in libraries to get you started.

There are plenty of ways to learn Processing, including the tutorials on the organisation's website, and the built-in examples that come with the distribution. But if you prefer a printed book, keep reading. This article will review nine available publications, so you can make an informed purchase decision.

For the sake of completeness I will also append information on two books I haven't had a chance to read.

Recommended books

Reas, Casey and Ben Fry. 2014. Processing: A Programming Handbook for Visual Designers and Artists. Second edition. London: The MIT Press. website

This book is straight from the horse's mouth; Reas and Fry created Processing. The 38 chapters cover an extensive array of topics: drawing and shapes, interactivity, variables and program flow, typography, images and transformations, animation, data manipulation, 3D graphics and rendering. The four main sections are bookended by informative practitioner interviews. There's a glossary, reading list, and index. If 640 pages isn't enough, supplementary chapters are available on the website. If you had to buy just one Processing book, this would be the one, since it covers both the basics and more advanced applications.


Shiffman, Daniel. 2015. Learning Processing: A Beginner’s Guide to Programming Images, Animation, and Interaction. Second edition. Burlington, MA: Morgan Kaufmann. website

If you have done little previous coding, this is a good guide, since it approaches Processing from scratch. The thorough topic list includes language constructs (loops, functions, objects), maths and transformations, working with images and video, data manipulation and networking, sound, and exporting. 525 pages.

Many of the examples will be familiar from the tutorials on the Processing website. Shiffman is also known for his cheerleader-style videos, which cover just about every topic under the sun with a ridiculous amount of enthusiasm. If you avail of the resources Shiffman's generosity makes available for free, you might find little need for this book. But for those who prefer the printed page, it's a good alternative to Reas and Fry.


Shiffman, Daniel. 2012. The Nature of Code. New York: self-published. website

This book is a more advanced consideration of techniques for generating algorithmic visuals. It's the perfect follow-up to Shiffman's Learning Processing. Indeed, those who already have some coding background should skip directly to this volume, after looking at the online tutorials.

The first sections cover physics (vectors, forces, oscillation) and particle systems. With this groundwork in hand, topics in algorithmic visuals are tackled, including autonomous agents, cellular automata, fractals, genetic algorithms, and neural networks. All of this in Shiffman's usual style: pleasant, accurate, thorough. I would buy this alongside Reas and Fry, since there's not much overlap in the topics. 520 pages.


Fry, Ben. 2008. Visualizing Data. Sebastopol, California: O'Reilly. publisher and author websites

This book provides a rigorous methodological framework for approaching the complex topic of data visualisation, alongside useful code examples. Two introductory chapters cover the visualisation workflow and the basics of Processing. Then we launch into the nitty-gritty of acquiring and parsing data, mapping, providing interaction, time series, scatterplots, tree maps, recursion, networks and graphs, etc. You will notice immediately that this book is more focused and technical than others. An updated edition would ice the cake, since this volume is getting rather old. 365 pages including index.


Greenberg, Ira; Dianna Xu; Deepak Kumar. 2013. Processing: Creative Coding and Generative Art in Processing 2. New York: Apress. website

This book provides an introduction to Processing for those new to the environment, teaching programming fundamentals and the elements of the language alongside sections tackling specific problems. There is nothing wrong with this approach, but if you already own one of the other introductions, the first half of this book is redundant. Thankfully, the topics do get more involved: advanced OOP (encapsulation, inheritance), data visualisation (parsing, graphs, heat map, word clouds, interaction, tree map), motion (vectors, boundary conditions, verlet integration), recursion and L-systems, image manipulation (bitwise manipulation, masking, filtering, convolution), 3D graphics (perspective, projection). 445 pages including index. While the contents overlap others here, these authors offer a deep and broad approach of significant value.


Other titles

Richardson, Andrew. 2016. Data-driven Graphic Design: Creative Coding For Visual Communications. London: Bloomsbury. website

This is an odd book that doesn't seem to know its audience. In part, it's a coffee-table volume highlighting creative use of code with full-page illustrations. Such a book should present interviews with practitioners, detailed background on artworks, and analysis of the intriguing interactions between technology, aesthetics, and the marketplace. But very little of this is actually found between these covers.

Instead, each chapter concludes with code examples, as though this is a book for programmers. But coders are hardly edified by statements like "The ability of a computer program to infinitely repeat calculations and processes gives it an enormous potential for creating complex drawings and graphics." This obvious statement is found, not in the preface, but well into chapter two! Topics include generative drawing, growth and form, dynamic typography, interactive projections, and data visualisation.


Pearson, Matt. 2011. Generative Art. Shelter Island, NY: Manning. website

Similar to the Richardson volume, this book attempts an overview of generative art while at the same time introducing the coding skills that might produce such wonders. With its greater emphasis on programming, it does a better job. But at 200 pages, the volume is slim. The first chapter provides a sketchy (pun intended) introduction to the topic, with insufficient historical examples. The second chapter contains a very basic introduction to Processing. Further topics include: randomness and Perlin noise, simple shapes, OOP, cellular automata, and fractals. The code examples reward study, and are the strength of this volume. But this material is covered with greater focus elsewhere.


Lees, Antony, ed. 2019. Beginning Graphics Programming with Processing 3. self-published. website

This book was assembled as an independent publishing project by a group of students. Section 1 covers programming principles (algorithms and operators through to OOP); section 2 examines shapes and interaction; section 3 covers images, rendering, curves, and 3D. These contents are most suitable for a beginner; hence the book competes head-on with Shiffman. 650 pages.


Gradwohl, Nikolaus. 2013. Processing 2: Creative Coding Hotshot. Birmingham, UK: Packt Publishing.

This book consists of nine projects: cardboard robots re-enacting Romeo and Juliet, a sound-reactive dance floor, a moon lander game for Android, etc. These emphasise interactivity and physical computing, integrating Processing with environments such as Arduino. You'll want to get this if you are tasked with running a workshop for tweens, but otherwise it's rather difficult to know the audience.


And more

Two further books come highly recommended, but I haven't yet had the opportunity to check them out. Send me a copy if you'd like a review!

Bohnacker, Hartmut; Benedikt Gross; Julia Laub; Claudius Lazzeroni. 2012. Generative Design: Visualize, Program, and Create with Processing. Hudson, New York: Princeton Architectural Press. website

Don't confuse this with the newer version that covers similar topics, but implements them in Javascript rather than Processing. 472 pages.


Glassner, Andrew. 2011. Processing for Visual Artists: How to Create Expressive Images and Interactive Art. Natick, MA: A. K. Peters. website

Designed for code newcomers, this includes advice on workflow and standards not commonly found in other books. An enormous 937 pages including index. Freakishly expensive.

Anwesha Das: February PyLadies Pune workshop

$
0
0

It was the time for “learning Python with harware” in February, 2020 with PyLadies in Pune. Coding in Python becomes fun when one can see the changes it makes in the hardware.

Selecting a place for work is always a difficult task as any organizer. College Of Engineering Pune (COEP) has always been supportive of PyLadies Pune. When I approached Abhijit for the venue he readily agreed. My sincere gratitude to him, Women Engineers Group and the FOSSMeet Pune team enough for that.

Once I reached the venue it was already a full house and still people were coming in. We had more than 55 students of 1st to 3rd year, attending the workshop. The first year students already knew Python. Around 12-14 people were writing Python for the first time.

The workshop started with the very basics of the language on the terminal.

feb1pyladies

Then came the exciting part, trying Python on hardware. We were using Circuit Playground Express boards by Adafruit. Nina and Kattni provided these boards to us.. We, people on the other side of the world, do not have easy access to Adafruit hardwares. It takes a lot of time and money to get them. The students were holding the any such board for the first time. I can not thank Nina and Kattni & Adafruit enough for that.

We started with blinking the first LED of the board. When the students lit their first LED the smile and light in their eyes were precious :). Following that we spend some time with the simple codes. We tried our hands on different modules of Circuit Python. We took the help from the tutorial provided in Adafruit website. The students were enjoying and indulged into creativity. So I decided to give them problem statements instead of showing them code. I was happy to see how fast they were solving it and experimenting with different patterns, colours.
The workshop finished too soon than I expected. We bid a good bye to them with a promise to return with such a workshop like this.

feb2pyladies

This was the first time for me to take hardware workshop for a larger group. I was scared at the beginning. Therefore I spent many days preparing for it. But I learned a lot new things in there :

A workshop can not be planned. One might have structure but not a plan. All of it depends on the participants and how they respond. And of course on the mercy of the hardware. I was amazed to see the problems that came up. Be it the :

  • code.py becoming read only;
  • how different approaches could be there to solve one problem;
  • People writing Python like C for college :)

The post will be incomplete if I do not thank [Kushal]9https://twitter.com/kushaldas) for being a patient Teaching Assistant for my workshop and helping the students throughout.

It was a great experience for me. The feedback of a student that “I was sure that I am going to be bored. But you taught so well that 3 and half hours just flew”, will be on my inspirational board from now on :).

Mike Driscoll: PyDev of the Week: Martin Fitzpatrick

$
0
0

This week we welcome Martin Fitzpatrick (@mfitzp) as our PyDev of the Week! Martin is the author of “Create Simple GUI Applications with Python and Qt 5” and the creator of the LearnPyQt website. You can also check out his personal site or see what he’s up to by visiting his Github profile. Let’s spend some time getting to know Martin better!

Martin Fitzpatrick

Can you tell us a little about yourself (hobbies, education, etc):

I’m a developer from the United Kingdom, who’s been working with Python for the past 12 years, and living in the Netherlands (Amersfoort) for the past 5.

I started coding on 8 bit machines back in the early 90s, creating platform games of dubious quality— in my defence we didn’t have StackOverflow back then. Later I moved onto the PC, first writing DOS games and then, after someone invented the internet, doing a stint of web dev. I’ve been programming on and off ever since.

Rather than pursue software development as a career, I instead took a long detour into healthcare/biology. I worked first in the ambulance service, then as a physiotherapy assistant and finally completed a degree and PhD in Bioinformatics & Immunology. This last step was where I discovered Python, ultimately leading me to where I am now.

In my spare time I tinker in my workshop, creating daft electronic games and robots.

I like robots.

Why did you start using Python?

I first used Python back in 2007 when I was looking for at alternatives to building websites with Drupal/PHP. That led me to Django and Python. It felt so much simpler and more logical than what I’d used before, after knocking something together in an afternoon I was basically hooked.

For the next few years I was using Python almost exclusively for web development, and it probably would have stayed that way was it not for my PhD.

My thesis project was looking at the effects of metabolism on rheumatoid arthritis, and required me to analyse some big chunks of data. Having worked with Python for the previous 4 years it only seemed natural to try and use it here, rather than stop and learn R or MATLAB. The Python data analysis landscape was still a bit rough back then, but improving quickly — pandas and Jupyter notebooks first appeared during this time. The final couple of years of my PhD I was looking to make the tools I’d written more accessible to non-technical users and started building GUI applications with PyQt5.

In the past couple of years I discovered microcontrollers (ESP8266 and Raspberry Pi) and have built some silly things with MicroPython.

What other programming languages do you know and which is your favorite?

Python is my favourite, hands down. There is something about the language that lines up very well with my brain, might be all the empty space.

I have learnt and forgotten quite a few languages including PHP, Pascal, Perl, Prolog and Z80 assembler. I can still bash something together in C and does MicroPython count as another language?

What projects are you working on now?

For the past month I’ve been working on updates to my PyQt5 book, Create Simple GUI Applications with Python & Qt and the accompanying PyQt5/PySide2 tutorials. There are plenty of good beginner resources available but once you get into developing real applications you’re a bit out of luck. I’ve now got a small team of writers together now which is really kicking things up a notch, it’s a lot of fun.

The site itself is custom built on top of Wagtail/Django, and I would like to make that available to other Python developers who want to host their own courses and tutorials.

Over on my workbench I have a few electronic game projects in progress using Python but can’t share anything yet. My biggest success to date was Etch-A-Snap a Raspberry Pi powered Etch-A-Sketch camera. Basically, you take a photo and it draws it automatically on a mini Etch-A-Sketch. That was bought by a Japanese collector which was kind of mind-blowing.

If anyone is interested in working with me I’m always open to collaborate!

Which Python libraries are your favorite (core or 3rd party)?

Python numeric libraries (numpy, scipy, pandas) and Jupyter notebooks are the big standouts for me. Collectively they have saved me hours upon hours of work, and having these kinds of tools available has been a major boost to Python’s popularity.

I think I have to say PyQt5/PySide2 here as well.

How did you become a self-published author?

When I started writing GUI applications in PyQt5 it was a bit of an uphill struggle. There were very few examples or tutorials available at the time, so I was stuck converting examples from C++. Once I had it figured out I just sat and wrote down everything I’d needed to learn on the way — hopefully to save other people from the same fate. That was the first edition of the book, and I’ve been updating it regularly ever since.

Self publishing was just the simplest option to begin with, allowing me to concentrate on writing the book I wanted to read. Offers I have had from publishers all come with changes or restrictions on distribution I don’t agree with (the book is currently Creative Commons licensed).

Why did you choose PyQt over the other Python GUI toolkits?

During my PhD I was developing a tool for visualizing chemical pathways in cells and needed a good way to visualize overlaid pathways. The only solution I could find for this involved rendering using HTML/Javascript in an embedded window.

I built prototypes of the app with Tkinter, wxWidgets and Qt, and actually quite liked wxWidgets for its native widgets. But Qt was the only toolkit which shipped with its own built-in browser and so worked reliably across different platforms (I was on MacOS, my supervisor on Windows).

The rest, as they say, is history. The big benefit of Qt for me now is the QGraphicsScene vector graphics canvas which is great for interactive visualizations.

Do you have any advice for other potential authors?

Firstly, everybody knows something, or some combination of things or some things from a certain angle that is unique enough to form the basis of a book. Even if a book already exists in your niche, your own voice or experience is valuable.

Secondly, pick a topic that you’re interested in but that you also enjoy doing. Writing a programming book means writing a lot of example code, and if you don’t enjoy creating these things it’s going to be a unpleasant experience.

One of my first steps writing my book was to start a collection of example PyQt5 applications including clones of classics like Paint and Solitaire. This helped me figure out what was *required* knowledge, but — more importantly — it was a lot of fun.

Is there anything else you’d like to say?

I often get questions from Python developers who’ve come to programming a little later, asking “am I too old to start doing this professionally?”

The answer is no! I’ve had a career spanning admin (finance, human resources), healthcare (ambulance service, physiotherapy assistant, disability support worker), research scientist (PhD and post-doc) and landed my first actual “programming job” at 37.

One great advantage of learning to code in Python is that it can be applied to so many different problem domains. Whether you build your experience automating some office reports, doing data analysis or building websites, those skills will transfer and and your Python will improve.

Thanks for doing the interview, Martin!

The post PyDev of the Week: Martin Fitzpatrick appeared first on The Mouse Vs. The Python.

Stories in My Pocket: Refactoring and asking for forgiveness

$
0
0

Recently, I had a great interaction with one of my coworkers that I think is worth sharing, with the hope you may learn a bit about refactoring and python.

My colleague came to me to help him think through a problem that surfaced with a change to a project. The code in question sends a file to a remote storage service. It looked like this:


Read more...

Real Python: A Guide to the Newer Python String Format Techniques

$
0
0

In the previous tutorial in this introductory series, you learned how to format string data using the string modulo operator. The string modulo operator is useful, and it’s good for you to be familiar with it because you’re likely to encounter it in older Python code. However, there are two newer ways that you can use Python to format strings that are arguably more preferable.

In this tutorial, you’ll learn about:

  1. The string .format() method
  2. The formatted string literal, or f-string

You’ll learn about these formatting techniques in detail and add them to your Python string formatting toolkit. Note that there’s a standard module called string containing a class called Template, which provides some string formatting through interpolation. The string modulo operator provides more or less the same functionality, so you won’t cover string.Template here.

Free Bonus:Click here to get access to a chapter from Python Tricks: The Book that shows you Python's best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

The Python String .format() Method

The Python string .format() method was introduced in version 2.6. It’s similar in many ways to the string modulo operator, but .format() goes well beyond in versatility. The general form of a Python .format() call is shown below:

<template>.format(<positional_argument(s)>,<keyword_argument(s)>)

Note that this is a method, not an operator. You call the method on <template>, which is a string containing replacement fields. The <positional_arguments> and <keyword_arguments> to the method specify values that are inserted into <template> in place of the replacement fields. The resulting formatted string is the method’s return value.

In the <template> string, replacement fields are enclosed in curly braces ({}). Anything not contained in curly braces is literal text that’s copied directly from the template to the output. If you need to include a literal curly bracket character, like { or }, in the template string, then you can escape this character by doubling it:

>>>
>>> '{{ {0} }}'.format('foo')'{ foo }'

Now the curly braces are included in your output.

The String .format() Method: Arguments

Let’s start with a quick example to get you acquainted before you dive into more detail on how to use this method in Python to format strings. For review, here’s the first example from the previous tutorial on the string modulo operator:

>>>
>>> print('%d%s cost $%.2f'%(6,'bananas',1.74))6 bananas cost $1.74

Here, you used the string modulo operator in Python to format the string. Now, you can use Python’s string .format() method to obtain the same result, like this:

>>>
>>> print('{0}{1} cost ${2}'.format(6,'bananas',1.74))6 bananas cost $1.74

In this example, <template> is the string '{0} {1} cost ${2}'. The replacement fields are {0}, {1}, and {2}, which contain numbers that correspond to the zero-based positional arguments 6, 'bananas', and 1.74. Each positional argument is inserted into the template in place of its corresponding replacement field:

Python string format, positional parametersUsing The String .format() Method in Python to Format a String With Positional Arguments

The next example uses keyword arguments instead of positional parameters to produce the same result:

>>>
>>> print('{quantity}{item} cost ${price}'.format(... quantity=6,... item='bananas',... price=1.74))6 bananas cost $1.74

In this case, the replacement fields are {quantity}, {item}, and {price}. These fields specify keywords that correspond to the keyword arguments quantity=6, item='bananas', and price=1.74. Each keyword value is inserted into the template in place of its corresponding replacement field:

Python string format and keyword parametersUsing The String .format() Method in Python to Format a String With Keyword Arguments

You’ll learn more about positional and keywords arguments there in the next tutorial in this introductory series, which explores functions and argument passing. For now, the two sections that follow will show you how these are used with the Python .format() method.

Positional Arguments

Positional arguments are inserted into the template in place of numbered replacement fields. Like list indexing, the numbering of replacement fields is zero-based. The first positional argument is numbered 0, the second is numbered 1, and so on:

>>>
>>> '{0}/{1}/{2}'.format('foo','bar','baz')'foo/bar/baz'

Note that replacement fields don’t have to appear in the template in numerical order. They can be specified in any order, and they can appear more than once:

>>>
>>> '{2}.{1}.{0}/{0}{0}.{1}{1}.{2}{2}'.format('foo','bar','baz')'baz.bar.foo/foofoo.barbar.bazbaz'

When you specify a replacement field number that’s out of range, you’ll get an error. In the following example, the positional arguments are numbered 0, 1, and 2, but you specify {3} in the template:

>>>
>>> '{3}'.format('foo','bar','baz')Traceback (most recent call last):
  File "<pyshell#26>", line 1, in <module>'{3}'.format('foo','bar','baz')IndexError: tuple index out of range

This raises an IndexErrorexception.

Starting with Python 3.1, you can omit the numbers in the replacement fields, in which case the interpreter assumes sequential order. This is referred to as automatic field numbering:

>>>
>>> '{}/{}/{}'.format('foo','bar','baz')'foo/bar/baz'

When you specify automatic field numbering, you must provide at least as many arguments as there are replacement fields:

>>>
>>> '{}{}{}{}'.format('foo','bar','baz')Traceback (most recent call last):
  File "<pyshell#27>", line 1, in <module>'{}{}{}{}'.format('foo','bar','baz')IndexError: tuple index out of range

In this case, there are four replacement fields in the template but only three arguments, so an IndexError exception occurs. On the other hand, it’s fine if the arguments outnumber the replacement fields. The excess arguments simply aren’t used:

>>>
>>> '{}{}'.format('foo','bar','baz')'foobar'

Here, the argument 'baz' is ignored.

Note that you can’t intermingle these two techniques:

>>>
>>> '{1}{}{0}'.format('foo','bar','baz')Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>'{1}{}{0}'.format('foo','bar','baz')ValueError: cannot switch from manual field specification to automatic field numbering

When you use Python to format strings with positional arguments, you must choose between either automatic or explicit replacement field numbering.

Keyword Arguments

Keyword arguments are inserted into the template string in place of keyword replacement fields with the same name:

>>>
>>> '{x}/{y}/{z}'.format(x='foo',y='bar',z='baz')'foo/bar/baz'

In this example, the values of the keyword arguments x, y, and z take the place of the replacement fields {x}, {y}, and {z}, respectively.

If you refer to a keyword argument that’s missing, then you’ll see an error:

>>>
>>> '{x}/{y}/{w}'.format(x='foo',y='bar',z='baz')Traceback (most recent call last):
  File "<stdin>", line 1, in <module>KeyError: 'w'

Here, you specify replacement field {w}, but there’s no corresponding keyword argument named w. Python raises a KeyError exception.

While you have to specify positional arguments in sequential order, but you can specify keyword arguments in any arbitrary order:

>>>
>>> '{0}/{1}/{2}'.format('foo','bar','baz')'foo/bar/baz'>>> '{0}/{1}/{2}'.format('bar','baz','foo')'bar/baz/foo'>>> '{x}/{y}/{z}'.format(x='foo',y='bar',z='baz')'foo/bar/baz'>>> '{x}/{y}/{z}'.format(y='bar',z='baz',x='foo')'foo/bar/baz'

You can specify both positional and keyword arguments in one Python .format() call. Just note that, if you do so, then all the positional arguments must appear before any of the keyword arguments:

>>>
>>> '{0}{x}{1}'.format('foo','bar',x='baz')'foobazbar'>>> '{0}{x}{1}'.format('foo',x='baz','bar')
  File "<stdin>", line 1SyntaxError: positional argument follows keyword argument

In fact, the requirement that all positional arguments appear before any keyword arguments doesn’t apply only to Python format methods. This is generally true of any function or method call. You’ll learn more about this in the next tutorial in this series, which explores functions and function calls.

In all the examples shown so far, the values you passed to the Python .format() method have been literal values, but you may specify variables as well:

>>>
>>> x='foo'>>> y='bar'>>> z='baz'>>> '{0}/{1}/{s}'.format(x,y,s=z)'foo/bar/baz'

In this case, you pass the variables x and y as positional parameter values and z as a keyword parameter value.

The String .format() Method: Simple Replacement Fields

As you’ve seen, when you call Python’s .format() method, the <template> string contains replacement fields. These indicate where in the template to insert the arguments to the method. A replacement field consists of three components:

{[<name>][!<conversion>][:<format_spec>]}

These components are interpreted as follows:

ComponentMeaning
<name>Specifies the source of the value to be formatted
<conversion>Indicates which standard Python function to use to perform the conversion
<format_spec>Specifies more detail about how the value should be converted

Each component is optional and may be omitted. Let’s take a look at each component in more depth.

The <name> Component

The <name> component is the first portion of a replacement field:

{[<name>][!<conversion>][:<format_spec>]}

<name> indicates which argument from the argument list is inserted into the Python format string in the given location. It’s either a number for a positional argument or a keyword for a keyword argument. In the following example, the <name> components of the replacement fields are 0, 1, and baz, respectively:

>>>
>>> x,y,z=1,2,3>>> '{0}, {1}, {baz}'.format(x,y,baz=z)'1, 2, 3'

If an argument is a list, then you can use indices with <name> to access the list’s elements:

>>>
>>> a=['foo','bar','baz']>>> '{0[0]}, {0[2]}'.format(a)'foo, baz'>>> '{my_list[0]}, {my_list[2]}'.format(my_list=a)'foo, baz'

Similarly, you can use a key reference with <name> if the corresponding argument is a dictionary:

>>>
>>> d={'key1':'foo','key2':'bar'}>>> d['key1']'foo'>>> '{0[key1]}'.format(d)'foo'>>> d['key2']'bar'>>> '{my_dict[key2]}'.format(my_dict=d)'bar'

You can also reference object attributes from within a replacement field. In the previous tutorial in this series, you learned that virtually every item of data in Python is an object. Objects may have attributes assigned to them that are accessed using dot notation:

obj.attr

Here, obj is an object with an attribute named attr. You use dot notation to access the object’s attribute. Let’s see an example. Complex numbers in Python have attributes named .real and .imag that represent the real and imaginary portions of the number. You can access these using dot notation as well:

>>>
>>> z=3+5j>>> type(z)<class 'complex'>>>> z.real3.0>>> z.imag5.0

There are several upcoming tutorials in this series on object-oriented programming, in which you’ll learn a great deal more about object attributes like these.

The relevance of object attributes in this context is that you can specify them in a Python .format() replacement field:

>>>
>>> z(3+5j)>>> 'real = {0.real}, imag = {0.imag}'.format(z)'real = 3.0, imag = 5.0'

As you can see, it’s relatively straightforward in Python to format components of complex objects using the .format() method.

The <conversion> Component

The <conversion> component is the middle portion of a replacement field:

{[<name>][!<conversion>][:<format_spec>]}

Python can format an object as a string using three different built-in functions:

  1. str()
  2. repr()
  3. ascii()

By default, the Python .format() method uses str(), but in some instances, you may want to force .format() to use one of the other two. You can do this with the <conversion> component of a replacement field. The possible values for <conversion> are shown in the table below:

ValueMeaning
!sConvert with str()
!rConvert with repr()
!aConvert with ascii()

The following examples force Python to perform string conversion using str(), repr(), and ascii(), respectively:

>>>
>>> '{0!s}'.format(42)'42'>>> '{0!r}'.format(42)'42'>>> '{0!a}'.format(42)'42'

In many cases, the result is the same regardless of which conversion function you use, as you can see in the example above. That being said, you won’t often need the <conversion> component, so you won’t spend a lot of time on it here. However, there are situations where it makes a difference, so it’s good to be aware that you have the capability to force a specific conversion function if you need to.

The <format_spec> Component

The <format_spec> component is the last portion of a replacement field:

{[<name>][!<conversion>][:<format_spec>]}

<format_spec> represents the guts of the Python .format() method’s functionality. It contains information that exerts fine control over how values are formatted prior to being inserted into the template string. The general form is this:

:[[<fill>]<align>][<sign>][#][0][<width>][<group>][.<prec>][<type>]

The ten subcomponents of <format_spec> are specified in the order shown. They control formatting as described in the table below:

SubcomponentEffect
:Separates the <format_spec> from the rest of the replacement field
<fill>Specifies how to pad values that don’t occupy the entire field width
<align>Specifies how to justify values that don’t occupy the entire field width
<sign>Controls whether a leading sign is included for numeric values
#Selects an alternate output form for certain presentation types
0Causes values to be padded on the left with zeros instead of ASCII space characters
<width>Specifies the minimum width of the output
<group>Specifies a grouping character for numeric output
.<prec>Specifies the number of digits after the decimal point for floating-point presentation types, and the maximum output width for string presentations types
<type>Specifies the presentation type, which is the type of conversion performed on the corresponding argument

These functions are analogous to the components you’ll find in the string modulo operator’s conversion specifier, but with somewhat greater capability. You’ll see their capabilities explained more fully in the following sections.

The <type> Subcomponent

Let’s start with <type>, which is the final portion of <format_spec>. The <type> subcomponent specifies the presentation type, which is the type of conversion that’s performed on the corresponding value to produce the output. The possible values are shown below:

ValuePresentation Type
bBinary integer
cSingle character
dDecimal integer
e or EExponential
f or FFloating point
g or GFloating point or Exponential
oOctal integer
sString
x or XHexadecimal integer
%Percentage

These are like the conversion types used with the string modulo operator, and in many cases, they function the same. The following examples demonstrate the similarity:

>>>
>>> '%d'%42'42'>>> '{:d}'.format(42)'42'>>> '%f'%2.1'2.100000'>>> '{:f}'.format(2.1)'2.100000'>>> '%s'%'foobar''foobar'>>> '{:s}'.format('foobar')'foobar'>>> '%x'%31'1f'>>> '{:x}'.format(31)'1f'

However, there are some minor differences between some of the Python .format() presentation types and the string modulo operator conversion types:

Type.format() MethodString Modulo Operator
bDesignates binary integer conversionNot supported
i, uNot supportedDesignates integer conversion
cDesignates character conversion, and the corresponding value must be an integerDesignates character conversion, but the corresponding value may be either an integer or a single-character string
g, GChooses between floating point or exponential output, but the rules governing the choice are slightly more complicatedChooses between floating point or exponential output, depending on the magnitude of the exponent and the value specified for <prec>
r, aNot supported (though the functionality is provided by the !r and !a conversion components in the replacement field)Designates conversion with repr() or ascii(), respectively
%Converts a numeric argument to a percentageInserts a literal '%' character into the output

Next, you’ll see several examples illustrating these differences, as well as some of the added features of the Python .format() method presentation types. The first presentation type you’ll see is b, which designates binary integer conversion:

>>>
>>> '{:b}'.format(257)'100000001'

The modulo operator doesn’t support binary conversion type at all:

>>>
>>> '%b'%257Traceback (most recent call last):
  File "<stdin>", line 1, in <module>ValueError: unsupported format character 'b' (0x62) at index 1

However, the modulo operator does allow decimal integer conversion with any of the d, i, and u types. Only the d presentation type indicates decimal integer conversion with the Python .format() method. The i and u presentation types aren’t supported and aren’t really necessary.

Next up is single character conversion. The modulo operator allows either an integer or single character value with the c conversion type:

>>>
>>> '%c'%35'#'>>> '%c'%'#''#'

On the other hand, Python’s .format() method requires that the value corresponding to the c presentation type be an integer:

>>>
>>> '{:c}'.format(35)'#'>>> '{:c}'.format('#')Traceback (most recent call last):
  File "<stdin>", line 1, in <module>ValueError: Unknown format code 'c' for object of type 'str'

When you try to pass a value of a different type, you’ll get a ValueError.

For both the string modulo operator and Python’s .format() method, the g conversion type chooses either floating-point or exponential output, depending on the magnitude of the exponent and the value specified for <prec>:

>>>
>>> '{:g}'.format(3.14159)'3.14159'>>> '{:g}'.format(-123456789.8765)'-1.23457e+08'

The exact rules governing the choice are slightly more complicated with .format() than they are with the modulo operator. Generally, you can trust that the choice will make sense.

G is identical to g except for when the output is exponential, in which case the 'E' will be in uppercase:

>>>
>>> '{:G}'.format(-123456789.8765)'-1.23457E+08'

The result is the same as in the previous example, but this time with an uppercase 'E'.

Note: There are a couple of other situations where you’ll see a difference between the g and G presentations types.

Under some circumstances, a floating-point operation can result in a value that’s essentially infinite. The string representation of such a number in Python is 'inf'. It may also happen that a floating-point operation produces a value that can’t be represented as a number. Python represents this with the string 'NaN', which stands for Not a Number.

When you pass these values to Python’s .format() method, the g presentation type produces lowercase output, and G produces uppercase output:

>>>
>>> x=1e300*1e300>>> xinf>>> '{:g}'.format(x)'inf'>>> '{:g}'.format(x*0)'nan'>>> '{:G}'.format(x)'INF'>>> '{:G}'.format(x*0)'NAN'

You’ll see similar behavior with the f and F presentations types as well:

>>>
>>> '{:f}'.format(x)'inf'>>> '{:F}'.format(x)'INF'>>> '{:f}'.format(x*0)'nan'>>> '{:F}'.format(x*0)'NAN'

For more information on floating-point representation, inf, and NaN, check out the Wikipedia page on IEEE 754-1985.

The modulo operator supports r and a conversion types to force conversion by repr() and ascii(), respectively. Python’s .format() method doesn’t support r and a presentation types, but you can accomplish the same thing with the !r and !a conversion components in the replacement field.

Finally, you can use the % conversion type with the modulo operator to insert a literal '%' character into the output:

>>>
>>> '%f%%'%65.0'65.000000%'

You don’t need anything special to insert a literal '%' character into the Python .format() method’s output, so the % presentation type serves a different handy purpose for percent output. It multiplies the specified value by 100 and appends a percent sign:

>>>
>>> '{:%}'.format(0.65)'65.000000%'

The remaining parts of <format_spec> indicate how the chosen presentation type is formatted, in much the same way as the string modulo operator’s width and precision specifiers and conversion flags. These are described more fully in the following sections.

The <fill> and <align> Subcomponents

<fill> and <align> control how formatted output is padded and positioned within the specified field width. These subcomponents only have meaning when the formatted field value doesn’t occupy the entire field width, which can only happen if a minimum field width is specified with <width>. If <width> isn’t specified, then <fill> and <align> are effectively ignored. You’ll cover <width> later on in this tutorial.

Here are the possible values for the <align> subcomponent:

  • <
  • >
  • ^
  • =

A value using the less than sign (<) indicates that the output is left-justified:

>>>
>>> '{0:<8s}'.format('foo')'foo     '>>> '{0:<8d}'.format(123)'123     '

This behavior is the default for string values.

A value using the greater than sign (>) indicates that the output should be right-justified:

>>>
>>> '{0:>8s}'.format('foo')'     foo'>>> '{0:>8d}'.format(123)'     123'

This behavior is the default for numeric values.

A value using a caret (^) indicates that the output should be centered in the output field:

>>>
>>> '{0:^8s}'.format('foo')'  foo   '>>> '{0:^8d}'.format(123)'  123   '

Finally, you can also specify a value using the equals sign (=) for the <align> subcomponent. This only has meaning for numeric values, and only when a sign is included in the output.

When numeric output includes a sign, it’s normally placed directly to the left of the first digit in the number, as shown above. If <align> is set to the equals sign (=), then the sign appears at the left edge of the output field, and padding is placed in between the sign and the number:

>>>
>>> '{0:+8d}'.format(123)'    +123'>>> '{0:=+8d}'.format(123)'+    123'>>> '{0:+8d}'.format(-123)'    -123'>>> '{0:=+8d}'.format(-123)'-    123'

You’ll cover the <sign> component in detail in the next section.

<fill> specifies how to fill in extra space when the formatted value doesn’t completely fill the output width. It can be any character except for curly braces ({}). (If you really feel compelled to pad a field with curly braces, then you’ll just have to find another way!)

Some examples of the use of <fill> are shown below:

>>>
>>> '{0:->8s}'.format('foo')'-----foo'>>> '{0:#<8d}'.format(123)'123#####'>>> '{0:*^8s}'.format('foo')'**foo***'

If you specify a value for <fill>, then you should also include a value for <align> as well.

The <sign> Subcomponent

You can control whether a sign appears in numeric output with the <sign> component. For example, in the following, the plus sign (+) specified in the <format_spec> indicates that the value should always be displayed with a leading sign:

>>>
>>> '{0:+6d}'.format(123)'  +123'>>> '{0:+6d}'.format(-123)'  -123'

Here, you use the plus sign (+), so a sign will always be included for both positive and negative values. If you use the minus sign (-), then only negative numeric values will include a leading sign, and positive values won’t:

>>>
>>> '{0:-6d}'.format(123)'   123'>>> '{0:-6d}'.format(-123)'  -123'

When you use a single space (' '), it means a sign is included for negative values, and an ASCII space character for positive values:

>>>
>>> '{0:*> 6d}'.format(123)'** 123'>>> '{0:*>6d}'.format(123)'***123'>>> '{0:*> 6d}'.format(-123)'**-123'

Since the space character is the default fill character, you’d only notice the effect of this if an alternate <fill> character is specified.

Lastly, recall from above that when you specify the equals sign (=) for <align> and you include a <sign> specifier, the padding goes between the sign and the value, rather than to the left of the sign.

The # Subcomponent

When you specify a hash character (#) in <format_spec>, Python will select an alternate output form for certain presentation types. This is analogous to the # conversion flag for the string modulo operator. For binary, octal, and hexadecimal presentation types, the hash character (#) causes inclusion of an explicit base indicator to the left of the value:

>>>
>>> '{0:b}, {0:#b}'.format(16)'10000, 0b10000'>>> '{0:o}, {0:#o}'.format(16)'20, 0o20'>>> '{0:x}, {0:#x}'.format(16)'10, 0x10'

As you can see, the base indicator can be 0b, 0o, or 0x.

For floating-point or exponential presentation types, the hash character forces the output to contain a decimal point, even if the output consists of a whole number:

>>>
>>> '{0:.0f}, {0:#.0f}'.format(123)'123, 123.'>>> '{0:.0e}, {0:#.0e}'.format(123)'1e+02, 1.e+02'

For any presentation type other than those shown above, the hash character (#) has no effect.

The 0 Subcomponent

If output is smaller than the indicated field width and you specify the digit zero (0) in <format_spec>, then values will be padded on the left with zeros instead of ASCII space characters:

>>>
>>> '{0:05d}'.format(123)'00123'>>> '{0:08.1f}'.format(12.3)'000012.3'

You’ll typically use this for numeric values, as shown above. However, it works for string values as well:

>>>
>>> '{0:>06s}'.format('foo')'000foo'

If you specify both <fill> and <align>, then <fill> overrides 0:

>>>
>>> '{0:*>05d}'.format(123)'**123'

<fill> and 0 essentially control the same thing, so there really isn’t any need to specify both. In fact, 0 is really superfluous, and was probably included as a convenience for developers who are familiar with the string modulo operator’s similar 0 conversion flag.

The <width> Subcomponent

<width> specifies the minimum width of the output field:

>>>
>>> '{0:8s}'.format('foo')'foo     '>>> '{0:8d}'.format(123)'     123'

Note that this is a minimum field width. Suppose you specify a value that’s longer than the minimum:

>>>
>>> '{0:2s}'.format('foobar')'foobar'

In this case, <width> is effectively ignored.

The <group> Subcomponent

<group> allows you to include a grouping separator character in numeric output. For decimal and floating-point presentation types, <group> may be specified as either a comma character (,) or an underscore character (_). That character then separates each group of three digits in the output:

>>>
>>> '{0:,d}'.format(1234567)'1,234,567'>>> '{0:_d}'.format(1234567)'1_234_567'>>> '{0:,.2f}'.format(1234567.89)'1,234,567.89'>>> '{0:_.2f}'.format(1234567.89)'1_234_567.89'

A <group> value using an underscore (_) may also be specified with the binary, octal, and hexadecimal presentation types. In that case, each group of four digits is separated by an underscore character in the output:

>>>
>>> '{0:_b}'.format(0b111010100001)'1110_1010_0001'>>> '{0:#_b}'.format(0b111010100001)'0b1110_1010_0001'>>> '{0:_x}'.format(0xae123fcc8ab2)'ae12_3fcc_8ab2'>>> '{0:#_x}'.format(0xae123fcc8ab2)'0xae12_3fcc_8ab2'

If you try to specify <group> with any presentation type other than those listed above, then your code will raise an exception.

The .<prec> Subcomponent

.<prec> specifies the number of digits after the decimal point for floating point presentation types:

>>>
>>> '{0:8.2f}'.format(1234.5678)' 1234.57'>>> '{0:8.4f}'.format(1.23)'  1.2300'>>> '{0:8.2e}'.format(1234.5678)'1.23e+03'>>> '{0:8.4e}'.format(1.23)'1.2300e+00'

For string types, .<prec> specifies the maximum width of the converted output:

>>>
>>> '{:.4s}'.format('foobar')'foob'

If the output would be longer than the value specified, then it will be truncated.

The String .format() Method: Nested Replacement Fields

Recall that you can specify either <width> or <prec> by an asterisk with the string modulo operator:

>>>
>>> w=10>>> p=2>>> '%*.*f'%(w,p,123.456)# Width is 10, precision is 2'    123.46'

The associated values are then taken from the argument list. This allows <width> and <prec> to be evaluated dynamically at run-time, as shown in the example above. Python’s .format() method provides similar capability using nested replacement fields.

Inside a replacement field, you can specify a nested set of curly braces ({}) that contains a name or number referring to one of the method’s positional or keyword arguments. That portion of the replacement field will then be evaluated at run-time and replaced using the corresponding argument. You can accomplish the same effect as the above string modulo operator example with nested replacement fields:

>>>
>>> w=10>>> p=2>>> '{2:{0}.{1}f}'.format(w,p,123.456)'    123.46'

Here, the <name> component of the replacement field is 2, which indicates the third positional parameter whose value is 123.456. This is the value to be formatted. The nested replacement fields {0} and {1} correspond to the first and second positional parameters, w and p. These occupy the <width> and <prec> locations in <format_spec> and allow field width and precision to be evaluated at run-time.

You can use keyword arguments with nested replacement fields as well. This example is functionally equivalent to the previous one:

>>>
>>> w=10>>> p=2>>> '{val:{wid}.{pr}f}'.format(wid=w,pr=p,val=123.456)'    123.46'

In either case, the values of w and p are evaluated at run-time and used to modify the <format_spec>. The result is effectively the same as this:

>>>
>>> '{0:10.2f}'.format(123.456)'    123.46'

The string modulo operator only allows <width> and <prec> to be evaluated at run-time in this way. By contrast, with Python’s .format() method you can specify any portion of <format_spec> using nested replacement fields.

In the following example, the presentation type <type> is specified by a nested replacement field and determined dynamically:

>>>
>>> bin(10),oct(10),hex(10)('0b1010', '0o12', '0xa')>>> fortin('b','o','x'):... print('{0:#{type}}'.format(10,type=t))...0b10100o120xa

Here, the grouping character <group> is nested:

>>>
>>> '{0:{grp}d}'.format(123456789,grp='_')'123_456_789'>>> '{0:{grp}d}'.format(123456789,grp=',')'123,456,789'

Whew! That was an adventure. The specification of the template string is virtually a language unto itself!

As you can see, string formatting can be very finely tuned when you use Python’s .format() method. Next, you’ll see one more technique for string and output formatting that affords all the advantages of .format(), but with more direct syntax.

The Python Formatted String Literal (f-String)

In version 3.6, a new Python string formatting syntax was introduced, called the formatted string literal. These are also informally called f-strings, a term that was initially coined in PEP 498, where they were first proposed.

f-String Syntax

An f-string looks very much like a typical Python string except that it’s prepended by the character f:

>>>
>>> f'foo bar baz''foo bar baz'

You can also use an uppercase F:

>>>
>>> s=F'qux quux'>>> s'qux quux'

The effect is exactly the same. Just like with any other type of string, you can use single, double, or triple quotes to define an f-string:

>>>
>>> f'foo''foo'>>> f"bar"'bar'>>> f'''baz''''baz'

The magic of f-strings is that you can embed Python expressions directly inside them. Any portion of an f-string that’s enclosed in curly braces ({}) is treated as an expression. The expression is evaluated and converted to string representation, and the result is interpolated into the original string in that location:

>>>
>>> s='bar'>>> print(f'foo.{s}.baz')foo.bar.baz

The interpreter treats the remainder of the f-string—anything not inside curly braces—just as it would an ordinary string. For example, escape sequences are processed as expected:

>>>
>>> s='bar'>>> print(f'foo\n{s}\nbaz')foobarbaz

Here’s the example from earlier using an f-string:

>>>
>>> quantity=6>>> item='bananas'>>> price=1.74>>> print(f'{quantity}{item} cost ${price}')6 bananas cost $1.74

This is equivalent to the following:

>>>
>>> quantity=6>>> item='bananas'>>> price=1.74>>> print('{0}{1} cost ${2}'.format(quantity,item,price))6 bananas cost $1.74

Expressions embedded in f-strings can be almost arbitrarily complex. The examples below show some of the possibilities:

  • Variables:

    >>>
    >>> quantity,item,price=6,'bananas',1.74>>> f'{quantity}{item} cost ${price}''6 bananas cost $1.74'
  • Arithmetic expressions:

    >>>
    >>> quantity,item,price=6,'bananas',1.74>>> print(f'Price per item is ${price/quantity}')Price per item is $0.29>>> x=6>>> print(f'{x} cubed is {x**3}')6 cubed is 216
  • Objects of composite types:

    >>>
    >>> a=['foo','bar','baz']>>> d={'foo':1,'bar':2}>>> print(f'a = {a} | d = {d}')a = ['foo', 'bar', 'baz'] | d = {'foo': 1, 'bar': 2}
  • Indexing, slicing, and key references:

    >>>
    >>> a=['foo','bar','baz']>>> d={'foo':1,'bar':2}>>> print(f'First item in list a = {a[0]}')First item in list a = foo>>> print(f'Last two items in list a = {a[-2:]}')Last two items in list a = ['bar', 'baz']>>> print(f'List a reversed = {a[::-1]}')List a reversed = ['baz', 'bar', 'foo']>>> print(f"Dict value for key 'bar' is {d['bar']}")Dict value for key 'bar' is 2
  • Function and method calls:

    >>>
    >>> a=['foo','bar','baz','qux','quux']>>> print(f'List a has {len(a)} items')List a has 5 items>>> s='foobar'>>> f'--- {s.upper()} ---''--- FOOBAR ---'>>> d={'foo':1,'bar':2}>>> print(f"Dict value for key 'bar' is {d.get('bar')}")Dict value for key 'bar' is 2
  • Conditional expressions:

    >>>
    >>> x=3>>> y=7>>> print(f'The larger of {x} and {y} is {x if x > y else y}')The larger of 3 and 7 is 7>>> age=13>>> f'I am {"a minor" if age < 18 else "an adult"}.''I am a minor.'
  • Object attributes:

    >>>
    >>> z=3+5j>>> z(3+5j)>>> print(f'real = {z.real}, imag = {z.imag}')real = 3.0, imag = 5.0

To include a literal curly brace in an f-string, escape it by doubling it, the same as you would in a template string for Python’s .format() method:

>>>
>>> z='foobar'>>> f'{{ {z[::-1]} }}''{ raboof }'

You may prefix an f-string with 'r' or 'R' to indicate that it is a raw f-string. In that case, backslash sequences are left intact, just like with an ordinary string:

>>>
>>> z='bar'>>> print(f'foo\n{z}\nbaz')foobarbaz>>> print(rf'foo\n{z}\nbaz')foo\nbar\nbaz>>> print(fr'foo\n{z}\nbaz')foo\nbar\nbaz

Note that you can specify the 'r' first and then the 'f', or vice-versa.

f-String Expression Limitations

There are a few minor restrictions on f-string expression. These aren’t too limiting, but you should know what they are. First, an f-string expression can’t be empty:

>>>
>>> f'foo{}bar'
  File "<stdin>", line 1SyntaxError: f-string: empty expression not allowed

It isn’t obvious why you’d want to do this. But if you feel compelled to try, then just know that it won’t work.

Additionally, an f-string expression can’t contain a backslash (\) character. Among other things, that means you can’t use a backslash escape sequence in an f-string expression:

>>>
>>> print(f'foo{\n}bar')
  File "<stdin>", line 1SyntaxError: f-string expression part cannot include a backslash>>> print(f'foo{\'}bar')
  File "<stdin>", line 1SyntaxError: f-string expression part cannot include a backslash

You can get around this limitation by creating a temporary variable that contains the escape sequence you want to insert:

>>>
>>> nl='\n'>>> print(f'foo{nl}bar')foobar>>> quote='\''>>> print(f'foo{quote}bar')foo'bar

Lastly, an expression in an f-string that is triple-quoted can’t contain comments:

>>>
>>> z='bar'>>> print(f'''foo{... z... }baz''')foobarbaz>>> print(f'''foo{... z    # Comment... }''')
  File "<stdin>", line 3SyntaxError: f-string expression part cannot include '#'

Note, however, that the multiline f-string may contain embedded newlines.

f-String Formatting

Like Python’s .format() method, f-strings support extensive modifiers that control the final appearance of the output string. There’s more good news, too. If you’ve mastered the Python .format() method, then you already know how to use Python to format f-strings!

Expressions in f-strings can be modified by a <conversion> or <format_spec>, just like replacement fields used in the .format() template. The syntax is identical. In fact, in both cases Python will format the replacement field using the same internal function. In the following example, !r is specified as a <conversion> component in the .format() template string:

>>>
>>> s='foo'>>> '{0!r}'.format(s)"'foo'"

This forces conversion to be performed by repr(). You can get essentially the same code using an f-string instead:

>>>
>>> s='foo'>>> f'{s!r}'"'foo'"

All the <format_spec> components that work with .format() also work with f-strings:

>>>
>>> n=123>>> '{:=+8}'.format(n)'+    123'>>> f'{n:=+8}''+    123'>>> s='foo'>>> '{0:*^8}'.format(s)'**foo***'>>> f'{s:*^8}''**foo***'>>> n=0b111010100001>>> '{0:#_b}'.format(n)'0b1110_1010_0001'>>> f'{n:#_b}''0b1110_1010_0001'

Nesting works as well, like nested replacement fields with Python’s .format() method:

>>>
>>> a=['foo','bar','baz','qux','quux']>>> w=4>>> f'{len(a):0{w}d}''0005'>>> n=123456789>>> sep='_'>>> f'{(n * n):{sep}d}''15_241_578_750_190_521'

This means formatting items can be evaluated at run-time.

f-strings and Python’s .format() method are, more or less, two different ways of doing the same thing, with f-strings being a more concise shorthand. The following expressions are essentially the same:

f'{<expr>!<conversion>:<format_spec>}''{0!<conversion>:<format_spec>}'.format(<expr>)

If you want to explore f-strings further, then check out Python 3’s f-Strings: An Improved String Formatting Syntax (Course).

Conclusion

In this tutorial, you mastered two additional techniques that you can use in Python to format string data. You should now have all the tools you need to prepare string data for output or display!

You might be wondering which Python formatting technique you should use. Under what circumstances would you choose .format() over the f-string? See Python String Formatting Best Practices for some considerations to take into account.

In the next tutorial, you’re going to learn more about functions in Python. Throughout this tutorial series, you’ve seen many examples of Python’s built-in functions. In Python, as in most programming languages, you can define your own custom user-defined functions as well. If you can’t wait to learn how then continue on to the next tutorial!


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]


Mike Driscoll: Python 101 2nd Edition Kickstarter is Live!

$
0
0

I am excited to announce that my newest book, Python 101, 2nd Edition is launching on Kickstarter today!

Python 101 2nd Ed KickstarterClick the Photo to Jump to Kickstarter

Python 101 holds a special place in my heart as it was the very first book I ever wrote. Frankly, I don’t think I would have even written a book if it weren’t for the readers of this blog who encouraged me to do so.

The new edition of Python 101 will be an entirely new, rewritten from scratch, book. While I will be covering most of the same things as in the original, I have reorganized the book a lot and I am adding all new content. I have also removed old content that is no longer relevant.

I hope you will join me by backing the book and giving me feedback as I write it so that I can put together a really great learning resource for you!

The post Python 101 2nd Edition Kickstarter is Live! appeared first on The Mouse Vs. The Python.

Roberto Alsina: Learning Serverless in GCP

$
0
0

Usually, when I want to learn how to use a tool, the thing that works best for me is to try to build something using it. Watching someone build something instead is the second best thing.

So, join me while I build a little thing using "serverless" Google Cloud Platform, Python and some other bits and pieces.


Caveat: this was originally a twitter thread, so there will be typos and things. Sorry! Also it's possible that it will look better here in threaderapp


A thread by Roberto Alsina

Ready for an afternoon learning with uncle Roberto? Come along, today's topic is "serverless"! Start thread.

As usual, serverless seems a bit daunting. So, our code will run in that vast, anonymous, vaguely lovecraftian infrastructure? Like AWS or GCP?

Well, yes. But you only need be scared of Cthulhu if you have not seen Cthulhu as a puppy.

So, let's start with a puppy-like thing. Yesterday @quenerapu mentioned https://carbon.now.sh which shows pretty code snippets but just as images. Pretty? Good.

Only pretty and no copy/paste? Bad baby Cthulhu

So, let's do something about it. This is a script that will produce either a colorized image of code, or HTML. Yes, that's a screenshot. Yes, I notice the irony.

It produces, when ran as shown, this:

Is it pretty? No. But remember, no successful complicated thing grew out of a complicated thing. As long as the inputs and outputs of our initial code are within the same postal code as what we really would like to have, it's good enough for now. So, what can I do with that code? I can do a flask app.

And if I run it, it works

So, this is the seed of an app that could replace https://carbon.now.sh ... it would need frontend, and (to make it better) something a lot like a URL shortener, but that's besides the point.

And now I can choose different paths to follow.

One way I could go forward is to deploy it to my server. Because I have a server. It's a tiny server, and it's very cheap, but this? It can run it a million times a day and I won't notice. But I said this thread is about serverless, right? So you know I am not doing that.

There are 2 main serverless environments I could try (because they are the 2 ones I know about)

* AWS Lambda
* Google cloud functions

Let's try google's. Later maybe we'll try the other one. But the important bit here is: doing serverless and doing "serverfull" is not really all that different. As long as:

1. Your endpoints are stateless
2. You don't rely on the filesystem for state
3. All the state you want, you put on the database

Now I have to go setup my billing information in Google Cloud, BRB. Done, there is a free tier and some free credit for starting up, and it's not going to cost any real money anyway.

So

I changed like, two lines.

And added requirements

It took some 30 seconds to deploy.

And it works for HTML output (…ering-force-268517.cloudfunctions.net/color-coding?f…)

But not for PNG, because while the code runs just fine the actual environment it's running on is ... limited.

So it has no fonts, which it needs to do the image.

These are the packages GCP provides in the image where our functions run. So, it's not the same as my desktop machine, to say the least.

Looks like we have liberation fonts, so "Liberation Mono" should be available? Yep. (…ering-force-268517.cloudfunctions.net/color-coding?l…) So, how do we make this useful?

Let's define a user story. I, a developer, want to show my code in twitter, but pretty and longer than 240 characters.

Also, I want people to be able to copy all or part of that text without typing it manually as if it were 1989. Now I will take a short break. Go have some coffee.

This uses this twitter feature: (developer.twitter.com/en/docs/tweets…)

So, if I make a second serverless function that generates something like that, I am almost there. Then it's a matter of UI. Mind you, it's perfectly ok to implement this as a flask app and then just once it does what I want it to do redeploy to google cloud.

The code looks exactly the same anyway.

That is just a matter of some basic string templating.

So, there you go, that Google cloud function is all the backend you need to implement something like https://carbon.now.sh

But I don't really want to, so you go ahead.

EOT

(Tomorrow: running this on AWS using Zappa and on GCP using gcloud)

tryexceptpass: Uniquely Managing Test Execution Resources using WebSockets

$
0
0

Executing tests for simple applications is complicated. You have to think about the users, how they interact with it, how those interactions propagate through different components, as well as how to handle error situations gracefully. But things get even more complicated when you start looking at more extensive systems, like those with multiple external dependencies.

Dependencies come in various forms, including third-party modules, cloud services, compute resources, networks, and others.

This level of complexity is standard in almost all projects involving a large organization, whether delivering internal tools or external products.

It means you must put emphasis on developing test systems and mechanisms good enough to validate not just code, but those third-party dependencies as well. After all, they’re part of the final product and failing to interact with them, means the product fails.

PyBites: Productivity Mondays - 5 tips that will boost your performance

$
0
0

The following things are relatively easy to do, but also easy not to do. Do them consistently and they can change your career and life.

1. Follow up

How many interactions die after the first meeting? Not if you follow up.

It shows you're interested (to be interesting, be interested - Carnegie), it keeps the momentum going, and it creates ongoing opportunities.

Keep nurturing your network, you never know where the next opportunity will come from.

2. Audit your time

How often you feel exhausted at the end of the day asking "where did my time go?". Like money and calories, what gets measured gets managed (Drucker).

Be in control of your time, or somebody else inevitably will!

3. Control your mood

Willpower and positive energy are finite. Start your day early using an empowering ritual.

For me that is steps and listening to an inspiring podcast or audiobook.

It sets the tone and confidence of the day. Improving our morning routine has been a game changer for us this year.

4. Use the right tool

Mobiles are great but highly interruptive. Avoid email (social) first thing in the morning.

Flight mode is not only for airplanes, it can be your new best friend in the early morning, especially when you have to eat that ugly frog :)

Talking about the right tool for the job: calls can be great. Sometimes email is just not the right tool. It starts clean but people get cc'd and some people go off on a tangent resulting in long, unfocused email chains.

Break the pattern: host a 15 min call with a clear agenda and come out with follow up actions. Win/win: not only will you save a lot of time, people will see you as a leader.

5. Just ask

There are no dumb questions! (Unless you did not do your homework of course.)

There is nothing more frustrating than being stuck on a problem for hours while somebody else can guide you in the right direction in minutes.

Don't let self imposed ceilings hold you back from asking for help.

You are not bothering the other person, you actually give him/her an opportunity to feel great by helping you!

Another reason to be assertive is to stay focused on your longer term goal. Without speaking up, your manager/team/audience does not know where you can be(come) more valuable and therefore risk getting stuck in a rut.


I hope this gives you some healthy inspiration to start off your week.

Now go crush it this week and comment below which of these tips boosted your productivity/ motivation/ moved you closer to your goal. See you next week.

-- Bob

With so many avenues to pursue in Python it can be tough to know what to do. If you're looking for some direction or want to take your Python code and career to the next level, schedule a call with us now. We can help you!

Anwesha Das: The scary digital world

$
0
0

The horror

Some years ago, my husband and I were looking for houses to rent. We both were in different cities and were having a telephone conversation. We had three or four phone calls to discuss this. After that, I opened my laptop and turned on my then browser, Google. Advertisements started popping up. Showing the adds of houses for rent at the very same location, the same budget I was looking for. A chill went down my bone. How did this particular website knows that we are looking for a house?

The internet

The internet was designed to give a home to the mind. It is the place for genuine independence and liberty. To create a new global social space exclusive of any authority, Government, sovereignty, and the “weary giants of flesh and steel,” the industrial area. Anonymity was there in the very ethos of the internet. It offered the opportunity to the users not to be discriminated against on religious, economic, and/or social background. It provided the platform to people to be themselves. And the Right to Privacy lies in the very core to the self-being of people. It was our chance to leave the world's nastiness, selfishness, and be an open and equal world. It was our chance to be better.

The last decade has seen a surge in the usage of the internet. The growth can most prominently be observed in the area of social media. The smartphones, smartwatch, every other smart device aided to that. There is a substantive number of people for whom using the internet is similar to using Facebook. There is a parallel universe built around Facebook and What's app. And every day, each second it is growing. According to the survey made by brandwatch.com Facebook adds 500,000 new users every day, six new profiles every second.

This Mammuthus growth of social media and our dependency on/over
the internet has blurred the line of individual privacy. What considered to be private once is now in the public domain. Be it our first date, our breakups, dinner plans, childbirth list goes on. This does not end here. Our behavior is also under watch.

Different types of Online activities

We give reactions to different situations, people, promote, support, reject issues on social media. We have conversations about what food we like to eat, where do we want to go shopping, or when we are getting married. It is the information-sharing aspect of the internet. The other principal usage of the internet is gaining knowledge. We browse to ask random things. Like - What is the total area of earth? What is global warming? What is the best brand of lingerie suitable for thin women?
Where in the first set of information (the activity on social media), we are sharing with our full consent and knowledge. For the second of information which we presume it to be private, between us, the browser and the website we are getting information from. But in reality, it is not.

Tracking and it’s kind

First party tracking

There are certain rights we waive, the information we give to avail a service over the internet. Like the name, contact details for Facebook, Instagram. Signing up for Instagram means voluntarily agreeing to their “Terms, Data Policy, and Cookies Policy.” It is precisely like an agreement in the real world — the primary distinction between the two lies in our approach. In case of an actual world agreement, we make sure we read it. But do rarely care about understanding what we are signing for a while signing something in the digital world. Here what we only care about is the service and nothing else. Signing up for the service then means letting the service provider (in this case, Instagram) insert cookies in our browser, get data from us. Also, these services, Facebook knows who our friends are? And what we “like” and how much?
Similarly, Amazon, Flipkart know - What we want to buy? When are we buying? This is called First Party Tracking, of which we are fully aware of and have agreed to.

Third-party tracking

There is another kind of Tracking. It happens behind our back, without our knowledge and consent, Third Party Tracking. These third-party trackers are there in almost all mobile apps, web pages, everywhere we go in online. Any regular mobile app collects and shares our private data, as sensitive as call records and location data with at least dozens third party companies. By third party company, we mean some other company than the service provider (in this case, the company making the mobile application). An average web page does the same thing to the user. The Physical world is also not spared by these trackers. Whenever we connect into the WiFi network into some coffee shop, hotel, restaurant, the service provider ( the coffee shop) can monitor our activity online. Also, they (coffee shop, hotel) use Bluetooth and WiFi beacons for passive monitoring of people in that locality.

Who performs these third party tracking?

The data brokers, advertisers, and tech companies are the ones who are tracking us behind our backs. A research paper published by EFF describes the situation aptly “ Corporations have built a hall of one-way mirrors: from the inside, you can see only apps, web pages, ads, and yourself reflected by social media. But in the shadows behind the glass, trackers quietly take notes on nearly everything you do. These trackers are not omniscient, but they are widespread and indiscriminate.” To know about the deep]down technical part of t)hird party tracking, go through the paper published by EFF. The data the trackers collect may be benign, but together with the public information, they tend to reveal a lot. Like if someone is political or not, ambitious or not, if you like to safe or prone to take risks. We, our lives, are being sold, and we are nothing but an accumulation of data for them.

Therefore with us feeding information and the third party tracking together subjects us to constant surveillance. Our profiles are being created based on these data. This profile makes

There is an invisible but unavoidable panopticon around us — a nearly unbreakable chain.

Why would someone want to track me? I have nothing to hide.

This is the general response we get when we initiate the discussion of and about privacy. To which Glen Greenworld has a great reply, ‘if you do not have to hide anything, please write down all your email ids, not just the work ones, the respectable ones but all, along with the passwords to me.’ Though people have nothing to hide no one has ever got back to him :)

Everyone needs privacy. We flourish our being and can be true to ourselves when we do not have the fear and knowledge of being watched by someone. Everyone cares about privacy. If they did not have, there would be no password on their accounts, no locker, no keys.

The evolution of physical self to digital self and protecting that

Now with us entering into the digital world, we are being measured by bits of data. Our data is an extension of our Physical being. The information forms a core part of a person. We are familiar with the norms of the Physical world. But the digital world is still a maze, where we are trying to find a way to safety, success, and survival. From this very blog post, I am starting a new series on saving ourselves in the digital world. This series is meant for beginners and newbies. In the coming posts, I will be dealing with

  • the threats of the digital world,
  • how to be safe in here?
  • What does the law say about it? and other relevant topics

Till we meet the next time, stay safe.

Python Insider: Python 3.8.2rc2 is now available for testing

$
0
0
Python 3.8.2rc2 is the second release candidate of the second maintenance release of Python 3.8. Go get it here:

https://www.python.org/downloads/release/python-382rc2/


Why a second release candidate?

The major reason for RC2 is that GH-16839 has been reverted.

The original change was supposed to fix for some edge cases in urlparse (numeric paths, recognizing netlocs without //; details in BPO-27657). Unfortunately it broke third parties relying on the pre-existing undefined behavior.

Sadly, the reverted fix has already been released as part of 3.8.1 (and 3.7.6 where it’s also reverted now). As such, even though the revert is itself a bug fix, it is incompatible with the behavior of 3.8.1.

Please test.

Timeline

Assuming no critical problems are found prior to 2020-02-24, the currently scheduled release date for  3.8.2 (as well as 3.9.0 alpha 4!), no code changes are planned between this release candidate and the final release.

That being said, please keep in mind that this is a pre-release of 3.8.2 and as such its main purpose is testing.

Maintenance releases for the 3.8 series will continue at regular bi-monthly intervals, with 3.8.3 planned for April 2020 (during sprints at PyCon US).

What’s new?

The Python 3.8 series is the newest feature release of the Python language, and it contains many new features and optimizations. See the “What’s New in Python 3.8” document for more information about features included in the 3.8 series.

Detailed information about all changes made in version 3.8.2 specifically can be found in its change log.

We hope you enjoy Python 3.8!

Thanks to all of the many volunteers who help make Python Development and these releases possible! Please consider supporting our efforts by volunteering yourself or through organization contributions to the Python Software Foundation.

Chris Moffitt: Python Tools for Record Linking and Fuzzy Matching

$
0
0

Introduction

Record linking and fuzzy matching are terms used to describe the process of joining two data sets together that do not have a common unique identifier. Examples include trying to join files based on people’s names or merging data that only have organization’s name and address.

This problem is a common business challenge and difficult to solve in a systematic way - especially when the data sets are large. A naive approach using Excel and vlookup statements can work but requires a lot of human intervention. Fortunately, python provides two libraries that are useful for these types of problems and can support complex matching algorithms with a relatively simple API.

The first one is called fuzzymatcher and provides a simple interface to link two pandas DataFrames together using probabilistic record linkage. The second option is the appropriately named Python Record Linkage Toolkit which provides a robust set of tools to automate record linkage and perform data deduplication.

This article will discuss how to use these two tools to match two different data sets based on name and address information. In addition, the techniques used to do matching can be applied to data deduplication and will be briefly discussed.

The problem

Anyone that has tried to merge disparate data sets together has likely run across some variation of this challenge. In the simple example below, we have a customer record in our system and need to determine the data matches - without the use of a common identifier.

Simple manual lookup

With a small sample set and our intuition, it looks like account 18763 is the same as account number A1278. We know that Brothers and Bro as well as Lane and LN are equivalent so this process is relatively easy for a person. However, trying to program logic to handle this is a challenge.

In my experience, most people start using excel to vlookup the various components of the address and try to find the best match based on the state, street number or zip code. In some cases, this can work. However there are more sophisticated ways to perform string comparisons that we might want to use. For example, I wrote briefly about a package called fuzzy wuzzy several years ago.

The challenge is that these algorithms (e.g. Levenshtein, Damerau-Levenshtein, Jaro-Winkler, q-gram, cosine) are computationally intensive. Trying to do a lot of matching on large data sets is not scaleable.

If you are interested in more mathematical details on these concepts, wikipedia is a good place to start and this article contains much more additional detail. Finally, this blog post discusses some of the string matching approaches in more detail.

Fortunately there are python tools that can help us implement these methods and solve some of these challenging problems.

The data

For this article, we will be using US hospital data. I chose this data set because hospital data has some unique qualities that make it challenging to match:

  • Many hospitals have similar names across different cities (Saint Lukes, Saint Mary, Community Hospital)
  • In urban areas, hospitals can occupy several city blocks so addresses can be ambiguous
  • Hospitals tend to have many clinics and other associated and related facilities nearby
  • Hospitals also get acquired and name changes are common - making this process even more difficult
  • Finally, there are a thousands of medical facilities in the US so the problem is challenging to scale

In these examples, I have two data sets. The first is an internal data set that contains basic hospital account number, name and ownership information.

Hospital account data

The second data set contains hospital information (called provider) as well as the number of discharges and Medicare payment for a specific Heart Failure procedure.

Hospital account data

The full data sets are available from Medicare.gov and CMS.gov and the simplified and cleaned version are available on github.

The business scenario is that we want to match up the hospital reimbursement information with our internal account data so we have more information to analyze our hospital customers. In this instance we have 5339 hospital accounts and 2697 hospitals with reimbursement information. Unfortunately we do not have a common ID to join on so we will see if we can use these python tools to merge the data together based on a combination of name and address information.

Approach 1 - fuzzymatcher

For the first approach, we will try using fuzzymatcher. This package leverages sqlite’s full text search capability to try to match records in two different DataFrames.

To install fuzzy matcher, I found it easier to conda install the dependencies (pandas, metaphone, fuzzywuzzy) then use pip to install fuzzymatcher. Given the computational burden of these algorithms you will want to use the compiled c components as much as possible and conda made that easiest for me.

If you wish to follow along, this notebook contains a summary of all the code.

After everything is setup, let’s import and get the data into our DataFrames:

importpandasaspdfrompathlibimportPathimportfuzzymatcherhospital_accounts=pd.read_csv('hospital_account_info.csv')hospital_reimbursement=pd.read_csv('hospital_reimbursement.csv')

Here is the hospital account information:

Hospital account data

Here is the reimbursement information:

Hospital account data

Since the columns have different names, we need to define which columns to match for the left and right DataFrames. In this case, our hospital account information will be the left DataFrame and the reimbursement info will be the right.

left_on=["Facility Name","Address","City","State"]right_on=["Provider Name","Provider Street Address","Provider City","Provider State"]

Now we let fuzzymatcher try to figure out the matches using fuzzy_left_join :

matched_results=fuzzymatcher.fuzzy_left_join(hospital_accounts,hospital_reimbursement,left_on,right_on,left_id_col='Account_Num',right_id_col='Provider_Num')

Behind the scenes, fuzzymatcher determines the best match for each combination. For this data set we are analyzing over 14 million combinations. On my laptop, this takes about 2 min and 11 seconds to run.

The matched_results DataFrame contains all the data linked together as well as as best_match_score which shows the quality of the link.

Here’s a subset of the columns rearranged in a more readable format for the top 5 best matches:

cols=["best_match_score","Facility Name","Provider Name","Address","Provider Street Address","Provider City","City","Provider State","State"]matched_results[cols].sort_values(by=['best_match_score'],ascending=False).head(5)
Matched information

The first item has a match score of 3.09 and certainly looks like a clean match. You can see that the Facility Name and Provider Name for the Mayo Clinic in Red Wing has a slight difference but we were still able to get a good match.

We can check on the opposite end of the spectrum to see where the matches don’t look as good:

matched_results[cols].sort_values(by=['best_match_score'],ascending=True).head(5)

Which shows some poor scores as well as obvious mismatches:

Bad matches

This example highlights that part of the issue is that one set of data includes data from Puerto Rico and the other does not. This discrepancy highlights the need to make sure you really understand your data and what cleaning and filtering you may need to do before trying to match.

We’ve looked at the extreme cases, let’s take a look at some of the matches that might be a little more challenging by looking at scores < 80:

matched_results[cols].query("best_match_score <= .80").sort_values(by=['best_match_score'],ascending=False).head(5)
Partial Matches

This example shows how some of the matches get a little more ambiguous. For example, is ADVENTISTHEALTHUKIAHVALLEY the same as UKIAHVALLEYMEDICALCENTER? Depending on your data set and your needs, you will need to find the right balance of automated and manual match review.

Overall, fuzzymatcher is a useful tool to have for medium sized data sets. As you start to get to 10,000’s of rows, it will take a lot of time to compute, so plan accordingly. However the ease of use - especially when working with pandas makes it a great first place to start.

Approach 2 - Python Record Linkage Toolkit

The Python Record Linkage Toolkit provides another robust set of tools for linking data records and identifying duplicate records in your data.

The Python Record Linkage Toolkit has several additional capabilities:

  • Ability to define the types of matches for each column based on the column data types
  • Use “blocks” to limit the pool of potential matches
  • Provides ranking of the matches using a scoring algorithm
  • Multiple algorithms for measuring string similarity
  • Supervised and unsupervised learning approaches
  • Multiple data cleaning methods

The trade-off is that it is a little more complicated to wrangle the results in order to do further validation. However, the steps are relatively standard pandas commands so do not let that intimidate you.

For this example, make sure you install the library using pip . We will use the same data set but we will read in the data with an explicit index column. This makes subsequent data joins a little easier to interpret.

importpandasaspdimportrecordlinkagehospital_accounts=pd.read_csv('hospital_account_info.csv',index_col='Account_Num')hospital_reimbursement=pd.read_csv('hospital_reimbursement.csv',index_col='Provider_Num')

Because the Record Linkage Toolkit has more configuration options, we need to perform a couple of steps to define the linkage rules. The first step is to create a indexer object:

indexer=recordlinkage.Index()indexer.full()
WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.

This WARNING points us to a difference between the record linkage library and fuzzymatcher. With record linkage, we have some flexibility to influence how many pairs are evaluated. By using full indexer all potential pairs are evaluated (which we know is over 14M pairs). I will come back to some of the other options in a moment. Let’s continue with the full index and see how it performs.

The next step is to build up all the potential candidates to check:

candidates=indexer.index(hospital_accounts,hospital_reimbursement)print(len(candidates))
14399283

This quick check just confirmed the total number of comparisons.

Now that we have defined the left and right data sets and all the candidates, we can define how we want to perform the comparison logic using Compare()

compare=recordlinkage.Compare()compare.exact('City','Provider City',label='City')compare.string('Facility Name','Provider Name',threshold=0.85,label='Hosp_Name')compare.string('Address','Provider Street Address',method='jarowinkler',threshold=0.85,label='Hosp_Address')features=compare.compute(candidates,hospital_accounts,hospital_reimbursement)

We can define several options for how we want to compare the columns of data. In this specific example, we look for an exact match on the city. I have also shown some examples of string comparison along with the threshold and algorithm to use for comparison. In addition to these options, you can define your own or use numeric, dates and geographic coordinates. Refer to the documentation for more examples.

The final step is to perform all the feature comparisons using compute . In this example, using the full index, this takes 3 min and 41 s.

Let’s go back and look at alternatives to speed this up. One key concept is that we can use blocking to limit the number of comparisons. For instance, we know that it is very likely that we only want to compare hospitals that are in the same state. We can use this knowledge to setup a block on the state columns:

indexer=recordlinkage.Index()indexer.block(left_on='State',right_on='Provider State')candidates=indexer.index(hospital_accounts,hospital_reimbursement)print(len(candidates))
475830

With the block on state, the candidates will be filtered to only include those where the state values are the same. We have filtered down the candidates to only 475,830. If we run the same comparison code, it only takes 7 seconds. A nice speedup!

In this data set, the state data is clean but if it were a little messier, we could use another the blocking algorithm like SortedNeighborhood to add some flexibility for minor spelling mistakes.

For instance, what if the state names contained “Tenessee” and “Tennessee”? Using blocking would fail but sorted neighborhood would handle this situation more gracefully.

indexer=recordlinkage.Index()indexer.sortedneighbourhood(left_on='State',right_on='Provider State')candidates=indexer.index(hospital_accounts,hospital_reimbursement)print(len(candidates))
998860

In this case, sorted neighbors takes 15.9 seconds on 998,860 candidates which seems like a reasonable trade-off.

Regardless of which option you use, the result is a features DataFrame that looks like this:

Feature feature_matrix

This DataFrame shows the results of all of the comparisons. There is one row for each row in the account and reimbursement DataFrames. The columns correspond to the comparisons we defined. A 1 is a match and 0 is not.

Given the large number of records with no matches, it is a little hard to see how many matches we might have. We can sum up the individual scores to see about the quality of the matches.

features.sum(axis=1).value_counts().sort_index(ascending=False)
3.0      2285
2.0       451
1.0      7937
0.0    988187
dtype: int6

Now we know that there are 988,187 rows with no matching values whatsoever. 7937 rows have at least one match, 451 have 2 and 2285 have 3 matches.

To make the rest of the analysis easier, let’s get all the records with 2 or 3 matches and add a total score:

potential_matches=features[features.sum(axis=1)>1].reset_index()potential_matches['Score']=potential_matches.loc[:,'City':'Hosp_Address'].sum(axis=1)
Match scoring

Here is how to interpret the table. For the first row, Account_Num 26270 and Provider_Num 868740 match on city, hospital name and hospital address.

Let’s look at these two and see how close they are:

hospital_accounts.loc[26270,:]
Facility Name         SCOTTSDALE OSBORN MEDICAL CENTER
Address                          7400 EAST OSBORN ROAD
City                                        SCOTTSDALE
State                                               AZ
ZIP Code                                         85251
County Name                                   MARICOPA
Phone Number                            (480) 882-4004
Hospital Type                     Acute Care Hospitals
Hospital Ownership                         Proprietary
Name: 26270, dtype: object
hospital_reimbursement.loc[868740,:]
Provider Name                SCOTTSDALE OSBORN MEDICAL CENTER
Provider Street Address                 7400 EAST OSBORN ROAD
Provider City                                      SCOTTSDALE
Provider State                                             AZ
Provider Zip Code                                       85251
Total Discharges                                           62
Average Covered Charges                               39572.2
Average Total Payments                                6551.47
Average Medicare Payments                             5451.89
Name: 868740, dtype: object

Yep. Those look like good matches.

Now that we know the matches, we need to wrangle the data to make it easier to review all the data together. I am going to make a concatenated name and address lookup for each of these source DataFrames.

hospital_accounts['Acct_Name_Lookup']=hospital_accounts[['Facility Name','Address','City','State']].apply(lambdax:'_'.join(x),axis=1)hospital_reimbursement['Reimbursement_Name_Lookup']=hospital_reimbursement[['Provider Name','Provider Street Address','Provider City','Provider State']].apply(lambdax:'_'.join(x),axis=1)account_lookup=hospital_accounts[['Acct_Name_Lookup']].reset_index()reimbursement_lookup=hospital_reimbursement[['Reimbursement_Name_Lookup']].reset_index()

Now merge in with the account data:

account_merge=potential_matches.merge(account_lookup,how='left')
Account merge

Finally, merge in the reimbursement data:

final_merge=account_merge.merge(reimbursement_lookup,how='left')

Let’s see what the final data looks like:

cols=['Account_Num','Provider_Num','Score','Acct_Name_Lookup','Reimbursement_Name_Lookup']final_merge[cols].sort_values(by=['Account_Num','Score'],ascending=False)
Final account lookup

One of the differences between the toolkit approach and fuzzymatcher is that we are including multiple matches. For instance, account number 32725 could match two providers:

final_merge[final_merge['Account_Num']==32725][cols]
Account num 32725 matches

In this case, someone will need to investigate and figure out which match is the best. Fortunately it is easy to save all the data to Excel and do more analysis:

final_merge.sort_values(by=['Account_Num','Score'],ascending=False).to_excel('merge_list.xlsx',index=False)

As you can see from this example, the Record Linkage Toolkit allows a lot more flexibility and customization than fuzzymatcher. The downside is that there is a little more manipulation to get the data stitched back together in order to hand the data over to a person to complete the comparison.

Deduplicating data with Record Linkage Toolkit

Yo Dawg

One of the additional uses of the Record Linkage Toolkit is for finding duplicate records in a data set. The process is very similar to matching except you pass match a single DataFrame against itself.

Let’s walk through an example using a similar data set:

hospital_dupes=pd.read_csv('hospital_account_dupes.csv',index_col='Account_Num')

Then create our indexer with a sorted neighbor block on State .

dupe_indexer=recordlinkage.Index()dupe_indexer.sortedneighbourhood(left_on='State')dupe_candidate_links=dupe_indexer.index(hospital_dupes)

We should check for duplicates based on city, name and address:

compare_dupes=recordlinkage.Compare()compare_dupes.string('City','City',threshold=0.85,label='City')compare_dupes.string('Phone Number','Phone Number',threshold=0.85,label='Phone_Num')compare_dupes.string('Facility Name','Facility Name',threshold=0.80,label='Hosp_Name')compare_dupes.string('Address','Address',threshold=0.85,label='Hosp_Address')dupe_features=compare_dupes.compute(dupe_candidate_links,hospital_dupes)

Because we are only comparing with a single DataFrame, the resulting DataFrame has an Account_Num_1 and Account_Num_2 :

Dupe Detect

Here is how we score:

dupe_features.sum(axis=1).value_counts().sort_index(ascending=False)
3.0         7
2.0       206
1.0      7859
0.0    973205
dtype: int64

Add the score column:

potential_dupes=dupe_features[dupe_features.sum(axis=1)>1].reset_index()potential_dupes['Score']=potential_dupes.loc[:,'City':'Hosp_Address'].sum(axis=1)

Here’s a sample:

High likelihood of dupes

These 9 records have a high likelihood of being duplicated. Let’s look at an example to see if they might be dupes:

hospital_dupes.loc[51567,:]
Facility Name                SAINT VINCENT HOSPITAL
Address                      835 SOUTH VAN BUREN ST
City                                      GREEN BAY
State                                            WI
ZIP Code                                      54301
County Name                                   BROWN
Phone Number                         (920) 433-0112
Hospital Type                  Acute Care Hospitals
Hospital Ownership    Voluntary non-profit - Church
Name: 51567, dtype: object
hospital_dupes.loc[41166,:]
Facility Name                   ST VINCENT HOSPITAL
Address                          835 S VAN BUREN ST
City                                      GREEN BAY
State                                            WI
ZIP Code                                      54301
County Name                                   BROWN
Phone Number                         (920) 433-0111
Hospital Type                  Acute Care Hospitals
Hospital Ownership    Voluntary non-profit - Church
Name: 41166, dtype: object

Yes. That looks like a potential duplicate. The name and address are similar and the phone number is off by one digit. How many hospitals do they really need to treat all those Packer fans? :)

As you can see, this method can be a powerful and relatively easy tool to inspect your data and check for duplicate records.

Advanced Usage

In addition to the matching approaches shown here, the Record Linkage Toolkit contains several machine learning approaches to matching records. I encourage interested readers to review the documentation for examples.

One of the pretty handy capabilities is that there is a browser based tool that you can use to generate record pairs for the machine learning algorithms.

Both tools include some capability for pre-processing the data to make the matching more reliable. Here is the preprocessing content in the Record Linkage Toolkit. This example data was pretty clean so you will likely need to explore some of these capabilities for your own data.

Summary

Linking different record sets on text fields like names and addresses is a common but challenging data problem. The python ecosystem contains two useful libraries that can take data sets and use multiple algorithms to try to match them together.

Fuzzymatcher uses sqlite’s full text search to simply match two pandas DataFrames together using probabilistic record linkage. If you have a larger data set or need to use more complex matching logic, then the Python Record Linkage Toolkit is a very powerful set of tools for joining data and removing duplicates.

Part of my motivation for writing this long article is that there are lots of commercial options out there for these problems and I wanted to raise awareness about these python options. Before you engage with an expensive consultant or try to pay for solution, you should spend an afternoon with these two options and see if it helps you out. All of the relevant code examples to get you started are in this notebook.

I always like to hear if you find these topics useful and applicable to your own needs. Feel free to comment below and let me know if you use these or any other similar tools.

credits: Title image - Un compositeur à sa casse


Stack Abuse: Integrating MongoDB with Python Using PyMongo

$
0
0

Introduction

In this post, we will dive into MongoDB as a data store from a Python perspective. To that end, we'll write a simple script to showcase what we can achieve and any benefits we can reap from it.

Web applications, like many other software applications, are powered by data. The organization and storage of this data are important as they dictate how we interact with the various applications at our disposal. The kind of data handled can also have an influence on how we undertake this process.

Databases allow us to organize and store this data, while also controlling how we store, access, and secure the information.

NoSQL Databases

There are two main types of databases - relational and non-relational databases.

Relational databases allow us to store, access, and manipulate data in relation to another piece of data in the database. Data is stored in organized tables with rows and columns with relationships linking the information among tables. To work with these databases, we use the Structured Query Language (SQL) and examples include MySQL and PostgreSQL.

Non-relational databases store data in neither relation or tabular, as in relational databases. They are also referred to as NoSQL databases since we do not use SQL to interact with them.

Furthermore, NoSQL databases can be divided into Key-Value stores, Graph stores, Column stores, and Document Stores, which MongoDB falls under.

MongoDB and When to Use it

MongoDB is a document store and non-relational database. It allows us to store data in collections that are made up of documents.

In MongoDB, a document is simply a JSON-like binary serialization format referred to as a BSON, or Binary-JSON, and has a maximum size of 16 megabytes. This size limit is in place to ensure efficient memory and bandwidth usage during transmission.

MongoDB also provides the GridFS specification in case there is a need to store files larger than the set limit.

Documents are made up of field-value pairs, just like in regular JSON data. However, this BSON format can also contain more data types, such as Date types and Binary Data types. BSON was designed to be lightweight, easily traversable, and efficient when encoding and decoding data to and from BSON.

Being a NoSQL datastore, MongoDB allows us to enjoy the advantages that come with using a non-relational database over a relational one. One advantage is that it offers high scalability by efficiently scaling horizontally through sharding or partitioning of the data and placing it on multiple machines.

MongoDB also allows us to store large volumes of structured, semi-structured, and unstructured data without having to maintain relationships between it. Being open-source, the cost of implementing MongoDB is kept low to just maintenance and expertise.

Like any other solution, there are downsides to using MongoDB. The first one is that it does not maintain relationships between stored data. Due to this, it is hard to perform ACID transactions that ensure consistency.

Complexity is increased when trying to support ACID transactions. MongoDB, like other NoSQL data stores, is not as mature as relational databases and this can make it hard to find experts.

The non-relational nature of MongoDB makes it ideal for the storage of data in specific situations over its relational counterparts. For instance, a scenario where MongoDB is more suitable than a relational database is when the data format is flexible and has no relations.

With flexible/non-relational data, we don't need to maintain ACID properties when storing data as opposed to relational databases. MongoDB also allows us to easily scale data into new nodes.

However, with all its advantages, MongoDB is not ideal when our data is relational in nature. For instance, if we are storing customer records and their orders.

In this situation, we will need a relational database to maintain the relationships between our data, which are important. It is also not suitable to use MongoDB if we need to comply with ACID properties.

Interacting with MongoDB via Mongo Shell

To work with MongoDB, we will need to install the MongoDB Server, which we can download from the official homepage. For this demonstration, we will use the free Community Server.

The MongoDB server comes with a Mongo Shell that we can use to interact with the server via the terminal.

To activate the shell, just type mongo in your terminal. You'll be greeted with information about the MongoDB server set-up, including the MongoDB and Mongo Shell version, alongside the server URL.

For instance, our server is running on:

mongodb://127.0.0.1:27017

In MongoDB, a database is used to hold collections that contains documents. Through the Mongo shell, we can create a new database or switch to an existing one using the use command:

> use SeriesDB

Every operation we execute after this will be effected in our SeriesDB database. In the database, we will store collections, which are similar to tables in relational databases.

For example, for the purposes of this tutorial, let's add a few series to the database:

> db.series.insertMany([
... { name: "Game of Thrones", year: 2012},
... { name: "House of Cards", year: 2013 },
... { name: "Suits", year: 2011}
... ])

We're greeted with:

{
    "acknowledged" : true,
    "insertedIds" : [
        ObjectId("5e300724c013a3b1a742c3b9"),
        ObjectId("5e300724c013a3b1a742c3ba"),
        ObjectId("5e300724c013a3b1a742c3bb")
    ]
}

To fetch all the documents stored in our series collection, we use db.inventory.find({}), whose SQL equivalent is SELECT * FROM series. Passing an empty query (i.e. {}) will return all the documents:

> db.series.find({})

{ "_id" : ObjectId("5e3006258c33209a674d1d1e"), "name" : "The Blacklist", "year" : 2013 }
{ "_id" : ObjectId("5e300724c013a3b1a742c3b9"), "name" : "Game of Thrones", "year" : 2012 }
{ "_id" : ObjectId("5e300724c013a3b1a742c3ba"), "name" : "House of Cards", "year" : 2013 }
{ "_id" : ObjectId("5e300724c013a3b1a742c3bb"), "name" : "Suits", "year" : 2011 }

We can also query data using the equality condition, for instance, to return all the TV series that premiered in 2013:

> db.series.find({ year: 2013 })
{ "_id" : ObjectId("5e3006258c33209a674d1d1e"), "name" : "The Blacklist", "year" : 2013 }
{ "_id" : ObjectId("5e300724c013a3b1a742c3ba"), "name" : "House of Cards", "year" : 2013 }

The SQL equivalent would be SELECT * FROM series WHERE year=2013.

MongoDB also allows us to update individual documents using db.collection.UpdateOne(), or perform batch updates using db.collection.UpdateMany(). For example, to update the release year for Suits:

> db.series.updateOne(
{ name: "Suits" },
{
    $set: { year: 2010 }
}
)
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }

Finally, to delete documents, the Mongo Shell offers the db.collection.deleteOne() and db.collection.deleteMany() functions.

For instance, to delete all the series that premiered in 2012, we'd run:

> db.series.deleteMany({ year: 2012 })
{ "acknowledged" : true, "deletedCount" : 2 }

More information on the CRUD operations on MongoDB can be found in the online reference including more examples, performing operations with conditions, atomicity, and mapping of SQL concepts to MongoDB concepts and terminology.

Integrating Python with MongoDB

MongoDB provides drivers and tools for interacting with a MongoDB datastore using various programming languages including Python, JavaScript, Java, Go, and C#, among others.

PyMongo is the official MongoDB driver for Python, and we will use it to create a simple script that we will use to manipulate data stored in our SeriesDB database.

With Python 3.6+ and Virtualenv installed in our machines, let us create a virtual environment for our application and install PyMongo via pip:

$ virtualenv --python=python3 env --no-site-packages
$ source env/bin/activate
$ pip install pymongo

Using PyMongo, we are going to write a simple script that we can execute to perform different operations on our MongoDB database.

Connecting to MongoDB

First, we import pymongo in our mongo_db_script.py and create a client connected to our locally running instance of MongoDB:

import pymongo

# Create the client
client = MongoClient('localhost', 27017)

# Connect to our database
db = client['SeriesDB']

# Fetch our series collection
series_collection = db['series']

So far, we have created a client that connects to our MongoDB server and used it to fetch our 'SeriesDB' database. We then fetch our 'series' collection and store it in an object.

Creating Documents

To make our script more convenient, we will write functions that wrap around PyMongo to enable us to easily manipulate data. We will use Python dictionaries to represent documents and we will pass these dictionaries to our functions. First, let us create a function to insert data into our 'series' collection:

# Imports truncated for brevity

def insert_document(collection, data):
    """ Function to insert a document into a collection and
    return the document's id.
    """
    return collection.insert_one(data).inserted_id

This function receives a collection and a dictionary of data and inserts the data into the provided collection. The function then returns an identifier that we can use to accurately query the individual object from the database.

We should also note that MongoDB adds an additional _id key to our documents, when they are not provided, when creating the data.

Now let's try adding a show using our function:

new_show = {
    "name": "FRIENDS",
    "year": 1994
}
print(insert_document(series_collection, new_show))

The output is:

5e4465cfdcbbdc68a6df233f

When we run our script, the _id of our new show is printed on the terminal and we can use this identifier to fetch the show later on.

We can provide an _id value instead of having it assigned automatically, which we'd provide in the dictionary:

new_show = {
    "_id": "1",
    "name": "FRIENDS",
    "year": 1994
}

And if we were to try and store a document with an existing _id, we'd be greeted with an error similar to the following:

DuplicateKeyError: E11000 duplicate key error index: SeriesDB.series.$id dup key: { : 1}

Retrieving Documents

To retrieve documents from the database we'll use find_document(), which queries our collection for single or multiple documents. Our function will receive a dictionary that contains the elements we want to filter by, and an optional argument to specify whether we want one document or multiple documents:

# Imports and previous code truncated for brevity

def find_document(collection, elements, multiple=False):
    """ Function to retrieve single or multiple documents from a provided
    Collection using a dictionary containing a document's elements.
    """
    if multiple:
        results = collection.find(elements)
        return [r for r in results]
    else:
        return collection.find_one(elements)

And now, let's use this function to find some documents:

result = find_document(series_collection, {'name': 'FRIENDS'})
print(result)

When executing our function, we did not provide the multiple parameter and the result is a single document:

{'_id': ObjectId('5e3031440597a8b07d2f4111'), 'name': 'FRIENDS', 'year': 1994}

When the multiple parameter is provided, the result is a list of all the documents in our collection that have a name attribute set to FRIENDS.

Updating Documents

Our next function, update_document(), will be used to update a single specific document. We will use the _id of the document and the collection it belongs to when locating it:

# Imports and previous code truncated for brevity

def update_document(collection, query_elements, new_values):
    """ Function to update a single document in a collection.
    """
    collection.update_one(query_elements, {'$set': new_values})

Now, let's insert a document:

new_show = {
    "name": "FRIENDS",
    "year": 1995
}
id_ = insert_document(series_collection, new_show)

With that done, let's update the document, which we'll specify using the _id returned from adding it:

update_document(series_collection, {'_id': id_}, {'name': 'F.R.I.E.N.D.S'})

And finally, let's fetch it to verify that the new value has been put in place and print the result:

result = find_document(series_collection, {'_id': id_})
print(result)

When we execute our script, we can see that our document has been updated:

{'_id': ObjectId('5e30378e96729abc101e3997'), 'name': 'F.R.I.E.N.D.S', 'year': 1995}

Deleting Documents

And finally, let's write a function for deleting documents:

# Imports and previous code truncated for brevity

def delete_document(collection, query):
    """ Function to delete a single document from a collection.
    """
    collection.delete_one(query)

Since we're using the delete_one method, only one document can be deleted per call, even if the query matches multiple documents.

Now, let's use the function to delete an entry:

delete_document(series_collection, {'_id': id_})

If we try retrieving that same document:

result = find_document(series_collection, {'_id': id_})
print(result)

We're greeted with the expected result:

None

Next Steps

We have highlighted and used a few of PyMongo's methods to interact with our MongoDB server from a Python script. However, we have not utilized all the methods available to us through the module.

All the available methods can be found in the official PyMongo documentation and are classified according to the submodules.

We've written a simple script that performs rudimentary CRUD functionality on a MongoDB database. While we could import the functions in a more complex codebase, or into a Flask/Django application for example, these frameworks have libraries to achieve the same results already. These libraries make it easier, more conventient, and help us connect more securely to MongoDB.

For example, with Django we can use libraries such as Django MongoDB Engine and Djongo, while Flask has Flask-PyMongo that helps bridge the gap between Flask and PyMongo to facilitate seamless connectivity to a MongoDB database.

Conclusion

MongoDB is a document store and falls under the category of non-relational databases (NoSQL). It has certain advantages compared to relational databases, as well as some disadvantages.

While it is not suitable for all situations, we can still use MongoDB to store data and manipulate the data from our Python applications using PyMongo among other libraries - allowing us to harness the power of MongoDB in situations where it is best suited.

It is therefore up to us to carefully examine our requirements before making the decision to use MongoDB to store data.

The script we have written in this post can be found on GitHub.

Wing Tips: Using "python -m" in Wing 7.2

$
0
0

Wing version 7.2 has been released, and the next couple Wing Tips look at some of its new features. We've already looked at reformatting with Black and YAPF and Wing 7.2's expanded support for virtualenv.

Now let's look at how to set up debugging modules that need to be launched with python-m. This command line option for Python allows searching the Python Path for the name of a module or package, and then loading and executing it. This capability was introduced way back in Python 2.4, and then extended in Python 2.5 through PEP 338 . However, it only came into widespread use relatively recently, for example to launch venv, black, or other command line tools that are shipped as Python packages.

Launching Modules

To configure Wing to launch a module by name with python-m, create a NamedEntryPoint from the Debug menu, select NamedModule, and enter the module or package name and any run arguments:

/images/blog/python-m/named-entry-point-module.png

The above is equivalent to this command line:

python -m mymodule one two

The named entry point can be set as the main entry point for your project under the Debug/Execute tab of ProjectProperties, from the Project menu:

/images/blog/python-m/main-entry-point.png

Or it can be launched from the Debug>DebugNamedEntryPoint menu or by assigning a key binding to it in the named entry point manager dialog.

Launching Packages

Packages can also be launched in this way, if they include a file named __main__.py to define the package's main entry point:

/images/blog/python-m/named-entry-point.png

Setting Python Path

Whether launching a module or package, the name has to be found on the PythonPath that you've configured for your project. If Wing fails to find the module, add its parent directory to PythonPath under the Environment tab in ProjectProperties:

/images/blog/python-m/python-path.png

That's it for now! We'll be back soon with more Wing Tips for Wing Python IDE.

As always, please don't hesitate to email support@wingware.com if you run into problems or have any questions.

Real Python: Finding the Perfect Python Code Editor

$
0
0

Find your perfect Python development setup with this review of Python IDEs and code editors. Writing Python using IDLE or the Python REPL is great for simple things, but not ideal for larger programming projects. With this course you’ll get an overview of the most common Python coding environments to help you make an informed decision.

By the end of this course, you’ll know how to:

  • Choose the Python editing environment that’s right for you
  • Perform common tasks like creating, running, and debugging code
  • Dig deeper into optimizing your favorite editing setup

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

EuroPython: EuroPython 2020: Presenting our conference logo for Dublin

$
0
0

We’re pleased to announce our official conference logo for EuroPython 2020, July 20-26, in Dublin, Ireland:

image

The logo is inspired by the colors and symbols often associated with Ireland: the shamrock and the Celtic harp. It was again created by our designer Jessica Peña Moro from Simétriko, who had already helped us in previous years with the conference design.

Some more updates:

  • We’re working on launching the main website, the CFP and ticket sales in March.
  • We are also preparing the sponsorship packages and should have them ready early in March as well. Early bird sponsors will again receive a 10% discount on the package price. If you’re interested in becoming a launch sponsor, please contact our sponsor team at sponsoring@europython.eu.

Enjoy,

EuroPython 2020 Team
https://ep2020.europython.eu/

Podcast.__init__: APIs, Sustainable Open Source and The Async Web With Tom Christie

$
0
0
Tom Christie is probably best known as the creator of Django REST Framework, but his contributions to the state the web in Python extend well beyond that. In this episode he shares his story of getting involved in web development, his work on various projects to power the asynchronous web in Python, and his efforts to make his open source contributions sustainable. This was an excellent conversation about the state of asynchronous frameworks for Python and the challenges of making a career out of open source.

Summary

Tom Christie is probably best known as the creator of Django REST Framework, but his contributions to the state the web in Python extend well beyond that. In this episode he shares his story of getting involved in web development, his work on various projects to power the asynchronous web in Python, and his efforts to make his open source contributions sustainable. This was an excellent conversation about the state of asynchronous frameworks for Python and the challenges of making a career out of open source.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Tom Christie about the Encode organization and the work he is doing to drive the state of the art in async for Python

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what the Encode organization is and how it came to be?
    • What are some of the other approaches to funding and sustainability that you have tried in the past?
    • What are the benefits to the developers provided by an organization which you were unable to achieve through those other means?
    • What benefits are realized by your sponsors as compared to other funding arrangements?
  • What projects are part of the Encode organization?
  • How do you determine fund allocation for projects and participants in the organization?
  • What is the process for becoming a member of the Encode organization and what benefits and responsibilities does that entail?
  • A large number of the projects that are part of the organization are focused on various aspects of asynchronous programming in Python. Is that intentional, or just an accident of your own focus and network?
  • For those who are familiar with Python web programming in the context of WSGI, what are some of the practices that they need to unlearn in an async world, and what are some new capabilities that they should be aware of?
  • Beyond Encode and your recent work on projects such as Starlette you are also well known as the creator of Django Rest Framework. How has your experience building and growing that project influenced your current focus on a technical, community, and professional level?
  • Now that Python 2 is officially unsupported and asynchronous capabilities are part of the core language, what future directions do you foresee for the community and ecosystem?
    • What are some areas of potential focus that you think are worth more attention and energy?
  • What do you have planned for the future of Encode, your own projects, and your overall engagement with the Python ecosystem?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Viewing all 22412 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>