Zero-with-Dot (Oleg Żero): Training on batch: how to split data effectively?

December 29, 2019, 3:00 pm

≫ Next: Learn PyQt: LearnPyQt — One year in, and more to come.

≪ Previous: qutebrowser development blog: 2019 qutebrowser crowdfunding - reminder

Introduction

With increasing volumes of the data, a common approach to train machine-learning models is to apply the so-called training on batch. This approach involves splitting a dataset into a series of smaller data chunks that are handed to the model one at a time.

In this post, we will present three ideas to split the dataset for batches:

creating a “big” tensor,
loading partial data with HDF5,
python generators.

For illustration purposes, we will pretend that the model is a sound-based detector, but the analysis presented in this post is generic. Despite the example is framed as a particular case, the steps discussed here are essentially splitting, preprocessing and iterating over the data. It conforms to a common procedure. Regardless of the data comes in for of image files, table derived from a SQL query or an HTTP response, it is the procedure that is our main concern.

Specifically, we will compare our methods by looking into the following aspects:

code quality,
memory footprint,
time efficiency.

What is a batch?

Formally, a batch is understood as an input-output pair (X[i], y[i]), being a subset of the data. Since our model is a sound-based detector, it expects a processed audio sequence as input and returns the probability of occurrence of a certain event. Naturally, in our case, the batch is consisted of:

X[t] - a matrix representing processed audio track sampled within a time-window, and
y[t] - a binary label denoting the presence of the event,

where t to denote the time-window (figure 1.).

/assets/splitting-to-batches/data-input.png

Figure 1. An example of data input. Top: simple binary label (random), middle: raw audio channel (mono), bottom: spectrogram represented as naural logarithm of the spectrum. The vertical lines represent slicing of the sequence into batches of 1 second length.

Spectrogram

As for the spectrogram, you can think of it as a way of describing how much of each “tune” is present within the audio track. For instance, when a bass guitar is being played, the spectrogram would reveal high intensity more concentrated on the lower side of the spectrum. Conversely, with a soprano singer we would observe the opposite. With this kind of “encoding”, a spectrogram naturally represents useful features for the model.

Comparing ideas

As a common prerequisite for our comparison, let’s briefly define the following imports and constants.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
fromscipy.signalimportspectrogramfromos.pathimportjoinfrommathimportceilimportnumpyasnpFILENAME='test'FILEPATH='data'CHANNEL=0# mono track onlySAMPLING=8000# sampling rate (audio at 8k samples per s)NFREQS=512# 512 frequencies for the spectrogramNTIMES=400# 400 time-points for the spectrogramSLEN=1# 1 second of audio for a batchN=lambdax:(x-x.mean())/x.std()# normalizationfilename=join(FILEPATH,FILENAME)

Here, the numbers are somewhat arbitrary. We decide to go for the lowest sampling rate (other common values are 16k and 22.4k fps), and let every X-chunk be a spectrogram of 512 frequency channels that is calculated from a non-overlapping audio sequence of 1s, using 400 data points along the time axis. In other words, each batch will be a pair of a 512-by-400 matrix, supplemented with a binary label.

Idea #1 - A “big” tensor

The input to the model is a 2-dimensional tensor. As the last step involves iterating over the batches, it makes sense to increase the rank of the tensor and reserve the third dimension for the batch count. Consequently, the whole process can be outlined as follows:

Load the x-data.
Load the y-label.
Slice X and y into batches.
Extract features on each batch (here: the spectrogram).
Collate X[t] and y[t] together.

Why wouldn’t that be a good idea? Let’s see an example of the implementation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
defcreate_X_tensor(audio,fs,slen=SLEN,bsize=(NFREQS,NTIMES)):X=np.zeros((n_batches,bsize[0],bsize[1]))forbninrange(n_batches):aslice=slice(bn*slen*fs,(bn+1)*slen*fs)*_,spec=spectrogram(N(audio(aslice)),fs=fs,nperseg=int(fs/bsize[1]),noverlap=0,nfft=bsize[0])X[bn,:,:spec.shape[1]]=specreturnnp.log(X+1e-6)# to avoid -Infdefget_batch(X,y,bn):returnX[bn,:,:],y[bn]if__name__=='__main__':audio=np.load(filename+'.npy')[:,CHANNEL]label=np.load(filename+'-lbl.npy')X=create_X_tensor(audio,SAMPLING)fortinrange(X.shape[0]):batch=get_batch(X,y,t)print('Batch #{}, shape={}, label={}'.format(t,X.shape,y[i]))

The essence of this method can best be described as load it all now, worry about it later.

While creating X a self-contained data piece can be viewed as an advantage, this approach has disadvantages:

We lead all data into the RAM, regardless of the RAM can store such data or not.
We use the first dimension of X for the batch count. However, this is solely based on a convention. What if the next time somebody decides that it should be the last one instead?
Although X.shape[0] tells us exactly how many batches we have, we still have to create an auxiliary variable t to help us keep track of the batches. This design enforces the model training code to adhere to this decision.
Finally, it asks for the get_batch function to be defined. Its only purpose is to select a subset of X and y and collate them together. It looks undesired at best.

Idea #2 - Loading batches with HDF5

Let’s start with eliminating the most dreaded problem that is having to load all data into the RAM. If the data comes from a file, it would make sense to be able to only load portions of it and operate on these portions.

Using skiprows and nrows arguments from Pandas’ read_csv it is possible to load fragments of a .csv file. However, with the CSV format being rather impractical for storing sound data, Hierarchical Data Format (HDF5) is a better choice. The format allows us to store multiple numpy-like arrays and access them in a numpy-like way.

Here, we assume that the file contains intrinsic datasets called 'audio' and 'label'. Check out Python h5py library for more information.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
defget_batch(filepath,t,slen=SLEN,bsize=(NFREQS,NTIMES)):withh5.File(filepath+'.h5','r')asf:fs=f['audio'].attrs['sampling_rate']audio=f['audio'][t*slen*fs:(t+1)*slen*fs,CHANNEL]label=f['label'][t]*_,spec=spectrogram(N(audio),fs=fs,nperseg=int(fs/bsize[1]),noverlap=0,nfft=bsize[0])X=np.zeros((bsize[0]//2+1,bsize[1]))X[:,:spec.shape[1]]=specreturnnp.log(X+1e-6),labeldefget_number_of_batches(filepath):withh5.File(filepath+'.h5','r')asf:fs=f['audio'].attrs['sampling_rate']sp=f['audio'].shape[0]returnceil(sp/fs)if__name__=='__main__':n_batches=get_number_of_batches(filename)fortinrange(n_batches):batch=get_batch(filename,t)print('Batch #{}, shape={}, label={}'.format(i,batch[0].shape,batch[1]))

Hopefully, our data is now manageable (if it was not before)! Moreover, we have also achieved some progress when it comes to the overall quality:

We got rid of the previous get_batch function and replaced it with the one that more meaningful. It computes what is necessary and delivers the data. Simple.
Our X tensor no longer needs to be artificially modified.
In fact, by changing get_batch(X, y, t) to get_batch(filename, t), we have abstracted access to our dataset and removed X and y from the namespace.
The dataset has also became a single file. We do not need to source the data and the labels from two different files.
We do not need to supply fs (the sampling rate) argument. Thanks to the so-called attributes in HDF5, it can be a part of the dataset file.

Despite the advantages, we are still left with two… inconveniences.

Because the new get_batch does not remember the state. We have to rely on controlling t using a loop as before. However, as there is no mechanism within get_batch to tell how large the loop needs to be (apart from adding the third output argument, making it weird), we need to check the size of our data beforehand. Apart from adding the third output to get_batch, which would make this function rather weird, it requires us to create a second function: get_number_of_batches.

Unfortunately, it does not make the solution as elegant as it can be. If we only transform get_batch to a form where it would preserve the state, we can do better.

Idea #3 - Generators

Let’s recognize the pattern. We are only interested in accessing, processing and delivering of data pieces one after the other. We do not need it all at once.

For these opportunities, Python has a special construct, namely generators. Generators are functions that return generator iterators Instead of eagerly performing the computation, the iterators deliver a bit of the result at the time and wait to be asked to continue. Perfect, right?

Generator iterators can be constructed in three ways:

through an expression that is similar to a list comprehansion: e.g. (i for i in iterable), but using () instead of [],
from a generator function - by replacing return with yield, or
from a class object that defines custom __iter__ (or __getitem__) and __next__ methods (see docs).

Here, using yield fits naturally in what we need to do.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
defget_batches(filepath,slen=SLEN,bsize=(NFREQS,NTIMES)):withh5.File(filepath+'.h5','r')asf:fs=f['audio'].attrs['sampling_rate']n_batches=ceil(f['audio'].shape[0]/fs)fortinrange(n_batches):audio=f['audio'][t*slen*fs:(t+1)*slen*fs,CHANNEL]label=f['label'][t]*_,spec=spectrogram(N(audio),fs=fs,nperseg=int(fs/bsize[1]),noverlap=0,nfft=bsize[0])X=np.zeros((bsize[0]//2+1,bsize[1]))X[:,:spec.shape[1]]=specyieldnp.log(X+1e-6),labelif__name__=='__main__':forbinget_batches(filename):print('shape={}, label={}'.format(b[0].shape,b[1]))

The loop is now inside of the function. Thanks to the yield statement, the (X[t], y[t]) pair will only be returned after get_batches be called t - 1 times. The model training code does not need to manage the state of the loop. The function remembers its state between calls, allowing the user to iterate over batches as opposed to having some artificial batch index.

It is useful to compare generator iterators to containers with data. As batches get removed with every iteration, at some point the container becomes empty. Consequently, neither indexing nor a stop condition is necessary. Data gets consumed until there is no more data and the process stops.

Performance: time and memory

We have intentionally started with a discussion on the code quality, as it was tightly related to the way our solution has been evolving. However, it is just as important to consider resource constraints, especially when data grows in volume.

Figure 2. presents the time it takes to deliver the batches using the three different methods described earlier. As we can see, the time it takes to process and hand over the data is nearly the same. Regardless if we load all data to process and then slice it or load and process it bit-by-bit from the beginning, the total time to get the solution is almost equal. This, of course, could be the consequence of having SSD that allows faster access to the data. Still, the strategy chosen seems to have little impact on overall time performance.

/assets/splitting-to-batches/time-performance.png

Figure 2. Time performance comparison. The red-solid line refers to timing both loading the data to the memory and performing the computation. The red-dotted line times only the loop, where slices are delivered, assuming that data was precomputed. The green-dotted line refers to loading batches from HDF5 file and the blue-dashed-dotted line implements a generator. Comparing the red lines, we can see that just accessing of the data once it is in the RAM is almost for free. When data is local, the differences between the other cases are minimal, anyway.

Much more difference can be observed when looking at figure 3. Considering the first approach, it is the most memory-hungry of all, making the 1-hour long audio sample throws MemoryError. Conversely, when loading data in chunks, the allocated RAM is determined by the batch size, leaving us safely below the limit.

/assets/splitting-to-batches/ram-consumption.png

Figure 3. Memory consumption comparison, expressed in terms of the percentage of the available RAM being consumed by the python script, evaluated using: (env)$ python idea.py & top -b -n 10 > capture.log; cat capture.log | egrep python > analysis.log, and post-processed.

Surprisingly (or not), there is no significant difference between the second and the third approach. What the figure tells us, is that choosing or not choosing to implement a generator iterator makes no impact on the memory footprint on our solution.

This is an important take-away. It is often encouraged to use generators as more efficient solutions to save both time and memory Instead, the figure shows that generators alone do not contribute to better solutions in terms of the resources. What matters is only how quickly we can access the resources and how much data we can handle at once.

Using an HDF5 file proves to be efficient since we can access the data very quickly, and flexible enough that we do not need to load it all at once. At the same time, the implementation of a generator improves code readability and quality. Although we could also frame the first approach in a generator form, it would not make any sense, since, without the ability to load data in smaller quantities, generators would only improve the syntax. Consequently, the best approach seems to be the simultaneous usage of loading partial data and a generator, which is represented by the 3rd approach.

Final remarks

In this post, we have presented three different ways to split and process our data in batches. We compared both the performance of each of the approaches and the overall code quality. We have also stated that the generators on their own do not make the code more efficient. The final performance is dictated by the time and memory constraints, however, generators can make the solution more elegant.

What solution do you find the most appealing?

↧

Learn PyQt: LearnPyQt — One year in, and more to come.

December 30, 2019, 4:21 pm

≫ Next: Codementor: Making Your First GUI: Python3, Tkinter

≪ Previous: Zero-with-Dot (Oleg Żero): Training on batch: how to split data effectively?

It's been a very good year.

Back in May I was looking through my collection of PyQt tutorials and videos and trying to decide what to do with them. They were pretty popular, but being hosted on multiple sites meant they lacked structure between them and were less useful than they could be. I needed somewhere to put them.

Having looked the options available for hosting tutorials and courses I couldn't find something that fit my requirements. So I committed the #1 programmer mistake of building my own.

LearnPyQt.com was born, and it turned out pretty great.

The site uses a freemium model — long detailed text tutorials, with an upgrade to buy video courses and books for those that want them. Built on the Django-based Wagtail CMS it has been extended with some custom apps into a fully-fledged learning management system. But it's far from complete. Plans include adding progress tracking, certificates and some lightweight gamification. The goal here is to provide little hooks and challenges, to keep you inspired and experimenting with PyQt (and Python).

The availability of the free tutorials is key — not everyone wants videos or books and not wanting those things is no reason not to learn something. Even so, the upgrade is a one-off payment to keep it affordable for as many people as possible, no subscriptions here!

New Tutorials

Once the existing tutorials and videos were up and running I set about creating more. These new tutorials were modelled on the popular multithreading tutorial, taking frequently asked PyQt5 questions and pain points and tackling them in detail together with working examples. This led first to the (often dreaded) ModelView architecture which really isn't that bad and then later to bitmap graphics which unlocks the power of QPainter giving you the ability to create your own custom widgets.

As the list of obvious targets dries up I'll be adding a topic-voting system on site to allow students to request and vote for their particular topics of interest, to keep me on topic with what people actually want.

New Videos

The video tutorials were where it all started, however in the past year these have fallen a little behind. This will be rectified in the coming months, with new video tutorials recorded for the advanced tutorials and updates to the existing videos following shortly after. The issue has been balancing between writing new content and recording new content, but that problem is solved now we have...

New Writers

With the long list of things to tackle I was very happy to be joined this year by a new writer — John Lim. John is a Python developer from Malaysia, who's been developing with PyQt5 for over 2 years and still remembers all the pain points getting started. His first tutorials covered embedding custom widgets from Qt Designer and basic plotting with PyQtGraph both of which were a huge success.

If you're interested in becoming a writer, you can! You get paid, and — assuming you enjoy writing about PyQt — it's a lot of fun.

New Types of Content

In addition to all the new tutorials and videos, we've been experimenting with new types of content on the site. First of all we have been working on a set of example apps and widgets which you can use for inspiration — or just plain use the code from — for your own projects. Everything on the site is open source and free to use.

We've also been experimenting with alternatives short-form tutorials/documentation for core Qt widgets and features. The first of these by John covers adding scrollable regions with QScrollArea to your app. We'll have more of these, together with more complete documentation re-written for Python coming soon.

New Year

That's all for this year.

To help the year go out with a bang, we're currently running a 50% discount on all courses and books with the code NEWYEAR20. Every purchase gets unlimited access to all future updates and upgrades, so this is a great way to get in ahead of all the good stuff coming down the pipeline.

The same code will give 10% off after New Year. Feel free to share it with the people you love, or wait a few days and share it with people you love slightly less.

Here's to another year building GUI apps with Python!

↧

Codementor: Making Your First GUI: Python3, Tkinter

December 30, 2019, 6:00 pm

≫ Next: Moshe Zadka: Meditations on the Zen of Python

≪ Previous: Learn PyQt: LearnPyQt — One year in, and more to come.

Your First GUI: Python3, Tkinter! Making a conversion app with a GUI

↧

Moshe Zadka: Meditations on the Zen of Python

December 30, 2019, 8:44 pm

≫ Next: S. Lott: Christmas Ornament

≪ Previous: Codementor: Making Your First GUI: Python3, Tkinter

(This is based on the series published in opensource.com as 9 articles: 1, 2, 3, 4, 5, 6, 7, 8, 9)

Python contributor Tim Peters introduced us to the Zen of Python in 1999. Twenty years later, its 19 guiding principles continue to be relevant within the community.

The Zen of Python is not "the rules of Python" or "guidelines of Python". It is full of contradiction and allusion. It is not intended to be followed: it is intended to be meditated upon.

In this spirit, I offer this series of meditations on the Zen of Python.

Beautiful is better than ugly.

It was in Structure and Interpretation of Computer Programs (SICP) that the point was made: "Programs must be written for people to read and only incidentally for machines to execute." Machines do not care about beauty, but people do.

A beautiful program is one that is enjoyable to read. This means first that it is consistent. Tools like Black, flake8, and Pylint are great for making sure things are reasonable on a surface layer.

But even more important, only humans can judge what humans find beautiful. Code reviews and a collaborative approach to writing code are the only realistic way to build beautiful code. Listening to other people is an important skill in software development.

Finally, all the tools and processes are moot if the will is not there. Without an appreciation for the importance of beauty, there will never be an emphasis on writing beautiful code.

This is why this is the first principle: it is a way of making "beauty" a value in the Python community. It immediately answers: "Do we really care about beauty?" We do.

Explicit is better than implicit.

We humans celebrate light and fear the dark. Light helps us make sense of vague images. In the same way, programming with more explicitness helps us make sense of abstract ideas. It is often tempting to make things implicit.

"Why is self explicitly there as the first parameter of methods?"

There are many technical explanations, but all of them are wrong. It is almost a Python programmer's rite of passage to write a metaclass that makes explicitly listing self unnecessary. (If you have never done this before, do so; it makes a great metaclass learning exercise!)

The reason self is explicit is not because the Python core developers did not want to make a metaclass like that the "default" metaclass. The reason it is explicit is because there is one less special case to teach: the first argument is explicit.

Even when Python does allow non-explicit things, such as context variables, we must always ask: Are we sure we need them? Could we not just pass arguments explicitly? Sometimes, for many reasons, this is not feasible. But prioritizing explicitness means, at least, asking the question and estimating the effort.

Simple is better than complex.

When it is possible to choose at all, choose the simple solution. Python is rarely in the business of disallowing things. This means it is possible, and even straightforward, to design baroque programs to solve straightforward problems.

It is worthwhile to remember at each point that simplicity is one of the easiest things to lose and the hardest to regain when writing code.

This can mean choosing to write something as a function, rather than introducing an extraneous class. This can mean avoiding a robust third-party library in favor of writing a two-line function that is perfect for the immediate use-case. Most often, it means avoiding predicting the future in favor of solving the problem at hand.

It is much easier to change the program later, especially if simplicity and beauty were among its guiding principles, than to load the code down with all possible future variations.

Complex is better than complicated.

This is possibly the most misunderstood principle because understanding the precise meanings of the words is crucial. Something is complex when it is composed of multiple parts. Something is complicated when it has a lot of different, often hard to predict, behaviors.

When solving a hard problem, it is often the case that no simple solution will do. In that case, the most Pythonic strategy is to go "bottom-up." Build simple tools and combine them to solve the problem.

This is where techniques like object composition shine. Instead of having a complicated inheritance hierarchy, have objects that forward some method calls to a separate object. Each of those can be tested and developed separately and then finally put together.

Another example of "building up" is using singledispatch, so that instead of one complicated object, we have a simple, mostly behavior-less object and separate behaviors.

Flat is better than nested.

Nowhere is the pressure to be "flat" more obvious than in Python's strong insistence on indentation. Other languages will often introduce an implementation that "cheats" on the nested structure by reducing indentation requirements. To appreciate this point, let's take a look at JavaScript.

JavaScript is natively async, which means that programmers write code in JavaScript using a lot of callbacks.

a(function(resultsFromA) {
  b(resultsFromA, function(resultsfromB) {
    c(resultsFromC, function(resultsFromC) {
      console.log(resultsFromC)
   }
  }
}

Ignoring the code, observe the pattern and the way indentation leads to a right-most point. This distinctive "arrow" shape is tough on the eye to quickly walk through the code, so it's seen as undesirable and even nicknamed "callback hell." However, in JavaScript, it is possible to "cheat" and not have indentation reflect nesting.

a(function(resultsFromA) {
b(resultsFromA,
  function(resultsfromB) {
c(resultsFromC,
  function(resultsFromC) {
    console.log(resultsFromC)
}}}

Python affords no such options to cheat: every nesting level in the program must be reflected in the indentation level. So deep nesting in Python looks deeply nested. That means "callback hell" was a worse problem in Python than in JavaScript: nesting callbacks mean indenting with no options to "cheat" with braces.

This challenge, in combination with the Zen principle, has led to an elegant solution by a library I worked on. In the Twisted framework, we came up with the deferred abstraction, which would later inspire the popular JavaScript promise abstraction. In this way, Python's unwavering commitment to clear code forces Python developers to discover new, powerful abstractions.

future_value = future_result()
future_value.addCallback(a)
future_value.addCallback(b)
future_value.addCallback(c)

(This might look familiar to modern JavaScript programmers: Promises were heavily influenced by Twisted's deferreds.)

Sparse is better than dense.

The easiest way to make something less dense is to introduce nesting. This habit is why the principle of sparseness follows the previous one: after we have reduced nesting as much as possible, we are often left with dense code or data structures. Density, in this sense, is jamming too much information into a small amount of code, making it difficult to decipher when something goes wrong.

Reducing that denseness requires creative thinking, and there are no simple solutions. The Zen of Python does not offer simple solutions. All it offers are ways to find what can be improved in the code, without always giving guidance for "how."

Take a walk. Take a shower. Smell the flowers. Sit in a lotus position and think hard, until finally, inspiration strikes. When you are finally enlightened, it is time to write the code.

Readability counts.

In some sense, this middle principle is indeed the center of the entire Zen of Python. The Zen is not about writing efficient programs. It is not even about writing robust programs, for the most part. It is about writing programs that other people can read.

Reading code, by its nature, happens after the code has been added to the system. Often, it happens long after. Neglecting readability is the easiest choice since it does not hurt right now. Whatever the reason for adding new code -- a painful bug or a highly requested feature -- it does hurt. Right now.

In the face of immense pressure to throw readability to the side and just "solve the problem," the Zen of Python reminds us: readability counts. Writing the code so it can be read is a form of compassion for yourself and others.

Special cases aren't special enough to break the rules.

There is always an excuse. This bug is particularly painful; let's not worry about simplicity. This feature is particularly urgent; let's not worry about beauty. The domain rules covering this case are particularly hairy; let's not worry about nesting levels.

Once we allow special pleading, the dam wall breaks, and there are no more principles; things devolve into a Mad Max dystopia with every programmer for themselves, trying to find the best excuses.

Discipline requires commitment. It is only when things are hard, when there is a strong temptation, that a software developer is tested. There is always a valid excuse to break the rules, and that's why the rules must be kept the rules. Discipline is the art of saying no to exceptions. No amount of explanation can change that.

Although, practicality beats purity.

"If you think only of hitting, springing, striking, or touching the enemy, you will not be able actually to cut him.", Miyamoto Musashi, The Book of Water

Ultimately, software development is a practical discipline. Its goal is to solve real problems, faced by real people. Practicality beats purity: above all else, we must solve the problem. If we think only about readability, simplicity, or beauty, we will not be able to actually solve the problem.

As Musashi suggested, the primary goal of every code change should be to solve a problem. The problem must be foremost in our minds. If we waver from it and think only of the Zen of Python, we have failed the Zen of Python. This is another one of those contradictions inherent in the Zen of Python.

Errors should never pass silently...

Before the Zen of Python was a twinkle in Tim Peters' eye, before Wikipedia became informally known as "wiki," the first WikiWiki site, C2, existed as a trove of programming guidelines. These are principles that mostly came out of a Smalltalk programming community. Smalltalk's ideas influenced many object-oriented languages, Python included.

The C2 wiki defines the Samurai Principle: "return victorious, or not at all." In Pythonic terms, it encourages eschewing sentinel values, such as returning None or -1 to indicate an inability to complete the task, in favor of raising exceptions. A None is silent: it looks like a value and can be put in a variable and passed around. Sometimes, it is even a valid return value.

The principle here is that if a function cannot accomplish its contract, it should "fail loudly": raise an exception. The raised exception will never look like a possible value. It will skip past the returned_value = call_to_function(parameter) line and go up the stack, potentially crashing the program.

A crash is straightforward to debug: there is a stack trace indicating the problem as well as the call stack. The failure might mean that a necessary condition for the program was not met, and human intervention is needed. It might mean that the program's logic is faulty. In either case, the loud failure is better than a hidden, "missing" value, infecting the program's valid data with None, until it is used somewhere and an error message says "None does not have method split," which you probably already knew.

Unless explicitly silenced.

Exceptions sometimes need to be explicitly caught. We might anticipate some of the lines in a file are misformatted and want to handle those in a special way, maybe by putting them in a "lines to be looked at by a human" file, instead of crashing the entire program.

Python allows us to catch exceptions with except. This means errors can be explicitly silenced. This explicitness means that the except line is visible in code reviews. It makes sense to question why this is the right place to silence, and potentially recover from, the exception. It makes sense to ask if we are catching too many exceptions or too few.

Because this is all explicit, it is possible for someone to read the code and understand which exceptional conditions are recoverable.

In the face of ambiguity, refuse the temptation to guess.

What should the result of 1 + "1" be? Both "11" and 2 would be valid guesses. This expression is ambiguous: there is no single thing it can do that would not be a surprise to at least some people.

Some languages choose to guess. In JavaScript, the result is "11". In Perl, the result is 2. In C, naturally, the result is the empty string. In the face of ambiguity, JavaScript, Perl, and C all guess.

In Python, this raises a TypeError: an error that is not silent. It is atypical to catch TypeError: it will usually terminate the program or at least the current task (for example, in most web frameworks, it will terminate the handling of the current request).

Python refuses to guess what 1 + "1" means. The programmer is forced to write code with clear intention: either 1 + int("1"), which would be 2 or str(1) + "1", which would be "11"; or "1"[1:], which would be an empty string. By refusing to guess, Python makes programs more predictable.

There should be one -- and preferably only one -- obvious way to do it.

Prediction also goes the other way. Given a task, can you predict the code that will be written to achieve it? It is impossible, of course, to predict perfectly. Programming, after all, is a creative task.

However, there is no reason to intentionally provide multiple, redundant ways to achieve the same thing. There is a sense in which some solutions are "better" or "more Pythonic."

Part of the appreciation for the Pythonic aesthetic is that it is OK to have healthy debates about which solution is better. It is even OK to disagree and keep programming. It is even OK to agree to disagree for the sake of harmony. But beneath it all, there has to be a feeling that, eventually, the right solution will come to light. There must be the hope that eventually we can live in true harmony by agreeing on the best way to achieve a goal.

Although that way may not be obvious at first (unless you're Dutch).

This is an important caveat: It is often not obvious, at first, what is the best way to achieve a task. Ideas are evolving. Python is evolving. The best way to read a file block-by-block is, probably, to wait until Python 3.8 and use the walrus operator.

This common task, reading a file block-by-block, did not have a "single best way to do it" for almost 30 years of Python's existence.

When I started using Python in 1998 with Python 1.5.2, there was no single best way to read a file line-by-line. For many years, the best way to know if a dictionary had a key was to use .haskey until the in operator became the best way.

It is only by appreciating that sometimes, finding the one (and only one) way of achieving a goal can take 30 years of trying out alternatives that Python can keep aiming to find those ways. This view of history, where 30 years is an acceptable time for something to take, often feels foreign to people in the United States, when the country has existed for just over 200 years.

The Dutch, whether it's Python creator Guido van Rossum or famous computer scientist Edsger W. Dijkstra, have a different worldview according to this part of the Zen of Python. A certain European appreciation for time is essential.

Now is better than never.

There is always the temptation to delay things until they are perfect. They will never be perfect, though. When they look "ready" enough, that is when it is time to take the plunge and put them out there. Ultimately, a change always happens at some now: the only thing that delaying does is move it to a future person's "now."

Although never is often better than right now.

This, however, does not mean things should be rushed. Decide the criteria for release in terms of testing, documentation, user feedback, and so on. "Right now," as in before the change is ready, is not a good time.

This is a good lesson not just for popular languages like Python, but also for your personal little open source project.

If the implementation is hard to explain, it's a bad idea.

The most important thing about programming languages is predictability. Sometimes we explain the semantics of a certain construct in terms of abstract programming models, which do not correspond exactly to the implementation. However, the best of all explanations just explains the implementation.

If the implementation is hard to explain, it means the avenue is impossible.

If the implementation is easy to explain, it may be a good idea.

Just because something is easy does not mean it is worthwhile. However, once it is explained, it is much easier to judge whether it is a good idea.

This is why the second half of this principle intentionally equivocates: nothing is certain to be a good idea, but it always allows people to have that discussion.

Namespaces in Python

Python uses namespaces for everything. Though simple, they are sparse data utructures -- which is often the best way to achieve a goal.

Modules are namespaces. This means that correctly predicting module semantics often just requires familiarity with how Python namespaces work. Classes are namespaces. Objects are namespaces. Functions have access to their local namespace, their parent namespace, and the global namespace.

The simple model, where the . operator accesses an object, which in turn will usually, but not always, do some sort of dictionary lookup, makes Python hard to optimize, but easy to explain.

Indeed, some third-party modules take this guideline and run with it. For example, the variants package turns functions into namespaces of "related functionality." It is a good example of how the Zen of Python can inspire new abstractions.

↧

S. Lott: Christmas Ornament

December 31, 2019, 12:00 am

≫ Next: Real Python: Sorting Data With Python

≪ Previous: Moshe Zadka: Meditations on the Zen of Python

See https://github.com/slott56/cpx-xmas-ornament

You'll need a Circuit Playground Express https://www.adafruit.com/product/3333

Install the code. Enjoy the noise and blinky lights.

The MML translation isn't as complete as you might like. The upper/lower case for the various commands isn't handled quite as cleanly as it could be. AFAIK, case shouldn't matter, but I omitted any lower() functions, making the MML parser case sensitive. It only mattered for one of the four songs, and it was easier to edit the song.

The processing leaves a great deal of "clickiness" in the start_tone() processing. I think I know how to address it.

There are barely 96 or so different tones available in MML compositions. It might be possible to generate the wave shapes in advance to have a smoother music experience.

One could image having an off-line translator to transform the MML text into a sequence of bytes with note number and duration. This would slightly compress the song, but would speed up processing by eliminating the overhead of parsing.

Additionally, having 96 wave tables could speed up tone production. The tiny bit of time to recompute the sine wave at a given frequency would be eliminated. But. Memory is limited.

↧

Real Python: Sorting Data With Python

December 31, 2019, 6:00 am

≫ Next: John Cook: Area of sinc and jinc function lobes

≪ Previous: S. Lott: Christmas Ornament

All programmers will have to write code to sort items or data at some point. Sorting can be critical to the user experience in your application, whether it’s ordering a user’s most recent activity by timestamp, or putting a list of email recipients in alphabetical order by last name. Python sorting functionality offers robust features to do basic sorting or customize ordering at a granular level.

In this course, you’ll learn how to sort various types of data in different data structures, customize the order, and work with two different methods of sorting in Python.

By the end of this tutorial, you’ll know how to:

Implement basic Python sorting and ordering on data structures
Differentiate between sorted() and .sort()
Customize a complex sort order in your code based on unique requirements

For this course, you’ll need a basic understanding of lists and tuples as well as sets. Those data structures will be used in this course, and some basic operations will be performed on them. Also, this course uses Python 3, so example output might vary slightly from what you’d see if you were using Python 2.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

John Cook: Area of sinc and jinc function lobes

December 31, 2019, 7:23 am

≫ Next: Catalin George Festila: News : The Python 2.7 no longer support from Python team.

≪ Previous: Real Python: Sorting Data With Python

Someone left a comment this morning on my blog post on sinc and jinc integrals regarding the area of the lobes.

It would be nice to have the values of integrals of each lobe, i.e. integrals between 0 and multiples of pi. Anyone knows of such a table?

This post will include Python code to address that question.

First, let me back up and explain the context. The sinc function is defined as [1]

sinc(x) = sin(x) / x

and the jinc function is defined analogously as

jinc(x) = J₁(x) / x,

substituting the Bessel function J₁ for the sine function. You could think of Bessel functions as analogs of sines and cosines. Bessel functions often come up when vibrations are described in polar coordinates, just as sines and cosines come up when using rectangular coordinates.

Here’s a plot of the sinc and jinc functions:

The lobes are the regions between crossings of the x-axis. For the sinc function, the lobe in the middle runs from -π to π, and for n> 0 the nth lobe runs from nπ to (n+1)π. The zeros of Bessel functions are not uniformly spaced like the zeros of the sine function, but they come up in application frequently and so it’s easy to find software to compute their locations.

First of all we’ll need some imports.

    from scipy import sin, pi
    from scipy.special import jn, jn_zeros
    from scipy.integrate import quad

The sinc and jinc functions are continuous at zero, but the computer doesn’t know that [2]. To prevent division by zero, we return the limiting value of each function for very small arguments.

    def sinc(x):
        return 1 if abs(x) < 1e-8 else sin(x)/x

    def jinc(x):
        return 0.5 if abs(x) < 1e-8 else jn(1,x)/x

You can show via Taylor series that these functions are exact to the limits of floating point precision for |x| < 10^-8.

Here’s code to compute the area of the sinc lobes.

    def sinc_lobe_area(n):
        n = abs(n)
       integral, info = quad(sinc, n*pi, (n+1)*pi)
       return 2*integral if n == 0 else integral

The corresponding code for the jinc function is a little more complicated because we need to compute the zeros for the Bessel function J₁. Our solution is a little clunky because we have an upper bound N on the lobe number. Ideally we’d work out an asymptotic value for the lobe area and compute zeros up to the point where the asymptotic approximation became sufficiently accurate, and switch over to the asymptotic formula for sufficiently large n.

    def jinc_lobe_area(n):
        n = abs(n)
        assert(n < N)
        integral, info = quad(jinc, jzeros[n-1], jzeros[n])
        return 2*integral if n == 0 else integral

Note that the 0th element of the array returned by jn_zeros is the first positive zero of J₁; it doesn’t include the zero at the origin.

For both sinc and jinc, the even numbered lobes have positive area and the odd numbered lobes have negative area. Here’s a plot of the absolute values of the lobe areas.

[1] Some authors define sinc(x) as sin(nx)/nx. Both definitions are common.

[2] Scipy has a sinc function in scipy.special, defined as sin(nx)/nx, but it doesn’t have a jinc function.

↧

Catalin George Festila: News : The Python 2.7 no longer support from Python team.

December 31, 2019, 2:13 am

≫ Next: Test and Code: 97: 2019 Retrospective, 2020 Plans, and an amazing decade

≪ Previous: John Cook: Area of sinc and jinc function lobes

The 1st of January 2020 will mark the sunset of Python 2.7. It’s clear that Python 3 is more popular these days. You can learn more about the popularity of both on Google Trends. Python 3.0 was released in December 2008. The main goal was to fix problems existing in Python 2. Since the 1st January 2020, Python 2 will no longer receive any support whatsoever from the core Python team. Migrating to

↧

Test and Code: 97: 2019 Retrospective, 2020 Plans, and an amazing decade

December 31, 2019, 11:30 am

≫ Next: PyCoder’s Weekly: Issue #401 (Dec. 31, 2019)

≪ Previous: Catalin George Festila: News : The Python 2.7 no longer support from Python team.

This episode is not just a look back on 2019, and a look forward to 2020.
Also, 2019 is the end of an amazingly transofrmative decade for me, so I'm going to discuss that as well.

top 10 episodes of 2019

10: episode 46, Testing Hard To Test Applications - Anthony Shaw
9: episode 64, Practicing Programming to increase your value
8: episode 70, Learning Software without a CS degree - Dane Hillard
7: episode 75, Modern Testing Principles - Alan Page
6: episode 72, Technical Interview Fixes - April Wensel
5: episode 69, Andy Hunt - The Pragmatic Programmer
4: episode 73, PyCon 2019 Live Recording
3: episode 71, Memorable Tech Talks, The Ultimate Guide - Nina Zakharenko
2: episode 76, TDD: Don’t be afraid of Test-Driven Development - Chris May
1: episode 89, Improving Programming Education - Nicholas Tollervey

Looking back on the last decade
Some amazing events, like 2 podcasts, a book, a blog, speaking events, and teaching has led me to where we're at now.

Looking forward to 2020 and beyond
I discussed what's in store in the next year and moving forward.

A closing quote
Software is a blast. At least, it should be.
I want everyone to have fun writing software.
Leaning on automated tests is the best way I know to allow me confidence and freedome to:

rewrite big chunks of code
play with the code
try new things
have fun without fear
go home feeling good about what I did
be proud of my code I want everyone to have that.

That's why I promote and teach automated testing.

I hope you had an amazing decade.
And I wish you a productive and fun 2020 and the upcoming decade.
If we work together and help eachother reach new heights, we can achieve some pretty amazing things

PyCoder’s Weekly: Issue #401 (Dec. 31, 2019)

December 31, 2019, 11:30 am

≫ Next: Tryton News: Newsletter January 2020

≪ Previous: Test and Code: 97: 2019 Retrospective, 2020 Plans, and an amazing decade

#401 – DECEMBER 31, 2019
View in Browser »

Python 2.7 Retires Today

Python 2.7 will not be maintained past Jan 1st, 2020. So long Python 2, and thank you for your years of faithful service. Python 3, your time is now!
PYTHONCLOCK.ORG

Meditations on the Zen of Python

“The Zen of Python is not ‘the rules of Python’ or ‘guidelines of Python’. It is full of contradiction and allusion. It is not intended to be followed: it is intended to be meditated upon. In this spirit, I offer this series of meditations on the Zen of Python.”
MOSHE ZADKA

Scout APM for Python

Check out Scout’s developer-friendly application performance monitoring solution for Python. Scout continually tracks down N+1 database queries, sources of memory bloat, performance abnormalities, and more. Get back to coding with Scout →
SCOUT APMsponsor

Python Timer Functions: Three Ways to Monitor Your Code

Learn how to use Python timer functions to monitor how fast your programs are running. You’ll use classes, context managers, and decorators to measure your program’s running time. You’ll learn the benefits of each method and which to use given the situation.
REAL PYTHON

Open Source Migrates With Emotional Distress

The creator of Flask reflects on the Python 2 to 3 migration and how the Python community handled the transition. Interesting read!
ARMIN RONACHER

Python REPL and Shell Integration Tips

Some good tips and ways to minimize the context interruption when moving between the shell and a Python session.
JOHN D. COOK

My Business Card Runs Linux & MicroPython

Embedded systems engineer builds a card-sized computer that boots Linux and runs MicroPython. Cool!
GEORGE HILLIARD

PyPy 7.3.0 Released

PYPY BLOG

Python Jobs

Articles & Tutorials

The Python Packaging Ecosystem

“[It] seems worthwhile for me to write-up my perspective as one of the lead architects for that ecosystem on how I characterize the overall problem space of software publication and distribution, where I think we are at the moment, and where I’d like to see us go in the future.”
NICK COGHLAN

Python Developers Are in Demand on Vettery

Vettery is an online hiring marketplace that’s changing the way people hire and get hired. Ready for a bold career move? Make a free profile, name your salary, and connect with hiring managers from top employers today →
VETTERYsponsor

Sorting Data With Python

In this step-by-step course, you’ll learn how to sort in Python. You’ll know how to sort various types of data in different data structures, customize the order, and work with two different ways of sorting in Python.
REAL PYTHONvideo

Training on Batch: How Do You Split the Data?

Creating data batches for model training evaluated in context of loading data using Python generators, HDF5 files and NumPy using a sound processing machine-learning model as an example.
OLEG ŻERO

How to use Pandas `get_dummies` to Create Dummy Variables in Python

Dummy variables (or binary/indicator variables) are often used in statistical analyses as well as in more simple descriptive statistics.
ERIK MARSJA

Python Type Hints & MyPy Tutorial

This post covers mypy in general terms as well many examples demonstrating the syntax and capabilities of this type checker.
GUILHERME KUNIGAMI

Pipx: Installing, Uninstalling & Upgrading Python Packages in Virtual Envs

Here you will learn how to install, uninstall, & upgrade Python packages using the pipx tool.
ERIK MARSJA

Magic-Wormhole: Get Things From One Computer to Another, Safely

MAGIC-WORMHOLE.READTHEDOCS.IO

Heap Sort in Python

OLIVERA POPOVIĆ

Projects & Code

drf_dynamics: Dynamic Queryset and Serializer Setup for Django REST Framework

Handles the hassle of handling the amount of fields to be serialized and queryset changes for each request for you.
GITHUB.COM/IMBOKOV• Shared by Ilya Bokov

Astropy: Astronomy With Python

ASTROPY.ORG

AI_Sudoku: Extract a Sudoku Puzzle From a Photo and Solve It

GITHUB.COM/NEERU1207

ffmpeg-python: Python Bindings for FFmpeg

GITHUB.COM/KKROENING

Magic-Wormhole: Get Things From One Computer to Another, Safely

MAGIC-WORMHOLE.READTHEDOCS.IO

pyopengl: OpenGL Bindings for Python

GITHUB.COM/MCFLETCH

Events

PyDelhi User Group Meetup

January 4, 2020
MEETUP.COM

Melbourne Python Users Group, Australia

January 6, 2020
J.MP

Dominican Republic Python User Group

January 7, 2020
PYTHON.DO

Heidelberg Python Meetup

January 8, 2020
MEETUP.COM

Python North East

January 8, 2020
PYTHONNORTHEAST.COM

PyStaDa

January 8, 2020
PYSTADA.GITHUB.IO

pyCologne User Group Treffen

January 8, 2020
PYCOLOGNE.DE

Santa Cruz Python Meetup

January 8, 2020
MEETUP.COM

PyMNTos

January 9, 2020
PYTHON.MN

Python Atlanta

January 9, 2020
MEETUP.COM

Happy Pythoning!
This was PyCoder’s Weekly Issue #401.
View in Browser »

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

↧

Tryton News: Newsletter January 2020

December 31, 2019, 3:00 pm

≫ Next: IslandT: Create a simple python project on Google Colaboratory

≪ Previous: PyCoder’s Weekly: Issue #401 (Dec. 31, 2019)

@ced wrote:

photo-of-2020-on-pink-background-3401900.jpg1280×854 211 KB
The Tryton team wishes you a happy new year.
Here are the new features the team has already prepared for the next version.
Contents:
Changes for users
Changes for developers
Changes For The User
We prevent posting draft moves that were created when a statement was validated. Such moves are posted when the statement is actually posted. This ensures a coherent state between the statement and the moves.
In every case the production cost is now allocated automatically to the outgoing moves. The allocation is based on the list price of each of the outgoing products. Any products with no list price are considered as waste and do not have a cost allocated.
We added the list of selection criteria to the carrier. So if you duplicate a carrier, the criteria are also automatically duplicated.
The company module now has its own menu entry and its own administrator group. This provides finer access control.
The “project invoice” group no longer gives access to the timesheets. This provides a better separation of roles.
On small screens (like mobile), the web client no longer displays empty cells. This optimizes the space available for use with other information.
You no longer need to enter a work center for production works that are in a request or draft state.
We now use the multiselection widget to select the week days that a supplier can deliver on.
When starting an CSV export from a client, by default all of the columns in the current view are selected. Before it skipped the Many2One fields for technical reasons, but these are now also selected. This provides the user with better, more expected, behavior.
You can now define a supervisor for each employee. This can be used to define access rules based on the company’s organization.
To improve the display performance on the web client, the number of records displayed is reset to its default value when the list is reloaded.
We implemented a new strategy to position new records in the list. The client now tries to position the new record depending of the list’s current ordering.
Changes For The Developer
Tryton can now use WeasyPrint to convert html and xhtml reports into PDFs. This avoids using LibreOffice which isn’t always as good at rendering the html reports.
The back-end classes can now be imported directly instead of using an indirect function.
The current employee is now in the evaluation context of the record rules. This allows you to create rules that depends on the user’s business role instead of just their user account.
You can now update the action from the ModelView.button_action by returning a dictionary with the value to change. This avoids needing to use a wizard to create a dynamic action.
All the tests are run on the new docker images before publishing them. This reinforces the stability of the published images.
It is now possible to add help text to the keys of Dict fields.
We have extended the generic tests on the wizards to ensure that the buttons point to an existing state.
The views for One2Many and Many2Many fields are no longer pre-fetched if the relation field is displayed with a different widget such as the multiselection widget which doesn’t need a view.
We now run the desktop client tests in our drone CI. This normalizes how tests are executed for all our packages and ensures no regressions are introduced by mistake.
It is now possible to load WSGI middleware using the configuration file. For example to load the InternerExplorerFix from werkzeug:
[wsgi middleware]
ie = werkzeug.contrib.fixers.InternetExplorerFix

[wsgi ie]
kwargs={'fix_attach': False}
We use now the non-standard SKIP LOCK instead of the also non-standard advisory lock when pulling out task from the queue. It has better performance and is more SQL-ish. But this also provides a nicer fallback if the feature is not implemented by the back-end.
It is now possible to use a MultiSelection field as the key for a Dict field.

Posts: 1

Participants: 1

Read full topic

↧

IslandT: Create a simple python project on Google Colaboratory

December 31, 2019, 8:53 pm

≫ Next: Mike C. Fletcher: DRM Names for EGL Enumerated Devices

≪ Previous: Tryton News: Newsletter January 2020

Hello and happy 2020 to you all, in this year I will continue to write more python related articles and continue to build up this website which is not only for python programming but also other interesting topics such as game creation with Godot and other game engines, Linux and Windows related topics and online earning opportunity. In the Python part, we will start to build a few online projects using Google Colaboratory and a few offline projects using various IDEs. So let us get started immediately!

In this article, I will create a simple python program just to familiarize myself with Google Colaboratory which is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. With Colaboratory we can write and execute code, save and share our analyses, and access powerful computing resources, all for free from our own browser. With Colaboratory, we do not need to download any extra python modules that we need in our python program, for example, numpy which will be needed to perform various mathematics jobs.

We will create a simple python function with Google Colaboratory that will return true if the string which has been entered as a parameter into that function is a digit from 0-9 or else that function will return false.

The first step is to start a new python 3 project, go to File->New Python 3 notebook after you have logged into your own Google account.

Create a new Python 3 project

We can further rename the python file but for now, let just leave that alone and concentrate on creating the simple python function first. Click on the Code button to create a new code then type in the below lines of code and press the play button next to the code editor to run the program.

Press the Code button to create a new python code.

The program will return True if we have entered “6” into the above function.

After we have created the first program we can rename the above python file by going to File->Rename and then start to type in the new name for that python file. Below is the full python program.

The filename has been changed.

There is a lot of good stuff that this online IDE can provide which we will start to explore when we create another python project in the coming chapter! What is your thought about Colaboratory, leave your thought below this post.

↧

Mike C. Fletcher: DRM Names for EGL Enumerated Devices

December 31, 2019, 9:07 pm

≫ Next: Armin Ronacher: I'm not feeling the async pressure

≪ Previous: IslandT: Create a simple python project on Google Colaboratory

So it turns out that there's an extension for getting the DRM name for an EGL queried device that seems to work on Ubuntu 19.10. With that it should be relatively easy to target an off-screen render to a particular device. (The extension allows `eglQueryDeviceStringEXT` to respond to `EGL_DRM_DEVICE_FILE_EXT`). Happy New Year all.

↧

Armin Ronacher: I'm not feeling the async pressure

December 31, 2019, 4:00 pm

≫ Next: Caktus Consulting Group: Our Top 19 Blogs of 2019

≪ Previous: Mike C. Fletcher: DRM Names for EGL Enumerated Devices

Async is all the rage. Async Python, async Rust, go, node, .NET, pick your favorite ecosystem and it will have some async going. How good this async business works depends quite a lot on the ecosystem and the runtime of the language but overall it has some nice benefits. It makes one thing really simple: to await an operation that can take some time to finish. It makes it so simple, that it creates innumerable new ways to blow ones foot off. The one that I want to discuss is the one where you don't realize you're blowing your foot off until the system starts overloading and that's the topic of back pressure management. A related term in protocol design is flow control.

What's Back Pressure

There are many explanations for back pressure and a great one is Backpressure explained — the resisted flow of data through software which I recommend reading. So instead of going into detail about what back pressure is I just want to give a very short definition and explanation for it: back pressure is resistance that opposes the flow of data through a system. Back pressure sounds quite negative — who does not imagine a bathtub overflowing due to a clogged pipe — but it's here to safe your day.

The setup we're dealing with here is more or less the same in all cases: we have a system composed of different components into a pipeline and that pipeline has to accept a certain number of incoming messages.

You could imagine this like you would model luggage delivery at airports. Luggage arrives, gets sorted, loaded into the aircraft and finally unloaded. At any point an individual piece of luggage is thrown together with other luggage into containers for transportation. When a container is full it will need to be picked up. When no containers are left that's a natural example of back pressure. Now the person that would want to throw luggage into a container can't because there is no container. A decision has to be made now. One option is to wait: that's often referred to as queueing or buffering. The other option is to throw away some luggage until a container arrives — this is called dropping. That sounds bad, but we will get into why this is sometimes important later. However there is another thing that plays into here. Imagine the person tasked with putting luggage into a container does not receive a container for an extended period of time (say a week). Eventually if they did not end up throwing luggage away now they will have an awful lot of luggage standing around. Eventually the amount of luggage they will have to sort through will be so enormous that they run out of physical space to store the luggage. At that point they are better off telling the airport not to accept any more incoming luggage until their container issue is resolved. This is commonly referred to as flow control and a crucial aspect of networking.

All these processing pipelines are normally scaled for a certain amount of messages (or in this case luggage) per time period. If the number exceeds this — or worst of all — if the pipeline stalls terrible things can happen. An example of this in the real world was the London Heathrow Terminal 5 opening where 42,000 bags failed to be routed correctly over 10 days because the IT infrastructure did not work correctly. They had to cancel more than 500 flights and for a while airlines chose to only permit carry-on only.

Back Pressure is Important

What we learn from the Heathrow disaster is that being able to communicate back pressure is crucial. In real life as well as in computing time is always finite. Eventually someone gives up waiting on something. In particular even if internally something would wait forever, externally it wouldn't.

A real time example for this: if your bag is supposed to be going via London Heathrow to your destination in Paris, but you will only be there for 7 days, then it is completely pointless for your luggage to arrive there with a 10 day delay. In fact you want your luggage to be re-routed back to your home airport.

It's in fact better to admit defeat — that you're overloaded — than to pretend that you're operational and keep buffering up forever because at one point it will only make matters worse.

So why is back pressure all the sudden a topic to discuss when we wrote thread based software for years and it did not seem to come up? A combination of many factors some of which are just the easy to shoot yourself into the foot.

Bad Defaults

To understand why back pressure matters in async code I want to give you a seemingly simple piece of code with Python's asyncio that showcases a handful of situations where we accidentally forgot about back pressure:

fromasyncioimportstart_server,runasyncdefon_client_connected(reader,writer):whileTrue:data=awaitreader.readline()ifnotdata:breakwriter.write(data)asyncdefserver():srv=awaitstart_server(on_client_connected,'127.0.0.1',8888)asyncwithsrv:awaitsrv.serve_forever()run(server())

If you are now to the concept of async/await just imagine that at any point where await is called, the function suspends until the expression resolves. Here the start_server function that is provided by Python's asyncio system runs a hidden accept loop. It listens on a socket and spawns an independent task running the on_client_connected function for each socket that connects.

Now this looks pretty straightforward. You could remove all the await and async keywords and you end up with code that looks very similar to how you would write code with threads.

However that hides one very crucial issue which is the root of all our issues here: and that are function calls that do not have an await in front of it. In threaded code any function can yield. In async code only async functions can. This means for instance that the writer.write method cannot block. So how does this work? So it will try to write the data right into the operating system's socket buffer which is non blocking. However what happens if the buffer is full and the socket would block? In the threading case we could just block here which would be ideal because it means we're applying some back pressure. However because there are not threads here we can't do that. So we're left with buffering here or dropping data. Because dropping data would be pretty terrible, Python instead chooses to buffer. Now what happens if someone sends a lot of data in but does not read? Well in that case the buffer will grow and grow and grow. This API deficiency is why the Python documentation says not to use write at all on it's own but to follow up with drain:

writer.write(data)awaitwriter.drain()

Drain will drain some excess on the buffer. It will not cause the entire buffer to flush out, but just enough to prevent things to run out of control. So why is write not doing an implicit drain? Well it's a massive API oversight and I'm not exactly sure how it happened.

An important part that is very important here is that most sockets are based on TCP and TCP has built-in flow control. A writer will only write so fast as the reader is willing to accept (give or take some buffering involved). This is hidden from you entirely as a developer because not even the BSD socket libraries expose this implicit flow control handling.

So did we fix our back pressure issue here? Well let's see how this whole thing would look like in a threading world. In a threading world our code most likely would have had a fixed number of threads running and the accept loop would have waited for a thread to become available to take over the request. In our async example however we now have an unbounded number of connections we're willing to handle. This similarly means we're willing to accept a very high number of connections even if it means that the system would potentially overload. In this very simple example this is probably less of an issue but imagine what would happen if we were to do some database access.

Picture a database connection pool that will give out up to 50 connections. What good is it to accept 10000 connections when most of them will bottleneck on that connection pool?

Waiting vs Waiting to Wait

So this finally leads me to where I wanted to go in the first place. In most async systems and definitely in most of what I encountered in Python even if you fix all the socket level buffering behavior you end up in a world where you chain a bunch of async functions together with no regard of back pressure.

If we take our database connection pool example let's say there are only 50 connections available. This means at most we can have 50 concurrent database sessions for our code. So let's say we want to let 4 times as many requests be processed as we're expecting that a lot of what the application does is independent of the database. One way to go about it would be to make a semaphore with 200 tokens and to acquire one at the beginning. If we're out of tokens we would start waiting for the semaphore to release a token.

But hold on. Now we're back to queueing! We're just queueing a bit earlier. If we were to severely overload the system now we would queue all the way at the beginning. So now everybody would wait for the maximum amount of time they are willing to wait and then give up. Worse: the server might still process these requests for a while until it realizes the client has disappeared and is no longer interested in the response.

So instead of waiting straight away we would want some feedback. Imagine you're in a post office and you are drawing a ticket from a machine that tells you when it's your turn. This ticket gives you a pretty good indication of how long you will have to wait. If the waiting time is too long you can decide to abandon your ticket and head out to try again later. Note that the waiting time you have until it's your turn at the post office is independent of the waiting time you have for your request (for instance because someone needs to fetch your parcel, check documents and collect a signature).

So here is the naive version where we can only notice we're waiting:

fromasyncio.syncimportSemaphoresemaphore=Semaphore(200)asyncdefhandle_request(request):awaitsemaphore.acquire()try:returngenerate_response(request)finally:semaphore.release()

For the caller of the handle_request async function we can only see that we're waiting and nothing is happening. We can't see if we're waiting because we're overloaded or if we're waiting because generating the response just takes so long. We're basically endlessly buffering here until the server will finally run out of memory and crash.

The reason for this is that we have no communication channel for back pressure. So how would we go about fixing this? One option is to add a layer of indirection. Now here unfortunately asyncio's semaphore is no use because it only lets us wait. But let's imagine we could ask the semaphore how many tokens are left, then we could do something like this:

fromhypothetical_asyncio.syncimportSemaphore,Servicesemaphore=Semaphore(200)classRequestHandlerService(Service):asyncdefhandle(self,request):awaitsemaphore.acquire()try:returngenerate_response(request)finally:semaphore.release()@propertydefis_ready(self):returnsemaphore.tokens_available()

Now we have changed the system somewhat. We now have a RequestHandlerService which has a bit more information. In particular it has the concept of readiness. The service can be asked if it's ready. That operation is inherently non blocking and a best estimate. It has to be, because we're inherently racy here.

The caller now would now turn from this:

response=awaithandle_request(request)

Into this:

request_handler=RequestHandlerService()ifnotrequest_handler.is_ready:response=Response(status_code=503)else:response=awaitrequest_handler.handle(request)

There are multiple ways to skin the cat, but the idea is the same. Before we're actually going to commit ourself to doing something we have a way to figure out how likely it is that we're going to succeed and if we're overloaded we're going to communicate this upwards.

Now the definition of the service I did not come up with. The design of this comes from Rust's tower and Rust's actix-service. Both have a very similar definition of the service trait which is similar to that.

Now there is still a chance to pile up on the semaphore because of how racy this is. You can now either take that risk or still fail if handle is being invoked.

A library that has solved this better than asyncio is trio which exposes the internal counter on the semaphore and a CapacityLimiter which is a semaphore optimized for the purpose of capacity limiting which protects against some common pitfalls.

Streams and Protocols

Now the example above solves us RPC style situations. For every call we can be informed well ahead of time if the system is overloaded. A lot of these protocols have pretty straightforward ways to communicate that the server is at load. In HTTP for instance you can emit a 503 which can also carry a retry-after header that tells the client when it's a good idea to retry. This retry adds a natural point to re-evaluate if what you want to retry with it still the same request or if something changed. For instance if you can't retry in 15 seconds, maybe it's better to surface this inability to the user instead of showing an endless loading icon.

However request/response style protocols are not the only ones. A lot of protocols have persistent connections open and let you stream a lot of data through. Traditionally a lot of these protocols were based on TCP which as was mentioned earlier has built-in flow control. This flow control is however not really exposed through socket libraries which is why high level protocols typically need to add their own flow control to it. In HTTP 2 for instance a custom flow control protocol exists because HTTP 2 multiplexes multiple independent streams over a single TCP connection.

Coming from a TCP background where flow control is managed silently behind the scenes can set a developer down a dangerous path where one just reads bytes from a socket and assumes this is all there is to know. However the TCP API is misleading because flow control is — from an API perspective — completely hidden from the user. When you design your own streaming based protocol you will need to absolutely make sure that there is a bidirectional communication channel and that the sender is not just sending, but also reading to see if they are allowed to continue.

With streams concerns are typically different. A lot of streams are just streams of bytes or data frames and you can't just drop packets in between. Worse: it's often not easy for a sender to check if they should slow down. In HTTP2 you need to interleave reads and writes constantly on the user level. You absolutely must handle flow control there. The server will send you (while you are writing) WINDOW_UPDATE frames when you're allowed to continue writing.

This means that streaming code becomes a lot more complex because you need to write yourself a framework first that can act on incoming flow control information. The hyper-h2 Python library for instance has a surprisingly complex file upload server example with flow control based on curio and that example is not even complete.

New Footguns

async/await is great but it encourages writing stuff that will behave catastrophically when overloaded. On the one hand because it's just so easy to queue but also because making a function async after the fact is an API breakage. I can only assume this is why Python still has a non awaitable write function on the stream writer.

The biggest reason though is that async/await lets you write code many people wouldn't have written with threads in the first place. That's I think a good thing, because it lowers the barrier to actually writing larger systems. The downside is that it also means many more developers who previously had little experience with distributed system now have many of the problems of a distributed system even if they only write a single program. HTTP2 is a protocol that is complex enough due to the multiplexing nature that the only reasonable way to implement it is based on async/await as an example.

And it's not just async await code that is suffering from these issues. Dask for instance is a parallelism library for Python used by data science programmers and despite not using async/await there are bug reports of the system running out of memory due to the lack of back pressure. But these issues are rather fundamental.

The lack of back pressure however is a type of footgun that has the size of a bazooka. If you realize too late that you built a monster it will be almost impossible to fix without major changes to the code base because you might have forgotten to make some functions async that should have been. And a different programming environment does not help here. The same issues people have in all programming environments including the latest additions like go and Rust. It's not uncommon to find open issues about “handle flow control” or “handle back pressure” even on very popular projects that are open for a lengthy period of time because it turns out that it's really hard to add after the fact. For instance go has an open issue from 2014 about adding a semaphore to all filesystem IO because it can overload the host. aiohttp has an issue dating back to 2016 about clients being able to break the server due to insufficient back pressure. There are many, many more examples.

If you look at the Python hyper-h2 docs there are a shocking amount of examples that say something along the lines of “does not handle flow control”, “It does not obey HTTP/2 flow control, which is a flaw, but it is otherwise functional” etc. I believe the fact flow control is very complex once it shows up in the surface and it's easy to just pretend it's not an issue, is why we're in this mess in the first place. Flow control also adds a significant overhead and doesn't look good in benchmarks.

So for you developers of async libraries here is a new year's resolution for you: give back pressure and flow control the importance they deserve in documentation and API.

↧

Caktus Consulting Group: Our Top 19 Blogs of 2019

January 1, 2020, 9:17 am

≫ Next: Matt Layman: Make A Custom User Model - Building SaaS #40

≪ Previous: Armin Ronacher: I'm not feeling the async pressure

During the last year we gave our popular technical blog an official name: Developer Access. We published 32 posts on the blog, including technical how-to’s, conference information, web development best practices and detailed guides. Among all those posts, 19 rose to the top of the popularity list (based on total pageviews):

1. A Guide To Creating An API Endpoint With Django REST Framework: Our most popular blog post was published on February 1, and the title is self-explanatory. Adding an API endpoint can take considerable time, but with the Django REST Framework tools, it can be done more quickly.

2. How to Use Django Bulk Inserts for Greater Efficiency: If you have an application that needs to insert a lot of data into a Django model, it pays to "chunk" those updates to the database.

3. How to Switch to a Custom Django User Model Mid-Project: Django documentation recommends starting projects with a custom user model, but what if you didn’t? See how to add a custom user model to an existing project on Django 2.0+.

4. Coding for Time Zones & Daylight Saving Time — Oh, the Horror: It’s difficult to program correctly when using times, dates, time zones, and daylight saving time. See why it’s so challenging and learn how to account for the nuances.

5. How to Set Up a Centralized Log Server with rsyslog: There are a myriad of ways to configure rsyslog (and centralized logging in general), often with little documentation about how best to do so. This post helps you consolidate logs with minimal resource overhead.

6. How to Import Multiple Excel Sheets in Pandas: Pandas is a powerful Python data analysis tool, used heavily in the data science community. At Caktus, in addition to using it for data exploration, we also incorporate it into Extract, Transform, and Load (ETL) processes.

7. Why We Love Wagtail (and You Will, Too): Wagtail is a user-friendly content management system (CMS) that also provides a robust technical solution with customizable content management tools.

8. Django: Recommended Reading: It's important to read up on the latest industry trends and technologies to stay current and address client needs. Review our list of the books, blogs, and other documents that we’ve found to be the most accurate, helpful, and practical for Django development.

Book Review 9. A Review of ReportLab: PDF Processing with Python: Python has a great library for generating and manipulating PDFs: ReportLab. We read more about this extremely useful library in “ReportLab: PDF Processing with Python,” by Michael Driscoll. With a few caveats, it’s an excellent resource.

10. One Team’s Development Patterns With Vue.js and Django REST Framework: On a recent project, one of our development teams chose to use Vue.js, along with a Django back-end with Django REST Framework (DRF). See some of the development patterns they chose as they worked through a number of issues.

11. Our Favorite PyCon 2019 Presentations: PyCon 2019 attracted 3,393 attendees, including a group of six Cakti. Read about the fascinating, clever presentations.

12. How to Do Wagtail Data Migrations: Here’s a detailed guide for doing data migrations with StreamFields in the Wagtail CMS.

13. Book Review: Creating GUI Applications with wxPython: The book “Creating GUI Applications with wxPython” by Michael Driscoll, provides various techniques for programming GUI applications in Python using wxPython.

14. Be Quick or Eat Potatoes: A Newbie’s Guide to PyCon: PyCon 2019 was held in Cleveland from May 1 - 9. Read about the experience through the eyes of a first-time attendee.

15. Caktus Blog: Top 18 Posts of 2018: If you’re interested in looking back even further into our blog archive, check out our most popular posts from 2018.

16. DjangoCon 2019 Delivered Again: Again this year, DjangoCon more than delivered on its promise of something for everyone. The conference took place in San Diego and ran from September 22 - 26.

17. Caktus Adopts New Web Framework: Our April Fool’s joke announced that Caktus would build new projects using our new COBOL-based framework, ADD COBOL TO WEB. If you missed it, check it out for a good laugh.

18. Suggestions For Picking Up Old Projects: Often, we pick up a project that we either have not worked on in a long time, or haven’t worked on at all. In our efforts to work on such projects, a few things have been helpful both for becoming familiar with the projects more quickly, and for making the same projects easier to pick up in the future.

19. 7 Conferences We’re Looking Forward To: Cakti attended a number of conferences around the country. This list highlights ones from 2019 that we were looking forward to attending.

Thank you for reading our blog during the past year. We look forward to providing more sharp content in 2020, and we welcome any questions, suggestions, or feedback. Simply leave a comment below. We love hearing from you!

Happy New Year!

↧

Matt Layman: Make A Custom User Model - Building SaaS #40

December 31, 2019, 4:00 pm

≫ Next: Django Weblog: Django bugfix release: 3.0.2

≪ Previous: Caktus Consulting Group: Our Top 19 Blogs of 2019

In this episode, we started a users app and hooked up the custom user model feature of Django to unlock the full extensibility of that model in the future. The stream was cut short this week because of some crashing issues in the OBS streaming software. The goal of the episode was to add django-allauth so that users can sign into the service with an email and password instead of the default username and password combination.

↧

Django Weblog: Django bugfix release: 3.0.2

January 1, 2020, 11:08 pm

≫ Next: Doing Math with Python: Number of trailing zeros in the factorial of an integer

≪ Previous: Matt Layman: Make A Custom User Model - Building SaaS #40

Today we've issued the 3.0.2 bugfix release.

The release package and checksums are available from our downloads page, as well as from the Python Package Index. The PGP key ID used for this release is Mariusz Felisiak: 2EF56372BA48CD1B.

↧

Doing Math with Python: Number of trailing zeros in the factorial of an integer

January 1, 2020, 11:40 pm

≫ Next: Codementor: Build a Rest API with Python and Django - The easiest way

≪ Previous: Django Weblog: Django bugfix release: 3.0.2

Please note, the math expressions are broken. I will be fixing it soon.

Hi all, I recently learned about a cool formula to calculate the number of trailing zeros in the factorial of a number. It has been a while since I wrote a program to do something like this. So, I decided to change that and write this blog post. In the spirit of wring various "calculators", we will write a "number of trailing zero" calculator. First up though, let's refresh some key relevant concepts.

Factorial: The factorial of a number, n denoted by n! is the product n*(n-1)*(n-2)...*1. For example, 5! = 5*4*3*2*1 = 120.

Trailing zeros: The trailing zeros of a number is the number of zeros at the end of a number. For example, the number 567100 has two trailing zeros.

Floor: The floor of a number is the greatest integer less than or equal to x. That is floor of 3.2 is 3 and that of 3.5 is 3 and the floor of 3 is 3 as well.

Now, coming back to the focus of this post, this document at brilliant.org wiki explains the process in detail.

The key bit there in is this formula: ∑^k_i = 0\floor(n)/(5ⁱ)

where, n is the number for whose factorial we want to find the number of trailing zeros in and k is defined as:

The following Python program implements the above formula:

import math


def is_positive_integer(x):
    try:
        x = float(x)
    except ValueError:
        return False
    else:
        if x.is_integer() and x > 0:
            return True
        else:
            return False


def trailing_zeros(num):
    if is_positive_integer(num):
        # The above function call has done all the sanity checks for us
        # so we can just convert this into an integer here
        num = int(num)

        k = math.floor(math.log(num, 5))
        zeros = 0
        for i in range(1, k + 1):
            zeros = zeros + math.floor(num/math.pow(5, i))
        return zeros
    else:
        print("Factorial of a non-positive non-integer is undefined")


if __name__ == "__main__":
    fact_num = input(
        "Enter the number whose factorial's trailing zeros you want to find: "
    )
    num_zeros = trailing_zeros(fact_num)
    print("Number of trailing zeros: {0}".format(num_zeros))

When we run this program using Python 3, it will ask for the number whose factorial's number of trailing zeros we want to find and then print it out, like so:

Enter the number whose factorial's trailing zeros you want to find: 5
Number of trailing zeros: 1

If you enter a number which is not a positive integer, you will get an error message:

Enter the number whose factorial's trailing zeros you want to find: 5.1
Factorial of a non-positive integer is undefined
Number of trailing zeros: None

Some key standard library functions we use in the above program are:

math.floor: This function is used to find the floor of a number
math.log: This function is used to find the logarithm of a number for a specified base (defaults to 10)
math.pow: This function is used to find out the power of a number raised to another

The above functions are defined in the math module.

Besides the above, we use the is_integer() function defined on a floating point object to check if the floating point object is actually an integer.

The latest version of the code is available here.

↧

Codementor: Build a Rest API with Python and Django - The easiest way

January 2, 2020, 2:38 am

≫ Next: Catalin George Festila: Python 3.7.5 : Testing the Falcon framework - part 001.

≪ Previous: Doing Math with Python: Number of trailing zeros in the factorial of an integer

How you can build a Rest api with Python

↧

Catalin George Festila: Python 3.7.5 : Testing the Falcon framework - part 001.

January 2, 2020, 1:10 am

≫ Next: Mike C. Fletcher: Windows Isn't a Horror Show

≪ Previous: Codementor: Build a Rest API with Python and Django - The easiest way

I start the new year with this python framework named Falcon. The Falcon is a low-level, high-performance Python framework for building HTTP APIs, app backends, and higher-level frameworks. The main reason was the speed of this python framework, see this article about falcon benchmark. You can see is more faster like Flask and Django. The instalation is easy with pip tool, you can read also the

↧