Ned Batchelder: Coverage 5.0 beta 1

November 11, 2019, 3:27 pm

≫ Next: Chris Moffitt: Book Review: Machine Learning Pocket Reference

≪ Previous: Podcast.__init__: Automate Your Server Security With GrapheneX

I want to finish coverage.py 5.0. It has some big changes, so I need people to try it and tell me if it’s ready. Please install coverage.py 5.0 beta 1 and try it in your environment.

I especially want to hear from you if you tried the earlier alphas of 5.0. There have been some changes in the SQLite database that were needed to make measurement efficient enough for large test suites, but that hinder ad-hoc querying.

If you haven’t taken a look at coverage.py 5.0 yet, the big change is the addition of “contexts.” These can record not just that a line was executed, but something about why it was executed. Any number of contexts can be recorded for a line. They could be different operating systems, or versions of Python, or the name of the test that was running. I think it could enable some really interesting tooling.

If you are interested in recording test names as contexts, the pytest-cov pytest plugin now has a “--cov-context” option to do just that.

Contexts increase the data requirements, so data storage is now a SQLite file rather than a JSON file. The summary of what’s new in 5.0 is here: Major changes in 5.0.

Please try this. Soon 5.0 will be done, and people will begin installing it unknowingly. I would really like to minimize the turmoil when that happens.

↧

Chris Moffitt: Book Review: Machine Learning Pocket Reference

November 11, 2019, 4:25 pm

≫ Next: Programiz: How to get current date and time in Python?

≪ Previous: Ned Batchelder: Coverage 5.0 beta 1

Introduction

This article is a review of O’Reilly’s Machine Learning Pocket Reference by Matt Harrison. Since Machine Learning can cover a lot of topics, I was very interested to see what content a “Pocket Reference” would contain. Overall, I really enjoyed this book and think it deserves a place on many data science practitioner’s book shelves. Read on for more details about what is included in this reference and who should consider purchasing it.

Physical Size

I purchased this book from Amazon shortly after it was released. Since I was interested in the content and the price was relatively low for a new O’Reilly book ($24.99); I impulsively purchased it without any research. When it showed up, I laughed a little. I did not realize that the book was as small as it was. Obviously I should not have been surprised. It is a “Pocket Reference” and the product dimensions are listed on the page but I never put 2 and 2 together.

Just for comparison, here’s a picture comparing this book to Chris Albon’s book:

I bring up the size for two reasons. First, the small size means I would not hesitate to carry it around in my laptop bag. I realize many people like electronic copies but I like the idea of paper reference book. From this perspective, the portability aspect is a positive consideration for me, it might not be for you.

The second point is that the small size means there is not a lot of real estate on the pages. For short code snippets, this is not an issue. However, for longer code sections or large visualizations it is not optimal. For example, on page 205 there is complex decision tree that is really tiny. There are a handful of other places in the book where the small physical size makes the visuals difficult to see.

However, I don’t view the size as huge negative issue. The author graciously includes jupyter notebooks in his github repo so it is easy to see the details if you need to. Since most readers will likely buy this without seeing it in person, I wanted to specifically mention this aspect so you could keep it in mind.

Who is this for?

There are many aspects of this book that I really like. One of the decisions that I appreciate is that Matt explicitly narrows down the Machine Learning topics he covers. This book’s subtitle is “Working with Structured Data in Python” which means that there is no discussion of deep learning libraries like TensorFlow or PyTorch nor is there any discussion about Natural Language Processing (NLP). This specific decision is smart because it focuses the content and gives the author the opportunity to go deeper in the topics he does choose to cover.

The other aspect of this book that I enjoy is that the author expects the reader to have basic python familiarity including a base level understanding of scikit-learn and pandas. Most of the code samples are relatively short and use consistent and idiomatic python. Therefore, anyone that has done a little bit of work in the python data science space should be able to follow along with the examples.

There is no discussion of how to program with python and there is only a very brief intro to using pip or conda to get libraries installed. I appreciate the fact that he does not try to cram in a python introduction and instead focuses on teaching the data science concepts in a crisp and clear manner.

The final point I want to mention is that this is truly a practical guide. There is almost no discussion about the mathematical theory behind the algorithms. In addition, this is not a book solely about scikit-learn. Matt chooses to highlight many libraries that a practitioner would use for real world problems.

Throughout the book, he introduces about 36 different python data science libraries including familiar ones like seaborn, numpy, pandas, scikit-learn as well as other libraries like Yellowbrick, mlxtend, pyjanitor, missing no and many others. In many cases, he shows how to perform similar functions in two different libraries. For example in Chapter 6, there are examples of similar plots done with both seaborn and Yellowbrick.

Some may think it is not necessary to show more than one way to solve a problem. However, I really enjoyed seeing how to use multiple approaches to solving a problem and the relative merits of the different approaches.

Book Organization

The Machine Learning Pocket Reference contains 19 chapters but is only 295 pages long (excluding indices and intro). For the most part, the chapters are very concise. For instance, chapter 2 is only 1 page and chapter 5 is 2 pages. Most chapters are 8-10 pages of clear code and explanation.

Chapter 3 is a special case in that it is the longest chapter and serves as a road map for the rest of the book. It provides a comprehensive walk through of working with the Titanic data set to solve a classification problem. The step by step process includes cleaning the data, building features, and normalizing data. Then using this data to build, evaluate and deploy a machine learning model. The rest of the book breaks down these various steps and goes into more detail on its respective data analysis topic. Here is how the chapters are laid out:

Introduction
Overview of the Machine Learning Processing
Classification Walkthrough: Titanic Dataset
Missing Data
Cleaning Data
Exploring
Preprocess Data
Feature Selection
Imbalanced Classes
Classification
Model Selection
Metrics and Classification Evaluation
Explaining Models
Regression
Metrics and Regression Evaluation
Explaining Regression Models
Dimensionality Reduction
Clustering
Pipelines

Chapter 13 is a good illustrative example of the overall approach of the book. The topic of model interpretablity is very timely and a constantly evolving topic with many advancements over the past couple of years. This chapter starts with a short discussion of regression coefficients. Then moves on to discuss more recent tools like treeinterpreter, lime and SHAP . It also include a discussion about how to use surrogate models in place of models that do not lend themselves to the interpretive approaches shown in the chapter. All of this content is discussed with code examples, output visualizations and guidance on how to interpret the results.

How to Read

When I received the book, I read through it in a couple of sittings. As I read through it, I pulled out lots of interesting notes and insights. Some of them were related to new libraries and some were clever code snippets for analyzing data. The other benefit of going through cover to cover is that I had a good feel for what was in the book and how to reference it in the future when I find myself trying to solve a data science problem.

The pocket reference nature of this book means that it can be helpful for a quick refresher of a topic that is difficult or new to you. A quick review of the chapter may be enough to get you through the problem. It can also be useful for pointing out some of the challenges and trade-offs with different approaches. Finally, the book can be a good jumping off point for further in-depth research when needed.

Other Thoughts

I did not run much of the code from the book but I did not notice any glaring syntax issues. The code uses modern and idiomatic python, pandas and scikit-learn. As mentioned earlier, there is a brief introduction and some caveats about using pip or conda for installation. There is reference to pandas 0.24 and the new Int64 data type so the book is as up to date as can be expected for a book published in September 2019.

In the interest of full disclosure, I purchased this book on my own and received no compensation for this review. I am an Amazon affiliate so if you choose to buy this book through a link , I will receive a small commission.

Summary

It is clear that Matt has a strong understanding of practical approaches to using python data science tools to solve real world problems. I can definitely recommend Machine Learning Pocket Reference as a book to have at your side when you are dealing with structured data in python. Thank you to Matt for creating such a useful resource. I have added it to my recommended resources list.

↧

Programiz: How to get current date and time in Python?

November 11, 2019, 7:44 pm

≫ Next: Real Python: Thinking Recursively in Python

≪ Previous: Chris Moffitt: Book Review: Machine Learning Pocket Reference

In this article, you will learn to get today's date and current date and time in Python. We will also format the date and time in different formats using strftime() method.

↧

Real Python: Thinking Recursively in Python

November 12, 2019, 6:00 am

≫ Next: Data School: How to encode categorical features with scikit-learn (video)

≪ Previous: Programiz: How to get current date and time in Python?

In this course, you’ll learn about recursion. Recursion is a powerful tool you can use to solve a problem that can be broken down into smaller variations of itself. You can create very complex recursive algorithms with only a few lines of code.

You’ll cover:

What recursion is
How to define a recursive function
How practical examples of recursive functions work
How to maintain state
How to optimize recursion

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Data School: How to encode categorical features with scikit-learn (video)

November 12, 2019, 8:07 am

≫ Next: PyCoder’s Weekly: Issue #394 (Nov. 12, 2019)

≪ Previous: Real Python: Thinking Recursively in Python

How to encode categorical features with scikit-learn (video)

In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn?

In this 28-minute video, you'll learn:

How to use OneHotEncoder and ColumnTransformer to encode your categorical features and prepare your feature matrix in a single step
How to include this step within a Pipeline so that you can cross-validate your model and preprocessing steps simultaneously
Why you should use scikit-learn (rather than pandas) for preprocessing your dataset

If you want to follow along with the code, you can download the Jupyter notebook from GitHub.

Click on a timestamp below to jump to a particular section:

0:22 Why should you use a Pipeline?
2:30 Preview of the lesson
3:35 Loading and preparing a dataset
6:11 Cross-validating a simple model
10:00 Encoding categorical features with OneHotEncoder
15:01 Selecting columns for preprocessing with ColumnTransformer
19:00 Creating a two-step Pipeline
19:54 Cross-validating a Pipeline
21:44 Making predictions on new data
23:43 Recap of the lesson
24:50 Why should you use scikit-learn (rather than pandas) for preprocessing?

Related Resources

scikit-learn documentation for OneHotEncoder, ColumnTransformer, and Pipeline
My video series: Introduction to Machine Learning in Python
My videos on cross-validation and grid search
My lesson notebook on StandardScaler

P.S. Want to master Machine Learning in Python? Enroll in my online course, Machine Learning with Text in Python!

↧

PyCoder’s Weekly: Issue #394 (Nov. 12, 2019)

November 12, 2019, 11:30 am

≫ Next: Quansight Labs Blog: File management improvements in Spyder4

≪ Previous: Data School: How to encode categorical features with scikit-learn (video)

#394 – NOVEMBER 12, 2019
View in Browser »

PSF Seeking Developers for Paid Contract Improving Pip

The Python Software Foundation Packaging Working Group is receiving funding to work on the design, implementation, and rollout of pip’s next-generation dependency resolver. Funding has been allocated to secure a senior developer and an intermediate developer, starting in December 2019 or January 2020. RFP open now through November 22.
PYFOUND.BLOGSPOT.COM• Shared by Brian Rutledge

My Python Development Environment, 2020 Edition

The co-creator of Django explains his Python environment: “My setup pieces together pyenv, poetry, and pipx. It’s probably a tad more complex than is ideal for most Python users, but for the things I need, it’s perfect.” Related discussion on Hacker News.
JACOB KAPLAN-MOSS

Collaborative Python and R Notebooks Integrated With SQL. All in One Platform. Free Forever.

Mode Studio combines a SQL editor, Python & R notebooks, and visualization builder in one platform. Connect your data warehouse and analyze with your preferred language. Make custom viz (D3.js, HTML/CSS) or use out-of-the-box charts →
MODE ANALYTICSsponsor

When to Use a List Comprehension in Python

Python list comprehensions make it easy to create lists while performing sophisticated filtering, mapping, and conditional logic on their members. In this tutorial, you’ll learn when to use a list comprehension in Python and how to create them effectively.
REAL PYTHON

“Parsing” in Python

“Don’t be afraid to create new, more specific data types for your specific use cases. It’s okay to represent different data, used for different purposes, with different data structures, and makes later generalization easier!”
ASTHASR.GITHUB.IO

Detecting Natural Disasters With Keras and Deep Learning

In this tutorial, you will learn how to automatically detect natural disasters (earthquakes, floods, wildfires, cyclones/hurricanes) with up to 95% accuracy using Keras, Computer Vision, and Deep Learning.
ADRIAN ROSEBROCK

Python Becomes 2nd Most Popular Language on GitHub

GitHub has published its latest State of the Octoverse report which provides fascinating insights into the development industry.
DEVELOPER-TECH.COM

The Complex Path for a Simple Portable Python Interpreter

“We needed a Python interpreter that can be shipped everywhere. You won’t believe what happened next!”
GLAUBER COSTA

The Most Underrated Python Packages

EYAL TRABELSI

Stop Using `utcnow` and `utcfromtimestamp`

PAUL GANSSLE

Python Jobs

Articles & Tutorials

Python Lambda Functions Quiz

Python lambdas are little, anonymous functions, subject to a more restrictive but more concise syntax than regular Python functions. Test your understanding on how you can use them better!
REAL PYTHON

Teaching Python Episode 31: Python in the School of 2024

“In this episode, Kelly and Sean discuss plausible trends in machine learning, artificial intelligence, augmented and virtual reality, and data science that we may see in schools by 2024. We focus on 5 areas from microscale in the classroom to macro across the entire educational system.”
TEACHINGPYTHON.FMpodcast

Automated Python Code Reviews, Directly From Your Git Workflow

Codacy lets developers spend more time shipping code and less time fixing it. Set custom standards and automatically track quality measures like coverage, duplication, complexity and errors. Integrates with GitHub, GitLab and Bitbucket, and works with 28 different languages. Get started today for free →
CODACYsponsor

How to Handle Coroutines With Asyncio in Python

Learn about coroutines in Python by example. More specifically, you’ll see how to handle coroutines using asyncio.
ERIK MARSJA

Thinking Recursively in Python

In this course, you’ll learn how to work with recursion in your Python programs by mastering concepts such as recursive functions and recursive data structures.
REAL PYTHONvideo

How to Read Stata Files in Python With Pandas

Learn how to read Stata (.dta) files in Python and how to write a Stata file to CSV and Excel files.
ERIK MARSJA

Developing a Single Page App With Flask and Vue.js

A step-by-step walkthrough of how to set up a basic CRUD app with Vue and Flask.
MICHAEL HERMAN

Tornado Framework for the Modern Web

Exploring a Tornado use case in low memory environment.
MANADOMA.COM

Measure and Improve Python Code Performance With Blackfire.io

Profile in development, test/staging, and production, with no overhead for end users! Blackfire supports any Python version from 2.7.x and 3.x. Find bottlenecks in wall-time, I/O, CPU, memory, HTTP requests, and SQL queries.
BLACKFIREsponsor

A Better Practice for Managing Many `extras_require` Dependencies in Python

HAN XIAO

Python VS Common Lisp, Workflow and Ecosystem

LISP-JOURNEY.GITLAB.IO

Unpacking the Quantum Supremacy Benchmark With Python

M. SOHAIB ALAM

11 New Python Web Frameworks

DEEPSOURCE.IO

Laziness and Streams

JEREMIAH MALINA

Creating a Slack App in Python on GCP

WILL LARSON

Stylish Word Clouds With `stylecloud`

ABDUL MAJED

Projects & Code

SeleniumBase: Easy Web Automation and Testing With Python

GITHUB.COM/SELENIUMBASE

flupy: Fluent Data Pipelines for Python and Your Shell

GITHUB.COM/OLIRICE

rele: Easy to Use Google Pub/Sub

GITHUB.COM/MERCADONA

pythran: Python to C++ Converter

GITHUB.COM/SERGE-SANS-PAILLE

pytest-quarantine: Manage Expected Test Failures

GITHUB.COM/ENERGYSAGE• Shared by Brian Rutledge

Events

Python Atlanta

November 14, 2019
MEETUP.COM

Karlsruhe Python User Group (KaPy)

November 15, 2019
BL0RG.NET

Chattanooga Python User Group

November 15, 2019
MEETUP.COM

BangPypers

November 16, 2019
MEETUP.COM

PyLadies Dublin

November 21, 2019
PYLADIES.COM

MadPUG

November 21, 2019
MEETUP.COM

Happy Pythoning!
This was PyCoder’s Weekly Issue #394.
View in Browser »

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

↧

Quansight Labs Blog: File management improvements in Spyder4

November 12, 2019, 9:00 am

≫ Next: Stack Abuse: Advanced OpenGL in Python with PyGame and PyOpenGL

≪ Previous: PyCoder’s Weekly: Issue #394 (Nov. 12, 2019)

Version 4.0 of Spyder—a powerful Python IDE designed for scientists, engineers and data analysts—is almost ready! It has been in the making for well over two years, and it contains lots of interesting new features. We will focus on the Files pane in this post, where we've made several improvements to the interface and file management tools.

Simplified interface

In order to simplify the Files pane's interface, the columns corresponding to size and kind are hidden by default. To change which columns are shown, use the top-right pane menu or right-click the header directly.

Pane Menu

Stack Abuse: Advanced OpenGL in Python with PyGame and PyOpenGL

November 12, 2019, 12:07 pm

≫ Next: Sumana Harihareswara - Cogito, Ergo Sumana: My New Title, Improving pip, Availability For Work, And SSL (No, The Other One)

≪ Previous: Quansight Labs Blog: File management improvements in Spyder4

Introduction

Following the previous article, Understanding OpenGL through Python where we've set the foundation for further learning, we can jump into OpenGL using PyGame and PyOpenGL.

PyOpenGL is the standardized library used as a bridge between Python and the OpenGL APIs, and PyGame is a standardized library used for making games in Python. It offers built-in handy graphical and audio libraries and we'll be using it to render the result more easily at the end of the article.

As mentioned in the previous article, OpenGL is very old so you won't find many tutorials online on how to properly use it and understand it because all of the top dogs are already knee-deep in new technologies.

In this article, we'll jump into several fundamental topics you'll need to know:

Initializing a Project Using PyGame

First off, we need to install PyGame and PyOpenGL if you haven't already:

$ python3 -m pip install -U pygame --user
$ python3 -m pip install PyOpenGL PyOpenGL_accelerate

Note: You can find a more detailed installation in the previous OpenGL article.

If you have problems concerning the installation, PyGame's "Getting Started" section might be a good place to visit.

Since there's no point in unloading 3 books worth of graphics theory on you, we'll be using the PyGame library to give us a head start. It will essentially just shorten the process from project initialization to actual modeling and animating.

To start off, we need to import everything necessary from both OpenGL and PyGame:

import pygame as pg
from pygame.locals import *

from OpenGL.GL import *
from OpenGL.GLU import *

Next, we get to the initialization:

pg.init()
windowSize = (1920,1080)
pg.display.set_mode(display, DOUBLEBUF|OPENGL)

While the initialization is only three lines of code, each deserves at least a simple explanation:

pg.init(): Initialization of all the PyGame modules - this function is a godsend
windowSize = (1920, 1080): Defining a fixed window size
pg.display.set_mode(display, DOUBLEBUF|OPENGL): Here, we specify that we'll be using OpenGL with double buffering

Double buffering means that there are two images at any given time - one that we can see and one that we can transform as we see fit. We get to see the actual change caused by the transformations when the two buffers swap.

Since we have our viewport set up, next we need to specify what we'll be seeing, or rather where the "camera" will be placed, and how far and wide it can see.

This is known as the frustum - which is just a cut off pyramid that visually represents the camera's sight (what it can and can't see).

A frustum is defined by 4 key parameters:

The FOV (Field of View): Angle in degrees
The Aspect Ratio: Defined as the ratio of the width and height
The z coordinate of the near Clipping Plane: The minimum draw distance
The z coordinate of the far Clipping Plane: The maximum draw distance

So, let's go ahead and implement the camera with these parameters in mind, using OpenGL C code:

void gluPerspective(GLdouble fovy, GLdouble aspect, GLdouble zNear, GLdouble zFar);
gluPerspective(60, (display[0]/display[1]), 0.1, 100.0)

To better understand how a frustum works, here's a reference picture:

frustum view

Near and far planes are used for better performance. Realistically, rendering anything outside our field of vision is a waste of hardware performance that could be used rendering something that we can actually see.

So everything that the player can't see is implicitly stored in memory, even though it isn't visually present. Here's a great video of how rendering only within the frustum looks like.

Drawing Objects

After this setup, I imagine we're asking ourselves the same question:

Well this is all fine and dandy, but how do I make a Super Star Destroyer?

Well... with dots. Every model in OpenGL object is stored as a set of vertices and a set of their relations (which vertices are connected). So theoretically if you knew the position of every single dot that is used to draw a Super Star Destroyer, you could very well draw one!

There are a few ways we can model objects in OpenGL:

Drawing using vertices, and depending on how OpenGL interprets these vertices, we can draw with:
- points: as in literal points that are not connected in any way
- lines: every pair of vertices constructs a connected line
- triangles: every three vertices make a triangle
- quadrilateral: every four vertices make a quadrilateral
- polygon: you get the point
- many more...
Drawing using the built in shapes and objects that were painstakingly modeled by OpenGL contributors
Importing fully modeled objects

So, to draw a cube for example, we first need to define its vertices:

cubeVertices = ((1,1,1),(1,1,-1),(1,-1,-1),(1,-1,1),(-1,1,1),(-1,-1,-1),(-1,-1,1),(-1, 1,-1))

cube

Then, we need to define how they're all connected. If we want to make a wire cube, we need to define the cube's edges:

cubeEdges = ((0,1),(0,3),(0,4),(1,2),(1,7),(2,5),(2,3),(3,6),(4,6),(4,7),(5,6),(5,7))

This is pretty intuitive - the point 0 has an edge with 1, 3, and 4. The point 1 has an edge with points 3, 5, and 7, and so on.

And if we want to make a solid cube, then we need to define the cube's quadrilaterals:

cubeQuads = ((0,3,6,4),(2,5,6,3),(1,2,5,7),(1,0,4,7),(7,4,6,5),(2,3,0,1))

This is also intuitive - to make a quadrilateral on the top side of the cube, we'd want to "color" everything in-between the points 0, 3, 6, and 4.

Keep in mind there's an actual reason we label the vertices as indexes of the array they're defined in. This makes writing code that connects them very easy.

The following function is used to draw a wired cube:

def wireCube():
    glBegin(GL_LINES)
    for cubeEdge in cubeEdges:
        for cubeVertex in cubeEdge:
            glVertex3fv(cubeVertices[cubeVertex])
    glEnd()

glBegin() is a function that indicates we'll defining the vertices of a primitive in the code below. When we're done defining the primitive, we use the function glEnd().

GL_LINES is a macro that indicates we'll be drawing lines.

glVertex3fv() is a function that defines a vertex in space, there are a few versions of this function, so for the sake of clarity let's look at how the names are constructed:

glVertex: a function that defines a vertex
glVertex3: a function that defines a vertex using 3 coordinates
glVertex3f: a function that defines a vertex using 3 coordinates of type GLfloat
glVertex3fv: a function that defines a vertex using 3 coordinates of type GLfloat which are put inside a vector (tuple) (the alternative would be glVertex3fl which uses a list of arguments instead of a vector)

Following similar logic, the following function is used to draw a solid cube:

def solidCube():
    glBegin(GL_QUADS)
    for cubeQuad in cubeQuads:
        for cubeVertex in cubeQuad:
            glVertex3fv(cubeVertices[cubeVertex])
    glEnd()

Iterative Animation

For our program to be "killable" we need to insert the following code snippet:

for event in pg.event.get():
    if event.type == pg.QUIT:
        pg.quit()
        quit()

It's basically just a listener that scrolls through PyGame's events, and if it detects that we clicked the "kill window" button, it quits the application.

We'll cover more of PyGame's events in a future article - this one was introduced right away because it would be quite uncomfortable for users and yourselves to have to fire up the task manager every time they want to quit the application.

In this example, we'll be using double buffering, which just means that we'll be using two buffers (you can think of them as canvases for drawing) which will swap in fixed intervals and give the illusion of motion.

Knowing this, our code has to have the following pattern:

handleEvents()
glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT)
doTransformationsAndDrawing()
pg.display.flip()
pg.time.wait(1)

glClear: Function that clears the specified buffers (canvases), in this case, the color buffer (which contains color information for drawing the generated objects) and depth buffer (a buffer which stores in-front-of or in-back-of relations of all the generated objects).
pg.display.flip(): Function that updated the window with the active buffer contents
pg.time.wait(1): Function that pauses the program for a period of time

glClear has to be used because if we don't use it, we'll be just painting over an already painted canvas, which in this case, is our screen and we're going to end up with a mess.

Next, if we want to continuously update our screen, just like an animation, we have to put all our code inside a while loop in which we:

Handle events (in this case, just quitting)
Clear the color and depth buffers so that they can be drawn on again
Transform and draw objects
Update the screen
GOTO 1.

The code ought to look something like this:

while True:
    handleEvents()
    glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT)
    doTransformationsAndDrawing()
    pg.display.flip()
    pg.time.wait(1)

Utilizing Transformation Matrices

In the previous article, we explained how, in theory, we need to construct a transformation that has a referral point.

OpenGL works the same way, as can be seen in the following code:

glTranslatef(1,1,1)
glRotatef(30,0,0,1)
glTranslatef(-1,-1,-1)

In this example, we did a z-axis rotation in the xy-plane with the center of rotation being (1,1,1) by 30 degrees.

Let's have a little refresher if these terms sound a bit confusing:

z-axis rotation means that we're rotating around the z-axis
This just means we're approximating a 2D plane with a 3D space, this whole transformation is basically like doing a normal rotation around a referral point in 2D space.
We get the xy-plane by squashing an entire 3D space into a plane that has z=0 (we eliminate the z parameter in every way)
Center of rotation is a vertex around which we will be rotating a given object (the default center of rotation is the origin vertex (0,0,0))

But there's a catch - OpenGL understands the code above by constantly remembering and modifying one global transformation matrix.

So when you write something in OpenGL, what you're saying is:

# This part of the code is not translated
# transformation matrix = E (neutral)
glTranslatef(1,1,1)
# transformation matrix = TxE
# ALL OBJECTS FROM NOW ON ARE TRANSLATED BY (1,1,1)

As you might imagine, this poses a huge problem, because sometimes we want to utilize a transformation on a single object, not on the whole source code. This is a very common reason for bugs in low-level OpenGL.

To combat this problematic feature of OpenGL, we're presented with pushing and popping transformation matrices - glPushMatrix() and glPopMatrix():

# Transformation matrix is T1 before this block of code
glPushMatrix()
glTranslatef(1,0,0)
generateObject() # This object is translated
glPopMatrix()
generateSecondObject() # This object isn't translated

These work in a simple Last-in-First-Out (LIFO) principle. When we wish to perform a translation to a matrix, we first duplicate it and then push it on top of the stack of the transformation matrices.

In other words, it isolates all the transformations we're performing in this block by creating a local matrix that we can scrap after we're done.

Once the object is translated, we pop the transformation matrix from the stack, leaving the rest of the matrices untouched.

Multiple Transformation Execution

In OpenGL, as previously mentioned, transformations are added to the active transformation matrix that's on top of stack of transformation matrices.

This means that the transformations are executed in reverse order. For example:

######### First example ##########
glTranslatef(-1,0,0)
glRotatef(30,0,0,1)
drawObject1()
##################################

######## Second Example #########
glRotatef(30,0,0,1)
glTranslatef(-1,0,0)
drawObject2()
#################################

In this example, Object1 is first rotated, then translated, and Object2 is first translated, and then rotated. The last two concepts won't be used in the implementation example, but will be practically used in the next article in the series.

Implementation Example

The code below draws a solid cube on the screen and continuously rotates it by 1 degree around the (1,1,1) vector. And it can be very easily modified to draw a wire cube by swapping out the cubeQuads with the cubeEdges:

import pygame as pg
from pygame.locals import *

from OpenGL.GL import *
from OpenGL.GLU import *

cubeVertices = ((1,1,1),(1,1,-1),(1,-1,-1),(1,-1,1),(-1,1,1),(-1,-1,-1),(-1,-1,1),(-1,1,-1))
cubeEdges = ((0,1),(0,3),(0,4),(1,2),(1,7),(2,5),(2,3),(3,6),(4,6),(4,7),(5,6),(5,7))
cubeQuads = ((0,3,6,4),(2,5,6,3),(1,2,5,7),(1,0,4,7),(7,4,6,5),(2,3,0,1))

def wireCube():
    glBegin(GL_LINES)
    for cubeEdge in cubeEdges:
        for cubeVertex in cubeEdge:
            glVertex3fv(cubeVertices[cubeVertex])
    glEnd()

def solidCube():
    glBegin(GL_QUADS)
    for cubeQuad in cubeQuads:
        for cubeVertex in cubeQuad:
            glVertex3fv(cubeVertices[cubeVertex])
    glEnd()

def main():
    pg.init()
    display = (1680, 1050)
    pg.display.set_mode(display, DOUBLEBUF|OPENGL)

    gluPerspective(45, (display[0]/display[1]), 0.1, 50.0)

    glTranslatef(0.0, 0.0, -5)

    while True:
        for event in pg.event.get():
            if event.type == pg.QUIT:
                pg.quit()
                quit()

        glRotatef(1, 1, 1, 1)
        glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT)
        solidCube()
        #wireCube()
        pg.display.flip()
        pg.time.wait(10)

if __name__ == "__main__":
    main()

Running this piece of code, a PyGame window will pop up, rendering the cube animation:

alt

Conclusion

There is a lot more to learn about OpenGL - lighting, textures, advanced surface modeling, composite modular animation, and much more.

But fret not, all of this will be explained in the following articles teaching the public about OpenGL the proper way, from the ground up.

And don't worry, in the next article, we'll actually draw something semi-decent.

↧

Sumana Harihareswara - Cogito, Ergo Sumana: My New Title, Improving pip, Availability For Work, And SSL (No, The Other One)

November 12, 2019, 3:29 pm

≫ Next: Real Python: Getting Started With Python IDLE

≪ Previous: Stack Abuse: Advanced OpenGL in Python with PyGame and PyOpenGL

A few professional announcements.

Seeking developers for paid contract on pip; apply by Nov. 22

One is that I helped the Packaging Working Group of the Python Software Foundation get funding for a long-needed improvement to pip. I led the writing of a few proposals -- grantwriting, to oversimplify -- and, starting possibly as soon as next month, contractors will start work. As Dustin Ingram explains:

Big news: the Python Packaging Working Group has secured >$400K in grants from multiple funders (TBA) to improve one of the most fundamental parts of pip: its dependency resolver. https://pyfound.blogspot.com/2019/11/seeking-developers-for-paid-contract.html
The dependency resolver is the algorithm which takes multiple constrained requirements (e.g. "some_package>=1.0,=2.0") and finds a version of all dependencies (and sub-dependencies) which satisfy all the constraints.
https://pip.pypa.io/en/stable/user_guide/#requirements-files
Right now, pip's resolver mostly works for most use cases... However the algorithm it uses is naïve, and isn't always guaranteed to produce an optimal (or correct) result.
.....
These funds will pay multiple developers to work on completing the design, implementation and rollout of this new dependency resolver for pip, finally closing issue #988.
Not only will this give pip a better resolver, but it will "enable us to untangle pip’s internals from the resolver, enabling pip to share code for dependency resolution with other packaging tooling". https://pradyunsg.me/blog/2019/06/23/oss-update-1/
This is great news for pip and Python packaging in general. Huge shout out to @pradyunsg for his existing work on the resolver issue and guidance here, and to @brainwane for all her tireless work acquiring and directing funding for Python projects.
If you or your organization is interested in participating in this project, we've just posted the RFP, which includes instructions for submitting proposals, evaluation criteria and scope of work.
https://github.com/python/request-for/blob/master/2020-pip/RFP.md

If you're interested, please apply by 22 November.

NYU, Secure Systems Lab, and my new title

In further news: New York University's Tandon School of Engineering has now announced that:

...pioneering open-source software nonprofits the Tor Project and Python Software Foundation (PSF) are the newest tenants at 370 Jay Street, a recently renovated addition to the University’s engineering and applied sciences programs in Downtown Brooklyn. NYU Tandon is donating work space to both organizations for their first offices in New York City. ....
Sumana Harihareswara, a volunteer with the PSF’s Packaging Working Group and a contracted project manager for the Python Packaging Index, is among the first to move into the NYU facility. Harihareswara is also a visiting scholar in [Professor Justin] Cappos's Secure Systems Lab.

Yes, I am now a Visiting Scholar at NYU's Secure Systems Lab and I get to use an office with a door, shelves, whiteboards, and so on (per the picture at right). If you contribute to Python packaging/distribution tools and live in/near or sometimes visit New York City, let me know and perhaps we could cowork a bit?

The Secure Systems Lab stewards The Update Framework (TUF) and related projects, and works to improve the security of the software supply chain. The Python Package Index is likely going to implement TUF to add cryptographic signatures to packages on PyPI, and so I've gotten to give TUF's developers some advice to help that work move along. (I won't be the manager on that project but I'll be watching with great interest.)

PSF projects

I'm grateful to get to help connect the Python Software Foundation with more resources and volunteers. Changeset's current and recent projects have mostly been for the PSF. Last month we finished accessibility, security, and internationalization work on PyPI that was funded by the Open Technology Fund, and Changeset's work on communicating about the sunsetting of Python 2.x continues and will go through April 2020.

Availability for one-day engagements in San Francisco in February

But I am interested in taking on new clients for short engagements starting in February 2020. In particular, I will be in the San Francisco Bay Area in mid- to late February. If you're in SF or nearby, I could offer you a one-day engagement doing one of the following:

developing a contributor outreach/intake strategy
researching potential funders and writing a rough draft of a grant proposal
auditing and improving your developer onboarding documents

I'd spend a little time talking with you, then sit in your office and finish the document before leaving that afternoon. (Photo at right provides a sample of how I look while sitting.) Drop me a line for a free initial 30-minute chat and we can talk pricing.

↧

Real Python: Getting Started With Python IDLE

November 13, 2019, 6:00 am

≫ Next: Artem Rys: 5 Scraping Tips

≪ Previous: Sumana Harihareswara - Cogito, Ergo Sumana: My New Title, Improving pip, Availability For Work, And SSL (No, The Other One)

If you’ve recently downloaded Python onto your computer, then you may have noticed a new program on your machine called IDLE. You might be wondering, “What is this program doing on my computer? I didn’t download that!” While you may not have downloaded this program on your own, IDLE comes bundled with every Python installation. It’s there to help you get started with the language right out of the box. In this tutorial, you’ll learn how to work in Python IDLE and a few cool tricks you can use on your Python journey!

In this tutorial, you’ll learn:

What Python IDLE is
How to interact with Python directly using IDLE
How to edit, execute, and debug Python files with IDLE
How to customize Python IDLE to your liking

Free Bonus:Click here to get a Python Cheat Sheet and learn the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.

What Is Python IDLE?

Every Python installation comes with an Integrated Development and Learning Environment, which you’ll see shortened to IDLE or even IDE. These are a class of applications that help you write code more efficiently. While there are many IDEs for you to choose from, Python IDLE is very bare-bones, which makes it the perfect tool for a beginning programmer.

Python IDLE comes included in Python installations on Windows and Mac. If you’re a Linux user, then you should be able to find and download Python IDLE using your package manager. Once you’ve installed it, you can then use Python IDLE as an interactive interpreter or as a file editor.

An Interactive Interpreter

The best place to experiment with Python code is in the interactive interpreter, otherwise known as a shell. The shell is a basic Read-Eval-Print Loop (REPL). It reads a Python statement, evaluates the result of that statement, and then prints the result on the screen. Then, it loops back to read the next statement.

The Python shell is an excellent place to experiment with small code snippets. You can access it through the terminal or command line app on your machine. You can simplify your workflow with Python IDLE, which will immediately start a Python shell when you open it.

A File Editor

Every programmer needs to be able to edit and save text files. Python programs are files with the .py extension that contain lines of Python code. Python IDLE gives you the ability to create and edit these files with ease.

Python IDLE also provides several useful features that you’ll see in professional IDEs, like basic syntax highlighting, code completion, and auto-indentation. Professional IDEs are more robust pieces of software and they have a steep learning curve. If you’re just beginning your Python programming journey, then Python IDLE is a great alternative!

How to Use the Python IDLE Shell

The shell is the default mode of operation for Python IDLE. When you click on the icon to open the program, the shell is the first thing that you see:

This is a blank Python interpreter window. You can use it to start interacting with Python immediately. You can test it out with a short line of code:

Here, you used print() to output the string "Hello, from IDLE!" to your screen. This is the most basic way to interact with Python IDLE. You type in commands one at a time and Python responds with the result of each command.

Next, take a look at the menu bar. You’ll see a few options for using the shell:

You can restart the shell from this menu. If you select that option, then you’ll clear the state of the shell. It will act as though you’ve started a fresh instance of Python IDLE. The shell will forget about everything from its previous state:

In the image above, you first declare a variable, x = 5. When you call print(x), the shell shows the correct output, which is the number 5. However, when you restart the shell and try to call print(x) again, you can see that the shell prints a traceback. This is an error message that says the variable x is not defined. The shell has forgotten about everything that came before it was restarted.

You can also interrupt the execution of the shell from this menu. This will stop any program or statement that’s running in the shell at the time of interruption. Take a look at what happens when you send a keyboard interrupt to the shell:

A KeyboardInterrupt error message is displayed in red text at the bottom of your window. The program received the interrupt and has stopped executing.

How to Work With Python Files

Python IDLE offers a full-fledged file editor, which gives you the ability to write and execute Python programs from within this program. The built-in file editor also includes several features, like code completion and automatic indentation, that will speed up your coding workflow. First, let’s take a look at how to write and execute programs in Python IDLE.

Opening a File

To start a new Python file, select File → New File from the menu bar. This will open a blank file in the editor, like this:

From this window, you can write a brand new Python file. You can also open an existing Python file by selecting File → Open… in the menu bar. This will bring up your operating system’s file browser. Then, you can find the Python file you want to open.

If you’re interested in reading the source code for a Python module, then you can select File → Path Browser. This will let you view the modules that Python IDLE can see. When you double click on one, the file editor will open up and you’ll be able to read it.

The content of this window will be the same as the paths that are returned when you call sys.path. If you know the name of a specific module you want to view, then you can select File → Module Browser and type in the name of the module in the box that appears.

Editing a File

Once you’ve opened a file in Python IDLE, you can then make changes to it. When you’re ready to edit a file, you’ll see something like this:

The contents of your file are displayed in the open window. The bar along the top of the window contains three pieces of important information:

The name of the file that you’re editing
The full path to the folder where you can find this file on your computer
The version of Python that IDLE is using

In the image above, you’re editing the file myFile.py, which is located in the Documents folder. The Python version is 3.7.1, which you can see in parentheses.

There are also two numbers in the bottom right corner of the window:

Ln: shows the line number that your cursor is on.
Col: shows the column number that your cursor is on.

It’s useful to see these numbers so that you can find errors more quickly. They also help you make sure that you’re staying within a certain line width.

There are a few visual cues in this window that will help you remember to save your work. If you look closely, then you’ll see that Python IDLE uses asterisks to let you know that your file has unsaved changes:

The file name shown in the top of the IDLE window is surrounded by asterisks. This means that there are unsaved changes in your editor. You can save these changes with your system’s standard keyboard shortcut, or you can select File → Save from the menu bar. Make sure that you save your file with the .py extension so that syntax highlighting will be enabled.

Executing a File

When you want to execute a file that you’ve created in IDLE, you should first make sure that it’s saved. Remember, you can see if your file is properly saved by looking for asterisks around the filename at the top of the file editor window. Don’t worry if you forget, though! Python IDLE will remind you to save whenever you attempt to execute an unsaved file.

To execute a file in IDLE, simply press the F5 key on your keyboard. You can also select Run → Run Module from the menu bar. Either option will restart the Python interpreter and then run the code that you’ve written with a fresh interpreter. The process is the same as when you run python3 -i [filename] in your terminal.

When your code is done executing, the interpreter will know everything about your code, including any global variables, functions, and classes. This makes Python IDLE a great place to inspect your data if something goes wrong. If you ever need to interrupt the execution of your program, then you can press Ctrl+C in the interpreter that’s running your code.

How to Improve Your Workflow

Now that you’ve seen how to write, edit, and execute files in Python IDLE, it’s time to speed up your workflow! The Python IDLE editor offers a few features that you’ll see in most professional IDEs to help you code faster. These features include automatic indentation, code completion and call tips, and code context.

Automatic Indentation

IDLE will automatically indent your code when it needs to start a new block. This usually happens after you type a colon (:). When you hit the enter key after the colon, your cursor will automatically move over a certain number of spaces and begin a new code block.

You can configure how many spaces the cursor will move in the settings, but the default is the standard four spaces. The developers of Python agreed on a standard style for well-written Python code, and this includes rules on indentation, whitespace, and more. This standard style was formalized and is now known as PEP 8. To learn more about it, check out How to Write Beautiful Python Code With PEP 8.

Code Completion and Call Tips

When you’re writing code for a large project or a complicated problem, you can spend a lot of time just typing out all of the code you need. Code completion helps you save typing time by trying to finish your code for you. Python IDLE has basic code completion functionality. It can only autocomplete the names of functions and classes. To use autocompletion in the editor, just press the tab key after a sequence of text.

Python IDLE will also provide call tips. A call tip is like a hint for a certain part of your code to help you remember what that element needs. After you type the left parenthesis to begin a function call, a call tip will appear if you don’t type anything for a few seconds. For example, if you can’t quite remember how to append to a list, then you can pause after the opening parenthesis to bring up the call tip:

The call tip will display as a popup note, reminding you how to append to a list. Call tips like these provide useful information as you’re writing code.

Code Context

The code context functionality is a neat feature of the Python IDLE file editor. It will show you the scope of a function, class, loop, or other construct. This is particularly useful when you’re scrolling through a lengthy file and need to keep track of where you are while reviewing code in the editor.

To turn it on, select Options → Code Context in the menu bar. You’ll see a gray bar appear at the top of the editor window:

As you scroll down through your code, the context that contains each line of code will stay inside of this gray bar. This means that the print() functions you see in the image above are a part of a main function. When you reach a line that’s outside the scope of this function, the bar will disappear.

How to Debug in IDLE

A bug is an unexpected problem in your program. They can appear in many forms, and some are more difficult to fix than others. Some bugs are tricky enough that you won’t be able to catch them by just reading through your program. Luckily, Python IDLE provides some basic tools that will help you debug your programs with ease!

Interpreter DEBUG Mode

If you want to run your code with the built-in debugger, then you’ll need to turn this feature on. To do so, select Debug → Debugger from the Python IDLE menu bar. In the interpreter, you should see [DEBUG ON] appear just before the prompt (>>>), which means the interpreter is ready and waiting.

When you execute your Python file, the debugger window will appear:

In this window, you can inspect the values of your local and global variables as your code executes. This gives you insight into how your data is being manipulated as your code runs.

You can also click the following buttons to move through your code:

Go: Press this to advance execution to the next breakpoint. You’ll learn about these in the next section.
Step: Press this to execute the current line and go to the next one.
Over: If the current line of code contains a function call, then press this to step over that function. In other words, execute that function and go to the next line, but don’t pause while executing the function (unless there is a breakpoint).
Out: If the current line of code is in a function, then press this to step out of this function. In other words, continue the execution of this function until you return from it.

Be careful, because there is no reverse button! You can only step forward in time through your program’s execution.

You’ll also see four checkboxes in the debug window:

Globals: your program’s global information
Locals: your program’s local information during execution
Stack: the functions that run during execution
Source: your file in the IDLE editor

When you select one of these, you’ll see the relevant information in your debug window.

Breakpoints

A breakpoint is a line of code that you’ve identified as a place where the interpreter should pause while running your code. They will only work when DEBUG mode is turned on, so make sure that you’ve done that first.

To set a breakpoint, right-click on the line of code that you wish to pause. This will highlight the line of code in yellow as a visual indication of a set breakpoint. You can set as many breakpoints in your code as you like. To undo a breakpoint, right-click the same line again and select Clear Breakpoint.

Once you’ve set your breakpoints and turned on DEBUG mode, you can run your code as you would normally. The debugger window will pop up, and you can start stepping through your code manually.

Errors and Exceptions

When you see an error reported to you in the interpreter, Python IDLE lets you jump right to the offending file or line from the menu bar. All you have to do is highlight the reported line number or file name with your cursor and select Debug → Go to file/line from the menu bar. This is will open up the offending file and take you to the line that contains the error. This feature works regardless of whether or not DEBUG mode is turned on.

Python IDLE also provides a tool called a stack viewer. You can access it under the Debug option in the menu bar. This tool will show you the traceback of an error as it appears on the stack of the last error or exception that Python IDLE encountered while running your code. When an unexpected or interesting error occurs, you might find it helpful to take a look at the stack. Otherwise, this feature can be difficult to parse and likely won’t be useful to you unless you’re writing very complicated code.

How to Customize Python IDLE

There are many ways that you can give Python IDLE a visual style that suits you. The default look and feel is based on the colors in the Python logo. If you don’t like how anything looks, then you can almost always change it.

To access the customization window, select Options → Configure IDLE from the menu bar. To preview the result of a change you want to make, press Apply. When you’re done customizing Python IDLE, press OK to save all of your changes. If you don’t want to save your changes, then simply press Cancel.

There are 5 areas of Python IDLE that you can customize:

Fonts/Tabs
Highlights
Keys
General
Extensions

Let’s take a look at each of them now.

Fonts/Tabs

The first tab allows you to change things like font color, font size, and font style. You can change the font to almost any style you like, depending on what’s available for your operating system. The font settings window looks like this:

You can use the scrolling window to select which font you prefer. (I recommend you select a fixed-width font like Courier New.) Pick a font size that’s large enough for you to see well. You can also click the checkbox next to Bold to toggle whether or not all text appears in bold.

This window will also let you change how many spaces are used for each indentation level. By default, this will be set to the PEP 8 standard of four spaces. You can change this to make the width of your code more or less spread out to your liking.

Highlights

The second customization tab will let you change highlights. Syntax highlighting is an important feature of any IDE that highlights the syntax of the language that you’re working in. This helps you visually distinguish between the different Python constructs and the data used in your code.

Python IDLE allows you to fully customize the appearance of your Python code. It comes pre-installed with three different highlight themes:

IDLE Day
IDLE Night
IDLE New

You can select from these pre-installed themes or create your own custom theme right in this window:

Unfortunately, IDLE does not allow you to install custom themes from a file. You have to create customs theme from this window. To do so, you can simply start changing the colors for different items. Select an item, and then press Choose color for. You’ll be brought to a color picker, where you can select the exact color that you want to use.

You’ll then be prompted to save this theme as a new custom theme, and you can enter a name of your choosing. You can then continue changing the colors of different items if you’d like. Remember to press Apply to see your changes in action!

Keys

The third customization tab lets you map different key presses to actions, also known as keyboard shortcuts. These are a vital component of your productivity whenever you use an IDE. You can either come up with your own keyboard shortcuts, or you can use the ones that come with IDLE. The pre-installed shortcuts are a good place to start:

The keyboard shortcuts are listed in alphabetical order by action. They’re listed in the format Action - Shortcut, where Action is what will happen when you press the key combination in Shortcut. If you want to use a built-in key set, then select a mapping that matches your operating system. Pay close attention to the different keys and make sure your keyboard has them!

Creating Your Own Shortcuts

The customization of the keyboard shortcuts is very similar to the customization of syntax highlighting colors. Unfortunately, IDLE does not allow you to install custom keyboard shortcuts from a file. You must create a custom set of shortcuts from the Keys tab.

Select one pair from the list and press Get New Keys for Selection. A new window will pop up:

Here, you can use the checkboxes and scrolling menu to select the combination of keys that you want to use for this shortcut. You can select Advanced Key Binding Entry >> to manually type in a command. Note that this cannot pick up the keys you press. You have to literally type in the command as you see it displayed to you in the list of shortcuts.

General

The fourth tab of the customization window is a place for small, general changes. The general settings tab looks like this:

Here, you can customize things like the window size and whether the shell or the file editor opens first when you start Python IDLE. Most of the things in this window are not that exciting to change, so you probably won’t need to fiddle with them much.

Extensions

The fifth tab of the customization window lets you add extensions to Python IDLE. Extensions allow you to add new, awesome features to the editor and the interpreter window. You can download them from the internet and install them to right into Python IDLE.

To view what extensions are installed, select Options → Configure IDLE -> Extensions. There are many extensions available on the internet for you to read more about. Find the ones you like and add them to Python IDLE!

Conclusion

In this tutorial, you’ve learned all the basics of using IDLE to write Python programs. You know what Python IDLE is and how you can use it to interact with Python directly. You’ve also learned how to work with Python files and customize Python IDLE to your liking.

You’ve learned how to:

Work with the Python IDLE shell
Use Python IDLE as a file editor
Improve your workflow with features to help you code faster
Debug your code and view errors and exceptions
Customize Python IDLE to your liking

Now you’re armed with a new tool that will let you productively write Pythonic code and save you countless hours down the road. Happy programming!

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Artem Rys: 5 Scraping Tips

November 13, 2019, 7:11 am

≫ Next: Not Invented Here: RelStorage 3.0

≪ Previous: Real Python: Getting Started With Python IDLE

In this story, I will point you to the 5 tips that I have collected while working on my freelance scraping projects.

Continue reading on python4you »

↧

Not Invented Here: RelStorage 3.0

November 12, 2019, 3:18 am

≫ Next: Not Invented Here: Introduction to ZODB Data Storage

≪ Previous: Artem Rys: 5 Scraping Tips

We're happy to announce the release of RelStorage 3.0, the relational storage engine for ZODB. Compared to RelStorage 2, highlights include a 30% reduction in memory usage, and up to 98% faster performance! (Ok, yes, that's from one specific benchmark and not everything is 98% faster, but improved performance was a major goal.)

RelStorage 3.0 is a major release of RelStorage with a focus on performance and scalability. It's the result of a concentrated development effort spanning six months, with each pre-release being in production usage with large databases.

Read on to find out what's new.

Contents

Overview

Please note that this document is only an overview. For details on the extensive changes since the previous release, RelStorage 2.1, and this one, please view the detailed changelog.

If you're not familiar with ZODB (the native Python object database) and how it uses pluggable storage engines like RelStorage, please take a moment to review this introduction.

This document will cover a few important things to know about RelStorage 3.0, and then go over some of the most important changes in it. Next, we'll show how those changes affect performance. Finally, we'll wrap it up with a whirlwind tour of some of the minor changes.

Backwards Incompatible Changes

Before we get to the good stuff, it's important to highlight the small number of backwards incompatible changes and other things to be aware of when migrating from RelStorage 2 to RelStorage 3.

Schema Changes

In history preserving schemas, the empty column of the transaction table has been renamed to is_empty. In MySQL 8.0.4, empty became a reserved word. The table is altered automatically when first opened with RelStorage 3.0. This makes the schema incompatible with opening under RelStorage 2. [1]

A new table, used during the commit process, is also automatically added when first opened with RelStorage 3.

Under MySQL, any remaining tables that were using the MyISAM engine are converted to InnoDB when the schema is first opened. The only tables remaining that were MyISAM were the pack tables and the new_oid table, all of which should ordinarily be empty, so this conversion shouldn't take long.

Option Changes

The shared-blob-dir default has changed from true to false. If you were using a non-shared blob-dir, meaning that blobs were only stored on the filesystem, you'll need to explicitly set this to true. The previous default could easily lead to accidental data loss, and there is now a performance penalty for a true value. See blob_cache for more information.

The previously deprecated option poll-interval has been removed.

Several of the cache persistence options are now deprecated and ignored. They'll generate a warning on startup if found in the configuration.

Concurrent Deployment With RelStorage 2 Is Not Possible

Caution!

It is not possible for RelStorage 3 to write to a database at the same time that RelStorage 2 is writing to it.

The specifics around locking have changed entirely, and are not compatible between the two versions. If RelStorage 3 and RelStorage 2 are both writing to a database, corruption is the very likely result. For this reason, shutting down all RelStorage 2 instances, or at least placing them into read-only mode, is required.

RelStorage does not take specific steps to prevent this. It is up to you to ensure any RelStorage 2 instances are shutdown or at least read-only before deploying RelStorage 3.

Major Changes

Benchmarking Notes

The benchmark data was collected with zodbshootout 0.8. Recent revisions of zodbshootout have adopted pyperf as the underlying benchamrk engine. This helps ensure much more stable, consistent results. It also allows collecting a richer set of data for later analysis. The data was passed through the seaborn statistical visualization library, which is built using pandas, numpy and matplotlib, to produce the plots shown here.

For the comparisons between RelStorage 3 and RelStorage 2, the RDBMS servers (MySQL 8 and PostgreSQL 11) were run on one computer running Gentoo Linux using SSD storage. The RelStorage client was run on a different computer, and the two were connected with gigabit ethernet. The machines were otherwise idle.

The comparisons used Python 2.7.16 (because 2.7 was the only version of Python that had a native code gevent driver compatible with RelStorage 2). The database drivers were mysqlclient 1.4.4 and psycopg2 2.8.4. The gevent driver used this version of ultramysql, umysqldb 1.0.4.dev2, PyMySQL 0.9.3 and gevent 1.5a2.

Why transaction of size 1, 5 and 20 objects? Review of a database containing 60 million objects showed the average transaction involved 2.6 objects with a standard deviation of 7.4. A database with 30 million objects had an average transaction of 6.1 objects and a standard deviation of 13.

In all the examples that follow, results for PostgreSQL are green while those for MySQL are in blue. The darker shades are RelStorage 3, while the lighter shades are RelStorage 2. The y axis is time, and shorter bars are better (no units or tickmarks are shown because the scale differs between graphs and we're generally focused on deltas not absolute values); the black line that appears in the middle of some bars is the confidence interval. Click for a larger version.

Most examples show a cross section of RDBMS server by concurrency kind (threads or process) by concurrency level (1, 5, or 20 concurrent threads or processes) by object count (1, 5 or 20 objects).

To accomplish its goals of improving performance (especially under high loads distributed across many processes on many machines), reducing memory usage, and being usable in small environments without access to a full RDBMS server (such as containers or test environments), RelStorage features several major internal changes.

Pickle Cache

The shared in-memory pickle cache has been redesigned to be precise and MVCC based; it no longer uses the old checkpoint system. This means that old revisions of objects can proactively be removed from the cache when they are no longer needed. Together, this means that connections within a process are able to share polling information, with the upshot being that there are no longer large, stop-the-world poll queries in order to rebuild checkpoints. Individual poll queries are usually smaller too.

It has also been changed to have substantially less overhead for each cached object value. Previously, it would take almost 400 bytes of overhead to store one cache entry. In examining a database of 60 million objects, it turned out that the average object size was only a little over 200 bytes. Using 400 bytes to store 200 bytes was embarrassing, and because the cache limit computations didn't take the overhead into account it meant that if you configured an in-memory cache size of 200MB, the cache could actually occupy up to 600MB.

Now, storing a cached value needs only a little more than 100 bytes, and the exact amount of overhead is included when enforcing the cache limit (so a 200MB limit means 200MB of memory usage). The Python cache implementation using a CFFI-based Segmented LRU and Python lookup dictionary were replaced with a minimal Cython extension using a C++ boost.intrusive list and map. This also essentially eliminates the cache's impact on the Python garbage collector, which should improve garbage collection times [2].

In concrete terms, one set of production processes that required almost 19GB of memory now requires only about 12GB: a 36% reduction.

Persistent Cache

Along with rearchitecting the in-memory cache, the on-disk persistent cache has been rebuilt on top of SQLite. Its hit rate is much improved and can easily reach 100% if nothing in the database changed. If you haven't deployed the persistent cache before, now would be a great time to give it a try.

If you had used the persistent cache in the past, the new cache should just work. Old cache files will be ignored and you might want to manually remove any that exist to reclaim disk space.

ZODB 5 Parallel Commit

RelStorage 3 now requires ZODB 5, and implements ZODB 5's parallel commit feature. During most of the ZODB commit process, including conflict resolution, only objects being modified are exclusively locked. Objects that were provided to Connection.readCurrent() are locked only in share mode so they may be locked that way by several transactions at the same time. (This fact is particularly important because BTrees callreadCurrent() for every node traversed while searching for the correct leaf node to add/remove/update, meaning there can be a surprising amount of contention.)

Only at the very end of the ZODB commit process when it is time to commit to the database server is a database-wide lock taken while the transaction ID is allocated. This should be a very brief time, so transactions that are operating on distinct sets of objects can continue concurrently for much longer (especially if conflicts occur that must be resolved; one thread resolving conflicts no longer prevents other threads from resolving non-overlapping conflicts).

This works on most databases, but it works best on a database that supports NOWAIT share locks, like MySQL 8 or PostgreSQL. SQLite doesn't support object-level locking or parallel commit. Oracle doesn't support shared object-level locks.

ZODB 5 Prefetch

RelStorage 3 implements efficient object prefetching through Connection.prefetch(). This is up to 74% faster than reading objects individually on demand.

Prefetching is much faster than reading serially.

RelStorage 2 did not implement prefetch so this benchmark falls back to reading objects individually. RelStorage 3 is able to query the database in a single bulk operation.

Support for SQLite

Note

The SQLite support is relatively new and hasn't received much production-level testing.

On some systems, the underlying sqlite3 module may experience crashes when lots of threads are used (even though a ZODB connection and its RelStorage instance and sqlite connection are not thread safe and must only ever be used from a single thread at a time, the sequential use from multiple threads can still cause issues).

RelStorage 3 can use a local SQLite3 database file. I'll quote the RelStorage FAQ to explain why:

Why does RelStorage support a SQLite backend? Doesn't that defeat the point?

The SQLite backend fills a gap between FileStorage and an external RDBMS server.

FileStorage is fast, requires few resources, and has no external dependencies. This makes it well suited to small applications, embedded applications, or applications where resources are constrained or ease of deployment is important (for example, in containers).

However, a FileStorage can only be opened by one process at a time. Within that process, as soon as a thread begins committing, other threads are locked out of committing.

An external RDBMS server (e.g., PostgreSQL) is fast, flexible and provides lots of options for managing backups and replications and performance. It can be used concurrently by many clients on many machines, any number of which can be committing in parallel. But that flexibility comes with a cost: it must be setup and managed. Sometimes running additional processes complicates deployment scenarios or is undesirable (for example, in containers).

A SQLite database combines the low resource usage and deployment simplicity of FileStorage with the ability for many processes to read from and write to the database concurrently. Plus, it's typically faster than ZEO. The tradeoff: all processes using the database must be on a single machine on order to share memory.

I'll leave the rest to the FAQ, with the exception of these two performance graphs showing SQLite slot in comfortably next to FileStorage and ZEO.

For small to medium sized write transactions, SQLite can actually outperform FileStorage and ZEO when threads are in use. When separate processes are in use, SQLite always beats ZEO.

When reading objects, SQLite is always faster than ZEO, but slower than FileStorage.

Performance Improvements

Much effort was spent on improving RelStorage's performance, in terms of overall speed and memory usage as well as concurrency and scalability. Here, let's compare the performance of RelStorage 2.1.1 with RelStorage 3.0 graphically from a speed perspective.

Writing

Note

RelStorage 2 failed to complete a number of the write benchmarks that used 20 objects and 20 concurrent processes or threads: some tasks would fail to obtain the commit lock using the default 10 second timeout. Those tasks were excluded from the results.

RelStorage 3 did not have this problem.

We'll start with writing to the database:

When simply adding new objects to the database, PostgreSQL is between 29% and 72% faster. MySQL is essentially within the margin of error for the non-concurrent cases, and up to 79% faster for larger, more concurrent tests.

Updating existing objects:

When updating objects that already existed in the database, The difference for MySQL ranges between statistically insignificant up to to 79% faster.. PostgreSQL likewise ranges from statistical insignificance and up to 80% faster.

Handling conflicts received extra attention.

Updating existing object that had conflicts got much faster.

Updating objects that have to resolve conflicts.

Once again, some cases were statistically insignificant for both databases. The cases that were statistically significant show a 15% to 84% improvement for PostgreSQL, with the range being from 11% to 89% for MySQL. Not only is it faster, going by the much tighter confidence intervals, it's also less volatile and more consistent.

Reading

RelStorage, like ZEO, includes a secondary pickle cache that's shared amongst all the connections in a process. Here's what it looks like to read data out of that cache (not hitting the database at all).

Reading from the pickle cache, by concurrency kind and object count.

Ideally the bars for both databases of a particular RelStorage release would be equal heights in any given test case because this test is database independent. That's not quite the case. When they're not, they're within each other's confidence intervals.

Because the bottom row of graphs is separate processes, they don't benefit much from the actual sharing of the cache. The improvements there, 30 – 40%, show the effect of the cache changes in isolation. The top row of graphs shows the improvement in the intended use case, when the cache is shared by multiple threads in a process. In that case, the difference can be up to 90%. [5]

Reading directly from the database, on the other hand, is harder to qualify.

Reading directly from the database is mixed.

Oh no! It looks like RelStorage 3 actually got slower when reading directly from the database. That's one of its core tasks. How could that be allowed to happen? (Spoiler alert: it didn't.)

Look closely at the pattern. In the top row, when we're testing with threads, RelStorage 3 is always at least as good as RelStorage 2 and frequently better [4]. It's only the bottom row, dealing with processes, that RelStorage 3 looks bad. But as you move further to the right, where more processes are making more queries to load more objects, the gap begins to close. At 20 objects in 5 processes, the gap is essentially gone. (And then we fall off a cliff at 20 processes querying 20 objects. It's not entirely clear exactly what's going on, but that's more CPUs/hardware threads than either the client machine or server machine has so it's not surprising that efficiency begins to fall.)

It turns out the benchmark includes the cost of opening a ZODB connection. For processes, that's a connection using a whole new ZODB instance, so there will be no prior connections open. But for threads, the ZODB instance is shared, so there will be other connections in the pool.

Working with brand new RelStorage connections got a bit slower in RelStorage 3 compared to RelStorage 2. They use more prepared statements (especially on PostgreSQL), and they use more database session state (especially on MySQL). Performing the first poll of the database state may also be a bit more expensive. So when the connection doesn't get used to do much before being closed, these costs outweigh the other speedups in RelStorage 3. But somewhere between making 5 and 20 queries for objects, the upfront costs are essentially amortized away. As always in ZODB, connection pooling with appropriate pool settings is important.

gevent

In RelStorage 2, gevent was supported on Python 2 and Python 3 for MySQL and PostgreSQL when using a pure-Python database driver (typically PyMySQL for the former and pg8000 for the later). There was special gevent support for MySQL using a custom database driver, but only on Python 2. This driver took a hybrid approach, providing some C acceleration of low-level operations, but delegating most operations to PyMySQL.

RelStorage 3 supports gevent-aware, fully native drivers, for both PostgreSQL and MySQL on both Python 2 and Python 3. Moreover, the MySQL driver has special support for RelStorage's two-phase commit protocol, essentially boosting the priority of a greenlet that's committing [6]. This avoid situations where a greenlet takes database-wide locks and then yields control to a different greenlet that starves the event loop, leaving the database locked for an unacceptable amount of time and halting the forward progress of other processes.

Do these things make a difference? We can compare the performance of gevent with MySQL to find out.

Adding objects, by concurrency type.

Adding objects improved across the board for all concurrency types.

Updating objects, by concurrency type.

Updating objects improved across the board for all concurrency types. Updating conflicting objects shows a similar gain.

Reading from the database is largely unchanged.

Reading objects, by concurrency type.

Reading individual objects, by contrast, shows no distinct trend. I suspect that it's essentially unchanged (the transactional and polling parts around it that were changed are not measured here), but we'd need more samples to be able to properly show that.

Minor Changes

This section documents some of the other changes in RelStorage 3.

Supported Versions

Support for PostgreSQL 12 and MySQL 8 were added, as well as support for Python 3.8.

Support for MySQL 5.6 and PostgreSQL 9.5 were removed, as was support for old versions of ZODB. Also, RelStorage no longer depends on ZEO (so it's theoretically possible that Python 2.7.8 and earlier could run RelStorage, but this isn't tested or recommended).

Most tested database drivers were updated to newer versions, and in some cases the minimum supported versions were updated.

mysqlclient must be 1.4, up from 1.3.7.
psycopg2 must be 2.8, up from 1.6.1.
psycopg2cffi must be 2.8.1, up from 2.7.4.
cx_Oracle must be 6.0, up from 5.0.
Support was removed for the Python 2-only driver umysqldb.

This table summarizes the support various databases have for RelStorage 3 features.

table {border-color: transparent}

Supported Features
	PostgreSQL	MySQL	Oracle	SQLite
Parallel commit	Yes	Yes	Yes	No
Shared readCurrent locks	Yes	Yes	No	No
Non-blocking readCurrent locks	Yes	Native on MySQL 8, emulated on MySQL 5.7	Yes	N/A (there is no distinction in lock type)
Streaming blobs	Yes	No (emulated via chunking)	Yes	No (consider configuring a shared-blob-dir)
Central transaction ID allocation	Yes	Yes	No (could probably be implemented)	N/A (but essentially yes because it only involves one machine)
Atomic lock and commit without Python involvement	Yes (except with PG8000)	Yes	No (could probably be implemented)	No

The Blob Cache

Using a shared-blob-dir (where all blobs are only stored on a filesystem and never in the database) disables much of the parallel commit features. This is because testing whether we can actually store the blob successfully during the "vote" phase of ZODB's two-phase commit requires knowing the transaction ID, and knowing the transaction ID requires taking the database-wide commit lock. This is much sooner than is otherwise required and the lock is held for much longer (e.g., during conflict resolution).

Increasing popularity, and ever-growing databases, make the implementation of the blob cache all the more important. This release focused on blob cache maintenance, specifically the process whereby the blob-cache-size (if any) is ensured.

First, for history free databases, when a new revision of a blob is uploaded to replace an older one, if RelStorage has the old revision cached on disk and can determine that it's not in use, it will be deleted as part of the commit process. This applies whether or not a cache size limit is in place.

If it becomes necessary to prune the blob cache, the process of doing so has been streamlined. It spawns far fewer unnecessary threads than it used to. If the process is using gevent, it uses an actual native thread to do the disk scan and IO instead of a greenlet, which would have blocked the event loop.

Finally, if running the pruning process is still too expensive and the thread interferes with the work of the process, there's a new option that spawns a separate process to do the cleanup. This can also be used manually to perform a cleanup without opening a storage.

Packing and GC

History-preserving databases now support zc.zodbdgc for multi-database garbage collection.

RelStorage's native packing is now safer for concurrent use in history-free databases thanks to correcting several race conditions.

For both types of databases, packing and pre-packing require substantially less memory. Pre-packing a large database was measured to use 9 times less memory on CPython 3 and 15 times less on CPython 2 (from 3GB to 200 MB).

Performance Grab Bag

Here's a miscellaneous selection of interesting changes, mostly performance related.

Reduce the number of network communications with the database.
RelStorage tries harder to avoid talking to the database more times than necessary. Each round-trip introduces extra latency that was measurable, even on fast connections. Also, native database drivers usually release the GIL during a database operation, so there could be extra overhead introduced acquiring it again. And under gevent, making a query yields to the event loop, which is good, but it could be an arbitrary amount of time before the greenlet regains control to process the response. If locks are being held, too many queries could spell disaster.
This was accomplished in several ways. One way was to move larger sequences of commands into stored procedures (MySQL and PostgreSQL only). For example, previously to finish committing a transaction, RelStorage required 7 database interactions in a history-preserving database: 1 to acquire the lock, 1 to get the previous transaction id, 1 to store transaction metadata, 1 to store objects, 1 to store blobs, 1 to update the current object pointers, and finally one to commit. Now, that's all handled by a single stored procedure using one database operation. The Python process doesn't need to acquire the GIL (or cycle through the event loop) to commit and release locks, that happens immediately on the database server regardless of how responsive the Python process is.
More careful control of transactions eliminated several superflous COMMIT or ROLLBACK queries in all databases. Similarly, more careful tracking of allocated object identifiers (_p_oid) eliminated some unnecessary database updates.
The use of upserts, which eliminate at least one query, was previously limited to a select few places for PostgreSQL and MySQL. That has been extended to more places for all supported databases.
Allocate transaction IDs on the database server.
This was primarily about reducing database communications. However, because transaction IDs are based on the current time, it also has the important side-effect of ensuring that they're more consistently meaningful with only one clock to consider.
Previously, all transaction IDs could be at least as inaccurate as the least-accurate clock writing to the database (if that clock was in the future).
PostgreSQL uses the COPY command to upload data.
Specifically, it uses the binary format of the bulk-loading COPY command to stream data to the server. This can improve storage times by 20% or so.

Conclusion

RelStorage 3 represents a substantial change from RelStorage 2. The pickle cache—both in-memory and on-disk—has been completely rewritten, the locking process has been re-imagined in support of parallel commit, time-sensitive logic moved into stored procedures, and more. Despite that, it should be a drop-in replacement in most situations.

Although RelStorage 3 has been in production usage under heavy load at NextThought for its entire development cycle, and we haven't encountered any problems with it that could lead to data loss, it's still software, and all software has bugs. Please exercise appropriate care when upgrading. Bug reports and pull requests are encouraged and appreciated.

We're very happy with the enhancements, especially around performance, and hope those improvements are applicable to most users. We welcome feedback on whether they are or are not, and also want to hear about where else RelStorage could improve.

Finally, I'd like to say thank you to everyone who has contributed to the development of RelStorage 3, whether through testing pre-releases, filing bug reports, or sharing enhancement ideas and use cases. It's greatly appreciated.

Footnotes

[1]

Why not simply "quote" the reserved word? Much of the SQL queries that RelStorage uses are shared among all supported databases. By default, MySQL uses a different, non-standard quoting syntax that wouldn't work with the other databases. That can be changed by altering the SQL Mode, but I was trying to avoid having to do that. In the end it turned out that another change [3] forced the alteration of the mode, so I should have just done that in the first place. But since there are good reasons to prevent RelStorage 2 and 3 from ever trying to use the same database, the incompatibility didn't seem like a big deal.

[2]

In CPython, the generational (cyclic) garbage collector uses time proportional to the number of objects in a generation (all objects in a generation are stored in a linked list). The more objects that exist, the longer it takes to perform a collection. In RelStorage 2, storing a value in the cache required creating several objects, and the Python garbage collector would have to examine these. The RelStorage 3 cache does not create objects that the garbage collector needs to traverse. Similar remarks hold for PyPy.

[3]	That change was support for SQLite, which requires the entire "transaction" table to be quoted. Amusingly, "transaction" is a word reserved by the SQL standard, while "empty" is not.

[4]	With the exception of PostgreSQL using one thread for 20 objects. The large error bar indicates an outlier event. We're working with relatively small sample sizes, so that throws things off.

[5]

RelStorage 2 implemented some of its cache functions using native code called by CFFI. When CFFI calls native code, Python's GIL is dropped, allowing other threads to run (as long as they weren't trying to use the cache, which used Python locks for thread safety). In contrast, RelStorage 3 uses a thin layer of cython to call into C++ and does not drop the GIL—it depends on the GIL for thread safety. The substantial speed improvement should outweigh the loss of the tiny window where the GIL was dropped.

[6]

We don't have that much granular control over the PostgreSQL driver (psycopg2). With MySQL (mysqlclient), on a connection-by-connection basis we can control if a particular query is going to yield to gevent or block. But with psycopg2, whether it yields or not is global to the entire process.

The reverse is that psycopg2 actually gives us exactly the same control as a gevent socket (yield each time any read/write would block), whereas in mysqlclient we can only wait for the first packet to arrive/go, after that it blocks for the duration (server-side cursors give us a bit more control, allowing yields between fetching groups of rows).

↧

Not Invented Here: Introduction to ZODB Data Storage

November 13, 2019, 5:19 am

≫ Next: Robin Wilson: Easily specifying colours from the default colour cycle in matplotlib

≪ Previous: Not Invented Here: RelStorage 3.0

ZODB is a powerful native object database for Python, widely known for its use in the Zope web framework and the Plone content management system. By enabling transparent object graph persistence with no need to predefine schemas, ZODB enables extremely flexible application development. With pluggable storage engines such as FileStorage, ZEO, and RelStorage, it also provides flexible ways to store data.

This post provides an introduction to ZODB, focusing on some of the lower-level mechanics of storing data. This post doesn't discuss persistent objects.

Disclaimer

This was written in support of the RelStorage 3.0 release so it may be biased in that direction. It is not an exhaustive list of all storage options. For example, it doesn't discuss NEO, a distributed, redundant storage. Partial lists of included and non-included storages may be found in the ZODB storage documentation.

Contents

What is ZODB?

ZODB [1] is a native object database for Python, enabling transparent object persistence. It provides the illusion of an infinite memory space holding application-defined objects. That memory space is shared between processes running at different times on the same or different machine. Only those objects actually used are brought into physical memory. Think of it something like operating system paging, but for objects, and distributed across time and space. (Apple's CoreData framework has a similar technique it calls "faulting".)

In addition, ZODB provides a transactional view of these objects with snapshot isolation. Any given connection to the database sees a consistent view of all the objects in the database (whether it reads or writes to any particular object or not) as-of the moment it began. When adding or updating objects, no changes are published and made visible to other connections until the writing connection commits its transaction, at which point either all the changes are made visible or none of them are. Existing connections that continue reading (or even writing!) will still not see those changes; they're "stuck" at the snapshot view of the objects they started with. (The ability for readers to continue to be able to retrieve old data that's been replaced in newer transactions is known as multi-version concurrency control, or MVCC.)

Many connections may be reading and writing to the database at once. ZODB uses optimistic concurrency control. Readers don't block other readers or writers, and writers are allowed to proceed as if they were the only one making changes right up until they commit. Writes are defined to occur in a strict order. If a writer discovers that an earlier transaction had modified objects that it too wants to modify, a conflict occurs. Instead of just rolling back the writing transaction and forcing it to start over, taking the modified object into account, ZODB gives the application the chance to resolve the conflict using a three-way merge between the object as it existed when the transaction began, the object that the connection wants to commit, and the object that was committed by the other writer. Only if it cannot do so is the transaction rolled back.

What is a ZODB storage?

ZODB uses a pluggable storage architecture, allowing different ways to store the objects it manages. Storage engines are responsible for allocating persistent object identifiers (OIDs) for each object ZODB manages, storing object state data [2] when an object is added or changed [3], and later retrieving the data for that particular object given its OID. The storage is also responsible for implementing snapshot isolation, ordering (serializing) writes and assigning incrementing transaction identifiers (TIDs), and detecting and handling conflicting writes.

FileStorage

Out of the box, in addition to a few different transient (in-memory) storage engines, ZODB comes with one persistent (on-disk) storage engine. FileStorage uses a single file to store an append-only transaction log for all the objects in the database. An additional in-memory and on-disk structure is used to record the relationship between objects (OIDs) and the transactions (TIDs) they appear in.

As an append-only file, writing to FileStorage can be quite fast. It requires memory (and extra storage space) proportional to the size of the database to record object positions for fast access. If that extra index data isn't saved to disk, it requires time proportional to the size of the database to scan the file on startup to re-create that index.

Because of its append-only nature, previous versions of objects are still found in the file and can be accessed by providing a proper TID. A FileStorage is thus said to be "history preserving." That's how snapshot isolation is implemented: each connection is explicitly associated with a TID and when it needs to read an object it asks the FileStorage to provide the revision of the object most recently written before that TID [4]. This can also be used like a version control system to view and even recover or undo changes to objects. Periodically, a FileStorage must be "packed" to remove obsolete historical data and prevent the file from growing forever.

FileStorage is widely deployed and has a long history of stability. It can only be used by a single process at a time, however. Within that process, only a single thread can be in the process of committing a transaction at a time (FileStorage uses a database-wide lock to provide serialization).

ZEO

A common method to extend access to a FileStorage to more than one process and/or to more than one machine is to deploy a ZEO [5] server. ZEO uses a client/server architecture. The server process opens one or more storages (in practice, always a FileStorage [6]) and exposes a network API to provide access to this storage. Client processes connect to this server and send it read and write requests. The server mediates access to the underlying storage for the clients.

ZEO inherits many of the strengths and weaknesses of its underlying storage and adds some of its own. For example, clients can be configured with a persistent local cache for cheap access to common objects or even read-only access when the server isn't available. But the central ZEO process has to contend with Python's GIL, which may limit scalability, and it defaults to resolving conflicts by loading application code into the server process, which can complicate deployments due to the need to keep client and server processes all running compatible code.

ZRS [7] is a storage wrapper implemented in Python and commonly wrapped around a ZEO storage that provides replication of data.

What is RelStorage?

RelStorage is a ZODB storage engine that's meant to solve many of the same problems as ZEO and ZRS, but taking a different approach with a different set of tradeoffs. RelStorage uses a relational database—MySQL, PostgreSQL, Oracle, or SQLite—to provide the final storage for object state data. It pushes the responsibility for OID allocation, locks, transaction management and snapshot isolation, and replication down to these systems.

The next section is mostly a copy of RelStorage's own description of its features. It makes references to ZEO and FileStorage described above.

Features

It is a drop-in replacement for FileStorage and ZEO, with several enhancements:
- Supports undo, packing, and object history preservation just like FileStorage.
- RelStorage can be configured not to keep object histories for reduced disk space usage and improved performance.
- Multiple processes on a single machine can read and write a local ZODB database using SQLite without needing to start and manage another process (i.e., ZEO).
- Blobs can be stored on a shared filesystem, or (recommended) in the relational database and only cached locally.
- Multiple threads in the same process share a high-performance in-memory pickle cache to reduce the number of queries to the RDBMS. This is similar to ZEO, and the ZEO cache trace tools are supported.
- The in-memory pickle cache can be saved to disk and read when a process starts up. This can dramatically speed up site warmup time by eliminating a flood of RDBMS queries. Unlike ZEO, this cache is automatically shared by all processes on the machine (no need to configure separate client identifiers.)
Ideal for large, high volume sites.
- Multiple Python processes on multiple machines can read and write the same ZODB database concurrently. This is similar to ZEO, but RelStorage does not require ZEO.
- Supports ZODB 5's parallel commit feature: Database writers only block each other when they would conflict (except for a small window at the end of the twophase commit protocol when the transaction ID is allocated; that still requires a global database lock).
- According to some tests, RelStorage handles concurrency better than the standard combination of ZEO and FileStorage.
- Whereas FileStorage takes longer to start as the database grows due to an in-memory index of all objects, RelStorage starts quickly regardless of database size.
- Capable of failover to replicated SQL databases.
Tested integration with gevent for PostgreSQL and MySQL.
There is a simple way (zodbconvert) to (incrementally) convert FileStorage to RelStorage and back again. You can also convert a RelStorage instance to a different relational database. This is a general tool that can be used to convert between any two ZODB storage implementations.
There is a simple way (zodbpack) to pack databases.
Supports zodburi .
Free, open source (ZPL 2.1)

MVCC and History Free Storage

One thing in particular I'd like to highlight is that RelStorage can implement snapshot isolation and conflict resolution without preserving history. To do this, it relies on the RDBMS's native implementation of MVCC, the repeatable read isolation level, and the read committed isolation level.

When a transaction begins, a RDBMS transaction is opened on a connection at the repeatable read (or higher) level. This connection is used for loading data from the database. This isolation level causes the RDBMS to establish its own snapshot view of the database as-of that moment of time.

A second connection is used to write data to the database. This connection is in the lower isolation level of simply read committed. This level ensures that each query it makes to the database returns the latest committed data. Objects being written are first placed in a temporary table; they are moved to their final table (overwriting an old revision for history free storages) only after any possible conflicts have been found and resolved.

The difference in the two connections' isolation levels matters specifically because of conflict resolution, as does the use of a temporary table. Recall that resolving conflicts needs three versions of the object: the object that existed when the transaction began (the original object), the object that is currently committed and was changed by someone else (the committed object), and the object that the writer would like to store (the new object). The task of the conflict resolution is to find the delta between the original object and the new object and apply those same changes to the committed object. This produces a new object to store which will become the committed object.

Ignoring caches, the only place that original object can come from is that load connection at repeatable read isolation level. By definition, any fresh connection or transaction that looked at the database now would see the currently committed object (or something even later)—the original object has been overwritten and that change committed, so it's gone. RelStorage relies on the underlying database to keep it visible to the load connection.

Likewise, getting the currently committed object requires a connection that can read the current state of the database. That's where the second connection comes in. It can see the current data in the database.

Q & A

Why two connections? Why not put the data in the temporary table, commit, and begin a new transaction to update the current view of the database?

Because that would lose access to the original object.

Why not preemptively store off all the original objects somewhere (e.g., download them or copy them to a temp table) before committing?

Because ZODB uses an optimistic concurrency model. We assume that conflicts are few and far between. If that's true, that would be doing a bunch of extra work that we don't usually need to do. Remember, there's no way to know if there's going to be a conflict or not without a current view of the database.

Well then, why not just have a single shared connection for the current view of the database and use it to check for conflicts and only then save the original objects that have conflicts?

Because that connection wouldn't know what objects to check for conflicts on. Those objects are already in the database in temporary tables that are connection specific and unreadable to a different connection. We'd have to pass a list of object IDs back to the database, and not all databases support array operations to do that efficiently. Or we'd have to write to a persistent table, which doesn't sound appealing (we'd have to arrange to delete from it too.)

Also, because RDBMS connections aren't thread-safe, that would introduce a per-process lock into the commit process.

Still, perhaps that's worth looking into more.

Couldn't a history-preserving database implement snapshot isolation just like FileStorage and use only one read committed connection?

Quite possibly, yes. That could make for some moderately ugly or inefficient SQL queries though.

SELECT*FROMobject_stateWHEREzoid=:zoidANDtid<=:tidORDERBYtidDESCLIMIT1

Why temp tables? Why not store directly to the final table?

For history free databases, the final table is where we get the data to resolve conflicts, so we can't overwrite it.

For history preserving databases, we don't yet have the necessary transaction ID we need to store to the final table. (The primary key is (OID, TID), and the TID is a foreign key reference to another tables as well).

We could allocate the TID earlier, before storing temporary data, but that defeats much of the benefit of ZODB 5 parallel commit.

We could use a fake TID and update it in-place, but altering primary keys tends to be expensive.

Conflict Resolution

RelStorage supports conflict resolution. Conflict resolution is performed in each individual process in a distributed fashion. There's no central server that has to be updated with application code in order to resolve conflicts. ZEO 5 supports a similar feature.

Summary

ZODB is a flexible and powerful object database for Python, supporting transactions, optimistic concurrency, and conflict resolution. It uses a layered architecture with the definition and serialization of individual objects handled by the persistent library, the generic transactional API provided by transaction, and data storage and MVCC semantics provided by the pluggable storage layer.

ZODB comes with a storage implementation using an append-only file, as well as an in-memory dict-based storage plus a change-tracking demo storage. These are all restricted to a single process, but ZEO allows utilizing them from multiple processes.

RelStorage is a storage layer based on a SQL database, intended to be highly scalable.

Updates

Add additional links to more resources about included and third-party storages.

Footnotes

[1]	ZODB may stand for "Zope Object Database," or it may stand for"Z Object Database."

[2]	To the storage engine, the object state data is just an opaque sequence of bytes. In reality, the ZODB Connection uses Python's standard pickle protocol to serialize objects into bytes.

[3]	Storages also handle non-object data in the form of BLOBs, each of which is associated with an object and assigned an OID.

[4]	Actually, the layer that implements snapshot isolation on top of an arbitrary history preserving storage is found in the core of ZODB. This was one of the major changes in ZODB 5.

[5]	Previously "Zope Enterprise Objects".

[6]	Though one possibly wrapped in something like zlibstorage to provide compression or `cipher.encryptingstorage` to provide encryption.

[7]	"ZODB Replicated Storage"

↧

Robin Wilson: Easily specifying colours from the default colour cycle in matplotlib

November 14, 2019, 3:19 am

≫ Next: Erik Marsja: Tutorial: How to Read Stata Files in Python with Pandas

≪ Previous: Not Invented Here: Introduction to ZODB Data Storage

Another quick matplotlib tip today: specifically, how easily specify colours from the standard matplotlib colour cycle.

A while back, when matplotlib overhauled their themes and colour schemes, they changed the default cycle of colours used for lines in matplotlib. Previously the first line was pure blue (color='b' in matplotlib syntax), then red, then green etc. They, very sensibly, changed this to a far nicer selection of colours.

However, this change made one thing a bit more difficult – as I found recently. I had plotted a couple of simple lines:

x_values = [0, 1, 2]
line1 = np.array([10, 20, 30])
line2 = line1[::-1]

plt.plot(x_values, line1)
plt.plot(x_values, line2)

which gives

file

I then wanted to plot a shaded area around the second line (the yellow one) – for example, to show the uncertainty in that line.

You can do this with the plt.fill_between function, like this:

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color='y')

This produces a shaded line which extends from 5 below the line to 5 above the line:

file

Unfortunately the colours don’t look quite right: the line isn’t yellow, so doing a partially-transparent yellow background doesn’t look quite right.

I spent a while looking into how to extract the colour of the line so I could use this for the shading, before finding a really easy way to do it. To get the colours in the default colour cycle you can simply use the strings 'C0', 'C1', 'C2' etc. So, in this case just

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color='C1')

The result looks far better now the colours match:

file

I found out about this from a wonderful graphical matplotlib cheatsheet created by Nicolas Rougier – I’d strongly suggest you check it out, there are all sorts of useful things on there that I never knew about!

Just in case you need to do this the manual way, then there are two fairly straightforward ways to get the colour of the second line.

The first is to get the default colour cycle from the matplotlib settings, and extract the relevant colour:

cycle_colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

which gives a list of colours like this:

['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', ...]

You can then just use one of these colours in the call to plt.fill_between– for example:

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color=cycle_colors[1])

The other way is to actually extract the colour of the actual line you plotted, and then use that for the plt.fill_between call

x_values = [0, 1, 2]
line1 = np.array([10, 20, 30])
line2 = line1[::-1]

plt.plot(x_values, line1)
plotted_line = plt.plot(x_values, line2)
plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color=plotted_line[0].get_color())

Here we save the result of the plt.plot call when we plot the second line. This gives us a list of the Line2D objects that were created, and we then extract the first (and only) element and call the get_color() method to extract the colour.

I do freelance work in data science and data visualisation – including using matplotlib. If you’d like to work with me, have a look at my freelance website or email me.

↧

Erik Marsja: Tutorial: How to Read Stata Files in Python with Pandas

November 11, 2019, 1:17 pm

≫ Next: Wingware Blog: Navigating Python Code with Wing Pro 7 (part 1 of 3)

≪ Previous: Robin Wilson: Easily specifying colours from the default colour cycle in matplotlib

The post Tutorial: How to Read Stata Files in Python with Pandas appeared first on Erik Marsja.

In this post, we are going to learn how to read Stata (.dta) files in Python.

As previously described (in the read .sav files in Python post) Python is a general-purpose language that also can be used for doing data analysis and data visualization. One example of data visualization will be found in this post.

One potential downside, however, is that Python is not really user-friendly for data storage. This has, of course, lead to that our data many times are stored using Excel, SPSS, SAS, or similar software. See, for instance, the posts about reading .sav, and sas files in Python:

Can I Open a Stata File in Python?

We are soon going to practically answer how to open a Stata file in Python? In Python, there are two useful packages called Pyreadstat, and Pandas that enable us to open .dta files. If we are working with Pandas, the read_stata method will help us import a .dta into a Pandas dataframe. Furthermore, the package Pyreadstat, which is dependent on Pandas, will also create a Pandas dataframe from a .dta file.

How to install Pyreadstat:

First, before learning how to read .dat files using Python and Pyreadstat we need to install it. As many Python packages this package can be installed using pip or conda:

Install Pyreadstat using pip:
Open up the Windows Command Prompt and type pip install pyreadstat
Install using Conda:
Open up the Anaconda Prompt, and type conda install -c conda-forge pyreadstat

How to Open a Stata file in Python

In this section, we are finally ready to learn how to read a .dta file in Python using the Python packages Pyreadstat and Pandas.

How to Load a Stata File in Python Using Pyreadstat

In this section, we are going to use pyreadstat to import a .dta file into a Pandas dataframe. First, we import pyreadstat:

import pyreadstat

Second, we are ready to import Stata files using the method read_dta. Note that, when we load a file using the Pyreadstat package, it will look for the .dta file in Python’s working directory. In the read Stata files example below, the FifthDaydata.dta is located in a subdirectory (i.e., “SimData”).

dtafile = './SimData/FifthDayData.dta'
df, meta = pyreadstat.read_dta(dtafile)

In the code chunk above, two variables were created; df, and meta. If we use the Python function type we can see that “df” is a Pandas dataframe:

This means that we can use all the available methods for Pandas dataframe objects. In the next line of code, we are Pandas head method to print the first 5 rows.

df.head()

Learn more about working with Pandas dataframes in the following tutorials:

Python Groupby Tutorial: Here you will learn about working the groupby method to group Pandas dataframes.
Learn how to take random samples from a pandas dataframe
A more general, overview, of how to work with Pandas dataframe objects can be found in the Pandas Dataframe tutorial.

How to Read a Stata file with Python Using Pandas

In this section, we are going to read the same Stata file into a Pandas dataframe. However, this time we will use Pandas read_stata method. This has the advantage that we can load the Statafile from a URL.

Before we continue, we need to import Pandas:

import pandas as pd

Now, when we have done that, we can read the .dta file into a Pandas dataframe using the read_stata method. In the read Stata example here, we are importing the same data file as in the previous example.

After we have loaded the Stata file using Python Pandas, we print the last 5 rows of the dataframe with the tail method.

dtafile = './SimData/FifthDayData.dta'

df = pd.read_stata(dtafile)
df.tail()

How to Read .dta Files from URL

In this section, we are going to use Pandas read_stata method, again. However, this time we will read the Stata file from a URL.

url = 'http://www.principlesofeconometrics.com/stata/broiler.dta'

df = pd.read_stata(url)
df.head()

Note, the only thing we changed was we used a URL as input (url) and Pandas read_stata will import the .dta file that the URL is pointing to.

Pandas Scatter Plot

Here, we will create a scatter plot in Python using Pandas scatter method. This is to illustrate how we can work with data imported from .dta files.

df.plot.scatter(x='pchick',
                       y='cpi')

Scatter Plot in Python

Learn more about data visualization in Python:

How to Read Specific Columns from a Stata file

Now using pyreadstat read_dta and Pandas read_stat both enables us to read specific columns from a Stata file. Note, that read_dta have the argument usecols and Pandas the argument columns.

Reading Specific Columns using Pyreadstat

In this Python read dta example, we use the argument usecols that takes a list as parameter.

import pyreadstat

dtafile = './SimData/FifthDayData.dta'
df, meta = pyreadstat.read_dta(dtafile,
                              usecols=['index', 'Name', 'ID',
                                      'Gender'])
df.head()

Dataframe from .dta

Reading Specific Columns using Pandas read_stata

Here, we are going to use Pandas read_stata method and the argument columns. This argument, as in the example above, takes a list as input.

import pandas as pd
url = 'http://www.principlesofeconometrics.com/stata/broiler.dta'

df = pd.read_stata(url,
                  columns=['year', 'pchick', 'time',
                                      'meatex'])
df.head()

Dataframe

Note, the behavior of Pandas read_stata; in the resulting dataframe the order of the column will be the same as in the list we put in.

How to Save a Stata file

In this section of the Python Stata tutorial, we are going to save the dataframe as a .dta file. This is easily done, we just have to use the write_dta method when using Pyreadstat and the dataframe method to_stata in Pandas.

Saving a dataframe as a Stata file using Pyreadstat

In the example below, we are using the dataframe we created in the previous section and write it as a dta file.

pyreadstat.write_dta(df, 'broilerdata_edited.dta')

Now, between the parentheses is where the important stuff happens. The first argument is our dataframe and the second is the file path. Note, only having the filename, as in the example above, will make the write_dta method to write the Stata file to the current directory.

How to Save a dataframe as .dta with Pandas to_stata

In this example, we are going to save the same dataframe using Pandas to_stata:

df.to_stata('broilerdata_edited.dta')

As can be seen in the image above, the dataframe object has the to_stata method. Within, the parentheses we put the file path.

Save a CSV file as a Stata File

In this section, we are going to work with Pandas read_csv to read a CSV file, containing data. After we have imported the CSV to a dataframe we are going to save it as a .dta file using Pandas to_stat:

df = pd.read_csv('./SimData/FifthDayData.csv')
df.to_stata('./SimData/FifthDayData.dta')

Export an Excel file as a Stata File

In the final example, we are going to use Pandas read_excel to import a .xslx file and then save this dataframe as a Stata file using Pandas to_stat:

df = pd.read_excel('./SimData/example_concat.xlsx')
df.to_stata('./SimData/example_concat.dta')

Note, that in both of the last two examples above we save the data to a folder called SimData. If we want to save the CSV and Excel file to the current directory we simply remove the “./SimData/” part of the string.

Learn more about importing data using Pandas:

Note, all the files we have read using read_dta, read_stata, read_csv, and read_excel can be found here and a Jupyter Notebook here. It is, of course, possible to open SPSS and SAS files using Pandas and save them as .dta files as well.

Summary: Read Stata Files using Python

In this post, we have learned how to read Stata files in Python. Furthermore, we have learned how to write Pandas dataframes to Stata files.

The post Tutorial: How to Read Stata Files in Python with Pandas appeared first on Erik Marsja.

↧

Wingware Blog: Navigating Python Code with Wing Pro 7 (part 1 of 3)

November 13, 2019, 5:00 pm

≫ Next: Continuum Analytics Blog: Essential Open-Source Library pandas Awarded CZI Grant to Further Development

≪ Previous: Erik Marsja: Tutorial: How to Read Stata Files in Python with Pandas

Wing Python IDE includes a boatload of features aimed at making it easier to navigate and understand the structure of Python code. Some of these allow for quick navigation between the definition and uses of a symbol. Others provide a convenient index into source code. And still others quickly find and open files or navigate to symbols matching a name fragment.

In the this and the next two Wing Tips, we'll take a look at each of these in turn.

Goto Definition

To get from any use of a symbol in Python code to its definition, use GotoSelectedSymbolDefn in the Source menu. This jumps to the def, class, or the point at which a variable or attribute was first defined.

Another way to do this is to right-click on the symbol in the editor and select GotoDefinition or GotoDefinitioninOtherSplit:

/images/blog/code-navigation/goto-definition.gif

The menus also give the key bindings for the commands, or you can bind your own key to the command goto-selected-symbol-defn with the UserInterface>Keyboard>CustomKeyBindings preference.

In some cases, jumping to a definition successfully depends on resolving imported modules correctly using the Python Path configured by Python. In most cases you will not need to add to this configuration, but doing so is possible with ProjectProperties from Wing's Project menu.

Navigation History

For this and all the other code navigation options, the history-back button at the top left of the editor may be used to return to the previous file or focus. Or move forward again in your navigation history with the history-forward button.

Find Uses

In Wing Pro only, FindPointsofUse in the Source menu or the editor's right-click context menu finds all points of use of a symbol in Python code:

This search distinguishes between different but like-named symbols and will cover all project files and other files Wing finds on the configured Python Path. The tool's Options menu provides control over which files are searched and what types of matches are shown.

Search in Files

To find all occurrences of other strings in Python files or in project files of any type, use the SearchinFiles tool from the Tools menu with Lookin set to ProjectFiles and Filter set to the narrowest filter that includes the files that you wish to search:

This tool supports text matching, wildcard, and regular expression searching and automatically updates the search results as files change.

Searching on ProjectFiles assumes that you have used AddExistingDirectory in the Project menu to add your source code to your project. Typically the project should contain the code you are actively working on. Packages that your code uses can be left out of the project, unless you anticipate often wanting to search them with SearchinFiles.

That's it for now! We'll be back next week to continue this Wing Tips mini-series on navigating Python code with Wing.

As always, please don't hesitate to email support@wingware.com if you run into problems or have any questions.

↧

Continuum Analytics Blog: Essential Open-Source Library pandas Awarded CZI Grant to Further Development

November 13, 2019, 5:46 pm

≫ Next: Talk Python to Me: #238 Collaborative data science with Gigantum

≪ Previous: Wingware Blog: Navigating Python Code with Wing Pro 7 (part 1 of 3)

We’re pleased to announce that pandas, the open-source library providing high-performance data structures for tabular data analysis, has received grant funding from the Chan Zuckerberg Initiative (CZI) as part of their Essential Open Source Software…

The post Essential Open-Source Library pandas Awarded CZI Grant to Further Development appeared first on Anaconda.

↧

Talk Python to Me: #238 Collaborative data science with Gigantum

November 14, 2019, 12:00 am

≫ Next: Roberto Alsina: Episodio 17: Esto es Personal

≪ Previous: Continuum Analytics Blog: Essential Open-Source Library pandas Awarded CZI Grant to Further Development

Collaborative data science has a few challenges. First of all, those who you are collaborating with might not be savvy enough in the computer science techniques (for example, git and source control or docker and Linux). Second, seeing the work and changes others have made is a challenge too.

↧

Roberto Alsina: Episodio 17: Esto es Personal

November 14, 2019, 7:02 am

≫ Next: Logilab: Typing Mercurial with pytype

≪ Previous: Talk Python to Me: #238 Collaborative data science with Gigantum

A veces gente me pide consejos sobre su carrera y cosas así... probablemente no soy la persona correcta para darlos.

↧

Logilab: Typing Mercurial with pytype

November 14, 2019, 7:42 am

≫ Next: NumFOCUS: Chan Zuckerberg Initiative Funds Maintenance of NumFOCUS Projects

≪ Previous: Roberto Alsina: Episodio 17: Esto es Personal

Following the recent introduction of Python type annotations (aka "type hints") in Mercurial (see, e.g. this changeset by Augie Fackler), I've been playing a bit with this and pytype.

pytype is a static type analyzer for Python code. It compares with the more popular mypy but I don't have enough perspective to make a meaningful comparison at the moment. In this post, I'll illustrate how I worked with pytype to gradually add type hints in a Mercurial module and while doing so, fix bugs!

The module I focused on is mercurial.mail, which contains mail utilities and that I know quite well. Other modules are also being worked on, this one is a good starting point because it has a limited number of "internal" dependencies, which both makes it faster to iterate with pytype and reduces side effects of other modules not being correctly typed already.

$ pytype mercurial/mail.py
Computing dependencies
Analyzing 1 sources with 36 local dependencies
ninja: Entering directory `.pytype'
[19/19] check mercurial.mail
Success: no errors found

The good news is that the module apparently already type-checks. Let's go deeper and merge the type annotations generated by pytype:

$ merge-pyi -i mercurial/mail.py out/mercurial/mail.pyi

(In practice, we'd use --as-comments option to write type hints as comments, so that the module is still usable on Python 2.)

Now we have all declarations annotated with types. Typically, we'd get many things like:

def codec2iana(cs) -> Any:
   cs = pycompat.sysbytes(email.charset.Charset(cs).input_charset.lower())
   # "latin1" normalizes to "iso8859-1", standard calls for "iso-8859-1"
   if cs.startswith(b"iso") and not cs.startswith(b"iso-"):
      return b"iso-" + cs[3:]
   return cs

The function signature has been annotated with Any (omitted for parameters, explicit for return value). This somehow means that type inference failed to find the type of that function. As it's (quite) obvious, let's change that into:

def codec2iana(cs: bytes) -> bytes:
   ...

And re-run pytype:

$ pytype mercurial/mail.py
Computing dependencies
Analyzing 1 sources with 36 local dependencies
ninja: Entering directory `.pytype'
[1/1] check mercurial.mail
FAILED: .pytype/pyi/mercurial/mail.pyi
pytype-single --imports_info .pytype/imports/mercurial.mail.imports --module-name mercurial.mail -V 3.7 -o .pytype/pyi/mercurial/mail.pyi --analyze-annotated --nofail --quick mercurial/mail.py
File "mercurial/mail.py", line 253, in codec2iana: Function Charset.__init__ was called with the wrong arguments [wrong-arg-types]
  Expected: (self, input_charset: str = ...)
  Actually passed: (self, input_charset: bytes)

For more details, see https://google.github.io/pytype/errors.html#wrong-arg-types.
ninja: build stopped: subcommand failed.

Interesting! email.charset.Charset is apparently instantiated with the wrong argument type. While it's not exactly a bug, because Python will handle bytes instead of str well in general, we can again change the signature (and code) to:

def codec2iana(cs: str) -> str:
   cs = email.charset.Charset(cs).input_charset.lower()
   # "latin1" normalizes to "iso8859-1", standard calls for "iso-8859-1"
   if cs.startswith("iso") and not cs.startswith("iso-"):
      return "iso-" + cs[3:]
   return cs

Obviously, this involves a larger refactoring in client code of this simple function, see respective changeset for details.

Another example is this function:

def _encode(ui, s, charsets) -> Any:
    '''Returns (converted) string, charset tuple.
    Finds out best charset by cycling through sendcharsets in descending
    order. Tries both encoding and fallbackencoding for input. Only as
    last resort send as is in fake ascii.
    Caveat: Do not use for mail parts containing patches!'''
    sendcharsets = charsets or _charsets(ui)
    if not isinstance(s, bytes):
        # We have unicode data, which we need to try and encode to
        # some reasonable-ish encoding. Try the encodings the user
        # wants, and fall back to garbage-in-ascii.
        for ocs in sendcharsets:
            try:
                return s.encode(pycompat.sysstr(ocs)), ocs
            except UnicodeEncodeError:
                pass
            except LookupError:
                ui.warn(_(b'ignoring invalid sendcharset: %s\n') % ocs)
        else:
            # Everything failed, ascii-armor what we've got and send it.
            return s.encode('ascii', 'backslashreplace')
    # We have a bytes of unknown encoding. We'll try and guess a valid
    # encoding, falling back to pretending we had ascii even though we
    # know that's wrong.
    try:
        s.decode('ascii')
    except UnicodeDecodeError:
        for ics in (encoding.encoding, encoding.fallbackencoding):
            ics = pycompat.sysstr(ics)
            try:
                u = s.decode(ics)
            except UnicodeDecodeError:
                continue
            for ocs in sendcharsets:
                try:
                    return u.encode(pycompat.sysstr(ocs)), ocs
                except UnicodeEncodeError:
                    pass
                except LookupError:
                    ui.warn(_(b'ignoring invalid sendcharset: %s\n') % ocs)
    # if ascii, or all conversion attempts fail, send (broken) ascii
    return s, b'us-ascii'

It quite clear from the return value (last line) that we can change its type signature to:

def _encode(ui, s:Union[bytes, str], charsets: List[bytes]) -> Tuple[bytes, bytes]
   ...

And re-run pytype:

$ pytype mercurial/mail.py
Computing dependencies
Analyzing 1 sources with 36 local dependencies
ninja: Entering directory `.pytype'
[1/1] check mercurial.mail
FAILED: .pytype/pyi/mercurial/mail.pyi
pytype-single --imports_info .pytype/imports/mercurial.mail.imports --module-name mercurial.mail -V 3.7 -o .pytype/pyi/mercurial/mail.pyi --analyze-annotated --nofail --quick mercurial/mail.py
File "mercurial/mail.py", line 342, in _encode: bad option in return type [bad-return-type]
  Expected: Tuple[bytes, bytes]
  Actually returned: bytes
[...]

For more details, see https://google.github.io/pytype/errors.html.
ninja: build stopped: subcommand failed.

That's a real bug. Line 371 contains return s.encode('ascii', 'backslashreplace') in the middle of the function. We indeed return a bytes value instead of a Tuple[bytes, bytes] as elsewhere in the function. The following changes fixes the bug:

diff --git a/mercurial/mail.py b/mercurial/mail.py
--- a/mercurial/mail.py
+++ b/mercurial/mail.py
@@ -339,7 +339,7 @@ def _encode(ui, s, charsets) -> Any:
                 ui.warn(_(b'ignoring invalid sendcharset: %s\n') % ocs)
         else:
             # Everything failed, ascii-armor what we've got and send it.
-            return s.encode('ascii', 'backslashreplace')
+            return s.encode('ascii', 'backslashreplace'), b'us-ascii'
     # We have a bytes of unknown encoding. We'll try and guess a valid
     # encoding, falling back to pretending we had ascii even though we
     # know that's wrong.

Steps by steps, by replacing Any with real types, adjusting "wrong" types like in the first example and fixing bugs, we finally get the whole module (almost) completely annotated.

↧