"Menno's Musings": Introduction to attrs

November 25, 2018, 12:55 am

≫ Next: Ned Batchelder: Coverage.py 5.0a4: the sys.path to hell

≪ Previous: John Cook: Ellipsoid distance on Earth

I've just published a Jupyter Notebook I used to present an introduction to the excellent Python attrs package at the November 2018 Christchurch Python meetup.

You can find it on Github.

↧

Ned Batchelder: Coverage.py 5.0a4: the sys.path to hell

November 25, 2018, 3:30 pm

≫ Next: Podcast.__init__: Using Calibre To Keep Your Digital Library In Order with Kovid Goyal

≪ Previous: "Menno's Musings": Introduction to attrs

Another alpha of Coverage.py 5.0 is available: 5.0a4. This fixes a few problems with the new SQLite-based storage. Please give it a try, especially to experiment with dynamic contexts.

The challenge with this release was something that started as a seemingly simple fix. Coverage.py tries to emulate how Python runs programs, including how the first element of sys.path is set. A few people run coverage with sys.path fully configured, and coverage’s setting of sys.path[0] was breaking their stuff.

The proposal was simple: delete the one line of code that set sys.path[0]. I tried that, and it seemed to work. Fixed!

Not so fast: the Windows builds failed. This started a multi-week adventure of debugging and refactoring. The Windows builds were failing not because of Windows itself, but because on Windows, I don’t use pytest-xdist, which parallelizes tests into worker processes. With xdist, the tests were all passing. Without xdist, a few sys.path-related tests were failing.

It turns out that xdist manipulates sys.path itself, which was masking the fact that I had removed an important step from coverage.py. First thing to do was to adjust my test code so that even with xdist, my tests didn’t get xdist’s path changes.

Then I had to re-think how to adjust sys.path. That required refactoring how I ran the user’s Python code, so that I could apply the path changes a little earlier than I used to. That made me look at how I was testing that layer of code with mocks, and I changed it from explicit dependency injection to implicit mock patching.

A few more little fixes were needed here and there along the way. All told, the “one line” fix ended up being 14 files changed, 587 insertions, 427 deletions.

↧

Podcast.init: Using Calibre To Keep Your Digital Library In Order with Kovid Goyal

November 25, 2018, 6:21 pm

≫ Next: Test and Code: 54: Python 1994 - Paul Everitt

≪ Previous: Ned Batchelder: Coverage.py 5.0a4: the sys.path to hell

Digital books are convenient and useful ways to have easy access to large volumes of information. Unfortunately, keeping track of them all can be difficult as you gain more books from different sources. Keeping your reading device synchronized with the material that you want to read is also challenging. In this episode Kovid Goyal explains how he created the Calibre digital library manager to solve these problems for himself, how it grew to be the most popular application for organizing ebooks, and how it works under the covers. Calibre is an incredibly useful piece of software with a lot of hidden complexity and a great story behind it.

Summary

Preface

Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
Join the community in the new Zulip chat workspace at podcastinit.com/chat
Your host as usual is Tobias Macey and today I’m interviewing Kovid Goyal about Calibre, the powerful and free ebook management tool

Interview

Introductions
How did you get introduced to Python?
Can you start by explaining what Calibre is and how the project got started?
How are you able to keep up to date with device support in Calibre, given the continual release of new devices and platforms that a user can read ebooks on?
What are the main features of Calibre?
- What are some of the most interesting and most popular plugins that have been creatd for Calibre?
Can you describe the software architecture for the project and how it has evolved since you first started working on it?
You have been maintaining and improving Calibre for a long time now. What is your motivation to keep working on it?
- How has the focus of the project and the primary use cases changed over the years that you have been working on it?
In addition to its longevity, Calibre has also become a de-facto standard for ebook management. What is your opinion as to why it has gained and kept its popularity?
- What are some of the competing options and how does Calibre differentiate from them?
In addition to the myriad devices and platforms, there is a significant amount of complexity involved in supporting the different ebook formats. What have been the most challenging or complex aspects of managing and converting between the formats?
One of the challenges around maintaining a private library of electronic resources is the prevalence of DRM restricted content available through major publishers and retailers. What are your thoughts on the current state of digital book marketplaces?
What was your motivation for implementing Calibre in Python?
- If you were to start the project over today would you make the same choice?
- Are there any aspects of the project that you would implement differently if you were starting over?
What are your plans for the future of Calibre?

Keep In Touch

Picks

Tobias
- American Gods by Neil Gaiman
Kovid
- Into Thin Air by John Krakauer
  About how an expedition to climb Everest went wrong. Wonderful account of the difficulties of high altitude mountaineering and the determination it needs.
- The Steerswoman’s Road by Rosemary Kirstein
  About the spirit of scientific enquiry in a fallen civilization on an alien planet with partial terraforming that is slowly failing.

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

↧

Test and Code: 54: Python 1994 - Paul Everitt

November 25, 2018, 11:30 am

≫ Next: Mike Driscoll: PyDev of the Week: Reimar Bauer

≪ Previous: Podcast.__init__: Using Calibre To Keep Your Digital Library In Order with Kovid Goyal

Paul talks about the beginning years of Python.
Talking about Python's beginnings is also talking about the Python community beginnings.
Yes, it's reminiscing, but it's fun.

Special Guest: Paul Everitt.

Mike Driscoll: PyDev of the Week: Reimar Bauer

November 25, 2018, 10:05 pm

≫ Next: Kushal Das: Building wheels and Debian packages for SecureDrop on Qubes OS

≪ Previous: Test and Code: 54: Python 1994 - Paul Everitt

This week we welcome Reimar Bauer (@ReimarBauer) as our PyDev of the Week! Reimar is a core developer of the popular Python wiki package, MoinMoin. He has spoken at PyCON DE, FOSDEM and EuroPython about Python. Let’s take a few minutes to get to know him better!

Can you tell us a little about yourself (hobbies, education, etc):

I am a programmer from Jülich, Germany. That‘s a small town between Aachen and Cologne.

I work at the Forschungszentrum Jülich GmbH. Employees research in the fields of energy and the environment, information and brain research with the aim of providing society with options for action facilitating sustainable development.

My work is related to atmospheric science.

I wrote a few books in the past years about data structures and about IDL, one of my first science related programming languages, developed by Ray Sterner in Boulder, Colorado.

In 2012 I was elected to the board of the German Python Software Verband. I helped to organize several Python Events in Germany, several PyConDE and the famous EuroPython in Berlin. I stepped down from this position and I left in 2017 because I needed and wanted to spend more time with my family.

I have been a fellow of the Python Software Foundation since 2013.

I used to play trombone in a Big Band, but had to take a break because I just couldn’t find the time this hobby needs. I enjoy sightseeing, hiking, sociocracy3.0, meeting a lot of different people. And sometimes I play an augmented reality game named ingress.

In 2019 the 10th PythonCamp Cologne Germany, a classic barcamp, will be the next community project Reimar Bauer is involved in.

Why did you start using Python?

It was because of the MoinMoin Software. I had an idea. I wanted to have a Gantt chart view in MoinMoin.

I got all help needed doing that, and learned to use IRC ;), too, and had a start into the Python Open Source World. In 2003 I used 1.5.2 . Later I wrote many plugins for MoinMoin and a description with examples about the API.

What other programming languages do you know and which is your favorite?

I have only one favorite, and that‘s Python. It also plays a role what my colleagues need for their work. I still maintain a large IDL library, but since development froze about 2013, we moved on to Python. I also know Fortran, but stopped using it.

What projects are you working on now?

Since 2016 I have been maintaining and developing the Mission Support System. This project is completely open source, Apache 2.0 licenses. This software is used to interactively plan aircraft missions for atmospheric science.

Because of the project I also have become a maintainer for an amount of conda-forge feed-stocks. And because of my love for pyfilesystem2, I started a qt5 file picker.

Which Python libraries are your favorite (core or 3rd party)?

Pyfilesystem2:
Work with files and directories in archives, memory, the cloud etc. as easily as your local drive. Write code now, decide later where the data will be stored; unit test without writing real files; upload files to the cloud without learning a new API; sandbox your file writing code; etc.
https://pyfilesystem2.readthedocs.io/en/latest/

requests: http for humans
http://docs.python-requests.org/en/master/

netcdf4:
NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
http://unidata.github.io/netcdf4-python/

borgbackup:
Borgbackup is a de-duplicating backup program. Optionally, it supports compression and authenticated encryption.
http://borgbackup.readthedocs.io/en/stable/

Which Python communities do you want to mention?

The German Python Software Verband, which is focused on the PyConDE and the EuroSciPy conference series but also supports a lot of smaller communities similar to the PSF.

The Euro Python Society, which is focused on the EuroPython conference

The Python Software Foundation– the organisation which over years tied everything together. Organizes the Pycon US and also helps smaller communities with grants.
Thanks to all of you for your time, and for doing such a great job!

What made you decide to work on the MoinMoin project?

The community: Helpful, warm and welcoming folks. I continued working on it because we were able to use MoinMoin as project wiki on any of our measurement campaigns for our institute. There we have to show model forecast data, manage pages with different access rules, provide a lot information to participants. Another use case is to create sphinx latex documents from reST formatted wiki pages.

Can you tell us some of the things you learned as a dev on MoinMoin?

I learned about a lot of tools, how to write proper unit tests, web server options; I learned from bugs, too. I also enjoyed new contacts to communities. Also the time as mentor for students during google summer of code was a great time. We were at the bleeding edge and solved a lot browser based issues or at least found a way to bypass them.

Would you do anything differently if you were able to start over from scratch?

We started with moin-2. But got stuck. A reason may be the unique selling point of wiki itselfs is wide spread meanwhile. Also many comparable tools exist, and many developers who just want to stabilize some project are not that interested in this kind of software mechanism. There are just too many options for this nowadays. As we have already spent much time replacing a lot of our self made model and views in moin-2, I would prefer to finish this.
We currently reunite devs and also build a new core team.

If you like to join us, feel free to get in contact https://mail.python.org/mailman/listinfo/moin-devel

Thanks for doing the interview, Reimar!

↧

Kushal Das: Building wheels and Debian packages for SecureDrop on Qubes OS

November 25, 2018, 11:01 pm

≫ Next: gamingdirectional: Create an explosion manager and explosion class for pygame project

≪ Previous: Mike Driscoll: PyDev of the Week: Reimar Bauer

For the last couple of months, the SecureDrop team is working on a new set of applications + system for the journalists, which are based on Qubes OS, and desktop application written particularly for Qubes. A major portion of the work is on the Qubes OS part, where we are setting up the right templateVMs and AppVMs on top of those templateVMs, setting up the qrexec services and right configuration to allow/deny services as required.

The other major work was to develop a proxy service (on top of Qubes qrexec service) which will allow our desktop application (written in PyQt) to talk to a SecureDrop server. This part finally gets into two different Debian packages.

The securedrop-proxy package: which contains only the proxy tool
The securedrop-client: which contains the Python SDK (to talk to the server using proxy) and desktop client tool

The way to build SecureDrop server packages

The legacy way of building SecureDrop server side has many steps and also installs wheels into the main Python site-packages. Which is something we plan to remove in future. While discussing about this during PyCon this year, Donald Stufft suggested to use dh-virtualenv. It allows to package a virtualenv for the application along with the actual application code into a Debian pacakge.

The new way of building Debian packages for the SecureDrop on Qubes OS

Creating requirements.txt file for the projects

We use pipenv for the development of the projects. pipenv lock -r can create a requirements.txt, but, it does not content any sha256sums. We also wanted to make sure that doing these steps become much easier. We have added a makefile target in our new packaging repo, which will first create the standard requirements.txt and then it will try to find the corresponding binary wheel sha256sums from a list of wheels+sha256sums, and before anything else, it verifies the list (signed with developers’ gpg keys).

PKG_DIR=~/code/securedrop-proxy make requirements

If it finds any missing wheels (say new dependency or updated package version), it informs the developer, the developer then can use another makefile target to build the new wheels, the new wheels+sources do get synced to our simple index hosted on s3. The hashes of the wheels+sources also get signed and committed into the repository. Then, the developer retries to create the requirements.txt for the project.

Building the package

We also have makefile targets to build the Debian package. It actually creates a directory structure (only in parts) like rpmbuild does in home directory, and then copies over the source tarball, untars, copies the debian directory from the packaging repository, and then reverifies each hashes in the project requirements file with the current signed (and also verified) list of hashes. If everything looks good, then it goes to build the final Debian package. This happens by the following environment variable exported in the above mention script.

DH_PIP_EXTRA_ARGS="--no-cache-dir --require-hashes"

Our debian/rules files make sure that we use our own packaging index for building the Debian package.

#!/usr/bin/make -f

%:
	dh $@ --with python-virtualenv --python /usr/bin/python3.5 --setuptools --index-url https://dev-bin.ops.securedrop.org/simple

For example, the following command will build the package securedrop-proxy version 0.0.1.

PKG_PATH=~/code/securedrop-proxy/dist/securedrop-proxy-0.0.1.tar.gz PKG_VERSION=0.0.1 make securedrop-proxy

The following image describes the whole process.

We would love to get your feedback and any suggestions to improve the whole process. Feel free to comment in this post, or by creating issues in the corresponding Github project.

↧

gamingdirectional: Create an explosion manager and explosion class for pygame project

November 26, 2018, 12:05 am

≫ Next: Made With Mu: Contributor Focus: Zander Brown

≪ Previous: Kushal Das: Building wheels and Debian packages for SecureDrop on Qubes OS

In this article we will create an explosion manager as well as an explosion class to manage the on scene explosions but first of all lets watch the raw video below which shows the explosions on the game scene. http://gamingdirectional.com/wp-content/uploads/2018/11/2018-11-26-at-14-49-02.mp4 Before we create the above classes, we will create one single explosion sprite sheet which contains many...

Source

↧

Made With Mu: Contributor Focus: Zander Brown

November 26, 2018, 3:00 am

≫ Next: Reuven Lerner: Last chance to save 40% on Python and Git courses

≪ Previous: gamingdirectional: Create an explosion manager and explosion class for pygame project

Mu is not a solo effort. Many folks have contributed to Mu, and I will be eternally grateful for their work. With the spirit of recognising the voluntary contributions of others in mind, I’m going to write about some of our most prodigious programmers. First up is Zander Brown.

Zander, like Top Gear’s the Stig, is a bit of a mystery.

Until recently, nobody had ever met him in real life, and all interactions with Zander were via copious GitHub comments, pull requests and bug fixes.

Some said he was an alien being from another dimension, others claimed him to be a rogue artificial intellegince running rampant in random Git repositories, yet more people whispered in hushed tones that he was merely a pseudonym for Bill Gates, coming out of retirement to contribute to free software projects.

All I can say is that I’m thankful for his considerable contributions to Mu’s code base, eagle-eyed code reviews and seemingly limitless Pythonic knowledge.

Actually, when I met Zander for the first time in July, it turned out he’s a 17 year-old studying for his A-levels (the exams teenagers sit in the UK to help them to get into university). He’s doing A-levels in Maths, Physics and Computer Science. He’s third from the left in the picture below:

I want to bring your attention to how wonderful this is ~ we have several related things going on:

We have a courteous, knowledgable and extraordinarily competent contributor to Mu.
Zander’s participation is evidence that the Mu project is welcoming, safe and supportive enough to encourage the participation of teenage coders.
Zander, a student who uses Mu, is helping to develop Mu. I would very much like to foster this sense of ownership in Mu’s users, be they students, teachers or professional software developers.

I recently asked Zander how he got into Mu. His reply was interesting in that he typed “micro:bit Python” into a search engine after his school had some of the early versions of the device delivered to teachers.

The head of department got the blue one, the other CS teacher had the green, leaving the yellow spare and the rest, as they say, is history…

He started coding with MSWLogo but has, more recently discovered Python and started making contributions to various open source projects:

It’s nice to work on things that help others be that the developer (fix this, add that) or the users (CS teachers!)
Recently I’ve been contributing the GNOME which has the ‘coolness’ that you are in country A working with person in country B to make something that’ll be used in A, B, C & D but also being a tad ‘meta’ when you’re fixing a bug older than you.

Right now, I feel very very old. :-)

How does he find working on Mu?

Great fun, first & only time i’ve met in person with others working on a project and the T-shirt isn’t bad either :-)
Been nice to play on a range of things within the same project from fighting QSS to walking the python syntax tree whilst knowing people are using it in the real world (TAA just switched form GCSE Java to Python so Mu is now used with the lower years)

(When we released Mu 1.0 I made sure all of the major contributors and friends of Mu got t-shirts to celebrate. Thanks to my hugely talented friend Steve Hawkes, Zander got a t-shirt with the “Stig” avatar shown at the top of this post.)

Zander, thank you for your continued contributions and best of luck with your studies (he wants to go on to study Computer Science at university). I have no doubt that wherever you end up studying will soon learn how fortunate they are to have such a talented student as you.

:-)

↧

Reuven Lerner: Last chance to save 40% on Python and Git courses

November 26, 2018, 3:22 am

≫ Next: Michael Foord: Selenium, jQuery and File uploads

≪ Previous: Made With Mu: Contributor Focus: Zander Brown

Today (Monday) is the last day to benefit from my weekend sale, with 40% off of my books and courses! Just enter the coupon code “BF2018” for any product , and take 40% off.

This offer includes my new “Intro Python: Fundamentals” course, containing the 1st day of the 4-day training I give to experienced developers around the world.

Here are the courses I’m currently offering:

Any questions? Just e-mail me at reuven@lerner.co.il. But don’t delay; today’s the last day to take advantage of this sale!

The post Last chance to save 40% on Python and Git courses appeared first on Lerner Consulting Blog.

↧

Michael Foord: Selenium, jQuery and File uploads

November 25, 2018, 4:00 pm

≫ Next: Real Python: Continuous Integration with Python: An Introduction

≪ Previous: Reuven Lerner: Last chance to save 40% on Python and Git courses

Selenium

One of the contracts I’ve been working on recently is working with Gurock building a test automation system for a PHP application, their test management app TestRail. As well as building the instrastructure for the application testing and the API testing I’ve once again been involved in the nitty-gritty of testing a web application with Selenium and all the fun that involved.

And actually it has been fun. We’ve had a bunch of issues to overcome and despite the usual pain and trauma and running round in circles we seem to have overcome most of them and have a test suite that is robust against the three different platforms we’re testing against.

For those who don’t know Selenium WebDriver interface allows you to connect your Python tests, or just about any language you care to choose, and test you web application as a “black box” - interacting with it in the same way as your users do. These are “functional tests”, as opposed to unit tests”, that tests the whole application as a whole meets its specifications and rquirements. As will all testing you can’t guarantee that it makes your applcation bug-free, but you can eleminiate whole classes of bugs and gurarantee a minimum level of application quality.

This application is written in PHP, but we’re using Python, py.test and Selenium to automate the tests and the front end is built with jQuery. There are various fun aspects of testing this app that we’ve encountered. A couple of these stem from the fact that like any modern any web application much of the UI updates are done from AJAX calls. This means that there’s no global page state ready event to wait for to know that load has finished and the page is ready to interact with.

One of the plugins in use is the BlockUI plugin. This puts a semi-opaque overlay over the user interface in the browser whilst asynchronous AJAX requests are being made, to prevent other elements of the user interface being interacted with. As the request is an asynchronous one the browser isn’t blocked so our Selenium tests don’t know that the user interface is blocked and it should wait before attempting any more interactions. This causes tests to fail with the dreaded error:

Exception in thread "main" org.openqa.selenium.WebDriverException: unknown error: Element <input type="button" class="btn btn-default" data-toggle="modal" data-target="#adduser" data-localize="adduser" value="Add user"> is not clickable at point (1397, 97). Other element would receive the click: <div class="blockUI blockOverlay" style="z-index: 1000; border: none; margin: 0px; padding: 0px; width: 100%; height: 100%; top: 0px; left: 0px; background-color: rgb(0, 0, 0); cursor: wait; position: absolute; opacity: 0.304712;"></div>

The dreaded part is specifically is not clickable at point (1397, 97). Other element would receive the click: <div class="blockUI blockOverlay".

The “blockUI” element is intercepting the click because the AJAX request is not completed, or more to the point ablockUI element is intercepting it. The normal way round this would be to find the “blockU” element and wait for it to no longer be displayed. Unfortunately there’s more than one of them! So this is the code we came up with to wait until none of them are displayed:

fromselenium.common.exceptionsimportStaleElementReferenceExceptionfromselenium.webdriver.common.byimportByclassGeneralLocators:blockUI=(By.CLASS_NAME,"blockUI")busy=(By.CLASS_NAME,"busy")defany_elements_displayed(elements):forelementinelements:try:ifelement.is_displayed():returnTrueexceptStaleElementReferenceException:passreturnFalseclassBasePageElement(object):defwait_for_blockui_to_close(self,seconds=5):self.driver.implicitly_wait(0)try:stop=time.time()+secondswhiletime.time()<stop:blockUIs=self.driver.find_elements(*GeneralLocators.blockUI)ifnotany_elements_displayed(blockUIs):returntime.sleep(0.1)raiseTimeoutException("Timed out waiting for blockUI to go away")finally:self.driver.implicitly_wait(10)

We have a similar problem with AJAX elements that don’t block the page, but take several seconds to update, showing a busy indiciator whilst they’re updating. Again, we need to wait for the busy indicators to complete before we ineract with any of the elements. Thanksfully that is similarly easy. Note that we set the global implicitly_wait timeout to zero whilst we’re checking.

defwait_until_not_busy(self,seconds=5):self.driver.implicitly_wait(0)try:stop=time.time()+secondswhiletime.time()<stop:busy=self.driver.find_elements(*GeneralLocators.busy)ifnotany_elements_displayed(busy):returntime.sleep(0.1)raiseTimeoutException("Timed out waiting to not be busy")finally:self.driver.implicitly_wait(10)

It’s well worth noting that with the selenium library in Python, the implicitly_wait value is a global. Setting it anywhere sets it for the rest of the session.

We put all the element locators into classes, like GeneralLocators so that as locators change (inevitable in an evolving user interface) there is only one place to change the locators rather than having them scattered through out our code.

Here’s a few more tricks and trips we’ve discovered along the way. Whilst text boxes have a nice and straightforward .clear() method to clear existing text in them, this dioesn’;t work with a textarea (which confusingly enough has a .clear() method which apppears to do nothing. The right way to clear to a text box is to send a CTRL-A followed by a backspace:

# CTRL-A plus BACKSPACE are needed for selenium to clear the textarea as .clear() doesn't work.self.send_keys_to_element(CustomizationsLocators.add_custom_field_description,Keys.CONTROL+"a")self.send_keys_to_element(CustomizationsLocators.add_custom_field_description,Keys.BACKSPACE)

If you want to provide a command line option to run the tests with a headless browser, this little function (firefox only) will do the trick. You could further customize is to switch between browers:

importpytestfromseleniumimportwebdriverfromselenium.webdriver.firefox.optionsimportOptionsdefget_driver():options=Options()ifpytest.config.getoption('headless'):options.headless=Truereturnwebdriver.Firefox(options=options)

And these final two are interesting. Uploading files with Selenium. Because the file upload dialog is a native dialog it’s very hard to interact with Selenium (impossible I thin.). However it does come along with a hidden input field that you can enter file paths directly to. So for a normal file dialog this works fine:

    from selenium.webdriver.common.by import By
    file_inputs = driver.find_elements(By.CSS_SELECTOR, 'input.dz-hidden-input')
    input_element = file_inputs[input_index]
    driver.execute_script('arguments[0].style = ""; arguments[0].style.display = "block"; arguments[0].style.visibility = "visible";', input_element)
    time.sleep(0.1)
    input_element.send_keys(filename)

So long as you know, or work out by trial and error, which file input dialog to send the input to it will work fine. The useful thing is that it exposes all the hidden file inputs in the user interace so you can see what you’re interacting with.

This still unfortunately doesn’t work for file uploads by dropzone, some kind of javascript extension. For this you need to base64 encode the file yourself and attach it to the dropzone. Made all the more interesting by the fact that the driver.execute_script api will only take a single line of input. Still, it works!! As horrible as it is, this works!! It takes the base64 encoded version of the file and attaches it to the dropzone element as sa blob, with the filename attached as metadata.

defadd_dropzone_attachment(self,locator,attachment_path):filename=os.path.basename(attachment_path)withopen(attachment_path,'rb')asf:content=f.read()content=base64.b64encode(content).decode('ascii')script=("var myZone, blob, base64Image; myZone = Dropzone.forElement('{}');""base64content = '{}';""function base64toBlob(r,e,n)var c=new Blob(a,);return c}}""blob = base64toBlob(base64content, 'image/png');""blob.name = '{}';""myZone.addFile(blob);").format(locator,content,filename)self.driver.execute_script(script)

The locator is the locator of the dropzone area itself, usually something like #attachmentDropzone.

Hopefully all this painfully won information proves useful to someone!

↧

Real Python: Continuous Integration with Python: An Introduction

November 26, 2018, 6:00 am

≫ Next: James Bennett: Truths programmers should know about case

≪ Previous: Michael Foord: Selenium, jQuery and File uploads

When writing code on your own, the only priority is making it work. However, working in a team of professional software developers brings a plethora of challenges. One of those challenges is coordinating many people working on the same code.

How do professional teams make dozens of changes per day while making sure everyone is coordinated and nothing is broken? Enter continuous integration!

In this tutorial you’ll:

Learn the core concepts behind continuous integration
Understand the benefits of continuous integration
Set up a basic continuous integration system
Create a simple Python example and connect it to the continuous integration system

Free Bonus:5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you'll need to take your Python skills to the next level.

What Is Continuous Integration?

Continuous integration (CI) is the practice of frequently building and testing each change done to your code automatically and as early as possible. Prolific developer and author Martin Fowler defines CI as follows:

“Continuous Integration is a software development practice where members of a team integrate their work frequently, usually each person integrates at least daily - leading to multiple integrations per day. Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible.” (Source)

Let’s unpack this.

Programming is iterative. The source code lives in a repository that is shared by all members of the team. If you want to work on that product, you must obtain a copy. You will make changes, test them, and integrate them back into the main repo. Rinse and repeat.

Not so long ago, these integrations were big and weeks (or months) apart, causing headaches, wasting time, and losing money. Armed with experience, developers started making minor changes and integrating them more frequently. This reduces the chances of introducing conflicts that you need to resolve later.

After every integration, you need to build the source code. Building means transforming your high-level code into a format your computer knows how to run. Finally, the result is systematically tested to ensure your changes did not introduce errors.

Why Should I Care?

On a personal level, continuous integration is really about how you and your colleagues spend your time.

Using CI, you’ll spend less time:

Worrying about introducing a bug every time you make changes
Fixing the mess someone else made so you can integrate your code
Making sure the code works on every machine, operating system, and browser

Conversely, you’ll spend more time:

Solving interesting problems
Writing awesome code with your team
Co-creating amazing products that provide value to users

How does that sound?

On a team level, it allows for a better engineering culture, where you deliver value early and often. Collaboration is encouraged, and bugs are caught much sooner. Continuous integration will:

Make you and your team faster
Give you confidence that you’re building stable software with fewer bugs
Ensure that your product works on other machines, not just your laptop
Eliminate a lot of tedious overhead and let you focus on what matters
Reduce the time spent resolving conflicts (when different people modify the same code)

Core Concepts

There are several key ideas and practices that you need to understand to work effectively with continuous integration. Also, there might be some words and phrases you aren’t familiar with but are used often when you’re talking about CI. This chapter will introduce you to these concepts and the jargon that comes with them.

Single Source Repository

If you are collaborating with others on a single code base, it’s typical to have a shared repository of source code. Every developer working on the project creates a local copy and makes changes. Once they are satisfied with the changes, they merge them back into the central repository.

It has become a standard to use version control systems (VCS) like Git to handle this workflow for you. Teams typically use an external service to host their source code and handle all the moving parts. The most popular are GitHub, BitBucket, and GitLab.

Git allows you to create multiple branches of a repository. Each branch is an independent copy of the source code and can be modified without affecting other branches. This is an essential feature, and most teams have a mainline branch (often called a master branch) that represents the current state of the project.

If you want to add or modify code, you should create a copy of the main branch and work in your new, development branch. Once you are done, merge those changes back into the master branch.

Git branching

Version control holds more than just code. Documentation and test scripts are usually stored along with the source code. Some programs look for external files used to configure their parameters and initial settings. Other applications need a database schema. All these files should go into your repository.

If you have never used Git or need a refresher, check out our Introduction to Git and GitHub for Python Developers.

Automating the Build

As previously mentioned, building your code means taking the raw source code, and everything necessary for its execution, and translating it into a format that computers can run directly. Python is an interpreted language, so its “build” mainly revolves around test execution rather than compilation.

Running those steps manually after every small change is tedious and takes valuable time and attention from the actual problem-solving you’re trying to do. A big part of continuous integration is automating that process and moving it out of sight (and out of mind).

What does that mean for Python? Think about a more complicated piece of code you have written. If you used a library, package, or framework that doesn’t come with the Python standard library (think anything you needed to install with pip or conda), Python needs to know about that, so the program knows where to look when it finds commands that it doesn’t recognize.

You store a list of those packages in requirements.txt or a Pipfile. These are the dependencies of your code and are necessary for a successful build.

You will often hear the phrase “breaking the build.” When you break the build, it means you introduced a change that rendered the final product unusable. Don’t worry. It happens to everyone, even battle-hardened senior developers. You want to avoid this primarily because it will block everyone else from working.

The whole point of CI is to have everyone working on a known stable base. If they clone a repository that is breaking the build, they will work with a broken version of the code and won’t be able to introduce or test their changes. When you break the build, the top priority is fixing it so everyone can resume work.

Introducing a breaking change to the master branch

When the build is automated, you are encouraged to commit frequently, usually multiple times per day. It allows people to quickly find out about changes and notice if there’s a conflict between two developers. If there are numerous small changes instead of a few massive updates, it’s much easier to locate where the error originated. It will also encourage you to break your work down into smaller chunks, which is easier to track and test.

Automated Testing

Since everyone is committing changes multiple times per day, it’s important to know that your change didn’t break anything else in the code or introduce bugs. In many companies, testing is now a responsibility of every developer. If you write code, you should write tests. At a bare minimum, you should cover every new function with a unit test.

Running tests automatically, with every change committed, is a great way to catch bugs. A failing test automatically causes the build to fail. It will draw your attention to the problems revealed by testing, and the failed build will make you fix the bug you introduced. Tests don’t guarantee that your code is free of bugs, but it does guard against a lot of careless changes.

Automating test execution gives you some peace of mind because you know the server will test your code every time you commit, even if you forgot to do it locally.

Using an External Continuous Integration Service

If something works on your computer, will it work on every computer? Probably not. It’s a cliché excuse and a sort of inside joke among developers to say, “Well, it worked on my machine!” Making the code work locally is not the end of your responsibility.

To tackle this problem, most companies use an external service to handle integration, much like using GitHub for hosting your source code repository. External services have servers where they build code and run tests. They act as monitors for your repository and stop anyone from merging to the master branch if their changes break the build.

Merging changes triggers the CI server

There are many such services out there, with various features and pricing. Most have a free tier so that you can experiment with one of your repositories. You will use a service called CircleCI in an example later in the tutorial.

Testing in a Staging Environment

A production environment is where your software will ultimately run. Even after successfully building and testing your application, you can’t be sure that your code will work on the target computer. That’s why teams deploy the final product in an environment that mimics the production environment. Once you are sure everything works, the application is deployed in the production environment.

Note: This step is more relevant to application code than library code. Any Python libraries you write still need to be tested on a build server, to ensure they work in environments different from your local computer.

You will hear people talking about this clone of the production environment using terms like development environment, staging environment, or testing environment. It’s common to use abbreviations like DEV for the development environment and PROD for the production environment.

The development environment should replicate production conditions as closely as possible. This setup is often called DEV/PROD parity. Keep the environment on your local computer as similar as possible to the DEV and PROD environments to minimize anomalies when deploying applications.

Test in a clone of the production environment

We mention this to introduce you to the vocabulary, but continuously deploying software to DEV and PROD is a whole other topic. The process is called, unsurprisingly, continuous deployment (CD). You can find more resources about it in the Next Steps section of this article.

Your Turn!

The best way to learn is by doing. You now understand all the essential practices of continuous integration, so it’s time to get your hands dirty and create the whole chain of steps necessary to use CI. This chain is often called a CI pipeline.

This is a hands-on tutorial, so fire up your editor and get ready to work through these steps as you read!

We assume that you know the basics of Python and Git. We will use Github as our hosting service and CircleCI as our external continuous integration service. If you don’t have accounts with these services, go ahead and register. Both of these have free tiers!

Problem Definition

Remember, your focus here is adding a new tool to your utility belt, continuous integration. For this example, the Python code itself will be straightforward. You want to spend the bulk of your time internalizing the steps of building a pipeline, instead of writing complicated code.

Imagine your team is working on a simple calculator app. Your task is to write a library of basic mathematical functions: addition, subtraction, multiplication, and division. You don’t care about the actual application, because that’s what your peers will be developing, using functions from your library.

Create a Repo

Log in to your GitHub account, create a new repository and call it CalculatorLibrary. Add a README and .gitignore, then clone the repository to your local machine. If you need more help with this process, have a look at GitHub’s walkthrough on creating a new repository.

Set up a Working Environment

For others (and the CI server) to replicate your working conditions, you need to set up an environment. Create a virtual environment somewhere outside your repo and activate it:

$# Create virtual environment$ python3 -m venv calculator

$# Activate virtual environment (Mac and Linux)$ . calculator/bin/activate

The previous commands work on macOS and Linux. If you are a Windows user, check the Platforms table in the official documentation. This will create a directory that contains a Python installation and tell the interpreter to use it. Now we can install packages knowing that it will not influence your system’s default Python installation.

Write a Simple Python Example

Create a new file called calculator.py in the top-level directory of your repository, and copy the following code:

"""Calculator library containing basic math operations."""defadd(first_term,second_term):returnfirst_term+second_termdefsubtract(first_term,second_term):returnfirst_term-second_term

This is a bare-bones example containing two of the four functions we will be writing. Once we have our CI pipeline up and running, you will add the remaining two functions.

Go ahead and commit those changes:

$# Make sure you are in the correct directory$cd CalculatorLibrary
$ git add calculator.py
$ git commit -m "Add functions for addition and subtraction"

Your CalculatorLibrary folder should have the following files right now:

CalculatorLibrary/
|
├── .git
├── .gitignore
├── README.md
└── calculator.py

Great, you have completed one part of the required functionality. The next step is adding tests to make sure your code works the way it’s supposed to.

Write Unit Tests

You will test your code in two steps.

The first step involves linting—running a program, called a linter, to analyze code for potential errors. flake8 is commonly used to check if your code conforms to the standard Python coding style. Linting makes sure your code is easy to read for the rest of the Python community.

The second step is unit testing. A unit test is designed to check a single function, or unit, of code. Python comes with a standard unit testing library, but other libraries exist and are very popular. This example uses pytest.

A standard practice that goes hand in hand with testing is calculating code coverage. Code coverage is the percentage of source code that is “covered” by your tests. pytest has an extension, pytest-cov, that helps you understand your code coverage.

These are external dependencies, and you need to install them:

$ pip install flake8 pytest pytest-cov

These are the only external packages you will use. Make sure to store those dependencies in a requirements.txt file so others can replicate your environment:

$ pip freeze > requirements.txt

To run your linter, execute the following:

$ flake8 --statistics
./calculator.py:3:1: E302 expected 2 blank lines, found 1./calculator.py:6:1: E302 expected 2 blank lines, found 12     E302 expected 2 blank lines, found 1

The --statistics option gives you an overview of how many times a particular error happened. Here we have two PEP 8 violations, because flake8 expects two blank lines before a function definition instead of one. Go ahead and add an empty line before each functions definition. Run flake8 again to check that the error messages no longer appear.

Now it’s time to write the tests. Create a file called test_calculator.py in the top-level directory of your repository and copy the following code:

"""Unit tests for the calculator library"""importcalculatorclassTestCalculator:deftest_addition(self):assert4==calculator.add(2,2)deftest_subtraction(self):assert2==calculator.subtract(4,2)

These tests make sure that our code works as expected. It is far from extensive because you haven’t tested for potential misuse of your code, but keep it simple for now.

The following command runs your test:

$ pytest -v --cov
collected 2 itemstest_calculator.py::TestCalculator::test_addition PASSED [50%]test_calculator.py::TestCalculator::test_subtraction PASSED [100%]---------- coverage: platform darwin, python 3.6.6-final-0 -----------Name                                              Stmts   Miss  Cover---------------------------------------------------------------------calculator.py                                         4      0   100%test_calculator.py                                    6      0   100%/Users/kristijan.ivancic/code/learn/__init__.py       0      0   100%---------------------------------------------------------------------TOTAL                                                10      0   100%

pytest is excellent at test discovery. Because you have a file with the prefix test, pytest knows it will contain unit tests for it to run. The same principles apply to the class and method names inside the file.

The -v flag gives you a nicer output, telling you which tests passed and which failed. In our case, both tests passed. The --cov flag makes sure pytest-cov runs and gives you a code coverage report for calculator.py.

You have completed the preparations. Commit the test file and push all those changes to the master branch:

$ git add test_calculator.py
$ git commit -m "Add unit tests for calculator"$ git push

At the end of this section, your CalculatorLibrary folder should have the following files:

CalculatorLibrary/
|
├── .git
├── .gitignore
├── README.md
├── calculator.py
├── requirements.txt
└── test_calculator.py

Excellent, both your functions are tested and work correctly.

Connect to CircleCI

At last, you are ready to set up your continuous integration pipeline!

CircleCI needs to know how to run your build and expects that information to be supplied in a particular format. It requires a .circleci folder within your repo and a configuration file inside it. A configuration file contains instructions for all the steps that the build server needs to execute. CircleCI expects this file to be called config.yml.

A .yml file uses a data serialization language, YAML, and it has its own specification. The goal of YAML is to be human readable and to work well with modern programming languages for common, everyday tasks.

In a YAML file, there are three basic ways to represent data:

Mappings (key-value pairs)
Sequences (lists)
Scalars (strings or numbers)

It is very simple to read:

Indentation may be used for structure.
Colons separate key-value pairs.
Dashes are used to create lists.

Create the .circleci folder in your repo and a config.yml file with the following content:

# Python CircleCI 2.0 configuration fileversion:2jobs:build:docker:-image:circleci/python:3.7working_directory:~/reposteps:# Step 1: obtain repo from GitHub-checkout# Step 2: create virtual env and install dependencies-run:name:install dependenciescommand:|python3 -m venv venv. venv/bin/activatepip install -r requirements.txt# Step 3: run linter and tests-run:name:run testscommand:|. venv/bin/activateflake8 --exclude=venv* --statisticspytest -v --cov=calculator

Some of these words and concepts might be unfamiliar to you. For example, what is Docker, and what are images? Let’s go back in time a bit.

Remember the problem programmers face when something works on their laptop but nowhere else? Before, developers used to create a program that isolates a part of the computer’s physical resources (memory, hard drive, and so on) and turns them into a virtual machine.

A virtual machine pretends to be a whole computer on its own. It would even have its own operating system. On that operating system, you deploy your application or install your library and test it.

Virtual machines take up a lot of resources, which sparked the invention of containers. The idea is analogous to shipping containers. Before shipping containers were invented, manufacturers had to ship goods in a wide variety of sizes, packaging, and modes (trucks, trains, ships).

By standardizing the shipping container, these goods could be transferred between different shipping methods without any modification. The same idea applies to software containers.

Containers are a lightweight unit of code and its runtime dependencies, packaged in a standardized way, so they can quickly be plugged in and run on the Linux OS. You don’t need to create a whole virtual operating system, as you would with a virtual machine.

Containers only replicate parts of the operating system they need in order to work. This reduces their size and gives them a big performance boost.

Docker is currently the leading container platform, and it’s even able to run Linux containers on Windows and macOS. To create a Docker container, you need a Docker image. Images provide blueprints for containers much like classes provide blueprints for objects. You can read more about Docker in their Get Started guide.

CircleCI maintains pre-built Docker images for several programming languages. In the above configuration file, you have specified a Linux image that has Python already installed. That image will create a container in which everything else happens.

Let’s look at each line of the configuration file in turn:

version: Every config.yml starts with the CircleCI version number, used to issue warnings about breaking changes.
jobs: Jobs represent a single execution of the build and are defined by a collection of steps. If you have only one job, it must be called build.
build: As mentioned before, build is the name of your job. You can have multiple jobs, in which case they need to have unique names.
docker: The steps of a job occur in an environment called an executor. The common executor in CircleCI is a Docker container. It is a cloud-hosted execution environment but other options exist, like a macOS environment.
image: A Docker image is a file used to create a running Docker container. We are using an image that has Python 3.7 preinstalled.
working_directory: Your repository has to be checked out somewhere on the build server. The working directory represents the file path where the repository will be stored.
steps: This key marks the start of a list of steps to be performed by the build server.
checkout: The first step the server needs to do is check the source code out to the working directory. This is performed by a special step called checkout.
run: Executing command-line programs or commands is done inside the command key. The actual shell commands will be nested within.
name: The CircleCI user interface shows you every build step in the form of an expandable section. The title of the section is taken from the value associated with the name key.
command: This key represents the command to run via the shell. The | symbol specifices that what follows is a literal set of commands, one per line, exactly like you’d see in a shell/bash script.

You can read the CircleCI configuration reference document for more information.

Our pipeline is very simple and consists of 3 steps:

Checking out the repository
Installing the dependencies in a virtual environment
Running the linter and tests while inside the virtual environment

We now have everything we need to start our pipeline. Log in to your CircleCI account and click on Add Projects. Find your CalculatorLibrary repo and click Set Up Project. Select Python as your language. Since we already have a config.yml, we can skip the next steps and click Start building.

CircleCI will take you to the execution dashboard for your job. If you followed all the steps correctly, you should see your job succeed.

The final version of your CalculatorLibrary folder should look like this:

CalculatorRepository/
|
├── .circleci
├── .git
├── .gitignore
├── README.md
├── calculator.py
├── requirements.txt
└── test_calculator.py

Congratulations! You have created your first continuous integration pipeline. Now, every time you push to the master branch, a job will be triggered. You can see a list of your current and past jobs by clicking on Jobs in the CircleCI sidebar.

Make Changes

Time to add multiplication to our calculator library.

This time, we will first add a unit test without writing the function. Without the code, the test will fail, which will also fail the CircleCI job. Add the following code to the end of your test_calculator.py:

deftest_multiplication(self):assert100==calculator.multiply(10,10)

Push the code to the master branch and see the job fail in CircleCI. This shows that continuous integration works and watches your back if you make a mistake.

Now add the code to calculator.py that will make the test pass:

defmultiply(first_term,second_term):returnfirst_term*second_term

Make sure there are two empty spaces between the multiplication function and the previous one, or else your code will fail the linter check.

The job should be successful this time. This workflow of writing a failing test first and then adding the code to pass the test is called test driven development (TDD). It’s a great way to work because it makes you think about your code structure in advance.

Now try it on your own. Add a test for the division function, see it fail, and write the function to make the test pass.

Notifications

When working on big applications that have a lot of moving parts, it can take a while for the continuous integration job to run. Most teams set up a notification procedure to let them know if one of their jobs fail. They can continue working while waiting for the job to run.

The most popular options are:

Sending an email for each failed build
Sending failure notifications to a Slack channel
Displaying failures on a dashboard visible to everyone

By default, CircleCI should send you an email when a job fails.

Next Steps

You have understood the basics of continuous integration and practiced setting up a pipeline for a simple Python program. This is a big step forward in your journey as a developer. You might be asking yourself, “What now?”

To keep things simple, this tutorial skimmed over some big topics. You can grow your skill set immensely by spending some time going more in-depth into each subject. Here are some topics you can look into further.

Git Workflows

There is much more to Git than what you used here. Each developer team has a workflow tailored to their specific needs. Most of them include branching strategies and something called peer review. They make changes on branches separate from the master branch. When you want to merge those changes with master, other developers must first look at your changes and approve them before you’re allowed to merge.

Note: If you want to learn more about different workflows teams use, have a look at the tutorials on GitHub and BitBucket.

If you want to sharpen your Git skills, we have an article called Advanced Git Tips for Python Developers.

Dependency Management and Virtual Environments

Apart from virtualenv, there are other popular package and environment managers. Some of them deal with just virtual environments, while some handle both package installation and environment management. One of them is Conda:

“Conda is an open source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was designed for Python programs, but it can package and distribute software for any language.” (Source)

Another option is Pipenv, a younger contender that is rising in popularity among application developers. Pipenv brings together pip and virtualenv into a single tool and uses a Pipfile instead of requirements.txt. Pipfiles offer deterministic environments and more security. This introduction doesn’t do it justice, so check out Pipenv: A Guide to the New Python Packaging Tool.

Testing

Simple unit tests with pytest are only the tip of the iceberg. There’s a whole world out there to explore! Software can be tested on many levels, including integration testing, acceptance testing, regression testing, and so forth. To take your knowledge of testing Python code to the next level, head over to Getting Started With Testing in Python.

Packaging

In this tutorial, you started to build a library of functions for other developers to use in their project. You need to package that library into a format that is easy to distribute and install using, for example pip.

Creating an installable package requires a different layout and some additional files like __init__.py and setup.py. Read Python Application Layouts: A Reference for more information on structuring your code.

To learn how to turn your repository into an installable Python package, read Packaging Python Projects by the Python Packaging Authority.

Continuous Integration

You covered all the basics of CI in this tutorial, using a simple example of Python code. It’s common for the final step of a CI pipeline to create a deployable artifact. An artifact represents a finished, packaged unit of work that is ready to be deployed to users or included in complex products.

For example, to turn your calculator library into a deployable artifact, you would organize it into an installable package. Finally, you would add a step in CircleCI to package the library and store that artifact where other processes can pick it up.

For more complex applications, you can create a workflow to schedule and connect multiple CI jobs into a single execution. Feel free to explore the CircleCI documentation.

Continuous Deployment

You can think of continuous deployment as an extension of CI. Once your code is tested and built into a deployable artifact, it is deployed to production, meaning the live application is updated with your changes. One of the goals is to minimize lead time, the time elapsed between writing a new line of code and putting it in front of users.

Note: To add a bit of confusion to the mix, the acronym CD is not unique. It can also mean Continuous Delivery, which is almost the same as continuous deployment but has a manual verification step between integration and deployment. You can integrate your code at any time but have to push a button to release it to the live application.

Most companies use CI/CD in tandem, so it’s worth your time to learn more about Continuous Delivery/Deployment.

Overview of Continuous Integration Services

You have used CircleCI, one of the most popular continuous integration services. However, this is a big market with a lot of strong contenders. CI products fall into two basic categories: remote and self-hosted services.

Jenkins is the most popular self-hosted solution. It is open-source and flexible, and the community has developed a lot of extensions.

In terms of remote services, there are many popular options like TravisCI, CodeShip, and Semaphore. Big enterprises often have their custom solutions, and they sell them as a service, such as AWS CodePipeline, Microsoft Team Foundation Server, and Oracle’s Hudson.

Which option you choose depends on the platform and features you and your team need. For a more detailed breakdown, have a look at Best CI Software by G2Crowd.

Conclusion

With the knowledge from this tutorial under your belt, you can now answer the following questions:

What is continuous integration?
Why is continuous integration important?
What are the core practices of continuous integration?
How can I set up continuous integration for my Python project?

You have acquired a programming superpower! Understanding the philosophy and practice of continuous integration will make you a valuable member of any team. Awesome work!

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

James Bennett: Truths programmers should know about case

November 26, 2018, 6:00 am

≫ Next: Erik Marsja: Pandas Read CSV Tutorial

≪ Previous: Real Python: Continuous Integration with Python: An Introduction

A couple weeks ago I gave a talk about usernames at North Bay Python. The content came mostly from things I’ve learned in roughly 12 years of maintaining django-registration, which has taught me more than I ever wanted to know about how complex even “simple” things can be.

I mentioned toward the beginning of the talk, though, that it wasn’t going to be one of those “falsehoods programmers believe about X” things. If you’re not ...

Read full entry

↧

Erik Marsja: Pandas Read CSV Tutorial

November 26, 2018, 7:31 am

≫ Next: Trey Hunner: Python Cyber Monday Sales

≪ Previous: James Bennett: Truths programmers should know about case

In this tutorial we will learn how to work with comma separated (CSV) files in Python and Pandas. We will get an overview of how to use Pandas to load CSV to dataframes and how to write dataframes to CSV.

In the first section, we will go through, with examples, how to read a CSV file, how to read specific columns from a CSV, how to read multiple CSV files and combine them to one dataframe, and, finally, how to convert data according to specific datatypes (e.g., using Pandas read_csv dtypes). In the last section we will continue by learning how to write CSV files. That is, we will learn how to export dataframes to CSV files.

Pandas Excel Tutorial: How to Read and Write Excel Files

Pandas Import CSV from the Harddrive

In the first example of this Pandas read CSV tutorial we will just use read_csv to load CSV to dataframe that is in the same directory as the script. If we have the file in another directory we have to remember to add the full path to the file. Here’s the first, very simple, Pandas read_csv example:

df = pd.read_csv('amis.csv')
df.head()

Pandas Read CSV Example Dataframe

The data can be downloaded here but in the following examples we are going to use Pandas read_csv to load data from a URL.

Pandas Read CSV from a URL

In the next read_csv example we are going to read the same data from a URL. It’s very simple we just put the URL in as the first parameter in the read_csv method:

url_csv = 'https://vincentarelbundock.github.io/Rdatasets/csv/boot/amis.csv'
df = pd.read_csv(url_csv)

As can be seen in the image above we get a column named ‘Unamed: 0’. We can also see that it contains numbers. Thus, we can use this column as index column. In the next code example we are going to use Pandas read_csv and the index_col parameter. This parameter can take an integer or a sequence. In our case we are going to use the integer 0 and we will get a way nicer dataframe:

df = pd.read_csv(url_csv, index_col=0)
df.head()

Pandas read_csv using index_cols

The index_col parameter also can take a string as input and we will now use a different datafile. In the next example we will read a CSV into a Pandas dataframe and use the idNum column as index.

csv_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv'
df = pd.read_csv(csv_url, index_col='idNum')
df.iloc[:, 0:6].head()

Note, to get the above output we used Pandas iloc to select the first 7 rows. This was done to get an output that could be easier illustrated. That said, we are now continuing to the next section where we are going to read certain columns to a dataframe from a CSV file.

Pandas Read CSV usecols

In some cases we don’t want to parse every column in the csv file. To only read certain columns we can use the parameter usecols. Note, if we want the first column to be index column and we want to parse the three first columns we need to have a list with 4 elements (compare my read_excel usecols example here):

cols = [0, 1, 2, 3]
df = pd.read_csv(url_csv, 
                   index_col=0, usecols=cols)
df.head()

read_csv usecols

Of course, using read_csv usecols make more sense if we had a CSV file with more columns. We can use Pandas read_csv usecols with a list of strings, as well. In the next example we return to the larger file we used previously:

csv_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv'
df = pd.read_csv(csv_url, index_col='idNum', 
                   usecols=['idNum', 'date', 'problem', 'MDC'])
df.head()

usecols with list of strings

Pandas Read CSV: Remove Unnamed Column

In some of the previous read_csv example we get an unnamed column. We have solved this by setting this column as index or used usecols to select specific columns from the CSV file. However, we may not want to do that for any reason. Here’s one example on how to use Pandas read_csv to get rid of the column “Unnamed:0”:

csv_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv'
cols = pd.read_csv(csv_url, nrows=1).columns
df = pd.read_csv(csv_url, usecols=cols[1:])
df.iloc[:, 0:6].head()

It’s of course also possible to remove the unnamed columns after we have loaded the CSV to a dataframe. To remove the unnamed columns we can use two different methods; loc and drop, together with other Pandas dataframe methods. When using the drop method we can use the inplace parameter and get a dataframe without unnamed columns.

df.drop(df.columns[df.columns.str.contains('unnamed', case=False)],
          axis=1, inplace=True)

# The following line will give us the same result as the line above
# df = df.loc[:, ~df.columns.str.contains('unnamed', case=False)]

df.iloc[:, 0:7].head()

To explain the code example above; we select the columns without columns that containing the string ‘unnamed’. Furthermore, we used the case parameter so that the contains method is not case-sensitive. Thus, we will get columns named “Unnamed” and “unnamed”. In the first row, using Pandas drop, we are also using the inplace parameter so that it changes our dataframe. The axis parameter, however, is used to drop columns instead of indices (i.e., rows).

Learn some data manipulation techniques using Python and Pandas.

Read CSV and Missing Values

If we have missing data in our CSV file and it’s coded in a way that make it impossible for Pandas to find them we can use the parameter na_values. In the example below the amis.csv file have been changed and there are some cells with the string “Not Available”.

CSV file

That is, we are going to change “Not Available” to something that we easily can remove when carrying out data analysis later.

df = pd.read_csv('Simdata/MissingData.csv', index_col=0,
                   na_values="Not Available")
df.head()

Reading CSV and Skipping Rows

What if our data file(s) contain information on the first x rows? For instance, how can we skip the first three rows in a file looking like this:

We will now learn how to use Pandas read_csv and skip x amount of row. Luckily, it’s very simple we just use the skiprows parameter. In the following example we are using read_csv and skiprows=3 to skip the first 3 rows.

Pandas read_csv skiprows example:

df = pd.read_csv('Simdata/skiprow.csv', index_col=0, skiprows=3)
df.head()

Note we can obtain the same result as above using the header parameter (i.e., data = pd.read_csv(‘Simdata/skiprow.csv’, header=3)).

How to Read Certain Rows using Pandas

If we don’t want to read every row in the CSV file we ca use the parameter nrows. In the next example below we read the first 8 rows of a CSV file.

df = pd.read_csv(url_csv, nrows=8)
df

If we want to select random rows we can load the complete CSV file and use Pandas sample to randomly select rows (learn more about this by reading the Pandas Sample tutorial).

Pandas read_csv dtype

We can also set the data types for the columns. Although, in the amis dataset all columns contain integers we can set some of them to string data type. This is exactly what we will do in the next Pandas read_csv pandas example. We will use the Pandas read_csv dtype parameter and put in a dictionary:

url_csv = 'https://vincentarelbundock.github.io/Rdatasets/csv/boot/amis.csv'
df = pd.read_csv(url_csv, dtype={'speed':int, 'period':str, 'warning':str, 'pair':int})
df.info()

It’s ,of course, possible to force other datatypes such as integer and float. All we have to do is change str to float, for instance (given that we have decimal numbers in that column, of course).

Load Multiple Files to a Dataframe

If we have data from many sources such as experiment participants we may have them in multiple CSV files. If the data, from the different CSV files, are going to be analyzed together we may want to load them all into one dataframe. In the next examples we are going to use Pandas read_csv to read multiple files.

First, we are going to use Python os and fnmatch to list all files with the word “Day” of the file type CSV in the directory “SimData”. Next, we are using Python list comprehension to load the CSV files into dataframes (stored in a list, see the type(dfs) output).

import os, fnmatch

csv_files = fnmatch.filter(os.listdir('./SimData'), '*Day*.csv')
dfs = [pd.read_csv('SimData/' + os.sep + csv_file) 
       for csv_file in csv_files]

type(dfs)
# Output: list

Finally, we use the method concat to concatenate the dataframes in our list. In the example files there is a column called ‘Day’ so that each day (i.e., CSV file) is unique.

df = pd.concat(dfs, sort=False)
df.Day.unique()

The second method we are going to use is a bit simpler; using Python glob. If we compare the two methods (os + fnmatch vs. glob) we can see that in the list comprehension we don’t have to put the path. This is because glob will have the full path to our files. Handy!

import glob

csv_files = glob.glob('SimData/*Day*.csv')
dfs = [pd.read_csv(csv_file) for csv_file in csv_files]

df = pd.concat(dfs, sort=False)

If we don’t have a column, in each CSV file, identifying which dataset it is (e.g., data from different days) we could apply the filename in a new column of each dataframe:

import glob

csv_files = glob.glob('SimData/*Day*.csv')
dfs = []

for csv_file in csv_files:
    temp_df = pd.read_csv(csv_file)
    temp_df['DataF'] = csv_file.split('\\')[1]
    dfs.append(temp_df)

Check the Pandas Dataframe Tutorial for Beginners

How to Write CSV files in Pandas

In this section we will learn how to export dataframes to CSV files. We will start by creating a dataframe with some variables but first we start by importing the modules Pandas:

import pandas as pd

The next step is to create a dataframe. We will create the dataframe using a dictionary. The keys will be the column names and the values will be lists containing our data:

df = pd.DataFrame({'Names':['Andreas', 'George', 'Steve',
                           'Sarah', 'Joanna', 'Hanna'],
                  'Age':[21, 22, 20, 19, 18, 23]})
df.head()

Then we write the dataframe to CSV file using Pandas to_csv method. In the example below we don’t use any parameters but the path_or_buf which is, in our case, the file name.

df.to_csv('NamesAndAges.csv')

Here’s how the exported dataframe look like:

As can be seen in the image above we get a new column when we are not using any parameters. This column is the index column from our Pandas dataframe. We can use the parameter index and set it to False to get rid of this column.

df.to_csv('NamesAndAges.csv', index=False)

How to Read Multiple Dataframes to one CSV file

If we have many dataframes and we want to export them all to the same CSV file it is, of course, possible. In the Pandas to_csv example below we have 3 dataframes. We are going to use Pandas concat with the parameters keys and names.

This is done to create two new columns, named Group and Row Num. The important part is Group which will identify the different dataframes. In the last row of the code example we use Pandas to_csv to write the dataframes to CSV.

df1 = pd.DataFrame({'Names': ['Andreas', 'George', 'Steve',
                           'Sarah', 'Joanna', 'Hanna'],
                   'Age':[21, 22, 20, 19, 18, 23]})
df2 = pd.DataFrame({'Names': ['Pete', 'Jordan', 'Gustaf',
                           'Sophie', 'Sally', 'Simone'],
                   'Age':[22, 21, 19, 19, 29, 21]})
df3 = pd.DataFrame({'Names': ['Ulrich', 'Donald', 'Jon',
                           'Jessica', 'Elisabeth', 'Diana'],
                   'Age':[21, 21, 20, 19, 19, 22]})


df = pd.concat([df1, df2, df3], keys =['Group1', 'Group2', 'Group3'], 
               names=['Group', 'Row Num']).reset_index()

df.to_csv('MultipleDfs.csv', index=False)

In the CSV file we get 4 columns. The keys parameter with the list ([‘Group1’, ‘Group2’, ‘Group3’]) will enable identification of the different dataframes we wrote. We also get the column “Row Num” which will contain the row numbers for each dataframe:

Conclusion

In this tutorial we have learned about importing CSV files into Pandas dataframe. More Specifically, we have learned how to:

Load CSV files to dataframe
- locally
- from the WEB
Read certain columns
Remove unnamed columns
Handle missing values
Skipping rows and reading certain rows
Changing datatypes using dtypes
Reading many CSV files
Saving dataframes to CSV

The post Pandas Read CSV Tutorial appeared first on Erik Marsja.

↧

Trey Hunner: Python Cyber Monday Sales

November 26, 2018, 7:45 am

≫ Next: Guido van Rossum: What do do with your computer science career

≪ Previous: Erik Marsja: Pandas Read CSV Tutorial

I’m running a sale that ends in 24 hours, but I’m not the only one. This post is a compilation of the different Cyber Monday deals I’ve found related to Python and Python learning.

Python Morsels weekly skill-building

This is my weekly Python skill-building service. I’m offering 52 weeks of Python Morsels for 50% off.

You can find more details on this sale here.

Talk Python Course Bundle

Michael Kennedy of Talk Python is offering a 4 course bundle for a number of Python courses.

Reuven Lerner’s Python courses

Reuven Lerner is also offering a 40% off sale on his courses. Reuven has courses on Python, Git, and regular expressions.

PyBites Code Challenges

Bob and Julian of PyBites are offering a 50% off sale on their Code Challenges. These are a different variety of code challenges than Python Morsels. You could sign up for both and Python Morsels if you wanted extra learning every week.

Python Books

A number of Python book publishers, such as No Starch, are running big sales on their books. I took a look at their sale, but haven’t gone searching for more of them.

Other Cyber Monday deals?

If you have questions about the Python Morsels sale, email me.

The Python Morsels sale and likely all the other sales above will end in the next 24 hours, probably sooner depending on when you’re reading this.

So go check them out!

Did I miss a deal that you know about? Link to it in the comments!

↧

Guido van Rossum: What do do with your computer science career

November 26, 2018, 1:49 am

≫ Next: Rene Dudfield: pygame 2 update - the examples all work

≪ Previous: Trey Hunner: Python Cyber Monday Sales

I regularly receive questions from students in the field of computer science looking for career advice.

Here's an answer I wrote to one of them. It's not comprehensive or anything, but I thought people might find it interesting.

The question about "9-5" vs. "enterpreneur" is a complex one -- not everybody can be a successful entrepreneur (who would do the work? :-) and not everybody has the temperament for it. For me personally it was never an option -- there are vast parts of management and entrepreneurship that I wouldn't enjoy doing, such as hiring (I hate interviewing and am bad at it) and firing (too emotionally draining -- even just giving negative feedback is hard for me). Pitching ideas to investors is another thing that I'd rather do without.

If any of that resonates with you, you may be better off not opting for entrepreneurship -- the kind of 9-5 software development jobs I have had are actually (mostly) very rewarding in the sense that I get to write software that gets used by 100s or 1000s of other developers (or millions in the case of Python), and those other developers in turn use my software to produce product that get uses by 100s of 1000s or, indeed 100s of millions of users. Not every 9-5 job is the same! For me personally, I don't like the product stuff (since usually that means it's products I have no interest in using myself) but "your mileage may vary" (as they say in the US). Just try to do better than an entry-level web development job.

[The question here was about whether AI would make human software developers redundant, not about what I think of the field of AI as a career choice]
Regarding AI, I'm not worried at all. The field is focused on automating boring, repetitive tasks like driving a car or recognizing faces, which humans can learn to do easily but find boring if they have to do it all the time. The field of software engineering is never boring, since as soon as a task is repetitive, you automate it, and you start solving new problems.

↧

Rene Dudfield: pygame 2 update - the examples all work

November 27, 2018, 2:42 am

≫ Next: Matthew Rocklin: Anatomy of an OSS Institutional Visit

≪ Previous: Guido van Rossum: What do do with your computer science career

🎮👾👽🎮👾👽🎮👾👽🎮👾👽

examples/chimp.py works

👽🎮👾👽🎮👾👽🎮👾👽🎮👾

That's a nice milestone for pygame 2 development.

(and also, now these examples also work...

python -m pygame.examples.aliens

python -m pygame.examples.scroll

)

I'm pretty sure all the examples are now working with SDL2/pygame 2.

New SDL2 functionality.

Also we have some new SDL2 functionality exposed

(experimentally, the APIs are not done yet).

Multitouch, and AudioDevice, along with an example of recording audio.

@dlon has been fixing a number of bugs(implementing key repeat, fixing unicode issues)

and also prototyped some Window/Renderer classes (using ctypes and python).

The new events are done in a backwards compatible way,

so that games using them with pygame 1/SDL1 will still work.

You just won't get the events.

if event.type == pg.AUDIODEVICEADDED:

print('a new sound card plugged in... let's offer the user to use it')

On pygame SDL1 AUDIODEVICEADDED is defined as -1, so you just never get these events.

See examples/audiocapture.py

Making an instrument, Free and Open Source

With a friend we made a loop recorder instrument/video thingo with microphone input.

It records the input and then lets you play back sounds as loops.

Then you can record over tracks making music in a performance.

The video synth reacts to beats and pitch changes driving video.

(pitch/beat detection not included in pygame... yet?)

(Demonstrating the app for a pygame presentation at an Open Source diversity event in Berlin)

Working on a project that uses pygame 2 has been good for me to push features forward.

Now you can ask pygame.mixer to use things like 32bit floats (good for audio apps).

As well as tell it to open a specific audio device for output (and input!).

Writing your own mixer/synth is now much easier.

Music making and performance app for pygame 2, looping and mixing sounds recorded in the microphone.

I think I learnt a lot from that event about making projects more accessible to people, and made some contacts with people from other projects and communities.

For contributors, I've written about how to debug pygame and other python extensions with gdb,

and also how to get started with SDL2 in C (strangely this did not exist on the internet). These will be tested and then merged into the pygame contributors guide.

Python, Cython(Pyrex), and C.

Now we have the start of a module using Cython (_sdl2.pyx). Why Cython?

We discussed on the mailing list already that it would be good for new features.
It also works with python 2.6+/3.
Cython can be used on platforms where ctypes can't, and is faster than ctypes.
Cython is more readable to python people.
Cython is also ok on pypy (cffi is the faster option on pypy still).
We already had one Cython/Pyrex module (the portmidi module that the midi.py module uses internally).

If we implement things in python using a fairly straight wrapping of SDL2 in Cython,

then it would be much easier for people in the future to make a CFFI enhancement for pypy.

If we keep the C parts separate from the python parts, then it will be easier to maintain.

What I mean, is that C+SDL code should not have any Python.h code in the file.

See src_c/_sdl2.pyx

What language to use when? A summary.

New features in python. Python implementations are easier to debug, change and read. They run faster on pypy. They are a good educational resource, because the python code is very readable. They can be used to cross-check test implementations in C. The work done by e1000 on the line drawing code has shown the value of this approach. Also the pypy and Cpython projects use a 'python first' approach.
C code should be kept separate from the Python.h parts. This is 'cleaner' easier to read, and maintain. This opens the possibility to share the C code more easily with other projects.
Keep SDL2 wrapping close to the SDL2 C API. This makes a CFFI implementation easier if someone wants to do so in the future. And it is clean layering. With Cython we don't have as much of an overhead as with C+Python, because of the compiler.
new SDL2 functionality can be wrapped with Cython.
Accelerators of python code can be done in Cython (the sprite.py code is a great example of something that could use Cython to speed up collision detection and dirty rect handling).
We should generate .c files so that the user does not need to use Cython when installing pygame from source.

I'm excited that Cython will allow us to speed up sprite.py, collision handling and dirty rect handling.

It's one of the "Four new pygame things for slow computers." I wrote about in March 2017.

Note, we already have 2 of 4 done :)

How to document with Cython?

There are some open questions with Cython however. The main one being...

How do we use the documentation from inside a .pyx file and have it included in sphinx?

If we don't do this, then we need to maintain two sets of documentation.

(Like we do for midi.py)

The other way is a possibility too... maintaining the documentation in the .rst files,

and somehow inserting the docs at build time.

Hopefully someone else in the Cython/Pyrex community knows how to do this documentation workflow well.

Using projects to guide development.

"Write Games, Not Engines"

Working on the looper/video synth project was good to get some audio code working.

But also, I've been working on fixing issues with 'solarwolf' as a test game,

and it's playable but still with a couple of graphical glitches left.

Since we don't have unit tests which cover absolutely everything,

using apps I know well to test is a pretty good alternative.

The approach of looking for issues, and then going into the debugger to find the

Surface blit/transform combinations that are failing in the solarwolf game is a good one.

The first 'game' using pygame 2. A community game.

Speaking of using projects to drive development.

I'm starting this "pygame community game" thingo we talked about a while ago... now.

To enter it into the Ludumdare/Github GameOff jams which end in about 4.5 days.

I'll start another email thread and website announcement but...

Anyone who wants to be involved can join the discord channel

web based chatroom(discord) in the "#communitygame" channel.

↧

Matthew Rocklin: Anatomy of an OSS Institutional Visit

November 26, 2018, 4:00 pm

≫ Next: gamingdirectional: Create a score manager class for pygame project

≪ Previous: Rene Dudfield: pygame 2 update - the examples all work

I recently visited the UK Meteorology Office, a moderately large organization that serves the weather and climate forecasting needs of the UK (and several other nations). I was there with other open source colleagues including Joe Hamman and Ryan May from open source projects like Dask, Xarray, JupyterHub, MetPy, Cartopy, and the broader Pangeo community.

This visit was like many other visits I’ve had over the years that are centered around showing open source tooling to large institutions, so I thought I’d write about it in hopes that it helps other people in this situation in the future.

My goals for these visits are the following:

Teach the institution about software projects and approaches that may help them to have a more positive impact on the world
Engage them in those software projects and hopefully spread around the maintenance and feature development burden a bit

Step 1: Meet allies on the ground

We were invited by early adopters within the institution, both within the UK Met Office’s Informatics Lab a research / incubation group within the broader organization, and the Analysis, Visualization, and Data group (AVD) who serve 500 analysts at the Met Office with their suite of open source tooling.

Both of these groups are forward thinking, already use and appreciate the tools that we were talking about, and hope to leverage our presence evangelize what they’ve already been saying throughout the company. They need outside experts to provide external validation within the company; that’s our job.

The goals for the early adopters are the following:

Reinforce the message they’ve already been saying internally, that these tools and approaches can improve operations within the institution
Discuss specific challenges that they’ve been having with the software directly with maintainers
Design future approaches within their more forward thinking groups

So our visit was split between meeting a variety of groups within the institution (analysts, IT, …) and talking shop.

Step 2: Talk to IT

One of our first visits was a discussion with a cross-department team of people architecting a variety of data processing systems throughout the company. Joe Hamman and I gave a quick talk about Dask, XArray, and the Pangeo community. Because this was more of an IT-focused group I went first, answered the standard onslaught of IT-related questions about Dask, and established credibility. Then Joe took over and demonstrated the practical relevance of the approach from their users’ perspective.

We’ve done this tag-team approach a number of times and its always effective. Having a technical person speak to technical concerns while also having a scientist demonstrating organizational value seems to establish credibility across a wide range of people.

However it’s still important to tailor the message to the group at hand. IT-focused groups like this one are usually quite conservative about adding new technology, and they have a constant pressure of users asking them for things that will generally cause problems. We chose to start with low-level technical details because it lets them engage with the problem at a level that they can meaningfully test and assess the situation.

Step 3: Give a talk to a broader audience

Our early-adopter allies had also arranged a tech-talk with a wider audience across the office. This was part of a normal lecture series, so we had a large crowd, along with a video recording within the institution for future viewers. The audience this time was a combination of analysts (users of our software), some IT, and an executive or two.

Joe and I gave essentially the same talk, but this time we reversed the order, focusing first on the scientific objectives, and then following up with a more brief summary on how the software accomplishes this. A pretty constant message in this talk was …

other institutions like yours already do this and are seeing transformative change

We provided social proof by showing that lots of other popular projects and developer communities integrate with these tools, and that many large government organizations (peers to the UK Met Office) are already adopting these tools and seeing efficiency gains.

Our goals for this section are the following:

Encourage the users within the audience to apply pressure to their management/IT to make it easier for them to integrate these tools to their everyday workflow
Convince management that this is a good approach.
This means two things for them:
1. These methods are well established outside of the institution, and not just something that their engineers are presently enamored with
2. These methods can enable transformative change within the organization

Step 4: Talk to many smaller groups

After we gave the talk to the larger audience we met with many smaller groups. These were groups that managed the HPC systems, were in charge of storing data on the cloud, ran periodic data processing pipelines, etc.. Doing this after the major talk is useful, because people arrive with a pretty good sense of what the software does, and how it might help them. Conversations then become more specific quickly.

Step 5: Engage with possible maintainers

During this process I had the good fortune to work with Peter Killick and Bill Little who had done a bit of work on Dask in the past and were interested in doing more. Before coming to the Met Office we found a bug that was of relevance to them, but also involved learning some more Dask skills. We worked on it off and on during the visit and it was great to get to know them better and hopefully they’re more likely to fix issues that arise in the future with more familiarity.

Step 6: Engage with other core developers

Between the visitors and our hosts we had several core developers present on related projects (XArray, Iris, Dask, Cartopy, Metpy, …). This was a good time not just for evangelism and growing the community, but also for making long-term plans about existing projects, identifying structural issues in the ecosystem, and identifying new projects to fix those issues.

There was good conversation around the future relationship between Xarray and Iris two similar packages that could play better together. We discussed current JupyterHub deployments both within the UK Met office, and without. Developers for the popular Cartopy library got together. A couple of us prototyped a very early stage unstructured mesh data structure.

These visits are one of the few times when a mostly distributed community gets together and can make in-person plans. Sitting down with a blank sheet of paper is a useful exercise that is still remarkably difficult to replicate remotely.

Step 7: Have a good time

It turns out that the Southwest corner of England is full of fine pubs, and even better walking. I’m thankful to Phil Elson and Jo Camp for hosting me over the weekend where we succeeded in chatting about things other than work.

↧

gamingdirectional: Create a score manager class for pygame project

November 27, 2018, 3:30 am

≫ Next: Stack Abuse: Asynchronous vs Synchronous Python Performance Analysis

≪ Previous: Matthew Rocklin: Anatomy of an OSS Institutional Visit

In this article we will create a score manager class to render the player score on the scene for our pygame project. At the moment we will only increase the score of the player each time the player’s missile hits the enemy ship and deduct the player score each time the player gets hit by the enemy missile, in the future we will introduce more features into the score manager class but for now...

Source

↧

Stack Abuse: Asynchronous vs Synchronous Python Performance Analysis

November 27, 2018, 6:44 am

≫ Next: Anwesha Das: Upgraded my blog to Ghost 2.6

≪ Previous: gamingdirectional: Create a score manager class for pygame project

Introduction

This article is the second part of a series on using Python for developing asynchronous web applications. The first part provides a more in-depth coverage of concurrency in Python and asyncio, as well as aiohttp.

If you'd like to read more about Asynchronous Python for Web Development, we've got it covered.

Due to the non-blocking nature of asynchronous libraries like aiohttp we would hope to be able to make and handle more requests in a given amount of time compared to analogous synchronous code. This is due to the fact that asynchronous code can rapidly switch between contexts in order to minimize time spent waiting for I/O.

Client-Side vs Server-Side Performance

Testing client-side performance of an asynchronous library like aiohttp is relatively straightforward. We choose some website as reference, and then make a certain number of requests, timing how long it takes our code to complete them. We'll be looking at the relative performance of aiohttp and requests when making requests to https://example.com.

Testing server-side performance is a little more tricky. Libraries like aiohttp come with built-in development servers, which are fine for testing routes on a local network. However, these development servers are not suited to deploying applications on the public web, as they cannot handle the load expected of a publicly available website, and they are not good at serving up static assets, like Javascript, CSS, and image files.

In order to get a better idea of the relative performance of aiohttp and an analogous synchronous web framework, we're going to re-implement our web app using Flask and then we'll compare development and production servers for both implementations.

For the production server, we're going to be using gunicorn.

Client-Side: aiohttp vs requests

For a traditional, synchronous approach, we just use a simple for loop. Though, before you run the code, make sure to install the requests module:

$ pip install --user requests

With that out of the way, let's go ahead and implement it in a more traditional manner:

# multiple_sync_requests.py
import requests  
def main():  
    n_requests = 100
    url = "https://example.com"
    session = requests.Session()
    for i in range(n_requests):
        print(f"making request {i} to {url}")
        resp = session.get(url)
        if resp.status_code == 200:
            pass

main()

Analogous asynchronous code is a little more complicated though. Making multiple requests with aiohttp leverages the asyncio.gather method to make requests concurrently:

# multiple_async_requests.py
import asyncio  
import aiohttp

async def make_request(session, req_n):  
    url = "https://example.com"
    print(f"making request {req_n} to {url}")
    async with session.get(url) as resp:
        if resp.status == 200:
            await resp.text()

async def main():  
    n_requests = 100
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(
            *[make_request(session, i) for i in range(n_requests)]
        )

loop = asyncio.get_event_loop()  
loop.run_until_complete(main())

Running both synchronous and asynchronous code with the bash time utility:

me@local:~$ time python multiple_sync_requests.py  
real    0m13.112s  
user    0m1.212s  
sys     0m0.053s

me@local:~$ time python multiple_async_requests.py  
real    0m1.277s  
user    0m0.695s  
sys     0m0.054s

The concurrent/asynchronous code is far faster. But what happens if we multi-thread the synchronous code? Could it match the speed of concurrent code?

# multiple_sync_request_threaded.py
import threading  
import argparse  
import requests

def create_parser():  
    parser = argparse.ArgumentParser(
        description="Specify the number of threads to use"
    )

    parser.add_argument("-nt", "--n_threads", default=1, type=int)

    return parser

def make_requests(session, n, url, name=""):  
    for i in range(n):
        print(f"{name}: making request {i} to {url}")
        resp = session.get(url)
        if resp.status_code == 200:
            pass

def main():  
    parsed = create_parser().parse_args()

    n_requests = 100
    n_requests_per_thread = n_requests // parsed.n_threads

    url = "https://example.com"
    session = requests.Session()

    threads = [
        threading.Thread(
            target=make_requests,
            args=(session, n_requests_per_thread, url, f"thread_{i}")
        ) for i in range(parsed.n_threads)
    ]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

main()

Running this rather verbose piece of code will yield:

me@local:~$ time python multiple_sync_request_threaded.py -nt 10  
real    0m2.170s  
user    0m0.942s  
sys     0m0.104s

And we can increase performance by using more threads, but returns diminish rapidly:

me@local:~$ time python multiple_sync_request_threaded.py -nt 20  
real    0m1.714s  
user    0m1.126s  
sys     0m0.119s

By introducing threading, we can come close to matching the performance of the asynchronous code, at the cost of increased code complexity.

While it does offer a similar response time, it's not worth it for the price of complicating code that could be simple - The quality of code isn't increased by the complexity or the number of lines we use.

Server-Side: aiohttp vs Flask

We'll use the Apache Benchmark (ab) tool to test the performance of different servers.

With ab we can specify the total number of requests to make, in addition to the number of concurrent requests to make.

Before we can start testing, we have to reimplement our planet tracker app (from the previous article) using a synchronous framework. We'll use Flask, as the API is similar to aiohttp (in reality the aiohttp routing API is based off of Flask):

# flask_app.py
from flask import Flask, jsonify, render_template, request

from planet_tracker import PlanetTracker

__all__ = ["app"]

app = Flask(__name__, static_url_path="",  
            static_folder="./client",
            template_folder="./client")

@app.route("/planets/<planet_name>", methods=["GET"])
def get_planet_ephmeris(planet_name):  
    data = request.args
    try:
        geo_location_data = {
            "lon": str(data["lon"]),
            "lat": str(data["lat"]),
            "elevation": float(data["elevation"])
        }
    except KeyError as err:
        # default to Greenwich observatory
        geo_location_data = {
            "lon": "-0.0005",
            "lat": "51.4769",
            "elevation": 0.0,
        }
    print(f"get_planet_ephmeris: {planet_name}, {geo_location_data}")
    tracker = PlanetTracker()
    tracker.lon = geo_location_data["lon"]
    tracker.lat = geo_location_data["lat"]
    tracker.elevation = geo_location_data["elevation"]
    planet_data = tracker.calc_planet(planet_name)
    return jsonify(planet_data)

@app.route('/')
def hello():  
    return render_template("index.html")

if __name__ == "__main__":  
    app.run(
        host="localhost",
        port=8000,
        threaded=True
    )

If you're jumping in without reading the previous article, we have to set up our project a little before testing. I've put all the Python server code in a directory planettracker, itself a sub-directory of my home folder.

me@local:~/planettracker$ ls  
planet_tracker.py  
flask_app.py  
aiohttp_app.py

I strongly suggest that you visit the previous article and get familiar with the application we've already built before proceeding.

aiohttp and Flask Development Servers

Let's see how long it takes our servers to handle 1000 requests, made 20 at a time.

First, I'll open two terminal windows. In the first, I run the server:

# terminal window 1
me@local:~/planettracker$ pipenv run python aiohttp_app.py

In the second, let's run ab:

# terminal window 2
me@local:~/planettracker$ ab -k -c 20 -n 1000 "localhost:8000/planets/mars?lon=145.051&lat=-39.754&elevation=0"  
...
Concurrency Level:      20  
Time taken for tests:   0.494 seconds  
Complete requests:      1000  
Failed requests:        0  
Keep-Alive requests:    1000  
Total transferred:      322000 bytes  
HTML transferred:       140000 bytes  
Requests per second:    2023.08 [\#/sec] (mean)  
Time per request:       9.886 [ms] (mean)  
Time per request:       0.494 [ms] (mean, across all concurrent requests)  
Transfer rate:          636.16 [Kbytes/sec] received  
...

ab outputs a lot of information, and I've only displayed the most relevant bit. Of this the number to which we should pay most attention is the "Requests per second" field.

Now, exiting out of the server in the first window, lets fire up our Flask app:

# terminal window 1
me@local:~/planettracker$ pipenv run python flask_app.py

Running the test script again:

# terminal window 2
me@local:~/planettracker$ ab -k -c 20 -n 1000 "localhost:8000/planets/mars?lon=145.051&lat=-39.754&elevation=0"  
...
Concurrency Level:      20  
Time taken for tests:   1.385 seconds  
Complete requests:      1000  
Failed requests:        0  
Keep-Alive requests:    0  
Total transferred:      210000 bytes  
HTML transferred:       64000 bytes  
Requests per second:    721.92 [\#/sec] (mean)  
Time per request:       27.704 [ms] (mean)  
Time per request:       1.385 [ms] (mean, across all concurrent requests)  
Transfer rate:          148.05 [Kbytes/sec] received  
...

It looks like the aiohttp app is 2.5x to 3x faster than the Flask when using each library's respective development server.

What happens if we use gunicorn to serve up our apps?

aiohttp and Flask as Served by gunicorn

Before we can test our apps in production mode, we have to first install gunicorn and figure out how to run our apps using an appropriate gunicorn worker class. In order to test the Flask app we can use the standard gunicorn worker, but for aiohttp we have to use the gunicorn worker bundled with aiohttp. We can install gunicorn with pipenv:

me@local~/planettracker$ pipenv install gunicorn

We can run the aiohttp app with the appropriate gunicorn worker:

# terminal window 1
me@local:~/planettracker$ pipenv run gunicorn aiohttp_app:app --worker-class aiohttp.GunicornWebWorker

Moving forward, when displaying ab test results I'm only going to show the "Requests per second" field for the sake of brevity:

# terminal window 2
me@local:~/planettracker$ ab -k -c 20 -n 1000 "localhost:8000/planets/mars?lon=145.051&lat=-39.754&elevation=0"  
...
Requests per second:    2396.24 [\#/sec] (mean)  
...

Now let's see how the Flask app fares:

# terminal window 1
me@local:~/planettracker$ pipenv run gunicorn flask_app:app

Testing with ab:

# terminal window 2
me@local:~/planettracker$ ab -k -c 20 -n 1000 "localhost:8000/planets/mars?lon=145.051&lat=-39.754&elevation=0"  
...
Requests per second:    1041.30 [\#/sec] (mean)  
...

Using gunicorn definitely resulting in increased performance for both the aiohttp and Flask apps. The aiohttp app still performs better, although not by as much of a margin as with the development server.

gunicorn allows us to use multiple workers to serve up our apps. We can use the -w command line argument to tell gunicorn to spawn more worker processes. Using 4 workers results in a significant performance bump for our apps:

# terminal window 1
me@local:~/planettracker$ pipenv run gunicorn aiohttp_app:app -w 4

Testing with ab:

# terminal window 2
me@local:~/planettracker$ ab -k -c 20 -n 1000 "localhost:8000/planets/mars?lon=145.051&lat=-39.754&elevation=0"  
...
Requests per second:    2541.97 [\#/sec] (mean)  
...

Moving on the the Flask version:

# terminal window 1
me@local:~/planettracker$ pipenv run gunicorn flask_app:app -w 4

Testing with ab:

# terminal window 2
me@local:~/planettracker$ ab -k -c 20 -n 1000 "localhost:8000/planets/mars?lon=145.051&lat=-39.754&elevation=0"  
...
Requests per second:    1729.17 [\#/sec] (mean)  
...

The Flask app saw a more significant boost in performance when using multiple workers!

Summarizing Results

Let's take a step back and look at the results of testing development and production servers for both aiohttp and Flask implementations of our planet tracker app in a table:

	aiohttp	Flask	% difference
Development server (Requests/sec)	2023.08	721.92	180.24
gunicorn (Requests/sec)	2396.24	1041.30	130.12
% increase over development server	18.45	44.24
gunicorn -w 4 (Requests/sec)	2541.97	1729.17	47.01
% increase over development server	25.65	139.52

Conclusion

In this article, we've compared the performance of an asynchronous web application compared to its synchronous counterpart and used several tools to do so.

↧

Anwesha Das: Upgraded my blog to Ghost 2.6

November 27, 2018, 6:57 am

≫ Next: Red Hat Developers: What, No Python in RHEL 8 Beta?

≪ Previous: Stack Abuse: Asynchronous vs Synchronous Python Performance Analysis

I have been maintaining my blog. It is a self hosted Ghost blog, where I have my theme as Casper, the Ghost default. In the recent past, September 2018, Ghost has updated its version to 2.0. Now it is my time to update mine.

It is always advisable to test it before running it into production server. I maintain a stage instance for the same. I test any and all the changes there before touching the production server. I did the same thing here also.

I have exported Ghost data into a Json file. For the ease to read I have prettified the file. I removed the old database and started the container for the new Ghost. I reimported the data into the new Ghost using the json file.

I had another problem to solve, the theme. I used to have Casper as my theme. But the new look of it, is something I do not like for my blog, which is predominantly a text blog. I was unable to fix the same theme for the new Ghost. Therefore I chose to use Attila as my theme. I did some modifications, uploaded and enabled it for my blog. A huge gratitude to the Ghost community and the developers, it was a real smooth job.

↧