Bhishan Bhandari: Debugging with breakpoint in Python3.7

July 20, 2018, 11:42 am

≫ Next: Bhishan Bhandari: Magic Methods in Python – Dunder Methods

≪ Previous: Bhishan Bhandari: Python Decorators – Python Essentials

Python has long had a default debugger named pdb in the standard libraries. pdb defines an interactive source code debugger for python programs. The intentions of this post is to clarify through examples and explanations what’s with the new built-in breakpoint() in python3.7 vs pdb in the earlier versions. Breakpoints are generally the point in […]

The post Debugging with breakpoint in Python3.7 appeared first on The Tara Nights.

↧

Bhishan Bhandari: Magic Methods in Python – Dunder Methods

July 21, 2018, 12:12 pm

≫ Next: EuroPython Society: List of EPS Board Candidates for 2018/2019

≪ Previous: Bhishan Bhandari: Debugging with breakpoint in Python3.7

Magic methods are the methods that has two underscores as the prefix and suffix to the method name. These are also called dunder methods which is an adopted name for double underscores(methods with double underscores). __init__, __str__ are some magic methods. These are a set of special methods that could be used to enhance your […]

The post Magic Methods in Python – Dunder Methods appeared first on The Tara Nights.

↧

EuroPython Society: List of EPS Board Candidates for 2018/2019

July 21, 2018, 1:58 pm

≫ Next: Weekly Python StackOverflow Report: (cxxxv) stackoverflow python report

≪ Previous: Bhishan Bhandari: Magic Methods in Python – Dunder Methods

At this year’s General Assembly we will vote in a new board of the EuroPython Society for the term 2018/2019.

List of Board Candidates

The EPS bylaws require one chair and 2 - 8 board members. The following candidates have stated their willingness to work on the EPS board. We are presenting them here (in alphabetical order by surname).

Prof. Martin Christen

Teaching Python / using Python for research projects

Martin Christen is a professor of Geoinformatics and Computer Graphics at the Institute of Geomatics at the University of Applied Sciences Northwestern Switzerland (FHNW). His main research interests are geospatial Virtual- and Augmented Reality, 3D geoinformation, and interactive 3D maps.
Martin is very active in the Python community. He teaches various Python-related courses and uses Python in most research projects. He organizes the PyBasel meetup - the local Python User Group Northwestern Switzerland. He also organizes the yearly GeoPython conference. He is a board member of the Python Software Verband e.V.

I would be glad to help with EuroPython, to be part of a great team that makes the next edition of EuroPython even better, wherever it will hosted.

Dr. Darya Chyzhyk

PhD / Python programming enthusiastic for research and science

Currently, Darya is a Post-Doc at INRIA Saclay research center, France.

She has a degree in applied mathematics and defended her thesis in computer science. Last 7 years Darya has been working on computer aided diagnostic computer systems for brain diseases at the University of the Basque Country, Spain the University of Florida, USA and she is a member of of the Computational Intelligence Group since 2009. Her aim is to develop computational methods for brain MRI processing and analysis, including open sours tools, that help to the medical people in their specific pathologies research studies.

She has experience in International Conference organization and take part in the events for the teenagers and kids such as Week of science. Participant in more than 10 international science conference, trainings and summer courses.

Board member of Python San Sebastian Society (ACPySS), on-site team of EuroPython 2015 and 2016, EPS board member since 2015.

Artur Czepiel

Pythonista / Web Programmer

Artur started writing in Python around 2008. Since then he used it for fun, profit, and automation. Mostly writing web backends and sysadmin scripts. In last few years slowly expanding that list of use cases with help of data analysis tools like pandas.

At EuroPython 2017 he saw a talk about the EuroPython’s codebase and started contributing patches, later joining Web and Support Workgroups. His plan for next year is to write more patches, focusing on how website (and other related software, like helpdesk) can be modified to improve workflows of other WGs.

Anders Hammarquist

Pythonista / Consultant / Software architect

Anders brought Python to Open End (née Strakt), a Python software company
focusing on data organization, when we founded it in 2001. He has used
Python in various capacities since 1995.

He helped organize EuroPython 2004 and 2005, and has attended and given
talks at several EuroPythons since then. He has handled the Swedish financials of the EuroPython Society since 2016 and has served as board member since 2017.

Marc-André Lemburg

Pythonista / CEO / Consultant / Coach

Marc-Andre is the CEO and founder of eGenix.com, a Python-focused project and consulting company based in Germany. He has a degree in mathematics from the University of Düsseldorf. His work with and for Python started in 1994. He became Python Core Developer in 1997, designed and implemented the Unicode support in Python and continued to maintain the Python Unicode implementation for more than a decade. Marc-Andre is a founding member of the Python Software Foundation (PSF) and has served on the PSF Board several times.

In 2002, Marc-Andre was on the executive committee to run the first EuroPython conference in Charleroi, Belgium. He also co-organized the second EuroPython 2003 conference. Since then, he has attended every single EuroPython conference and continued being involved in the workings of the conference organization.

He was elected as board member of the EuroPython Society (EPS) in 2012 and enjoyed the last few years working with the EPS board members on steering the EuroPython conference to the new successful EuroPython Workgroup structures to sustain the continued growth, while maintaining the EuroPython spirit and fun aspect of the conference.

For the EuroPython 2017 and 2018 edition, Marc-Andre was chair of the EuroPython Society and ran lots of activities around the conference organization, e.g. managing the contracts and budget, helping with sponsors, the website, setting up the conference app, writing blog posts and many other things that were needed to make EuroPython happen.

Going forward, he would like to intensify work on turning the EPS into an organization which aids the Python adoption in Europe not only by running the EuroPython conference, but also by help build organizer networks and provide financial help to other Python conferences in Europe.

Dr. Valeria Pettorino

PhD in physics / Astrophysics / Data Science / Space Missions / Python user

Valeria has more than 12 years experience in research, communication and project management, in Italy/US/Switzerland/Germany/France. Since December 2016 she is permanent research enginner at CEA (Commissariat de l’énergie atomique) in Paris-Saclay. She is part of the international collaborations for the ESA/NASA Planck and Euclid space missions; among other projects, she is leading the forecast Taskforce that predicts Euclid performance.

She has been using python both in astrophysics (for plotting and data interpretation) and for applications to healthcare IOT. She is alumni of the Science to Data Science (S2DS) program and is passionate about transfer of knowledge between industry and academia.

She took part to EuroPython 2016 as a speaker and since then helped co-organizing EuroPython 2017 and 2018 in different WGs. She is invited mentor for women in physics for the Supernova Foundation http://supernovafoundation.org/ remote worldwide program.

Mario Thiel

Pythonista

Mario has been helping a lot with EuroPython in recent years, mostly working on supporting attendees through the helpdesk, on-site to make sure setup and tear-down run smoothly and more recently also on the sponsors WG.

Mario will unfortunately not be able to attend EuroPython this year, but would still feel honored to be voted in to the board.

Silvia Uberti

Sysadmin / IT Consultant

She is a Sysadmin with a degree in Network Security, really passionate about technology, traveling and her piano.

She’s an advocate for women in STEM disciplines and supports inclusiveness of underrepresented people in tech communities.

She fell in love with Python and its warm community during PyCon Italia in 2014 and became a member of EuroPython Sponsor Workgroup in 2017.
She enjoys a lot working in it and wants to help more!

What does the EPS Board do ?

The EPS board runs the day-to-day business of the EuroPython Society, including running the EuroPython conference events. It is allowed to enter contracts for the society and handle any issues that have not been otherwise regulated in the bylaws or by the General Assembly. Most business is handled by email on the board mailing list or the board’s Telegram group, board meetings are usually run as conference calls.

It is important to note that the EPS board is an active board, i.e. the board members are expected to put in a significant amount of time and effort towards the goals of the EPS and for running the EuroPython conference. This usually means at least 100-200h work over a period of one year, with most of this being needed in the last six months before the conference. Many board members put in even more work to make sure that the EuroPython conferences become a success.

Board members are generally expected to take on leadership roles within the EuroPython Workgroups.

Enjoy,
–
EuroPython Society

↧

Weekly Python StackOverflow Report: (cxxxv) stackoverflow python report

July 21, 2018, 2:21 pm

≫ Next: Justin Mayer: Python Development Environment on macOS High Sierra

≪ Previous: EuroPython Society: List of EPS Board Candidates for 2018/2019

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2018-07-21 21:21:06 GMT

What is the n parameter of tkinter.mainloop function? - [14/1]
Find different pair in a list by python - [11/6]
Fastest way to create a pandas column conditionally - [11/1]
With assignment expressions in Python 3.8, why do we need to use `as` in `with`? - [9/1]
Python: When should we name the parameters we're passing? - [8/3]
Python 3 generator comprehension to generate chunks including last - [7/7]
pandas iteratively update column values - [7/4]
Convert 1d array to lower triangular matrix - [7/3]
Extract string if match the value in another list - [7/1]
Inconsistent behavior of jitted function - [7/1]

↧

Justin Mayer: Python Development Environment on macOS High Sierra

July 21, 2018, 11:00 pm

≫ Next: Philip Semanchuk: A Python 2 to 3 Migration Guide

≪ Previous: Weekly Python StackOverflow Report: (cxxxv) stackoverflow python report

While installing Python and Virtualenv on macOS High Sierra can be done several ways, this tutorial will guide you through the process of configuring a stock Mac system into a solid Python development environment.

First steps

This guide assumes that you have already installed Homebrew. For details, please follow the steps in the macOS Configuration Guide.

Python

We are going to install the latest version of Python via Homebrew. Why bother, you ask, when Apple includes Python along with macOS? Here are some reasons:

When using the bundled Python, macOS updates can nuke your Python packages, forcing you to re-install them.
As new versions of Python are released, the Python bundled with macOS will become out-of-date. Homebrew always has the most recent version.
Apple has made significant changes to its bundled Python, potentially resulting in hidden bugs.
Homebrew’s Python includes the latest versions of Pip and Setuptools (Python package management tools)

Use the following command to install Python 3.x via Homebrew:

brew install python

You’ve already modified your PATH as mentioned in the macOS Configuration Guide, right? If not, please do so now.

If you need to install the deprecated, legacy Python version 2.7, you can also install it via:

brew install python@2

This makes it easy to test your code on both Python 3 and 2.7. More importantly, since Python 3 is the present and future of all things Python, the examples below assume you have installed Python 3.

Pip

Let’s say you want to install a Python package, such as the fantastic Virtualenv environment isolation tool. While nearly every Python-related article for macOS tells the reader to install it via sudo pip install virtualenv, the downsides of this method include:

installs with root permissions
installs into the system /Library
yields a less reliable environment when using Homebrew’s Python

As you might have guessed by now, we’re going to use the tools provided by Homebrew to install the Python packages that we want to be globally available. When installing via Homebrew’s Pip, packages will be installed to /usr/local/lib/python{version}/site-packages, with binaries placed in /usr/local/bin.

Homebrew recently changed the names of Python-related binaries to avoid potential confusion with those bundled with macOS. As a result, pip became pip2, et cetera. Between this change and the many new improvements in Python 3, it seems a good time to start using pip3 for all the examples that will follow below. If you don’t want to install Python 3 or would prefer your global packages to use the older, deprecated Python 2.7, you can replace the relevant invocations below with pip2 instead.

Version control (optional)

The first thing I pip-install is Mercurial, since I have Mercurial repositories that I push to both Bitbucket and GitHub. If you don’t want to install Mercurial, you can skip ahead to the next section.

The following command will install Mercurial and hg-git:

pip3 install Mercurial hg-git

At a minimum, you’ll need to add a few lines to your .hgrc file in order to use Mercurial:

vim ~/.hgrc

The following lines should get you started; just be sure to change the values to your name and email address, respectively:

[ui]username=YOUR NAME <address@example.com>

To test whether Mercurial is configured and ready for use, run the following command:

hg debuginstall

If the last line in the response is “no problems detected”, then Mercurial has been installed and configured properly.

Virtualenv

Python packages installed via the steps above are global in the sense that they are available across all of your projects. That can be convenient at times, but it can also create problems. For example, sometimes one project needs the latest version of Django, while another project needs an older Django version to retain compatibility with a critical third-party extension. This is one of many use cases that Virtualenv was designed to solve. On my systems, only a handful of general-purpose Python packages (such as Mercurial and Virtualenv are globally available — every other package is confined to virtual environments.

With that explanation behind us, let’s install Virtualenv:

pip3 install virtualenv

Create some directories to store our projects, virtual environments, and Pip configuration file, respectively:

mkdir -p ~/Projects ~/Virtualenvs ~/Library/Application\ Support/pip

We’ll then open Pip’s configuration file (which may be created if it doesn’t exist yet)…

vim ~/Library/Application\ Support/pip/pip.conf

… and add some lines to it:

[install]require-virtualenv=true[uninstall]require-virtualenv=true

Now we have Virtualenv installed and ready to create new virtual environments, which we will store in ~/Virtualenvs. New virtual environments can be created via:

cd ~/Virtualenvs
virtualenv foobar

If you have both Python 2.x and 3.x and want to create a Python 3.x virtualenv:

virtualenv -p python3 foobar-py3

… which makes it easier to switch between Python 2.x and 3.x foobar environments.

Restricting Pip to virtual environments

What happens if we think we are working in an active virtual environment, but there actually is no virtual environment active, and we install something via pip3 install foobar? Well, in that case the foobar package gets installed into our global site-packages, defeating the purpose of our virtual environment isolation.

In an effort to avoid mistakenly Pip-installing a project-specific package into my global site-packages, I previously used easy_install for global packages and the virtualenv-bundled Pip for installing packages into virtual environments. That accomplished the isolation objective, since Pip was only available from within virtual environments, making it impossible for me to pip3 install foobar into my global site-packages by mistake. But easy_install has some deficiencies, such as the inability to uninstall a package, and I found myself wanting to use Pip for both global and virtual environment packages.

Thankfully, Pip has an undocumented setting (source) that tells it to bail out if there is no active virtual environment, which is exactly what I want. In fact, we’ve already set that above, via the require-virtualenv = true directive in Pip’s configuration file. For example, let’s see what happens when we try to install a package in the absence of an activated virtual environment:

$ pip3 install markdown
Could not find an activated virtualenv (required).

Perfect! But once that option is set, how do we install or upgrade a global package? We can temporarily turn off this restriction by defining a new function in ~/.bashrc:

gpip(){PIP_REQUIRE_VIRTUALENV="0" pip3 "$@"}

(As usual, after adding the above you must run source ~/.bash_profile for the change to take effect.)

If in the future we want to upgrade our global packages, the above function enables us to do so via:

gpip install --upgrade pip setuptools wheel virtualenv

You could achieve the same effect via env PIP_REQUIRE_VIRTUALENV="0" pip3 install --upgrade foobar, but that’s much more cumbersome to type.

Creating virtual environments

Let’s create a virtual environment for Pelican, a Python-based static site generator:

cd ~/Virtualenvs
virtualenv pelican

Change to the new environment and activate it via:

cd pelican
source bin/activate

To install Pelican into the virtual environment, we’ll use pip:

pip3 install pelican markdown

For more information about virtual environments, read the Virtualenv docs.

Dotfiles

These are obviously just the basic steps to getting a Python development environment configured. Feel free to also check out my dotfiles (GitHub mirror).

If you found this article to be useful, please follow me on Twitter. Also, if you are interested in server security monitoring, be sure to sign up for early access to Monitorial!

↧

Philip Semanchuk: A Python 2 to 3 Migration Guide

July 22, 2018, 9:52 am

≫ Next: Mike Driscoll: PyDev of the Week: Christopher Neugebauer

≪ Previous: Justin Mayer: Python Development Environment on macOS High Sierra

July 2018 update: I’ll be giving a talk based on this guide at PyOhio next week. If you’re there, please come say hello!

It’s not always obvious, but migrating from Python 2 to 3 doesn’t have to be an overwhelming effort spike. I’ve done Python 2-to-3 migration assessments with several organizations, and in each case we were able to turn the unknowns into a set of straightforward to-do lists.

I’ve written a Python 2-to-3 migration guide [PDF] to help others who want to make the leap but aren’t sure where to start, or have maybe already begun but would like another perspective. It outlines some high level steps for the migration and also contains some nitty-gritty technical details, so it’s useful for both those who will plan the migration and the technical staff that will actually perform it.

The (very brief) summary is that most of the work can be done in advance without sacrificing Python 2 compatibility. What’s more, you can divide the work into manageable chunks that you can tick off one by one as you have time to work on them. Last but not least, many of the changes are routine and mechanical (for example, changing the print statement to a function), and there are tools that do a lot of the work for you.

You can download the migration guide here [PDF]. Please feel free to share; it’s licensed under a Creative Commons Attribution-ShareAlike license.

Feedback is welcome, either via email or in the comments below.

↧

Mike Driscoll: PyDev of the Week: Christopher Neugebauer

July 22, 2018, 10:05 pm

≫ Next: Bhishan Bhandari: File Handling in Python

≪ Previous: Philip Semanchuk: A Python 2 to 3 Migration Guide

This week we welcome Christopher Neugebauer (@chrisjrn) as our PyDev of the Week! Christopher helped organize North Bay Python and PyCon Australia. He is also a fellow of the Python Software Foundation. You can catch up with him on Github or on his website. If you are interested in being a part of the North Bay Python 2018 conference, you can submit a proposal here.

Now let’s take a few moments to get to know him better!

Can you tell us a little about yourself (hobbies, education, etc):

I’m originally from Hobart, the capital city of the Australian state of Tasmania, but these days, I live with my fiancé, Josh, in Petaluma, California. I hold a BSc in Computer Science and Mathematics from the University of Tasmania, and my current day job is in software engineering: I currently work on search ranking at Shutterstock. I’m a volunteer conference organiser by the rest of the day, my current project is the North Bay Python conference, but I’ve also been a co-lead organiser of PyCon Australia. I’ve been a Fellow of the Python Software Foundation since 2013.

On that note, we just announced dates for North Bay Python 2018 — it’ll be on November 3 & 4 in Historic Downtown Petaluma, California. You should follow us on Twitter at @northbaypython, or at least sign up to our newsletter at northbaypython.org.

Why did you start using Python?

In year 10 at school, I was taking part in a programming competition with one of my friends. One question involved finding some facts about a really big number, and I’d heard that Python supports bignum arithmetic out of the box. I tried it out, and of course, using Python made it really easy.

From there, I realised that Python might be able to make lots of other things easy… and it did! I’ve stuck with Python ever since.

What other programming languages do you know and which is your favorite?

My first programming language was Microsoft QBasic, and I’ve also been known to write C, C++, PHP, tiny bits of Perl, Objective-C, bash, and AWK. The one that I keep ending up having to write, though, is Java.

Over the last few months, I’ve had the pleasure of working on a well-written Java 8 codebase, and I’m super-impressed with the work they’ve done adding anonymous functions and stream processing to the language. It’s nowhere near as tidy as Python’s list comprehensions and generators, but the number of complicated for loops I’ve needed to write has drastically reduced.

For this Python programmer, having the language I often use out of necessity fit my brain better is great!

What projects are you working on now?

The North Bay Python conference is my main side project, and alongside that I maintain a conference ticket sales that sits alongside Symposion. It’s called Registrasion, and I wrote it as part of a PSF grant over the course of 2016. It’s built around selling tickets for an event like linux.conf.au (which has multiple dinners and networking sessions that are available to different people), but it proved flexible enough to work with smaller, simpler events like North Bay Python.

Which Python libraries are your favorite (core or 3rd party)?

Lately I’ve been doing a lot of work testing differences between different search ranking techniques. My toolbelt for a lot of that work involves `concurrent.futures`, and `csv` from the standard library, and Kenneth Reitz et al’s wonderful `requests` library.

While Python’s great for building big and maintainable bits of code, being able to throw together small library modules together to solve ad-hoc tasks has been a huge win for me.

I see you do a lot of Python conference organizing. How did you get into that?

For starters, I ran my University’s Computer Science society/club for a few years, and that involved finding speakers and scheduling meetings and all that… but my organising career _really_ started when Richard Jones messaged me out of the blue in 2010 and reminded me that I’d volunteered to run PyCon Australia in 2012. I recalled doing no such thing, but I went along with it anyway.

Two PyCon AU’s later and I was hooked. I’ve since run linux.conf.au 2017, and after I moved to the US, I’ve helped found and run North Bay Python!

Can you give any advice to someone who would like to start a regional conference?

Of course! Talk to other conference organisers. It turns out there’s a lot of shared knowledge out there, especially in the Python world, and like with most other Python things, we’re a pretty friendly bunch. Finding sponsorship leads, getting an idea of what tradeoffs you need to make, or getting warm introductions to people who you really want to speak at your event is _so_ much easier when there are people who’ve done it all before. We also run a yearly “Regional PyCons” BoF session at PyCon US, which also makes it really easy to make yourself known.

Is there anything else you’d like to say?

One thing I really appreciate about the Python community is that I’ve been able to build up my reputation and skillset as an organiser and community builder. I’ve only been working in a full-time job that’s primarily Python for the last 5 months, and I’ve only made substantial code contributions to open source Python projects in the last two years.

I’m pretty sure I only had the means or ability to do either of these because the community recognises me as capable, and that definitely wasn’t from my code!

Thanks for doing the interview!

↧

Bhishan Bhandari: File Handling in Python

July 23, 2018, 2:36 am

≫ Next: Matthew Rocklin: Pickle isn't slow, it's a protocol

≪ Previous: Mike Driscoll: PyDev of the Week: Christopher Neugebauer

Python has convenient built-ins to work with files. The intentions of this post is to discuss on various modes of open() and see them through examples. open() is a built-in function that returns a file object, also called a handle, as it is used to read or modify the file accordingly. We will start by […]

The post File Handling in Python appeared first on The Tara Nights.

↧

Matthew Rocklin: Pickle isn't slow, it's a protocol

July 22, 2018, 5:00 pm

≫ Next: Dataquest: Top 20 Python AI and Machine Learning Open Source Projects

≪ Previous: Bhishan Bhandari: File Handling in Python

This work is supported by Anaconda Inc

tl;dr:Pickle isn’t slow, it’s a protocol. Protocols are important for ecosystems.

A recent Dask issue showed that using Dask with PyTorch was slow because sending PyTorch models between Dask workers took a long time (Dask GitHub issue).

This turned out to be because serializing PyTorch models with pickle was very slow (1 MB/s for GPU based models, 50 MB/s for CPU based models). There is no architectural reason why this needs to be this slow. Every part of the hardware pipeline is much faster than this.

We could have fixed this in Dask by special-casing PyTorch models (Dask has it’s own optional serialization system for performance), but being good ecosystem citizens, we decided to raise the performance problem in an issue upstream (PyTorch Github issue). This resulted in a five-line-fix to PyTorch that turned a 1-50 MB/s serialization bandwidth into a 1 GB/s bandwidth, which is more than fast enough for many use cases (PR to PyTorch).

     def __reduce__(self):
-        return type(self), (self.tolist(),)
+        b = io.BytesIO()
+        torch.save(self, b)
+        return (_load_from_bytes, (b.getvalue(),))
+def _load_from_bytes(b):
+    return torch.load(io.BytesIO(b))

Thanks to the PyTorch maintainers this problem was solved pretty easily. PyTorch tensors and models now serialize efficiently in Dask or in any other Python library that might want to use them in distributed systems like PySpark, IPython parallel, Ray, or anything else without having to add special-case code or do anything special. We didn’t solve a Dask problem, we solved an ecosystem problem.

However before we solved this problem we discussed things a bit. This comment stuck with me:

This comment contains two beliefs that are both very common, and that I find somewhat counter-productive:

Pickle is slow
You should use our specialized methods instead

I’m sort of picking on the PyTorch maintainers here a bit (sorry!) but I’ve found that they’re quite widespread, so I’d like to address them here.

Pickle is slow

Pickle is not slow. Pickle is a protocol. We implement pickle. If it’s slow then it is our fault, not Pickle’s.

To be clear, there are many reasons not to use Pickle.

It’s not cross-language
It’s not very easy to parse
It doesn’t provide random access
It’s insecure
etc..

So you shouldn’t store your data or create public services using Pickle, but for things like moving data on a wire it’s a great default choice if you’re moving strictly from Python processes to Python processes in a trusted and uniform environment.

It’s great because it’s as fast as you can make it (up a a memory copy) and other libraries in the ecosystem can use it without needing to special case your code into theirs.

This is the change we did for PyTorch.

     def __reduce__(self):
-        return type(self), (self.tolist(),)
+        b = io.BytesIO()
+        torch.save(self, b)
+        return (_load_from_bytes, (b.getvalue(),))
+def _load_from_bytes(b):
+    return torch.load(io.BytesIO(b))

The slow part wasn’t Pickle, it was the .tolist() call within __reduce__ that converted a PyTorch tensor into a list of Python ints and floats. I suspect that the common belief of “Pickle is just slow” stopped anyone else from investigating the poor performance here. I was surprised to learn that a project as active and well maintained as PyTorch hadn’t fixed this already.

As a reminder, you can implement the pickle protocol by providing the __reduce__ method on your class. The __reduce__ function returns a loading function and sufficient arguments to reconstitute your object. Here we used torch’s existing save/load functions to create a bytestring that we could pass around.

Just use our specialized option

Specialized options can be great. They can have nice APIs with many options, they can tune themselves to specialized communication hardware if it exists (like RDMA or NVLink), and so on. But people need to learn about them first, and learning about them can be hard in two ways.

Hard for users

Today we use a large and rapidly changing set of libraries. It’s hard for users to become experts in all of them. Increasingly we rely on new libraries making it easy for us by adhering to standard APIs, providing informative error messages that lead to good behavior, and so on..

Hard for other libraries

Other libraries that need to interact definitely won’t read the documentation, and even if they did it’s not sensible for every library to special case every other library’s favorite method to turn their objects into bytes. Ecosystems of libraries depend strongly on the presence of protocols and a strong consensus around implementing them consistently and efficiently.

Sometimes Specialized Options are Appropriate

There are good reasons to support specialized options. Sometimes you need more than 1GB/s bandwidth. While this is rare in general (very few pipelines process faster than 1GB/s/node), it is true in the particular case of PyTorch when they are doing parallel training on a single machine with multiple processes. Soumith (PyTorch maintainer) writes the following:

When sending Tensors over multiprocessing, our custom serializer actually shortcuts them through shared memory, i.e. it moves the underlying Storages to shared memory and restores the Tensor in the other process to point to the shared memory. We did this for the following reasons:

Speed: we save on memory copies, especially if we amortize the cost of moving a Tensor to shared memory before sending it into the multiprocessing Queue. The total cost of actually moving a Tensor from one process to another ends up being O(1), and independent of the Tensor’s size
Sharing: If Tensor A and Tensor B are views of each other, once we serialize and send them, we want to preserve this property of them being views. This is critical for neural-nets where it’s common to re-view the weights / biases and use them for another. With the default pickle solution, this property is actually lost.

↧

Dataquest: Top 20 Python AI and Machine Learning Open Source Projects

July 23, 2018, 5:00 am

≫ Next: Sylvain Hellegouarch: How Python and Guido got me my first job and many afterwards

≪ Previous: Matthew Rocklin: Pickle isn't slow, it's a protocol

Getting into Machine Learning and AI is not an easy task. Many aspiring professionals and enthusiasts find it hard to establish a proper path into the field, given the enormous amount of resources available today. The field is evolving constantly and it is crucial that we keep up with the pace of this rapid development. In order to cope with this overwhelming speed of evolution and innovation, a good way to stay updated and knowledgeable on the advances of ML, is to engage with the community by contributing to the many open-source projects and tools that are used daily by advanced professionals.

Here we update the information and examine the trends since our previous post Top 20 Python Machine Learning Open Source Projects (Nov 2016).

Tensorflow has moved to the first place with triple-digit growth in contributors. Scikit-learn dropped to 2nd place, but still has a very large base of contributors.

Compared to 2016, the projects with the fastest growth in number of contributors were

TensorFlow, 169% up, from 493 to 1324 contributors
Deap, 86% up, from 21 to 39 contributors
Chainer, 83% up, from 84 to 154 contributors
Gensim, 81% up, from 145 to 262 contributors
Neon, 66% up, from 47 to 78 contributors
Nilearn, 50% up, from 46 to 69 contributors

Also new in 2018:

Keras, 629 contributors
PyTorch, 399 contributors

top-python-ai-machine-learning-github-693

Fig. 1: Top 20 Python AI and Machine Learning projects on Github.

Size is proportional to the number of contributors, and color represents to the change in the number of contributors - red is higher, blue is lower. Snowflake shape is for Deep Learning projects, round for other projects.

We see that Deep Learning projects like TensorFlow, Theano, and Caffe are among the most popular.

The list below gives projects in descending order based on the number of contributors on Github. The change in number of contributors is versus 2016 KDnuggets Post on Top 20 Python Machine Learning Open Source Projects.

We hope you enjoy going through the documentation pages of each of these to start collaborating and learning the ways of Machine Learning using Python.

TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization. The system is designed to facilitate research in machine learning, and to make it quick and easy to transition from research prototype to production system.
Contributors: 1324 (168% up), Commits: 28476, Stars: 92359. Github URL: Tensorflow
Scikit-learn is simple and efficient tools for data mining and data analysis, accessible to everybody, and reusable in various context, built on NumPy, SciPy, and matplotlib, open source, commercially usable – BSD license.
Contributors: 1019 (39% up), Commits: 22575, Github URL: Scikit-learn
Keras, a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.
Contributors: 629 (new), Commits: 4371, Github URL: Keras
PyTorch, Tensors and Dynamic neural networks in Python with strong GPU acceleration.
Contributors: 399 (new), Commits: 6458, Github URL: pytorch
Theano allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
Contributors: 327 (24% up), Commits: 27931, Github URL: Theano
Gensim is a free Python library with features such as scalable statistical semantics, analyze plain-text documents for semantic structure, retrieve semantically similar documents.
Contributors: 262 (81% up), Commits: 3549, Github URL: Gensim
Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors.
Contributors: 260 (21% up), Commits: 4099, Github URL: Caffe
Chainer is a Python-based, standalone open source framework for deep learning models. Chainer provides a flexible, intuitive, and high performance means of implementing a full range of deep learning models, including state-of-the-art models such as recurrent neural networks and variational auto-encoders.
Contributors: 154 (84% up), Commits: 12613, Github URL: Chainer
Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.
Contributors: 144 (33% up), Commits: 9729, Github URL: Statsmodels
Shogun is Machine learning toolbox which provides a wide range of unified and efficient Machine Learning (ML) methods. The toolbox seamlessly allows to easily combine multiple data representations, algorithm classes, and general purpose tools.
Contributors: 139 (32% up), Commits: 16362, Github URL: Shogun
Pylearn2 is a machine learning library. Most of its functionality is built on top of Theano. This means you can write Pylearn2 plugins (new models, algorithms, etc) using mathematical expressions, and Theano will optimize and stabilize those expressions for you, and compile them to a backend of your choice (CPU or GPU).
Contributors: 119 (3.5% up), Commits: 7119, Github URL: Pylearn2
NuPIC is an open source project based on a theory of neocortex called Hierarchical Temporal Memory (HTM). Parts of HTM theory have been implemented, tested, and used in applications, and other parts of HTM theory are still being developed.
Contributors: 85 (12% up), Commits: 6588, Github URL: NuPIC
Neon is Nervana's Python-based deep learning library. It provides ease of use while delivering the highest performance.
Contributors: 78 (66% up), Commits: 1112, Github URL: Neon
Nilearn is a Python module for fast and easy statistical learning on NeuroImaging data. It leverages the scikit-learn Python toolbox for multivariate statistics with applications such as predictive modelling, classification, decoding, or connectivity analysis.
Contributors: 69 (50% up), Commits: 6198, Github URL: Nilearn
Orange3 is open source machine learning and data visualization for novice and expert. Interactive data analysis workflows with a large toolbox.
Contributors: 53 (33% up), Commits: 8915, Github URL: Orange3
Pymc is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large suite of problems.
Contributors: 39 (5.4% up), Commits: 2721, Github URL: Pymc
Deap is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data structures transparent. It works in perfect harmony with parallelisation mechanism such as multiprocessing and SCOOP.
Contributors: 39 (86% up), Commits: 1960, Github URL: Deap
Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mapped into memory so that many processes may share the same data.
Contributors: 35 (46% up), Commits: 527, Github URL: Annoy
PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.
Contributors: 32 (3% up), Commits: 992, Github URL: PyBrain
Fuel is a data pipeline framework which provides your machine learning models with the data they need. It is planned to be used by both the Blocks and Pylearn2 neural network libraries.
Contributors: 32 (10% up), Commits: 1116, Github URL: Fuel

The contributor and commit numbers were recorded in February 2018.

Editor's note: This was originally posted on KDNuggets, and has been reposted with permission. Author Ilan Reinstein is a physicist and data scientist.

↧

Sylvain Hellegouarch: How Python and Guido got me my first job and many afterwards

July 23, 2018, 6:04 am

≫ Next: Real Python: Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects

≪ Previous: Dataquest: Top 20 Python AI and Machine Learning Open Source Projects

Recently, Guido van Rossum, creator and leader of the Python programming language announced, quite out of the blue for the distant pythonista, that he was resigning from his role as the leader of the project, the well-known BDFL.

This is no small news as Guido has created the language far back in the early 90s and has stuck with its community forever since.

What saddened me was to see that he left to a certain degree due to the harhness of the the discussion around PEP 572. All things considered this PEP doesn’t seem like it should have involved such an outcome, yet it did. As a community we ought to reflect on this event. I assume this is not just a single PEP that forced GvR to make that decision but after years of fighting, maybe that one went too far.

I started with Python back in 2001 with a first personal project, an IRC client for Python. The code is long gone (and that’s probably better for my ego). But, I do recall joining the #python channel back then, asking a newbie question and being left with a feeling I was stupid. I left that channel and never came back. But I stuck with Python, as a language, because it is such a pleasure to work with (even though, back then I mostly had to work on Zope and Plone. Ouch).

When I joined the CherryPy project a year later, I found a very welcoming community and, when I created the according IRC channel a couple of years later, I always made sure that newbies wouldn’t felt the way I had. If a given question is asked repeatedly, I think it’s best to question our documentation quality rather than the person who asked. That led me to propose a new documentation for the project a few years ago.

In all those years, I worked at various companies and learnt different new languages, some I had a lot of fun with (for instance erlang was really sweet to learn from and a couple of years ago, I played a little with Clojure with some interests). But, for my personal projects, I always ended up with Python. This language is so powerful and versatile. It’s not better in every contexts but does a fine job to find the balance between capabilities, readability, maintanability and performance. Its ecosystem is rich and some of its communities are really nice and kind.

Python landed me a few jobs and, today is the backbone of my own company’s products at ChaosIQ such as the Chaos Toolkit and ChaosHub.

So, thank you Mr van Rossum and all the folks leading the project. Not only have you given me tools to build my carreer but you also made it fun for so long and even more so with recent Python 3 versions. I hope you’ll stick by the project for a long time!

↧

Real Python: Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects

July 23, 2018, 7:00 am

≫ Next: PyCharm: PyCharm 2018.2 RC 2

≪ Previous: Sylvain Hellegouarch: How Python and Guido got me my first job and many afterwards

If you work with big data sets, you probably remember the “aha” moment along your Python journey when you discovered the Pandas library. Pandas is a game-changer for data science and analytics, particularly if you came to Python because you were searching for something more powerful than Excel and VBA.

So what is it about Pandas that has data scientists, analysts, and engineers like me raving? Well, the Pandas documentation says that it uses:

“fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.”

Fast, flexible, easy, and intuitive? That sounds great! If your job involves building complicated data models, you don’t want to spend half of your development hours waiting for modules to churn through big data sets. You want to dedicate your time and brainpower to interpreting your data, rather than painstakingly fumbling around with less powerful tools.

But I Heard That Pandas Is Slow…

When I first started using Pandas, I was advised that, while it was a great tool for dissecting data, Pandas was too slow to use as a statistical modeling tool. Starting out, this proved true. I spent more than a few minutes twiddling my thumbs, waiting for Pandas to churn through data.

But then I learned that Pandas is built on top of the NumPy array structure, and so many of its operations are carried out in C, either via NumPy or through Pandas’ own library of Python extension modules that are written in Cython and compiled to C. So, shouldn’t Pandas be fast too?

It absolutely should be, if you use it the way it was intended!

The paradox is that what may otherwise “look like” Pythonic code can be suboptimal in Pandas as far as efficiency is concerned. Like NumPy, Pandas is designed for vectorized operations that operate on entire columns or datasets in one sweep. Thinking about each “cell” or row individually should generally be a last resort, not a first.

This Tutorial

To be clear, this is not a guide about how to over-optimize your Pandas code. Pandas is already built to run quickly if used correctly. Also, there’s a big difference between optimization and writing clean code.

This is a guide to using Pandas Pythonically to get the most out of its powerful and easy-to-use built-in features. Additionally, you will learn a couple of practical time-saving tips, so you won’t be twiddling those thumbs every time you work with your data.

In this tutorial, you’ll cover the following:

Advantages of using datetime data with time series
The most efficient route to doing batch calculations
Saving time by storing data with HDFStore

To demonstrate these topics, I’ll take an example from my day job that looks at a time series of electricity consumption. After loading the data, you’ll successively progress through more efficient ways to get to the end result. One adage that holds true for most of Pandas is that there is more than one way to get from A to B. This doesn’t mean, however, that all of the available options will scale equally well to larger, more demanding datasets.

Assuming that you already know how to do some basic data selection in Pandas, let’s get started.

The Task at Hand

The goal of this example will be to apply time-of-use energy tariffs to find the total cost of energy consumption for one year. That is, at different hours of the day, the price for electricity varies, so the task is to multiply the electricity consumed for each hour by the correct price for the hour in which it was consumed.

Let’s read our data from a CSV file that has two columns: one for date plus time and one for electrical energy consumed in kilowatt hours (kWh):

The rows contains the electricity used in each hour, so there are 365 x 24 = 8760 rows for the whole year. Each row indicates the usage for the “hour starting” at the time, so 1/1/13 0:00 indicates the usage for the first hour of January 1st.

Saving Time With Datetime Data

The first thing you need to do is to read your data from the CSV file with one of Pandas’ I/O functions:

>>> importpandasaspd>>> pd.__version__'0.23.1'# Make sure that `demand_profile.csv` is in your# current working directory.>>> df=pd.read_csv('demand_profile.csv')>>> df.head()     date_time  energy_kwh0  1/1/13 0:00       0.5861  1/1/13 1:00       0.5802  1/1/13 2:00       0.5723  1/1/13 3:00       0.5964  1/1/13 4:00       0.592

This looks okay at first glance, but there’s a small issue. Pandas and NumPy have a concept of dtypes (data types). If no arguments are specified, date_time will take on an object dtype:

>>> df.dtypesdate_time      objectenergy_kwh    float64dtype: object>>> type(df.iat[0,0])str

This is not ideal. object is a container for not just str, but any column that can’t neatly fit into one data type. It would be arduous and inefficient to work with dates as strings. (It would also be memory-inefficient.)

For working with time series data, you’ll want the date_time column to be formatted as an array of datetime objects. (Pandas calls this a Timestamp.) Pandas makes each step here rather simple:

>>> df['date_time']=pd.to_datetime(df['date_time'])>>> df['date_time'].dtypedatetime64[ns]

(Note that you could alternatively use a Pandas PeriodIndex in this case.)

You now have a DataFrame called df that looks much like our CSV file. It has two columns and a numerical index for referencing the rows.

>>> df.head()               date_time    energy_kwh0    2013-01-01 00:00:00         0.5861    2013-01-01 01:00:00         0.5802    2013-01-01 02:00:00         0.5723    2013-01-01 03:00:00         0.5964    2013-01-01 04:00:00         0.592

The code above is simple and easy, but how fast it? Let’s put it to the test using a timing decorator, which I have unoriginally called @timeit. This decorator largely mimics timeit.repeat() from Python’s standard library, but it allows you to return the result of the function itself and print its average runtime from multiple trials. (Python’s timeit.repeat() returns the timing results, not the function result.)

Creating a function and placing the @timeit decorator directly above it will mean that every time the function is called, it will be timed. The decorator runs an outer loop and an inner loop:

>>> @timeit(repeat=3,number=10)... defconvert(df,column_name):... returnpd.to_datetime(df[column_name])>>> # Read in again so that we have `object` dtype to start >>> df['date_time']=convert(df,'date_time')Best of 3 trials with 10 function calls per trial:Function `convert` ran in average of 1.610 seconds.

The result? 1.6 seconds for 8760 rows of data. “Great,” you might say, “that’s no time at all.” But what if you encounter larger data sets—say, one year of electricity use at one-minute intervals? That’s 60 times more data, so you’ll end up waiting around one and a half minutes. That’s starting to sound less tolerable.

In actuality, I recently analyzed 10 years of hourly electricity data from 330 sites. Do you think I waited 88 minutes to convert datetimes? Absolutely not!

How can you speed this up? As a general rule, Pandas will be far quicker the less it has to interpret your data. In this case, you will see huge speed improvements just by telling Pandas what your time and date data looks like, using the format parameter. You can do this by using the strftime codes found here and entering them like this:

>>> @timeit(repeat=3,number=100)>>> defconvert_with_format(df,column_name):... returnpd.to_datetime(df[column_name],... format='%d/%m/%Y %H:%M')Best of 3 trials with 100 function calls per trial:Function `convert_with_format` ran in average of 0.032 seconds.

The new result? 0.032 seconds, which is 50 times faster! So you’ve just saved about 86 minutes of processing time for my 330 sites. Not a bad improvement!

One finer detail is that the datetimes in the CSV are not in ISO 8601 format: you’d need YYYY-MM-DD HH:MM. If you don’t specify a format, Pandas will use the dateutil package to convert each string to a date.

Conversely, if the raw datetime data is already in ISO 8601 format, Pandas can immediately take a fast route to parsing the dates. This is one reason why being explicit about the format is so beneficial here. Another option is to pass infer_datetime_format=True parameter, but it generally pays to be explicit.

Simple Looping Over Pandas Data

Now that your dates and times are in a convenient format, you are ready to get down to the business of calculating your electricity costs. Remember that cost varies by hour, so you will need to conditionally apply a cost factor to each hour of the day. In this example, the time-of-use costs will be defined as follows:

Tariff Type	Cents per kWh	Time Range
Peak	28	17:00 to 24:00
Shoulder	20	7:00 to 17:00
Off-Peak	12	0:00 to 7:00

If the price were a flat 28 cents per kWh for every hour of the day, most people familiar with Pandas would know that this calculation could be achieved in one line:

>>> df['cost_cents']=df['energy_kwh']*28

This will result in the creation of a new column with the cost of electricity for that hour:

date_timeenergy_kwhcost_cents02013-01-0100:00:000.58616.40812013-01-0101:00:000.58016.24022013-01-0102:00:000.57216.01632013-01-0103:00:000.59616.68842013-01-0104:00:000.59216.576# ...

But our cost calculation is conditional on the time of day. This is where you will see a lot of people using Pandas the way it was not intended: by writing a loop to do the conditional calculation.

For the rest of this tutorial, you’ll start from a less-than-ideal baseline solution and work up to a Pythonic solution that fully leverages Pandas.

But what is Pythonic in the case of Pandas? The irony is that it is those who are experienced in other (less user-friendly) coding languages such as C++ or Java that are particularly susceptible to this because they instinctively “think in loops.”

Let’s look at a loop approach that is not Pythonic and that many people take when they are unaware of how Pandas is designed to be used. We will use @timeit again to see how fast this approach is.

First, let’s create a function to apply the appropriate tariff to a given hour:

defapply_tariff(kwh,hour):"""Calculates cost of electricity for given hour."""if0<=hour<7:rate=12elif7<=hour<17:rate=20elif17<=hour<24:rate=28else:raiseValueError(f'Invalid hour: {hour}')returnrate*kwh

Here’s the loop that isn’t Pythonic, in all its glory:

>>> # NOTE: Don't do this!>>> @timeit(repeat=3,number=100)... defapply_tariff_loop(df):... """Calculate costs in loop.  Modifies `df` inplace."""... energy_cost_list=[]... foriinrange(len(df)):... # Get electricity used and hour of day... energy_used=df.iloc[i]['energy_kwh']... hour=df.iloc[i]['date_time'].hour... energy_cost=apply_tariff(energy_used,hour)... energy_cost_list.append(energy_cost)... df['cost_cents']=energy_cost_list... >>> apply_tariff_loop(df)Best of 3 trials with 100 function calls per trial:Function `apply_tariff_loop` ran in average of 3.152 seconds.

For people who picked up Pandas after having written “pure Python” for some time prior, this design might seem natural: you have a typical “for each x, conditional on y, do z.”

However, this loop is clunky. You can consider the above to be an “antipattern” in Pandas for several reasons. Firstly, it needs to initialize a list in which the outputs will be recorded.

Secondly, it uses the opaque object range(0, len(df)) to loop over, and then after applying apply_tariff(), it has to append the result to a list that is used to make the new DataFrame column. It also does what is called chained indexing with df.iloc[i]['date_time'], which often leads to unintended results.

But the biggest issue with this approach is the time cost of the calculations. On my machine, this loop took over 3 seconds for 8760 rows of data. Next, you’ll look at some improved solutions for iteration over Pandas structures.

Looping with `.itertuples()` and `.iterrows()`

What other approaches can you take? Well, Pandas has actually made the for i in range(len(df)) syntax redundant by introducing the DataFrame.itertuples() and DataFrame.iterrows() methods. These are both generator methods that yield one row at a time.

.itertuples() yields a namedtuple for each row, with the row’s index value as the first element of the tuple. A nametuple is a data structure from Python’s collections module that behaves like a Python tuple but has fields accessible by attribute lookup.

.iterrows() yields pairs (tuples) of (index, Series) for each row in the DataFrame.

While .itertuples() tends to be a bit faster, let’s stay in Pandas and use .iterrows() in this example, because some readers might not have run across nametuple. Let’s see what this achieves:

>>> @timeit(repeat=3,number=100)... defapply_tariff_iterrows(df):... energy_cost_list=[]... forindex,rowindf.iterrows():... # Get electricity used and hour of day... energy_used=row['energy_kwh']... hour=row['date_time'].hour... # Append cost list... energy_cost=apply_tariff(energy_used,hour)... energy_cost_list.append(energy_cost)... # Create new column with cost list... df['cost_cents']=energy_cost_list...>>> apply_tariff_iterrows(df)Best of 3 trials with 100 function calls per trial:Function `apply_tariff_iterrows` ran in average of 0.713 seconds.

Some marginal gains have been made. The syntax is more explicit, and there is less clutter in your row value references, so it’s more readable. In terms of time gains, is almost 5 five times quicker!

However, there is more room for improvement. You’re still using some form of a Python for-loop, meaning that each and every function call is done in Python when it could ideally be done in a faster language built into Pandas’ internal architecture.

Pandas’ `.apply()`

You can further improve this operation using the .apply() method instead of .iterrows(). Pandas’ .apply() method takes functions (callables) and applies them along an axis of a DataFrame (all rows, or all columns). In this example, a lambda function will help you pass the two columns of data into apply_tariff():

>>> @timeit(repeat=3,number=100)... defapply_tariff_withapply(df):... df['cost_cents']=df.apply(... lambdarow:apply_tariff(... kwh=row['Energy (kWh)'],... hour=row['date_time'].dt.hour),... axis=1)...>>> apply_tariff_withapply(df)Best of 3 trials with 100 function calls per trial:Function `apply_tariff_withapply` ran in average of 0.272 seconds.

The syntactic advantages of .apply() are clear, with a significant reduction in the number of lines and very readable, explicit code. In this case, the time taken was roughly half that of the .iterrows() method.

However, this is not yet “blazingly fast.” One reason is that .apply() will try internally to loop over Cython iterators. But in this case, the lambda that you passed isn’t something that can be handled in Cython, so it’s called in Python, which is consequently not all that fast.

If you were to use .apply() for my 10 years of hourly data for 330 sites, you’d be looking at around 15 minutes of processing time. If this calculation were intended to be a small part of a larger model, you’d really want to speed things up. That’s where vectorized operations come in handy.

Selecting Data With `.isin()`

Earlier, you saw that if there were a single electricity price, you could apply that price across all the electricity consumption data in one line of code (df['energy_kwh'] * 28). This particular operation was an example of a vectorized operation, and it is the fastest way to do things in Pandas.

But how can you apply condition calculations as vectorized operations in Pandas? One trick is to select and group parts the DataFrame based on your conditions and then apply a vectorized operation to each selected group.

In this next example, you will see how to select rows with Pandas’ .isin() method and then apply the appropriate tariff in a vectorized operation. Before you do this, it will make things a little more convenient if you set the date_time column as the DataFrame’s index:

df.set_index('date_time',inplace=True)@timeit(repeat=3,number=100)defapply_tariff_isin(df):# Define hour range Boolean arrayspeak_hours=df.index.hour.isin(range(17,24))shoulder_hours=df.index.hour.isin(range(7,17))off_peak_hours=df.index.hour.isin(range(0,7))# Apply tariffs to hour rangesdf.loc[peak_hours,'cost_cents']=df.loc[peak_hours,'energy_kwh']*28df.loc[shoulder_hours,'cost_cents']=df.loc[shoulder_hours,'energy_kwh']*20df.loc[off_peak_hours,'cost_cents']=df.loc[off_peak_hours,'energy_kwh']*12

Let’s see how this compares:

>>> apply_tariff_isin(df)Best of 3 trials with 100 function calls per trial:Function `apply_tariff_isin` ran in average of 0.010 seconds.

To understand what’s happening in this code, you need to know that the .isin() method is returning an array of Boolean values that looks like this:

[False,False,False,...,True,True,True]

These values identify which DataFrame indices (datetimes) fall within the hour range specified. Then, when you pass these Boolean arrays to the DataFrame’s .loc indexer, you get a slice of the DataFrame that only includes rows that match those hours. After that, it is simply a matter of multiplying the slice by the appropriate tariff, which is a speedy vectorized operation.

How does this compare to our looping operations above? Firstly, you may notice that you no longer need apply_tariff(), because all the conditional logic is applied in the selection of the rows. So there is a huge reduction in the lines of code you have to write and in the Python code that is called.

What about the processing time? 315 times faster than the loop that wasn’t Pythonic, around 71 times faster than .iterrows() and 27 times faster that .apply(). Now you are moving at the kind of speed you need to get through big data sets nice and quickly.

Can We Do Better?

In apply_tariff_isin(), we are still admittedly doing some “manual work” by calling df.loc and df.index.hour.isin() three times each. You could argue that this solution isn’t scalable if we had a more granular range of time slots. (A different rate for each hour would require 24 .isin() calls.) Luckily, you can do things even more programmatically with Pandas’ pd.cut() function in this case:

@timeit(repeat=3,number=100)defapply_tariff_cut(df):cents_per_kwh=pd.cut(x=df.index.hour,bins=[0,7,17,24],include_lowest=True,labels=[12,20,28]).astype(int)df['cost_cents']=cents_per_kwh*df['energy_kwh']

Let’s take a second to see what’s going on here. pd.cut() is applying an array of labels (our costs) according to which bin each hour belongs in. Note that the include_lowest parameter indicates whether the first interval should be left-inclusive or not. (You want to include time=0 in a group.)

This is a fully vectorized way to get to your intended result, and it comes out on top in terms of timing:

>>> apply_tariff_cut(df)Best of 3 trials with 100 function calls per trial:Function `apply_tariff_cut` ran in average of 0.003 seconds.

So far, you’ve built up from taking potentially over an hour to under a second to process the full 300-site dataset. Not bad! There is one last option, though, which is to use NumPy functions to manipulate the underlying NumPy arrays for each DataFrame, and then to integrate the results back into Pandas data structures.

Don’t Forget NumPy!

One point that should not be forgotten when you are using Pandas is that Pandas Series and DataFrames are designed on top of the NumPy library. This gives you even more computation flexibility, because Pandas works seamlessly with NumPy arrays and operations.

In this next case you’ll use NumPy’s digitize() function. It is similar to Pandas’ cut() in that the data will be binned, but this time it will be represented by an array of indexes representing which bin each hour belongs to. These indexes are then applied to a prices array:

@timeit(repeat=3,number=100)defapply_tariff_digitize(df):prices=np.array([12,20,28])bins=np.digitize(df.index.hour.values,bins=[7,17,24])df['cost_cents']=prices[bins]*df['energy_kwh'].values

Like the cut() function, this syntax is wonderfully concise and easy to read. But how does it compare in speed? Let’s see:

>>> apply_tariff_digitize(df)Best of 3 trials with 100 function calls per trial:Function `apply_tariff_digitize` ran in average of 0.002 seconds.

At this point, there’s still a performance improvement, but it’s becoming more marginal in nature. This is probably a good time to call it a day on hacking away at code improvement and think about the bigger picture.

With Pandas, it can help to maintain “hierarchy,” if you will, of preferred options for doing batch calculations like you’ve done here. These will usually rank from fastest to slowest (and most to least flexible):

Use vectorized operations: Pandas methods and functions with no for-loops.
Use the .apply() method with a callable.
Use itertuples(): iterate over DataFrame rows as namedtuples from Python’s collections module.
Use iterrows(): iterate over DataFrame rows as (index, pd.Series) pairs. While a Pandas Series is a flexible data structure, it can be costly to construct each row into a Series and then access it.
Use “element-by-element” for loops, updating each cell or row one at a time with df.loc or df.iloc.

Don’t Take My Word For It: The order of precedence above is a suggestion straight from a core Pandas developer.

Here’s the “order of precedence” above at work, with each function you’ve built here:

Function	Runtime (seconds)
`apply_tariff_loop()`	3.152
`apply_tariff_iterrows()`	0.713
`apply_tariff_withapply()`	0.272
`apply_tariff_isin()`	0.010
`apply_tariff_cut()`	0.003
`apply_tariff_digitize()`	0.002

Prevent Reprocessing with HDFStore

Now that you have looked at quick data processes in Pandas, let’s explore how to avoid reprocessing time altogether with HDFStore, which was recently integrated into Pandas.

Often when you are building a complex data model, it is convenient to do some pre-processing of your data. For example, if you had 10 years of minute-frequency electricity consumption data, simply converting the date and time to datetime might take 20 minutes, even if you specify the format parameter. You really only want to have to do this once, not every time you run your model, for testing or analysis.

A very useful thing you can do here is pre-process and then store your data in its processed form to be used when needed. But how can you store data in the right format without having to reprocess it again? If you were to save as CSV, you would simply lose your datetime objects and have to re-process it when accessing again.

Pandas has a built-in solution for this which uses HDF5 , a high-performance storage format designed specifically for storing tabular arrays of data. Pandas’ HDFStore class allows you to store your DataFrame in an HDF5 file so that it can be accessed efficiently, while still retaining column types and other metadata. It is a dictionary-like class, so you can read and write just as you would for a Python dict object.

Here’s how you would go about storing your pre-processed electricity consumption DataFrame, df, in an HDF5 file:

# Create storage object with filename `processed_data`data_store=pd.HDFStore('processed_data.h5')# Put DataFrame into the object setting the key as 'preprocessed_df'data_store['preprocessed_df']=dfdata_store.close()

Now you can shut your computer down and take a break knowing that you can come back and your processed data will be waiting for you when you need it. No reprocessing required. Here’s how you would access your data from the HDF5 file, with data types preserved:

# Access data storedata_store=pd.HDFStore('processed_data.h5')# Retrieve data using keypreprocessed_df=data_store['preprocessed_df']data_store.close()

A data store can house multiple tables, with the name of each as a key.

Just a note about using the HDFStore in Pandas: you will need to have PyTables >= 3.0.0 installed, so after you have installed Pandas, make sure to update PyTables like this:

pip install --upgrade tables

Conclusions

If you don’t feel like your Pandas project is fast, flexible, easy, and intuitive, consider rethinking how you’re using the library.

The examples you’ve explored here are fairly straightforward but illustrate how the proper application of Pandas features can make vast improvements to runtime and code readability to boot. Here are a few rules of thumb that you can apply next time you’re working with large data sets in Pandas:

Try to use vectorized operations where possible rather than approaching problems with the for x in df... mentality. If your code is home to a lot of foor-lops, it might be better suited to working with native Python data structures, because Pandas otherwise comes with a lot of overhead.
If you have more complex operations where vectorization is simply impossible or too difficult to work out efficiently, use the .apply() method.
If you do have to loop over your array (which does happen), use iterrows() to improve speed and syntax.
Pandas has a lot of optionality, and there are almost always several ways to get from A to B. Be mindful of this, compare how different routes perform, and choose the one that works best in the context of your project.
Once you’ve got a data cleaning script built, avoid reprocessing by storing your intermediate results with HDFStore.
Integrating NumPy into Pandas operations can often improve speed and simplify syntax.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

PyCharm: PyCharm 2018.2 RC 2

July 23, 2018, 9:32 am

≫ Next: Bhishan Bhandari: Python Operators

≪ Previous: Real Python: Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects

Today we’re making the second release candidate for PyCharm 2018.2 available on our website. We’ve only made one small fix to how usage statistics are collected.

Usage statistics collection in PyCharm is opt-in and completely voluntary. It greatly helps us to see which functionality in PyCharm actually gets used, and which areas should receive the most attention for new features. To all users who opt-in: thank you very much!

See the blog post of the previous RC to read what’s new since the last EAP.

Interested?

Download the RC from our website. Alternatively, you can use the JetBrains Toolbox App to stay up to date.

If you’re on Ubuntu 16.04 or later, you can use snap to get PyCharm RC versions, and stay up to date. You can find the installation instructions on our website.

The release candidate (RC) is not an early access program (EAP) build, and does not bundle an EAP license. If you get PyCharm Professional Edition RC, you will either need a currently active PyCharm subscription, or you will receive a 30-day free trial.

↧

Bhishan Bhandari: Python Operators

July 23, 2018, 12:32 pm

≫ Next: Doing Math with Python: Doing Math with Python in Linux Geek Humble Bundle

≪ Previous: PyCharm: PyCharm 2018.2 RC 2

Operators are the constructs that enable performing operations on operands(values and variables). The operators in python are represented by special symbols and keywords. The intentions of this blog is to familiarize with the various operators in Python. Arithmetic Operators These operators are used to perform mathematical operations ranging from addition, subtraction, multiplication, division to modulus, […]

The post Python Operators appeared first on The Tara Nights.

↧

Doing Math with Python: Doing Math with Python in Linux Geek Humble Bundle

July 23, 2018, 8:20 pm

≫ Next: Mike Driscoll: Understanding Tracebacks in Python

≪ Previous: Bhishan Bhandari: Python Operators

"Doing Math with Python" is part of No Starch Press's "Pay what you want" Linux Geek Humble Bundle running for the next 7 days. Your purchases will help support EFF or a charity of your choice.

Get the bundle here!

↧

Mike Driscoll: Understanding Tracebacks in Python

July 23, 2018, 10:05 pm

≫ Next: EuroPython: EuroPython 2018: Get to know other attendees

≪ Previous: Doing Math with Python: Doing Math with Python in Linux Geek Humble Bundle

When you are first starting out learning how to program, one of the first things you will want to learn is what an error message means. In Python, error messages are usually called tracebacks. Here are some common traceback errors:

SyntaxError
ImportError or ModuleNotFoundError
AttributeError
NameError

When you get an error, it is usually recommended that you trace through it backwards (i.e. traceback). So start at the bottom of the traceback and read it backwards.

Let’s take a look at a few simple examples of tracebacks in Python.

Syntax Error

A very common error (or exception) is the SyntaxError. A syntax error happens when the programmer makes a mistake when writing the code out. They might forget to close an open parentheses, or use a mix of quotes around a string on accident, for instance. Let’s take a look at an example I ran in IDLE:

>>>print('This is a test)
 
SyntaxError: EOL while scanning string literal

Here we attempt to print out a string and we receive a SyntaxError. It tells us that the error has something to do with it not finding the End of Line (EOL). In this case, we didn’t finish the string by ending the string with a single quote.

Let’s look at another example that will raise a SyntaxError:

def func
    return1

When you run this code from the command line, you will receive the following message:

File "syn.py", line 1def func
           ^
SyntaxError: invalid syntax

Here the SyntaxError says that we used “invalid syntax”. Then Python helpfully uses an arrow (^) to point out exactly where we messed up the syntax. Finally we learn that the line of code we need to look at is on “line 1”. Using all of these facts, we can quickly see that we forgot to add a semi-colon to the end of our function definition.

Import Errors

Another common error that I see even with experienced developers is the ImportError. You will see this error whenever Python cannot find the module that you are trying to import. Here is an example:

>>>import some
Traceback (most recent call last):
  File "<stdin>", line 1, in<module>ImportError: No module named some

Here we learn that Python could not find the “some” module. Note that in Python 3, you might get a ModuleNotFoundError error instead of ImportError. ModuleNotFoundError is just a subclass of ImportError and means virtually the same thing. Regardless which exception you end up seeing, the reason you see this error is because Python couldn’t find the module or package. What this means in practice is that the module is either incorrectly installed or not installed at all. Most of the time, you just need to figure out what package that module is a part of and install it using pip or conda.

AttributeError

The AttributeError is really easy to accidentally hit, especially if you don’t have code completion in your IDE. You will get this error when you try to call an attribute that does not exist:

>>> my_string = 'Python'>>> my_string.up()	  
Traceback (most recent call last):
  File "<pyshell#8>", line 1, in<module>
    my_string.up()AttributeError: 'str'object has no attribute 'up'

Here I tried to use a non-existent string method called “up” when I should have called “upper”. Basically the solution to this problem is to read the manual or check the data type and make sure you are calling the correct attributes on the object at hand.

NameError

The NameError occurs when the local or global name is not found. If you are new to programming that explanation seems vague. What does it mean? Well in this case it means that you are trying to interact with a variable or object that hasn’t been defined. Let’s pretend that you open up a Python interpreter and type the following:

>>>print(var) 
Traceback (most recent call last):
  File "<pyshell#10>", line 1, in<module>print(var)NameError: name 'var'isnot defined

Here you find out that ‘var’ is not defined. This is easy to fix in that all we need to do is set “var” to something. Let’s take a look:

>>> var = 'Python' 
>>>print(var) 
Python

See how easy that was?

Wrapping Up

There are lots of errors that you will see in Python and knowing how to diagnose the cause of those errors is really useful when it comes to debugging. Soon it will become second nature to you and you will be able to just glance at the traceback and know exactly what happened. There are many other built-in exceptions in Python that are documented on their website and I encourage you to become familiar with them so you know what they mean. Most of the time, it should be really obvious though.

EuroPython: EuroPython 2018: Get to know other attendees

July 24, 2018, 3:37 am

≫ Next: Stack Abuse: Cross Validation and Grid Search for Model Selection in Python

≪ Previous: Mike Driscoll: Understanding Tracebacks in Python

As with every larger conference, it is sometimes a bit intimidating to approach other fellow attendees to get to know them.

At EuroPython, we’re generally a very friendly bunch and open to helping people, start conversations and interact based on our common field of interest which is Python for starters and can easily extend to many other fields as well.

Let’s talk…

In order to facilitate this interaction, we are providing a number of tools attendees can use to communicate with each, organizer ad-hoc meetups or plan activities:

EuroPython Pulse in the Conference App: This is a channel where attendees can post messages, pictures, and even send direct messages to other attendees registered in the app.
EuroPython Telegram Group: Telegram is a messenger application, which is becoming increasingly popular and has a low barrier to entry. We started a public group two years ago and it’s been working really well as platform for reaching out to other attendees.
Twitter: Using the hash tag #EuroPython you can easily reach out to other attendees.
Attendee Profiles: The website offers the possibility to setup a public profile in your account, where you can put your picture, interests, contact details and bio. The profiles are searchable on our “Who is coming” page.

and, of course, we have coffee breaks, open spaces and lunch at the conference as well, to provide you with plenty of possibiliities to get in touch in person.

Selfie / Group Photo Spot

We’d also like to draw some more attention to a selfie or group photo spot we have installed on the outside of the venue (if you exit the venue, walk left to next corner). It makes a great background for photos to post on social networks, to your friends or to take home as memory:

Socializing in Edinburgh

Since we did not find a venue for holding a social event this year, we would like to simply suggest a common place to use as hub for EuroPython attendees, with enough pubs and restaurants to accommodate everyone.

For this we’d like to propose the Edinburgh Grassmarket, which is located just south of the castle and within walking distance of the EICC.

Enjoy,
–
EuroPython 2018 Team
https://ep2018.europython.eu/
https://www.europython-society.org/

↧

Stack Abuse: Cross Validation and Grid Search for Model Selection in Python

July 24, 2018, 6:38 am

≫ Next: Davy Wybiral: Running python-RQ on a Raspberry Pi 3 Cluster

≪ Previous: EuroPython: EuroPython 2018: Get to know other attendees

Introduction

A typical machine learning process involves training different models on the dataset and selecting the one with best performance. However, evaluating the performance of algorithm is not always a straight forward task. There are several factors that can help you determine which algorithm performance best. One such factor is the performance on cross validation set and another other factor is the choice of parameters for an algorithm.

In this article we will explore these two factors in detail. We will first study what cross validation is, why it is necessary, and how to perform it via Python's Scikit-Learn library. We will then move on to the Grid Search algorithm and see how it can be used to automatically select the best parameters for an algorithm.

Cross Validation

Normally in a machine learning process, data is divided into training and test sets; the training set is then used to train the model and the test set is used to evaluate the performance of a model. However, this approach may lead to variance problems. In simpler words, a variance problem refers to the scenario where our accuracy obtained on one test is very different to accuracy obtained on another test set using the same algorithm.

The solution to this problem is to use K-Fold Cross-Validation for performance evaluation where K is any number. The process of K-Fold Cross-Validation is straightforward. You divide the data into K folds. Out of the K folds, K-1 sets are used for training while the remaining set is used for testing. The algorithm is trained and tested K times, each time a new set is used as testing set while remaining sets are used for training. Finally, the result of the K-Fold Cross-Validation is the average of the results obtained on each set.

Suppose we want to perform 5-fold cross validation. To do so, the data is divided into 5 sets, for instance we name them SET A, SET B, SET C, SET D, and SET E. The algorithm is trained and tested K times. In the first fold, SET A to SET D are used as training set and SET E is used as testing set as shown in the figure below:

Cross validation

In the second fold, SET A, SET B, SET C, and SET E are used for training and SET D is used as testing. The process continues until every set is at least used once for training and once for testing. The final result is the average of results obtained using all folds. This way we can get rid of the variance. Using standard deviation of the results obtained from each fold we can in fact find the variance in the overall result.

Cross Validation with Scikit-Learn

In this section we will use cross validation to evaluate the performance of Random Forest Algorithm for classification. The problem that we are going to solve is to predict the quality of wine based on 12 attributes. The details of the dataset are available at the following link:

https://archive.ics.uci.edu/ml/datasets/wine+quality

We are only using the data for red wine in this article.

Follow these steps to implement cross validation using Scikit-Learn:

1. Importing Required Libraries

The following code imports a few of the required libraries:

import pandas as pd  
import numpy as np

2. Importing the Dataset

Download the dataset, which is available online at this link:

https://www.kaggle.com/piyushgoyal443/red-wine-dataset

Once we have downloaded it, we placed the file in the "Datasets" folder of our "D" drive for the sake of this article. The dataset name is "winequality-red.csv". Note that you'll need to change the file path to match the location in which you saved the file on your computer.

Execute the following command to import the dataset:

dataset = pd.read_csv(r"D:/Datasets/winequality-red.csv", sep=';')

The dataset was semi-colon separated, therefore we have passed the ";" attribute to the "sep" parameter so pandas is able to properly parse the file.

3. Data Analysis

Execute the following script to get an overview of the data:

dataset.head()

The output looks like this:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

4. Data Preprocessing

Execute the following script to divide data into label and feature sets.

X = dataset.iloc[:, 0:11].values  
y = dataset.iloc[:, 11].values

Since we are using cross validation, we don't need to divide our data into training and test sets. We want all of the data in the training set so that we can apply cross validation on that. The simplest way to do this is to set the value for the test_size parameter to 0. This will return all the data in the training set as follows:

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0, random_state=0)

5. Scaling the Data

If you look at the dataset you'll notice that it is not scaled well. For instance the Field_Goals and Throws column have values between 0 and 1, while the rest of the columns have higher values. Therefore, before training the algorithm, we will need to scale our data down. Remember we discussed scaling in the last chapter.

Here we will use the StandardScalar class.

from sklearn.preprocessing import StandardScaler  
feature_scaler = StandardScaler()  
train_features = feature_scaler.fit_transform(train_features)  
test_features = feature_scaler.transform(test_features)

6. Training and Cross Validation

The first step in the training and cross validation phase is simple. You just have to import the algorithm class from the sklearn library as shown below:

from sklearn.ensemble import RandomForestClassifier  
classifier = RandomForestClassifier(n_estimators=300, random_state=0)

Next, to implement cross validation, the cross_val_score method of the sklearn.model_selection library can be used. The cross_val_score returns the accuracy for all the folds. Values for 4 parameters are required to be passed to the cross_val_score class. The first parameter is estimator which basically specifies the algorithm that you want to use for cross validation. The second and third parameters, X and y, contain the X_train and y_train data i.e. features and labels. Finally the number of folds is passed to the cv parameter as shown in the following code:

from sklearn.model_selection import cross_val_score  
all_accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=5)

Once you've executed this, let's simply print the accuracies returned for five folds by the cross_val_score method by calling print on all_accuracies.

print(all_accuracies)

Output:

[ 0.72360248  0.68535826  0.70716511  0.68553459  0.68454259 ]

To find the average of all the accuracies, simple use the mean() method of the object returned by cross_val_score method as shown below:

print(all_accuracies.mean())

The mean value is 0.6972, or 69.72%.

Finally let's find the standard deviation of the data to see degree of variance in the results obtained by our model. To do so, call the std() method on the all_accuracies object.

print(all_accuracies.std())

The result is: 0.01572 which is 1.57%. This is extremely low, which means that our model has a very low variance, which is actually very good since that means that the prediction that we obtained on one test set is not by chance. Rather, the model will perform more or less similar on all test sets.

Grid Search for Parameter Selection

A machine learning model has two types of parameters. The first type of parameters are the parameters that are learned through a machine learning model while the second type of parameters are the hyper parameter that we pass to the machine learning model.

In the last section, while predicting the quality of wine, we used the Random Forest algorithm. The number of estimators we used for the algorithm was 300. Similarly in KNN algorithm we have to specify the value of K and for SVM algorithm we have to specify the type of Kernel. These estimators - the K value and Kernel - are all types of hyper parameters.

Normally we randomly set the value for these hyper parameters and see what parameters result in best performance. However randomly selecting the parameters for the algorithm can be exhaustive.

Also, it is not easy to compare performance of different algorithms by randomly setting the hyper parameters because one algorithm may perform better than the other with different set of parameters. And if the parameters are changed, the algorithm may perform worse than the other algorithms.

Therefore, instead of randomly selecting the values of the parameters, a better approach would be to develop an algorithm which automatically finds the best parameters for a particular model. Grid Search is one such algorithm.

Grid Search with Scikit-Learn

Let's implement the grid search algorithm with the help of an example. The script in this section should be run after the script that we created in the last section.

To implement the Grid Search algorithm we need to import GridSearchCV class from the sklearn.model_selection library.

The first step you need to perform is to create a dictionary of all the parameters and their corresponding set of values that you want to test for best performance. The name of the dictionary items corresponds to the parameter name and the value corresponds to the list of values for the parameter.

Let's create a dictionary of parameters and their corresponding values for our Random Forest algorithm. Details of all the parameters for the random forest algorithm are available in the Scikit-Learn docs.

To do this, execute the following code:

grid_param = {  
    'n_estimators': [100, 300, 500, 800, 1000],
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False]
}

Take a careful look at the above code. Here we create grid_param dictionary with three parameters n_estimators, criterion, and bootstrap. The parameter values that we want to try out are passed in the list. For instance, in the above script we want to find which value (out of 100, 300, 500, 800, and 1000) provides the highest accuracy.

Similarly, we want to find which value results in the highest performance for the criterion parameter: "gini" or "entropy"? The Grid Search algorithm basically tries all possible combinations of parameter values and returns the combination with the highest accuracy. For instance, in the above case the algorithm will check 20 combinations (5 x 2 x 2 = 20).

The Grid Search algorithm can be very slow, owing to the potentially huge number of combinations to test. Furthermore, cross validation further increases the execution time and complexity.

Once the parameter dictionary is created, the next step is to create an instance of the GridSearchCV class. You need to pass values for the estimator parameter, which basically is the algorithm that you want to execute. The param_grid parameter takes the parameter dictionary that we just created as parameter, the scoring parameter takes the performance metrics, the cv parameter corresponds to number of folds, which is 5 in our case, and finally the n_jobs parameter refers to the number of CPU's that you want to use for execution. A value of -1 for n_jobs parameter means that use all available computing power. This can be handy if you have large number amount of data.

Take a look at the following code:

gd_sr = GridSearchCV(estimator=classifier,  
                     param_grid=grid_param,
                     scoring='accuracy',
                     cv=5,
                     n_jobs=-1)

Once the GridSearchCV class is initialized, the last step is to call the fit method of the class and pass it the training and test set, as shown in the following code:

gd_sr.fit(X_train, y_train)

This method can take some time to execute because we have 20 combinations of parameters and a 5-fold cross validation. Therefore the algorithm will execute a total of 100 times.

Once the method completes execution, the next step is to check the parameters that return the highest accuracy. To do so, print the sr.best_params_ attribute of the GridSearchCV object, as shown below:

best_parameters = gd_sr.best_params_  
print(best_parameters)

Output:

{'bootstrap': True, 'criterion': 'gini', 'n_estimators': 1000}

The result shows that the highest accuracy is achieved when the n_estimators are 1000, bootstrap is True and criterion is "gini".

Note: It would be a good idea to add more number of estimators and see if performance further increases since the highest allowed value of n_estimators was chosen.

The last and final step of Grid Search algorithm is to find the accuracy obtained using the best parameters. Previously we had a mean accuracy of 69.72% with 300 n_estimators.

To find the best accuracy achieved, execute the following code:

best_result = gd_sr.best_score_  
print(best_result)

The accuracy achieved is: 0.6985 of 69.85% which is only slightly better than 69.72%. To improve this further, it would be good to test values for other parameters of Random Forest algorithm, such as max_features, max_depth, max_leaf_nodes, etc. to see if the accuracy further improves or not.

Conclusion

In this article we studied two very commonly used techniques for performance evaluation and model selection of an algorithm. K-Fold Cross-Validation can be used to evaluate performance of a model by handling the variance problem of the result set. Furthermore, to identify the best algorithm and best parameters, we can use the Grid Search algorithm.

↧

Davy Wybiral: Running python-RQ on a Raspberry Pi 3 Cluster

July 24, 2018, 10:17 am

≫ Next: Reuven Lerner: Avoiding Windows backslash problems with Python’s raw strings

≪ Previous: Stack Abuse: Cross Validation and Grid Search for Model Selection in Python

I keep getting asked to show some examples of Python code running on a Raspberry Pi cluster so here's a distributed task queue using python-RQ, Redis, and 16 ARM cores-worth of Raspberry Pi 3's.

↧

Reuven Lerner: Avoiding Windows backslash problems with Python’s raw strings

July 24, 2018, 11:49 am

≫ Next: Mike Driscoll: Python 101: Episode #17 – The email and smtp modules

≪ Previous: Davy Wybiral: Running python-RQ on a Raspberry Pi 3 Cluster

I’m a Unix guy, but the participants in my Python classes overwhelmingly use Windows. Inevitably, when we get to talking about working with files in Python, someone will want to open a file using the complete path to the file. And they’ll end up writing something like this:

filename = 'c:\abc\def\ghi.txt'

But when my students try to open the file, they discover that Python gives them an error, indicating that the file doesn’t exist! In other words, they write:

for one_line in open(filename):    print(one_line)

What’s the problem? This seems like pretty standard Python, no?

Remember that strings in Python normally contain characters. Those characters are normally printable, but there are times when you want to include a character that isn’t really printable, such as a newline. In those cases, Python (like many programming languages) includes special codes that will insert the special character.

The best-known example is newline, aka ‘\n’, or ASCII 10. If you want to insert a newline into your Python string, then you can do so with ‘\n’ in the middle. For example:

s = 'abc\ndef\nghi'

When we print the string, we’ll see:

>>> print(s)

abc

def

ghi

What if you want to print a literal ‘\n’ in your code? That is, you want a backslash, followed by an “n”? Then you’ll need to double the backslash:The “\\” in a string will result in a single backslash character. The following “n” will then be normal. For example:

s = 'abc\\ndef\\nghi'

When we say:

>>> print(s)

abc\ndef\nghi

It’s pretty well known that you have to guard against this translation when you’re working with \n. But what other characters require it? It turns out, more than many people might expect:

\a — alarm bell (ASCII 7)
\b — backspace (ASCII
\f — form feed
\n — newline
\r — carriage return
\t — tab
\v — vertical tab
\ooo — character with octal value ooo
\xhh — character with hex value hh
\N{name} — Unicode character {name}
\uxxxx — Unicode character with 16-bit hex value xxxx
\Uxxxxxxxx — Unicode character with 32-bit hex value xxxxxxxx

In my experience, you’re extremely unlikely to use some of these on purpose. I mean, when was the last time you needed to use a form feed character? Or a vertical tab? I know — it was roughly the same day that you drove your dinosaur to work, after digging a well in your backyard for drinking water.

But nearly every time I teach Python — which is, every day — someone in my class bumps up against one of these characters by mistake. That’s because the combination of the backslashes used by these characters and the backslashes used in Windows paths makes for inevitable, and frustrating, bugs.

Remember that path I mentioned at the top of the blog post, which seems so innocent?

filename = 'c:\abc\def\ghi.txt'

It contains a “\a” character. Which means that when we print it:

>>> print(filename)
c:bc\def\ghi.txt

See? The “\a” is gone, replaced by an alarm bell character. If you’re lucky.

So, what can we do about this? Double the backslashes, of course. You only need to double those that would be turned into special characters, from the table I’ve reproduced above: But come on, are you really likely to remember that “\f” is special, but “\g” is not? Probably not.

So my general rule, and what I tell my students, is that they should always double the backslashes in their Windows paths. In other words:

>>> filename = 'c:\\abc\\def\\ghi.txt'

>>> print(filename)
c:\abc\def\ghi.txt

It works!

But wait: No one wants to really wade through their pathnames, doubling every backslash, do they? Of course not.

That’s where Python’s raw strings can help. I think of raw strings in two different ways:

what-you-see-is-what-you-get strings
automatically doubled backslashes in strings

Either way, the effect is the same: All of the backslashes are doubled, so all of these pesky and weird special characters go away. Which is great when you’re working with Windows paths.

All you need to do is put an “r” before the opening quotes (single or double):

>>> filename = r'c:\abc\def\ghi.txt'

>>> print(filename)
c:\abc\def\ghi.txt

Note that a “raw string” isn’t really a different type of string at all. It’s just another way of entering a string into Python. If you check, type(filename) will still be “str”, but its backslashes will all be doubled.

Bottom line: If you’re using Windows, then you should just write all of your hard-coded pathname strings as raw strings. Even if you’re a Python expert, I can tell you from experience that you’ll bump up against this problem sometimes. And even for the best of us, finding that stray “\f” in a string can be time consuming and frustrating.

PS: Yes, it’s true that Windows users can get around this by using forward slashes, like we Unix folks do. But my students find this to be particularly strange looking, and so I don’t see it as a general-purpose solution.

The post Avoiding Windows backslash problems with Python’s raw strings appeared first on Lerner Consulting Blog.

↧