Holger Peters: Python Data Science Going Functional

This weekend, I visited PyCon Italy in the pittoreque town of Firenze. It was a great conference with great talks and encounters (great thanks to all the volunteers who made it happen) and amazing coffee.

I held a talk with the title "Python Data Science Going Functional" Science Track", where I mostly presented on Dask, one of those libraries-to-watch in the Python data science eco system. Slides are available on speaker deck.

Dask

The talk introduces Dask as a functional abstraction in the Python data science stack. While creating the slides I had stumbled over an exciting tweet about Dask by Travis Oliphant

Cool to see dask.array achieving similar performance to Cython + OpenMP: https://t.co/3tsWCAgWWQ Much simpler code with #dask. @PyData
— Travis Oliphant (@teoliphant) April 4, 2016

And after verifying the results on my machine (with some modifications, as I do not trust timeit), I included a very similar benchmark in my slides. While reproducing and adapting the benchmarks, I stumbled over some weirdly long execution times for the dask from_array classmethod. So I included this finding in my talk's slides without really being able to attribute this delay to a specific reason.

After delivering my talk I felt a bit unsatisfied about this. Why did from_array perform so badly? So I decided to ask. The answer: Dask hashes down the whole array in from_array to generate a key for it, which is the reason for it to be so slow. The solution is surprisingly simple. By passing a name='identifier' to the from_array, one can provide a custom key and from_array is a suddenly a cheap operation. So the current state of my benchmark shows that Dask improves upon pure numpy or numexpr performance, however does not quite reach the performance of a Cython implementation:

A corrected benchmark showing execution times for dask, numexpr, numpy and parallelized Cython.

A corrected benchmark showing execution times for numexpr (NX), numpy (np) and Cython (with OpenMP parallelization) and Cython (including from_array).

The expression evaluated in that benchmark was

x=da.from_array(x_np,chunks=arr.shape[0]/CPU_COUNT,name='x')mx=x.max()x=(x/mx).sum()*mxx.compute()

I plan to upload a revised edition of the slides on Speakerdeck (once I have a decenet internet connection again), to include the improved benchmark, so that they are not misleading for people who stumble on them without context.

Learnings

What can we conclude from this?

The conversion overhead of converting a dask array to a numpy array is not as bad as I feared.
There are two aspects in a benchnark: performance and usability.
Dask should be watched not only for out-of-core computations, but also for parallelizing simple, blocking numpy expressions.

Holger Peters: Python Data Science Going Functional - Or: Benchmarking Dask

Dask

Learnings

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...