This weekend, I visited PyCon Italy in the pittoreque town of Firenze. It was a great conference with great talks and encounters (great thanks to all the volunteers who made it happen) and amazing coffee.
I held a talk with the title “Python Data Science Going Functional” Science Track”, where I mostly presented on Dask, one of those libraries-to-watch in the Python data science eco system. Slides are available on speaker deck.
The talk introduces Dask as a functional abstraction in the Python data science stack. While creating the slides I had stumbled over an exciting tweet about Dask by Travis Oliphant
And after verifying the results on my machine (with some modifications,
as I do not trust
timeit), I included a very similar benchmark in my
slides. While reproducing and adapting the benchmarks, I stumbled over
some weirdly long execution times for the dask
classmethod. So I included this finding in my talk’s slides without
really being able to attribute this delay to a specific reason.
After delivering my talk I felt a bit unsatisfied about this. Why did
from_array perform so badly? So I decided to ask. The answer: Dask
hashes down the whole array in
from_array to generate a key for it,
which is the reason for it to be so slow. The solution is surprisingly
simple. By passing a
name='identifier' to the
from_array, one can
provide a custom key and
from_array is a suddenly a cheap operation.
So the current state of my benchmark shows that Dask improves upon pure
numpy or numexpr performance, however does not quite reach the
performance of a Cython implementation:
The expression evaluated in that benchmark was
I plan to upload a revised edition of the slides on Speakerdeck (once I have a decenet internet connection again), to include the improved benchmark, so that they are not misleading for people who stumble on them without context.
What can we conclude from this?
- The conversion overhead of converting a dask array to a numpy array is not as bad as I feared.
- There are two aspects in a benchnark: performance and usability.
- Dask should be watched not only for out-of-core computations, but also for parallelizing simple, blocking numpy expressions.