Developer Blog
Overview
Dask is a flexible open source parallel computation framework that lets you comfortably scale up and scale out your analytics. If you’re running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.
Dask enables distributed computing in pure Python and complements the existing numerical and scientific computing capability within Anaconda. Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data, and it scales up resiliently and elastically on clusters with hundreds of nodes.
Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP.
In this post, we’ll show you how you can use Anaconda with Dask for distributed computations and workflows, including distributed dataframes, arrays, text processing, and custom parallel workflows that can help you make the most of Anaconda and Dask on your cluster. We’ll work with Anaconda and Dask interactively from the Jupyter Notebook while the heavy computations are running on the cluster.
Installing Anaconda and Dask on your Hadoop or High Performance Computing (HPC) Cluster
There are many different ways to get started with Anaconda and Dask on your Hadoop or HPC cluster, including manual setup via SSH; by integrating with resource managers such as YARN, SGE, or Slurm; launching instances on Amazon EC2; or by using the enterprise-ready Anaconda for cluster management.
Anaconda for cluster management makes it easy to install familiar packages from Anaconda (including NumPy, SciPy, Pandas, NLTK, scikit-learn, scikit-image, and access to 720+ more packages in Anaconda) and the Dask parallel processing framework on all of your bare-metal or cloud-based cluster nodes. You can provision centrally managed installations of Anaconda, Dask and the Jupyter notebook using two simple commands with Anaconda for cluster management:
$ acluster create dask-cluster -p dask-cluster
$ acluster install dask notebook
Additional features of Anaconda for cluster management include:
- Easily install Python and R packages across multiple cluster nodes
- Manage multiple conda environments across a cluster
- Push local conda environments to all cluster nodes
- Works on cloud-based and bare-metal clusters with existing Hadoop installations
- Remotely SSH and upload/download files to and from cluster nodes
Once you’ve installed Anaconda and Dask on your cluster, you can perform many types of distributed computations, including text processing (similar to Spark), distributed dataframes, distributed arrays, and custom parallel workflows. We’ll show some examples in the following sections.
Distributed Text and Language Processing (Dask Bag)
Dask works well with standard computations such as text processing and natural language processing and with data in different formats and storage systems (e.g., HDFS, Amazon S3, local files). The Dask Bag collection is similar to other parallel frameworks and supports operations like filter, count, fold, frequencies, pluck
, and take
, which are useful for working with a collection of Python objects such as text.
For example, we can use the natural language processing toolkit (NLTK) in Anaconda to perform distributed language processing on a Hadoop cluster, all while working interactively in a Jupyter notebook.
In this example, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.
First, we import libraries from Dask and connect to the Dask distributed scheduler:
>>> import dask
>>> from distributed import Executor, hdfs, progress
>>> e = Executor('54.164.41.213:8786')
Next, we load 242 GB of JSON data from HDFS using pure Python:
>>> import json
>>> lines = hdfs.read_text('/user/ubuntu/RC_2015-*.json')
>>> js = lines.map(json.loads)
We can filter and load the data into distributed memory across the cluster:
>>> movies = js.filter(lambda d: 'movies' in d['subreddit'])
>>> movies = e.persist(movies)
Once we’ve loaded the data into distributed memory, we can import the NLTK library from Anaconda and construct stacked expressions to tokenize words, tag parts of speech, and filter out non-words from the dataset.
>>> import nltk
>>> pos = e.persist(movies.pluck('body')
... .map(nltk.word_tokenize)
... .map(nltk.pos_tag)
... .concat()
... .filter(lambda (word, pos): word.isalpha()))
In this example, we’ll generate a list of the top 10 proper nouns from the movies
subreddit.
>>> f = e.compute(pos.filter(lambda (word, type): type == 'NNP')
... .pluck(0)
... .frequencies()
... .topk(10, lambda (word, count): count))
>>> f.result()
[(u'Marvel', 35452),
(u'Star', 34849),
(u'Batman', 31749),
(u'Wars', 28875),
(u'Man', 26423),
(u'John', 25304),
(u'Superman', 22476),
(u'Hollywood', 19840),
(u'Max', 19558),
(u'CGI', 19304)]
Finally, we can use Bokeh to generate an interactive plot of the resulting data:
View the full notebook for this distributed language processing example on Anaconda Cloud.
Analysis with Distributed Dataframes (Dask DataFrame)
Dask allows you to work with familiar Pandas dataframe syntax on a single machine or on many nodes on a Hadoop or HPC cluster. You can work with data stored in different formats and storage systems (e.g., HDFS, Amazon S3, local files). The Dask DataFrame collection mimics the Pandas API, uses Pandas under the hood, and supports operations like head, groupby, value_counts, merge
, and set_index
.
For example, we can use the Dask to perform computations with dataframes on a Hadoop cluster with data stored in HDFS, all while working interactively in a Jupyter notebook.
First, we import libraries from Dask and connect to the Dask distributed scheduler:
>>> import dask
>>> from distributed import Executor, hdfs, progress, wait, s3
>>> e = Executor('54.164.41.213:8786')
Next, we’ll load the NYC taxi data in CSV format from HDFS using pure Python and persist the data in memory:
>>> df = hdfs.read_csv('/user/ubuntu/nyc/yellow_tripdata_2015-*.csv',
parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'],
header='infer')
>>> df = e.persist(df)
We can perform familiar operations such as computing value counts on columns and statistical correlations:
>>> df.payment_type.value_counts().compute()
1 91574644
2 53864648
3 503070
4 170599
5 28
Name: payment_type, dtype: int64
>>> df2 = df.assign(payment_2=(df.payment_type == 2),
... no_tip=(df.tip_amount == 0))
>>> df2.astype(int).corr().compute()
no_tip payment_2
no_tip 1.000000 0.943123
payment_2 0.943123 1.000000
Dask runs entirely asynchronously, leaving us free to explore other cells in the notebook while computations happen in the background. Dask also handles all of the messy CSV schema handling for us automatically.
Finally, we can use Bokeh to generate an interactive plot of the resulting data:
View the full notebook for this distributed dataframe example on Anaconda Cloud.
Numerical, Statistical and Scientific Computations with Distributed Arrays (Dask Array)
Dask works well with numerical and scientific computations on n-dimensional array data. The Dask Array collection mimics a subset of the NumPy API, uses NumPy under the hood, and supports operations like dot, flatten, max, mean,
and std
.
For example, we can use the Dask to perform computations with arrays on a cluster with global temperature/weather data stored in NetCDF format (like HDF5), all while working interactively in a Jupyter notebook. The data files contain measurements that were taken every six hours at every quarter degree latitude and longitude.
First, we import the netCDF4 library and point to the data files stored on disk:
>>> import netCDF4
>>> from glob import glob
>>> filenames = sorted(glob('2014-*.nc3'))
>>> t2m = [netCDF4.Dataset(fn).variables['t2m'] for fn in filenames]
>>> t2m[0]
<class 'netCDF4._netCDF4.Variable'>
int16 t2m(time, latitude, longitude)
scale_factor: 0.00159734395579
add_offset: 268.172358066
_FillValue: -32767
missing_value: -32767
units: K
long_name: 2 metre temperature
unlimited dimensions:
current shape = (4, 721, 1440)
filling off
We then import Dask and read in the data from the NumPy arrays:
>>> import dask.array as da
>>> xs = [da.from_array(t, chunks=t.shape) for t in t2m]
>>> x = da.concatenate(xs, axis=0)
We can then perform distributed computations on the cluster, such as computing the mean temperature, variance of the temperature over time, and normalized temperature. We can view the progress of the computations as they run on the cluster nodes and continue to work in other cells in the notebook:
>>> avg, std = da.compute(x.mean(axis=0), x.std(axis=0))
>>> z = (x - avg) / std
>>> progress(z)
We can plot the resulting normalized temperature using matplotlib
:
We can also create interactive widgets in the notebook to interact with and visualize the data in real-time while the computations are running across the cluster:
View the full notebook for this distributed array example on Anaconda Cloud.
Creating Custom Parallel Workflows
When one of the standard Dask collections isn’t a good fit for your workflow, Dask gives you the flexibility to work with different file formats and custom parallel workflows. The Dask Imperative collection lets you wrap functions in existing Python code and run the computations on a single machine or across a cluster.
In this example, we have multiple files stored hierarchically in a custom file format (Feather for reading and writing Python and R dataframes on disk). We can build a custom workflow by wrapping the code with Dask Imperative and making use of the Feather library:
>>> import feather
>>> from dask import delayed
>>> from glob import glob
>>> import os
>>> lazy_dataframes = []
>>> for directory in glob('2016-*'):
... for symbol in os.listdir(directory):
... filename = os.path.join(directory, symbol)
... df = delayed(feather.read_dataframe)(filename)
... df = delayed(pd.DataFrame.assign)(df,
date=pd.Timestamp(directory),
symbol=symbol)
... lazy_dataframes.append(df)
View the full notebook for this custom parallel workflow example on Anaconda Cloud.
Additional Resources
View more examples and documentation in the Dask documentation. For more information about using Anaconda and Dask to scale out Python on your cluster, check out our recent webinar on High Performance Hadoop with Python.
You can get started with Anaconda and Dask using Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:
$ conda install anaconda-client -n root
$ anaconda login
$ conda install anaconda-cluster -c anaconda-cluster
In addition to Anaconda subscriptions, there are many different ways that Continuum can help you get started with Anaconda and Dask to construct parallel workflows, parallelize your existing code, or integrate with your existing Hadoop or HPC cluster, including:
- Architecture consulting and review
- Manage Python packages and environments on a cluster
- Develop custom package management solutions on existing clusters
- Migrate and parallelize existing code with Python and Dask
- Architect parallel workflows and data pipelines with Dask
- Build proof of concepts and interactive applications with Dask
- Custom product/OSS core development
- Training on parallel development with Dask
For more information about the above solutions, or if you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at sales@continuum.io.