This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
All code in this post is experimental. It should not be relied upon. For people looking to deploy dask.distributed on a cluster please refer instead to the documentation instead.
Dask is deployed today on the following systems in the wild:
- SGE
- SLURM,
- Torque
- Condor
- LSF
- Mesos
- Marathon
- Kubernetes
- SSH and custom scripts
- … there may be more. This is what I know of first-hand.
These systems provide users access to cluster resources and ensure that many distributed services / users play nicely together. They’re essential for any modern cluster deployment.
The people deploying Dask on these cluster resource managers are power-users; they know how their resource managers work and they read the documentation on how to setup Dask clusters. Generally these users are pretty happy; however we should reduce this barrier so that non-power-users with access to a cluster resource manager can use Dask on their cluster just as easily.
Unfortunately, there are a few challenges:
- Several cluster resource managers exist, each with significant adoption. Finite developer time stops us from supporting all of them.
- Policies for scaling out vary widely. For example we might want a fixed number of workers, or we might want workers that scale out based on current use. Different groups will want different solutions.
- Individual cluster deployments are highly configurable. Dask needs to get out of the way quickly and let existing technologies configure themselves.
This post talks about some of these issues. It does not contain a definitive solution.
Example: Kubernetes
For example, both Olivier Griesl (INRIA, scikit-learn) and Tim O’Donnell (Mount Sinai, Hammer lab) publish instructions on how to deploy Dask.distributed on Kubernetes.
These instructions are well organized. They include Dockerfiles, published images, Kubernetes config files, and instructions on how to interact with cloud providers’ infrastructure. Olivier and Tim both obviously know what they’re doing and care about helping others to do the same.
Tim (who came second) wasn’t aware of Olivier’s solution and wrote up his own. Tim was capable of doing this but many beginners wouldn’t be.
One solution would be to include a prominent registry of solutions like these within Dask documentation so that people can find quality references to use as starting points. I’ve started a list of resources here: dask/distributed #547 comments pointing to other resources would be most welcome..
However, even if Tim did find Olivier’s solution I suspect he would still need to change it. Tim has different software and scalability needs than Olivier. This raises the question of “What should Dask provide and what should it leave to administrators?” It may be that the best we can do is to support copy-paste-edit workflows.
What is Dask-specific, resource-manager specific, and what needs to be configured by hand each time?
Adaptive Deployments
In order to explore this topic of separable solutions I built a small adaptive deployment system for Dask.distributed on Marathon, an orchestration platform on top of Mesos.
This solution does two things:
- It scales a Dask cluster dynamically based on the current use. If there are more tasks in the scheduler then it asks for more workers.
- It deploys those workers using Marathon.
To encourage replication, these two different aspects are solved in two different pieces of code with a clean API boundary.
- A backend-agnostic piece for adaptivity that says when to scale workers up and how to scale them down safely
- A Marathon-specific piece that deploys or destroys dask-workers using the Marathon HTTP API
This combines a policy, adaptive scaling, with a backend, Marathon such that either can be replaced easily. For example we could replace the adaptive policy with a fixed one to always keep N workers online, or we could replace Marathon with Kubernetes or Yarn.
My hope is that this demonstration encourages others to develop third party packages. The rest of this post will be about diving into this particular solution.
Adaptivity
The distributed.deploy.Adaptive
class wraps around a Scheduler
and
determines when we should scale up and by how many nodes, and when we should
scale down specifying which idle workers to release.
The current policy is fairly straightforward:
- If there are unassigned tasks or any stealable tasks and no idle workers, or if the average memory use is over 50%, then increase the number of workers by a fixed factor (defaults to two).
- If there are idle workers and the average memory use is below 50% then reclaim the idle workers with the least data on them (after moving data to nearby workers) until we’re near 50%
Think this policy could be improved or have other thoughts? Great. It was easy to implement and entirely separable from the main code so you should be able to edit it easily or create your own. The current implementation is about 80 lines (source).
However, this Adaptive
class doesn’t actually know how to perform the
scaling. Instead it depends on being handed a separate object, with two
methods, scale_up
and scale_down
:
classMyCluster(object):defscale_up(n):"""
Bring the total count of workers up to ``n``
This function/coroutine should bring the total number of workers up to
the number ``n``.
"""raiseNotImplementedError()defscale_down(self,workers):"""
Remove ``workers`` from the cluster
Given a list of worker addresses this function should remove those
workers from the cluster.
"""raiseNotImplementedError()
This cluster object contains the backend-specific bits of how to scale up and down, but none of the adaptive logic of when to scale up and down. The single-machine LocalCluster object serves as reference implementation.
So we combine this adaptive scheme with a deployment scheme. We’ll use a tiny Dask-Marathon deployment library available here
fromdask_marathonimportMarathonClusterfromdistributedimportSchedulerfromdistributed.deployimportAdaptives=Scheduler()mc=MarathonCluster(s,cpus=1,mem=4000,docker_image='mrocklin/dask-distributed')ac=Adaptive(s,mc)
This combines a policy, Adaptive, with a deployment scheme, Marathon in a
composable way. The Adaptive cluster watches the scheduler and calls the
scale_up/down
methods on the MarathonCluster as necessary.
Marathon code
Because we’ve isolated all of the “when” logic to the Adaptive code, the Marathon specific code is blissfully short and specific. We include a slightly simplified version below. There is a fair amount of Marathon-specific setup in the constructor and then simple scale_up/down methods below:
frommarathonimportMarathonClient,MarathonAppfrommarathon.models.containerimportMarathonContainerclassMarathonCluster(object):def__init__(self,scheduler,executable='dask-worker',docker_image='mrocklin/dask-distributed',marathon_address='http://localhost:8080',name=None,cpus=1,mem=4000,**kwargs):self.scheduler=scheduler# Create Marathon App to run dask-workerargs=[executable,scheduler.address,'--nthreads',str(cpus),'--name','$MESOS_TASK_ID',# use Mesos task ID as worker name'--worker-port','$PORT_WORKER','--nanny-port','$PORT_NANNY','--http-port','$PORT_HTTP']ports=[{'port':0,'protocol':'tcp','name':name}fornamein['worker','nanny','http']]args.extend(['--memory-limit',str(int(mem*0.6*1e6))])kwargs['cmd']=' '.join(args)container=MarathonContainer({'image':docker_image})app=MarathonApp(instances=0,container=container,port_definitions=ports,cpus=cpus,mem=mem,**kwargs)# Connect and register appself.client=MarathonClient(marathon_address)self.app=self.client.create_app(nameor'dask-%s'%uuid.uuid4(),app)defscale_up(self,instances):self.client.scale_app(self.app.id,instances=instances)defscale_down(self,workers):forwinworkers:self.client.kill_task(self.app.id,self.scheduler.worker_info[w]['name'],scale=True)
This isn’t trivial, you need to know about Marathon for this to make sense, but fortunately you don’t need to know much else. My hope is that people familiar with other cluster resource managers will be able to write similar objects and will publish them as third party libraries as I have with this Marathon solution here: https://github.com/mrocklin/dask-marathon (thanks goes to Ben Zaitlen for setting up a great testing harness for this and getting everything started.)
Adaptive Policies
Similarly, we can design new policies for deployment. You can read more about
the policies for the Adaptive
class in the
documentation or
the
source
(about eighty lines long). I encourage people to implement and use other
policies and contribute back those policies that are useful in practice.
Final thoughts
We laid out a problem
- How does a distributed system support a variety of cluster resource managers and a variety of scheduling policies while remaining sensible?
We proposed two solutions:
- Maintain a registry of links to solutions, supporting copy-paste-edit practices
- Develop an API boundary that encourages separable development of third party libraries.
It’s not clear that either solution is sufficient, or that the current implementation of either solution is any good. This is is an important problem though as Dask.distributed is, today, still mostly used by super-users. I would like to engage community creativity here as we search for a good solution.