Dask is a flexible open source parallel computation framework that lets you comfortably scale up and scale out your analytics. If you’re running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.

Overview

Dask is a flexible open source parallel computation framework that lets you comfortably scale up and scale out your analytics. If you’re running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.

Dask enables distributed computing in pure Python and complements the existing numerical and scientific computing capability within Anaconda. Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data, and it scales up resiliently and elastically on clusters with hundreds of nodes.

Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP.

View the full notebook for this custom parallel workflow example on Anaconda Cloud.

Additional Resources

View more examples and documentation in the Dask documentation. For more information about using Anaconda and Dask to scale out Python on your cluster, check out our recent webinar on High Performance Hadoop with Python.

You can get started with Anaconda and Dask using Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:

$ conda install anaconda-client -n root

$ anaconda login

$ conda install anaconda-cluster -c anaconda-cluster

In addition to Anaconda subscriptions, there are many different ways that Continuum can help you get started with Anaconda and Dask to construct parallel workflows, parallelize your existing code, or integrate with your existing Hadoop or HPC cluster, including:

  • Architecture consulting and review
  • Manage Python packages and environments on a cluster
  • Develop custom package management solutions on existing clusters
  • Migrate and parallelize existing code with Python and Dask
  • Architect parallel workflows and data pipelines with Dask
  • Build proof of concepts and interactive applications with Dask
  • Custom product/OSS core development
  • Training on parallel development with Dask

For more information about the above solutions, or if you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at [email protected]

 


About the Authors

Kristopher Overholt

Product Manager

Kristopher works with DevOps, product engineering and platform infrastructure for the Anaconda Enterprise data science platform. His areas of expertise include distributed systems, data engineering and computational science workflows.

Kr …

Read more


Q. What is your superpower(s)?

A. Developer

Q. What is your technical specialty or area of research?

A. Parallel computation, numerics, programming.

Q. What do y …

Read more

Join the Disucssion