Introducing Dask for Scalable Machine Learning

Although Python contains several powerful libraries for machine learning, unfortunately, they don’t always scale well to large datasets. This has forced data scientists to use tools outside of the Python ecosystem (e.g., Spark) when they need to process data that can’t fit on a single machine.

But thanks to Dask, data scientists can now use the Python tools they already know and love to process large volumes of data in parallel.

The Challenge of Scaling to Large Datasets

Python users benefit from a rich analytics ecosystem. NumPy, pandas, matplotlib, and many other packages make data scientists wildly productive with Python across myriad data-related problems. However, one thing these libraries haven’t done well is scale to large datasets. Popular libraries like NumPy and pandas are designed to work on a single core and with data that fits in RAM.

This is a real problem for data scientists. As data volumes continue to grow, it’s no longer feasible to solve business problems with the computing power of a single machine. Modern data volumes often necessitate the use of a cluster, especially in the enterprise data science environment.

Dask to the Rescue

In the past, data scientists were forced to switch from Python to a distributed computing framework like Spark to process big data on a cluster. This meant learning entirely new APIs, which was at best annoying and at worst a large drain on productivity. Thanks to Dask, that is no longer the case. As Matt Rocklin shared in his Scaling Python with Dask webinar last month, data scientists can easily scale numeric Python tools to large datasets with Dask.

While Matt provided a high-level overview of Dask for a variety of data science tasks, we know that scaling machine learning in particular is very important to data scientists. Scikit-learn, for example, is a popular machine learning library that works extremely well with data that can fit on a laptop. But when that is no longer the case, Dask-ml provides several options for scaling machine learning workloads with scikit-learn (as well as many other machine learning packages such as TensorFlow and XGBoost).

ANACONDA PLATFORM

FOR DEVELOPERS

PLATFORM BENEFITS

PLATFORM & LICENSING

CAPABILITIES

INDUSTRIES

SERVICES & SUPPORT

FOR USERS

MEET US

ARTICLES

SUPPORT

ABOUT US

PRESS

CONTACT

CAREERS

Introducing Dask for Scalable Machine Learning

The Challenge of Scaling to Large Datasets

Dask to the Rescue

You May Also Like

ANACONDA PLATFORM

FOR DEVELOPERS

PLATFORM BENEFITS

PLATFORM & LICENSING

CAPABILITIES

INDUSTRIES

SERVICES & SUPPORT

FOR USERS

MEET US

ARTICLES

SUPPORT

ABOUT US

PRESS

CONTACT

CAREERS

Introducing Dask for Scalable Machine Learning

The Challenge of Scaling to Large Datasets

Dask to the Rescue

You May Also Like

Automate Conda Publishing with Anaconda GitHub Actions

Why ML/AI Developers and Platform Teams Choose Metaflow

Whisper with Metaflow on Kubernetes

You May Also Like

Automate Conda Publishing with Anaconda GitHub Actions

Why ML/AI Developers and Platform Teams Choose Metaflow

Whisper with Metaflow on Kubernetes