Introducing Dask for Scalable Machine Learning


Although Python contains several powerful libraries for machine learning, unfortunately, they don’t always scale well to large datasets. This has forced data scientists to use tools outside of the Python ecosystem (e.g., Spark) when they need to process data that can’t fit on a single machine.

But thanks to Dask, data scientists can now use the Python tools they already know and love to process large volumes of data in parallel.

The Challenge of Scaling to Large Datasets

Python users benefit from a rich analytics ecosystem. NumPy, pandas, matplotlib, and many other packages make data scientists wildly productive with Python across myriad data-related problems. However, one thing these libraries haven’t done well is scale to large datasets. Popular libraries like NumPy and pandas are designed to work on a single core and with data that fits in RAM.

This is a real problem for data scientists. As data volumes continue to grow, it’s no longer feasible to solve business problems with the computing power of a single machine. Modern data volumes often necessitate the use of a cluster, especially in the enterprise data science environment.

Dask to the Rescue

In the past, data scientists were forced to switch from Python to a distributed computing framework like Spark to process big data on a cluster. This meant learning entirely new APIs, which was at best annoying and at worst a large drain on productivity. Thanks to Dask, that is no longer the case. As Matt Rocklin shared in his Scaling Python with Dask webinar last month, data scientists can easily scale numeric Python tools to large datasets with Dask.

While Matt provided a high-level overview of Dask for a variety of data science tasks, we know that scaling machine learning in particular is very important to data scientists. Scikit-learn, for example, is a popular machine learning library that works extremely well with data that can fit on a laptop. But when that is no longer the case, Dask-ml provides several options for scaling machine learning workloads with scikit-learn (as well as many other machine learning packages such as TensorFlow and XGBoost).

In our upcoming webinar, Scalable Machine Learning with Dask, Anaconda Data Scientist Tom Augspurger will share how easy Dask-ml makes it for data scientists to scale their machine learning workloads from their laptops to thousands of nodes on a cluster. If you’re interested in applying machine learning to large datasets with the friendly Python APIs you already know, be sure to tune in!

Looking forward to learning more? Register now for our live webinar, Scalable Machine Learning with Dask, taking place Thursday, June 21, at 2PM CT.

Scalable Machine Learning with Dask Webinar

You May Also Like

For Practitioners
Get Python Package Download Statistics with Condastats
Hundreds of millions of Python packages are downloaded using Conda every month. That’s why we are excited to announce the release of condastats, a conda statistics API ...
Read More
AnacondaCON 2019 Day 3 Recap: The Need for Speed, “Delightful UX” in Dev Tools, LOTR Jokes and More.
Everyone at Anaconda is still feeling the love AnacondaCON 2019. Day 3 wrapped up last Friday with one more day of talks and sessions, highlighted by some powerhouse keynotes....
Read More
For Practitioners
Introducing Skein: Deploy Python on Apache YARN the Easy Way
By Jim Crist *This post is reprinted with permission from Jim Crist’s blog. The original post can be found here. In this post, I introduce Skein, a new tool and library...
Read More