Introducing Dask for Scalable Machine Learning

 

Although Python contains several powerful libraries for machine learning, unfortunately, they don’t always scale well to large datasets. This has forced data scientists to use tools outside of the Python ecosystem (e.g., Spark) when they need to process data that can’t fit on a single machine.

But thanks to Dask, data scientists can now use the Python tools they already know and love to process large volumes of data in parallel.

The Challenge of Scaling to Large Datasets

Python users benefit from a rich analytics ecosystem. NumPy, pandas, matplotlib, and many other packages make data scientists wildly productive with Python across myriad data-related problems. However, one thing these libraries haven’t done well is scale to large datasets. Popular libraries like NumPy and pandas are designed to work on a single core and with data that fits in RAM.

This is a real problem for data scientists. As data volumes continue to grow, it’s no longer feasible to solve business problems with the computing power of a single machine. Modern data volumes often necessitate the use of a cluster, especially in the enterprise data science environment.

Dask to the Rescue

In the past, data scientists were forced to switch from Python to a distributed computing framework like Spark to process big data on a cluster. This meant learning entirely new APIs, which was at best annoying and at worst a large drain on productivity. Thanks to Dask, that is no longer the case. As Matt Rocklin shared in his Scaling Python with Dask webinar last month, data scientists can easily scale numeric Python tools to large datasets with Dask.

While Matt provided a high-level overview of Dask for a variety of data science tasks, we know that scaling machine learning in particular is very important to data scientists. Scikit-learn, for example, is a popular machine learning library that works extremely well with data that can fit on a laptop. But when that is no longer the case, Dask-ml provides several options for scaling machine learning workloads with scikit-learn (as well as many other machine learning packages such as TensorFlow and XGBoost).

In our upcoming webinar, Scalable Machine Learning with Dask, Anaconda Data Scientist Tom Augspurger will share how easy Dask-ml makes it for data scientists to scale their machine learning workloads from their laptops to thousands of nodes on a cluster. If you’re interested in applying machine learning to large datasets with the friendly Python APIs you already know, be sure to tune in!

Looking forward to learning more? Register now for our live webinar, Scalable Machine Learning with Dask, taking place Thursday, June 21, at 2PM CT.

Scalable Machine Learning with Dask Webinar


You May Also Like

Data Science Blog
Why We Removed the “Free” Channel in Conda 4.7
One of the changes we made in Conda 4.7 was the removal of a software collection called “free” from the default channel configuration. The “free” channel is our collec...
Read More
Data Science Blog
Utilizing the New Compilers in Anaconda Distribution 5
Part of what made the recent release of Anaconda Distribution 5 so exciting was our switch from OS-provided compiler tools to our own Anaconda toolsets. This change has allowe...
Read More
Data Science Blog
Easy Distributed Training with Joblib and Dask
This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me ...
Read More