Introducing Dask for Scalable Machine Learning


Although Python contains several powerful libraries for machine learning, unfortunately, they don’t always scale well to large datasets. This has forced data scientists to use tools outside of the Python ecosystem (e.g., Spark) when they need to process data that can’t fit on a single machine.

But thanks to Dask, data scientists can now use the Python tools they already know and love to process large volumes of data in parallel.

The Challenge of Scaling to Large Datasets

Python users benefit from a rich analytics ecosystem. NumPy, pandas, matplotlib, and many other packages make data scientists wildly productive with Python across myriad data-related problems. However, one thing these libraries haven’t done well is scale to large datasets. Popular libraries like NumPy and pandas are designed to work on a single core and with data that fits in RAM.

This is a real problem for data scientists. As data volumes continue to grow, it’s no longer feasible to solve business problems with the computing power of a single machine. Modern data volumes often necessitate the use of a cluster, especially in the enterprise data science environment.

Dask to the Rescue

In the past, data scientists were forced to switch from Python to a distributed computing framework like Spark to process big data on a cluster. This meant learning entirely new APIs, which was at best annoying and at worst a large drain on productivity. Thanks to Dask, that is no longer the case. As Matt Rocklin shared in his Scaling Python with Dask webinar last month, data scientists can easily scale numeric Python tools to large datasets with Dask.

While Matt provided a high-level overview of Dask for a variety of data science tasks, we know that scaling machine learning in particular is very important to data scientists. Scikit-learn, for example, is a popular machine learning library that works extremely well with data that can fit on a laptop. But when that is no longer the case, Dask-ml provides several options for scaling machine learning workloads with scikit-learn (as well as many other machine learning packages such as TensorFlow and XGBoost).

In our upcoming webinar, Scalable Machine Learning with Dask, Anaconda Data Scientist Tom Augspurger will share how easy Dask-ml makes it for data scientists to scale their machine learning workloads from their laptops to thousands of nodes on a cluster. If you’re interested in applying machine learning to large datasets with the friendly Python APIs you already know, be sure to tune in!

Looking forward to learning more? Register now for our live webinar, Scalable Machine Learning with Dask, taking place Thursday, June 21, at 2PM CT.

Scalable Machine Learning with Dask Webinar

You May Also Like

Data Science Blog
InfoWorld: 5 essential Python tools for data science—now improved
If you want to master, or even just use, data analysis, Python is the place to do it. Python is easy to learn, it has vast and deep support, and most every data science librar...
Read More
Company Blog
Introducing Microsoft R Open as Default R for Anaconda Distribution
Although Anaconda, Inc. is best known as the creator of the world’s most popular Python data science platform, for many years we also have been creating conda packages for R...
Read More
Company Blog
Anaconda included in Gartner’s 2018 Magic Quadrant for Data Science and Machine Learning Platforms
Gartner recently released its 2018 Magic Quadrant for Data Science and Machine Learning Platforms, featuring Anaconda for the first time. For those unfamiliar with the process...
Read More