Enterprise Data Science
Introducing Dask for Scalable Machine Learning
Jun 15, 2018By Anaconda Team
Although Python contains several powerful libraries for machine learning, unfortunately, they don’t always scale well to large datasets. This has forced data scientists to use tools outside of the Python ecosystem (e.g., Spark) when they need to process data that can’t fit on a single machine.
But thanks to Dask, data scientists can now use the Python tools they already know and love to process large volumes of data in parallel.
The Challenge of Scaling to Large Datasets
Python users benefit from a rich analytics ecosystem. NumPy, pandas, matplotlib, and many other packages make data scientists wildly productive with Python across myriad data-related problems. However, one thing these libraries haven’t done well is scale to large datasets. Popular libraries like NumPy and pandas are designed to work on a single core and with data that fits in RAM.
This is a real problem for data scientists. As data volumes continue to grow, it’s no longer feasible to solve business problems with the computing power of a single machine. Modern data volumes often necessitate the use of a cluster, especially in the enterprise data science environment.
Dask to the Rescue
In the past, data scientists were forced to switch from Python to a distributed computing framework like Spark to process big data on a cluster. This meant learning entirely new APIs, which was at best annoying and at worst a large drain on productivity. Thanks to Dask, that is no longer the case. As Matt Rocklin shared in his Scaling Python with Dask webinar last month, data scientists can easily scale numeric Python tools to large datasets with Dask.
While Matt provided a high-level overview of Dask for a variety of data science tasks, we know that scaling machine learning in particular is very important to data scientists. Scikit-learn, for example, is a popular machine learning library that works extremely well with data that can fit on a laptop. But when that is no longer the case, Dask-ml provides several options for scaling machine learning workloads with scikit-learn (as well as many other machine learning packages such as TensorFlow and XGBoost).
In our upcoming webinar, Scalable Machine Learning with Dask, Anaconda Data Scientist Tom Augspurger will share how easy Dask-ml makes it for data scientists to scale their machine learning workloads from their laptops to thousands of nodes on a cluster. If you’re interested in applying machine learning to large datasets with the friendly Python APIs you already know, be sure to tune in!
Looking forward to learning more? Register now for our live webinar, Scalable Machine Learning with Dask, taking place Thursday, June 21, at 2PM CT.