Building powerful machine learning models often requires more computing power than a laptop can provide. Although it’s fairly easy to provision compute instances in the cloud these days, all the computing power in the world won’t help you if your machine learning library cannot scale. Unfortunately, popular libraries like scikit-learn, XGBoost, and TensorFlow don’t offer native scaling to users.
This is a real problem. Data scientists either must build models with subsets of data that can fit on their laptops or rewrite their code and use different libraries that can scale to a cluster of machines. Neither approach is ideal.
On June 21, Anaconda Data Scientist and newly minted Python fellow Tom Augspurger held a webinar on scaling machine learning with Dask. With Dask, data scientists can scale their machine learning workloads from their laptops to thousands of nodes on a cluster, all without having to rewrite their code. This makes life easy for data scientists who want to build powerful models trained on large amounts of data, without having to leave the comfort of the familiar API they know. Moreover, enterprises love Dask because it enables their data scientists to iterate faster in model development and model training, leading to improved business outcomes.
We received a number of great questions during Tom’s webinar that we’d like to highlight here:
What advantages does Dask have in comparison to Apache Spark? In what situations would Dask be preferable?
At a user level, the two projects do have some overlapping goals. This is a question we’re frequently asked, so we’ve even dedicated a page of the Dask documentation to it, with some (probably biased) answers. We’ll quote the reasons you might want to choose Dask:
- You prefer Python or native code, or have large legacy code bases that you do not want to entirely rewrite
- Your use case is complex or does not cleanly fit the Spark computing model
- You want a lighter-weight transition from local computing to cluster computing
- You want to interoperate with other technologies and don’t mind installing multiple packages
But note that there are tradeoffs involved in every choice of technology, so Dask may not be the right tool for your problem.
I can rent an AWS EC2 instance with 36 cores. Why should I use Dask?
Dask is still useful for a single, large workstation. The majority of the scientific Python ecosystem is not parallelized. Libraries like NumPy and pandas, for the most part, don’t use multiple cores. Dask parallelizes these libraries. And if your dataset is larger than RAM, Dask will run the computations in a low-memory footprint.
Focusing specifically on machine learning, scikit-learn already has good single-machine parallelism via joblib. The first example we saw using joblib.parallel_backend(‘dask’) will offer no advantages on a single machine; it’s only useful for distributed settings. If scikit-learn is working well for you on a single machine, then there’s little reason to use Dask-ML (some of Dask-ML’s pre-processing estimators may be faster, due to the parallelism, but that may not matter relative to the overall runtime).
Can you share the code from the examples?
Of course! The exact notebooks Tom ran are available on https://anaconda.org/TomAugspurger/notebooks
- Distributed Scikit-Learn for CPU-bound Problems
- Distributed Scikit-Learn for RAM-bound Problems
- Distributed KMeans
To run these notebooks, you’ll need to connect to your Dask cluster by changing the scheduler address.
If you want to try out smaller variants of these notebooks without having to set anything up, check out the machine learning examples on Dask’s Binder.
How did you set up the cluster?
We didn’t have time during the webinar to go into this in detail, but Tom’s cluster set up using Dask’s helm chart to deploy to Google’s Kubernetes Engine. Tom ran a simple
helm install stable/dask -f config.yaml
See Dask’s documentation or more on using Dask with Kubernetes.
There’s nothing particularly special about using Kubernetes or Google Cloud. Dask can be used in many environments, including on a single machine, Kubernetes, HPC systems, and Yarn. See the setup documentation for more.
If scalable machine learning interests you, then you absolutely must check out our upcoming webinar, Accelerating Deep Learning with GPUs. Stan Seibert, Anaconda’s Director of Community Innovation and a phenomenal speaker in his own right, will introduce today’s hot topic in data science: Deep Learning with GPUs. Stan will help you gain an understanding of Deep Learning, the role GPUs play in parallelizing model training, and where AI can make a big impact in business today.
Even better, Stan will run all of his examples with Anaconda Enterprise. With Anaconda Enterprise, data scientists can access the open source libraries and leverage the hardware they need to build powerful Deep Learning models while also meeting IT requirements. Save your seat now—this is one you don’t want to miss!