Enterprise Data Science
Scalable Machine Learning in the Enterprise with Dask
Jun 27, 2018By Anaconda Team
You’ve been hearing the hype for years: machine learning can have a magical, transformative impact on your business, putting key insights into the hands of decision-makers and driving industries forward. But many organizations today still struggle to extract value from their machine learning initiatives. Why?
Building & Training Models on Your Laptop is No Longer Good Enough
One of the biggest reasons machine learning projects fail to produce tangible results in the enterprise is the inability to scale models. Leading data scientists understand that, by harnessing larger training sets, they can build models that are more effective.
According to Open AI, the largest AI training runs have increased exponentially every 3.5 months. Their research shows that compute has grown by 300,000x since AlexNet was released in 2012.
What this means is that data scientists must plan to scale their model training, as simply building and training models on their laptops is no longer good enough. But even the most popular tools for machine learning are not designed to scale. Scikit-learn, for example, works well with data that fits on a local machine, but when your data volumes require multiple cores or nodes, it cannot help you.
Historically, data scientists have turned to distributed computing frameworks like Spark to train large datasets. This approach is not ideal, however. Assuming an enterprise data science team has access to a Spark cluster, they still must rewrite the code in Spark, which is both time-consuming and introduces the potential for reproducibility errors.
However, as we learned in Anaconda Data Scientist Tom Augspurger’s recent webinar, Scalable Machine Learning with Dask, there is now an easy alternative for scaling model training: Dask.
Scale Your Machine Learning Systems in the Enterprise with Dask
With Dask, data scientists can use the familiar APIs they know—including scikit-learn, XGBoost, and TensorFlow—and, with slight modification, scale their model training to thousands of nodes on a cluster. As Tom demonstrated, data scientists can now perform compute-intensive tasks like hyper-parameter optimization on large datasets with relative ease. This means that model training takes less time and model accuracy improves.
Even better, Dask scales down nicely. For data volumes that don’t fit in memory but are not so large that a cluster is required, Dask can still parallelize computation on a single machine. This is extremely useful for data scientists that experiment with samples of their data. With Dask, they do not have to rewrite their code as they increase from sample data to larger training sets, saving them significant time.
While Dask is open source and freely available, many of our enterprise customers leverage Dask as part of their Anaconda Enterprise deployments. Anaconda Enterprise is the AI enablement platform for data science teams at scale. It offers not only a scalable environment to build and train models on infrastructure ranging from a single machine to thousands of nodes, but also a simple, robust platform for model deployment and management.
With the single click of a button, data scientists can deploy models into production in seconds, quickly delivering insights into the hands of decision-makers without IT intervention or laborious server-side coding. Meanwhile, Anaconda Enterprise keeps IT happy by offering security, governance, and scale that provide them with control without sacrificing productivity.