Scalable Machine Learning in the Enterprise with Dask

 

You’ve been hearing the hype for years: machine learning can have a magical, transformative impact on your business, putting key insights into the hands of decision-makers and driving industries forward. But many organizations today still struggle to extract value from their machine learning initiatives. Why?

Building & Training Models on Your Laptop is No Longer Good Enough

One of the biggest reasons machine learning projects fail to produce tangible results in the enterprise is the inability to scale models. Leading data scientists understand that, by harnessing larger training sets, they can build models that are more effective.

According to Open AI, the largest AI training runs have increased exponentially every 3.5 months. Their research shows that compute has grown by 300,000x since AlexNet was released in 2012.

What this means is that data scientists must plan to scale their model training, as simply building and training models on their laptops is no longer good enough. But even the most popular tools for machine learning are not designed to scale. Scikit-learn, for example, works well with data that fits on a local machine, but when your data volumes require multiple cores or nodes, it cannot help you.

Historically, data scientists have turned to distributed computing frameworks like Spark to train large datasets. This approach is not ideal, however. Assuming an enterprise data science team has access to a Spark cluster, they still must rewrite the code in Spark, which is both time-consuming and introduces the potential for reproducibility errors.

However, as we learned in Anaconda Data Scientist Tom Augspurger’s recent webinar, Scalable Machine Learning with Dask, there is now an easy alternative for scaling model training: Dask.

Scale Your Machine Learning Systems in the Enterprise with Dask

With Dask, data scientists can use the familiar APIs they know—including scikit-learn, XGBoost, and TensorFlow—and, with slight modification, scale their model training to thousands of nodes on a cluster. As Tom demonstrated, data scientists can now perform compute-intensive tasks like hyper-parameter optimization on large datasets with relative ease. This means that model training takes less time and model accuracy improves.

Even better, Dask scales down nicely. For data volumes that don’t fit in memory but are not so large that a cluster is required, Dask can still parallelize computation on a single machine. This is extremely useful for data scientists that experiment with samples of their data. With Dask, they do not have to rewrite their code as they increase from sample data to larger training sets, saving them significant time.

While Dask is open source and freely available, many of our enterprise customers leverage Dask as part of their Anaconda Enterprise deployments. Anaconda Enterprise is the AI enablement platform for data science teams at scale. It offers not only a scalable environment to build and train models on infrastructure ranging from a single machine to thousands of nodes, but also a simple, robust platform for model deployment and management.

With the single click of a button, data scientists can deploy models into production in seconds, quickly delivering insights into the hands of decision-makers without IT intervention or laborious server-side coding. Meanwhile, Anaconda Enterprise keeps IT happy by offering security, governance, and scale that provide them with control without sacrificing productivity.

If you missed Tom’s webinar, you can check it out here on-demand. To learn more about how Anaconda Enterprise can take your organization to the next level, contact us anytime for a demonstration.


You May Also Like

Company Blog
Anaconda Funded by Citi Ventures
Scott Collison, CEO Today, we’re incredibly happy to announce funding from Citi Ventures and welcome them as a new investor and partner. Following its initial investment in ...
Read More
Company Blog
Anaconda Welcomes Maggie Key as SVP of Customer Success
Former VP of Accruent joins executive team to build out and embed customer success program within Anaconda AUSTIN, Texas – September 4, 2018 – Anaconda, Inc., the most po...
Read More
Data Science Blog
Database Trends & Applications: Machine Learning and Data Science are Top Trends at Strata Data
Data professionals and vendors converged at Strata Data in New York to trade tips and tricks for handling big data. Top of mind for most was the impact of machine learning and...
Read More