Distributed Auto-ML with TPOT with Dask

 

By Tom Augspurger

This work is supported byAnaconda, Inc.

This post describes a recent improvement made to TPOT. TPOT is an automated machine learning library for Python. It does some feature engineering and hyper-parameter optimization for you. TPOT uses genetic algorithms to evaluate which models are performing well and how to choose new models to try out in the next generation.

Parallelizing TPOT

In TPOT-730, we made some modifications to TPOT to support distributed training. As a TPOT user, the only changes you need to make to your code are

  1. Connect a client to your Dask Cluster
  2. Specify the 
    use_dask=True

     argument to your TPOT estimator

From there, all the training will use your cluster of machines. This screencast shows an example on an 80-core Dask cluster.

Commentary

Fitting a TPOT estimator consists of several stages. The bulk of the time is spent evaluating individual scikit-learn pipelines. Dask-ML already had code for splitting apart a scikit-learn 

Pipeline.fit

 call into individual tasks. This is used in Dask-ML’s hyper-parameter optimization to avoid repeating work. We were able to drop-in Dask-ML’s fit and scoring method for the one already used in TPOT. That small change allows fitting the many individual models in a generation to be done on a cluster.

There’s still some room for improvement. Internal to TPOT, some time is spent determining the next set of models to try out (this is the “mutation and crossover phase”). That’s not (yet) been parallelized with Dask, so you’ll notice some periods of inactivity on the cluster.

Next Steps

This will be available in the next release of TPOT. You can try out a small example now on the dask-examples binder.

Stepping back a bit, I think this is a good example of how libraries can use Dask internally to parallelize workloads for their users. Deep down in TPOT there was a single method for fitting many scikit-learn models on some data and collecting the results. Dask-ML has code for building a task graph that does the same thing. We were able to swap out the eager TPOT code for the lazy Dask version, and get things distributed on a cluster. Projects like xarray have been able to do a similar thing with dask Arrays in place of NumPy arrays. If Dask-ML hadn’t already had that code, 

dask.delayed

 could have been used instead.

If you have a library that you think could take advantage of Dask, please reach out!


You May Also Like

Data Science Blog
Anaconda and Microsoft Partner to Deliver Python-Powered Machine Learning
Strata Data Conference, NEW YORK––September 26, 2017––Anaconda, Inc., the most popular Python data science platform provider, today announced it is partnering with Mic...
Read More
Data Science Blog
How to Troubleshoot Python Software in Anaconda Distribution
Below is a question that was recently asked on StackOverflow and I decided it would be helpful to publish an answer explaining the various ways in which to troubleshoot a prob...
Read More
Data Science Blog
ZDNet: Strata NYC 2017 to Hadoop: Go jump in a data lake
http://www.zdnet.com/article/strata-nyc-2017-to-hadoop-go-jump-in-a-data-lake/...
Read More