Dask core contributor Jim Crist has put together a series of posts discussing some recent experiments combining Dask and scikit-learn on his blog, Marginally Stable. From these experiments, a small library has been built up, and can be found here.

The tutorial spans three posts, which covers model parallelism, data parallelism and combining the two with a real-life dataset. 

rom these experiments, a small library has been built up, and can be found here.

The tutorial spans three posts, which covers model parallelism, data parallelism and combining the two with a real-life dataset. 

Part I: Dask & scikit-learn: Model Parallelism

In this post we’ll look instead at model-parallelism (use same data across different models), and dive into a daskified implementation of GridSearchCV.

Part II: Dask & scikit-learn: Data Parallelism


In the last post we discussed model-parallelism — fitting several models across the same data. In this post we’ll look into simple patterns for data-parallelism, which will allow fitting a single model on larger datasets.

Part III: Dask & scikit-learn: Putting it All Together

In this post we’ll combine the above concepts together to do distributed learning and grid search on a real dataset; namely the airline dataset. This contains information on every flight in the USA between 1987 and 2008.

Keep up with Jim and his blog by following him on Twitter, @jiminy_crist