Today we are excited to talk about the RAPIDS GPU dataframe release along with our partners in this effort: NVIDIA, BlazingDB, and Quansight. RAPIDS is the culmination of 18 months of open source development to address a common need in data science: fast, scalable processing of tabular data for extract-transform-load (ETL) operations. ETL tasks typically are not GPU accelerated, but recent developments in both hardware and software brought together by RAPIDS make it practical for data scientists to accelerate ETL on NVIDIA GPUs.
Data wrangling is the new bottleneck
The explosive growth of GPU performance on model training tasks has made ETL stages in the data science workflow the new bottleneck. We’ve seen an explosion of GPU-accelerated deep learning use cases, like image recognition, document classification, and time series prediction. But any data scientist will tell you a model is only as good as the data you put into it. Preparing training and validation data is often as much work as designing and training the model itself. Loading, selecting, transforming, and aggregating data is an important part of the data science workflow, and seldom done only once.
When these ETL tasks are combined during data exploration, feature engineering, and model verification, the data scientist often will need to do these tasks over and over again with increasingly larger datasets, at which point performance becomes important.
At its core, RAPIDS is a set of libraries built upon the Apache Arrow standard for in-memory representation and sharing of structured data. Arrow has a huge community and is promoting data interoperability between programming languages and libraries. RAPIDS brings Arrow to the GPU, accelerating common operations like column transforms, selection, sorting, and table joins. These low-level primitives can be used directly by application developers, or they can be exposed to data scientists via a friendly Python interface. We modeled the Python interface after the very popular Pandas project to make it easier to learn and incorporate into notebooks and scripts for working with dataframes, but now on the GPU.
Parallel GPU-accelerated Pandas
As useful as a “GPU-accelerated Pandas” is, RAPIDS takes things one step further. Datasets can easily grow beyond the capacity of one GPU, so RAPIDS builds on Anaconda’s Dask distributed computing project to scale out. Dask was a natural choice because of its Python interface, modular design, and adaptability to distributed IT infrastructure, including Kubernetes and YARN clusters. By using Dask, RAPIDS can process data split across many GPUs in one server, or many GPUs in a whole cluster. This gives RAPIDS extreme scalability to very large datasets, while still retaining the ease of use that Python data scientists prefer. Our benchmarks with ~100 GB of tabular show speedups of 5-50x on dataframe manipulation tasks commonly performed during feature engineering and data preparation.
Although the big announcement for RAPIDS was today, there is still a lot planned. Growing from a proof-of-concept that Anaconda prototyped in 2017 to a large collaboration between multiple companies, open source developers, and end users, we’re getting ready to take the next step. Work is continuing to grow the functionality of RAPIDS, including accelerated data loaders, more table operations, and higher performance communication protocols between nodes. At this stage, the software is ready for beta testing by early adopters and interested contributors. Conda packages are available in the GPU Open Analytics channel on Anaconda Cloud, and there is a public Slack channel for discussion. To get started, you will need a Linux system with NVIDIA GPUs installed (Pascal architecture or later).
Our goal at Anaconda is to bring the best data science tools together, and we are excited to soon be adding this new project to our growing toolbox of GPU-enabled packages. If you have questions about how to accelerate your data science teams with Anaconda and Anaconda Enterprise, please contact us.