Anaconda | Getting Started with GPU Computing in Anaconda

GPU computing has become a big part of the data science landscape. Computational needs continue to grow, and a large number of GPU-accelerated projects are now available. In addition, GPUs are now available from every major cloud provider, so access to the hardware has never been easier. However, building GPU software on your own can be quite intimidating. Fortunately, Anaconda Distribution makes it easy to get started with GPU computing with several GPU-enabled packages that can be installed directly from our package repository. In this blog post, we’ll give you some pointers on where to get started with GPUs in Anaconda Distribution.

Note that we won’t talk about hybrid architectures, like the Xeon Phi, which combine aspects of both GPUs and CPUs. The Xeon Phi is a very interesting chip for data scientists, but really needs its own blog post. We’ll also focus specifically on GPUs made by NVIDIA GPUs, as they have built-in support in Anaconda Distribution, but AMD’s Radeon Open Compute initiative is also rapidly improving the AMD GPU computing ecosystem and we may also talk about them in the future as well.

When is a GPU a good idea?

It is important to remember that GPUs are not general purpose compute devices. They are specialized coprocessors that are extremely good (>10x performance increases) for some tasks, and not very good for others. The most successful applications of GPUs have been in problem domains which can take advantage of the large parallel floating point throughput and high memory bandwidth of the device. Some examples include:

Linear algebra
Signal and image processing (FFTs)
Neural networks and deep learning
Other machine learning algorithms, including generalized linear models, gradient boosting, etc.
Monte Carlo simulation and particle transport
Fluid simulation
In-memory databases
… and the list gets longer every day

Given how quickly the field is moving, it is a good idea to search for new GPU accelerated algorithms and projects to find out if someone has figured out how to apply GPUs to your area of interest.

What’s the commonality to all these successful use cases? Broadly speaking, applications ready for GPU acceleration have the following features:

1. High “arithmetic intensity”

For every memory access, how many math operations are performed? If the ratio of math to memory operations is high, the algorithm has high arithmetic intensity, and is a good candidate for GPU acceleration. These algorithms take advantage of the GPU’s high math throughput, and its ability to queue up memory access in the background while doing math operations on other data at the same time. The GPU can easily execute many math instructions in the time it takes to request and receive one number stored in GPU memory. As a result, it can sometime be better to recompute a value than to save it to memory and reload it later.

What counts as high arithmetic intensity? A good rule of thumb for the GPU is that, for every number you input, you want at least ten basic math operations (add, subtract, multiply, divide, etc) or at least one special math function call, such as exp() or cos(). This is not a rigid requirement, as careful use of data locality and caching also matter, but the rule of thumb is a guide toward the kinds of problems best suited for the GPU.

2. A high degree of parallelism

GPUs are ideal for array processing, where elements of a large array can be computed in parallel. If a calculation can only be divided into a small number of independent tasks, it may be more suited for a multicore CPU. Note that sometimes the way to find parallelism is to replace your current serial algorithm with a different one that solves the same problem in a highly parallel fashion. It is valuable to do a quick web search to see if something that “clearly can’t be parallelized” actually can be.

3. Working dataset can fit into the GPU memory

High end GPUs with 16 GB (or even 24 GB in one case) are readily available now. That’s very impressive, but also an order of magnitude smaller than the amount of system RAM that can be installed in a high-end server. If a dataset doesn’t fit into GPU memory, all is not lost, however. Some algorithms can split their data across multiple GPUs in the same computer, and there are cases where data can be split across GPUs in different computers. It is also possible to stream data from system RAM into the GPU, but the bandwidth of the PCI-E bus that connects the GPU to the CPU will be a limiting factor unless computation and memory transfer are carefully overlapped.

4. I/O is not a bottleneck

A lot of data science tasks are primarily constrained by I/O speed. For example, an application that filters 500 GB of records on disk to find the subset that matches a simple search pattern is going to spend most of the time waiting for data to load from disk. The GPU will provide no additional benefits. If this data filtering is followed by six hours of training a deep learning model, then having a GPU will be very beneficial (for the model training stage). It is always a good idea to profile your Python application to measure where the time is actually being spent before embarking on any performance optimization effort.

Prerequisites

Before starting your GPU exploration, you will need a few things:

Access to a system with an NVIDIA GPU: The cheaper GeForce cards are very good for experimentation, with the more expensive Tesla cards generally having better double precision performance and more memory. Mobile NVIDIA GPUs can also work, but they will be very limited in performance. Cloud GPU instances are also an option, though somewhat more expensive than normal cloud systems.
CUDA-supporting drivers: Although CUDA is supported on Mac, Windows, and Linux, we find the best CUDA experience is on Linux. Macs stopped getting NVIDIA GPUs in 2014, and on Windows the limitations of the graphics driver system hurt the performance of GeForce cards running CUDA (Tesla cards run full speed on Windows). Up-to-date NVIDIA drivers (not Nouveau) on Linux are sufficient for GPU-enabled Anaconda packages. You do not need to install the full CUDA toolkit unless you want to compile your own GPU software from scratch.
Anaconda: The easiest way to install the packages described in this post is with the conda command line tool in Anaconda Distribution. If you are new to Anaconda Distribution, the recently released Version 5.0 is a good place to start, but older versions of Anaconda Distribution also can install the packages described below.

GPU Projects To Check Out

Deep Learning: Keras, TensorFlow, PyTorch

Training neural networks (often called “deep learning,” referring to the large number of network layers commonly used) has become a hugely successful application of GPU computing. Neural networks have proven their utility for image captioning, language translation, speech generation, and many other applications. However, they require large data sets and computing power for training, and the ability to easily experiment with different models. Once trained, models can be run on CPUs and mobile devices with much more modest computing abilities.

For people getting started with deep learning, we really like Keras. Keras is a Python library for constructing, training, and evaluating neural network models that support multiple high-performance backend libraries, including TensorFlow, Theano, and Microsoft’s Cognitive Toolkit. TensorFlow is the default, and that is a good place to start for new Keras users. The documentation is very informative, with links back to research papers to learn more. Keras also does not require a GPU, although for many models, training can be 10x faster if you have one.

Keras and the GPU-enabled version of TensorFlow can be installed in Anaconda with the command:

conda install keras-gpu

We also like recording our Keras experiments in Jupyter notebooks, so you might also want to run:

conda install notebook

jupyter notebook

Some great starting points are the CIFAR10 and MNIST convolutional neural network examples on Github.

It is also worth remembering that libraries like TensorFlow and PyTorch (also available in Anaconda Distribution) can be used directly for a variety of computational and machine learning tasks, and not just deep learning. Because they make it so easy to switch between CPU and GPU computation, they can be very powerful tools in the data science toolbox.

GPU Accelerated Math Libraries: pyculib

NVIDIA also releases libraries with GPU-accelerated implementations of standard math algorithms. Our pyculib project provides Python wrappers around many of these algorithms, including:

Linear algebra
Fast Fourier Transforms
Sparse Matrices
Random number generation
Sorting

These Python wrappers take standard NumPy arrays, and handle all the copy to and from the GPU for you. Note that because of the copying overhead, you may find that these functions are not any faster than NumPy for small arrays. Performance also depends strongly on the kind of GPU you use, and the array data type. The float32 type is much faster than float64 (the NumPy default) especially with GeForce graphics cards. Always remember to benchmark before and after you make any changes to verify the expected performance improvement.

GPU Kernel Programming: Numba

For someone who wants to dig into the details of GPU programming, Numba can be a great option. Numba is our open source Python compiler, which includes just-in-time compilation tools for both CPU and GPU targets. Not only does it compile Python functions for execution on the CPU, it includes an entirely Python-native API for programming NVIDIA GPUs through the CUDA driver. The code that runs on the GPU is also written in Python, and has built-in support for sending NumPy arrays to the GPU and accessing them with familiar Python syntax.

Numba’s GPU support is optional, so to enable it you need to install both the Numba and CUDA toolkit conda packages:

conda install numba cudatoolkit

The CUDA programming model is based on a two-level data parallelism concept. A “kernel function” (not to be confused with the kernel of your operating system) is launched on the GPU with a “grid” of threads (usually thousands) executing the same function concurrently. The grid is comprised of many identical blocks of threads, with threads within a block able to synchronize and share data more easily and efficiently than threads in different blocks. This style of programming is quite different from traditional multithreaded programming on the CPU, and is optimized for “data parallel” algorithms, where each thread is running the same instructions at the same time, but with different input data elements. The first few chapters of the CUDA Programming Guide give a good discussion of how to use CUDA, although the code examples will be in C.

Once you have some familiarity with the CUDA programming model, your next stop should be the Jupyter notebooks from our tutorial at the 2017 GPU Technology Conference. The notebooks cover the basic syntax for programming the GPU with Python, and also include more advanced topics like ufunc creation, memory management, and debugging techniques.

GPU Dataframes: PyGDF

The GPU Dataframe (“GDF” for short) concept is something that Anaconda has been developing with other members of the GPU Open Analytics Initiative. The GDF is a dataframe in the Apache Arrow format, stored in GPU memory. Thanks to support in the CUDA driver for transferring sections of GPU memory between processes, a GDF created by a query to a GPU-accelerated database, like MapD, can be sent directly to a Python interpreter, where operations on that dataframe can be performed, and then the data moved along to a machine learning library like H2O, all without having to take the data off the GPU. While we have been developing the core GDF functionality in a separate library, called libgdf, our goal is to integrate the basic GPU support back into the Apache Arrow project in the future.

Python support for the GPU Dataframe is provided by the PyGDF project, which we have been working on since March 2017. It offers a subset of the Pandas API for operating on GPU dataframes, using the parallel computing power of the GPU (and the Numba JIT) for sorting, columnar math, reductions, filters, joins, and group by operations. PyGDF is not yet present in the main Anaconda Distribution due to its early alpha status, but the development is totally open source, and examples are available in our demo repository.

Conclusion

We’ve only scratched the surface of possibilities with the GPU, but hopefully some of the aforementioned projects will inspire you to dive in. Are there GPU projects you would like to see in Anaconda Distribution? Let us know!

Here are some links to get you started: