We are excited to announce that Numba version 0.21 adds support for a new target architecture: the Heterogenous System Architecture (HSA)! HSA is a new standard aimed at allowing CPUs and GPUs to cooperate more closely and share a common memory space.AMD has implemented this standard in their recent Application Processor Units (APUs), and now Numba can compile code to run on these devices.
We are excited to announce that Numba version 0.21 adds support for a new target architecture: the Heterogenous System Architecture (HSA)! HSA is a new standard aimed at allowing CPUs and GPUs to cooperate more closely and share a common memory space.AMD has implemented this standard in their recent Application Processor Units (APUs), and now Numba can compile code to run on these devices.It is Numba’s mission to help Python developers take advantage of the full power of their computers with the help of just-in-time (JIT) compilation. We’ve seen numerous examples where compiling numerical Python code has resulted in 2-5x speed improvements over code calling NumPy functions, and more than 150x over pure Python code.
Numba currently supports 32-bit and 64-bit Intel and AMD CPUs, as well as NVIDIA GPUs. Thanks to the diverse range of hardware supported by LLVM, the open source compiler library used by Numba, we plan to support several more computing architectures in the future. The future of computing is heterogenous, and we want to make sure that Python users are not left behind as new technologies are developed.
AMD’s approach with their APU devices is to put an x86-compatible CPU and a desktop-class GPU onto the same chip, both sharing access to system memory. This design reduce communication overhead between the CPU and GPU, allowing sequential workloads to be performed by the CPU with lower latency, while parallel workloads are handled by the GPUwith higher throughput. Such an approach to computing has the potential to be very useful for workloads where communication overhead has previously made GPU acceleration challenging. These situations come up often when working with financial data sets, implementing graph analytics algorithms, and creating other complex data procesing workflows.
We have worked closely with AMD to develop and test Numba’s HSA support specifically with their APUs. The CPU cores were already supported by Numba, and with the new Numba release you can now compile code for the GPU cores on HSA-supported devices as well.
Programming HSA with Python
The HSA programming model is very similar to that of OpenCL and CUDA. The computation to be performed is defined by a kernel function (not to be confused with the OS kernel) that is executed in parallel by a collection of work-items. The work-items are collected into work-groups, and the set of all work-groups is an NDRange. (The equivalent terminology for CUDAdescribes the computation as a collection of threads grouped together into blocks, and the set of all blocks is called a grid.)
The syntax for writing an HSA kernel with Numba looks very similar to CUDA:
Here we’ve used
hsa.get_global_id(0) to get a globally unique work-item ID. The corresponding polar coordinates in the input arrays are retrieved and used to compute Cartesian coordinates that are saved in the output arrays.
To execute this kernel, we use a similar syntax to that for CUDA Python:
Unlike the Numba CUDA driver, the HSA kernel executes synchronously (though we plan to support also asynchronous behavior in the future), so you do not have to worry about whether the kernel is finished executing before accessing your output data. Also note that there is no memory transfer overhead before or after the kernel execution, since the GPUand CPU share the same memory!
Like other GPUs, the GPU inside an AMD APU reaches its highest performance when working on large problems. We can see this if we plot the relative speedup of the above HSAkernel vs a CPU implementation for different size input arrays:
For large input data sets, it is possible to achieve speedups between 10x and 18x relative to the standard single-threaded NumPy implementation of the same function.
The release of HSA support in Numba is only the first step in our planned support of AMDAPUs. In the coming months we will be expanding our APU support to include ufunc generation as well as adding APU-optimized BLAS and FFT functionality to the Anaconda Platform.
Today, HSA is supported on Kaveri and Carrizo architecture APUs on Ubuntu Linux 14.04 64-bit. The HSA kernel driver and runtime must be installed, as described at in the HSAPlatforms & Installation Guide. The installation guide will point you to the packages you need, describe how to install them, and show you how to verify a successful installation. Once the driver is installed, you will need to install Anaconda or Miniconda, and then you can run:
and you will have the latest Numba with HSA support. The full Numba HSA documentationdescribes the Python syntax for HSA targets in more detail. Try it out and let us know what you would like to see in future releases!