We are excited to announce that Numba version 0.21 adds support for a new target architecture: the Heterogenous System Architecture (HSA)! HSA is a new standard aimed at allowing CPUs and GPUs to cooperate more closely and share a common memory space.AMD has implemented this standard in their recent Application Processor Units (APUs), and now Numba can compile code to run on these devices.

We are excited to announce that Numba version 0.21 adds support for a new target architecture: the Heterogenous System Architecture (HSA)! HSA is a new standard aimed at allowing CPUs and GPUs to cooperate more closely and share a common memory space.AMD has implemented this standard in their recent Application Processor Units (APUs), and now Numba can compile code to run on these devices.It is Numba’s mission to help Python developers take advantage of the full power of their computers with the help of just-in-time (JIT) compilation. We’ve seen numerous examples where compiling numerical Python code has resulted in 2-5x speed improvements over code calling NumPy functions, and more than 150x over pure Python code.

Numba currently supports 32-bit and 64-bit Intel and AMD CPUs, as well as NVIDIA GPUs. Thanks to the diverse range of hardware supported by LLVM, the open source compiler library used by Numba, we plan to support several more computing architectures in the future. The future of computing is heterogenous, and we want to make sure that Python users are not left behind as new technologies are developed.

AMD’s approach with their APU devices is to put an x86-compatible CPU and a desktop-class GPU onto the same chip, both sharing access to system memory. This design reduce communication overhead between the CPU and GPU, allowing sequential workloads to be performed by the CPU with lower latency, while parallel workloads are handled by the GPUwith higher throughput. Such an approach to computing has the potential to be very useful for workloads where communication overhead has previously made GPU acceleration challenging. These situations come up often when working with financial data sets, implementing graph analytics algorithms, and creating other complex data procesing workflows.

We have worked closely with AMD to develop and test Numba’s HSA support specifically with their APUs. The CPU cores were already supported by Numba, and with the new Numba release you can now compile code for the GPU cores on HSA-supported devices as well.

Programming HSA with Python

The HSA programming model is very similar to that of OpenCL and CUDA. The computation to be performed is defined by a kernel function (not to be confused with the OS kernel) that is executed in parallel by a collection of work-items. The work-items are collected into work-groups, and the set of all work-groups is an NDRange. (The equivalent terminology for CUDAdescribes the computation as a collection of threads grouped together into blocks, and the set of all blocks is called a grid.)

The syntax for writing an HSA kernel with Numba looks very similar to CUDA:

from numba import hsa
import numpy as np
import math
 
@hsa.jit
def polar_to_cartesian(r_in, theta_in, x_out, y_out):
    "Convert 2D polar coordinates to Cartesian coordinates"
    i = hsa.get_global_id(0)
 
    if i < r_in.size:
        x_out[i] = r_in[i] * math.cos(theta_in[i])
        y_out[i] = r_in[i] * math.sin(theta_in[i])

 

Python

Here we’ve used hsa.get_global_id(0) to get a globally unique work-item ID. The corresponding polar coordinates in the input arrays are retrieved and used to compute Cartesian coordinates that are saved in the output arrays.

To execute this kernel, we use a similar syntax to that for CUDA Python:

# Create some input data
n = 10000
r = np.linspace(1.0, 3.0, n).astype(np.float32)
theta = np.linspace(0, 8 * pi, n).astype(np.float32)
 
# Create output storage
x = np.zeros_like(r)
y = np.zeros_like(r)
 
# Ensure we have enough work-items to cover the whole input
items_per_group = 64
groups = (n + items_per_group - 1) // items_per_group
 
# Execute the kernel
polar_to_cartesian[groups, items_per_group](r, theta, x, y)

 

Python

Unlike the Numba CUDA driver, the HSA kernel executes synchronously (though we plan to support also asynchronous behavior in the future), so you do not have to worry about whether the kernel is finished executing before accessing your output data. Also note that there is no memory transfer overhead before or after the kernel execution, since the GPUand CPU share the same memory!

Like other GPUs, the GPU inside an AMD APU reaches its highest performance when working on large problems. We can see this if we plot the relative speedup of the above HSAkernel vs a CPU implementation for different size input arrays:

HSA performance compared to NumPy

For large input data sets, it is possible to achieve speedups between 10x and 18x relative to the standard single-threaded NumPy implementation of the same function.

Next Steps

The release of HSA support in Numba is only the first step in our planned support of AMDAPUs. In the coming months we will be expanding our APU support to include ufunc generation as well as adding APU-optimized BLAS and FFT functionality to the Anaconda Platform.

Today, HSA is supported on Kaveri and Carrizo architecture APUs on Ubuntu Linux 14.04 64-bit. The HSA kernel driver and runtime must be installed, as described at in the HSAPlatforms & Installation Guide. The installation guide will point you to the packages you need, describe how to install them, and show you how to verify a successful installation. Once the driver is installed, you will need to install Anaconda or Miniconda, and then you can run:

conda install numba

 

Bash

and you will have the latest Numba with HSA support. The full Numba HSA documentationdescribes the Python syntax for HSA targets in more detail. Try it out and let us know what you would like to see in future releases!


About the Author

Stanley Seibert

Director, Community Innovation

Stan leads the Community Innovation team at Anaconda, where his work focuses on high performance GPU computing and designing data analysis, simulation and processing pipelines. He is a longtime advocate of the use of Python and GPU computin …

Read more

Join the Disucssion