With this latest release of Numba, we’re excited to be able to deliver several frequently requested features:

  • JIT classes
  • np.dot support in nopython mode
  • a multithreaded CPU target for @guvectorize.

With this latest release of Numba, we’re excited to be able to deliver several frequently requested features:

  • JIT classes
  • np.dot support in nopython mode
  • a multithreaded CPU target for @guvectorize.

Experimental Support for JIT-friendly Classes

Numba derives its performance from the combination of two important ingredients:

  1. Data structures that can be accessed directly, bypassing the Python interpreter (such as the NumPy ndarray).
  2. Just-in-time generation of machine code by the LLVM compiler library .

While much of our focus is on the compiler-side of things, we also want to promote machine-friendly data structures in Python. We look forward to someday being able to pass Pandas DataFrames, xray DataArrays, and DyND arrays to Numba-compiled functions. (More on that below…)

However, sometimes you just want to use some simple objects as your data structure, along with compiled methods that operate on the attributes. For those cases, we’ve created a new decorator that can be applied to a Python class: @jitclass.

Here’s an example of how this works:

# conda create -n numba_023_test python=3.4 numba scipy bokeh jupyter
import numpy as np
from numba import jit, jitclass, int64, float64, guvectorize
from bokeh.plotting import figure, output_notebook, show
output_notebook()

 

@jitclass([    
    ('xmin', float64),
    ('xmax', float64),
    ('nbins', int64),
    ('xstep', float64),
    ('xcenter', float64[:]),
    ('bins', int64[:]),
    ('moments', float64[:])
])
class Hist1D(object):
    '''A 1D histogram that can be updated, and computes mean and stddev incrementally'''
    def __init__(self, xmin, xmax, nbins):
        self.xmin = xmin
        self.xmax = xmax
        self.nbins = nbins
        self.xstep = (xmax - xmin) / nbins
        self.xcenter = (np.arange(nbins) + 0.5) * self.xstep - self.xmin
        self.bins = np.zeros(self.nbins, dtype=np.int64)
        self.moments = np.zeros(3, dtype=np.float64)
    
    def fill_many(self, values):
        for value in values:
            bin_index = np.int64((value - self.xmin) / self.xstep)
            if 0 <= bin_index < len(self.bins):
                self.bins[bin_index] += 1
                self.moments[0] += 1
                self.moments[1] += value
                self.moments[2] += value**2
    
    @property
    def count(self):
        return np.int64(self.moments[0])
    
    @property
    def mean(self):
        return self.moments[1] / self.moments[0]
    
    @property
    def stddev(self):
        return np.sqrt(self.moments[2] / self.moments[0] - self.mean**2)
        

 

This example shows all the basic features that @jitclass currently supports. The attributes and Numba types are described in a specification that is passed to the @jitclass decorator.

h = Hist1D(-4, 4, 25)
h.fill_many(np.random.normal(size=5000))

fig = figure(plot_width=600, plot_height=300)
fig.line(h.xcenter, h.bins)
show(fig)

print('Count: %f, Mean: %f, StdDev: %f' % (h.count, h.mean, h.stddev))
Count: 4999.000000, Mean: -0.013222, StdDev: 0.997465

The great thing about JIT classes is that they can also be passed to any nopython mode functions:

@jit(nopython=True)
def add_uniform_noise(hist, noise_fraction):
    '''Add uniformly distributed noise to a histogram.
    The final histogram will have the specified fraction of noise samples.
    '''
    n = np.int64(hist.count / (1 - noise_fraction))
    samples = np.empty(n, dtype=np.float64)
    for i in range(n):
        samples[i] = np.random.uniform(hist.xmin, hist.xmax)
    hist.fill_many(samples)
    

Let’s try it out:

h2 = Hist1D(-4, 4, 25)
h2.fill_many(np.random.normal(size=5000))

add_uniform_noise(h2, noise_fraction=0.3)

fig = figure(plot_width=600, plot_height=300, y_range=(0, h2.bins.max() * 1.1))
fig.line(h2.xcenter, h2.bins)
_ = show(fig)

One important caveat about JIT classes is that access to attributes from Python will be significantly slower than a normal Python object:

class PythonObject(object):
  def __init__(self):
      self.count = 1

python_obj = PythonObject()

%timeit python_obj.count + 2
%timeit h2.nbins + 2
10000000 loops, best of 3: 103 ns per loop
The slowest run took 5270.13 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 8.28 µs per loop

This performance difference is because JIT classes store their attribute data in a non-Python object form, so when the attribute value is requested from the Python interpreter, Numba has to wrap the value in a new Python object to return it. This is similar to the tradeoffs associated with accessing NumPy array elements from Python. In moderation, attribute access is fine, but if you need to do it frequently, then you should consider compiling that code with Numba as well.

Our support for JIT classes is very limited today. Only nopython mode is supported, and our interface for handling recursive types (types that contain instances of themselves) is still in flux. We will be working to remove the rough spots, and expand the functionality to cover more use cases over the next few releases.

Initial support for np.dot in nopython mode (requires SciPy 0.16 or later)

This simple-sounding request turned out to be much more complicated because we wanted to make sure that Numba would use the same high performance BLAS library (MKL, OpenBLAS, etc.) likely being used by NumPy and SciPy. We discovered that SciPy exports C-callable BLAS functions for Cython which Numba will also take advantage of. Those SciPy functions in turn will call whatever BLAS implementation SciPy was configured with.

The feature itself is fairly straightforward. We support the np.dot() function in nopython mode for contiguous arrays when doing:

  • 1D vector × 1D vector dot product
  • 2D matrix × 1D vector multiplication
  • 2D matrix × 2D vector multiplication

Future releases will implement the broadcast rules for higher dimensional dot products, and also support non-contiguous arrays by copying to temporary storage before calling the BLAS library. (Note that np.dot inside nopython mode will be no faster than outside nopython mode, since both will use the same optimized BLAS library to do the heavy lifting.)

Multithreaded CPU @guvectorize

We love Universal Functions (“ufuncs”) and Generalized Universal Functions (“gufuncs”)! They are an underappreciated abstraction for expressing array-oriented functions that are intutive to write and easy for a compiler to parallelize. In fact, many people may not realize that Numba supports four different targets for both ufuncs and gufuncs:

  • target=cpu: Single-threaded, CPU execution
  • target=parallel: Multi-threaded, CPU execution
  • target=cuda: Execution on NVIDIA GPUs that support CUDA (most of them)
  • target=hsa: Execution on AMD APUs that support HSA (Kaveri and Carrizo)

With the Numba 0.22.1 release, we open sourced the parallel, cuda, and hsa targets that had been in our numbapro package (see Deprecating NumbaPro: The New State of Accelerate in Anaconda for more details).

However, there was one missing implementation: @guvectorize(target=parallel). During the Numba 0.23 release cycle we filled in that gap, so now you can take advantage of gufuncs on your multicore processors:

@guvectorize([(float64[:], float64[:])], '(n)->()')
def l2norm_cpu(vec, result):
    result[0] = (vec**2).sum()
    
@guvectorize([(float64[:], float64[:])], '(n)->()', target='parallel')
def l2norm_parallel(vec, result):
    result[0] = (vec**2).sum()
    

On my quad-core MacBook Pro laptop:

n = 100000
dims = 10
random_vectors = np.random.uniform(size=n*dims).reshape((n, dims))
%timeit l2norm_cpu(random_vectors)
%timeit l2norm_parallel(random_vectors)
10 loops, best of 3: 48 ms per loop
10 loops, best of 3: 19.5 ms per loop

For more details on gufuncs, check out our @guvectorize documentation:

What’s Next?

In the next release cycle, we’ll be refining and improving the features described above, upgrading Numba to LLVM 3.7, and documenting an official API for 3rd parties to extend Numba. Our hope is that this will allow other libraries to add new types and function implementations to the Numba compiler without needing to modify the Numba codebase itself. This opens up a lot of possibilities (nopython support for accessing Pandas DataFrames, anyone?) and we look forward to seeing what you can build with Numba.

As always, you can install the latest Numba with conda:

conda install numba # Don't forget to install scipy for np.dot support!

or find our release at PyPI.

If you have questions or suggestions, we welcome feedback in our GitHub repository and on our mailing list.

(You can download this blog post in notebook form from https://notebooks.anaconda.org/seibert/numba-0-23-release)

 


About the Author

Stanley Seibert

Director, Community Innovation

Stan leads the Community Innovation team at Anaconda, where his work focuses on high performance GPU computing and designing data analysis, simulation and processing pipelines. He is a longtime advocate of the use of Python and GPU computin …

Read more

Join the Disucssion