*This post was originally published on Jupyter Notebooks nbviewer by Joshua Adelman and is reposted here with his permission.*

**TL;DR** – The python CFFI library provides an easy and efficient way to call C code from within a function jitted (just-in-time compiled) by Numba. This makes it simple to produce fast code with functionality that is not yet available directly in Numba. As a simple demonstration, I wrap several statistical functions from the Rmath library.

*This post was originally published on Jupyter Notebooks nbviewer by Joshua Adelman and is reposted here with his permission.*

**TL;DR** – The python CFFI library provides an easy and efficient way to call C code from within a function jitted (just-in-time compiled) by Numba. This makes it simple to produce fast code with functionality that is not yet available directly in Numba. As a simple demonstration, I wrap several statistical functions from the Rmath library.

## Background and Motivation

A large fraction of the code that I write has a performance requirement attached to it. Either I’m churning through a large amount of data in an analysis pipeline, or it is part of a real-time system and needs to complete a specific calculation in a constrained amount of time. Sometimes I can rely on numpy, pandas or existing Python packages that wrap C or Fortran code under the covers to get sufficient performance. Often times though, I’m dealing with algorithms that are difficult to implement efficiently using these tools.

Since I started coding primarily in Python ~6 years ago, in those instances I’d typically reach for Cython to either wrap something I or others wrote in C/C++/Fortran or to provide sufficient type information to my code so that Cython could generate a performant C-extension that I could call from Python. Although Cython has been a pretty rock solid solution for me, the amount of boilerplate often required and some of the strange semantic of mixing python and low-level C code often feels less than ideal. I also collaborate with people who know Python, but don’t have backgrounds in C and/or haven’t had enough experience with Cython to understand how it all fits together.

More and more frequently, I find myself using Numba in instances that I had traditionally used Cython. In short, through a simple decorator mechanism, Numba converts a subset of Python code into efficient machine code using LLVM. It uses type inference so you don’t have to specify the type of every variable in a function like you do in Cython to generate fast code. This subset primarily deals with numerical code operating on scalars or Numpy arrays, but that covers 95% of the cases where I need efficient code so it does not feel that limiting. That said, the most common mistake I see people making with Numba is trying to use it as a general Python compiler and then being confused/disappointed when it doesn’t speed up their code. The library has matured incredibly over the last 6-12 months to the point where at work we have it deployed in a couple of critical pieces of production code. When I first seriously prototyped it maybe a year and a half ago, it was super buggy and missing a number of key features (e.g. caching of jitted functions, memory management of numpy arrays, etc). But now it feels stable and I rarely run into problems, although I’ve written a very extensive unit test suite for every bit of code that it touches.

One of the limitations that I do encounter semi-regularly though is when I need some specialized function that is available in Numpy or Scipy, but that function has not been re-implemented in the Numba core library so it can be called in the so-called "nopython" mode. Basically this means that if you want to call one of these functions, you have to go through Numba’s object mode, which typically cannot generate nearly as efficient code.

While there is a proposal under development that should allow external libraries to define an interface to make usable in nopython mode, it is not complete and will them require adoption within the larger Scipy/PyData communities. I’m looking forward to that day, but currently you have to choose a different option. The first is to re-implement a function yourself using Numba. This is often possible for functionality that is small and limited in scope, but for anything non-trivial this approach can rapidly become untenable.

In the remainder of this notebook, I’m going to describe a second technique that involves using CFFI to call external C code directly from within Numba jitted code. This turns out to be a really great solution if the functionality you want has already been written either in C or a language with a C interface. It is mentioned in the Numba docs, but there aren’t any examples that I have seen, and looking at the tests only helped a little.

I had not used CFFI before integrating it with Numba for a recent project. I had largely overlooked it for two reasons: (1) Cython covered the basic usecase of exposing external C code to python and I was already very comfortable with Cython, and (2) I had the (incorrect) impression that CFFI was mostly useful in the PyPy ecosystem. Since PyPy is a non-starter for all of my projects, I largely just ignored its existence. I’m thankfully correcting that mistake now.

## Rmath. It’s not just for R

Every once in a while I fire up R, usually through rpy2, to do something that I can’t do using Statsmodel or Scikit-Learn. But for the most part I live squarely in the Python world, and my experience with R is rudimentary. So it wasn’t totally surprising that I only recently discovered that the math library that underpins R, Rmath, can be built in a standalone mode without invoking R at all. In fact, the Julia programming language uses Rmath for its probability distributions library and maintains a fork of the package called Rmath-julia.

Discovering Rmath over the summer, led to the following tweet (apologies for the Jupyter Notebook input cell) and a horrific amalgamation of code that worked, but was pretty difficult to maintain and extend:

display(HTML('''Today's

the sort of day that I wrote C code that used the Rmath-julia library and then called that from Python via Cython.

Don't ask— Joshua Adelman (@synapticarbors)

July 11, 2015”’))

Today's

the sort of day that I wrote C code that used the Rmath-julia library and then called that from Python via Cython.

Don't ask— Joshua Adelman (@synapticarbors) July 11, 2015

As I began to introduce more and more Numba into various code bases at work, I recently decided to revisit this particular bit and see if I could re-implement the whole thing using Numba + CFFI + Rmath. This would cut out the C code that I wrote, the Cython wrapper that involved a bunch of boilerplate strewn across multiple .pyx and .pxd files, and hopefully would make the code easier to extend in the future by people who didn’t know C or Cython, but could write some Python and apply the appropriate Numba jit decorator.

So to begin with, I vendorized the whole Rmath-julia library into our project under externals/Rmath-julia. I’ll do the same here in this example. Now the fun begins…

## Building the Rmath library using CFFI

Since we are going to use what cffi calls the "API-level, out-of-line", we need to define a build script (build_rmath.py) that we will use to compile the Rmath source and produce an importable extension module. The notebook "cell magic", %%file will write the contents of the below cell to a file.

%%file build_rmath.py import glob import os import platform from cffi import FFI include_dirs = [os.path.join('externals', 'Rmath-julia', 'src'), os.path.join('externals', 'Rmath-julia', 'include')] rmath_src = glob.glob(os.path.join('externals', 'Rmath-julia', 'src', '*.c')) # Take out dSFMT dependant files; Just use the basic rng rmath_src = [f for f in rmath_src if ('librandom.c' not in f) and ('randmtzig.c' not in f)] extra_compile_args = ['-DMATHLIB_STANDALONE'] if platform.system() == 'Windows': extra_compile_args.append('-std=c99') ffi = FFI() ffi.set_source('_rmath_ffi', '#include', include_dirs=include_dirs, sources=rmath_src, libraries=[], extra_compile_args=extra_compile_args) # This is an incomplete list of the available functions in Rmath # but these are sufficient for our example purposes and gives a sense of # the types of functions we can get ffi.cdef(''' // Normal Distribution double dnorm(double, double, double, int); double pnorm(double, double, double, int, int); // Uniform Distribution double dunif(double, double, double, int); double punif(double, double, double, int, int); // Gamma Distribution double dgamma(double, double, double, int); double pgamma(double, double, double, int, int); ''') if __name__ == '__main__': # Normally set verbose to `True`, but silence output # for reduced notebook noise ffi.compile(verbose=False)

Overwriting build_rmath.py

Then we simply run the script as below assuming we have a properly configured C compiler on our system. For larger projects integration with setuptools is supported. The exclamation point tells the notebook to execute the following command in a system shell.

!python build_rmath.py

We should now have an extension module named _rmath_ffi that gives us access to the functions whose prototypes we enumerated in the ffi.cdef(…).

## An example of replicating scipy.stats with Numba

Now that we have built our module wrapping Rmath using cffi, we can write Numba jit-able versions of scipy stats functions that we can call without additional overhead

import numpy as np import numba as nb import scipy.stats # Import our Rmath module import _rmath_ffi

Now we can define a number of shorter aliases to the Rmath functions for use in the python namespace

dnorm = _rmath_ffi.lib.dnorm pnorm = _rmath_ffi.lib.pnorm dunif = _rmath_ffi.lib.dunif punif = _rmath_ffi.lib.punif dgamma = _rmath_ffi.lib.dgamma pgamma = _rmath_ffi.lib.pgamma

In order for us to use these methods from within a function that we’ve jit-ed with Numba, we need to import cffi_support and register the module:

from numba import cffi_support cffi_support.register_module(_rmath_ffi)

I’ll start off by writing a function that is equivalent to scipy.stats.norm.cdf using two different styles. In the first, pnorm_nb, I’ll make the assumption that I’m going to be working on a 1d array, and in the second, I’ll use numba.vectorize to lift that constraint and create a universal function that can operate on arbitrary dimensional numpy arrays:

@nb.jit(nopython=True) def pnorm_nb(x): y = np.empty_like(x) for k in xrange(x.shape[0]): y[k] = pnorm(x[k], 0.0, 1.0, 1, 0) return y @nb.vectorize(nopython=True) def pnorm_nb_vec(x): return pnorm(x, 0.0, 1.0, 1, 0)

x = np.random.normal(size=(100,)) y1 = scipy.stats.norm.cdf(x) y2 = pnorm_nb(x) y3 = pnorm_nb_vec(x) # Check that they all give the same results print np.allclose(y1, y2) print np.allclose(y1, y3)

True True

And now let’s do the same calculation for 2D data, demonstrating that the vectorized form of the Numba function automatically create the appropriate universal function for the given dimensionality of the inputs:

x = np.random.normal(size=(100,100)) y1 = scipy.stats.norm.cdf(x) y2 = pnorm_nb_vec(x) # Check that they all give the same results print np.allclose(y1, y2)

True

Timing the scipy and numba versions:

%timeit scipy.stats.norm.cdf(x) %timeit pnorm_nb_vec(x)

1000 loops, best of 3: 618 µs per loop 1000 loops, best of 3: 336 µs per loop

We can see that our Numba version is almost 2x faster than the scipy version, with the added bonus that it can be called from within other Numba-ized methods without going through the python object layer, which can be quite slow.

Just for kicks, lets also try to take advantage of multiple cores using the target argument:

@nb.vectorize([nb.float64(nb.float64),], nopython=True, target='parallel') def pnorm_nb_vec_parallel(x): return pnorm(x, 0.0, 1.0, 1, 0)

y3 = pnorm_nb_vec_parallel(x) print np.allclose(y1, y3) print %timeit pnorm_nb_vec_parallel(x)

True The slowest run took 17.27 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 131 µs per loop

So on my laptop with 4 physical cores, we get a nice additional speed-up over the serial numba version and the scipy.stats function.

Finally, I’m going to programmatically wrap the all of the Rmath functions I exposed and compare them to the equivalent scipy functions.

from collections import OrderedDict func_map = OrderedDict([ ('norm_pdf', (scipy.stats.norm.pdf, dnorm)), ('norm_cdf', (scipy.stats.norm.cdf, pnorm)), ('unif_pdf', (scipy.stats.uniform.pdf, dunif)), ('unif_cdf', (scipy.stats.uniform.cdf, punif)), ('gamma_pdf', (scipy.stats.gamma.pdf, dgamma)), ('gamma_cdf', (scipy.stats.gamma.cdf, pgamma)), ]) def generate_function(name, rmath_func): if 'norm' in name or 'unif' in name: def cdf_func(x): return rmath_func(x, 0.0, 1.0, 1, 0) def pdf_func(x): return rmath_func(x, 0.0, 1.0, 0) elif 'gamma' in name: def cdf_func(x, shape): return rmath_func(x, shape, 1.0, 1, 0) def pdf_func(x, shape): return rmath_func(x, shape, 1.0, 0) if 'cdf' in name: return cdf_func elif 'pdf' in name: return pdf_func x = np.random.normal(size=(100,100)) for name, (scipy_func, rmath_func) in func_map.iteritems(): nb_func = nb.vectorize(nopython=True)(generate_function(name, rmath_func)) print name if 'norm' in name or 'unif' in name: y1 = scipy_func(x) y2 = nb_func(x) print 'allclose: ', np.allclose(y1, y2) print 'scipy timing:' %timeit scipy_func(x) print 'numba timing:' %timeit nb_func(x) elif 'gamma' in name: shape = 1.0 y1 = scipy_func(x, shape) y2 = nb_func(x, shape) print 'allclose: ', np.allclose(y1, y2) print 'scipy timing:' %timeit scipy_func(x, shape) print 'numba timing:' %timeit nb_func(x, shape) print

norm_pdf allclose: True scipy timing: 1000 loops, best of 3: 545 µs per loop numba timing: 1000 loops, best of 3: 212 µs per loop norm_cdf allclose: True scipy timing: 1000 loops, best of 3: 634 µs per loop numba timing: 1000 loops, best of 3: 328 µs per loop unif_pdf allclose: True scipy timing: 1000 loops, best of 3: 436 µs per loop numba timing: 10000 loops, best of 3: 68.4 µs per loop unif_cdf allclose: True scipy timing: 1000 loops, best of 3: 495 µs per loop numba timing: 10000 loops, best of 3: 128 µs per loop gamma_pdf allclose: True scipy timing: 1000 loops, best of 3: 616 µs per loop numba timing: 1000 loops, best of 3: 277 µs per loop gamma_cdf allclose: True scipy timing: 1000 loops, best of 3: 1.13 ms per loop numba timing: 1000 loops, best of 3: 1.11 ms per loop

## Conclusion

To wrap up, CFFI + Numba provides a powerful and surprisingly simple way to generate fast python code and extend the currently limited repertoire of functionality that is baked into Numba. Pairing this approach with Rmath specifically has been particularly useful in my own work.

## Appendix

For completeness, I’ll use Sebastian Raschka’s watermark package to specify the libraries and hardware used to run these examples:

%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py

%load_ext watermark %watermark -vm -p numpy,scipy,numba,cffi

CPython 2.7.11 IPython 4.0.3 numpy 1.10.2 scipy 0.16.1 numba 0.23.1 cffi 1.5.0 compiler : GCC 4.2.1 (Apple Inc. build 5577) system : Darwin release : 13.4.0 machine : x86_64 processor : i386 CPU cores : 8 interpreter: 64bit