Welcome. This post is part of a series of Continuum Analytics Open Notebooks showcasing our projects, products, and services.
In this Continuum Open Notebook, you’ll learn more about how Numba works and how it reduces your programming effort, and see that it achieves comparable performance to C and Cython over a range of benchmarks.
If you are reading the blog form of this notebook, you can run the code yourself on our cloud-based Python-in-the-browser app, Wakari. Wakari gives you a full Scientific Python stack, right from your browser, and allows you to write and share your own IPython Notebooks. Sign up for free here.
How Does Numba Work?
Numba is a Continuum Analytics-sponsored open source project. Numba’s job is to make Python + NumPy code as fast as its C and Fortran equivalents without sacrificing any of the power and flexibility of Python. Python can be slower than C and Fortran because it features a generic, dynamic object system. If you were to look at the Python source code in C, you would see that every object, even simple integer constants, live in large, generic PyObject structures. The Python interpreter has to unwind several layers of abstraction each time it operates on a generic object. Let’s consider a simple statement to demonstrate this concept:
c = a+b
We’ll assume that a and b are both floating-point numbers. Adding them together is a single instruction on any modern CPU. This statement in C or Fortran will usually generate just this single floating-point add instruction at compile-time. At run-time, dispatching this instruction will likely only require a single CPU cycle, and it will complete in less than five cycles.
The same statement in Python will generate dozens of instructions. Because a and b are dynamically typed, the interpreter must first determine the type of a and b, which will require lookups to memory of the a and b types. Then the interpreter has to determine if the type possesses the add method. A new object, c, may need to be created. The creation of c requires a memory allocation on the heap. Finally, the floating-point add operation is called, and the result is stored in c. The many additional function calls are responsible for the first order of magnitude of difference in performance between Python and compiled languages such as C and Fortran. But it is the memory allocations and dereferencing that are responsible for the next several orders of magnitude of performance difference. Python does not feature a native just-in-time compiler, so every time it sees this statement again (such as in a for loop), it has to repeat all the work it just did.
Numba is our bridge between Python and performance. Numba takes over for the Python interpreter on decorated functions and classes, and intelligently adds type information to as many objects as possible in an expression. When Numba can’t figure out what type an object is, it falls back to the same expensive type queries the Python interpreter uses. Numba then compiles the Python and NumPy functions and classes into performant code. Numba can compile just-in-time with the autojit decorator, or ahead of time with the jit decorator.
This notebook provides a benchmark comparison between Python, C interfaced through ctypes, Cython, and Numba – all from an IPython notebook in the cloud that you can run yourself! The notebook is self-validating, with integrated tests checking the correctness of each kernel function before timing it. We encourage you to experiment with the code, try out new ideas, or even improve the code performance or the benchmarks themselves. Feel free to reuse any of this code for your own work.
We start by importing the libraries we need and defining a plotting function. We also install an IPython extension, cmagic, for compiling C code using the same compiler and flags that were used to build Python. By default, we have hidden some of the longer code snippets. Click on the title to view them inline.
Our first benchmark is a simple loop calculating a vector sum over $N$ values. This is a native NumPy function, so we’ll define that first.
We have to write the same loop explicitly in Python.
Note that the Python code does not require us to specify what y is, beyond that it must be indexable. Although the Python dynamic types are flexible, they are not performant.
Next, we define the Numba code.
With a single line of code, we create a high-performance but equally flexible version of python_sum. When numba_sum is called with a numpy ndarray object, numba_sum will execute at the same speed as C or Cython. Don’t believe me? Let’s time it!
Note: We set the func_name attribute to numba_sum to distinguish it from python_sum, the func_name inherited by default.
Here’s the C code we will compare against. See the notebook for details on how it is interfaced using magic functions.
Notice that because C is statically typed, we have to state ahead of time what the contents of y are.
Next up is Cython.
Cython is an optimising static compiler for Python. This Cython code will generate a Python extension module. Note that the Cython language is neither C nor Python, but a creole constructed from the two languages. Again, see the notebook for details on how this code is compiled and run using magic functions.
Correctly Measuring Performance
We will use the timeit module to handle our performance comparison. timeit doesn’t have access to any of the variables in our namespace by default, so we attach and retrieve them from the __main__ module.
Performance measuring function
Here are the results of our first call to the timer for three of the benchmarks:
The first time we run the Numba code, we notice that it is much slower than C or Cython. This is a feature. Remember that Numba is a just-in-time compiler; this means the code is not compiled until the very last moment before execution (thus the just in time). If we re-run the Numba code a second time:
We see that execution time is consistently much faster. Numba only pays the cost of compiling once for a given type of function arguments. Numba caches the results of compilation between function calls and recognizes that numba_sum has been called with an integer array previously, saving a recompile!
Let’s see how NumPy, Python, Cython, Numba, and C actually stack up!
Whoa! We’re not going to have time to run Python on big arrays, let’s drop it from the rest of the comparison.
Cython and Numba both do very well for small arrays, although Numba eventually loses some performance for very large arrays. Numba is not quite as fast as C or Cython for very large problems in this case, this will be addressed in an upcoming release.
For loop benchmark (Floating-Point)
This next benchmark demonstrates the true flexibility of Numba. We don’t need to modify the Numba code at all, we simply pass an array of doubles this time instead of integers. We have to write new functions with a different type for y in both the C and the Cython code.
Whoa! Whoa! Easy with the pitchforks and torches! I have delicate skin!
Yes, we know that this problem could be solved by using typedefs and macros, or templates in C++. But we would still need to have multiple functions, one for each possible case, and this would quickly explode combinatorially for combinations of multiple options. Besides, the whole point of this exercise is to get performance while keeping the developer’s job as simple as possible.
One of Python’s greatest attributes is its support for clean, generic functions. Numba really shines in supporting generic functions while providing performance through autojit.
We measure the performance over a range of vector sizes.
From a performance perspective, the different versions are behaving almost identically for large vectors. Both Cython and Numba really shine for smaller array size, though, even over NumPy!
Artifical benchmarks always leave us with a minor sense of dissatisfaction, similar to the feeling we’re left with after eating hot dogs made out of that unidentifiable bright red meat. Let’s go back to a real application kernel and consider the impacts of using Numba there.
Next, we autojit the pure Python kernels to create accelerated Numba variants.
If we want maximum performance, we need to autojit the two functions used in iterating over the loop. We could have removed the functions themselves, but they help improve the readability of the code. Currently, Numba does not support inlining (Pull Requests welcome!), which makes it more challenging to put function calls in innermost loops performantly.
Finally, here’s a pure C version of the GrowCut kernel, ready to interface into Python.
We time the verified kernels over a range of square image sizes. Again, very small images in Python, then larger images in Cython, Numba, and C.
We don’t observe a significant performance difference between the C, Cython, or Numba kernels. Of course, only one of the three kernels is written in clean, dynamic, Python 🙂
In this open notebook, we compared Numba against Cython and C. First, we explored some simple benchmarks. Then, we returned to the GrowCut example. Our experiments reveal that Numba performs as well as Cython and native C interfaced directly into Python. At the same time, the Numba code is clearly the easiest to understand and write from a Python programmer’s perspective.
We still have much more ground to cover, including how the professional version of Numba, NumbaPro, can accelerate code on GPUs. NumbaPro is available as part of our Anaconda Accelerate product.
At the request of several commenters, here is a test script and benchmarks that we ran on PyPy and Anaconda Python (with Numba). The results are not tuned (I am not a PyPy expert!) so we did not post them in the blog. We’d be happy to look deeper into this with the PyPy developers. While PyPy is not currently installed on Wakari, we are looking at a number of ways we can install and support the PyPy community.