Automatic generation of vectorized functions with Blaze and Numba

tl;dr Using Blaze in conjunction with Numba helps speed up code by reducing temporary variables from intermediate computations while allowing you to express computations at a high level.

To run the code examples in this post, you need to install Blaze and Numba, which you can do with

``````conda install -c blaze -c numba blaze numba
``````

NumPy and intermediate calculations

We’re often faced with a computation we need to express that potentially has many intermediate variables. In these situations, memory bandwidth is the computation bottleneck, not the CPU’s ability to perform the operations we’re doing. To illustrate the effect on the runtime of an expression that many intermediate results can have, let’s compare how Numpy performs when computing a simple addition operation versus a more complex operation such as a polynomial.

```In [17]: import numpy as np
In [18]: x = np.arange(100000000, dtype='float64')
In [19]: %timeit x + 1
1 loops, best of 3: 360 ms per loop
In [20]: def poly(t):
....:     return (5.0 * t + 4.0) * t + 1.0
....:
In [21]: %timeit poly(x)
1 loops, best of 3: 1.57 s per loop
```

`poly(x)` is about 4x slower than `x + 1`, not because the computation is that much more expensive, but because there’s a lot of data transfer between the CPU and main memory due to the generation of intermediate results!

Here’s the expression tree for the polynomial calculation:

The above image shows that each binary operation—for example `5.0 * t`—will generate a new temporary array that has to be passed across the memory bus to the CPU, computed and written to a temporary that then has to be fed into the next operation which will trigger the same operation again until we have our final result.

For a reasonably large expression, there are many intermediates generated, and you’ll spend most of your time waiting for things to get to and from the CPU rather than waiting on the CPU to do the computation. This is where Numba comes in.

Numba

Numba is a library that accelerates computations on NumPy arrays through the use of a just-in-time compiler provided by LLVM. One thing I really like about Numba is that I don’t have to write any C extensions or Cython to get native performance.

How does this help us mitigate the effects of temporaries?

Numba can reduce or eliminate the effects of temporaries by computing the entire expression one element at a time instead of piecemeal with many intermediates. The input array is read into memory once the computation is done element by element in a loop, and the result is returned. There’s a single memory read operation (the input array) and a single memory write operation (the output array). Let’s see how we can use Numba to maximize performance:

```In [31]: from numba import vectorize, float64
In [32]: @vectorize([float64(float64)])
....: def fast_poly(t):
....:     return (5.0 * t + 4.0) * t + 1.0
....:
In [33]: %timeit fast_poly(x)
1 loops, best of 3: 403 ms per loop
```

We can get extremely close to native speed. In fact, we’re not that much slower than just adding 1 to `x`. What this means is that adding a few extra math operations to your expression adds virtually no cost to the running time of your program.

How does Numba fit into the larger PyData ecosystem?

We want to write expressive code at a very high level but sacrifice little, if any, performance because of this. Often there’s a tension between these goals as more abstraction generally increases the running time of a program.

With a combination of Blaze and Numba, we can simultaneously achieve both goals, writing abstract expressions and making them run at or close to native speed.

Blaze

Blaze separates computation from data via an expression system together with a `compute` API that knows how to tell specific back-ends what to do with a particular kind of expression and some data source such as a `numpy.ndarray`.

NB: When using Blaze, often the simplest way to interact with your data quickly is through the `Data` object:

```In [41]: from blaze import Data
In [42]: import pandas as pd
In [43]: df = pd.DataFrame({'a': [1, 2, 3], 'b': list('abc')})
In [44]: data = Data(df)
In [45]: data
Out[45]:
a  b
0  1  a
1  2  b
2  3  c
In [47]: data.a.sum()
Out[47]: 6
In [48]: type(_)
Out[48]: blaze.expr.reductions.sum
In [49]: compute(data.a.sum())
Out[49]: 6
In [50]: type(_)
Out[50]: int
```

Blaze is uniquely set up to handle the translation of an expression into a Numba function. Blaze can manipulate expressions before any computation occurs, therefore it can easily take advantage of Numba in ways that other libraries cannot – such as pandas.

Blaze translates an expression into a Numba function in a few simple steps:

1. We turn our Blaze expression into a Python `lambda` expression
2. Based on the `dshape` of the expression’s return value and its arguments’ types, compute a Numba compatible function prototype
3. Call `numba.vectorize` with both of the above to yield a Numba function

NB: Blaze is not a magic bullet. It will not compile every expression into a Numba function. Additionally, it will only create a Numba function when computing expressions over NumPy arrays.

Blaze + Numba

Blaze helps by automating Numba code generation. For example, instead of you having to write down the types of the arguments and the return type multiple times—once when you construct your symbols and again when you write your vectorized function—Blaze can use the type information you provide when you construct your expression to generate the appropriate function signature for `vectorize`. The net result is automation of generating vectorized functions.

Numba code generation doesn’t require anything beyond installing Numba. Once you install Numba, you simply write down a Blaze expression and call `compute`

```In [1]: from numba import vectorize, float64
In [2]: from numpy import linspace, pi
In [3]: from blaze import Data, discover, sqrt, exp
In [4]: x = Data(linspace(-5, 5, 100000000))
In [5]: mu, sigma = -1.33, 1.25
In [6]: expr = 1 / (sigma * sqrt(2 * pi)) * exp(-(x - mu) ** 2 / (2 * sigma ** 2))
```

And the timings:

```In [7]: %timeit compute(expr)
1 loops, best of 3: 1.33 s per loop
In [8]: %timeit compute(expr, optimize=False)
1 loops, best of 3: 2.91 s per loop
```

We only get a ≈2x speedup here because the cost of computing `exp` is generally expensive relative to the cost of computing a polynomial.

Final Thoughts

We can use Numba and Blaze together to write efficient high-level code. While Numba and Blaze aren’t magic bullets, and we can’t just pipe arbitrary expressions through them and hope for the best, we can leverage them in key areas to help make our code more efficient while retaining the expressive power of Blaze.

Other Projects

People are doing interesting things with Numba. Here are a few interesting projects and talks:

• numbagg by Stephan Hoyer. It provides near C speed `nan`-aware aggregation functions (i.e., `sum`, `count`, `mean`, etc.) using Numba.
• Ville Tuulos’s talk from PyData Silicon Valley 2014

[1] We use `nopython=True` in Blaze so that Numba will refuse to compile if we’ve written code that will generate Python objects. In general Numba usage, this is not necessary and these examples would run at the same speed without that flag.

[2] The `optimize` function isn’t tied to Numba in any way. It refers to the optimization of Blaze expressions. In the case of a machine with Numba installed, an arithmetic operation on a NumPy array will trigger an attempt to generate a vectorized expression via a special `Broadcast` dispatch implementation.

Thanks to Matthew Rocklin and Stan Seibert for providing valuable feedback on the preliminary drafts of this post.

About the Author

Contractor

Max Gamurar has been with the Anaconda Global Inc. team for over 2 years.