Last week, I wrote about the features Numba has, and what our long term development goals are. Today, I want to dig into the details of how to more effectively use Numba on your programs.

Tip 1: Profile, profile, profile

The very first, and most important tip on this list has nothing to do with Numba. Before trying to optimize the performance any program, it is extremely important to profile its execution. Profiling tools track the amount of time your program spends executing each function, and sometimes also the time required to execute each line of code. Although you might think that you know what parts of your program take the longest, you will frequently be surprised (I certainly am!) when you do the measurement.

Python has several profiling tools available. The Python standard library includes the cProfile module, which records the time spent executing each function. To profile the execution of your entire Python application (myprog.py), you can run:

python -m cProfile -o myprog.prof myprog.py

This will create the output file myprog.prof with profiling data, which you can query with the command:

python -m pstats myprog.prof

The pstats tool will show you an interpreter prompt, where you can run commands to sort the results and print out the names of the functions that take the longest total time. Here is an example session:

Welcome to the profile statistics browser.
myprog.prof% sort time
myprog.prof% stats 5
Tue Sep 30 18:25:05 2014    myprog.prof

         410681 function calls (410582 primitive calls) in 1.689 seconds

   Ordered by: internal time
   List reduced from 464 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.549    0.549    1.510    1.510 myprog.py:6(normalize)
   100000    0.525    0.000    0.525    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   100000    0.295    0.000    0.961    0.000 myprog.py:3(norm)
   100000    0.086    0.000    0.612    0.000 /Users/stan/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:23(_sum)
   100000    0.054    0.000    0.666    0.000 {method 'sum' of 'numpy.ndarray' objects}

Once you have identified the slowest parts of your application, you can better strategize how to use Numba (and other tools) to improve performance.

Tip 2: Use @vectorize for element-wise operations

Often, a complex array expression can be reduced to a ufunc (short for “Universal Function”). Ufuncs are functions that operate on scalar arguments (like floats) which NumPy will automatically broadcast over arrays for you. Most standard functions in NumPy that operate element-wise, like cos(), are implemented as ufuncs.

As an example, suppose you have the following array expression in your code:

d = 2 * (a - b) / (a + b)

 

Python

 

This can be written as a ufunc and compiled with Numba:

@vectorize('float64(float64, float64)')
def rel_diff(x, y):
    return 2 * (x - y) / (x + y)

 

Python

and used in your code this way:

d = rel_diff(a, b)

 

Python

The Numba-compiled ufunc is 3x faster than the NumPy array expression!

Note that unlike the @jit decorator described in the next tip, @vectorize requires that you give the types of the arguments and the type of the return value. Multiple type signatures can be specified as a list:

@vectorize(['float64(float64, float64)', 'float32(float32, float32)'])
def rel_diff(x, y):
    return 2 * (x - y) / (x + y)

 

Python

Tip 3: Use @jit with explicit looping over array elements

For computations that cannot be described with a ufunc, the @jit decorator can be used to convert a Python function directly to machine code.

Numba has two modes of compilation: object mode and nopython mode. The “nopython” mode is as fast as if you had written the code in a compiled language like C. We can only compile certain kinds of language features in nopython mode, so object mode is provided as the fallback. Object mode is less optimized than nopython mode, but can still be effective if your function contains loops that do operations on numeric data.

Numba cannot currently compile array expressions in “nopython” mode. We are working on a solution for that, but in the meantime array expressions compiled with @jit will compile in object mode and run at the same speed they do in NumPy. That said, Numba can be very useful when you have an algorithm that is difficult or impossible to express efficiently using NumPy array expressions and functions.

For example, Jake Vanderplas has an excellent writeup on Conway’s Game of Life in Python in which he gives this implementation of the update step using NumPy:

def numpy_life_step(X):
    """Game of life step using generator expressions"""
    nbrs_count = sum(np.roll(np.roll(X, i, 0), j, 1)
                     for i in (-1, 0, 1) for j in (-1, 0, 1)
                     if (i != 0 or j != 0))
    return (nbrs_count == 3) | (X & (nbrs_count == 2))

 

Python

This is a very clever bit of NumPy code, though it can be a little hard to see what is happening if you are not familiar with the numpy.roll() function. Eight copies of the game state are made in memory, each displaced by a cell in one of the eight possible directions, and summed together.

With Numba, we can implement the update step differently, taking advantage of the fact that most cells in the game state are inactive:

@jit
def wrap(k, max_k):
    if k == -1:
        return max_k - 1
    elif k == max_k:
        return 0
    else:
        return k
    
@jit
def increment_neighbors(i, j, neighbors):
    ni, nj = neighbors.shape
    for delta_i in (-1, 0, 1):
        neighbor_i = wrap(i + delta_i, ni)
        for delta_j in (-1, 0, 1):
            if (delta_i != 0 or delta_j != 0):
                neighbor_j = wrap(j + delta_j, nj)
                neighbors[neighbor_i, neighbor_j] += 1
 
@jit
def numba_life_step(X):
    neighbors = np.zeros_like(X, dtype=np.int8)
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            if X[i,j]:
                increment_neighbors(i, j, neighbors)
    return (neighbors == 3) | (X & (neighbors == 2))

 

Python

Although not as compact as the NumPy version, this implementation is probably easier to follow. It only reads the game state once, and the number of writes is proportional to the number of active cells. As a result, the Numba version is twice as fast as the NumPy version when updating a 1000×1000 cell game state when 30% of the cells are active.

Tip 4: Move up a level when calling small functions many times

In order to transition execution from the Python interpreter to a compiled function, Numba has to insert a wrapper that translates Python objects into their native representation, and translates the function’s return value back again when the function is finished. This wrapper introduces some overhead which can overwhelm the performance gains from compiling very small functions.

Suppose a program spends a lot of time evaluating a cubic polynomial:

def poly(x):
    # Nested form of 3x**3 + x**2 + 2x + 1.5
    return ((3 * x + 1) * x + 2) * x + 1.5

 

Python

We can have Numba compile this function just by adding the @jit decorator before the function definition:

@jit
def poly(x):
    # Nested form of 3x**3 + x**2 + 2x + 1.5
    return ((3 * x + 1) * x + 2) * x + 1.5

 

Python

However, the compiled function is “only” 75% faster than Python because of the wrapper overhead. Additional performance improvements will be seen by compiling any function that calls poly() many times. When one compiled function calls another compiled function, the wrapper overhead is eliminated.

As an example of this, consider the bisection algorithm to solve for the value of x such that poly(x) is equal to zero:

@jit
def bisection(left, right, iterations, tol):
    f_left = poly(left)
    for i in range(iterations+1):
        mid = (left + right) / 2.0
        if right - left < tol:
            break
        f_mid = poly(mid)
 
        if f_mid * f_left > 0:  # same signs
            left = mid
            f_left = f_mid
        else:
            right = mid
    return mid

 

Python

Compiling just poly() makes bisection() run 30% faster, but compiling both bisection() and poly() makes bisection() run 45 times faster!

Tip 5: Ask Questions

There are several situations where Numba might not improve the performance of your code. However, sometimes there are simple changes to your code that can make a huge performance impact. If you run into difficulty, or find something you don’t expect, we strongly encourage you to post questions to the Numba User’s Mailing List or the Numba Github Issue Tracker. We can often help give advice on how to restructure your code to take the most advantage of Numba. In addition, feedback from Numba users helps us prioritize new feature development, and direct our efforts toward making your code run faster with each release.


About the Author

Stanley Seibert

Director, Community Innovation

Stan leads the Community Innovation team at Anaconda, where his work focuses on high performance GPU computing and designing data analysis, simulation and processing pipelines. He is a longtime advocate of the use of Python and GPU computin …

Read more

Join the Disucssion