With CPU core counts on the rise, Python developers and data scientists often struggle to take advantage of all of the computing power available to them. CPUs with 20 or more cores are now available, and at the extreme end, the Intel® Xeon Phi™ has 68 cores with 4-way Hyper-Threading. (That’s 272 active threads!)
To use multiple cores in a Python program, there are three options. You can use multiple processes, multiple threads, or both. Multiple processes are a common way to split work across multiple CPU cores in Python. Each process runs independently of the others, but there is the challenge of coordination and communication between processes. The multiprocessing package in the standard library, and distributed computing tools like Dask and Spark, can help with this coordination. The main problems with multiple processes – especially for systems with a large number of CPU cores – are memory usage, communication overhead, along with the need for the programmer to think about these issues. Every process needs a working copy of the relevant data, and communication is typically through network sockets. There are ways to mitigate both of these problems, but it is not a straightforward task that most programmers can solve easily.
On the other hand, threads can be very lightweight and operate on shared data structures in memory without making copies, but suffer from the effects of the Global Interpreter Lock (GIL). The GIL is designed to protect the Python interpreter from race conditions caused by multiple threads, but it also ensures only one thread is executing in the Python interpreter at a time. Fortunately, compiled code called by the Python interpreter can release the GIL and execute on multiple threads at the same time. Libraries like NumPy and Pandas release the GIL automatically, so multithreading can be fairly efficient for Python code where NumPy and Pandas operations are the bulk of the computation time. Tools like Dask can also manage distributing tasks to worker threads for you, as well as the combination of multiple threads and processes at the same time.
But what if you want to multithread some custom algorithm you have written in Python? Normally, this would require rewriting it in some other compiled language (C, FORTRAN, Cython, etc). This is a huge hit to programmer productivity, and makes future maintenance harder. This is one of the reasons we created Numba, as compiling numerical code written in Python syntax is something we want to make as easy and high performance as possible. In the rest of this post, we’ll talk about some of the old and new capabilities in Numba for multithreading your code.
Several years ago, we added the
nogil=True option to the
@jit compilation decorator. This option causes Numba to release the GIL whenever the function is called, which allows the function to be run concurrently on multiple threads. This assumes the function can be compiled in “nopython” mode, which Numba will attempt by default before falling back to “object” mode. You can force the compiler to attempt “nopython” mode, and raise an exception if that fails using the nopython=True option. Here’s an example with all this put together:
<pre class=”language-python”><code>@numba.jit(nopython=True, nogil=True)<br />def test_nogil_func(result, a, b):<br /> for i in range(result.shape):<br /> result[i] = math.exp(2.1 * a[i] + 3.2 * b[i])</code></pre>
Note that, in this case, Numba does not create or manage threads. To execute this function in multiple threads, you need to use something like Dask or concurrent.futures:
def run_multithreaded(func, a, b, numthreads=1): result = np.zeros_like(a, dtype=np.float64)
# These are views on the original array a_chunks = np.array_split(a, numthreads) b_chunks = np.array_split(b, numthreads) result_chunks = np.array_split(result, numthreads)
with concurrent.futures.ThreadPoolExecutor(max_workers=numthreads) \as executor:
futures = 
for a_chunk, b_chunk, result_chunk in zip(a_chunks, b_chunks,result_chunks):
futures.append(executor.submit(func, result_chunk, a_chunk, b_chunk))
Numba also offers fully automatic multithreading when using the special @vectorize and @guvectorize decorators. These decorators are used to create universal functions (AKA “ufuncs”), which execute some elementwise (or subarray, in the case of @guvectorize) operation across an entire array. Most of the functions you are familiar with from NumPy are ufuncs, which broadcast operations across arrays of different dimensions. For example, the built-in arctan2 function can be used this way in NumPy:
<pre class=”language-python”><code>a = np.array([-3.0, 4.0, 2.0]) # 1D array<br />b = 1.5 # scalar<br />np.arctan2(a, b) # combines each element of the array with the scalar</code></pre>
Numba lets you create your own ufuncs, and supports different compilation “targets.” One of these is the “parallel” target, which automatically divides the input arrays into chunks and gives each chunk to a different thread to execute in parallel. Here is an example ufunc that computes a piecewise function:
@numba.vectorize('float64(float64, float64)', target='parallel') def parallel_response(v, gamma): if v < 0: return 0.0 elif v < 1: return v ** gamma else: return v
Note that multithreading has some overhead, so the “parallel” target can be slower than the single threaded target (the default) for small arrays. Always try a single threaded implementation first!
Earlier this year, a team from Intel Labs approached the Numba team with a proposal to port the automatic multithreading techniques from their Julia-based ParallelAccelerator.jl library to Numba. We were very excited to collaborate on this, as this functionality would make multithreading more accessible to Numba users. This Intel Labs team has contributed a series of compiler optimization passes that recognize different code patterns and transforms them to run on multiple threads with no user intervention.
Since the ParallelAccelerator is still new, it is off by default and we require a special toggle to turn on the parallelization compiler passes. To take advantage of any of these features you need to add parallel=True to the @jit decorator. Since multithreading also requires nopython mode to be effective, we recommend you decorate your functions this way:
@numba.jit(nopython=True, parallel=True) def example_func(a, b): return a**2 + b**2
Note that the compiler is not guaranteed to parallelize every function. I’ll give some guidelines below, but if you set
NUMBA_DEBUG_ARRAY_OPT_STATS=1 in your environment, Numba will print information to the console about when parallelization occurs.
<pre class=”language-python”><code>@numba.jit(nopython=True, parallel=True)<br />def logistic_regression(Y, X, w, iterations):<br /> for i in range(iterations):<br /> w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) – 1.0) * Y), X)<br /> return w</code></pre>
The array operations will be extracted and fused together in a single loop and chunked for execution by different threads.
ParallelAccelerator can parallelize a wide range of operations, including:
- Basic math and comparison operators
- NumPy ufuncs (that are supported in nopython mode)
- User-defined ufuncs created with numba.vectorize
- Reduction functions: sum, prod
- Array creation: np.ones and np.zeros
- Dot products: vector-vector and matrix-vector
Multidimensional arrays are supported, but broadcasting between arrays of different dimensions is not yet supported. Some of what ParallelAccelerator does here was technically possible with the
@guvectorize decorator and the parallel target, but it was much harder to write. Using
parallel=True results in much easier to read code, and works for a wider range of use cases. The Intel team has benchmarked the speedup on multicore systems for a wide range of algorithms:
Long ago (more than 20 releases!), Numba used to have support for an idiom to write parallel for loops called prange(). After a major refactoring of the code base in 2014, this feature had to be removed, but it has been one of the most frequently requested Numba features since that time. After the Intel developers parallelized array expressions, they realized that bringing back prange would be fairly easy:
<pre class=”language-python”><code>@numba.jit(nopython=True, parallel=True)<br />def normalize(x):<br /> ret = np.empty_like(x)</code><br /><br /><code class=”language-python”> for i in numba.prange(x.shape):<br /> acc = 0.0<br /> for j in range(x.shape):<br /> acc += x[i,j]**2</code><br /><br /><code class=”language-python”> norm = np.sqrt(acc)<br /> for j in range(x.shape):<br /> ret[i,j] = x[i,j] / norm</code><br /><br /><code class=”language-python”> return ret</code></pre>
By using prange() instead of range(), the function author is declaring that there are no dependencies between different loop iterations, except perhaps through a reduction variable using a supported operation (like *= or +=). In that situation, the compiler is free to break the range into chunks and execute them in different threads. This is a very simple, but powerful abstraction, familiar to anyone who has used OpenMP in C/C++ or FORTRAN.
The most recent addition to ParallelAccelerator is the @stencil decorator, which joins Numba’s other compilation decorators: @jit, @vectorize, and @guvectorize. Similar to @vectorize, @stencil is used to decorate a “kernel” which is then implicitly broadcast over an array input. However, @stencil is used to describe stencil calculations, where each output element is computed from a neighborhood of elements from the input arrays. Examples of such calculations are found in implementations of moving averages, convolutions, and PDE solvers. Aside from some very hacky stride tricks, there were not very good ways to describe stencil operations on NumPy arrays before, so we are very excited to have this capability in Numba, and to have the implementation multithreaded right out of the gate.
The @stencil decorator is very flexible, allowing stencils to be compiled as top level functions, or as inner functions inside a parallel @jit function. Stencil neighborhoods can be asymmetric, as in the case for a trailing average, or symmetric, as would be typical in a convolution. Currently, border elements are handling with constant padding (zero by default) but we plan to extend the system to support other border modes in the future, such as wrap-around indexing.
A basic stencil kernel accesses the array neighborhood using relative indexing and returns the scalar value that should appear in the output:
@numba.stencil def neighbor_mean(a): return (a[-1] + a + a) / 3
>>> x = np.array([1, 2, 3, 4, 5]) >>> neighbor_mean(x) array([ 0., 2., 3., 4., 0.])
(Note that the default shown here is to zero-pad the output array.)
If the array neighborhood is only accessed with constant indexing, the neighborhood will be inferred by the compiler. However, if you want to loop over the neighborhood (much more convenient for a large neighborhood, the neighborhood needs to be explicitly described in the @stencil decorator:
<pre class=”language-python”><code>N = 10<br />GAMMA = 2.2</code><br /><br /><code class=”language-python”>@numba.jit(nopython=True, parallel=True)<br />def blur(x):<br /> def stencil_kernel(a):<br /> acc = 0.0<br /> for i in range(-N, N+1):<br /> for j in range(-N, N+1):<br /> acc += a[i,j]**GAMMA</code><br /><br /><code class=”language-python”> avg = acc/((2*N+1)*(2*N+1))<br /> return np.uint8(avg**(1/GAMMA))</code><br /><br /><code class=”language-python”> return numba.stencil(stencil_kernel, neighborhood=((-N,N),(-N,N)))(x)</code></pre>
The result of this blur operation looks something like this (input on left, output on right):
Bonus: Array Comprehensions and Inner Functions
Along the way to implementing ParallelAccelerator, the Intel team also implemented support for initializing NumPy arrays with list comprehensions, an idiom we are calling “array comprehensions”:
@numba.jit(nopython=True) def get_array(n): def band(i, j): if abs(i - j) < 3: return (3 - abs(i - j)) else: return 0
return np.array([[band(i, j) for i in range(n)] for j in range(n)])
>>> get_array(6) array([[3, 2, 1, 0, 0, 0], [2, 3, 2, 1, 0, 0], [1, 2, 3, 2, 1, 0], [0, 1, 2, 3, 2, 1], [0, 0, 1, 2, 3, 2], [0, 0, 0, 1, 2, 3]])
Nesting a list comprehension inside the NumPy array() function is standard practice for NumPy user, but in Numba things work a little differently. Rather than constructing a temporary list of lists to pass to the NumPy array constructor, the entire expression is translated to an efficient set of loops that fill in the target array directly. The support for inner functions (like band() in the example) makes it easy to create complex logic to populate the array elements without having to declare these functions globally.
There are more things we want to do with ParallelAccelerator in Numba in the future. In particular, we want to take a look at how to make better use of Intel® Threading Building Blocks (Intel® TBB) library internally. While the overhead of Numba’s thread pool implementation is tolerable for parallel functions with large inputs, we know that it is less efficient than Intel TBB for medium size inputs when threads could still be beneficial. Additionally, using Intel TBB would allow for better handling of nested parallelism when a multithreaded Numba function is calling out to other multithreaded code, like the linear algebra functions in Intel® Math Kernel Library (Intel® MKL).
We have also learned about ways we can refactor the internals of Numba to make extensions like ParallelAccelerator easier for groups outside the Numba team to write. We want Numba to be a compiler toolbox that anyone can extend for general or specific usage.