The GPU revolution of the past few years provides inexpensive access to hundreds of specialized computational units in a single silicon die. The challenge is efficiently accessing these and developing or adapting algorithms that can harness their power. Here, I’ll show how the NumbaPro module from Anaconda Accelerate can be used to parallelize a standard option pricing algorithm onto a GPU, giving a 14x speedup, using only a few extra lines of code.Option pricing is well suited to GPUs because the algorithms use many independent Monte Carlo simulations that can be calculated in parallel.

All code samples are available on GitHub, however you will require a GPU and NumbaPro on your system if you want to try these out.


In finance, an option gives the holder the right to buy or sell an asset at a predetermined price during a particular period of time. To effectively trade options, one must know the fair price for that option, and the value of the option depends strongly on price changes of the underlying asset during the duration of the option. Of course, we can’t know how the price of the asset will change in advance. Fortunately, we do have a statistical model for how these prices change over time. Sadly, for many types of options, it is difficult or impossible to find a closed-form solution to finding the expected value of the option given the model and its parameters. Because of this, a common approach to solving these types of valuation problems is to perform large numbers of simulations of the model and taking the average of these results. This is the Monte-Carlo method.

A Simple NumPy Implementation

Basically, the Monte Carlo simulation assumes the value of the asset depends on the initial price, the interest rate, volatility of the asset and a normally distributed random variable. The simulation is repeated with different random variables to generate many different paths for the value of the asset. The following is a simple NumPy implementation (source):

import numpy as np
def step(dt, prices, c0, c1, noises):
    return prices * np.exp(c0 * dt + c1 * noises)
def monte_carlo_pricer(paths, dt, interest, volatility):
    c0 = interest - 0.5 * volatility ** 2
    c1 = volatility * np.sqrt(dt)
    for j in xrange(1, paths.shape[1]): # for all trials
        prices = paths[:, j - 1]
        # generate normally distributed random number
        noises = np.random.normal(0., 1., prices.size)
        # calculate the next batch of prices for all trials
        paths[:, j] = step(dt, prices, c0, c1, noises)



In the above code, the paths variable is an array that contains all prices at various time from the start date to the expiration date. It is a 2D array of shape (number of trials, number of time divisions). The initial prices for all trials are contained in paths[:, 0].

Exploiting Parallelism with CUDA Python

The Monte Carlo method has a disadvantage that it is very compute intensive. To gain significant speedup, Numba (example) is not the best candidate because array expressions are simply lowered to NumPy calls or equivalent implementations. We can consider exploiting the data parallelism. Each simulation path is completely independent of each other. The only data dependency at each step of a trial is the previous price. With a CUDA device, we can execute hundreds of trials in parallel. The following is a sample NumbaPro CUDA Python implementation using the @vectorize decorator and cuRAND for random number generation (source):

from numbapro import vectorize, cuda
from numbapro.cudalib import curand
@vectorize(['f8(f8, f8, f8, f8, f8)'], target='gpu')
def step(last, dt, c0, c1, noise):
    return last * math.exp(c0 * dt + c1 * noise)
def monte_carlo_pricer(paths, dt, interest, volatility):
    n = paths.shape[0]
    blksz = cuda.get_current_device().MAX_THREADS_PER_BLOCK
    gridsz = int(math.ceil(float(n) / blksz))
    # Instantiate cuRAND PRNG
    prng = curand.PRNG(curand.PRNG.MRG32K3A)
    # Allocate device side array
    d_normdist = cuda.device_array(n, dtype=np.double)
    c0 = interest - 0.5 * volatility ** 2
    c1 = volatility * math.sqrt(dt)
    # Simulation loop
    d_last = cuda.to_device(paths[:, 0])
    for j in range(1, paths.shape[1]):
        prng.normal(d_normdist, mean=0, sigma=1)
        d_paths = cuda.to_device(paths[:, j])
        step(d_last, dt, c0, c1, d_normdist, out=d_paths)
        d_paths.copy_to_host(paths[:, j])
        d_last = d_paths



The @vectorize can turn a scalar function into a vector function that takes array arguments. It has a disadvantage in that it converts all scalar arguments into arrays of one element. Loading an element in an array requires additional memory access, consuming scarce memory bandwidth. Rewriting the kernel using the @jit decorator with the GPU target allows us to specify which arguments are scalars (source):

from numbapro import jit, cuda
@jit('void(double[:], double[:], double, double, double, double[:])',
def step(last, paths, dt, c0, c1, normdist):
    # expands to i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    i = cuda.grid(1)
    if i >= paths.shape[0]:
    noise = normdist[i]
    paths[i] = last[i] * math.exp(c0 * dt + c1 * noise)




The following figure shows the performance benchmark for the four versions:

CUDA Python can speedup the Monte Carlo application by 14x.

Advanced Optimization Technique: Overlapping Copy and Compute

It is possible to further optimize by overlapping copy and compute. To overlap copy and compute, all works are equally divided into multiple CUDA streams. Assuming we use two CUDA streams and have N trials, the first stream will compute the first N/2 trials and the second stream will compute the remaining N/2 trials.

It can be difficult to determine whether we have successfully overlapped the copy and compute without a profiler. NVidia Visual Profiler (NVVP) is a great tool for visualizing the performance of a CUDA application. The timeline plot it generates can be used to check if the copy and compute operations are overlapping.

The following figure generated using NVVP illustrates the normal execution:

The following figure generated using NVVP illustrates the overlapped execution:

In the two figures, the brown boxes represent memory copy operations; the cyan boxes represent calls to the cuRAND; and, the purple boxes represent calls to the step kernel. The y-axis shows the CUDA streams. The x-axis is the time.

The CUDA device overlaps a copy operation with a compute operation. On a GT 650M device (CC 3.0), the overlapping speeds up the execution by 1.3x.

It is important to prevent memory allocation and deallocation inside the asynchronous simulation loop because these memory operations are implicitly synchronous. We do so by implementing a simple memory manager to control memory reuse. It uses CUDA events to track the progress of the execution. (source)


With NumbaPro CUDA Python, it is easy to leverage the massively parallel CUDA devices without writing low-level C code. We have demonstrated the use of NumbaPro to parallelize a Monte Carlo option pricer that takes advantage of the cuRAND library for random number generation. We have also shown an advanced optimization technique that overlaps copy and compute operations, which hides the time used to copy memory between the host and the device.