In the upcoming 0.17.0 version, pandas will release the Global-Interpreter-Lock (GIL) on groupby operations. In this post, we are going to answer some important questions:

Pandas releases the Global-Interpreter-Lock.

In the upcoming 0.17.0 version, pandas will release the Global-Interpreter-Lock (GIL) on groupby operations. In this post, we are going to answer some important questions:

  • What is the GIL?
  • Why is this important?
  • How did we do it?

What is the GIL?

The Global-Interpreter-Lock (GIL) is a mutex that prevents multiple native threads from running in parallel. In essence, this says that a python program cannot do more than one thing at once via threading.

However, extensions, which are C/C++/Fortran compiled code that are linked to the python interpreter, CAN release the GIL. Some python packages, notably NumPy, do release the GIL.

Why is this important?

It is easy to see that when processing data with pandas, doing more than one thing at once could be really useful! We could do more things faster, or potentially work with a blocking-API. In fact, some of the motiviation for releasing the GIL in pandas was driven by the desire to use this type of in-process parallelism with another project, dask.

We are going to simulate doing an embarrasingly parallel operation, namely, calculating groupby means. One could imagine using threads to calculate on different groups at the same time. This is what I mean by embarrasingly parallel. There are no interactions between the calculations; they are wholly independent.

First we set up the environment by importing pandas and NumPy. Then we create a large DataFrame with a bunch of randomly created data.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: np.random.seed(1234)
In [4]: N = 1000000
In [5]: ngroups = 1000
In [6]: df = DataFrame({'key' : np.random.randint(0,ngroups,size=N),
                        'data' : np.random.randn(N) })
In [7]: df.head()
       data  key
0  0.395838  815
1  0.377571  723
2 -0.452345  294
3 -0.806461   53
4  0.026369  204
In [8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
data    1000000 non-null float64
key     1000000 non-null int64
dtypes: float64(1), int64(1)
memory usage: 22.9 MB



This is a typical groupby scenario. We want to groupby and reduce. In this case we will groupby the key and retrieve the mean for each group.

In [9]: result = df.groupby('key').mean()
In [10]: result.head()
0   -0.059063
1    0.022777
2    0.005029
3    0.021176
4   -0.013846
In [11]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
data    1000 non-null float64
dtypes: float64(1)
memory usage: 15.6 KB



How long did it take to run this operation?

In [15]: %timeit df.groupby('key')['data'].mean()
10 loops, best of 3: 24.8 ms per loop


Now, we are going to time how long it takes to do this operation twice. This is a serial operation, so it clearly should take about twice as long.

In [13]: def g2():
   ....:     for i in range(2):
   ....:         df.groupby('key')['data'].mean()
In [14]: %timeit g2()
10 loops, best of 3: 51.63 ms per loop



Shocker, we were right. About twice as long. So this scales linearly with the number of times we are doing this operation.

This effictively simulates using 2 threads in the current version of pandas. They work one after the other.

How did we do it

pandas achieves high performance by using NumPy intelligently, several external packages including numexpr, and bottleneck and a lot of Cython code.

I gave away the solution at the top of this post! Cython allows us to release the GIL during the execution of C code. For operations that involve only basic non-pointer data types, like floats and integers, we can pretty easily release the GIL. These basic non-pointer data types do not include strings. This allows multi-threaded programs to use more of the machine hardware, and consequently, do more work.

We have a created a decorator in pandas, test_parallel, to allow us to run a function with a specified number of threads. This will run the function once per thread with the specified number of threads.

from pandas.util.testing import test_parallel
def pg2():



This is analagous to the g2() function above doing the same amount of work. We will time this next.

In [18]: %timeit pg2()
10 loops, best of 3: 26.24 ms per loop




Wow. We did the same amount of work in a little bit more than 1/2 of the time. This is great!

Taking this exercise further, let’s get some data by running this on 2, 4 and 8 threads.

graph of timings for single vs multi threaded groupbys

Furthermore, we calculate a speedup factor, which is the ratio of the single threaded to the multi threaded version. This tells use how much work we could get done in an equivalent amount of time. So using more threads helps! Awesome!

graph of speed ups for releasing the gil for multi-threads


We have seen that it is possible to speed up certain operations in pandas by releasing the Global-Interpreter-Lock. In the future we hope to extend this to other operations and techniques.