We announced the release of Anaconda Distribution 5 back in October 2017, but we’re only now catching up with a blog post on the security and performance implications of that release. Improving security and enabling new language features were our primary goals, but we also reaped some performance improvements along the way. This blog post covers the recent improvements you can expect to see from updating to the latest Anaconda packages, as well as what you can expect from using Anaconda packages vs pip wheels. We’ll also talk about the partnership between Anaconda and Intel, and the performance implications of that partnership.
Python and library versions
As a security benchmark, we used Debian’s hardening-check script. This script checks several currently available approaches that enhance the security of binaries. More information about this script can be found at https://manpages.debian.org/testing/devscripts/hardening-check.1.en.html. We ran this script on the Python binary from each Python distribution. These security flags generally represent the packages present in other Anaconda or Intel distribution packages from official channels, but may vary depending on the build system (make, cmake, scons, etc.) used for a given package. Because of the more distributed nature of PyPI, security flags should be examined there on a per-package basis.
The check results for these attributes are shown:
- Position Independent Executable (PIE): Enables address-space-layout-randomization, makes buffer overflow attacks more difficult. Having prior knowledge of the target executable’s memory layout makes many stack overflow attacks trivial.
- Stack Smashing Protector (SSP): On top of ASLR, this feature forces executables to terminate immediately upon detection of a stack overflow. Combining these two reduces the number of buffer exploit targets considerably. Many of the more modern attack vectors (e.g. ROP) still require initiation via a stack buffer overflow.
- Fortified Functions (FFs): Terminates programs when buffer overflows are detected
- Relocation Read-Only (RELRO): Prevent modification of the Global Offset Table, which maps locations of dynamic functions. Prevents arbitrary code execution.
- Immediate Symbol Binding (NOW): Immediate binding ensures the GOT is read-only during the entire execution of the executable making it impossible to re-vector them during execution.
It should be noted that the benefits of PIE and RELRO/NOW are properties of the launched executable, not the shared libraries loaded by that executable. Using a protected Python executable is good protection against buffer overflow attacks against compiled extensions used within that Python process.
* Ubuntu and Anaconda 5 both statically link libpython into their executables, so the Debian check script can read them accurately. Intel’s Python executable depends dynamically on libpython.so instead, which results in the Debian check script showing that Intel’s Python executable does not have SSP. It actually does: it’s just a property of libpython.so.
- Performance, the Python benchmark tool: https://pypi.python.org/pypi/performance/0.6.0
- A Black-Scholes benchmark suite created by Intel: https://github.com/IntelPython/BlackScholes_bench
- A basic BLAS benchmark: https://github.com/ContinuumIO/mkl-optimizations-benchmarks
Each benchmark was run five times, and the minimum time and standard deviation for each test within each benchmark was recorded.
We employed an in-house data-science machine to avoid any virtualization overhead. This machine has:
- Intel(R) Xeon(R) CPU E7-8857 v2 @ 3.00GHz (4x 12 core CPUs, 48 total cores)
- 512 GB RAM
- CentOS 6 base OS
Tests were run in Docker containers, for reproducibility and to obtain the OS of choice (Ubuntu 16.04). Docker images can be obtained from https://hub.docker.com/r/continuumio/python_benchmarking/. The Dockerfile for these images, as well as source code for all of these benchmarks, is at https://github.com/continuumio/anaconda-benchmarking.
You can run these benchmarks for yourself using commands like the following:
git clone --recursive https://github.com/continuumio/anaconda-benchmarking
For parallel benchmarks:
docker run -w /project -v ~/anaconda-benchmarking:/project -ti continuumio/python_benchmarking:ubuntu_1604_anaconda_36 bash run.sh
For single-threaded benchmarks:
docker run -w /project -v ~/anaconda-benchmarking:/project -e OMP_THREAD_LIMIT=1 -e MKL_NUM_THREADS=1 -e OMP_NUM_THREADS=1 -ti continuumio/python_benchmarking:ubuntu_1604_anaconda_36 bash run.sh
You can collect and plot results using the Jupyter notebook included in the anaconda-benchmarking repository. You’ll need to adjust the paths coded therein to match the paths where data is dumped on your system.
Results are normalized by the results for the Ubuntu 16.04 Python, with NumPy installed from pip. Values are expressed in terms of t, time taken for a given benchmark, with tubuntu / tdistro such that values greater than 1 indicate performance increases relative to Ubuntu’s system Python. The plot shown is a histogram of the 60 tests contained in the Python benchmark suite.
Here we see the performance improvements that we gained by using new compilers, and by building Python with the features of those compilers, such as link-time optimization and profile-guided optimization. We’ve gained 10-20% on average relative to Ubuntu’s Python, and 30-40% over our earlier builds of Python. If you’d like to see how we achieved this, our recipe for building Python is available at https://github.com/anacondarecipes/python-feedstock.
This benchmark expresses performance in the millions of options simulated per second (MOPS). The values plotted here are MOPSdistro / MOPSubuntu, so numbers greater than 1 indicate improved performance relative to the NumPy package available on PyPI.
This benchmark highlights great acceleration that Intel has been able to achieve with the random number generation, and erf and invsqrt functions as well as using Intel Threading Building Blocks library for multithreading . Anaconda has incorporated the random number generation and erf advancements from Intel, and as a result, shares many of the performance gains. The invsqrt enhancements are under investigation for future inclusion in Anaconda. Intel also uses their C/C++ compiler for their NumPy package, while Anaconda utilizes GCC. The Intel compiler may be responsible for some of the additional performance increases observed here. Additionally, the Intel implementation of these functions is able to utilize the many cores available on our system for greater speed-up, relative to the pip-installed NumPy package.
One of the major ways that scientific computing can be sped up is the use of a high-quality BLAS/LAPACK implementation, such as MKL or OpenBLAS. MKL is Intel’s BLAS/LAPACK implementation, and is what Anaconda provides as its default BLAS/LAPACK implementation. NumPy wheels from PyPI use OpenBLAS. We benchmarked a few fundamental BLAS/LAPACK operations, as well as FFTs. Array sizes were integer powers of 2.
The results are expressed in GFLOPS, or billions of floating point operations per second. These are calculated by estimating a number of multiplication and addition operations necessary for a particular math operation, then dividing by the time taken for the overall operation. Because we don’t know exactly what algorithm is used for the given overall operation, these figures are not exact, but establish a general performance ballpark. The values plotted here are GFLOPSdistro / GFLOPSpip, so numbers greater than 1 indicate improved performance relative to the NumPy package available on PyPI.
Here we see that OpenBLAS is highly competitive with MKL in BLAS/LAPACK operations, while Intel and Anaconda behave similarly due to their similar usage of MKL. There is a large performance increase at smaller array sizes for DGEMM, perhaps due to a larger overhead in determining appropriate parallelism with OpenBLAS. There is significant performance improvement regarding SVDs between MKL and OpenBLAS. For FFTs, the work that Intel put into integrating MKL’s FFT acceleration into NumPy has tremendous results. Adding other FFT libraries, such as FFTW, can certainly speed up FFTs with pip-installed NumPy, but that’s unnecessary with Anaconda and Intel’s NumPy packages—they are already extremely fast, without any additional user effort or code alteration.
About the Anaconda-Intel partnership
Anaconda and Intel have been collaborating since early 2016, when Anaconda made MKL-powered NumPy its default offering. Since then, Anaconda and Intel have worked closely, with Intel contributing performance-enhancing patches and packages and Anaconda providing package-building tools, recipes, and advice. Keep an eye on this space for future performance improvements!