Accelerating Python for scientific research
Optimize your Python code for research
Scientists tackle complex scientific problems by breaking them down into much simpler chunks. They then apply the optimal tool to each chunk to obtain a solution. The optimal approach to solving each set of problems may be different, and thus each problem may need different tools.
Python is well suited to data science, machine learning, and deep learning, all of which are gaining in popularity as tools to solve scientific problems. The strengths of Python lie in its integration of multiple approaches to problem solving. By integrating all the problem-solving tools in one container, Python serves as a wonderful toolkit.
In contrast, Fortran has the best performance at number crunching. Scientists use Fortran to program numerical algorithms for scientific research. Python gives scientists a powerful way to wrap special-purpose programs and make them easily accessible from a common application layer. In addition, Python can use an appropriate back end, such as a CPU back end, GP-GPU back end, or a quantum processing back end, to accelerate these tools (for example, integrating deep learning with quantum computing).
In this article, we’ll discuss how to optimize Python for scientific research. Specifically we’ll cover:
- Barriers for using Python and tips to overcoming those barriers
- Python concurrency and extensibility
- Python compilers
- Accelerating hardware to work with Python
- Major Python implementations
Overcoming the barriers to entry
Python is very popular with developers. Opinion polls rank it among the top 10 programming languages as seen in this IEEE Spectrum report. Despite its high popularity, Python has not seen wide usage in core science and technology. There has been a justifiable cause for concern about the performance of Python especially among scientists.
Rather than use Python, scientists have continued to rely on Fortran, C or C++. Some have explored newer languages like Julia designed from the ground up to overcome the deficiencies of Python while retaining the performance of Fortran. Bearing in mind that we are not trying to replace Fortran or C/C++ with Python helps narrow the scope of the discussion. So, at this point, it’s worthwhile to put things in perspective and instead focus our attention on what the real barriers to entry are for Python in scientific research and how we can overcome them.
Pure, native Python programs are slow. Hybrid programming using C/C++ or Fortran can accelerate performance. In the section “Python extensibility,” I explore some of these methods.
Packages such as NumPy and SciPy accelerate linear algebra and matrix operations by using platform-specific, optimized math libraries such as the Linear Algebra Package (LAPACK) and Basic Linear Algebra Subprograms (BLAS) that use hardware acceleration. For more on this, see the section “Python hardware acceleration.”
Python trades simplicity and robustness for performance when using threads. Python threads are simple abstractions of the actual operating system threads. CPython has a global interpreter lock (GIL) that limits execution to a single thread at a time. This mutex also limits the ability of Python to exploit multicore CPUs. All C-based Python extensions are subject to this limitation.
Some compilers can explicitly release the GIL around a section of code. Extensions can therefore release the GIL when handing off control to an external library and reacquire it when control returns to the Python code. In the section “Python concurrency,” I discuss this concept more.
Not only does Python2 continue to exist a decade after the first release of Python3, but Python2 is still quite a bit faster than Python3. Python 3.6 is currently the fastest version of Python3, and early reports suggest that Python 3.7 will be faster than Python 3.6 but still not as fast as Python 2.7, which is slated to be the last version of Python2. Another problem is that PyPy, one of the most promising CPython performance-enhancing replacements, doesn’t support Python3 yet.
Python3 has many performance improvements, including changes to the GIL (from Python 3.2 onward), a new concurrent.futures module, and many language enhancements. All these improvements as well as the fact that Python2 is going into maintenance mode makes Python3 the Python version of choice moving forward.
The design of Python puts productivity ahead of raw performance and makes Python a popular choice with developers. Python implementers have already overcome or worked around many performance shortcomings in Python, yet concerns persist mainly because of the lack of good documentation about these solutions. In part, this is because some of the topics are esoteric and deal with Python internals rather than with Python application development.
This article goes over some of the important aspects of Python optimization and how to accelerate Python performance to make it suitable for use in scientific research computations.
By design, Python leaves concurrency to the operating system and just provides a simple wrapper for the OS mechanisms. Python bytecode runs directly on single-core processors. Only one thread executes at a time. The executing thread locks out other threads by acquiring the GIL. This simple and robust design also makes it easy to write extensions.
Multithreaded performance remains good when some threads are I/O bound, but when all threads are CPU bound, performance suffers—a big deal for scientific computing because most tasks are CPU bound. It may seem trivial to get rid of the GIL, but doing so is not simple. Attempts to remove the GIL from CPython have not been very successful so far. Most attempts result in either a performance hit to the single-thread mode of operation or in breaking compatibility with extensions, neither of which is a desirable outcome.
An alternate method is to spawn processes instead of threads. Python implements this functionality using the multiprocessing module. Each process has its own GIL and therefore won’t block the other. In Python3, this functionality is part of the new concurrent.futures module. The method has some overheads, but Python can use multicore CPUs and CPU clusters this way.
As we saw earlier, some compilers can explicitly release the GIL around a section of code. Extensions can therefore release the GIL when handing off control to an external library and reacquire it when control returns to the Python code. This feature allows Python to hand off to C code, which can handle multithreading or multiprocessing. Scientific programming packages in Python such as NumPy and SciPy use this approach.
Python is highly extensible, and many methods exist for writing extensions in C or Fortran. Python code can call these extensions directly as subroutines, if necessary. This section discusses some of the major compilers used to build extensions (it is by no means a complete list).
Cython (which is distinct from CPython) refers to both a language and a compiler. The Cython language is a superset of the Python language that adds C language syntax. Cython can explicitly release the GIL either in code sections or in complete functions. The declaration of C types on variables and class attributes as well as calls to C functions use C syntax. The rest of the code uses Python syntax. From this hybrid Cython code, the Cython compiler generates highly efficient C code. Any regular optimizing C/C++ compiler can compile this C code, resulting in highly optimized runtime code for the extension, with performance close to that for native C code.
Numba is a dynamic, just-in-time (JIT), NumPy-aware Python compiler. Numba uses the LLVM compiler infrastructure to generate optimized machine code and a wrapper to call it from Python. Unlike Cython, coding is in regular Python. Numba reads type information from annotations embedded in the decorator and optimizes the code. For programs that use NumPy data structures, such as arrays, and a lot of math functions, it can achieve similar performance to C or Fortran. NumPy uses hardware acceleration for linear algebra and matrix functions using LAPACK and BLAS to provide additional acceleration and improves performance dramatically, as seen in the IBM blog post, A Speed Comparison Of C, Julia, Python, Numba, and Cython on LU Factorization.
Numba is also capable of using a GP-GPU back end in addition to the CPU. Anaconda, Inc., the company behind one of the major Python distributions, also develops Numba and Numba Pro, a commercial version.
The Fortran to Python Interface Generator
The Fortran to Python Interface Generator (F2Py) started as a separate package but is now part of NumPy. F2Py allows Python calls to numerical routines written in Fortran as if they were another Python module. Because the Python interpreter cannot understand Fortran source code, F2Py compiles Fortran into native machine code in the form of a dynamic library file—a shared object with functions that have the interfaces of a Python module. As a result, Python can call those functions directly as subroutines that execute with the speed and performance of the native Fortran code.
The standard built-in Python compiler is CPython. Unlike the compilers discussed in the previous section, CPython is really the Python interpreter implemented in C. In interactive mode, CPython executes instructions on a line-by-line basis just like any classical interpreter. In offline mode, however, CPython transparently translates Python code into byte code that resides in memory or on disk in a cached folder.
Python byte code is a lower-level, yet platform-independent translation of the Python source code. In this respect, byte code is different from machine code produced by compiling a C or Fortran program, for example. Execution of Python byte code happens on a Python virtual machine (PVM), which is the Python runtime engine and part of CPython. This PVM is nothing but an interpreter loop that interprets and executes byte code. As such, Python programs don’t run as fast as C, for example.
This fact is the source of much of the confusion and misconceptions about Python. Standard Python libraries are certainly slower than C, but Python packages can exploit the power of Fortran, C/C++, and hardware acceleration to enhance performance. All this discussion has been about the standard implementation of Python, but many alternate implementations of CPython exist, and some distributions of Python may repackage or even replace CPython.
Here’s a list of major CPython alternatives:
- IronPython: This implementation of Python is integrated with the Microsoft .NET Framework and designed to run on the Microsoft Common Language Runtime(CLR). It is one of the few Python implementations that does not implement the GIL.
- PyPy: This is a new JIT compiler for Python. PyPy It runs much faster than CPython and uses less memory, but it also implements the GIL and consequently is subject to the same limitations as CPython. An experimental branch of PyPy implements Software Transactional Memory (STM), which eliminates the GIL.
- Jython: This compiler, which runs Python on the Java™ Virtual Machine (JVM), is another Python implementation that does not implement the GIL. Jython is truly multithreaded and supports extensions using the Java Foreign Function Interface (JFFI). Support for the C Foreign Function Interface (CFFI) and C Extensions are still a road map items.
- Stackless Python: This enhanced version of CPython supports microthreads. According to the implementers, microthreads, or “threadlets,” run at least an order of magnitude faster than OS threads and scale much better. Contrary to popular belief, Stackless Python does implement the GIL and consequently cannot take advantage of any multicore features of the OS or hardware processors. Standard Python can also reap the benefits of microthreading by using the greenlets module, a spinoff of this development branch.
Hardware acceleration and Python
Python can run on many different back ends, ranging from single CPUs to quantum computers. As Python is a high-level language, hardware acceleration needs the help of an external hardware driver and library. In this section, I go over some of the widely used hardware back ends and how Python uses them as well as some of the newer back ends, such quantum computing.
CPU on most machines
Most machines either local desktops or remote cloud-based servers have multicore CPUs:
- Single-core CPU: Python bytecode runs directly on single-core processors. Only one thread executes at a time. The executing thread locks out other threads by acquiring the GIL. This is the baseline environment on which Python runs.
- Multicore CPU: The processing module or the concurrent.futures modules are the preferred ways for Python to take advantage of multicore CPUs and achieve modest performance gains.
There are multiple ways to take advantage of the CPU clusters on supercomputers or CPU coprocessors typically found in high-performance computing workstations. The most popular is an implementation of the Message Passing Interface (MPI) such as OpenMPI. Python bindings for MPI are in the MPI4Py module. MPI is most effective on distributed memory machines. Parallel processing becomes possible in Python when you use MPI4Py.
GP-GPU on graphics chip sets
Depending on whether you use NVIDIA or other GPU chipsets, you could use one of two modules:
- NVIDIA PyCUDA: This module maps NVIDIA CUDA onto Python so that Python can take advantage of GP-GPU programming on NVIDIA GPU chipsets.
- PyOpenCL: This module allows Python to access the OpenCL API, giving Python the ability to use GP-GPU back ends from GPU chipset vendors such as AMD and Intel.
Google Tensor Processing back ends
Currently, the only way for Python access to a Tensor Processing Unit (TPU) back end is by using the TensorFlow framework. The advantage of doing so is that you can use tensors, which are n-dimensional arrays, with the TensorFlow Python API. TPUs are matrix processors rather than vector processors and can perform potentially hundreds of thousands of instructions per cycle without needing to access memory.
Quantum computing back ends
Quantum computing platforms such as the IBM quantum processor chip are relatively new platforms capable of high-end scientific computing needed in scientific research. Certain classes of problems that could theoretically take conventional supercomputers millions of years to solve would take a quantum computer only a few hours. Modeling of complex molecules is another application where quantum computers excel.
IBM makes this powerful new resource available as a cloud back end as part of the IBM Q™ Experience for academic and scientific research. The IBM Q Experience gives researchers and academic institutions access to the 5-qubit and 16-qubit IBM quantum computers through IBM Cloud™ services. IBM Q™ is a commercially available implementation of the 20-bit IBM quantum computer also as an IBM Cloud service. An SDK that implements a Python API client to the online back end allows access to the system.
The IBMQuantumExperience Python package is the official API client for using the IBM Q Experience in Python.
Major Python distributions
By now, you may have realized that optimizing or accelerating Python is certainly not a trivial task and best left to the experts. To that end, rather than starting with a standard distribution and spending a lot of time optimizing it, it’s better to start with one of the major Python distributions customized for scientific research listed below:
- ActiveState ActivePython: This commercial distribution of Python includes most of the major scientific computing modules. A free community edition is also available.
- Anaconda Python: This is one of the most popular commercial Python distributions. It includes an easy-to-use, graphical development environment manager, with the ability to manage multiple versions of Python from different development channels. Most major vendors, including Intel, Microsoft, and IBM, have integrated their offerings into the Anaconda distribution. A free community edition is also available.
- Enthought Canopy: This commercial distribution also offers a free lite version, and more than 450 scientific computing modules are available.
Python on Power architectures
IBM has partnered with Anaconda, Inc., to bring the Anaconda Python distribution to IBM® POWER® platforms such as IBM POWER8® and POWER9™. This distribution integrates the IBM PowerAI software distribution for machine learning and deep learning. It can also use the NVIDIA high-speed NVLink interface to the NVIDIA Tesla Pascal P100 GPU accelerators integrated with this platform, giving a performance boost to deep learning and analytics applications.