A Speed Comparison Of C, Julia, Python, Numba, and Cython on LU Factorization
JeanFrancoisPuget 2700028FGP Comments (11) Visits (90686)
How fast can compiled Python be compared to, say C? You'd be surprised by the answer. The study below contradicts common wisdom that you cannot get close to C for matrix oriented computation.
A good example of a study supporting the common wisdom is Sebastian F. Walter's Spee
The code we use is described below, after the results section.
We benchmark the LU factorization of square random matrices of increasing sizes.
Here is the picture I get when comparing Python, Numba, and C.
This is a bit different from the picture Walter had 18 months ago:
Numba is closer to C now. Where does the difference comes from? Numba is now using a new LLVM interface called llvmlite rather than the deprecated llvmpy. This speeds Numba by about 40% on this benchmark.
The story isn't that simple. Walter did not use any C compilation flag except for -03. When we use more, we improve C running times, and the situation becomes.
Numba is not the only compiler that can be used with Python. The most popular one is Cython. Here is how it compares to C.
It is quite close to C.
Using a compiler is not the only way to speed our code. We can also use Numpy array operations. Using them gives significant speedup over pure python:
It is way faster, but it is not good enough to match C.
So, should we use Cython or Numba if we want to have efficient LU factorization in Python? The answer is none of them. We should rather use Scipy LU factorization routines. There are two of them, a Python one, and one that uses LAPACK. Benchmarking these show they are quite effective as the size grows:
This is because these routines use a much more elaborate algorithm than the one we have used so far. The comparison isn't really fair, but I felt compelled to remind readers that one should always look first at existing Python libraries before writing new code.
What about Julia? Adding it to the picture is interesting. A naive translation of the Python code is not great. We must write the code to take into account the Fortran ordering used by Julia. We must also add decorators to speed the code. Then, running times improve significantly and become similar to C.
When we increase the sizes, we see that C, Cython, and Julia performances become very close.
This shows, as InsideLoop commented below, that cpu isn't the limiting factor probably. It is rather memory throughput. It would be interesting to perform the comparison using more cpu intensive code. I provide one in How To Quickly Compute The Mandelbrot Set In Python.
The numerical results are:
We use the following versions.
Garbage collection has been disabled for all code including Julia, except for C of course. All software, except for Julia, is installed via Anaconda. The gcc compiler is installed via mingw in Anaconda.
The code we used is available on github. We comment below the main parts.
The code is from Walter's blog. It is a naive implementation of computing the determinant of a matrix via LU decomposition. It is not immune to overflows. In particular, it can have bad numerics if the value of x[k,k] is close to zero during the main loop. In short, don't take this as a model for implementing LU factorization.
Anyway, we'll use this code as it is representative of array intensive code. Not surprisingly, pure Python performs poorly on that. The code takes two arguments, the matrix x, and a place holder y for the value of the determinant.
def det_by_lu(y, x): y = 1. N = x.shape for k in range(N): y *= x[k,k] for i in range(k+1, N): x[i,k] /= x[k,k] for j in range(k+1, N): x[i,j] -= x[i,k] * x[k,j]
The timing code for it is the following. Similar code is used for all variants, except when noted. We loop over the LU factorization because small timings using wall clock aren't reliable on a Windows machine.
import numpy as npdef run_
One can speed the innermost loop by replacing it with a Numpy array operation. The codes becomes:
def numpy_det_by_lu(y, x): y = 1. N = x.shape with n
The timing code is similar to the one for Python, we simply replace det_by_lu with numpy_det_by_lu in it.
The Numba version is obtained by compiling the naive Python code. One way to do it is this:
from numba import jit, void, doublefastdet_by_lu = jit(void(double[:], doub
The timing code is similar to the one for Python, we simply replace det_by_lu with fastdet_by_lu in it.
Numba cannot compile the code with Numpy array operations.
The code is straightforward, but we add a pragma to say that the innermost loop can be parallelized via simp instructions.
It is compiled with
where gcc.bat is the command line provided by mingw installed with Anaconda.
We import it in Python with the cffi package:
from cffi import FFI ffi = FFI() ffi.cdef('void det_by_lu(double *y, double *B, int N);') C = ff
The timing code is similar to the one for Python, we simply replace det_by_lu with c_det_by_lu in it.
I also tried #pragma GCC ivdep, but it does not change timings significantly.
We can code in Julia in a way similar to Python's code. However, that code is slower than it should. One way to speed Julia is to take into account the Fortran ordering it uses by looping on j before looping on i in the second loop. We also add decorators to speed up the code:
This is much better and we used that version for comparison. Timing code is quite simple:
We repeat this for each matrix size.
Cython code is similar tot he Python code, with added types. I compile and run it via magic commands:
%%cython import cython @cyt
Scipy provides a direct implementation of LU factorization. The signature of the function is a bit different, it returns 3 matrices instead of one. The difference shows in how we check that the result is correct.
def run_scipy(A,B,y,N): # check that result is correct np.copyto(B,A) (P,L,U) = l
Scipy provides another LU factorization routine that calls the LAPACK library. Its signature is different from the above, and it makes the result checking code a bit cumbersome. I skipped it and I assume LAPACK was correct given how proven this library is.
Overall timing code is
def timings(n=7, \ series=['pure python', 'c', 'numba', 'numpy', \ 'cython', 'scipy', 'lapack', 'julia']): Ns = n
The display is using matplotlib. Julia timings were measured outside Python, and added to the Python timings.
def plot_times(times, cols = ,\ nam
Update on January 16, 2016. Added a Numpy array section. Improved Cython, C, and Julia code thanks to remarks from readers.
Update on January 27, 2016. Uploaded the code on github, and used pandas for storing timing results.