How To Quickly Compute The Mandelbrot Set In Python
JeanFrancoisPuget 2700028FGP Visits (99235)
My Christmas Gift was about creating nice images of the Mandelbrot set. A comment on reddit make me write this sequel. The comment is suggesting that I should use a vectorized version of the code rather than the sequential one I am using. I take this excellent suggestion as an excuse to review several ways of computing the Mandelbrot set in Python using vectorized code and gpu computing. I will specifically have a look at Numpy, NumExpr, Numba, Cython, TensorFlow, PyOpenCl, and PyCUDA. All timings, except for TensorFlow, are measured using Python 3.5 provided by Anaconda. I will compare these to C code.
Let me first define the benchmark I will be using throughout this post. You can skip this explanation and directly go to the code if you wish. Each point in the plane of coordinates (x,y) can be interpreted as a complex number with real part x and imaginary part y. Let c be such point. We define a series of complex numbers for it:
z0 = 0
z1 = c
z2 = z12 + c
zn+1 = zn2 + c
If the size |zn| of these complex numbers stays bounded as n tends to infinity, then c belongs to the Mandelbrot set.
In order to compute images, people usually compute the series until either zn exceed a given horizon value, say 2, or n reaches a max number of iterations. In both cases we store n. In the later case, we assume that the point belongs to the Mandelbrot set.
In order to benchmark various versions of the code I will use the time needed to compute the pixel values for two 1M pixels images. The first image is obtained with 80 iterations, while the second one requires a much larger number of iterations, namely 2048. We only review how to compute pixel values here
The full Mandelbrot set
A detail magnified 50,000 times.
Let's first establish a baseline with a naive code. I took the code from Juli
Timing it on my laptop for our two images with
1 loops, best of 3: 9.21 s per loop
1 loops, best of 3: 3min 11s per loop
Quite slow, isn't it?
Numpy With Loops
The above code uses list comprehensions. Can we speed up the code by using Numpy arrays? Here is a code that loops over arrays instead of using lists.
import numpy as np
Timing it yields
1 loops, best of 3: 19.9 s per loop
1 loops, best of 3: 4min 57s per loop
This is slower than using lists! It reinforces what I already explained in Python is Not C: looping over arrays is perhaps the slowest way of using Python. On way to speed such code is to use a compiler for Python. We'll have a look at two of them, Numba and Cython.
Numba, which is a recent just in time compiler (jit) for Python can do marvel on C like code with Numpy arrays. In order to use it we simply need to import Numba and add a decorator to the functions we want to compile. My code becomes:
from numba import jit @jit def mand
Timing it yields
1 loops, best of 3: 166 ms per loop
1 loops, best of 3: 3.96 s per loop
This more than 100 times faster than the above code! It is also faster than the corresponding Julia code from the same benchmark set, see How To Make Python Run As Fast As Julia.
After writing the above I read read Numba documentation (shame on me, I should have done it before, but it is so easy to use...) and I saw a Mand
z.real * z.real + z.imag * z.imag > 4
We can do even better, by breaking the complex number into its constituents, i.e. into two floating point numbers. The code becomes:
@jit def mand
Timing it yields:
1 loops, best of 3: 101 ms per loop
1 loops, best of 3: 2.43 s per loop
This is a nice improvement. We are more than 100 times faster than the non compiled code. This is a bit fatser than C code (see appendix)! The difference can be due to the machine, as it is not a proper benchmarking platform when it comes to few milliseconds difference. We can safely say that Numba produces code that is as fast as C code.
My version is also slightly faster than the one
1 loops, best of 3: 113 ms per loop
1 loops, best of 3: 2.72 s per loop
We'll see below that we can do even better with vectorized operations.
Another way to speed Python code is to use Cython with static typing. The syntax to declare types is similar to C syntax, but the resulting code is still a Python function we can call from within Python interpreter. I am using a notebook, hence I first load Cython into it with:
Then I add types to the code, and compilation directives that speed up array operations. The code becomes:
%%cython import cython import numpy as np cdef int mandelbrot(double creal, double cimag, int maxiter): cdef: double real2, imag2 double real = creal, imag = cimag int n for n in range(maxiter): real2 = real*real imag2 = imag*imag if real2 + imag2 > 4.0: return n imag = 2* real*imag + cimag real = real2 - imag2 + creal; return 0 @cyt
Timing it yields
1 loops, best of 3: 109 ms per loop
1 loops, best of 3: 2.61 s per loop
This is a tad slower than Numba, and similar to C code. Note that Numba code is closer to Python code.
Numpy Array Operations
Vectorized operations are a way to write code without explicit loops over arrays. The Numpy package provides vectorized operations that extend traditional arithmetic operations to whole arrays. For instance, if z is an array, then z*z returns the array whose elements are the square of the elements of z. Numpy provides many other types of array operations. For instance if done is an array containing truth values, then z[done] = 0 assigns 0 to all the elements of z where the corresponding element of done is True.
There are various ways to vectorize our computation using Numpy. Here is the best I could find. It is about 3 times faster than the one available on PyOpenCl github.
def mandelbrot_numpy(c, maxiter): output = np.zeros(c.shape) z = np.zeros(c.shape, np.complex64) for it in range(maxiter): notdone = n
Given this function takes an array as first argument we need to also change the calling code. it is now:
Timing it for our two images yields
1 loops, best of 3: 1.07 s per loop
1 loops, best of 3: 29.7 s per loop
This is way faster than the non compiled sequential code, but it is slower than compiled code. Let's see how we might improve it further.
One reason the above code is not as efficient as it could be is the creation of temporary arrays to hold intermediate computation result. For instance the expression
will create a temporary array to hold
One way to avoid these temporary arrays is to use the NumExpr package. Our code becomes:
import numexpr as ne def mandelbrot_numpy(c, maxiter): output = np.zeros(c.shape) z = np.zeros(c.shape, np.complex64) for it in range(maxiter): notdone = n
Timing it for our two images yields
1 loops, best of 3: 686 ms per loop
1 loops, best of 3: 20.4 s per loop
This is a nice 30% speedup.
Another way to avoid intermediate arrays is to use Numba vectorize. We will use it where we used NumExpr. Here is the code.
from numba import vectorize, complex64, boolean, jit @vec
Timing it yields:
1 loops, best of 3: 555 ms per loop
1 loops, best of 3: 17.5 s per loop
This is even faster, but it is still slower than the sequential code compiled with Numba. Does it mean that we should forget about vectorized operations? Maybe not.
Numpy is not the only tool we can use for vectorizing our code, here is another one: TensorFlow is a recent open source library from Google for machine learning, and especially neural networks (deep learning). One of its tuto
TensorFlow comes in two flavors: one that uses cpus, and one that uses gpus. The latter uses the graphic processing chip that you may have on your computer (some don't have such chip, but most have at least one). My laptop has a NVIDIA Quadro 1000M. It only supports CUDA Compute 2.1, when Tens
Here is my test with the cpu version. I ran it in a Docker container. The bulk of the code for computing the Mandelbrot set looks like:
This code is quite different from the Numpy one. Here, the code constructs the computation graph. First, there are few constants and variables definitions. Then, few computations are defined (zs, not_diverged). Last, we define an iteration to be the execution of these computations, followed by some assignments.
Timing it with 2 cores yields:
Clearly, the setup time is quite high, but time per iteration is comparable to that of Numpy. The tensorFlow code can probably be optimized a bit, for instance by avoiding the square root in the complex_abs call.
Note that we used the cpu version of it, when its value clearly comes from its gpu version.
Good news is that there is are a couple of Python libraries that can help use gpus, PyOpenCl and PyCUDA. [Side comment:] there is always a Python library that can help, just look for it. [end of side comment]
GPU and PyOpenCl
I am in no way a PyOpenCl expert, and I will start from the code provided on the PyOp
This makes the code at least twice as fast..
The core of the resulting code is:
import pyopencl as cl
The code is yet another completely different beast. It is basically split in two. A first piece, shown in red above, is a piece of C code that will be compiled and run on the selected processing unit. Second, the rest of the code sets the input and output for the C code, and runs it. The calling code is similar to that of vectorized code, but we first turn the input array into a one dimensional array:
PyOpenCl can run on cpus and gpus. I ran both on my laptop and here are the timings I got.
With my Intel cpu:
10 loops, best of 3: 22 ms per loop
1 loops, best of 3: 181 ms per loop
With my NVIDIA gpu:
The results are impressive. The timings are way faster than anything else we tried. The gpu isn't faster. It is probably due to the fact that my NVIDIA chip is rather old. My cpu is 4 cores: Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz. With hyper threading it yields up to 8 full speed threads.
It is also interesting that I can run the PyOpenCl code in the same notebook as the rest of my code. I actually used the imaging code from My Christmas Gift to check that all the codes I tried compute the right image...
Interestingly, we can do slightly better on gpu by avoiding redundant computation. We simply need to replace the C code with this.
prg = cl.Program(ctx, """ #pragma OPENCL EXTENSION cl_k
Timing it with my Intel cpu:
10 loops, best of 3: 25.6 ms per loop
1 loops, best of 3: 344 ms per loop
And my NVIDIA device:
10 loops, best of 3: 23.9 ms per loop
10 loops, best of 3: 187 ms per loop
Interestingly, the new code is way lower on the cpu, but it gets faster on the gpu.
PyOpenCl can be used to run code on a variety of platforms, including Intel, AMD, NVIDIA, and ATI chips. If we target NVIDIA chips only, then we can use PyCUDA. I provide instructions to install it on Windows and Anaconda here.
The code is similar to the one of PyOpenCl: there is a ++C code that will be executed on the NVIDIA chip, and Python code to compile, execute, and get the results of the C++ code.
Here is the code I am using. It is 6 times faster than the one found on PyCu
import pycuda.driver as drv import pycuda.tools import pycuda.autoinit from pycuda.compiler import SourceModule import pycuda.gpuarray as gpuarray from pycuda.elementwise import ElementwiseKernel complex_gpu = ElementwiseKernel( "pyc
Timing it yields:
10 loops, best of 3: 21.9 ms per loop
1 loops, best of 3: 184 ms per loop
This is similar to PyOpenCl on gpu. No surprise s the same driver and the same device is used in this case.
Numba 0.23.0 provides a new argument that can be used to parallelize code. We simply have to add target='parallel' where we want to use it. Adding it to each of the vectorize calls above does not help. I think it is because parallel incurs some overhead that is only offset when we parallelize more significant chunks of code. Let's parallelize the top loop, the one over the image itself. For this we will use the function guvectorize(). This function accepts arrays as arguments, but these arrays can be of different shape. For instance, we can pass a scalar as a one element array. Starting from the sequential code, we get this code:
from numba import jit, vectorize, guvectorize, float64, complex64, int32, float32 @ji
Timing it yields:
10 loops, best of 3: 44.7 ms per loop
1 loops, best of 3: 798 ms per loop
Win! This is way faster than anything we've tried so far.
Guvectorize can also target CUDA. The code is similar, we change the target to be 'cuda'. We also had to explicitly create an array for passing the maxiteration as there seems to be a bug with passing scalars at this point. You may have to install the cudatoolkit package to get it running. I installed it with conda install cudatoolkit.
Timing it yields
10 loops, best of 3: 53.8 ms per loop
1 loops, best of 3: 906 ms per loop
This is similar t what we get with OpenCl on my machine: Using my Icpu is slightly better than using my NVIDA gpu. Results could be different on a different machine.
The table below recaps all the running times we got. The last line is a sequential C code described in the appendix.
When looking at overall running times PyOpenCl and PyCUDA are the fastest. They are about 5 times faster per iteration than Numba guvectorize(). This is really significant, but it requires writing C or C++ code in addition to Python. If we want to stick to Python code, then the sequential version parallelized with with Numba guvectorize() is the clear winner. What is also quite impressive is that Numba sequential code is as fast as C code, if not faster. The time difference may come from the back end compiler, LLVM for Numba versus Microsoft Visual C++ for Cython and C, but I may be wrong.
(time2 - time1)/(2048 - 80) for vectorized codes
(time2 - time1)/(582- 23.5) for sequential and parallel codes
where time1 is the time for the first image, and time2 the time for the second image, assuming the time per iteration is constant.
Let me close with some warnings. Running the above code on another machine may yield different results. In particular, I am using a rather old NVIDIA chip on a laptop. Its cpu is relatively good in comparison. A more recent machine with a powerful NVIDIA chip may run gpu code way faster.
There are things I haven't tried. The main missing one is Pypy, which is a faster Python. I haven't benchmarked it because I cannot use Pypy for my day to day job. Indeed, I rely on Python scientific stack for that (including scikit-learn, scipy and pandas). That stack does not run on Pypy. I know, this has nothing to do with this benchmark, but that is my excuse for not having pypy up and running on my machine.
Updated on January 9, 2016. Improved code for Numpy with and without Numba vectorize. Added sections on NumExpr, Cython, and PyCUDA.
Updated on January 11, 2016. Added the timing code where relevant.
Updated on January 12, 2016. Addded a section on C code. I also improved the gpu code by avoiding some recomputation. Lots of good comments on reddit here and here, leading to better code, and new ideas still to be tested. I will have a major update to that post, or a new one, after I have tried all of these. I particulalry want to thank the following readers:
Update on January 13, 2016.
Update on January 18, 2016. Added a section on guvectorize().
Appendix: C code
Let's compare with C. Here is the code I am using. This code is faster than th
I compile it using the same compiler used for Cython, i.e. Visual C++ 2013. It is compiled in release mode for the x64 target, and with all optimizations that were effective.
It uses all tricks that were effective in Python, for instance breaking complex into two floats. Inlining the first function does not improve running time.
Code for timing it is the following:
The main() function reads command line arguments, then executes the above loop and prints the overall time. I needed to loop in order to get reliable timings. The timings for the two images are:
1000 loops, time spent 0.109 seconds
50 loops, time spent 2.6 seconds