**More details available on our FAQ.**(

**Read in Japanese.**)

## Fast Computation of Julia Set in PythonI just stumbled upon this nice wallpaper, displayed above, and the associated reddit discussion. The code used to produce that image was quite simple, and it looked close to the one I used to compute Mandelbrot set. I decided to see if the lessons I learned could be applied to this new fractal computation. The whole code used in this post is available as a notebook on github and nbviewer. Before proceeding, let me add the usual caveat: I am running Anaconda with Python 3.5 on a Windows laptop, and timings could differ significantly on another machine, or another Python version. The code used for the above image is reproduced below. More precisely, I just replaced the parallel range instruction by a standard range. Therefore I run a sequential version of that code, while the author ran a parallel version. %%cython import numpy as np cimport numpy as np import cython @cyt Timing it on my machine yields: 1 loops, best of 3: 1.94 s per loop Can we do better? Here is a revised code, mimicked after what I did for Mandelbrot Set computation. I replaced the array declaration to be more up to date, I added a no wraparound decorator, and I split complex numbers into their real and imaginary parts. The code becomes: %%cython import numpy as np cimport numpy as np import cython @cyt Timing it yields: 10 loops, best of 3: 74.2 ms per loop Wow! More than 25 times faster! A little investigation shows that most of the difference comes from the replacement of while z.imag**2 + z.real**2 <= 4: with while z.imag*z.imag + z.real*z.real <= 4: With this change in the original code running time goes from about 2 second to about 200 ms. Most of the additional speedup comes from the replacement of complex numbers by two floating point numbers. Let's try Numba. The original author code was significantly slower with Numba than with Cython. Would this be the case with my improved code? Let's see. Numba code is similar to the Cython code, except that we removed all typing information. from numba import jit @jit def julia_numba(N): T = np.empty((N, 2*N), dtype=np.uint8) creal = -0.835 cimag = - 0.2321 h = 2.0/N for J in range(N): for I in range(2*N): zimag = -1.0 + J*h zreal = -2.0 + I*h T[J,I] = 0 zreal2 = zreal*zreal zimag2 = zimag*zimag while zimag2 + zreal2 <= 4: zimag = 2* zreal*zimag + cimag zreal = zreal2 - zimag2 + creal zreal2 = zreal*zreal zimag2 = zimag*zimag T[J,I] += 1 return T Timing it yields something similar to Cython: 10 loops, best of 3: 71 ms per loop Can we do better? In my Mandelbrot computation, I could improve the code by using the guvectorize feature of Numba. Let's try it here. from numba import jit, guvectorize, int32, complex128, complex64, uint8 @guv Timing it yields: 10 loops, best of 3: 57.6 ms per loop This is faster, but not that faster. The requirement to construct an array of complex numbers offsets most of what we gain from parallel execution. We could do even better probably using gpu computing, as shown in How To Quickly Compute The Mandelbrot Set In Python. However, my machine gpu memory isn't large enough for computing the 10,000 by 10,000 image that we target. Let's time our various code on that image size. %timeit julia_cython(10000) %timeit juli It yields 1 loops, best of 3: 3min 16s per loop 1 loops, best of 3: 7.19 s per loop 1 loops, best of 3: 6.79 s per loop 1 loops, best of 3: 3.46 s per loop We see that Numba sequential code is a tad faster than Cython sequential code. Both are more than 25 times faster than the original sequential code. The Cython code can be made parallel, which could speed it significantly. The parallelization of the Numba code yields an almost 2x speedup.
while z.imag**2 + z.real**2 <= 4: with while z.imag*z.imag + z.real*z.real <= 4: has no effect in their case. I also ran the above code in a Python 2.7 environment on my Anaconda3 installation. Timings for the 10,000x10,000 image are: 1 loops, best of 3: 21 s per loop 1 loops, best of 3: 7.34 s per loop 1 loops, best of 3: 7.15 s per loop 1 loops, best of 3: 4 s per loop Seems that we hit a performance bug with our combination of Python 3.5, Cython, and C compiler. |