Python Is Not C: Take Two
JeanFrancoisPuget 2700028FGP Comments (4) Visits (95822)
When I wrote Python Is Not C 6 months ago I did not imagine that it would be my most popular post ever, with more than 67k views. The conclusion of that post reads:
The lesson is clear: do not write Python code as you would do in C. Use numpy array operations rather than iterate on arrays. For me it meant a mental shift.
Given Python ecosystem is rapidly evolving, I decided to revisit this conclusion using the performance improvement tools that I discuss in my previous post.
Let me briefly introduce the topic again for those who haven't read
These data sets are pretty different in nature. Here are the way points plotted along latitude and longitude:
A Naive Code
When I first coded that problem, I had little time. My first reaction had been to look for a code snippet that would compute the distance between two points given their latitudes and longitudes. This is called the great circle distance. It so happens that John D. Cook made such cod
I was all set! All I needed to do was to write a little Python function that returns the index of the point closest to the input coordinates:
In order to profile it I am now using the following function. It gets two data frames as input, and computes for each point in the second dataframe the closest point in the first data frame. It stores the result in the second dataframe.
This code takes 3 minutes 42 seconds on my laptop for the raam dataset, and 3 minutes 44 seconds with the random dataset.
Using Vectorized Operations
Cutting a long story short, I could slash running time using Numpy vectorized functions. Here is the resulting function for computing closest distance without explicit Python loops. It is mimicked after John's D. Cook's function:
import numpy as np import mathdef clos
That code runs in 2.3 seconds with the raam dataset and in 2.17 seconds with the random dataset.
That was the best I could do 6 months ago with the great circle distance. I was very happy because it is a 100 time speedup compared to the naive code!
Can we do better now? We can, by switching to Numpy from Pandas as recommended on Enh
best, distance = clos
we pass the arrays corresponding to the columns of the data frame:
track_latitudes = trac
The code of the closest distance function is modified accordingly:
We also rewrite the profiling function as follow:
The code now runs in 806 milliseconds with the raam dataset, and 797 milliseconds with the random dataset. More than twice as fast!
Can we do better using the tools discussed in my previous post?
Using Numba yields another 25% speedup, with 678 milliseconds with the raam dataset, and 614 milliseconds with the random dataset.
I didn't stop there 6 months ago. I decided to use an approximation where I replaced the great circle distance with the Manhattan distance (or city block distance). That approximation was valid on the raam dataset I was using. It was providing the same closest point as when using the great circle distance. This approximation isn't valid for points near the pole as latitude and longitude aren't commensurate anymore.
Using array operations, our closest distance function became:
This code runs in 884 milliseconds with the raam dataset and in 809 milliseconds with the random dataset. It is about 3 times faster than using the great circle distance. That was the best I found 6 months ago.
As explained above, we can now speed up this code by using Numpy arrays instead of Pandas. The code becomes.
Using the Numpy profiling function, the code now runs in 76.7 milliseconds with the raam data set and in 52.2 milliseconds with the random dataset. This is a 10 times speedup!
Revisiting The Naive code
Can we speed up the naive code using the tools I discuss in my previous post?
Let's rewrite it using Numpy instead of Pandas, and compile it with Numba:
@jit def clos
This code runs in 35.7 milliseconds with the raam dataset and in 30.4 milliseconds with the random dataset. We are now twice as fast as Numpy vectorized operations!
This is somewhat incredible. Let me say it again:
Python code compiled with Numba runs twice as fast as Numpy code written directly in C.
Well, there may be faster C code around. Computing nearest neighbor has received quite a lot of interest since it became a key ingredient in clustering algorithms used in machine learning. Not surprisingly, the scientific Python stack offers several implementations. Let's use them on our datasets. I will specifically use these implementations:
I haven't looked in detail at how these implementation differ. A general comparison between some of these tree implementations has been performed by Jake Vanderplas, see his excellent Benc
Let me evaluate these trees on my datasets.
They have very similar api. Using them is rather simple. We first create a tree with the track points:
import scipy.spatial import sklearn.neighbors
We must pass the data as one Numpy array that we construct from our dataframe. Note that for the last two trees we must say that we are not using the default distance (euclidian distance). We rather use the Manhattan distance, or L1 norm.
Then we query the resulting tree with each point of our test set. We must also say in the query that we are using the L1 norm. The profiling function becomes
def test_tree(tree, test_points): tests = n
Note that the query function accepts an array of point coordinates in a single call.
We run each tree implementation on both datasets.
The table below summarizes the running times we get in milliseconds. The construction time is significantly higher in the raam case for three of them. The query time is also significantly higher for one tree implementation.
The table also includes our code for comparison. We see that the running times are quite similar between the two datasets for the code I wrote. Note that these times are for computing the nearest neighbor for each of the way points. Average time for one way point is about 1/400 of the reported time.
When it is compiled with Numba, our native Manhattan distance code using loops over Numpy arrays is about 7,000 times faster than the code we started with 6 months ago. This is awesome, but there is something even more impressive: it is also competitive with the tree implementations. Indeed, these have a short query time, but the time to construct the tree largely offsets it. In particular, our code is way faster overall for the raam dataset.
I think it is still safe to say that Python is not C, but we can also say that Numba makes it close. This is true both in term of running times and in coding style.
Note that the above comparison may not be representative of the performance of the code with other data sets. Performance of trees can vary with dataset size and parameter settings as documented in Benc
Update on December 16, 2015. I was contacted by Serge Guelton, one of the developer of Pythran, a Python compiler. I'll blog about it soon as it yields additional speedups. I also have repeated comments in my Python related performance posts asking why I am not using Pypy. I didn't use Pypy so far because it only supports part of Numpy, and not Pandas, Scipy, and Sklearn. It may be that Pypy supports enough of Numpy to run my closest distance code though. If that's the case, then I'll blog about it soon.
Update on January 10, 2017. Ari Hartikainen tells me that cKDTree with kw 'bal