Python Is Not C
JeanFrancoisPuget 2700028FGP Comments (12) Visits (104166)
Update on December 8, 2015. An updated version of this post is available at Python Is Not C: Take Two.
I have been using Python a lot lately, for various data science projects. Python is known for its ease of use. Someone with coding experience can use it effectively in few days.
This sounds good but there may be an issue if you program in Python as you would in another language, say C.
Here are some details.
As part of a recent project I had to deal with geospatial data. I was given a gps track with about 25,000 points, and I repeatedly needed to find the closest point on that track given a latitude and a longitude. My first reaction has been to look for a code snippet that would compute the distance between two points given their latitudes and longitudes. It so happens that John D. Cook made such code
I was all set! All I needed to do was to write a little Python function that returns the index of the point closest to the input coordinates:
The above function is similar to what I would have written in C. It iterates over the trkpts array, and it keeps the index of the closest point so far in the local variable best.
So far so good, Python syntax is very different from C, but at least I could write the code I needed very quickly.
The code was quick to write, but it is rather slow to execute. For instance I was also given 428 waypoints with names. I wanted to find for each of the waypoints the closest point on the track. Running the search for the closest point for all the 428 waypoints takes 3 minutes 6 seconds on my laptop.
I then looked for an approximation called the Manhattan distance. Rather than computing the exact distance between points, I compute the distance along the East-West axis plus the distance along the South-North axis. Here is the function that computes Manhattan distance:
Actually, I used an even simpler function, ignoring the fact that 1 degree difference in latitude yields a larger distance than 1 degree difference in longitude. Here is my simple function:
Is it safe to use this approximation? It may not be safe in general, but in my case it is safe. Here is why. All my points are in mainland USA. Input latitude and longitude are for points on the same road as the gps track, hence they should never be far from at least one the points in the gps track.
The approximation would not be reasonable in several circumstances. First, near the poles, because a large longitude difference does not translate into a large distance. Second, when the longitudes are more than 180° apart. In that case we should take the complement to 360° of the difference. Third if the input latitude and longitude are far from all points in the gps track. None of these cases occur in my project.
Our closest function becomes:
We can speed it a bit via inlining the Manhattan distance function:
Using that function yields the same closest points than using John's function. I was hoping for that, and was pleased to see my intuition was right. This simpler function is also faster. Finding the closest points for all my waypoints now takes 2minutes 37 seconds on my laptop. It is a nice 18% speedup. Nice, but nothing dramatic.
I then decided to use Python correctly. In this case it means leveraging the array operations that pandas support. These arrays operations come from the numpy package. Using these array operations we can get the closest point in a very concise code:
This function returns the same closests points as before for all my waypoints. Running it for all the waypoints takes 0.5 second on my laptop. This is a 300x speedup over the above code! 300x, i.e. 30,000 %. This is dramatic. The reason for that speedup is that numpy array operations are written in C. Hence, we get the best of both world: we get speed comparable to C, with the conciseness of Python.
Without this speedup I would have been forced to use a more complex data structure such as a quadtree implemented in SciP
The lesson is clear: do not write Python code as you would do in C. Use numpy array operations rather than iterate on arrays. For me it meant a mental shift.
Update on July 2, 2015. That post is discussed on Hacker News. Some of the comments missed the fact that I am using pandas data frames. I am using them because they are useful for data analysis. If my problem was just to quickly find the closest point, and if I had enough time, then I would code some quad trees in C or C++.
Several readers also made a comment that numba would provide substantial speedup as well. I did give it a try.
Here is my, non conclusive, experience. First, let me say that experiment with another Python install may yield different result. I am using Anaconda on a Windows machine, and I have installed quite a few packages. Maybe some of these packages interfere with numba.
First thing I did was to follow the instructions for installing numba:
$ conda install numba
Here is what I got:
Then I discovered that numba was already installed in my anaconda distribution. Maybe someone working on numba should update the install instructions.
I then used numba as recommended:
I did not see running time improvement. I also tried the more aggressive compilation setting:
This time I got an error when running the code:
Seems that pandas code is a bit too smart for numba to compile.
Sure, I could spend time changing my data structure until numba can compile it. But why would I want to do that? I got a code that runs fast enough using numpy. I was already using numpy via pandas anyway. Why not leverage it?
There were also suggestions to use pypy. This could make sense but... I had to use Jupyter notebooks on a hosted server. I have to use the kernels they provide, i.e. a regular Python 2.7.x kernel. Pypy was not an option.
Update on July 3. A comment on HackerNews made me code the numpy equivalent of John D. Cook's function. I did it and ran it with my 428 waypoints. It runs in 2 seconds on my laptop. This is much faster than looping over the array indeed. However, it is 4 times slower than using my approximation. It also creates 12 intermediate arrays, compared 5 with my simplified function, resulting in more memory used.
The code for that function is given below. I stayed as close as possible to John's code.
import numpy as np