Is Python Slow As Molasses?
JeanFrancoisPuget 2700028FGP Comments (3) Visits (11827)
Python is a popular language for machine learning. It is even the most popular one according to a study of mine recently published here and on KDnuggets. The above study generated quite a few reactions on social media. One that draw my attention reads:
I just recently switched to Scala. Somewhat similar to python but with a number of advanced concepts. It's definitely more complex to learn than Java, but from a performance perspective much faster. Although considering that Python is slow as molasses yet is leading the pack, it's a struggle.
The last sentence captures a sentiment I have seen many times: Python has a reputation to be very slow, which may prevent some from considering it for tasks such as machine learning.
The first answer to that is simple: Python is slow when you use just Python. But when you use Python for machine learning you mostly use packages written in compiled languages (C, C++, Cython, even Fortran) and you get good performance.
Cython deserves a special mention as it may be less familiar than the others. It is a language obtained from Python by adding type qualifiers to Python variables. The result can be compiled rather efficiently. Cython is heavily used in the popular scikit-learn machine learning package for instance.
Lots of prominent machine learning packages are written in C++ and can be used via a Python api, e.g. TensorFLow, XGboost, MXnet, Caffe, etc. There are many more, and I apologize to all the ones I do not explicitly list here.
All right, we can import efficient packages into Python. One could argue that you still need to write some glue code in Python. This glue code could run too slowly. The answer to that second concern is to look at the many ways one can make Python code run faster, including:
Jack Vanderplas gives a compelling example of how the combination of Numba and Numpy can lead to code that runs almost as fast as Fortran in Optimizing Python in the Real World: NumPy, Numba, and the NUFFT
I also blogged on how to use the above techniques, and additional ones, to get efficient Python code, see for instance:
This should clear any doubt: it is possible to write efficient Python code.
If Python code can run quickly, then why isn't Python the answer to all? Python is great, but it has some limits. The most important one is that Python can only execute one thread at a time, because of the global interpreter lock (GIL). It means that in order to scale out Python code one needs to use multiprocessing rather than multi threading. It means that one will need to duplicate data sets in memory. This may prevent dealing with large data sets.
A number of projects aim at removing this fundamental issue. Pypy stm is a python interpreter that has no GIL. Unfortunately, it does not support all of Python yet, and as far as I know, none of the major machine learning packages work with it yet. The other interesting avenue is to distribute computation. Let me point to two projects worth looking at: Dask. and Spark ML. Again, there are some more, and I apologize to those I don't explicitly list here.
Bottom line is that using Python for machine learning yields good performance when you deal with moderate size datasets.