JeanFrancoisPuget 2700028FGP Comments (7) Visits (32776)
I am not the only one saying Python has the lead. Here is a first fact supporting this. It is the job
These job trends are for: Python and ("data science" or "big data" or "statistical analysis" or "data mining" or "machine learning"), Scala and ("data science" or "big data" or "statistical analysis" or "data mining" or "machine learning"), R and ("data science" or "big data" or "statistical analysis" or "data mining" or "machine learning") .
I selected R, Python, and Scala for this comparison because they are the most popular open source languages for data science. R has been for long the dominant open source for statisticians, and by extension, for data science. But we see that Python is taking over since a couple of years. Scala is a recent contender, because of its link to Spark and Spark ML but it is a quite distant follower still.
What about commercial software? I do think that SPSS modeler is here to stay as well for instance. But its target is a bit different from R, Python or Scala. Indeed, SPSS modeler is a click and point software aimed at non programmers. With SPSS modeler one draws the machine learning pipeline, whereas one programs it in Python, R, or Scala. It is because of this difference that I did not include SPSS modeler in the comparison, as it would be comparing apple to orange.
Back to open source, here are other signs of Python popularity. The table below includes the number of questions on stack overflow, the number of packages in the main package repository for the language, and the prog
These measure the strength and popularity of the ecosystems built around these languages. Indeed, when comparing languages, one should not just do a feature by feature comparison, or efficiency benchmarks. Having a vibrant community that can help newcomers, and that can further advance the language, is key.
There are probably additional ways to evaluate the importance of an ecosystem, and I welcome suggestions.
We can also get facts about the main data scientists IDE for the lang
We see that here too IPython/Jupyter has a lead, but that RStudio is quite popular too. Zeppelin, due to it being very recent, is not much popular yet but it is actively being developed. Popularity differences show on Google trends too:
There we see that the renaming of IPython into Jupyter isn't fully recognized, as IPython search popularity is still ahead of Jupyter even if it is stalling. We also see that RStudio popularity is great too. And we see Zeppelin only starting.
The above facts support the view that Python is the leading open source for data scientists. They also supports the view that R is a bit less popular, but growing too.
Enough of facts, let's move into opinions to make our final decision between R and Python for data science. You can stop reading here if you have strong opinions on Python, R, or Scala as you may disagree with me.
You've been warned, so here it is. I am clearly biased towards delivering commercial software that uses data science. For this use case, Python is a better fit than R. In a nutshell, Python becomes way better than R when it comes to turn data science into production at scale. R may still be better in the exploration phase of a data science project.
Let me hint at some reasons why I believe the above is true. Each of these is probably worth a post on its own, but let me list them here as a starting point.
I don't think Python is better than R for data analysis per se. In that respect, R is more comprehensive as nearly all statistical techniques you can think of exists in R. And lots of machine learning too. Python ecosystem is quickly catching up with its scientific stack and packages such as sklearn, pandas, statsmodels, matplotlib, seaborn, etc (I can't cite them all), but it is not as comprehensive as R ecosystem yet.
So, why Python and not R? Because Python can be used beyond data analysis. You can build web sites in Python, you can connect to almost any data source, you can leverage an incredible number of systems and tools as many of them expose a python api, you can visualize results, you can implement arbitrary algorithms, you can comp
Another reason for selecting Python to me is that R comes with a GPL license. This license forces you to open source any software that includes R. Therefore, using R is restricted to either those who do not care about shipping software, or to those building open source software. R license is just not a fit for those developing commercial software like me.
I hope I answered the initial question correctly: why Python? The answer in general is all about the ecosystem and the breadth of what it covers. And when it comes to commercial software development, the R license is the straw that broke the camel's back.
It does not mean that Python should be the only language we use when building solutions. But we certainly leverage Python more than other programming languages for the data science pieces.