Data Science Is Hard : A Look At Sotchi Olympics
JeanFrancoisPuget 2700028FGP Comments (2) Visits (11016)
Data Science is hard. I'll use an example that made lots of buzz to show some of the issues with data science. Two brothers, Dan and Tim Graettinger, who work for Discovery Corps, Inc. devised a predictive model that predicts medal count per country for the Sotchi Olympics. The Graettinger brothers model was commented on most data science and analytics sites, in OR blogs (see Laura McLay's entry), even beyond. Question is: did they predict medal count correctly?
Before answering that question let me flush the inevitable discussion about what Data Science is about. I'll let others, more qualified, people answer that. For instance, I like this definition from DawenP, as it speaks to an Operations Research person like me. For the sake of this article, let us say we are discussing how to analyze data in order to find actionable insights. Actionable insights can be trends, in which case they can be used to "predict" a probable future. Note that before "data science", people were using other terms, ranging from statistics to data mining, via machine learning.
Then they used a standard technique known as linear regression to find which set of features were best for predicting medal count. I was reading their blog post with great interest until I saw what were the most meaningful features found by the linear regression algorithm:
I was frankly puzzled. Nothing had to do with sport. I would have expected more specific features, such as number of practitioners of winter sports. Or number of equipment (eg ski resorts). Even number of participants. Nope.
I guess this had to do with the difficulty of getting data. Indeed, in their post, the brothers explain how difficult it had been to gather enough data to be able to perform interesting analysis. Anyway, they must have been puzzled too given how they comment their findings:
Let's now look at their prediction
Image credit: Dan and Tim Graettinger (link)
How does this compare to the actual results?
Not that good.
Even if we forget the actual count and look at ranking, the predicted results are really different from the actual results.
What does it tell us?
First, that prediction is hard. Despite all the hype we're hearing, despite great success (eg Nate Silver prediction of US election results), prediction is hard. Note that prediction seems easier and more accurate when it is about predicting a binary outcome (state votes democrat or republican, team A wins, etc). Predicting a wider range of numbers, such as medal count, is harder.
Second, that prediction is only as good as data allows. Whatever algorithm or method you use, you can only find what is in data. If you do not have relevant indicators, such as the ones I mentioned above, no wonder your predictions are poor. As I said, I can't blame the Graettinger brothers here, as collecting the relevant data is often a daunting task. It may well be that they did quite a good job given the data they had. It may well be that their model is the most accurate one could build with this data set.
Third, and this is where I might be a bit critical of their work, any predictive model must be tested. The standard way of performing such test is to not use all the data you have when you create the model. A typical use would be to save 20% of the data for test, and use 80% of the data to create the model. When you think that your model is final, then you test it on the remaining 20% test data. Then you know how likely your model is to provide good predictions. There are various ways to perform this test, see for instance 3 Wa
Bottom line is that predictive analytics is difficult, and that one should asses model accuracy before making any conclusion.
Let me conclude with kudos to Dan and Tim Graettinger. They dared to publish their prediction, exposing them to criticisms like mine. If there weren't people like them, then we would probably miss great actionable insights from time to time.