Start With A Question
JeanFrancoisPuget 2700028FGP Comments (4) Visits (9190)
The view that storing large amounts of data is enough to get insights out of it is losing ground, fortunately. It is a well known fact that data gathering should have a purpose. See for instance this citation this citation from 1942, shar
A more recent way of saying it is from Seth Godin:
Analytics without action
Don't measure anything unless the data helps you make a better decision or change your actions.
If you're not prepared to change your diet or your workouts, don't get on the scale.
What does it say about data science? It says that any data science project should start with a purpose in mind, say some actions to be taken. Usually, this would come as a question asked by some stakeholder. Examples include: which customers should I target with my marketing campaign? Which products should I recommend to which customers? What maintenance operations should I perform first? How should I replenish my inventory to best meet future demand? Etc.
I thought this was common knowledge in data science circles. Yet I was extremely surprised when I read Nir Kaldero's Why
They hypothesized that, if they all received the same dataset, worked on it, and came back together, then they would find they all independently used the same techniques. So, they got a very large dataset and shared it between them.
The machine leaner used the whole dataset and built a complex predictive model. The statistician took a 1% sample of the dataset, discarded the rest, and showed that the data met certain assumptions.
The mathematician, believe it or not, didn’t even look at the dataset. Rather, he proved the characteristics of various formulas that could (in theory) be applied to the data.
Nir's take was that the mathematician, statistician, and machine learner solved the same problem using different techniques.
I think this is not a correct account of what happened in the experiment. Reality is that the mathematician, the statistician, and the machine learner solved totally unrelated problems.
Why is that?
It is because the experiment did not include a definition of the problem to be solved! The dataset itself was not a problem definition. Each participant to the experiment selected a problem in his own comfort zone, a problem that could be tackled with the techniques he was familiar with.
I bet that they would have come up with similar results if they had agreed upon a question to be answered before they went working on the data set separately.