Data Science Automation
JeanFrancoisPuget 2700028FGP Comments (2) Visits (18710)
Will data scientists disappear soon? I am asking the question as I see more and more papers about why data scientists may be a parenthesis in history. Latest I read is Will
Does this mean that data scientists are useless now?
I don't think so.
First of all, there are many flavors of data science, and Watson Analytics automates one of them. More precisely, it automates what I would label as data mining. It implements a simple workflow:
Note that simplicity is a plus here, it is not a defect. Making data science simpler is what makes it usable by non specialists. Note also that the above is an over simplification of Watson Analytics. It may not represent it faithfully. I encourage readers to give it a try. I bet you'll be positively surprised by the natural language interface and the great visualization capabilities. Anyway, for the sake of this discussion, I think we can abstract it to the above workflow.
Another flavor of data science is machine learning. I discussed at length how machine learning goes beyond data analysis in Machine Learning Algorithm != Learning Machine. Here is the workflow I derived:
Watson Analytics could be seen as a way to automate some of it, namely the "train" step. There are other efforts aiming at automating this step in the context of machine mearning, including, but not limited to: IBM
These three tasks can be automated in principle because they are essentially a trial and error process. For each of these tasks, there is a number of things a data scientist can use. Selecting among these often amounts to just try them on some training data, and see what quality of prediction they give on some validation data.
The difficulty comes from the combinatorial nature of the task. If you have 10 different ways to do feature engineering, 10 different algorithms to consider, and 4 parameters with 10 possible values each for each algorithm, then you end up with 1 million possibilities to try. Each possibility is one feature engineering algorithm followed by one machine learning algorithm with one value for each of its 4 parameters. This combination defines one machine learning pipeline.
Goal is to find the best possible pipeline among the million possible pipelines. A brute force approach isn't doable by hand, and hardly doable at all in an automated way. But one could see this as an optimization problem, and try to be clever about how to explore this large number of possible machine learning pipelines. This is basically what the efforts listed above try to do.
So, aren't we on the verge of machine learning automation? I'd say yes and no. Yes, lots of the tedious part of machine learning is going to be automated. Selecting the right learning rate will no longer be a key human skill. But I do think there is room for smart data scientists. Feature engineering can be quite tricky and I doubt all of it can be automated.
Interpreting machine learning algorithm output quality is also a challenge: it can't be summarized by a single number. We said above that selecting the best machine learning pipeline is an optimization problem. It is more precisely a multi objective optimization. Indeed, there are many ways to quantify the quality of the predictions made by a machine learning pipeline. Common evaluation metrics include F1 score, ROC area under curve, recall, precision, accuracy, etc. The fact that there are many evaluation metrics simply indicates that human judgement is key, and will remain key.
From the above, my advice to data scientists is to prepare for a shift in their practice: if their value comes from their skills at selecting machine learning algorithm parameters, then they should worry. If their skills are about how to map a business problem into a machine learning problem, and drive the machine learning workflow till a deployed application, then they are on the safe side.