Operational Data Science: Part 2
JeanFrancoisPuget 2700028FGP Comments (2) Visits (12973)
Data science projects are often depicted as following round circle methodology. suggesting that it repeats itself forever. We provided two examples in Oper
Truth is that lots of data science projects run the process only once. They star
There are many great examples of this linear process. Lucius Riccio's Pothole Analytics is a good example. Riccio did a great job at answering a business question, namely "how to better repair potholes on NYC streets?" The result is a report that can be used by the city mayor office to make better decisions. I picked this one, but there are many other available online. Randy Olson's blog is a good source of examples.
Experimental sciences research rely on a process very similar to the above one. It starts with a scientific question, eg "what is the mass of the Higgs boson if it exists?" Scientists design an experiment, collect data produced by the experiment, then follow the same data science process, until they find something worth publishing:
Two recent major discoveries in Physics exemplify this process. Both the discovery of the Higgs Boson and the discovery of gravitational waves followed the same pattern. The data science process for the latter can be replayed by any reader thanks to a we
Often overlooked is a second type of data science projects that deals with a business process. Let me take an example for the sake of clarity. Assume you are running a payment system for credit card transactions. Your process is to execute transactions, i.e. debit and credit the right banking accounts for each transaction coming in. A question that you will quickly ask, if you haven't already, is how to detect fraudulent transactions. Indeed, you'd like to refuse these and only process legitimate transactions. Therefore, you will ask your favorite data science team (in house, or third party, depending on how mature is your organization) about how to detect fraudulent transactions. If the data science team is any good, then they will come back at some point with a description of fraudulent transaction patterns. You can then contract your favorite IT team to implement a fraud detection system based on the patterns found by the data science team. Then the IT team deploys what they have coded into your business process.
It can be displayed as follows:
You're happy because your system rejects most frauds now.
More generally, you have an existing business process that consumes input data (transactions in our example), and that outputs some business decision (accept or reject a transaction in our example). The data science process starts with a question about how to improve that process, and it outputs insights that are given to the IT team. The IT team codes something based on these insights, and deploys it in the business process.
This process looks straightforward, but there is a major gap where the data science team hands their findings to the development team. This gap is caused by different teams using different tools. Data scientists use tools geared towards their needs for data exploration, data wrangling, model fitting, and model evaluation. They can use proprietary tools such as SPSS, SAS, Matlab, etc., or they can use open source tools such as R and Python. The IT team responsible for developing, maintaining, and operating a business application will use enterprise application development tools like Java. It means that predictive models developed with data science tools need to be re implemented in, say, Java. This re-implementation is error prone. It is also a major obstacle for quick deployment of new insights into existing business processes.
The need for seamless deployment of predictive models has been recognized by the industry. We'll expand on that in a subsequent post but let's just say now that there are ways to deploy insights from a data science team without having to re-implement them. In that case, the last step of the data science process is a seamless deployment to the business process:
This looks like the holy grail for operational data science, doesn't it?
In fact it isn't. There is another shortcoming worth discussing. Let's reuse our fraud detection example for the sake of clarity. Assume your data science team came up with great insights about fraud patterns. Assume further that your IT team seamlessly deployed these patterns into a fraud detection system. As a result, your rate of fraudulent transactions drops dramatically. You're happy.
Guess what? This won't last.
Fraudsters will soon understand that their scam is exposed. They will start looking for alternative ways of using stolen credit card information, until they find some that work again. Your fraud detection system accuracy will start to decrease until it is no longer effective enough. You therefore need to monitor the effectiveness of your process, and restart the data science process when accuracy isn't good enough.
What is true for credit card fraud detection is true for most domains. The environment is changing: customer behavior evolve, competition is going after your business in new ways, new technology may disrupt your business, etc. As a result, your business process needs to adapt to these changing conditions. This is where the data science process comes to a loop again. Once insights are deployed, then the execution of the business process needs to be monitored. When its effectiveness isn't good enough, then a new cycle starts again with an analysis of what is wrong with the business process, followed by a full data science process.
We will deep dive on how to implement this integrated view of data science and business processes in subsequent posts in this series.