Optimization Is Ready For Big Data: Part 4, Veracity
Big Data promise is to enable better decisions based on data. The idea seems appealing yet there is a caveat: is the data reliable enough to base decisions on it? Question is to what extent can we trust data? My experience shows that cleaning data can take up to 80% of an analytics project. This is well known, and is often called the veracity dimension of Big Data. Point is that most data in the Big Data era is uncertain, see for instance the figure below, taken from a post The red curve shows the proportion of data whose veracity is unknown. By end of 2015 we should be at 80% of uncertain data! Uncertainty about the veracity of data can come from various sources. For instance, measurement error in the case of sensors, or lack of credential in case of social media (should you trust all tweets about a given company?) We could discuss how to cope with data veracity in general, but I will focus answering this question: can optimization be used with uncertain data? This question is in line with my previous posts discussing whether optimization can be used with large data volume, data in motion, and data of all kind. My answer was that optimization could be used provided raw data is preprocessed by some predictive analytics. The general pattern for that is the following (see my previous posts for details): Indeed, predictive analytics compresses large data volume, it abstracts data variety, and it aggregates data in motion. Predictive analytics outputs models that capture the essence of the business process data is about. One example models we reviewed was price elasticity that captures customer purchasing behavior. another example model was traffic prediction that captures drivers behavior. These predictive models are then used as input to some optimization model(s). So far so good, we should be happy given we found a quite generic pattern for Optimization and Big Data. Issue is that we still have to face the data veracity caveat for two reasons. First, the raw data with which the predictive models are built may not be reliable. Second, the data coming out of predictive model is uncertain. As Niels Bohr supposedly said: "prediction is very difficult, especially about the future". Yet, many optimization practitioners (including me, shame) will just pick the most likely (eg the mean) predicted value as if it was certain. To our defense, many machine learning techniques focus on fitting a model without providing any estimate on the accuracy of the model. Statistical methods are a bit better in that context as they provide ways assess the extent to which statistical models are reliable (confidence intervals, standard deviation, etc). Yet, as said above, practitioners often ignore this information and merely use most likely values. If we want to do a better job, then we need to revise our nice looking pattern. We have to take into account that input data can be uncertain, either because it is predicted, or because it is part of the 80% of uncertain raw data. Can optimization still be used when input data is uncertain? Good news is that optimization researchers didn't wait for Big Data to look at this question. A lot of methods and algorithms have been proposed so far under the names robust optimization and stochastic optimization. Moreover, some modeling language editors, including us, are providing ways to automatically transform optimization models that assume certain data into models that handle uncertain data. I will briefly describe our own uncertainty toolkit below, The optimization methods that can deal with uncertain data depend on the nature of the uncertainty. A first family, robust optimization, assume that data is contained within ranges. For each uncertain value we know a minimum and maximum value. Optimization then optimize for the worst case that could happen given these ranges. A second family, stochastic optimization, assumes that we have an idea of how the uncertain data behave. In mathematical terms we say we have a probability distribution of the uncertain data. In practice, we start with a finite set of scenarios, each with its own data set. These scenarios represent a sample of the uncertainty of data. These scenarios can be obtained from the probability distribution via Monte Carlo simulation for instance. We can then optimize the worst case scenario, or optimize the average across scenarios, fr instance The choice of the optimization method can be further refined depending on where the uncertainty lies. is it in the objective function? Is it in the rest of the data? We have captured these questions and more into an automated tool that starts with a model for certain data. Then, via a wizard, optimization practitioners can input where and in which form (robust or stochastic) uncertainty is provided. Last, our toolkit automatically transforms the model into one that deals with data uncertainty. We use the following methods for generating the model. One could argue that we miss some key ones, for instance Light robustness (Fischetti et al. 2009).. That's true, but these methods can be added to the toolkit as we go. Indeed our toolkit has been designed in a quite open and extendible way. More information about our uncertainty toolkit is available in this paper. Let us conclude with a look at the question I started with: Can optimization be used with uncertain data? I think we can answer yes because of the large body of robust optimization techniques, stochastic optimization techniques and automated modeling tools. This positive answer is important for two reasons:
