Overfitting In Mathematical Optimization
JeanFrancoisPuget 2700028FGP Comments (2) Visits (13740)
Have you ever heard of overfitting? I bet you did if you are using machine learning one way or another. Avoiding overfitting is a key concern for machine learning practitioners and researchers, and it led to the development of many techniques to avoid it, such as cross validation and regularization. I claim that OR practitioners, especially those using mathematical optimization solvers, should be equally worried about overfitting.
In a nutshell, overfitting in machine learning happens when the model you trained is too good to be true on the training data: your model predicts the training data very well, but your model performs poorly on new data. An extreme case of overfitting is rote learning: in that case your model achieves 100% performance on data you have already seen, and probably won't do any better than random guess on new data. You can read Overfitting In Machine Learning if you want to know more.
Overfitting In General
How could overfitting be relevant to optimization?
Let me abstract a bit overfitting in order to answer that question. We need few ingredients:
Our goal is to select the parameters that lead to the best value for the evaluation metric on new, unforeseen, data sets. These data sets are called test data.
For our sorting algorithm, finding the best parameter settings amounts to finding the number of bins that yield best performance on new, unforeseen, arrays. I machine learning, finding good parameter settings is called hyper parameter optimization (HPO). HPO finds the values of hyper parameters that lead to the best evaluation metric on new, unforeseen, examples.
The catch is that we don't have the new, unforeseen, data, when we select parameters, hence we cannot compute evaluation metric on it. As explained in Overfitting In Machine Learning, the solution to this is to assume that the training data is representative of the new, unforeseen, data. Then we get a simpler problem:
Find parameter settings that lead to the best evaluation metric for the training data.
These parameter setting would then yield best possible evaluation on new, unforeseen, test data.
The above is true only if we made a significant assumption. Let me state it again:
We assume that the training data is representative (close to) the data to which the algorithm will be applied in the future.
If this assumption is violated, then the best parameter setting for the training data may not be the best parameter setting for test data. Violating this assumption results in overfitting: we get good evaluation metric on training data, and poor evaluation metric on test data.
Overfitting In Optimization
Let us now look at mathematical optimization solvers. These are mostly a collection of clever implementations of well known algorithms such as simplex, log-barrier (interior point), and branch&bound for MIP. These implementations come with a number of variants that are controlled via public parameters and hidden parameters: different parameter settings lead to different running times. For instance, in a MIP solver, you will have quite a few parameters, including on/off switches for presolve reductions and cut families.
One of the main focus of solver developers is to improve performance, i.e. find ways to solve optimization problems faster and faster. One way to do it is to try to select the best default values for parameter settings. This fits the framework described above. Using CPLEX MIP as an example:
Overfitting happens if the training data is not representative of the new data to which CPLEX MIP will be applied. In optimization terms:
We overfit if the MIP instances we use to select best parameter settings are not representative of the MIP instances our users will have.
In order to avoid this, we try to have a set of MIP instances that are representative of our users MIP instances. We do it via a very straightforward way: we collect instances from our users (with their explicit permission of course). Having done that for decades ensures we have a reasonable representative set containing thousands of models.
What about our users? OR practitioners usually spent time tuning the solver they use so that problems get solved faster. They do so by collecting a set of problem instances, then try manually or automated tuning tools that selects parameters leading to better performance. If the problem set they use is not representative of the future problem the solver will be applied to, then they will suffer from overfitting-
We can further generalize overfitting to the modeling part of optimization. This part is somehow the equivalent of feature engineering in machine learning. It is how to translate a business problem into a mathematical optimization problem. i gave an example of optimization modeling in Step By Step Modeling Of PuzzlOr Electrifying Problem. Point is that the same business problem can be represented in many different ways as an optimization problem. Some ways will lead to better performance than others. It is therefore no surprise that OR practitioners spend significant time trying various models in order to find the one leading to best performance. Here again there is a risk of overfitting. Indeed, think of the model as one (major) parameter to the algorithm:
Users could suffer from overfitting if the training data is not representative of the new data to which CPLEX MIP will be applied. In optimization terms:
Users overfit if the business problem instances used to select best model are not representative of the business problem instances users will have.
This is often overlooked as there is pressure to show best possible performance during the development of models.
Let me conclude with yet another way overfiting can creep in. It is when one uses a public benchmark to infer solver performance comparison. There are public benchmarks for mathematical optimization solvers, like MIPLIB. CPLEX performance is one of the top performance on these public benchmarks. Yet we could do better. Indeed, It would be tempting to use these public benchmarks as the training data when we select CPLEX default parameter settings: this would yield even better performance. However, given these benchmarks usually contain a small number of instances that are not representative of our users problems, we would be overfittting. Another consequence is that you should not use these public benchmarks as an indication of how various solvers would behave on your problems. There is only one way to know this: try the various solvers on a set of problems that are representative of the problems you'll need to solve later. Anything that departs from this will lead to overfitting and lead to disappointing performance in the future.