Benchmarking Is Tricky
JeanFrancoisPuget 2700028FGP Comments (4) Visits (9519)
We benchmark all the time. Why?
There are mainly two reasons. First, our customers keep asking for performance improvements, as they apply CPLEX to larger and more complex problems. We therefore need to make sure newer releases of CPLEX are faster for our customers. The only way to know is to benchmark our code. Second, marketing people like to be able to claim speedup in their messaging. Indeed, this boils down to a simple number that can be reused everywhere. The latter is a nice side effect, but our focus is really on the former: we use benchmarking to make sure that we improve the performance of CPLEX for our customers.
Specifically, benchmark results are used to drive CPLEX code evolution. For each proposed change to CPLEX code we compare the performance of the code after the change (let's call it version B) with the performance of the code before the change (let's call it version A). If the performance increases from A to B and if the change does not yield wrong results, then the change is committed.
Pretty simple isn't it?
Devil being in details, the issue lies with how we define and measure performance. The common way used by most people working in mathematical optimization is to select a set of test problems (test set T), and to measure a function f(T,X) of the test set T and the version X of the code used. Then version B is committed if
(1) f(T,B) < f(T,A)
This leads to two questions:
Q1. What should be the function f ?
Q2. What should be the test set T ?
Let's answer Q1 first. Traditionally, in mathematical optimization, people define f as the geometric average of the time in which problems in T are solved by version X. Here, solved means find the optimal solution and prove it is optimal. A good property of this definition is that it is independent of hardware speed in the following sense. If we perform our tests on, say, a 2x faster hardware, then the value of f is consistently divided by 2, and the result of test (1) above does not change.
In the recent years there has been a growing interest from our customers in quickly finding good feasible solutions. We use a proxy for evaluating this, which is the time to first feasible solution. This measure is also independent of hardware speed as defined above.
However, it is far from being perfect, and we are looking at alternative ways of measuring time to find good solutions. We still want test (1) above to be unchanged if we move to a faster hardware. This precludes the use of a measure like the following: the average value of best solution found in 5 minutes on hardware X. Indeed, running on a different hardware could change the relative value of f for different versions of the code.
Let us now focus on answering Q2: What should be the test set T ?
Ideally, the test set should be the union of all the problems solved by our customers. This is not realistic, if only because some customer problems are confidential and cannot be shared with us. We therefore use a sample of our customer problems. We took great care of building this sample set over decades, continuously adding to it. The result is a set of several thousand models with a large variety of features and complexity. Feedback from customers show that it is quite representative as they experience similar average speedups as what we measure.
Let me finish with a discussion of a common trap that many people fall into. It happens when one wants to focus on hardest problems. Indeed, it is our experience (and that of some competitors as well) that when we improve our MIP solver then the improvements are greater for problems that are harder to solve. Reporting this improvement separately is now customary.
How should one measure how better is version B compared to version A on hard problems? Here is how it is often done. One splits the test suite by looking at how A solves them. The problems taking the longest are classified as hard, while the rest is classified as easy. One then compares the performance of A and B using the hard problems only.
This may seem right but it is fundamentally flawed. The flaw has been recognized and documented, for instance in a paper by Tobias Achterberg and Roland Wunderling, or in this post by Mike Trick commenting a presentation by Matteo Fischetti. Here is another stab at explaining the flaw.
Assume we have 4 problems in our test set. Table below gives the running times for each version on these problems
Following the methodology above, we would define as hard problems P3 and P4. On these problems version B geometric average is 10 while A geometric mean is 100. We can therefore claim that B is 10 times faster on average then A on hard problems.
Issue is that the claim is not backed by reality. If we define hard problems using version B results, then hard problems are P2 and P4, and version B is 10 times slower on average than A this time! This example shows that arbitrary speedup claims can be done when the selection of which problems are hard is based on only one of the two versions that are compared.
The cure is to define hard problems in a symmetric way, i.e. such that the hard problems do not change if we swap versions. We could define as hard problems the ones that are hard for at least one version, ie P2, P3, and P4. Or, we could define as hard problems the ones that are hard for both versions, i.e. P4. Whichever definition we chose, we see that versions A and B have same average performance on hard problems. This does not look as nice for version B, but it is closer to reality.
The above example is of course artificially constructed to make the issue look as bad as possible, but the problem remains with more realistic data. Truth is that many people publishing benchmarks fall in the trap: in order to promote a new algorithm B they only report results on problems that are hard for the reference algorithm A. What is typically missing is B results on problems that are easy for A.
I must confess that we have been guilty of this in the past when reporting CPLEX speedups at ILOG, before acquisition by IBM. Numbers we reported since then are no longer suffering from this flaw.
I hope I've shown that benchmarking is not as straightforward as it seems. This topic deserves more discussion and I encourage interested reader to read the paper by Tobias Achterberg and Roland Wunderling.