## Benchmarking Is TrickyWe benchmark all the time. Why? There are mainly two reasons. First, our customers keep asking for performance improvements, as they apply CPLEX to larger and more complex problems. We therefore need to make sure newer releases of CPLEX are faster for our customers. The only way to know is to benchmark our code. Second, marketing people like to be able to claim speedup in their messaging. Indeed, this boils down to a simple number that can be reused everywhere. The latter is a nice side effect, but our focus is really on the former: we use benchmarking to make sure that we improve the performance of CPLEX for our customers. Specifically, benchmark results are used to drive CPLEX code evolution. For each proposed change to CPLEX code we compare the performance of the code after the change (let's call it version B) with the performance of the code before the change (let's call it version A). If the performance increases from A to B and if the change does not yield wrong results, then the change is committed. Pretty simple isn't it?
Devil being in details, the issue lies with how we define and measure performance. The common way used by most people working in mathematical optimization is to select a set of test problems (test set
This leads to two questions:
Let's answer Q1 first. Traditionally, in mathematical optimization, people define In the recent years there has been a growing interest from our customers in quickly finding good feasible solutions. We use a proxy for evaluating this, which is the time to first feasible solution. This measure is also independent of hardware speed as defined above.
However, it is far from being perfect, and we are looking at alternative ways of measuring time to find good solutions. We still want test
Let us now focus on answering Q2: What should be the test set Ideally, the test set should be the union of all the problems solved by our customers. This is not realistic, if only because some customer problems are confidential and cannot be shared with us. We therefore use a sample of our customer problems. We took great care of building this sample set over decades, continuously adding to it. The result is a set of several thousand models with a large variety of features and complexity. Feedback from customers show that it is quite representative as they experience similar average speedups as what we measure. Let me finish with a discussion of a common trap that many people fall into. It happens when one wants to focus on hardest problems. Indeed, it is our experience (and that of some competitors as well) that when we improve our MIP solver then the improvements are greater for problems that are harder to solve. Reporting this improvement separately is now customary. How should one measure how better is version B compared to version A on hard problems? Here is how it is often done. One splits the test suite by looking at how A solves them. The problems taking the longest are classified as hard, while the rest is classified as easy. One then compares the performance of A and B using the hard problems only. This may seem right but it is fundamentally flawed. The flaw has been recognized and documented, for instance in a paper by Tobias Achterberg and Roland Wunderling, or in this post by Mike Trick commenting a presentation by Matteo Fischetti. Here is another stab at explaining the flaw. Assume we have 4 problems in our test set. Table below gives the running times for each version on these problems
Following the methodology above, we would define as hard problems P3 and P4. On these problems version B geometric average is 10 while A geometric mean is 100. We can therefore claim that B is 10 times faster on average then A on hard problems. Issue is that the claim is not backed by reality. If we define hard problems using version B results, then hard problems are P2 and P4, and version B is 10 times slower on average than A this time! This example shows that arbitrary speedup claims can be done when the selection of which problems are hard is based on only one of the two versions that are compared. The cure is to define hard problems in a symmetric way, i.e. such that the hard problems do not change if we swap versions. We could define as hard problems the ones that are hard for at least one version, ie P2, P3, and P4. Or, we could define as hard problems the ones that are hard for both versions, i.e. P4. Whichever definition we chose, we see that versions A and B have same average performance on hard problems. This does not look as nice for version B, but it is closer to reality. The above example is of course artificially constructed to make the issue look as bad as possible, but the problem remains with more realistic data. Truth is that many people publishing benchmarks fall in the trap: in order to promote a new algorithm B they only report results on problems that are hard for the reference algorithm A. What is typically missing is B results on problems that are easy for A. I must confess that we have been guilty of this in the past when reporting CPLEX speedups at ILOG, before acquisition by IBM. Numbers we reported since then are no longer suffering from this flaw. I hope I've shown that benchmarking is not as straightforward as it seems. This topic deserves more discussion and I encourage interested reader to read the paper by Tobias Achterberg and Roland Wunderling. |

1msoos commented PermalinkAlso, there are more parameters than just time. You hinted at this with the "first solution" example where it may be better to give a good enough first solution quick rather than a much better one a bit later. I don't have this problem (with SAT), but memory, for example, is very important, and many algorithms do eat quite a big chunk of it. Thus, comparing times becomes rather useless as the memory limit of the user is reached long before any solution can be found. I've found this particularly problematic, as small but hard problems would greatly benefit from the algorithms storing their data differently but then larger problems would not fit. This sometimes means the algorithm has to be written twice, or in a polymorphic/templated way which makes it difficult to debug, etc.

2JeanFrancoisPuget commented PermalinkRight, there are many possible definitions for f, and memory use could be one. I've documented the ones we use given they suit our needs. Other definitions may make better sense in another context. Note that the trap I discuss is relevant whatever definition for f is used.

3Marc-AndréCarle commented PermalinkYou seem to have a very systematic approach to testing and benchmarking; that's great but it also requires significant computational resources to benchmark each potential change. In some project I wanted to do this but testing would be too slow to pace up with code development. I considered bundling small changes together (that are as independent as possible from one another) for the purpose of testing; the alternative was to use smaller test samples. I ended up doing a bit of both. Is that something you have to do sometimes?

4JeanFrancoisPuget commented PermalinkYou are right, this requires significant compute resource. We have several farms of identical machines that help us expedite testing. Key is to have strictly identical machines in which case you can run performance tests in parallel.

This said, performing full performance test for each change could indeed be too much. What we can do is test using a subset as a filter. Then proposed changes that improve on the subset are lumped together, and the overall change is tested against the full test set. Point is that no commit is permanent before a full test is done.

1