The Good, The Bad and The Ugly - Rule Engine Benchmarks
Daniel Selman 2700022VQ3 Visits (4421)
Last night I watched "Unforgiven" so I thought I'd continue the spaghetti Western theme today; looking at the challenge of developing, running and interpreting benchmarks for Business Rule Management Systems. I recently read the excellent article "Behind the benchmarks: SPEC, GFLOPS, MIPS et al" by Jon "Hannibal" Stokes. Although Jon discusses CPU benchmarking many of his points are relevant to rule engine benchmarking as well. For example, here is how he describes the various techniques used to to measure CPU performance:
The current "Academic Benchmarks" for BRMS fall somewhere within the definition of a "Kernel" or a "Toy Application", probably closer to the latter. Although they can be useful for implementors of rule engines, I'd argue they are of no use to evaluators of rule engine technology and may be a significant distraction. ILOG has been guilty of publishing academic benchmark numbers in the past of course, so I am not putting us on a pedestal!
The issues with the academic benchmarks for BRMS are fairly well documented, and there seems to be a general consensus that they should not be used for reliable product performance comparison, however they keep cropping up! Even in Ruby!
Here are some of the problems:
For reference, I have included the descriptions for the three most famous academic benchmarks below (paraphrased from the paper "Effects of Database Size on Rule System Performance: Five Case Studies" published by Daniel Miranker et al). How representative do they sound of your application!?
Manners is based on a depth-first search solution to the problem of finding an acceptable seating arrangement for guests at a dinner party. The seating protocol ensures that each guest is seated next to someone of the opposite sex who shares at least one hobby.
Waltz and Waltzdb
Waltz was developed at Columbia University. It is an expert system designed to aid in the 3-dimensional interpretation of a 2-dimensional line drawing. It does so by labeling all lines in the scene based on constraint propagation. Only scenes containing junctions composed of two and three lines are permitted. The knowledge that Waltz uses is embedded in the rules. The constraint propagation consists of 17 rules that irrevocably assign labels to lines based on the labels that already exist. Additionally, there are 4 rules that establish the initial labels.
Waltzdb was developed at the University of Texas at Austin. It is more general version of the Waltz program. Walzdb is designed so that it can be easily adapted to support junctions of 4, 5, and 6 lines. The method used in solving the labeling problem is a version of the algorithm described by Winston [Winston, 1984]. The key difference between the problem solving technique used in waltz and waltzdb is that waltzdb uses a database of legal line labels that are applied to the junctions in a constrained manner. In Waltz the constraints are enforced by constant tests within the rules. The input data for waltz is a set of lines defined by Cartesian coordinate pairs.
If performance is one of your major buying criteria then I would strongly encourage you to build a proof-of-concept set of rules and data and verify rule engine performance in your own environment. It is impossible to meaningfully extrapolate from published academic benchmark results to your running application, with your rules and data, deployed to your OS and hardware. In addition this will also allow you to evaluate the BRMS from as many angles as possible, spanning ease-of-use, business user accessibility, support and professional services, performance, deployment platforms, scalability etc.