Last night I watched "Unforgiven" so I thought I'd continue the spaghetti Western theme today; looking at the challenge of developing, running and interpreting benchmarks for Business Rule Management Systems. I recently read the excellent article "Behind the benchmarks: SPEC, GFLOPS, MIPS et al" by Jon "Hannibal" Stokes. Although Jon discusses CPU benchmarking many of his points are relevant to rule engine benchmarking as well. For example, here is how he describes the various techniques used to to measure CPU performance:
- Real-world Applications: one of the most popular, and best, ways to benchmark a system is by running a real program on it and see how long it takes the program to complete a task.
- Kernels: a small, CPU-intensive portion of a real program that’s intended to be run on its own or as part of a benchmarking suite.
- Toy Applications: small programs like Quicksort or the Sieve of Eratosthenes that people cook up to compile and run on any computer.
- Synthetic Benchmarks: try to determine the average instruction mix of a typical program so that it can replicate that mix and run those types of instructions on the CPU.
The current "Academic Benchmarks" for BRMS fall somewhere within the definition of a "Kernel" or a "Toy Application", probably closer to the latter. Although they can be useful for implementors of rule engines, I'd argue they are of no use to evaluators of rule engine technology and may be a significant distraction. ILOG has been guilty of publishing academic benchmark numbers in the past of course, so I am not putting us on a pedestal!
The issues with the academic benchmarks for BRMS are fairly well documented, and there seems to be a general consensus that they should not be used for reliable product performance comparison, however they keep cropping up! Even in Ruby!
Here are some of the problems:
- Very small number of rules (less than 50)
- Test worst-case RETE inference and agenda usage, which is rarely encountered with well designed applications
- Implement solutions to combinatorial search problems, typically better implemented using a CP Engine rather than a RETE engine
- The solutions are not verified, meaning that correctness may be inadvertently (?) sacrificed for speed. This is particularly the case for Waltz and WaltzDB for which I've never seen any verification of the results.
- The benchmarks are easy to "game" because every vendor ports from the original OPS5 source code into their own rule language, where they take advantage of their product's features
For reference, I have included the descriptions for the three most famous academic benchmarks below (paraphrased from the paper "Effects of Database Size on Rule System Performance: Five Case Studies" published by Daniel Miranker et al). How representative do they sound of your application!?
Manners is based on a depth-first search solution to the problem of finding an acceptable seating arrangement for guests at a dinner party. The seating protocol ensures that each guest is seated next to someone of the opposite sex who shares at least one hobby.
Waltz and Waltzdb
Waltz was developed at Columbia University. It is an expert system designed to aid in the 3-dimensional interpretation of a 2-dimensional line drawing. It does so by labeling all lines in the scene based on constraint propagation. Only scenes containing junctions composed of two and three lines are permitted. The knowledge that Waltz uses is embedded in the rules. The constraint propagation consists of 17 rules that irrevocably assign labels to lines based on the labels that already exist. Additionally, there are 4 rules that establish the initial labels.
Waltzdb was developed at the University of Texas at Austin. It is more general version of the Waltz program. Walzdb is designed so that it can be easily adapted to support junctions of 4, 5, and 6 lines. The method used in solving the labeling problem is a version of the algorithm described by Winston [Winston, 1984]. The key difference between the problem solving technique used in waltz and waltzdb is that waltzdb uses a database of legal line labels that are applied to the junctions in a constrained manner. In Waltz the constraints are enforced by constant tests within the rules. The input data for waltz is a set of lines defined by Cartesian coordinate pairs.
If performance is one of your major buying criteria then I would strongly encourage you to build a proof-of-concept set of rules and data and verify rule engine performance in your own environment. It is impossible to meaningfully extrapolate from published academic benchmark results to your running application, with your rules and data, deployed to your OS and hardware. In addition this will also allow you to evaluate the BRMS from as many angles as possible, spanning ease-of-use, business user accessibility, support and professional services, performance, deployment platforms, scalability etc.
- OPS5 Benchmark Suite by Daniel Miranker (University of Texas)
- OPS5 Benchmark Suite by Daniel Miranker (CMU Artificial Intelligence Repository)/
- Lessons learned in benchmarking — Floating point benchmarks: can you trust them? John Markus Bjørndalen, Otto Anshus
- Behind the benchmarks: SPEC, GFLOPS, MIPS et al by Jon "Hannibal" Stokes
- Academic Benchmark Performance