Standards and specs: Lies, statistics, and benchmarks

Measure twice, run everywhere

Benchmarks can be an excellent tool for predicting performance and estimating requirements. They can also be misleading, possibly catastrophically so. Benchmark standardization helps distinguish between a good estimate and a meaningless number.

Peter Seebach, Freelance author, Plethora.net

Peter SeebachPeter Seebach likes to measure his writing output in words per hour per dollar spent on the keyboard. His useless numbers per benchmark performed ratio is roughly three to one. Peter Seebach wrote this article at 60fps.



23 May 2006

Since the second computer was built, users have compared the performance of different computers. Performance matters. Sometimes, a performance difference is just a question of whether a job will be done sooner or later; in other cases, a performance difference might prevent a job from being done at all.

Measuring performance of a particular task is not too daunting, but developing a prediction of how quickly other tasks will run can be nigh impossible. A benchmark is a task designed such that a measure of performance on this particular task will be a good proxy for performance across a wide variety of tasks.

Benchmarks can measure the performance of hardware or software. In some cases, it can be very difficult to figure out exactly what a benchmark is measuring. The difference between two runs of a benchmark might reflect architectural differences between systems, operating system differences, or even choice of compilers or compiler options. Good benchmarks try to control for at least some of these differences.

The operative word here is "try," in many cases. A huge number of factors might influence performance in a particular case. Code run from ROM might run at a different speed than code run from RAM. One inner loop might fit inside a cache, while another doesn't. Some systems might have other limited resources. A system that can process data from disk fast enough, or push it to the network fast enough, might run out of PCI bandwidth trying to do both at once. Unfortunately, it's very hard, and in some cases impossible, to change only one variable at a time. For this reason, benchmarks are invariably approximate.

This month, in a departure from more rigidly formal specifications, Standards & Specs looks at what makes benchmarks worth using and how they can contribute to standardization and development. Benchmarks are often used as a kind of certification. It's not exactly a standard, but it's interesting.

A little history

One of the very first benchmarks involved the VAX 11/780, which was marketed as being able to perform a million instructions per second. Thus, one early benchmark was simply to compare performance to a VAX. More generally, MIPS might represent the number of clocks per second divided by the average number of clocks per instruction; this estimate can be fairly good, or fairly bad. A system which could do a task twice as fast as a VAX was estimated to run at 2 MIPS. Many people argue that this measurement quickly diverged from an actual count of instructions; in fact, there's really no point at which it was an actual count of instructions. MIPS numbers can still be calculated, although it's hard to make much sense of them. For instance, one source claims 21,800 MIPS for the Cell Broadband Engine™ processor running at 3.2GHz... But is that with or without the SPEs? We aren't told. Wikipedia's article on MIPS is interesting: the 68000 is reported as 1 MIPS running at 8MHz, while the PowerPC® G3 is reported at 525 MIPS running at 233MHz -- one of the first CPUs for which the MIPS rating handily passes the processor speed.

Today, whole organizations are devoted to benchmarking systems. The most famous is probably SPEC (Standard Performance Evaluation Corporation), whose broad variety of benchmarks are typically well respected. They were founded in 1988, and they now have over 60 member organizations. SPEC acts in many ways like a standards organization, standardizing measurements. The member organizations benefit from a more consistent standard, and with many members, the fear that a benchmark will be tuned to favor a given vendor's products is reasonably dismissed.


What makes a benchmark matter

What makes a benchmark matter is how useful it is in predicting performance. That's simple enough to describe; it's not nearly so simple to implement. Many factors come together in trying to design a good benchmark. In practice, you can't meet all of these goals at once in a single test. As a result, you might want to run a variety of tests and look at their results together.

One key component of an effective benchmark is controls. Modern computer systems are prone to variances, and benchmarks that don't account for those variances might not help the user at all. As an example, video card benchmarks are often run in multiple color depths. It's common for a system to do very well in one depth and comparatively poorly in another. Similarly, resolution and related features, such as antialiasing, can reveal strengths or weaknesses of a given design. On one card, use of antialiasing might have very little effect on performance, while on another, it might have a very noticeable effect.

Predicting "overall" performance is essentially meaningful. A developer whose system will spend 75% of its cycles running compilers and simulators has little reason to care about 3D graphics performance. A gamer might have even less reason to care about efficiency of byte-swapping operations. Because of this, benchmarks generally try to come up with a measurement of some particular kind of task.

The units in which a benchmark is measured often tell you a great deal about the benchmark. A benchmark generally measures in terms of items per unit, for instance, frames per second for a video card, or transactions per second for a database. Benchmarks often provide both breakdowns of specific tests and composite and derived statistics -- averages, for example. One of the most heavily used derived statistic is price/performance, which you obtain by dividing performance numbers by estimated cost.

Cost itself is subject to benchmarking. Anyone who has ever owned a printer has probably found out how hard it is to get a reasonable estimate of what it actually costs to print a single page on it. The cost of ink alone isn't enough to tell you about printing costs; some printers use ink more efficiently. The amount of ink used isn't enough; different ink has wildly different costs. Initial cost of a system isn't enough to tell you about its long-term costs, but tests of hardware longevity are, in a way, a test of cost over time.

Some benchmarks are simple tests to see whether a product lives up to quoted specifications. For instance, I've never seen a printer that actually printed real documents at its rated speed in pages per minute. Hard drives are another device where reported specifications and actual performance can diverge wildly. Often, reported specifications have little relevance; every ATA/100 drive has a theoretical transfer rate of 100MB/second, but very few drives can actually provide that much data! Only some vendors bother to provide relevant performance figures for drives.


Designing benchmarks

A good benchmark ought to give users reasonable feedback on aspects of a product that they care about. A reasonable level of detail is important, but so is giving users something they can make sense of. A chart of instruction timings for a CPU is not a very good benchmark, even though it's very detailed. A clock speed is not a very good benchmark, even though it's easy to summarize.

One thing a benchmark should have is controls. After you've decided exactly what you're testing, try to eliminate or control for other variables. If you're comparing printer speed, perform all your tests on the same data files and host computer. Comparing the print performance of Printer A, hooked up to a Pentium 90, and Printer B, hooked up to a 2GHz dual-core Athlon64, might not give you much information about the printers. On the other hand, it might be even more informative to test both printers with two different systems. Perhaps Printer A has inefficient Macintosh drivers, but Printer B has horrible Windows® drivers. Isolating the printer from the drivers is hard, and since the drivers might well be proprietary and closed, it might also be useless.

If more than a few people are going to use your benchmark, the most important thing to control for might be efforts to skew the benchmark results. Because benchmarks are frequently a major influence on buying decisions, there's a strong incentive to try to improve performance on a particular benchmark. The benchmark goes from being representative of possible workloads to being a workload the system is specifically targeted to.

The question of how much you're allowed to tweak your system for benchmarks is both an ethical and a pragmatic one. Companies whose benchmark results diverge too far from reality generally get caught with "real-world numbers" that don't line up. Stories were told a while back about a company whose video cards scored unusually well in well-known gaming benchmarks, but if you watched the screen during the benchmark, it was full of errors. Allegedly, the drivers detected the benchmark and took shortcuts!

You might need to update benchmarks over time. Updates, unfortunately, make it hard for people to track progress. If I can't run the same benchmark on two systems, I can't make a fair comparison of their performance. Updates thus offer a mixed bag. If you are doing published numbers, it's important to maintain some overlap between systems tested with the new version, and systems tested with the old version, to give people at least some idea of what changed.

Don't try to be all things to all people. Figure out what you're testing, and test it. Overviews are important, but don't hide all the real numbers behind vague abstractions. Don't condense everything to a single unitless number. A measurement of video card performance should let me know how much resolution matters, how much antialiasing matters, and how much color depth matters. These things vary; I have one video card on which resolution changes have almost no effect, and another where performance slows to a crawl at high resolutions. In some cases, other architectural features might come into play. The performance characteristics of the same video chipset might be substantially different on PCI, AGP, and PCI-Express systems.

Example benchmarks

A couple of examples of informal benchmarks can give some insight into designing benchmarks. One example is the recently published benchmarks IBM® gave out for the first implementation of the Cell Broadband Engine™ Architecture (see Resources). These results were based on a number of specific tasks -- ones, such as matrix multiplication, whose performance characteristics on other processors are well known. Of particular interest are the bandwidth benchmarks for the Element Interconnect Bus (EIB). For these benchmarks, the researchers took advantage of domain knowledge (an understanding of the EIB architecture) to design benchmarks which contrasted best-case and worst-case scenarios. This kind of information helps developers make informed decisions.

For another example, consider the entirely informal benchmarks for the Art of Illusion (Aol) renderer, hosted by Kevin Lynn (see Resources). These benchmarks simply collect reported time to render a sample scene on a variety of platforms. The benchmark image itself stays the same; new result sets are collected for major releases of the software. Users report CPU, physical memory, host operating system, Java™ version, and special flags or notes. This benchmark shows both machine differences and differences between versions of AoI. (Kevin asks that anyone with dual-core Opterons who has a moment please run the benchmark and submit results.)


Using benchmarks

Before you start running a benchmark, make sure you understand what it's supposed to be measuring. Benchmarks which have source code might have unusual requirements for compilation. Some benchmarks might impose requirements on software. If a benchmark does a huge amount of file access, and you leave your anti-virus software running, your results will be low. (They'll also be realistic, for many users, but it won't be the intended measurement.)

For benchmark results to be meaningful, you have to do the controls correctly. Don't run a benchmark that specifies system memory on a system with twice the memory specified.

If the results you get seem way out of line with expectations, check your work carefully in case you overlooked something. Unrealistically high or low numbers might indicate a control you forgot to apply. (In one test I did, replacing a cheap built-in video device with a regular plug-in card made about a 10% difference in performance of a Java application that did almost no graphics; I still don't know why.)

If the product you're testing is still in development, please resist the urge to tune it specifically for the benchmark. You might get better results on the benchmark, but you might well get worse results on real-world applications. Better, perhaps, is to use detailed benchmarks to try to identify weak points.


One benchmark is not enough

Even the "best" benchmark is not enough. The Top 500 World Supercomputers list has been built around a single benchmark -- Linpack. This is a pretty good benchmark, and using it allows reasonable long-term comparisons of a particular variety of computation: high-speed floating point math. There are application domains that Linpack doesn't measure. Some allegations maintain that consistent top places in the Top 500 list might reflect compilers or even systems specifically tuned in terms of the Linpack benchmark, rather than overall performance. Whether there's intentional subterfuge or not, though, it's probably important to get tests of other kinds of capacity, too. One defense of tweaks is that they might give a better picture of what an application can do when tuned for a specific platform.

A variety of vendors are working on a broader suite of benchmarks for use in testing supercomputers. While individual benchmarks in the suite might be "worse" benchmarks in some sense than Linpack, they will give users more information. On the down side, users will have to figure out which benchmark numbers are closest to their application domain, and possibly combine scores in some way to get an overall value that tells them what they need to know. One hopes their existing computers will be up to the task.


Using benchmark results

Benchmarks can test components or whole systems. A benchmark of a CPU can tell you whether it is even possible for that CPU to meet your performance needs. If you're developing an embedded system, you need to know how much the processor can actually do, or how much CPU time it'll take to saturate the network with that particular network card. This is where benchmark results, even informal ones, can be necessary. If a given CPU can't possibly handle the workload you anticipate, then you don't have to build a system around it; that could save you a lot of time.

Benchmark results are always a little approximate. Do not build your system such that, if the benchmarks are accurate, it will be exactly capable of performing as desired. You should allow a fair bit of leeway for various problems: benchmark flaws, differences between your workload and the benchmark, and feature creep. In general, any computer will eventually be put to uses you can't possibly anticipate, so leave yourself some slack.

Not every benchmark is relevant to your tasks, although the emotional impact of a benchmark can be significant. IBM has some high-end printers which can allegedly paper the outside of a building in minutes; it's not that you need to do this very often, but it certainly gives an impression of ludicrous speed. (I'll be looking them up next Halloween, though.)

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=123309
ArticleTitle=Standards and specs: Lies, statistics, and benchmarks
publish-date=05232006