Starting out as a government project 10 years ago, IBM Research’s high performance computing project, PERCS (pronounced “perks”), led to one of the world’s most powerful supercomputers, the Power 775. This July, the Power 775 continued to prove its power by earning the top spot in a series of benchmark components of the HPC Challenge suite.
IBM Research scientist Ram Rajamony, who was the chief performance architect for the Power 775, talks about how the system beat this HPC Challenge.
How did PERCS become the Power 775?
Ram Rajamony: In 2002, DARPA (U.S. Defense Advanced Research Projects Agency) put out a call for the creation of commercially viable high performance computing systems that would also be highly productive.
Our response was named PERCS – Productive Easy-to-use Reliable Computing System. From the start, our goal was to combine ease-of-use and significantly higher efficiencies, compared to the state-of-the-art at the time (Japan’s Earth Simulator
was the top-ranked supercomputer that year with a peak speed of 41 TFLOPS).
After four years of research, the third phase of the DARPA project – that started in 2006 – resulted in today’s IBM’s Power 775.
What makes Power 775 unique because of PERCS?
RR: It’s all in the software and hardware magic we put into the system!
PERCS chip design
PERCS blazed the trail for a whole set of new technologies in the industry. We produced the first 8-core, 4-way-Simultaneous Multi-Threaded processor – the POWER7 chip.
The compute workhorse is the 3.84 GHz POWER7 processor. We house four of these in a ceramic substrate to create a compute monster that has a peak performance of 982 GFLOPS; a peak memory bandwidth of 512 GB/s; and a peak bi-directional interconnect bandwidth of 192 GB/s. These advances resulted directly from the PERCS program.
Then, we coupled each set of four POWER7 chips with an interconnect Hub chip codenamed Torrent
, that in turn connects to other Hub chips through 47 copper and optical links, and moves data over these links in excess of 8 Tbps. (No typo here. That is indeed eight tera-bits per second!)
Cool features abound, but one in particular is how the Hub chip can translate program variable addresses in incoming packets into physical memory addresses. When used in conjunction with a special arithmetic logic unit in the POWER7 memory controllers, we get amazingly fast atomic operations.
But it’s not just about the hardware. Through PERCS we added numerous innovations in areas such as the operating system, compilers, systems management tools, programmer aids, and debuggers. We even have a new language called X10
that developers can use.
What is the HPC Challenge, compared to the Top500, Graph500, and others?
Fast Fourier Transform
The FFT is an algorithm to compute the Fourier transform of a signal; transforming it from one domain, such as the time domain, to another, such as the frequency domain. FFTs are the backbone of signal processing and are used in a wide variety of areas, such as music, medicine and astronomy.
The HPC Challenge suite
was constructed to stress different parts of a system such as compute, memory bandwidth, and communication capability. It also contains components such as the FFT
, which is difficult to make work at high efficiencies on computing systems – but which is often indicative of how entire classes of workloads will perform.
The HPC Challenge gives you a nice fingerprint of your system’s performance across numerous dimensions that show how a system may perform on a real-world workload.
For comparison, the Top500 rankings order systems based on their FLOP rate when computing the Linpack Benchmark. These rankings are biased towards indicating only a system’s compute capability. The newer Graph500 benchmark measures how fast you can traverse a graph and compute metrics similar to the Bacon number
over a social network.
Giga-Updates per Second (GUPS) and MegaFlops (MFLOPS) are as different as apples and oranges. (Actually, I should rephrase that because recent research has shown how apples and oranges are indeed very much alike
, calling into question the validity of that analogy.)
MFLOPS measure the compute characteristic of a system – the number of floating point operations (in millions of FLOPs) that can be executed every second. Systems have a peak FLOP rating as well as a FLOP rating when executing various different workloads, such as the Top500’s Linpack.
GUPS measure the rate at which the system can perform random updates to a large set of values that are distributed across the memory in the system. The idea is to find out how well a system can handle a workload that requires extremely fine-grained communication with no locality. The lack of locality in this context refers to the fact that contiguous operations in time are directed at values stored in very different places. The GUPS workload has traditionally been brutal on systems, but is representative of workloads that just don’t have the locality characteristics that machines are optimized to handle well.
What did your team do that put the Power 775 in the #1 position on the HPC Challenge?
RR: Well, we began many years ago with the goal of disrupting the status quo of interconnect-intensive workloads. Many of our performance metrics show linear scaling as the system size increases. In other words, for workloads like GUPS, PTRANS (which is a measure of the interconnect bisection bandwidth), and FFT (which is a workload that stresses all three compute, memory and interconnect elements) the system performance increases linearly with the addition of hardware.
This is unheard of in a typical system. In that sense, the Power 775 has been extremely disruptive as evidenced by the large margins by which we have taken over the number one position in the HPC Challenge results.
What does being #1 on this list mean for the Power 775’s capabilities?
RR: People have always grappled with how to structure large-scale computing systems. If you look at HPC systems in existence today, there is a spectrum of solutions with different compute and interconnect characteristics. Each of these solutions works well for the particular problem that it is used to solve.
The advantage of Power 775 is that it is a general purpose system. It has a completely homogeneous compute component which leads to a simple mental model of how the system works. The communication prowess of the system is forgiving of how programmers write their programs, making it easy to get high performance programs on the Power 775.
And while the system is suited for general purpose high-performance computing, it shines especially well on workloads that need more interconnect performance and capabilities.