Meet the experts: John McCalpin

On the POWER7, Simultaneous Multi-Threading -- and the true origins of AIX

This question and answer article features IBM Senior Technical Staff Member John McCalpin on his work on the POWER5 and in high performance computing; on the Hypervisor and the size of POWER5 chips; on 128-bit computing -- and even on why he became a computer scientist instead of an entomologist.


John D. McCalpin is a physicist, an oceanographer,a computer scientist, a supercomputer specialist -- and one of hundreds of engineers and scientists working on the IBM® line of high-end POWER™ processors. He recently took time from his very busy schedule to grant this interview:

developerWorks: I wanted to just ask two questions about you, to start, if I can. One was, when you were small, what did you want to be when you grew up and how did that end up turning into what you do now?

John D. McCalpin: It has changed more times than I could easily count [laughs]. I believe that when I was in fourth grade, I wanted to be an entomologist -- except that it turns out I hated bugs, so that was not something that worked. I ended up with an interest in physics and electronics, and chose to get a bachelor's degree in physics because it was 18 hours less coursework than a bachelor's degree in electrical engineering. Then, I ended up with my masters and Ph.D. in oceanography, and was a professor of oceanography for six years at the University of Delaware. For a variety of reasons, that didn't work out, and I moved into the computer industry in 1996. But it wasn't as large a switch as it may sound. I was doing high performance computing for large-scale climate modeling for that period of my life. So switching that to doing performance modeling for a computer company was not terribly different.

dW: So on your IBM Blue Pages [IBM internal directory -eds.] profile, under your name where job titles usually go, there's Technical Computing Performance. That doesn't sound like a job title.

McCalpin: Oh, I mean, I'm an STSM [Senior Technical Staff Member -eds]. That's my classification, but we don't have any other kinds of titles here. On my office door the sign says, ?Dr. Bandwidth? ? is that a title? [John developed the STREAM benchmark, which is the de facto industry standard benchmark for measuring sustainable memory bandwidth ? eds.]

dW: Can you describe what you do all day? Or, if it's not possible to typify a typical day, then what does your work week usually look like?

McCalpin: I lead a lot of the, or the majority of the, HPC performance analysis on POWER5™, particularly with respect to the memory subsystem. And so I spend a lot of time working on developing and applying micro benchmarks to test the hardware to make sure that the performance is what it's supposed to be, and correlating those to application performance. I maintain a set of data and a methodology for predicting performance on the SPEC CPU benchmarks on all of our POWER5 systems.

So that's sort of the technical side of the work. I'm involved in system architecture definition and performance analysis for POWER7™ and other future systems.

dW: POWER seven?

McCalpin: That's sort of a far off thing. It looks far off, but IBM is in the middle of receiving a US$53 million grant from DARPA to develop high-performance computing systems for the 2010 time frame. I'm the technical leader of the system architecture team in that project. So, I'm spending time looking at system architecture, and looking at how we can get the processor to play nicely with the system, how to drive costs out of the system so that we can deliver what the customer is interested in in that time frame.

dW: That actually leads into one of the things I did want to ask about, about the way that the design process at IBM is structured. I did know that the POWER5 and 6 are being worked on, even as POWER4™ maybe is finishing up. I've also seen, oftentimes on the Internet, that some people think that POWER chips are in mainframes -- which they aren't, are they?

McCalpin: Not currently.

dW: But my understanding is that we do borrow a lot of technology from the mainframes. And a lot of our technology trickles down into PowerPC® and maybe sometimes trickles up as well. I was wondering if you could talk a little bit about that process, how it is at IBM. How do so many different groups manage to communicate and work together?

McCalpin: We spend a great deal of time in meetings [laughs]. What has primarily happened with the mainframe input is that...many of the senior people from the mainframe group have been moved into the POWER Development Group, and they have brought their expertise in reliability issues. Many of these reliability-enhancing features were implemented in POWER4, many more in POWER5, and many more in POWER6™.

dW: Those are things like the error-checking in hardware?

McCalpin: Correct. The ability to recover from errors while continuing to run. One example, some of that even showed up as early as POWER3™. One of the jobs that my colleagues and I had to do in POWER3 was add performance tests to the manufacturing line, because we discovered that we had so much ability to cope with errors that machines that were very badly broken were still passing functional tests, they just passed it too slowly. So we had to add additional tests to ensure that they were actually passing the functional tests at the right speed. And that continues to be an issue, because our machines do tolerate a large number of different types of failures.

So, the issues as far as the design of the system are argued out in many meetings by the technical leadership, some of whom are first and second line managers and some of whom are STSM's and DE's [Distinguished Engineers --eds.] and Fellows. We eventually come to a level of consensus about which of the items that have been proposed are fundable and the remainder are not.

SMT, Hypervisor, and virtualization

dW: Can I ask you about the Simultaneous Multi-Threading?

McCalpin: Of course.

dW: That is considered a very big deal. I have a couple of different questions about it. One is, if I were already familiar with multi-threading in hardware or software, is it going to be difficult for me to understand this?

McCalpin: Simultaneous Multi-Threading is not hard to understand. In traditional designs, the entire collection of functional units in the CPU belong to one process at a time. And so, a process can be executing instructions in some or all of the functional units of the processor and nobody else can be using it at that instance. With a feature that we called hardware multi-threading in the RS64 line of processors a few years ago, we provided additional hardware resources that allowed two processes to have their state, essentially, on chip. And when a process had a cache miss that would normally stall it, it would switch to the other process, the other thread, with a three-cycle pause. So this was still only one process executing at a time, but could switch back and forth between two of them very, very rapidly.

In Simultaneous Multi-Threading we have widened the data path somewhat to allow a thread indicator on each instruction. So we actually can fetch from two different instruction streams and have instructions from two different instruction streams issuing simultaneously to the different functional units on the chip. ...

We currently support two threads on the system and it's a very general-purpose mechanism. You can have instructions from different threads in different pipeline stages of the same functional unit just following each other through. And it provides the ability to use the hardware, the processor functional units much more efficiently.

dW: I have seen a lot of articles about SMT, but not so much space has been devoted to some of the other features in POWER5, which seem like they are getting sort of short shrift. Things like eFUSE haven't gotten so much discussion -- or the Hypervisor™. Can you tell me about the Hypervisor? Is it important?

McCalpin: The Hypervisor's important for IBM's virtualization strategy. Traditionally, the operating system's job is to provide an interface between the user and the hardware and to provide protection so that the user can't access any part of the hardware in an uncontrolled way. In current directions, especially in server consolidation projects, one finds that you want to run multiple operating systems on the same piece of hardware, especially with all of the operating system exploits and security problems that are happening.

So, what we have done is added a new operating system, essentially, called the Hypervisor that sits between the operating systems and the hardware. The Hypervisor is modestly complicated, but it's enough smaller than an operating system that one can have a lot more confidence about its reliability. And now when the operating system wants to interact with the hardware, it has to do it only through the Hypervisor or with the permission of the Hypervisor. So that you can have, for example, multiple Linux kernels running on the same hardware, and even if there's a security problem in Linux and someone compromises the kernel, that kernel is still prevented from interfering with any of the other partitions on the machine.

dW: Is there any conflict between SMT and the virtualization? Again, my understanding, I could have as many as hundreds of virtual machines running on one server. Can I do that and multithreading, or is that too many things?

McCalpin: They're done at the same time -- well sort of. When you do virtualization with subprocessor partitioning, those partitions are timeshared on the processor. So they're not literally running at the same time, but they're alternated. So each of the partitions can use the multi-threading while it's running on the processor.

dW: Is that efficient, or is there any downside to doing that?

McCalpin: There's some loss of efficiency whenever you do timesharing as you move one set of, well, as you switch to a different set of processes, they're going to want different data in the cache. So they're going to bring their data in the cache and displace the data from the previous process. And if you do that often, it will slow you down somewhat.

dW: Is there any potential that SMT would be disappeared and come back later again? Because it did, actually, didn't it? I mean, it had been HMT, and it wasn't in several generations and now it's back again (with changes)?

McCalpin: Right. At this point, I don't anticipate SMT disappearing, although it's certainly possible that we could choose to not use it in some processor somewhere. It is generally considered to be a cost effective thing to do.

dW: And the numbers I saw was that it can increase performance by 30 to 40%?

McCalpin: It varies from negative, in some cases, to up to 60% in some cases. So it is not at all unusual to see 20 to 30% speed up on applications without even doing anything special.

dW: I had one question which might be sort of naive, but -- the processor has two cores: I was wondering, do the logical processors map physically to a physical core, is it that simple of a one-to-one?

McCalpin: It is, but we tend to not provide guarantees on what that mapping is going to be because it's allowed to change. The operating system does know which processor numbers correspond to independent physical resources.

For example, if you have a four-chip system that would have eight processor cores and running in SMT mode, it appears to have 16 processors.

But if you're only running four processes at a time, the operating system will spread the processes out and run one process on each chip to get the best performance.

dW: But it won't run them on both cores at the same time, will it?

McCalpin: It will if it needs to. So, if you have, again, on a four-chip system, if you're running eight processes, you'll get one on each core, two on each chip. And then if you start running more processes, it will start adding them to essentially the virtual processors provided by Simultaneous Multi-Threading.

dW: Is that very hard to program for at the operating-system level?

McCalpin: For the most part, it's not particularly hard. The difficult part is handling the cases when processors are deconfigured because of errors or failures, and you have to figure out how to jump over the holes and things like that.

Operating systems (and a digression about documentation)

dW: The POWER line runs AIX® and Linux. Is there any work being done for other operating systems?

McCalpin: We also run OS/400®.

dW: Oh! I'm sorry -- I forgot that one. I was thinking particularly BSD®.

McCalpin: That's an interesting question. I haven't heard anybody talk about BSD on POWER.

dW: Because I know the BSD people would be happy.

McCalpin: Well, they're welcome to purchase machines and port it.

dW: Well, one thing that I have heard from not only BSD people, but many people, is -- it can be really hard to get documentation, especially, maybe -- and tell me if I'm wrong about this, but since the POWER chips are only in IBM machines --

McCalpin: Right.

dW: Is it the case, then, that there's much less in the way of data sheets and so on that are available than there would be for a processor like a 970 or some industry processors, which do get integrated into another maker's systems?

McCalpin: Right. There are several levels to that. The PowerPC architecture is published, it's a three-volume architecture definition. It's actually available in PDF format (see Resources). So Book I describes the instruction set, and Books II and III describe the virtual memory and operating system environment. Book IV describes, for each processor, the implementation-specific details. Then what you need beyond that to implement an operating system is you need to know about the service processor and the firmware hooks, and where the memory map dial addresses are, and a lot of ugly stuff that IBM tends to not publish.

dW: Do you know why they don't publish that? Because it's not only the people who are implementing, but I know there are so many enthusiasts especially who just love the POWER line and would just love to read about it. Is it trade secret stuff?

McCalpin: There are probably different answers. Some of it, well the Book IV for each of the processors is considered "IBM Confidential," because it does include information on trade secrets. Most of the other material that you would need to write an operating system, for example, it's not so much secret as it is that there's nobody whose job it is to make that available. That is, of course, not the case for our embedded processors, where we provide a lot of support.

dW: That is a shame. I have one more on the subject of OSes -- was AIX really designed by space aliens?

McCalpin: I hadn't heard that one. It feels like it was designed by, [not space aliens, but] in some part by people with mainframe backgrounds, but I certainly wasn't working for IBM back in the early days of AIX. I did notice that when I was a customer, [and] bought my first IBM RISC System/6000® in 1990, that AIX didn't look like the other UNIXes I was used to. ...

From the user perspective, it's essentially the same as any other UNIX®, just sometimes things are found in different places. From the system administration perspective, there's a little bit of extra overhead in learning how to use it, partly because AIX offers additional features that are not offered in other UNIX operating systems. And therefore, you have to have more extensive and different kinds of administrative tools to control those features.

Who's on first?

dW: Can we talk a little bit about the size of the actual POWER5 chip?

McCalpin: I don't actually know the number off the top of my head.

dW: I have a few numbers.

McCalpin: I think I saw it in a presentation. [laughs].

dW: I have this, and you just tell me if it sounds more or less right. I have 389 square millimeters for each processor, which are then packaged in the modules.

McCalpin: Right, it's -- that's for the chip. That's not a processor size.

dW: That's for the chip.

McCalpin: Right.

dW: What's the difference?

McCalpin: Chip includes two processors and 1.875 megabyte Level 2 cache.

dW: I thought a chip had two cores, it actually has two processors?

McCalpin: Well, a core and a processor, I'm going to call those the same thing.

dW: Okay...

McCalpin: That's actually a very important point. IBM tries to be very consistent in calling a chip a chip, and saying that a processor and a core are the same thing.

dW: So if there are two processors on one chip, is that still a microprocessor then, if we're talking about these kinds of definitions?

McCalpin: It's still a microprocessor, because the microprocessor means, in colloquial terms, no more than one chip per processor. This is a half a chip per processor.


dW: Okay. So we have 389 square millimeters per... chip.

McCalpin: Per chip.

dW: And 4 chips per module.

McCalpin: There are four POWER5 chips plus four L3 chips on a multi-chip module in the p5 590 and p5 595.

dW: Okay... And the size of the multi-chip module is 95 millimeters squared?

McCalpin: That sounds about right, yes.

dW: Which is about the size of a man's palm. Which is huge!

McCalpin: Yes, it's almost four inches on a side.

dW: But one thing I couldn't find was anything about the power consumption and heat dissipation?

McCalpin: We are not going into details externally, but we do emphasize that we have aggressive and extensive power management functions that allow us to get a lot more performance out of this chip than we would if we didn't have these features that allow us to power down certain parts of the chip without any performance penalties.

Automatically laid-out versus hand-tuned circuits

dW: Especially at Ars Technica, but also on many of the other enthusiast sites, I've seen comments where people say that we should be hand-tuning the POWER chips instead of having that automated. Can you defend that design decision? That we don't hand-tune them?

McCalpin: Well, the last company that spent a huge amount of headcount on manual tuning and layout of circuits was DEC and it killed DEC and then it killed Compaq and now the Alpha line of processors has reached its end of life, so?

dW: Is it comparable to the difference between having a compiler compile your code and hand-tuning the assembly?

McCalpin: Yes, you could certainly make that analogy. And over time, what you find is the fraction of code that actually benefits from being hand tuned gets smaller and smaller, because compilers are very good. So you have the choice, we could certainly figure out how to make our processors, well, "certainly" ... there are certain limiters to processor frequency and/or performance that could be reduced by spending hundreds of man years and schedule to tune various circuits. But you don't -- schedule is not something that you want to lose and this sort of extensive manual tuning takes time.

Gazing into the future: POWER6, POWER7, and 128-bit processing

dW: I'm not trying to get you to say anything you're not supposed to say -- but can you tell us anything about POWER6 or POWER7?

McCalpin: Almost nothing. Even under nondisclosure, I'm not allowed to say very much about them.

dW: How about, again, I really don't want to get you to say something you're not supposed to, could we talk about then maybe some of the areas that are being worked on? Not with a view to saying that these things are going to be delivered or even might be delivered -- but where can we improve as far as, I mean -- packaging is a problem for everybody in the industry right now; heat dissipation is, too --

McCalpin: Right. We have clear efforts underway in technology related to reducing leakage current that allows us to drive the frequencies higher, like consuming less power. We have efforts that are targeted at allowing us to build larger SMPs and efforts that are targeted at allowing us to build cheaper entry-level systems. In the long-term, our government-funded research is pursuing a wide variety of technology investigations related to memory system architectures and operating systems and packaging... interconnect technologies are being -- a lot of optical interconnect technologies are being -- investigated as part of our DARPA-funded proposal.

So, there's a lot of exciting stuff happening. And the reason that we are not talking about POWER6 or POWER7 in detail is really one of expectation setting, that there are hundreds of people working on these projects. We have cycle-accurate models of the current design proposals. This is very definitely real, but even though the microprocessor design is pinned down at this point, for the most part, the actual systems that we're going to support are not fully decided yet. So details like frequencies and processor counts and bandwidths, we don't want to be getting expectations set to specific numbers at this point.

dW: I have one future computing question -- the POWER processor line, as far as I know, has been 64 bit for quite a long time -- 64 bits isn't new in UNIX circles, although it has gotten a lot of press on the desktop now. But will we ever move to 128-bit? Or is that just too big?

McCalpin: 128 bits is probably too big. The ?bit size? usually refers to how big the addresses are and the corresponding registers that operate on addresses. 32 bits was good for a long time, because it allows you 4 billion addresses; 64 bits allows you 4 billion times 4 billion addresses -- and we don't anticipate technology changes that will make 64 bits inadequate for a long time. On the other hand, there are some 128-bit features in some PowerPC processors already. What Apple® calls Velocity Engine®, Motorola® calls it AltiVec™, and IBM calls it VMX -- the media extensions that are implemented in the PowerPC 970 are 128-bit registers, so they do operate on 128-bit chunks.


dW: There are so many supercomputers these days. Would you say we need that many and that much? Do the people with these systems, do they really end up using that many teraflops?

McCalpin: The majority of high-end systems are used for throughput workloads of one kind or another. The vector machines, both from NEC and Cray™, are very well-liked by the end users because vectorization is a relatively easy thing to understand, how to write code that will vectorize. And the machines -- with relatively little effort -- give you a good utilization, you'll get a good fraction of the peak theoretical performance without a whole lot of work. And customers find that comforting. You put the code on there, you get 35% of the theoretical peak performance and you say, "Well, that's pretty good and I don't need to mess with it anymore."

On the machines that IBM sells and that HP sells and AMD™ and all of the others, the costs are much lower, but it's harder to get very high utilization on those machines, in part because they don't have so much expensive memory bandwidth. So there's an interesting discrepancy between the end users who love vector machines because they're easy to use and then the purchasing manager who doesn't like vector machines because they cost too much.

dW: And what should I be calling the non-vector machines? Are they ... scalar?

McCalpin: That's one, yes. That's not the, I guess that's one way to say it. That's a traditional word. It really doesn't get to the heart of the difference, but it's a multi-faceted topic.

dW: Are they much harder to program for?

McCalpin: Not necessarily harder to program for. It can be very difficult to get very high utilization of these processors. So, for example, it's pretty easy to put a code on a RISC-based scalar system and get 7% of peak performance. And the same code might deliver 35% of peak performance on the vector machine.

dW: I hate to cut it off at this point, and I'm sorry -- it's entirely my fault -- but we spent an awful lot of time talking about general topics. And I think I did that because you are our first interview in this series, so it's ground we haven't covered before, because we are such a new zone. But I have already taken more of your time than I was supposed to -- can I ask if sometime in the future, you'll speak to us again? In future installments to this series, we'll be soliciting questions from readers -- would you be willing to field more questions on supercomputing, and especially questions from readers?

McCalpin: Sure!

dW: Thanks so much! And thank you for the interview.

McCalpin: Thank you.

So there you have it. Among the obvious questions we will want to ask next time -- like, "So how does one predict performance on the SPEC CPU benchmarks, anyway?" and more on the difference between vector and scalar supercomputers -- we must add also "What advice exactly are you giving people on where profits and losses are in supercomputing?" For, as John also told us during the interview:

Well, I only told you a little bit about what I do for a living, because the POWER5 and POWER7 things are only a part of my job at this point.
The other thing that I spend a lot of time on is working on business strategy for the high performance computing market, which involves a whole lot of different things. It involves competitive analysis, it involves market segmentation analysis and profitability analysis of different pieces of the market. It involves characterizing application requirements in different parts of the market and how those relate to different architectural approaches and cost structures of different computers.
I am very often invited by government agencies to participate in discussions on the future of supercomputing in the United States; what needs to be done in the supercomputing industry. I'm especially well known for what we call micro benchmarks and figures of merit for computer performance based on different kinds of benchmark performance measurements. So I spend a lot of time both inside IBM and outside IBM describing to people where the money is in the high performance computing market, where the profits are, where businesses lose money in high performance computing and go out of business [laughter], and where businesses make money in high performance computing and stay in business.

Fodder for -- well, another interview! You'll get a chance to ask John questions the next time he is interviewed for Meet the experts. He agreed to try to do that around the time that Blue Gene®/L comes out of prototype and is installed at Lawrence Livermore National Labs, because of course John is involved in IBM?s HPC product strategy and future roadmap decisions related to BlueGene as well.

Note that although Blue Gene/L is still in prototype, it has just been named the world's top supercomputer. Although this interview is being posted after that announcement, it was conducted before -- so we weren't able to cover that, nor the announcement of Blue Gene being offered commercially off the shelf (see Resources). The date for John's next interview will be announced as soon as it is finalized, but you can send questions for him to the developerWorks Power Architecture editors any time from now until then, and we'll present as many of them as we can to John in the next interview.

Until then, next month's guest will be Regina Darmoni, Program Director of PowerPC Licensing at IBM. So if you have any questions about licensing -- or about the Power Everywhere™ initiative, of which Regina was an early proponent, send those in also by December 14, 2004 -- and we will include as many of them as we can in our next Meet the experts interview with Regina Darmoni.



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into developerWorks

Zone=Multicore acceleration
ArticleTitle=Meet the experts: John McCalpin