developerWorks: To begin, could each of you just say a little bit about yourselves, just one or two lines? Could we start with David?
David Murrell: Okay. I'm David Murrell. I've been working here at the IBM Austin Research Lab for about two years now working on the Mambo project, which is now termed Full-System Simulator for Cell Broadband Engine, for the alphaWorks release. Prior to that, I worked at Motorola, Rockwell and a prior stint also at IBM.
Mike Kistler: This is Mike Kistler. I've been working on the Mambo simulator, as well, for the past two years, working primarily on the STI support, which is now being released as the alphaWorks release of the Cell BE simulator.
Patrick Bohrer: Pat Bohrer. I manage the group here in Austin. We've been doing full-system simulation for a number of years, so I started engaging the STI project early on to do their simulation, and it's grown into what it is today. I started doing a lot of the technical work, but I'm doing a lot more of the management of the team and what have you, now.
dW: And I saved you for last. I understand you also made up the name Mambo.
dW: Where does it come from?
Bohrer: It had to be called something. Before, it was based on a previous product called SIM OS for PowerPC®, and we had to have a new name for it when we made it an IBM-only, proprietary tool. So, it was just a name that didn't have the word SIM in it, since there are so many simulators that have 'SIM' in their name. Then, for alphaWorks, we were forced to give it a more docile name. So, on alphaWorks I guess there is a reference that internally we call it Mambo, but it's called the IBM Full-System Simulator for Cell, or systemsim.
dW: So, have there been a lot of "Dances with snakes" references?
Bohrer: No. That's the first time I've heard about snakes getting involved, but a lot of dance references, yes. A lot of people mispronounce it "Mumbo" and a lot of different things, so I always get the question of "why Mambo?" But it was a song I heard the day I was asked to name it.
dW: Can somebody step up and describe the simulator in like 20 words or less?
Bohrer: It's a software application that mimics real hardware in sufficient detail that you can run a full software stack [19 words!! --eds].
dW: This Mambo that we're talking about here, is this Cell-only, or can it do both the 970 and the Cell?
Bohrer: If you want the 970 support, you need to download that particular simulator. If you want the Cell support, you have to download the Cell version of the simulator.
dW: Can you discuss briefly the differences between a simulator and an emulator?
Kistler: Well, in the simulator, we're actually trying to mimic the exact behavior of the real hardware. An emulator, at least in my experience of how the term is used, is just trying to get the effects of the program at the program level. You're not as interested in mimicking the behavior down to the details of the hardware. However, I think the terms often do get used interchangeably.
dW: Okay. But a simulator is supposed to be an exact representation of the hardware?
dW: When we talk about exact representation of the hardware, are we talking cycle-accurate or functionally accurate?
Bohrer: There's a fast mode in the [release] simulator for running "fast functional." It's more intended for doing debug of your applications or porting applications when you're not really trying to do any performance tuning. Cycle-accurate modeling, which the full-blown version of the simulator has, can be used more for performance tuning. There was a decision made early on to limit the amount of cycle accuracy in the model that we're releasing on alphaWorks due to the fact that performance models are sort of tricky to make sure you have them absolutely correct.
The simulator does support cycle-accurate modeling of code running on the SPUs. But if someone wants a version of the simulator that has all of that performance modeling turned on: support for the memory nests, the buses, and what have you -- I think the plan is to have them engage E&TS to do that level of performance work on Cell.
Kistler: Yes. That's right. The simulator that we're putting out for download on alphaWorks won't have the performance models enabled for anything except the "SPU." And let me just respond to something you said in your question about whether it would be accurate down to the cycle. In fact, because Mambo is a full-system simulator, it models things at a much higher level than typical hardware simulators like VHDL simulators and so forth. VHDL simulators, because they model so much detail, are able to actually model performance down to the cycle [see Resources for more on VHDL --eds].
Because the System Simulator models at a much higher level, it gets close, but it's not going to be cycle-accurate in the sense that some people may think of when they think of VHDL simulators. So, we like to talk about performance models and we do performance simulation and we give timing estimates, but as far as accuracy down to the cycle with the hardware, that is something that we try to avoid saying.
dW: Is it possible to come up with some code where a real Cell will act slightly differently from the sim, for example if there is a race condition in the code, it might trip on the simulator, but not on the real chip, or vice versa?
Kistler: That's correct.
Murrell: It really depends on how abstractly certain parts of the system are modeled, and you have some control over that in Mambo. Typically, you trade abstraction level of detail for simulation speed, so you decide what components you're really interested in, whether you want to run the simulation in fast mode where you're moving ahead in simulation time to the area of interest. When you arrive at that area of interest, then you typically turn on the models at a much higher level of detail in order to get your timing performance modeling done.
dW: When you're doing the modeling, is it an op code interpreter within the software, within Mambo?
Murrell: Again, that depends on what abstraction level of modeling you're using. Internally, yes, there's a mode where we, in most cases, just take a single op code and crack that into its various pieces, the operation and the operands, perform that operation on simulated memory and simulated resources, and do that over and over again. When you're talking cycle-accurate models, then we will probably model also the pipeline that that instruction would flow down through, the various operations done at each stage in the pipeline, protocols that exist between components, arbitration, and so on.
dW: So, do you have a pretty good grasp on exactly how long a specific operation takes and specific time, or is it just relative to other instructions?
Kistler: The SPU model of Cell is actually simple enough, as far as its timing characteristics, that it allows pretty reasonable estimation of these times. So, for example, since the load always goes against local store and there are no translation operations or protection checks, the timing of those operations is very deterministic. So, in the case of the SPU core, which is the thing that will actually have a performance model available in the alphaWorks release, those timings are actually very accurate.
It does get a lot fuzzier when you start to talk about the PowerPC processor because there is a lot more variation in what can happen, address translation and so forth, exceptions. Those things are more difficult as far as getting timing information. But, in general, that's what needs to be done. You need to find out the latencies of those individual events that are occurring in the processor and then apply those latencies as the events are recognized in the simulation.
dW: Is there a common code base in that PPE simulation part?
Bohrer: Yes, there is common code base that we build those simulators from.
Kistler: Part of the way that we get the performance of the simulator to a level that we think is best for our users is we actually use compile time selection of features. So, it would perhaps be possible to build a single simulator that you could run either as a 970 or as a Cell, but then we would wind up dragging in all the features of both and have to do runtime checks and so forth and it would slow down the simulation time. So, what we've chosen to do instead is to make many of those decisions at compile time and then build specialized binaries for one processor or the other.
dW: How do you handle differences between like a 970 core doing out-of-order execution and then the Cell core doing in-order execution?
Bohrer: That would be more of a performance model issue there. So, there are really two significant levels that we worry about: one, getting our functional accuracy to be dead-on, and the next one is then to start looking at performance modeling. For the Cell project, the initial goal was really only to target the functional accuracy. There are lots of issues involved in just being sure that you have all the right Book IV-type of behavior that you see in the PPE core versus the 970 core, so that's a layer.
We haven't rolled out our performance modeling across all the different implementations that we support. At the minimum, we have functional models and lately, we've been adding performance model layers on top of those. That's still work in progress: some performance models are available now, others will become available later.
dW: How do you represent the hardware to the simulated environment?
Kistler: It's virtual hardware and the simulator has a configuration object that describes in basic terms what that hardware looks like. So, for example, the configuration object says what the size of the L1 cache is, what the size of the L2 cache is, how big memory is, certain characteristics of those components.
dW: Does this include like a network card, or a hard drive?
Kistler: It could. In the version of the simulator that we're releasing via alphaWorks, there is no provision for network cards that I'm aware of, right, Pat?
Bohrer: Well, there are lots of different layers that we support. Some of our models do have exactly that, like a serial ATA and what have you, but for this alphaWorks release of Cell, they have chosen to go this simple route where we have a simple machine. So, we have a Cell chip and then we have what we call bogus devices, so there is a bogus disk and a bogus console and there are device drivers in the kernel that know how to talk to the bogus disk and the bogus console, which are really just simulator-specific devices.
dW: And does that "bogus" extend to a bogus firmware?
dW: Is there a published API on that bogus firmware, if someone wanted to explore alternative kernels or even operating systems?
Bohrer: Right. We have support in there for Open Firmware, so, basically, what we have today is that there is, through some Tcl commands, the ability to specify the device tree for the firmware. [See also Resources for more information on Open Firmware --eds.]
dW: Much like an Open Firmware device tree.
Bohrer: Right. So, then the STI team adapted the Linux® kernels to make sure that everything was in that device tree that the Linux kernel needed so it can boot up, so the Linux agrees with what's in the Open Firmware. If there is not enough, there is some flexibility to modify it with Tcl.
dW: How is the device interrupt controller modeled?
Bohrer: Internally, we have developed support for a variety of interrupt controllers which are used in various mostly internal projects. For the Cell simulator, we support the on-chip interrupt controller of the Cell processor, which is documented in the CBEA V1.0 spec. The external interrupt controller is typically part of the southbridge. Since the bogus disk and bogus console driver do not require interrupt support, we did not include a southbridge model in the initial release of the Cell Simulator. However, we are looking at providing this functionality in a future release.
dW: And how much documentation for this stuff that's been generically referred to as bogus, how much of that will be coming out?
Bohrer: Over time, if people want to write different device drivers for the bogus devices ... it would be worth doing a write up of how you talk to these bogus devices, bogus console, bogus disk.
dW: Now, as I understand it, you can both run a kernel operating system, but also just run Cell SPE or SPU executables by themselves.
Kistler: Yes. That's right.
dW: How does the simulator tell the difference?
Murrell: Well, you configure the simulator to do one or the other, so you can tell the simulator that you plan on running a particular application in stand-alone mode versus Linux. In stand-alone mode, the simulator will field the operating system calls that the application tries to make.
dW: I'm a little fuzzy on the concept of mailboxes. I can't quite wrap my head around whether it's a register or a memory mapped address, a DMA. Can you do more exploring with mailboxes, both the polled and the interrupt mailbox, within the simulator?
dW: Where would someone want to start in order to do that?
Kistler: Basically, we model all those mailbox registers and, in fact, they are channels that are supported by the SPE, and they are available to other elements in the system, either the PPE or other SPEs, through memory mapped addresses. So, they're both local registers and memory mapped IO. And, if you wanted to experiment with them, what you would do is you'd have to go look at the architecture and come up with a program that attempted to write to them and then you would run that program on the simulator.
Now, the simulator actually allows you to step through your program instruction-by-instruction, and you would run that little sample program, step through it, and at the point where you were writing to the mailbox channels, or trying to read from them, you could stop and look at the status of the channels. We actually have in our GUI a little channel window that shows you the state of all the channels, the contents of them, and what their counts are, so you could go look at the channel for the mailboxes and see whether there was something in there or whether you had read it out, things like that.
Murrell: And you can also set break points on channel activity.
dW: During your step-by-step, you can look at all 128 registers on an SPE?
Murrell: Yes. That's correct.
Kistler: And, not only that, but the simulator has support where you can display the registers and then, as you step, the registers that change are highlighted, so it will actually show you, not only the whole register set, but will highlight for you what things are changing as you're stepping through your program.
dW: The Cell PPE is dual-threaded. How is this represented in Mambo?
Kistler: Mambo maintains separate architectural state for both threads of the Cell PPE, just as the hardware does. The separate state can be viewed in the GUI by selecting either "PPE0:0", for thread 0, or "PPE0:1", for thread 1 in the component window on the left side of the main panel.
dW: Is this debugging support source-code level or instruction level?
Murrell: Mambo's native debugging support that we supply through the GUI is more reminiscent of a very low-level debugger where you see registers, disassembled instructions, and set break points at that level. Within that particular GUI environment, we don't supply any source-level debugging capabilities. Now, that's available through a GDB attach that we provide.
dW: Do you do any modeling on the Element Interconnect Bus (EIB)?
Kistler: We do. However, we won't be providing that support in the alphaWorks release, at least not initially. For the most part, the bus operation, as well as the cache operation, really have to do mostly with the performance models. You don't need them in order to do simple functional modeling, which is really what we're supporting in this first alphaWorks release.
dW: This will, basically, be then a single SPE functional emulation?
Kistler: No, the simulator models a full Cell chip with eight SPEs and a PPE, but when they do memory operations, they will effectively do them straight to memory and there won't be any visibility to the path that is taken to get there.
dW: Are you going to be able to model a local store to local store pipeline model?
Kistler: Yes. We will.
dW: And you'll also be able to do like a job queue where it goes in, gets written out to memory, and then comes back into another SPE?
Kistler: Yes. We essentially just collapse the memory hierarchy and where you might have been taking it from local store and putting it into the PPE cache and sending it over the bus and then it eventually gets to memory, we just collapse all that and we just send it straight to memory, which is the architectural view, right?
dW: Overall, when you're doing the modeling or running through an SPE executable, how do you handle a memory access violation or say a divide-by-zero exception?
Kistler: Generally, we handle it in a way that's faithful to the architecture. Now, in certain cases, there are things where the simulator provides you the option of, rather than doing the architected behavior, simply stopping the simulation to indicate that something unexpected has happened. For example, on illegal instructions, you can configure the simulator through Tcl commands to stop when an illegal instruction is encountered, rather than the architected behavior, which is to raise the proper exception and go through all that.
This is because we found that people generally want to know exactly when that illegal instruction happened so they can go look at the code and things of that nature. They want to stop the simulator right at that point, so we give them the opportunity to do that, but you can set it up so that it does exactly what the architecture would do in almost all these cases. A floating-point error will cause a floating-point exception. An address-translation error will cause an address-translation exception.
dW: How architecturally accurate is the display? Can I see SLB entries?
Murrell: The Mambo GUI has a display that shows you all of the SLBs and TLBs for the PPEs, and then you can just count them up, so I happen to be looking at it now. There are 64 SLBs and 1,024 TLBs for the PPE and each of the SPEs.
Bohrer: I think we should really emphasize the fact that, while they were doing bring ups, the STI design center was able develop and debug their system software on Mambo and then run it on the hardware. So, as far as the functional accuracy at the architectural level of the CBEA, it's identical to the hardware.
dW: Are there any future plans to have Mambo hosted on AIX?
Bohrer: First off, just let me say Mambo runs really anywhere. It doesn't matter to us. We'll run on Linux x86, Linux PowerPC. We've done some work on AIX. It's not really our big focus here. We have also gotten the simulator working on other platforms like on Mac OS X and what have you, so it's not really a problem. Based on some feedback we got early on with our 970 release on alphaWorks, we put out a Linux PowerPC version of that simulator [see Resources --eds].
[However], on the Cell side of things, we're really dependent on having the rest of the environment. The simulator by itself is not very useful without development tools and what have you. For the first alphaWorks release of Cell software, the decision was made to have Fedora Core 4 on x86 be the main platform, so that means your cross-compilers, all that stuff just all works in that environment. That's why we're providing Mambo there.
dW: I just want to thank everybody for their time and their insightful comments behind it. Thank you.
Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.
- Full-System Simulator for the Cell BE
- XL C Alpha Edition for the Cell BE
- Cell BE Software Sample and Library Source Code
- GCC Toolchain for the Cell BE
- Cell BE SPE Management Library
- Linux kernel patch for the Cell BE
- Fedora Core 4
- SDK installation script
Learn more about SimOS: Stanford
University's Complete Machine Simulator. See what IBM
Research thinks of SimOS.
See Peter Seebach's Get started with the Cell BE SDK
series of articles to install and begin using the SDK.
The Mambo team recommends:
the Cell BE Tutorial, as found in the docs/
directory of the SDK, as well as
specifications, user manuals, and reference documentation for Cell BE and
PowerPC as found in the IBM
Semiconductor solutions technical library.
The original Mambo the
dance had its heyday in the 1940s; it eventually led to the much tamer
Cha-Cha. The Black Mamba
snake is not black. It is, however,
both fast and deadly, so you really might not want to do the Mambo with a
Mamba. Hear Mambo #5
now; as you listen, be thankful that Mr. Bohrer was not listening to Ashlee
Simpson on the day he was asked to name the simulator.
VHDL is sometimes also known as VHSIC-HDL: both stand for
"Very High Speed Integrated Circuit Hardware Description Language" but one
is easier to pronounce in Real Life. There is perhaps no better VHDL
resource than the Hamburg VHDL archive (unless it is the VHDL Analysis and Standardization
Group (VASG) or maybe the VHDL
Learn more about Open
Firmware in the developerWorks article, "Open
Firmware -- the the bridge between power-up and OS" (October 2004).
The Cell Broadband Engine
project page at IBM Research offers a wealth of links, diagrams,
information, and articles.
Find all Cell BE-related articles, discussion forums, downloads, and more at
the IBM developerWorks Cell Broadband Engine
resource center: your definitive resource for all things Cell BE.
Keep abreast of all the Cell BE -- and other Power Architecture-related
news: Subscribe to the Power Architecture Community Newsletter.
Get products and technologies
Get the IBM Full System
Simulator for Cell BEA and many other fine Cell BE-related downloads from alphaWorks.
The 970 Full-System Simulator is also available on
alphaWorks. Find it and
all Power Architecture-related
downloads on one page in the developerWorks Power Architecture zone.
Get the Cell BE: Contact
IBM E&TS for custom Cell BE-based or custom processor-based
The developerWorks Power Architecture editors welcome your comments on this article. E-mail them at email@example.com.