Level: Intermediate Power Architecture editors, developerWorks, IBM
18 Apr 2006 The Power Architecture™ PowerPC® core and the Cell Broadband Engine™ (Cell BE) PPE unit: how different are they? Find out why there is "nothing to fear" from Cell BE programming, after all.
developerWorks:
We're taking Five Minutes of Mark Nutter's and Max Aguilar's time to talk about the memory model of the Cell BE processor. Max and Mark work in the IBM Systems & Technology Group on Cell BE software development.
dW:
The Cell BE processor PPE core is not based on any existing PowerPC core. Is that correct?
Mark Nutter:
The PPU is very similar to other PowerPC cores. The only difference now is that we have these additional supplementary processors, and they actually have their own MMU.
dW:
Sometimes there's discussion of how close is the Cell PPE to a PowerPC core. I know that there's a lot of emphasis that Cell is PowerPC-compliant. It's not an out-of-order execution unit, so how different is it?
Max Aguilar:
I think we could start by going all the way back to 601 and the PowerPC architecture, and look from there all the way through POWER5™ and such, and you could say that to a certain extent there's been more or less in-order or out-of-order execution in memory communication. There is variability within that PowerPC architecture for in-order or out-of-order execution.
dW:
But it hasn't affected the actual ability to run code.
Aguilar:
Right, because each processor that takes the out-of-order execution is responsible for producing internally the correct results, and analyzing register dependencies and whatnot.
Nutter:
For reference [see Resources for a link --eds], you may want to look at the MMU description for the SPEs, because that's the important piece, choosing, again, the same address translation machinery and storage protection machinery that the PowerPC
core uses. They are fully PowerPC [ISA] compliant, so when you issue MFC/DMAs, it steps through the same address translation and protection mechanism that the PowerPC core steps through.
dW:
So there is really nothing to fear about Cell.
Aguilar:
As a PowerPC programmer no, other than the fact that it's not a high-performance PowerPC core in the POWER5 vein.
dW:
Now if we take and we start with, say, the PowerPC Architecture Book III and we look at that memory model from there, where do we go differently with the Cell BE processor?
Nutter:
From the SPU's perspective, its loads and stores are relative to the local storage area. Internal to the SPU there is no address translation or local memory protection. So there is no segment or page or any of that.
dW:
Could this create a situation in which an SPE is writing to real memory that is not owned by the thread using the SPE?
Aguilar:
No. From the SPU's perspective, the only way that it accesses system memory is through an MFC/DMA.
Nutter:
And for MFC/DMA accesses to main memory, all of the TLB and SLB translations take effect. Each MFC includes its own MMU, which is compatible with the PowerPC semantics. So the usual rules are applied.
dW:
What about the reverse, from a PPE to SPE or to a global memory?
Nutter:
From the PPE to an SPE, there are two convenient ways to do this. The first is load store to or from a memory mapped local storage area. The second is to use the PowerPC side of the DMA queue in the MFC engine.
dW:
When you're doing a memory mapped local store, who is responsible for
configuring and discovery of those memory regions?
Nutter:
The operating system, usually at an application's request.
Aguilar:
We really don't advise people to memory map local store into the PPE's effective address space
Nutter:
Or to make heavy use of it if they do.
dW:
You recommend going with the DMA?
Aguilar:
Yes. The DMA is definitely the way you want to transfer, because when you access the local store it's done through MMIO, very slow compared to the MFC/DMA.
dW:
That's a big clarification, that the recommended approach is using the DMA approach as opposed to the global memory mapping.
Nutter:
Let's consider for a moment an application that might need to copy the content to the local store from the PPE side. One way it could do that would be to call memcpy on that memory map local storage area, and essentially copy all 256 kilobytes with any other anonymous memory chunk in the system. That would be very slow relative to the DMA engine. It would have, potentially, various side effects -- depending on the memory target where you were copying to, it would potentially displace contents of the L2, and so on. So every load and store to an MMIO region is something to be avoided, if you can do it, and certainly to avoid 256 kilobytes worth
of that.
dW:
Alex Chow's pipeline abstraction [see Resources --eds] uses Local Store to Local Store. Who handles that transference from one SPE's LS to another LS? Is that in the SPU executable code, or is that handled at the operating system level?
Nutter:
Local storage area could be memory mapped into an effective address space. We also mentioned that an SPU transfers either to its local store from EA or visa versa. The piece that we didn't tie together for you there, but we are now, hopefully, is that you can transfer to another SPU's local storage area that has been memory mapped.
So you essentially are just treating the target local store as an effective address, and the SPU program initiates the transfer, and it's actually a very high bandwidth on-chip, and it stays entirely on-chip transfer. It's actually 10x or more, the kind of improvement over what you would see if you were having to spill out to memory.
dW:
In the case of a pipeline model, there are two (or more) approaches. One is where only the resulting data is pipelined and the executable code already resides on the next SPE in the pipe, and the other is where both data and executable code are written to the next SPE. Are the execute and write protections on memory segments enforced when writing from one SPE to another, and if so, how?
Nutter:
Yes. As we mentioned before, MFC/DMA commands targeting effectively addressed memory go through address translation and protection. This is true both for accesses to regular memory and for accesses to memory-mapped I/O. MFC/DMA commands targeting another SPU's local storage area are just like any other memory access, from the MMU's point of view.
dW:
Does staying within the Element Interconnect Bus (EIB) by pipelining to another SPE require some pre-knowledge by the compiler to take advantage of the performance gain?
Nutter:
Somebody, either the programmer or ideally the compiler, has to communicate where the target local storage area is and organize the data flow, so whether that's the PPE side of the application or an automated compiler toolchain is the piece that has to be tied together for the SPU.
dW:
With the GCC toolchain, what abstractions does it support out of the box?
Aguilar:
It's a low-level compiler, in the sense that it compiles to the SPU or to the PPE ISA, but it doesn't do any of the higher-level abstraction or tying together of the programming model. For instance, what we're talking about here is the "task parallel programming" model in the pipeline. That [GCC] compiler right now doesn't have a great deal of abstraction capability in it for handling that automatically. So the programmers have to set that up for themselves.
A good way to think about this is that depending on your workload, the programming model you choose to break up your workload is kind of the flexibility we're giving the programmers here. For example, if it makes sense for you to break down your workload and have multiple SPUs executing on different chunks of work, then a parallel programming model makes sense.
There are other programming models where we take work and pass it from one SPU to another in a serial fashion, and that may be another way to break up your workload. What we've done in the SDK is given the programmer flexibility that would suggest the programming models, so that they can decide how they feel their workload would take greatest advantage of the Cell processor.
dW:
Thank you for your time and responses.
Resources Learn
Get products and technologies
Discuss
About the author  | |  | The developerWorks Power Architecture editors welcome your comments on this article. E-mail them at dwpower@us.ibm.com.
|
Rate this page
|