The new "synergistic" processing units in the Cell Broadband Engine (Cell BE) Architecture have received a lot of attention, but this overlooks a critical component of the Cell BE - the Element Interconnect Bus (EIB). While the SPUs are the nitromethane-breathing explosions in a bottle, the EIB is the intake and exhaust system. The data has to flow in and out with the least amount of restriction possible. What does it take to do this? The Smokey Yunicks of EIB, lead designer David Krolak, and resource manager Jim Mikos, explain (see Resources).
developerWorks: This is a nice little creature you have here. How many years have you guys been working on it?
Dave Krolak: I think I started working on the STI project in January of 2002, so that would make me starting up on the fourth year now.
dW: In looking at some of the specs and how the EIB works, it does seem to me that there's some correlation with a token ring implementation. What were your inspirations for designing this?
Krolak: Actually, token ring...not at all, it's actually more of a mainframe bus on a chip. My background was working on the IBM UNIX® machines, the S70, S80, S85, and I was familiar with the 6XX bus that was used on those machines, where the S85 had 24 processors that they had to connect together and we, in some way, shape, and form, took those ideas and applied them to this on-chip interconnect where we have eleven processors, or I should say nine processors and a couple of I/O ports.
dW: Okay, so even though there are phrases like a "token" and there looks like a ring topology, there's no real connection?
Krolak: There's no connection to a token ring concept. No, the Resource Allocation Manager actually has to do with some resource allocation control.
Is it a bus with a chip on its shoulder?
dW: How did the ring design come out, instead of a traditional interconnect bus?
Figure 1. The data topology of the EIB
Krolak: Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is architected, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just wasn't enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.
dW: Okay, and on that ring, when you've got your data arbiter there, you're able to do concurrent transfers on the same ring, is that correct, as long as they don't overlap?
Krolak: That's correct. A ring can start a new op every three cycles. Each transfer always takes eight beats. That was one of the simplifications we made, it's optimized for streaming a lot of data. If you do small ops, it doesn't work quite as well. If you think of eight-car trains running around this track, as long as the trains aren't running into each other, they can coexist on the track.
Jim Mikos: Each beat is 16 bytes [128 byte transfers].
dW: When this data arbiter is scheduling these transactions, what is the mechanism presented to each of the elements? Is it similar to a promiscuous mode on Ethernet where it's listening to the bus, or is it a pass-through, and does the data arbiter tell each target you're the one this is for?
Krolak: The bus units don't know that there's a ring. It's a very simple handshake interface where, if you want to send data, you say, I want to send data to unit four and eventually you will get a grant. When you get your grant, you ship your data into the network and the central arbiter handles all the scheduling so it will move around the ring and it will eventually end up on the destination unit four and then it'll get a signal that says, hey, you've got data coming and so it'll know to look at its data port at this point, and the data port has a tag with it that tells it this is transaction such-and-such and here comes the data and since it's already got all that information from a preceding command, it knows what to do with it and it just streams in.
dW: So the managing of workloads is from the data arbiter itself?
Krolak: Just the management of the individual transactions. The fact that the transaction was on the bus at all was started off with a command request by the device. For instance, the SPE may want to do a DMA from memory, so somebody had programmed it to do that DMA, and since memory is a critical resource, its probably going to use the Resource Allocation Manager to make sure that it doesn't overwhelm it. So first, as it's doing this DMA, it can do up to 128 bytes with each transaction. It'll get a token for the transaction. When it gets the token then it'll put the command on the bus that says I want to fetch 128 bytes from this address in memory. That command goes through the network and gets reflected to all the units, the memory will see it. It will say, okay, I'll send it to you. The memory at that point will eventually arbitrate for the bus and send the data to the SPE and then the SPE will receive it. So the EIB of the ring portion is just responding to requests for data that have already been scheduled implicitly by the command.
Four rings were given to the elements...
dW: In your diagram, there are two clockwise, and then two counter-clockwise rings and you show a diagram of handling every single unit on there talking to another unit, which is maximum efficiency there. What's the difference in theoretical efficiency and what you're actually seeing in real world?
Figure 2. The EIB grapples with eight concurrent transactions
Krolak: Again, I allude to it a little bit in the [presentation see Resources --eds]. It's highly dependent on the workload. Obviously, if everybody wants to send data to memory or everybody wants to fetch data from memory, you're going to be limited to what memory can do. You can set up bad traffic patterns, have everybody try to talk to their clockwise adjacent neighbor, and that's going to pretty much limit you to two rings and so you won't get the bandwidth that you would get if you had chosen some other pairing.
dW: Can we explore this for a little bit then? Because there is a layer where the software is talking to the Resource Allocation Manager and I took it to mean that the management of the load, once it's actually on the bus, is up to the data arbiter. Is the data arbiter going to be the one scheduling "this goes to the next one," or is that software interface to the Resource Allocation Manager going to be the one that schedules that this element talks to the next element?
Krolak: The Resource Allocation Manager can only schedule access to memory and to I/O. Software is, in broad strokes, scheduling which SPE talks to which SPE and when. In Alex's presentation [see Resources --eds], I think he had some charts where you partition up the workload and you're loading your next job while you execute your current job and unload your prior job. Do you remember those?
dW: Yes, he was referring to it as a double buffering or multibuffering.
Krolak: Yes, multibuffering. Well, judiciously choosing which SPEs are the source and target for what you're doing can impact your performance.
Never name your children Biff, but it's a good acronym for a protocol
dW: In looking at the presentation you gave for FPF [see Resources --eds], I see a discussion of a bus protocol called Broadband Interface Protocol (BIF). Is that an industry standard or is that IBM?
Krolak: No, that's something we came up with as part of this. We wanted to be able to connect the Cell BEs into a multiprocessor configuration, and actually the first protocol that we came up with was the BIF. The EIB is sort of a logical extension of the BIF, so the BIF defined command formats, transaction types, the snooper cache coherence protocol and all of that, and it's also got definitions of physical link layer routing and all the other stuff that's in some ways similar to PCI Express. Some of the same concepts are used with the BIF. The difference with the BIF though is that it's got packets that allow for coherent MP communication.
dW: Now that was the next question coming up - how is coherency maintained in a multiCell environment?
Krolak: The key is any operation that is coherent has to be routed through a single central serialization point. We have that logic on the Cell itself. It's part of the EIB. So if you have two Cells hooked together, one Cell can be the central serialization point, and all the commands that are coherent will flow through that point, and the other Cell will be a slave chip. If you have more than two Cells, then that's more than we can handle, and then you have to design a switch chip of some sort that would act as that central point, and all the Cell chips would hook to that switch chip.
dW: Is the broadband interface protocol going to be a published API specification?
dW: There's two I/O controllers on the Cell chip, and the EIB document that we've seen says that one can be configured for either IOIF [I/O Interface Protocol --eds] or BIF. Is that a software configurable setting or is that done in the hardware?
Krolak: It is programmed immediately after power-up.
dW: So would that be a firmware command?
dW: In general, is it correct to say that the software won't need to interact directly with EIB much?
Krolak: That is correct. The one part of the EIB that software would interact with would be the Resource Allocation Manager, and that's for the resource allocation. It can control access to the 16 memory banks and to the I/O ports.
Mikos: That can be reprogrammed on the fly, right?
Krolak: Yes, that can be reprogrammed on the fly, so as your workload changes, you may need to rearrange who gets access to what and how much of it, and software would be very active in that.
dW: The Resource Allocation Manager then, is that API-published?
Krolak: The Resource Allocation Manager will also be covered in the Resource Allocation section of the CBE Handbook, which is expected to be published in the first half of next year.
dW: Is there generally a desire to breach the perceived... I don't want to say perceived, but if there's a disconnect between what you guys know as far as what has to be done on scheduling algorithms and what the software people are getting, is that going to be attempted to be overcome?
Krolak: I think you would want to, because we make all these claims for performance, and if you can't hit them because you did something [wrong], but you didn't know [this], that's a disconnect, and we look bad when you do that [see Resources --eds].
dW: How do you see Cell initially performing on independent benchmark tests? Do you anticipate there being any issues with its utilization of the EIB?
Mikos: I'd say chip performance is a function of the specific SPE communications that are chosen by software, as well as dynamic configuration of the Resource Allocation Manager. So, initially performance may not be optimized, but will still be impressive.
dW: That is actually -- we're very near the end of the time that we have for today. Is there anything you would like to add, that we haven't covered?
Mikos: I have one question for Dave. Let me ask you this. If I was to relate back, because I know you've got a lot of experience in buses and how to interconnect processors and things and you mentioned with the crossbar switch. From a design perspective, would it have been easier for you to do a crossbar switch? It's just the area is so much larger; the power is so much higher.
Krolak: Yes, the crossbar switch would be simpler because arbitration would basically be... there would be a path from everybody to a destination. You would only have to do arbitration based on the destination. Who wants to talk to a particular destination? And you get them routed through the network to that guy. It would be easier to deal with than these rings where its transactions for different destinations are on the ring, when can you schedule it on a ring? How do you keep two transactions or successive transactions from arriving at a destination at the same time?
Mikos: So it's kind of back to when dW asked you the question earlier and you used the analogy of the train. So I can have up to twelve trains on these four tracks at any one time. That's what your logic has to keep track of. In a crossbar switch, that would have been much simpler I think.
Krolak: Yes, it probably would have been simpler.
Mikos: So the reason you went with the ring structure is that it takes up much less chip area, although the crossbar switch would have been simpler, because it gives you the point-to-point interconnect, and you relieve the performance concerns with software in that you don't have to have this knowledge of who's going to who, at least in as much detail.
Krolak: Yes, there's just more wire than we had space for and all the related buffering.
Mikos: And power was a big concern, so it saves area, it saves power to do it this way.
dW: Thank you so much for joining us today.
Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.
- Smokey Yunick's engine tuning abilities are both larger than life and understated. He only revealed ten patentable inventions and kept the rest to himself.
- Read Dave Krolak's Unleashing the Cell Processor: The Element Interconnect Bus, originally presented at Fall Processor Forum 2005.
- Read Alex Chow's Unleashing the Cell Processor: A programming model approach, originally presented at Fall Processor Forum 2005.
- See Thomas Chen's article Cell Broadband Engine Architecture and its first implementation for more on Cell BE performance (developerWorks, November 2005).
- Learn how a Token Ring works.
- Read about BIF/IOIF in this Cell BE forum thread.
- The Cell Broadband Engine project page at IBM Research offers a wealth of links, diagrams, information, and articles.
- The Cell Broadband Engine Registers document is posted to the Cell Broadband Engine section of the IBM Semiconductor solutions Technical Library, where it lives with many interesting friends and relatives.
- Find all Cell BE-related articles, discussion forums, downloads, and more at the IBM developerWorks Cell Broadband Engine resource center: your definitive resource for all things Cell BE.
- Keep abreast of all the Cell BE -- and other Power Architecture-related news: Subscribe to the Power Architecture Community Newsletter.
Get products and technologies
- Get Cell BE: Contact IBM E&TS for custom Cell BE-based or custom-processor based solutions.
- Get the alphaWorks Cell Broadband Engine downloads -- including the IBM Full-System Simulator.
- See all Power Architecture-related downloads on one page.
- Full-System Simulator for the Cell BE
- XL C Alpha Edition for the Cell BE
- Cell BE Software Sample and Library Source Code
- GCC Toolchain for the Cell BE
- Cell BE SPE Management Library
- Linux kernel patch for the Cell BE
- Fedora Core 4
- SDK installation script