Much of the literature on the Cell Broadband Engine (Cell BE) Processor is focused on the processors in it: the eight synergistic processor elements (SPEs) and the power processor element (PPE) which organizes them. However, in a picture of the Cell BE processor as a whole, it turns out that the physical center of the processor is not any of the processor elements, but the bus which connects them. A nine-core processor running at 3GHz and faster imposes huge bandwidth demands, which are met by the Element Interconnect Bus (EIB). This article reviews the function and design of the EIB.
As an overview, the EIB provides the Cell BE with an aggregate main memory bandwidth (at 3.2GHz) of about 25.6GB/s, I/O bandwidth of 35GB/s inbound and another 40GB/s outbound, and a fair amount of bandwidth left over for moving data within the processor. (All numbers in this article are based on the 3.2GHz Cell BE.)
Figure 1. The data topology of the EIB
The Cell BE connects a number of devices to each other: the PPE, the eight SPEs, a memory interface controller (MIC), and a bus interface controller (BIC). The EIB runs at half of the processor's core frequency, and can shift a maximum of 96 bytes per processor cycle, or 192 bytes per bus cycle.
The EIB has independent networks for commands (requests for data from other sources) and for the data being moved. It has 12 ports for elements, each of which can produce and consume up to 16 bytes per bus cycle. The BIC has two ports, called IOIF0 (or BIF) and IOIF1, and each of the other components gets a single port.
Commands are filtered through address concentrators (ACs) which handle collision detection and prevention and ensure that all units have equal access to the command bus. There are multiple address concentrators, all of which forward data to a single serial command reflection point, called AC0. The others are AC1, AC2 (of which there are two), and AC3.
Data transfer is more elaborate. There are four "rings," each of which is a chain connecting all of the data ports. Data can move down a ring only in one direction. For instance, a connection that allows data to move from the PPE to SPE1 cannot be used to move data from SPE1 back to the PPE. Two rings go clockwise, and two counterclockwise, and all four rings have the components attached in the same order, as Figure 2 shows:
Figure 2. The physical layout of the EIB
Each ring can move 16 bytes at a time from any position on the ring to any other position. In fact, each ring can transmit three concurrent transfers, but those transfers cannot overlap. Data ports are not exposed to the ring interface, which is transparent to them.
Each port has a theoretical bandwidth of 25.6GB/s in each direction. The command bus streams commands fast enough to support 102.4GB/s for coherent commands, and 204.8GB/s for non-coherent commands. The data rings can sustain 204.8GB/s for some workloads, with transient rates as high as 307.2GB/s. (This would represent all four rings managing three concurrent transfers at once!)
Contrasting this with typical front-side bus arrangements is informative. A typical front-side bus has total available data bandwidth around 6 to 8GB/s, and DDR2 memory interfaces can provide 6-11GB/s. The Cell BE's external memory bandwidth is 25.6GB/sec inbound and outbound to the Rambus Dual XDR memory controller, roughly 3-8 times the bandwidth of a typical DDR memory bus.
Furthermore, the I/O interfaces offer additional bandwidth. The first I/O interface, called BIF/IOIF0 can be switched between two protocols: the Broadband Engine Interface (BIF) coherent protocol, or the I/O Interface (IOIF) non-coherent protocol. The second I/O interface, called IOIF1, only supports the IOIF protocol.
For multiprocessor environments, BIF is the preferred protocol for interconnecting Cell BE processors. A pair of Cell BE processors can communicate directly using their BIF/IOIF0 ports, or a larger set can communicate over a switched bus.
The IOIF0 interface's bandwidth is scalable, from 0 to 7 bytes outbound, and 0 to 5 bytes inbound; each byte of transfer gives 5GB/s of bandwidth, for a peak of 30GB/s outbound and 25GB/s inbound. (The external interfaces of the IOIF elements are not locked to the internal clock of the Cell BE.) Meanwhile, IOIF1 can be scaled from 0 to 2 bytes inbound and outbound, for another 10GB/s of inbound and outbound I/O capacity, but one byte in each direction is shared with IOIF0; if IOIF1 is configured for 2 bytes in and out, IOIF0 is limited to 6 out and 4 in. Each of these interfaces exceeds the typical data bandwidth of a traditional system.
The theoretical bandwidth of the Cell BE processor is not always attainable. The four data rings are a shared resource. While it is possible to have multiple transactions on a single ring, they cannot "overlap;" if a given ring is being used for SPE to SPE communications, this might block communication from the PPE to the IOIF units, depending on which SPEs are involved. Physical locality matters; it is important to ensure that tasks which exchange data between SPEs do not tie up an entire ring.
Drawing diagrams of the simultaneous transfers an algorithm is trying to perform can make it easier to see whether they are possible. While multiple simultaneous transactions can occur, transactions can block each other. In effect, the further apart two components are on the rings, the more expensive communication between them will be. One key feature is that the PPE and MIC are adjacent, so communications between them can generally occur with minimal disruption of other communications. In general, the even-numbered SPEs can talk to each other more cheaply than they can talk to the odd-numbered SPEs.
The question of which ring is being used is transparent. However, the SPEs on the north end of the chip are numbered 1357, and on the south are named 0246. Any communication between, say, SPE1 and SPE2 must "pass through" the CPU and the memory controller; transactions cannot go more than halfway around the EIB in a given direction. If a transaction would be too long clockwise, the communication must go counterclockwise instead, and vice versa. While truly an engineering marvel, EIB may find itself underachieving due to workload complexity.
Another class of bottlenecks is contention. For instance, if four SPEs are trying to move data to or from the MIC at the same time, their aggregate bandwidth of 102.4GB/sec completely swamps the MIC's bandwidth of 25.6GB/sec. Similarly, while the SPEs are trying to interact with the MIC, the PPE may have degraded access to main memory. When a unit is overwhelmed, it might need to retry commands, which in turn slows traffic down even further.
Thus, balancing access is very important. Some tolerance is built in: the bus allows each element to have up to 64 outstanding requests, although individual units might have a shorter queue. The MIC can queue the full 64 requests, which is a long enough queue to give some insurance against clashes, and allow the MIC to optimize its use of the Rambus interface by reordering the commands when dependencies allow it.
Resource allocation management is an optional facility used to minimize over-allocation effects of critical resources. It is an independent function, but complementary to the EIB. A critical resource is distributed among groups of requestors in a way designed to keep the processor moving.
Several resources can be managed. One set is the XDR memory, divided into 16 banks. The inbound and outbound data paths of the IOIF0 and IOIF1 controllers can also be managed. Requestors get allocated into four resource allocation groups. Possible requestors include the PPE, the SPEs, and inbound and outbound I/O.
Within this framework, there is a central token manager controller; requestors ask permission to issue EIB commands to managed resources. Tokens granted across resource allocation groups (RAGs) allow requestors access to issue a command. Within each RAG, allocation is handled on a round-robin basis. The token manager is configured dynamically in software to adjust allocation rates; for instance, favoring the PPE or the SPEs.
Feedback from managed resources detects congestion and can then throttle token allocation for that overloaded resource, giving the congestion time to clear up.
The EIB offers next-generation data bandwidth. The good news is that this will allow for applications that are qualitatively unlike anything we know today. The bad news is that this gives a whole new class of potential bottlenecks and performance issues to resolve. Careful use of workload assignments and planning for access to critical elements, such as memory, are necessary to obtain maximum performance.
This article was adapted by Peter Seebach, working from the original presentation "Unleashing the Cell Processor: The Element Interconnect Bus," presented at the MPR Fall Processor Forum 2005 by David Krolak of IBM. Peter would like to thank Tim Kelly and Dave Krolak for technical and editorial review during the writing process.
Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.
This paper is based on a presentation given at Fall Processor Forum 2005: The Road
to Multicore. See the rest in this series.
Cell Broadband Engine project
page at IBM Research offers a wealth of links, diagrams, information,
to the Cell multiprocessor (IBM Journal of Research and Development,
2005) has a good discussion of the history of the Cell BE project.
Efficient Processor Design and the Cell Processor by Peter Hofstee was
presented at the 11th International Symposium on High-Performance Computer
Architecture (HPCA 2005).
- The IBM Semiconductor Solutions Technical Library Cell
Broadband Engine documentation section lists specifications, user
manuals, and articles of general interest.
Application Binary Interface Specification V1.3 discusses register
usage and calling conventions, data type sizes and alignment, low-level
system and language binding information, information on loading and
linking, and coding examples. This specification defines the system
interface for SPU-targeted object files to help ensure maximum binary
portability across implementations.
Find related articles, downloads, discussion forums, and more at the IBM
Broadband Engine resource center: your definitive resource for all
things Cell BE.
Keep abreast of all the Cell BE news: subscribe to the Power
Architecture Community Newsletter
Get products and technologies
Get Cell BE: Contact
Get the alphaWorks
Cell Broadband Engine downloads.
downloads on one page.
- Participate in the discussion forum.
Take part in the IBM developerWorks Power Architecture Cell
Broadband Engine discussion forum.
Send a letter to the editor.
David is a Senior Engineer in the Engineering and Technology Services division. He received a B.S degree in Electrical Engineering from the University of Wisconsin at Madison in 1979. In his 25 years at IBM, he has worked on DRAM controllers, was the lead designer for the L2 cache controllers used in the RS/6000 S70, S80, and S85 models, and is the lead designer of the Element Interconnect Bus used in the Cell chip.He holds nine patents and is currently working on future Cell designs.