Skip to main content

Just like being there: Papers from the Fall Processor Forum 2005: Unleashing the Cell Broadband Engine Processor

The Element Interconnect Bus

David Krolak, Senior Engineer, IBM Engineering and Technology Services
David is a Senior Engineer in the Engineering and Technology Services division. He received a B.S degree in Electrical Engineering from the University of Wisconsin at Madison in 1979. In his 25 years at IBM, he has worked on DRAM controllers, was the lead designer for the L2 cache controllers used in the RS/6000 S70, S80, and S85 models, and is the lead designer of the Element Interconnect Bus used in the Cell chip.He holds nine patents and is currently working on future Cell designs.

Summary:  This paper from the MPR Fall Processor Forum 2005 explores the Cell Broadband Engine™ (Cell BE) Processor's Element Interconnect Bus (EIB). Designed to handle the bandwidth demands of a nine-core processor running at 3GHz, it's like no bus you have ever met before. Read why.

View more content in this series

Date:  29 Nov 2005
Level:  Introductory
Activity:  3327 views

Much of the literature on the Cell Broadband Engine (Cell BE) Processor is focused on the processors in it: the eight synergistic processor elements (SPEs) and the power processor element (PPE) which organizes them. However, in a picture of the Cell BE processor as a whole, it turns out that the physical center of the processor is not any of the processor elements, but the bus which connects them. A nine-core processor running at 3GHz and faster imposes huge bandwidth demands, which are met by the Element Interconnect Bus (EIB). This article reviews the function and design of the EIB.

As an overview, the EIB provides the Cell BE with an aggregate main memory bandwidth (at 3.2GHz) of about 25.6GB/s, I/O bandwidth of 35GB/s inbound and another 40GB/s outbound, and a fair amount of bandwidth left over for moving data within the processor. (All numbers in this article are based on the 3.2GHz Cell BE.)


Figure 1. The data topology of the EIB

Terminology and layout

The Cell BE connects a number of devices to each other: the PPE, the eight SPEs, a memory interface controller (MIC), and a bus interface controller (BIC). The EIB runs at half of the processor's core frequency, and can shift a maximum of 96 bytes per processor cycle, or 192 bytes per bus cycle.

The EIB has independent networks for commands (requests for data from other sources) and for the data being moved. It has 12 ports for elements, each of which can produce and consume up to 16 bytes per bus cycle. The BIC has two ports, called IOIF0 (or BIF) and IOIF1, and each of the other components gets a single port.

Commands are filtered through address concentrators (ACs) which handle collision detection and prevention and ensure that all units have equal access to the command bus. There are multiple address concentrators, all of which forward data to a single serial command reflection point, called AC0. The others are AC1, AC2 (of which there are two), and AC3.

Data transfer is more elaborate. There are four "rings," each of which is a chain connecting all of the data ports. Data can move down a ring only in one direction. For instance, a connection that allows data to move from the PPE to SPE1 cannot be used to move data from SPE1 back to the PPE. Two rings go clockwise, and two counterclockwise, and all four rings have the components attached in the same order, as Figure 2 shows:


Figure 2. The physical layout of the EIB

Each ring can move 16 bytes at a time from any position on the ring to any other position. In fact, each ring can transmit three concurrent transfers, but those transfers cannot overlap. Data ports are not exposed to the ring interface, which is transparent to them.

Each port has a theoretical bandwidth of 25.6GB/s in each direction. The command bus streams commands fast enough to support 102.4GB/s for coherent commands, and 204.8GB/s for non-coherent commands. The data rings can sustain 204.8GB/s for some workloads, with transient rates as high as 307.2GB/s. (This would represent all four rings managing three concurrent transfers at once!)

Contrasting this with typical front-side bus arrangements is informative. A typical front-side bus has total available data bandwidth around 6 to 8GB/s, and DDR2 memory interfaces can provide 6-11GB/s. The Cell BE's external memory bandwidth is 25.6GB/sec inbound and outbound to the Rambus Dual XDR memory controller, roughly 3-8 times the bandwidth of a typical DDR memory bus.

Furthermore, the I/O interfaces offer additional bandwidth. The first I/O interface, called BIF/IOIF0 can be switched between two protocols: the Broadband Engine Interface (BIF) coherent protocol, or the I/O Interface (IOIF) non-coherent protocol. The second I/O interface, called IOIF1, only supports the IOIF protocol.

For multiprocessor environments, BIF is the preferred protocol for interconnecting Cell BE processors. A pair of Cell BE processors can communicate directly using their BIF/IOIF0 ports, or a larger set can communicate over a switched bus.

The IOIF0 interface's bandwidth is scalable, from 0 to 7 bytes outbound, and 0 to 5 bytes inbound; each byte of transfer gives 5GB/s of bandwidth, for a peak of 30GB/s outbound and 25GB/s inbound. (The external interfaces of the IOIF elements are not locked to the internal clock of the Cell BE.) Meanwhile, IOIF1 can be scaled from 0 to 2 bytes inbound and outbound, for another 10GB/s of inbound and outbound I/O capacity, but one byte in each direction is shared with IOIF0; if IOIF1 is configured for 2 bytes in and out, IOIF0 is limited to 6 out and 4 in. Each of these interfaces exceeds the typical data bandwidth of a traditional system.


Bottlenecks

The theoretical bandwidth of the Cell BE processor is not always attainable. The four data rings are a shared resource. While it is possible to have multiple transactions on a single ring, they cannot "overlap;" if a given ring is being used for SPE to SPE communications, this might block communication from the PPE to the IOIF units, depending on which SPEs are involved. Physical locality matters; it is important to ensure that tasks which exchange data between SPEs do not tie up an entire ring.

Drawing diagrams of the simultaneous transfers an algorithm is trying to perform can make it easier to see whether they are possible. While multiple simultaneous transactions can occur, transactions can block each other. In effect, the further apart two components are on the rings, the more expensive communication between them will be. One key feature is that the PPE and MIC are adjacent, so communications between them can generally occur with minimal disruption of other communications. In general, the even-numbered SPEs can talk to each other more cheaply than they can talk to the odd-numbered SPEs.

The question of which ring is being used is transparent. However, the SPEs on the north end of the chip are numbered 1357, and on the south are named 0246. Any communication between, say, SPE1 and SPE2 must "pass through" the CPU and the memory controller; transactions cannot go more than halfway around the EIB in a given direction. If a transaction would be too long clockwise, the communication must go counterclockwise instead, and vice versa. While truly an engineering marvel, EIB may find itself underachieving due to workload complexity.

Another class of bottlenecks is contention. For instance, if four SPEs are trying to move data to or from the MIC at the same time, their aggregate bandwidth of 102.4GB/sec completely swamps the MIC's bandwidth of 25.6GB/sec. Similarly, while the SPEs are trying to interact with the MIC, the PPE may have degraded access to main memory. When a unit is overwhelmed, it might need to retry commands, which in turn slows traffic down even further.

Thus, balancing access is very important. Some tolerance is built in: the bus allows each element to have up to 64 outstanding requests, although individual units might have a shorter queue. The MIC can queue the full 64 requests, which is a long enough queue to give some insurance against clashes, and allow the MIC to optimize its use of the Rambus interface by reordering the commands when dependencies allow it.


Resource allocation management

Resource allocation management is an optional facility used to minimize over-allocation effects of critical resources. It is an independent function, but complementary to the EIB. A critical resource is distributed among groups of requestors in a way designed to keep the processor moving.

Several resources can be managed. One set is the XDR memory, divided into 16 banks. The inbound and outbound data paths of the IOIF0 and IOIF1 controllers can also be managed. Requestors get allocated into four resource allocation groups. Possible requestors include the PPE, the SPEs, and inbound and outbound I/O.

Within this framework, there is a central token manager controller; requestors ask permission to issue EIB commands to managed resources. Tokens granted across resource allocation groups (RAGs) allow requestors access to issue a command. Within each RAG, allocation is handled on a round-robin basis. The token manager is configured dynamically in software to adjust allocation rates; for instance, favoring the PPE or the SPEs.

Feedback from managed resources detects congestion and can then throttle token allocation for that overloaded resource, giving the congestion time to clear up.


Summary

The EIB offers next-generation data bandwidth. The good news is that this will allow for applications that are qualitatively unlike anything we know today. The bad news is that this gives a whole new class of potential bottlenecks and performance issues to resolve. Careful use of workload assignments and planning for access to critical elements, such as memory, are necessary to obtain maximum performance.


Acknowledgments

This article was adapted by Peter Seebach, working from the original presentation "Unleashing the Cell Processor: The Element Interconnect Bus," presented at the MPR Fall Processor Forum 2005 by David Krolak of IBM. Peter would like to thank Tim Kelly and Dave Krolak for technical and editorial review during the writing process.


Attributions

Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.


Resources

Learn

Get products and technologies

Discuss

About the author

David Krolak

David is a Senior Engineer in the Engineering and Technology Services division. He received a B.S degree in Electrical Engineering from the University of Wisconsin at Madison in 1979. In his 25 years at IBM, he has worked on DRAM controllers, was the lead designer for the L2 cache controllers used in the RS/6000 S70, S80, and S85 models, and is the lead designer of the Element Interconnect Bus used in the Cell chip.He holds nine patents and is currently working on future Cell designs.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=99381
ArticleTitle=Just like being there: Papers from the Fall Processor Forum 2005: Unleashing the Cell Broadband Engine Processor
publish-date=11292005
author1-email=krolak@us.ibm.com
author1-email-cc=dwpower@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers