Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Just like being there: Papers from the Fall Processor Forum 2005: Unleashing the Cell Broadband Engine Processor

The Element Interconnect Bus

David Krolak, Senior Engineer, IBM Engineering and Technology Services
David is a Senior Engineer in the Engineering and Technology Services division. He received a B.S degree in Electrical Engineering from the University of Wisconsin at Madison in 1979. In his 25 years at IBM, he has worked on DRAM controllers, was the lead designer for the L2 cache controllers used in the RS/6000 S70, S80, and S85 models, and is the lead designer of the Element Interconnect Bus used in the Cell chip.He holds nine patents and is currently working on future Cell designs.

Summary:  This paper from the MPR Fall Processor Forum 2005 explores the Cell Broadband Engine™ (Cell BE) Processor's Element Interconnect Bus (EIB). Designed to handle the bandwidth demands of a nine-core processor running at 3GHz, it's like no bus you have ever met before. Read why.

View more content in this series

Date:  29 Nov 2005
Level:  Introductory

Activity:  10870 views
Comments:  

Much of the literature on the Cell Broadband Engine (Cell BE) Processor is focused on the processors in it: the eight synergistic processor elements (SPEs) and the power processor element (PPE) which organizes them. However, in a picture of the Cell BE processor as a whole, it turns out that the physical center of the processor is not any of the processor elements, but the bus which connects them. A nine-core processor running at 3GHz and faster imposes huge bandwidth demands, which are met by the Element Interconnect Bus (EIB). This article reviews the function and design of the EIB.

As an overview, the EIB provides the Cell BE with an aggregate main memory bandwidth (at 3.2GHz) of about 25.6GB/s, I/O bandwidth of 35GB/s inbound and another 40GB/s outbound, and a fair amount of bandwidth left over for moving data within the processor. (All numbers in this article are based on the 3.2GHz Cell BE.)


Figure 1. The data topology of the EIB

Terminology and layout

The Cell BE connects a number of devices to each other: the PPE, the eight SPEs, a memory interface controller (MIC), and a bus interface controller (BIC). The EIB runs at half of the processor's core frequency, and can shift a maximum of 96 bytes per processor cycle, or 192 bytes per bus cycle.

The EIB has independent networks for commands (requests for data from other sources) and for the data being moved. It has 12 ports for elements, each of which can produce and consume up to 16 bytes per bus cycle. The BIC has two ports, called IOIF0 (or BIF) and IOIF1, and each of the other components gets a single port.

Commands are filtered through address concentrators (ACs) which handle collision detection and prevention and ensure that all units have equal access to the command bus. There are multiple address concentrators, all of which forward data to a single serial command reflection point, called AC0. The others are AC1, AC2 (of which there are two), and AC3.

Data transfer is more elaborate. There are four "rings," each of which is a chain connecting all of the data ports. Data can move down a ring only in one direction. For instance, a connection that allows data to move from the PPE to SPE1 cannot be used to move data from SPE1 back to the PPE. Two rings go clockwise, and two counterclockwise, and all four rings have the components attached in the same order, as Figure 2 shows:


Figure 2. The physical layout of the EIB

Each ring can move 16 bytes at a time from any position on the ring to any other position. In fact, each ring can transmit three concurrent transfers, but those transfers cannot overlap. Data ports are not exposed to the ring interface, which is transparent to them.

Each port has a theoretical bandwidth of 25.6GB/s in each direction. The command bus streams commands fast enough to support 102.4GB/s for coherent commands, and 204.8GB/s for non-coherent commands. The data rings can sustain 204.8GB/s for some workloads, with transient rates as high as 307.2GB/s. (This would represent all four rings managing three concurrent transfers at once!)

Contrasting this with typical front-side bus arrangements is informative. A typical front-side bus has total available data bandwidth around 6 to 8GB/s, and DDR2 memory interfaces can provide 6-11GB/s. The Cell BE's external memory bandwidth is 25.6GB/sec inbound and outbound to the Rambus Dual XDR memory controller, roughly 3-8 times the bandwidth of a typical DDR memory bus.

Furthermore, the I/O interfaces offer additional bandwidth. The first I/O interface, called BIF/IOIF0 can be switched between two protocols: the Broadband Engine Interface (BIF) coherent protocol, or the I/O Interface (IOIF) non-coherent protocol. The second I/O interface, called IOIF1, only supports the IOIF protocol.

For multiprocessor environments, BIF is the preferred protocol for interconnecting Cell BE processors. A pair of Cell BE processors can communicate directly using their BIF/IOIF0 ports, or a larger set can communicate over a switched bus.

The IOIF0 interface's bandwidth is scalable, from 0 to 7 bytes outbound, and 0 to 5 bytes inbound; each byte of transfer gives 5GB/s of bandwidth, for a peak of 30GB/s outbound and 25GB/s inbound. (The external interfaces of the IOIF elements are not locked to the internal clock of the Cell BE.) Meanwhile, IOIF1 can be scaled from 0 to 2 bytes inbound and outbound, for another 10GB/s of inbound and outbound I/O capacity, but one byte in each direction is shared with IOIF0; if IOIF1 is configured for 2 bytes in and out, IOIF0 is limited to 6 out and 4 in. Each of these interfaces exceeds the typical data bandwidth of a traditional system.


Bottlenecks

The theoretical bandwidth of the Cell BE processor is not always attainable. The four data rings are a shared resource. While it is possible to have multiple transactions on a single ring, they cannot "overlap;" if a given ring is being used for SPE to SPE communications, this might block communication from the PPE to the IOIF units, depending on which SPEs are involved. Physical locality matters; it is important to ensure that tasks which exchange data between SPEs do not tie up an entire ring.

Drawing diagrams of the simultaneous transfers an algorithm is trying to perform can make it easier to see whether they are possible. While multiple simultaneous transactions can occur, transactions can block each other. In effect, the further apart two components are on the rings, the more expensive communication between them will be. One key feature is that the PPE and MIC are adjacent, so communications between them can generally occur with minimal disruption of other communications. In general, the even-numbered SPEs can talk to each other more cheaply than they can talk to the odd-numbered SPEs.

The question of which ring is being used is transparent. However, the SPEs on the north end of the chip are numbered 1357, and on the south are named 0246. Any communication between, say, SPE1 and SPE2 must "pass through" the CPU and the memory controller; transactions cannot go more than halfway around the EIB in a given direction. If a transaction would be too long clockwise, the communication must go counterclockwise instead, and vice versa. While truly an engineering marvel, EIB may find itself underachieving due to workload complexity.

Another class of bottlenecks is contention. For instance, if four SPEs are trying to move data to or from the MIC at the same time, their aggregate bandwidth of 102.4GB/sec completely swamps the MIC's bandwidth of 25.6GB/sec. Similarly, while the SPEs are trying to interact with the MIC, the PPE may have degraded access to main memory. When a unit is overwhelmed, it might need to retry commands, which in turn slows traffic down even further.

Thus, balancing access is very important. Some tolerance is built in: the bus allows each element to have up to 64 outstanding requests, although individual units might have a shorter queue. The MIC can queue the full 64 requests, which is a long enough queue to give some insurance against clashes, and allow the MIC to optimize its use of the Rambus interface by reordering the commands when dependencies allow it.


Resource allocation management

Resource allocation management is an optional facility used to minimize over-allocation effects of critical resources. It is an independent function, but complementary to the EIB. A critical resource is distributed among groups of requestors in a way designed to keep the processor moving.

Several resources can be managed. One set is the XDR memory, divided into 16 banks. The inbound and outbound data paths of the IOIF0 and IOIF1 controllers can also be managed. Requestors get allocated into four resource allocation groups. Possible requestors include the PPE, the SPEs, and inbound and outbound I/O.

Within this framework, there is a central token manager controller; requestors ask permission to issue EIB commands to managed resources. Tokens granted across resource allocation groups (RAGs) allow requestors access to issue a command. Within each RAG, allocation is handled on a round-robin basis. The token manager is configured dynamically in software to adjust allocation rates; for instance, favoring the PPE or the SPEs.

Feedback from managed resources detects congestion and can then throttle token allocation for that overloaded resource, giving the congestion time to clear up.


Summary

The EIB offers next-generation data bandwidth. The good news is that this will allow for applications that are qualitatively unlike anything we know today. The bad news is that this gives a whole new class of potential bottlenecks and performance issues to resolve. Careful use of workload assignments and planning for access to critical elements, such as memory, are necessary to obtain maximum performance.


Acknowledgments

This article was adapted by Peter Seebach, working from the original presentation "Unleashing the Cell Processor: The Element Interconnect Bus," presented at the MPR Fall Processor Forum 2005 by David Krolak of IBM. Peter would like to thank Tim Kelly and Dave Krolak for technical and editorial review during the writing process.


Attributions

Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.


Resources

Learn

Get products and technologies

Discuss

About the author

David Krolak

David is a Senior Engineer in the Engineering and Technology Services division. He received a B.S degree in Electrical Engineering from the University of Wisconsin at Madison in 1979. In his 25 years at IBM, he has worked on DRAM controllers, was the lead designer for the L2 cache controllers used in the RS/6000 S70, S80, and S85 models, and is the lead designer of the Element Interconnect Bus used in the Cell chip.He holds nine patents and is currently working on future Cell designs.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=99381
ArticleTitle=Just like being there: Papers from the Fall Processor Forum 2005: Unleashing the Cell Broadband Engine Processor
publish-date=11292005
author1-email=krolak@us.ibm.com
author1-email-cc=dwpower@us.ibm.com