 | Level: Introductory David Krolak, Senior Engineer, IBM Engineering and Technology Services
29 Nov 2005 This paper from the MPR Fall Processor Forum 2005 explores the Cell Broadband Engine™ (Cell BE) Processor's Element Interconnect Bus (EIB). Designed to handle the bandwidth demands of a nine-core processor running at 3GHz, it's like no bus you have ever met before. Read why.
Much of the literature on the Cell Broadband Engine (Cell BE) Processor is focused on the processors
in it: the eight synergistic processor elements (SPEs) and the power
processor element (PPE) which organizes them. However, in a picture of
the Cell BE processor as a whole, it turns out that the physical center of
the processor is not any of the processor elements, but the bus which
connects them. A nine-core processor running at 3GHz and faster imposes huge
bandwidth demands, which are met by the Element Interconnect Bus (EIB).
This article reviews the function and design of the EIB.
As an overview, the EIB provides the Cell BE with an aggregate main memory
bandwidth (at 3.2GHz) of about 25.6GB/s, I/O bandwidth of 35GB/s inbound
and another 40GB/s outbound, and a fair amount of bandwidth left over for
moving data within the processor. (All numbers in this article are based
on the 3.2GHz Cell BE.)
Figure 1. The data topology of the EIB
Terminology and layout
The Cell BE connects a number of devices to each other: the
PPE, the eight SPEs, a memory interface controller (MIC), and a bus interface
controller (BIC). The EIB runs at half of the processor's core frequency,
and can shift a maximum of 96 bytes per processor cycle, or 192 bytes per
bus cycle.
The EIB has independent networks for commands (requests for data from
other sources) and for the data being moved. It has 12 ports for
elements, each of which can produce and consume up to 16 bytes per bus
cycle. The BIC has two ports, called IOIF0 (or BIF) and IOIF1, and each
of the other components gets a single port.
Commands are filtered through address concentrators (ACs) which handle
collision detection and prevention and ensure that all units have equal access to the command bus. There are multiple
address concentrators, all of which forward data to a single serial
command reflection point, called AC0. The others are AC1, AC2 (of which
there are two), and AC3.
Data transfer is more elaborate. There are four "rings," each of which
is a chain connecting all of the data ports. Data can move down a ring
only in one direction. For instance, a connection that allows data to move
from the PPE to SPE1 cannot be used to move data from SPE1 back to the
PPE. Two rings go clockwise, and two counterclockwise, and all four rings
have the components attached in the same order, as Figure 2 shows:
Figure 2. The physical layout of the EIB
Each ring can move 16 bytes at a time from any position on the ring to
any other position. In fact, each ring can transmit three concurrent
transfers, but those transfers cannot overlap. Data ports are not exposed
to the ring interface, which is transparent to them.
Each port has a theoretical bandwidth of 25.6GB/s in each direction.
The command bus streams commands fast enough to support 102.4GB/s for
coherent commands, and 204.8GB/s for non-coherent commands. The data
rings can sustain 204.8GB/s for some workloads, with transient rates as
high as 307.2GB/s. (This would represent all four rings managing three
concurrent transfers at once!)
Contrasting this with typical front-side bus arrangements is informative.
A typical front-side bus has total available data bandwidth around 6 to 8GB/s, and
DDR2 memory interfaces can provide 6-11GB/s. The Cell BE's external memory
bandwidth is 25.6GB/sec inbound and outbound to the Rambus Dual XDR memory
controller, roughly 3-8 times the bandwidth of a typical DDR memory bus.
Furthermore, the I/O interfaces offer additional bandwidth. The first I/O
interface, called BIF/IOIF0 can be switched between two protocols: the
Broadband Engine Interface (BIF) coherent protocol, or the I/O Interface
(IOIF) non-coherent protocol. The second I/O interface, called IOIF1,
only supports the IOIF protocol.
For multiprocessor environments, BIF is the preferred protocol for
interconnecting Cell BE processors. A pair of Cell BE processors can
communicate directly using their BIF/IOIF0 ports, or a larger set can
communicate over a switched bus.
The IOIF0 interface's bandwidth is scalable, from 0 to 7 bytes outbound, and 0 to 5 bytes inbound; each byte of transfer gives 5GB/s of bandwidth, for a peak of 30GB/s outbound and 25GB/s inbound. (The external interfaces of the IOIF elements are not locked to the internal clock of the Cell BE.) Meanwhile, IOIF1 can be scaled from 0 to 2 bytes inbound and outbound, for another 10GB/s of inbound and outbound I/O capacity, but one byte in each direction is shared with IOIF0; if IOIF1 is configured for 2 bytes in and out, IOIF0 is limited to 6 out and 4 in. Each of these interfaces exceeds the typical data bandwidth of a traditional system.
Bottlenecks
The theoretical bandwidth of the Cell BE processor is not always attainable.
The four data rings are a shared resource. While it is possible to have
multiple transactions on a single ring, they cannot "overlap;" if a given ring is being used for SPE to SPE communications, this
might block communication from the PPE to the IOIF units, depending
on which SPEs are involved.
Physical locality matters; it is important to ensure that tasks which
exchange data between SPEs do not tie up an entire ring.
Drawing diagrams of the simultaneous transfers an algorithm is trying to
perform can make it easier to see whether they are possible. While
multiple simultaneous transactions can occur, transactions can block each other. In effect, the further apart two
components are on the rings, the more expensive communication between them
will be. One key feature is that the PPE and MIC are adjacent, so
communications between them can generally occur with minimal disruption of
other communications. In general, the even-numbered SPEs can talk to each
other more cheaply than they can talk to the odd-numbered SPEs.
The question of which ring is being used is transparent. However, the
SPEs on the north end of the chip are numbered 1357, and on the south are
named 0246. Any communication between, say, SPE1 and SPE2 must "pass
through" the CPU and the memory controller; transactions cannot go
more than halfway around the EIB in a given direction. If a
transaction would be too long clockwise, the communication must go
counterclockwise instead, and vice versa. While truly
an engineering marvel, EIB may find itself underachieving
due to workload complexity.
Another class of bottlenecks is contention. For instance, if four SPEs
are trying to move data to or from the MIC at the same time, their
aggregate bandwidth of 102.4GB/sec completely swamps the MIC's bandwidth
of 25.6GB/sec. Similarly, while the SPEs are trying to interact with the
MIC, the PPE may have degraded access to main memory. When a unit is
overwhelmed, it might need to retry commands, which in turn slows traffic
down even further.
Thus, balancing access is very important. Some tolerance is built
in: the bus allows each element to have up to 64 outstanding
requests, although individual units might have a shorter queue.
The MIC can queue the full 64 requests, which is a long
enough queue to give some insurance against clashes, and allow the
MIC to optimize its use of the Rambus interface by reordering the commands when dependencies allow it.
Resource allocation management
Resource allocation management is an optional facility used to minimize
over-allocation effects of critical resources. It is an independent
function, but complementary to the EIB. A critical resource is
distributed among groups of requestors in a way designed to keep the
processor moving.
Several resources can be managed. One set is the XDR
memory, divided into 16 banks. The inbound and outbound data paths of the
IOIF0 and IOIF1 controllers can also be managed. Requestors get allocated
into four resource allocation groups. Possible requestors include the
PPE, the SPEs, and inbound and outbound I/O.
Within this framework, there is a central token manager controller;
requestors ask permission to issue EIB commands to managed resources.
Tokens granted across resource allocation groups (RAGs) allow requestors
access to issue a command. Within each RAG, allocation is handled on a
round-robin basis. The token manager is configured dynamically in
software to adjust allocation rates; for instance, favoring the PPE or the
SPEs.
Feedback from managed resources detects congestion and can then throttle
token allocation for that overloaded resource, giving the congestion
time to clear up.
Summary
The EIB offers next-generation data bandwidth. The good news is that this will allow for applications that are qualitatively
unlike anything we know today. The bad news is that this gives a whole
new class of potential bottlenecks and performance issues to resolve.
Careful use of workload assignments and planning for access to critical
elements, such as memory, are necessary to obtain maximum performance.
Acknowledgments
This article was adapted by Peter Seebach, working from the original
presentation "Unleashing the
Cell Processor: The Element Interconnect Bus," presented at the MPR Fall
Processor Forum 2005 by David Krolak of IBM.
Peter would like to thank Tim Kelly and Dave Krolak for technical and editorial review during the writing process.
Attributions
Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.
Resources Learn
-
This paper is based on a presentation given at Fall Processor Forum 2005: The Road
to Multicore. See the rest in this series.
-
The
Cell Broadband Engine project
page at IBM Research offers a wealth of links, diagrams, information,
and articles.
-
Introduction
to the Cell multiprocessor (IBM Journal of Research and Development,
2005) has a good discussion of the history of the Cell BE project.
- Power
Efficient Processor Design and the Cell Processor by Peter Hofstee was
presented at the 11th International Symposium on High-Performance Computer
Architecture (HPCA 2005).
- The IBM Semiconductor Solutions Technical Library Cell
Broadband Engine documentation section lists specifications, user
manuals, and articles of general interest.
-
The SPU
Application Binary Interface Specification V1.3 discusses register
usage and calling conventions, data type sizes and alignment, low-level
system and language binding information, information on loading and
linking, and coding examples. This specification defines the system
interface for SPU-targeted object files to help ensure maximum binary
portability across implementations.
-
Find related articles, downloads, discussion forums, and more at the IBM
developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell BE.
-
Keep abreast of all the Cell BE news: subscribe to the Power
Architecture Community Newsletter
Get products and technologies
Discuss
About the author  | 
|  | David is a Senior Engineer in the Engineering and Technology Services division. He received a B.S degree in Electrical Engineering from the University of Wisconsin at Madison in 1979. In his 25 years at IBM, he has worked on DRAM controllers, was the lead designer for the L2 cache controllers used in the RS/6000 S70, S80, and S85 models, and is the lead designer of the Element Interconnect Bus used in the Cell chip.He holds nine patents and is currently working on
future Cell designs. |
Rate this page
|  |