 | Level: Intermediate Thomas Chen (wtchen@us.ibm.com), Systems Performance, IBM Ram Raghavan (ramr@us.ibm.com), Systems Performance, IBM Jason Dale (jndale@us.ibm.com), Systems Performance, IBM Eiji Iwata (iwata@rd.scei.sony.co.jp), Microprocessor Development, Sony Computer Entertainment Inc.
29 Nov 2005 Explore the first implementation of the Cell Broadband Engine™ (Cell BE) Architecture, developed jointly by Sony, Toshiba, and IBM, and get an up-close look at its performance figures and characteristics.
The Cell Broadband Engine (Cell BE) processor is the first
implementation of the Cell Broadband Engine Architecture (CBEA), developed
jointly by Sony, Toshiba, and IBM. In addition to Cell BE's use in the
upcoming Sony PlayStation® 3 console, there is a great deal of
interest in using it in Cell BE-based workstations, media-rich electronics
devices, video and image processing systems, as well as several other
emerging applications.
The Cell BE includes one POWER™ Processing Element (PPE) and eight
Synergistic Processing Elements (SPEs). The Cell BE architecture is designed
to be well-suited for a wide variety of programming models, and allows for
partitioning of work between the PPE and the eight SPEs. This paper shows that Cell BE can outperform other state-of-the-art processors by
approximately an order of magnitude, or more in some cases.
Introduction
Until recently, improvements in the performance of general-purpose
processor systems were primarily derived from higher processor clock
frequencies, wider issue super-scalar or deeper super-pipeline
designs. However, without a commensurate increase in the memory speed,
these approaches only led to increased memory latencies and even more
complex logic to hide them. Further, not being able to have a large number
of concurrent accesses to memory, complex cores end up under-utilizing the
execution pipelines and memory bandwidth. Invariably the end results can
be poor utilization of the chip area and a disproportionate increase in
power dissipation relative to overall performance.
The approach taken by the Cell BE design was to focus on improving
performance/area and performance/power ratios (see Introduction to the
Cell Multiprocessor, listed in Resources. These
goals are largely achieved by using powerful, yet simple cores that use
area more efficiently with less power dissipation. Supported by an
interconnect with high data bandwidth, these cores could work both
independently and cooperatively. By supporting a large number of
simultaneous memory accesses from the cores, the memory bandwidth can be
used more efficiently as well. The design philosophy is somewhat similar
to the recent trend of having multiple general-purpose cores in the same
chip; in the Cell BE, all cores are just much simpler, yet still powerful.
The following section, Cell BE Overview and its Performance Characteristics, introduces the performance characteristics of the Cell BE, focusing on the
PPE, the SPEs, the Element Interconnect Bus (EIB), the XDR memory, and the
IO interfaces (IOIFs). Finally, Application Examples
and their Performance Characteristics, characterizes performance of
several applications that exploit the Cell BE processor features and compares
the results to other general-purpose processors.
Cell BE overview and its performance characteristics
Figure 1. Cell BE Processor Block Diagram
Figure 1 shows a high-level view of the first implementation of Cell BE. It
includes a general-purpose 64-bit POWER Processing Element (PPE). In
addition, the Cell BE incorporates eight Synergistic Processing Elements
(SPEs) interconnected by a high-speed, memory-coherent Element
Interconnect Bus (EIB). This initial implementation of Cell BE is targeted
to run at 3.2GHz.
The SIMD units on the eight SPEs provide the majority of the
computational power of the Cell BE. When using single-precision
floating-point fused multiply-add instructions, the eight SPEs in the
first-generation Cell BE chip can perform a total of 64 floating-point
operations per cycle.
The integrated memory controller (MIC) provides a peak bandwidth of 25.6GB/s to an external XDR memory, while the integrated I/O controller
provides peak bandwidths of 25GB/s (for inbound) and 35GB/s (for
outbound). The EIB supports a peak bandwidth of 204.8GB/s for
intra-chip data transfers among the PPE, the SPEs, and the memory and the
I/O interface controllers.
The POWER Processing Element (PPE)
The PPE consists of a POWER Processing Unit (PPU) connected to a 512KB L2
cache. The PPE is the main processor of the Cell BE, and is responsible for
running the operating system and coordinating the SPEs. The key design
goals of the PPE are to maximize the performance/power ratio as well as
the performance/area ratio. The PPU is a dual-issue, in-order processor
with dual-thread support. A pipeline timing diagram, as detailed in
Introduction to the Cell Multiprocessor (see Resources for a link), is shown in Figure 2.
Figure 2. The PPU pipeline
The PPE core can fetch four instructions at a time, and issue two. In order
to improve performance from its in-order pipeline, the PPE utilizes
delayed-execution pipelines and allows limited out-of-order execution of
load instructions. This allows the PPE to get some of the advantages of
out-of-order execution without any significant increase in complexity.
We do not focus on the PPE in this paper since most of the algorithms
presented here do not utilize the PPE.
SPE performance
The SPE is a modular design consisting of a Synergistic Processing Unit
(SPU) and a Memory Flow Controller (MFC). An SPU is a compute engine with
SIMD support and 256KB of dedicated local storage. The MFC contains a DMA
controller with an associated MMU, as well as an Atomic Unit to handle
synchronization operations with other SPUs and the PPU.
An SPU is a dual-issue, in-order machine with a large 128-entry, 128-bit
register file used for both floating-point and integer operations. The
SPU operates directly on instructions and data from its dedicated local
store, and relies on a channel interface to access the main memory and
other local stores. The channel interface, which is in the MFC, runs
independently of the SPU and is capable of translating addresses and doing
DMA transfers while the SPU continues with the program execution.
Figure 3. SPE block diagram
The SPUs SIMD support can perform operations on sixteen 8-bit integers,
eight 16-bit integers, four 32-bit integers, or four single-precision
floating-point numbers per cycle. At 3.2GHz, each SPU is capable of
performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in
single precision. Figure 3 shows the main functional units in an SPU:
(1) an SPU floating point unit for single-precision, double-precision, and
integer multiplies; (2) an even SPU fixed-point unit for arithmetic,
logical operations, and word shifts; (3) an odd SPU fixed-point unit for
permutes, shuffles, and quadword rotates; (4) an SPU control unit for
instruction sequencing and branch execution; (5) an SPU local store unit
for load and stores; also to supply instructions to the control unit; (6)
an SPU channel/DMA transport which is responsible for controlling input
and output through the MFC.
As Figure 3 shows, each functional unit is assigned to one of the two
execution pipelines. The floating point and fixed point units are on the
even pipeline while the rest of the functional units are on the odd
pipeline. The SPU can issue and complete up to two instructions per
cycle, one on each of the execution pipelines. A dual issue occurs when a
group of fetched instructions has two issueable instructions, one of which
is executed by a unit on the even pipeline and the other executed by a
unit on the odd pipeline.
There are three types of instruction fetches: flush-initiated fetches,
inline prefetches, and hint fetches. To fetch instructions, the
instruction fetch logic reads 32 instructions at a time into its
instruction line buffer (ILB), from which two instructions at a time are
sent to the issue logic. When the operands are ready, the issue logic
sends the instructions to the functional units for execution. Functional
unit pipelines vary from two to seven cycles. Hint instructions can
preload instructions into the ILB.
Features such as a deterministic LS access time, simple issue rules,
software-inserted branch hints, a large register file, and so on, are exposed
to the compiler and applications for performance tuning. With some tuning
efforts, we have seen a wide variety of applications approach the
theoretical IPC of 2 in the SPU. The SPEs have DMA support for excellent
data streaming bandwidth that is much higher than many modern processors.
The Element Interconnect Bus
The Element Interconnect Bus (EIB) in the Cell BE allows for communication
among the PPE, the SPEs, the off-chip memory, and the external I/O (see Figure 4). The EIB consists of one address bus and four 16B-wide data rings, two
of which run clockwise and the other two counter-clockwise. Each ring can
potentially allow up to three concurrent data transfers as long as their
paths do not overlap. The EIB operates at half the speed of the processor.
Each requestor on the EIB starts with a small number of initial command
credits to send out requests on the bus. The number of credits is the size
of the command buffer inside the EIB for that particular requestor. One
command credit is used for each request on the bus. When a slot becomes
open in the command buffer as a previous request makes progress in the EIB
request pipeline, the EIB returns the credit back to the requestor.
Figure 4. The EIB and the connected units
When a requestor needs a data ring, it makes a single request to the EIB
data bus arbiter. The arbiter arbitrates among multiple requestors and
decides which data ring is granted to which requestor and when. The memory
controller is given the highest priority to prevent stalling of the
requestor on the read data, while all others are treated equally with a
round-robin priority. Any bus requestor can use any of the four rings to
send or receive data. The data arbiter does not grant a data ring to a
requestor if the transfer would cross more than halfway around the ring on
its way to destination or would interfere with another data transfer
already in progress at the time.
Each unit on the EIB can simultaneously send and receive 16B of data
every bus cycle. The maximum data bandwidth of the entire EIB is limited
by the maximum rate at which addresses are snooped across all units in the
system, which is one per bus cycle. Since each snooped address request can
potentially transfer up to 128B, the theoretical peak data bandwidth on
the EIB at 3.2GHz is 128Bx1.6GHz = 204.8GB/s.
The sustained data bandwidth on the EIB will however be lower than the
peak bandwidth due to several factors: the locations of the destination
and the source relative to each other, the potential for the new transfer
interfering with those already in progress, the number of Cell BE chips in
the system, whether the data transfers are to/from memory or between local
stores in the SPEs, and the efficiency of the data arbiter.
Reduced bus bandwidths can result such as when: (1) all requestors access
the same destination at the same time, such as memory on the same local
store; (2) all transfers are in the same direction and cause idling on two
of the four data rings; (3) there are a large number of partial cache line
transfers lowering the bus efficiency; and (4) each data transfer is a
6-hop, inhibiting the units on the way from using the same ring.
We ran a series of experiments on the hardware in our lab, using a core
frequency of 3.2GHz. In our experiments, all four pairs of SPEs did
streaming reads or writes to each other's local stores. For instance: SPE0
reads from SPE2, SPE4 from SPE6, SPE1 from SPE3, and SPE5 from SPE7, and
vice versa in each case. Table 1 below shows the results from a few of the
runs. We use the notation SPEx <-> SPEy to indicate that SPEx reads
from SPEy's local store, and vice versa.
Table 1. Sustained EIB bandwidth achieved for some SPE-to-SPE DMA transfers
|
Test configuration
|
Aggregate EIB BW
3.2 GHz
|
|---|
|
SPE1 <-> SPE3, SPE5 <-> SPE7, SPE0 <-> SPE2, SPE4
<-> SPE6
|
186 GB/s
| |
SPE0 <-> SPE4, SPE1 <-> SPE5, SPE2 <-> SPE6, SPE3
<-> SPE7
|
197 GB/s
| |
SPE0 <-> SPE1, SPE2 <-> SPE3, SPE4 <-> SPE5, SPE6
<-> SPE7
|
197 GB/s
| |
SPE0 <-> SPE3, SPE1 <-> SPE2, SPE4 <-> SPE7, SPE5
<-> SPE6
|
197 GB/s
| |
SPE0 <-> SPE7, SPE1 <-> SPE6, SPE2 <-> SPE5, SPE3
<-> SPE4
|
78 GB/s
| |
SPE0 <-> SPE5, SPE1 <-> SPE4, SPE2 <-> SPE7, SPE3
<-> SPE6
|
95 GB/s
| |
SPE0 <-> SPE6, SPE1 <-> SPE7, SPE2 <-> SPE4, SPE3
<-> SPE5
|
197 GB/s
|
The sustained effective EIB data bandwidth in our experiments varied from 78GB/s (38% of peak) to 197GB/s (96% of peak). In the case where the bandwidth is only 78GB/s, the communicating SPEs are farthest apart (see Figure 4), and only one transfer happens on each of the four rings. In the case where the bandwidth is 95GB/s, the communicating SPEs are five hops away from each other, still preventing other transfers to take place because of path overlap. In the remaining cases the achieved bandwidth is close to the peak of 204.8GB/s.
The memory subsystem
The memory interface controller (MIC) in the Cell BE chip is connected to
the external RAMBUS XDR memory through two XIO channels that operate at a
maximum effective frequency of 3.2GHz (400 MHz, Octal Data Rate). Both
RAMBUS channels can have eight concurrently operating banks and a maximum size
of 256MB, for a total of 512MB.
The MIC has separate read and write request queues for each XIO channel
operating independently. For each channel, the MIC arbiter alternates the
dispatch between read and write queues after a minimum of every eight
dispatches from each queue or until the queue becomes empty, whichever is
shorter. High-priority read requests are given priority over reads and
writes.
Writes of sizes 16B or more, but less than 128B, can be written directly
to memory using a masked-write operation, but writes smaller than 16B need
a read-modify-write operation. Due to a small number of buffers for
read-modify-write operations, the read part of the read-modify-write
operation is given a higher priority over normal reads, while the write
part is given a higher priority over normal writes.
Other performance-enhancing features in the MIC include a fast-path mode
where an incoming request with the request queues empty is dispatched
immediately to save a few cycles of latency for reads, and a speculative
read mode where reads can be dispatched even before the combined response
is received from the EIB.
With both XIO channels operating at 3.2GHz, the peak raw memory
bandwidth is 25.6GB/s. However, normal memory operations such as refresh,
scrubbing, and so on, typically reduce the bandwidth by about 1GB/s. The peak
bandwidth assumes that all the banks are kept active all the time by the
incoming request streams, and requests are all of the same type (read or
write), and each is exactly 128B in size. If streaming reads and writes
are intermingled, the effective bandwidth can be reduced to about 21GB/s;
the bandwidth loss in this case arises from the overhead of turning around
the MIC-to-XIO bidirectional bus.
Finally, the Cell BE also implements resource allocation to provide
controlled access to the critical shared resources such as the memory or
the I/O links. This controlled access allows time-critical applications to
meet timing targets by preventing contention for the shared resources.
Flexible I/O Interface
There are seven transmit and five receive RAMBUS RRAC FlexIO links that
are 1B wide each. These links can be configured as two logical interfaces.
With the FlexIO links operating at 5GHz, the I/O interface provides a raw
bandwidth of 35GB/s outbound and 25GB/s inbound. A typical configuration
may have one I/O interface configured with raw bandwidths of 30GB/s
outbound and 20GB/s inbound; and the other I/O interface with raw
bandwidths of 5GB/s outbound and 5GB/s inbound.
Data and commands on the I/O interface are transmitted as packets. In
addition to the command, response, and data, each packet may carry
information such as the data tag, data size, command id, and flow control
information, as well as other information. Because of these overheads,
and non-optimal arrival times of data and commands, the effective
bandwidth on the two interfaces is typically lower, and ranges from 50% to
80% of the raw bandwidth. Of course, other factors such as the prevailing
data traffic on the EIB, resource allocation, speed of the I/O devices,
ordering characteristics of the I/O data traffic, interrupts, and so on, can
potentially reduce the I/O bandwidth further.
Application examples and their performance characteristics
In this section, we present the results for a number of applications that
showcase the performance of the Cell BE chip. The applications cover a wide
range: Matrix Multiplication, Linpack, MPEG-2 video decoding, Triangle
Transform and Lighting, and cryptography algorithms such as AES, TDES,
MD5, and so on.
Some of the performance results in this section were obtained from our
cycle-accurate SPEsim performance model, which is an internal tool, as it
provides a great deal of insight into pipeline behavior. In other cases,
the results were measured on the hardware with no operating system support
and overheads. The results from the hardware indicate that the simulator
is typically more than 98% accurate for compute intensive code, while in
some cases such as Linpack, which has significant memory activity, the
modeling accuracy can drop to about 88%, due to simplifying assumptions in
the EIB and memory system model.
Optimization of Matrix Multiplication
Our matrix multiplication program calculates C = A x B, where A, B, and C
are square matrices of order N. Each element cij
of C is computed as follows:
A well-known optimization to reduce the required memory bandwidth is to
partition the matrices into smaller square matrices of order M <
N:
where Aij, Bij and
Cij are square matrices of order M. Each matrix
block Cij is then solved by:
The first optimization for the SPE architecture was to take advantage of
the four way-SIMD using 32-bit multiply-and-add operations to perform up to
eight single-precision floating point operations per cycle.
The next optimization was to take advantage of DMA engines in SPE by
adopting a double-buffer approach to overlap computations on the
previously fetched data blocks with transfers of subsequent data blocks
to and from memory. This effectively removes memory stalls.
Additional optimizations, such as overlapping the execution of blocks,
loop exchange, and software pipelining were applied to balance the usage
of the two SPU pipelines and maximize the dual-issue rate. With these
tuning efforts, the matrix multiplication program on a single SPU improved
its performance by almost 60x over the original scalar code running on an
SPU.
Table 2. Performance of Matrix Multiplication on one SPU
|
# of Cycles
|
# of Insts.
|
CPI
|
Dual Issue
|
Channel Stalls
|
Other Stalls
|
# of Used Registers
|
GFLOPS
|
|---|
|
Original (scalar)
|
258.9M
|
247.1M
|
1.05
|
26.1%
|
11.4%
|
26.3%
|
47
|
0.42
|
|---|
|
SIMD optimized
|
9.78M
|
13.8M
|
0.711
|
40.3%
|
3.0%
|
9.8%
|
60
|
10.96
|
|---|
|
SIMD + dbl buf
|
9.68M
|
13.6M
|
0.711
|
41.4%
|
2.6%
|
10.2%
|
65
|
11.12
|
|---|
|
Optimized code
|
4.27M
|
8.42M
|
0.508
|
80.1%
|
0.2%
|
0.4%
|
69
|
25.12
|
|---|
The data presented in Table 2 above was obtained from a cycle-accurate
SPEsim simulator. The accuracy of the results from our SPEsim simulator
for matrix multiplication (with a matrix size of 256x256), as shown in
the table below, is higher than 99%.
|
Number of SPUs
|
1
|
|---|
|
SPEsim (GFLOPS)
|
25.12
| |
Hardware (GFLOPS)
|
25.01
| |
Accuracy (%)
|
99.6%
|
Since operations in each data block are independent from those in other
blocks, the matrix multiplication algorithm is easily parallelized to all
eight SPUs. Figure 5 shows that the matrix multiplication performance
increases almost linearly with the number of SPUs, especially with large
matrix sizes. Using eight SPUs, the parallel version of matrix
multiplication achieves 201GFLOPS, very close to the theoretical maximum
of 204.8GFLOPS.
Figure 5. Performance of parallelized matrix multiplication for different sized matrices
Assuming that matrix multiplication can achieve its peak single-precision
floating point capability, a Pentium4 with SSE3 at 3.2GHz can achieve
25.6GFLOPS, while the Cell BE could achieve 201GFLOPS, better by almost a
factor of eight.
Optimization of Linpack
Linpack solves a dense system of linear equations Ax = b, where
A is a matrix of order N, and x and b are
single dimension arrays of size N. Matrix multiplication is a key
part of Linpack. The Linpack implementation on the Cell BE was based upon the
Block Partitioned Algorithm for LU Factorization (see Software
Libraries for Linear Algebra Computations on High Performance
Computers listed in Resources).
Linpack single precision floating point performance
The initial implementation of single-precision Linpack was a scalar
version. When run on the cycle-accurate SPEsim simulator with a matrix
size of 1024x1024 using partitioned blocks of 64x64, it achieved only
0.26GFLOPS with a 3.2GHz SPE.
With optimizations added to hide the memory latency by overlapping data
transfers with computation using double buffers, as well as working to
maximize dual issue with optimal instruction scheduling, the final
optimized version of the program achieved 16.5GFLOPS on a 1Kx1K matrix,
22.0GFLOPS on a 4Kx4K matrix, and 23.5GFLOPS on a 8Kx8K
matrix (Table 3).
Table 3. Performance of optimized Linpack on a single SPU
|
Code
|
Matrix size
|
# of Cycles
|
# of Insts.
|
CPI
|
Dual Issue
|
Channel Stalls
|
Other Stalls
|
# of Regs
|
GFLOPS
|
Efficiency
| |
Orig.
|
1024x1024
|
9.11G
|
6.57G
|
1.39
|
16.0%
|
1.9%
|
50%
|
32
|
0.26
|
1.02%
| |
Opt.
|
1024x1024
|
140M
|
205M
|
0.68
|
56.7%
|
18.1%
|
3.9%
|
72
|
16.5
|
64.5%
| |
Opt
|
4096x4096
|
6.66G
|
11.9G
|
0.56
|
71.7%
|
6.4%
|
1.7%
|
68
|
22.0
|
85.9%
| |
Opt.
|
8192x8192
|
50.0G
|
94G
|
0.53
|
75.8%
|
3.8%
|
1.0%
|
68
|
23.5
|
91.8%
|
* Note that there is an approximately 10% error in this SPEsim
performance results.
Table 4 shows the Linpack performance when optimized to run on all eight
SPEs. Note that the computational efficiency improved significantly with
the larger matrix size. Eight SPUs running Linpack at 3.2GHz achieves
155.5GFLOPS on a 4Kx4K matrix.
Table 4. Performance of parallelized Linpack on eight SPUs
|
Matrix size
|
Cycles
|
# of Insts.
|
CPI
|
Single Issue
|
Dual Issue
|
Channel Stalls
|
Other Stalls
|
# of Used Regs
|
SPEsim
|
Mea-
sured
|
Model accuracy
|
Effi-
ciency
| |
1024x1024
|
27.6M
|
2.92M
|
0.95
|
27.9%
|
32.6%
|
26.9%
|
12.6%
|
126
|
83.12
|
73.04
|
87.87%
|
35.7%
| |
4096x4096
|
918.0M
|
1.51G
|
0.61
|
29.0%
|
56.7%
|
10.8%
|
3.4%
|
126
|
160
|
155.5
|
97.2%
|
75.9%
|
Table 5 compares the performance results of parallel Linpack from the
SPEsim and the hardware.
Table 5. Performance of parallelized Linpack on SPEsim and hardware
|
1kx1k matrix
|
4kx4k matrix
| |
Number of SPUs
|
1
|
2
|
3
|
4
|
5
|
8
|
8
| |
SPEsim (GFLOPS)
|
16.03
|
32.79
|
45.62
|
56.33
|
64.94
|
83.12
|
160
| |
Hardware (GFLOPS)
|
14.94
|
30.46
|
42.04
|
50.8
|
58.21
|
73.04
|
155.5
| |
Model Accuracy
|
93.20%
|
92.89%
|
92.15%
|
90.18%
|
89.64%
|
87.87%
|
97.19%
| |
Efficiency
|
58.36%
|
59.49%
|
54.74%
|
49.61%
|
45.48%
|
35.66%
|
75.93%
|
As can be seen from Table 5, the SPEsim results for the Linpack are off
by 7% to 12%, except for in the 4Kx4K case, due to not modeling all the
complexities of DMA and the memory system.
Linpack double precision floating point performance
Although the SPU double-precision (DP) floating-point is not as high as
the single-precision performance, it is still good. Each SPU is capable
of executing two DP instructions every seven cycles. With Fused-Multiply-Add,
an SPU can achieve a peak 1.83GFLOPS at 3.2GHz. With eight SPUs and fully
pipelined DP floating-point support in the PPE's VMX, the Cell BE is capable
of a peak 21.03GFLOPS DP floating-point, compared to a peak of 230.4GFLOPS SP floating point.
The initial implementation of double-precision Linpack was a scalar
version. When run on the SPEsim with a matrix size of 1024x1024, using
partitioned blocks of 64x64, it achieved 0.27GFLOPS on a 3.2GHz SPE.
With additional tuning, the final version achieved 1.55GFLOPS on a 4Kx4K matrix, or about 85% of peak.
Table 6. Performance of double-precision FP Linpack on one SPU
|
Single-SPU Linpack (DP)
|
matrix size
|
# of Cycles
|
# of Insts.
|
CPI
|
Dual Issue
|
Channel Stalls
|
Other Stalls
|
Used Regs
|
3.2 GHz GFLOPs
|
Efficiency
| |
Orig.
|
1Kx1K
|
8.46G
|
4.91G
|
1.72
|
9.5%
|
3.00%
|
58.0%
|
33
|
0.27
|
14.88%
| |
Opt.
|
1Kx1K
|
1.57G
|
466M
|
3.36
|
2.8%
|
0.80%
|
74.1%
|
74
|
1.46
|
80.06%
| |
Opt.
|
4Kx4K
|
94.5G
|
26.0G
|
3.63
|
1.9%
|
0.20%
|
75.8%
|
74
|
1.55
|
84.88%
|
The double-precision version of Linpack on the Cell BE was also parallelized
to run on all eight SPUs. Table 7 summarizes Linpack performance
parallelized to eight SPUs, running on SPEsim.
Table 7. Performance of parallelized double-precision Linpack on eight SPUs
|
matrix size
|
# of Cycles
|
# of Insts.
|
CPI
|
Dual Issue
|
Channel Stalls
|
Other Stalls
|
Used Regs
|
SPEsim GFLOPS
|
Measured GFLOPS
|
Model Accuracy
|
Efficiency
| |
1Kx1K
|
236.7M
|
69.1M
|
3.42
|
2.9%
|
6.7%
|
68.5%
|
128
|
9.704
|
9.46
|
97.49%
|
64.66%
| |
2Kx2K
|
1.64G
|
44.9M
|
3.65
|
2.2%
|
3.3%
|
72.5%
|
128
|
11.184
|
11.05
|
98.80%
|
75.53%
|
Table 8. Comparison of SPEsim and hardware performance results
|
|
1kx1k matrix
| |
Number of SPUs
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
| |
SPEsim (GFLOPS)
|
1.46
|
2.84
|
4.15
|
5.39
|
6.56
|
7.66
|
8.67
|
9.71
| |
Hardware (GFLOPS)
|
1.45
|
2.81
|
4.11
|
5.32
|
6.46
|
7.52
|
8.51
|
9.46
| |
Model Accuracy
|
99.14%
|
98.82%
|
99.12%
|
98.79%
|
98.52%
|
98.12%
|
98.21%
|
97.45%
| |
Efficiency
|
79.23%
|
76.78%
|
74.86%
|
72.68%
|
70.60%
|
68.49%
|
66.43%
|
64.62%
|
Table 8 shows the results measured on hardware as compared to the SPEsim
model. The modeling accuracy for DP Linpack is greater than 97% in all
cases. With a 1024x1024 matrix, the computational efficiency of
parallel Linpack is greater than 64% when running on all eight SPUs. The
best parallel Linpack DP floating-point results measured from hardware is
11.82GFLOPS for a 4096x4096 matrix, with an efficiency of 81%. The
Linpack performance on IA-32 and IA-64 machines (Survey of High
Performance Machines; see Resources)
is summarized in Table 9 along with the SPU results.
Table 9. Comparison of Linpack performance between Cell BE and other processors
|
Linpack 1kx1k (DP)
|
Peak GFLOPS
|
Actual GFLOPS
|
Efficiency
| |
SPU, 3.2GHz
|
1.83
|
1.45
|
79.23%
| |
8 SPUs, 3.2GHz
|
14.63
|
9.46
|
64.66%
| |
Pentium4, 3.2GHz
|
6.4
|
3.1
|
48.44%
| |
Pentium4 + SSE3, 3.6GHz
|
14.4
|
7.2
|
50.00%
| |
Itanium, 1.6GHz
|
6.4
|
5.95
|
92.97%
|
The Linpack implementation on the Cell BE has the highest DP floating point
performance in the chart, exceeding the most recent IA-32 and IA-64
machines available today.
Optimization of MPEG-2 video decoding
The Cell BE is primarily targeted for game applications (Introduction to the Cell Multiprocessor), some of
which demand a high video processing capability. Consequently, significant
effort was devoted to optimizing MPEG-2 video decoding. Figure 6 shows the
major components of an MPEG-2 video decoder; a Variable Length Decoder
(VLD), an Inverse Quantizer (IQ), an Inverse Discrete Cosine Transform
(IDCT), the Motion Compensation section (MC), and control logic.
The VLD code in an MPEG2 decoder has a lot of branches and needs
significant optimization to run well on the SPU. Our optimizations
include algorithmic tuning, lookup-table modification, fast bit
manipulation with SPU intrinsics, static branch prediction with the gcc
compiler's __builtin_expect pragma, function
inlining, and global register assignment. Most of the optimizations also
reduced the instruction count and eliminated or minimized branch
penalties.
Figure 6. Process flow of MPEG2 decoding
A fast IDCT Algorithm (see Fast computational Algorithm for the
Discrete Cosine Transform listed in Resources),
was adopted to operate an 8x8 2-D IDCT using 512 mpya (multiply and add)
and 128 add/subtract instructions. This IDCT algorithm was regular and it
was easy to keep the precision required by the MPEG2 standard for 16bx16b multiplication. Most operations in IDCT were four-way SIMD for the
inner-product, except matrix transpose, which was eight-way SIMD.
The Motion Compensation part was implemented primarily with eight-way SIMD,
and some 16-way SIMD. Function inlining and nested if
else to switch case conversion were
adopted to eliminate some branches and to improve code scheduling by
enlarging basic block size.
Motion compensation required small pixel block transfers (for example, 16x16
pixels) from system memory to local store in order to construct the
predicted blocks. If a video frame were stored in raster scan manner,
then motion compensation would require numerous small DMA transfers (for example, 16 transfers of 16 bytes). In the Cell BE, however, the DMA transfers are
most efficient when performed on 128B naturally aligned boundaries. In
order to increase the efficiency, the data structure was rearranged to put
each macroblock in a 384B contiguous area, consisting of a 16x16 pixel
luminance block and two 8x8 pixel chrominance blocks (in the case of the
4:2:0 format). With this arrangement, a macroblock could be retrieved
from the main memory to local store with fewer 128B DMA transfers.
Table 10 shows our results from the optimized decoder running on
cycle-accurate SPU simulator. As is evident, a single 3.2GHz SPU can
easily support any kind of real-time MPEG-2 video decoding.
Table 10. Single SPU MPEG2 decode performance at different resolutions
| |
# of Cycles
|
# of Insts.
|
CPI
|
# of Used Registers
|
Frames/sec. @3.2GHz
|
|---|
|
CIF
|
1Mbps
|
63.4M
|
51.9M
|
1.22
|
126
|
1514
|
|---|
|
SDTV
|
5Mbps
|
263M
|
220M
|
1.20
|
126
|
365
|
|---|
|
SDTV
|
8Mbps
|
324M
|
290M
|
1.12
|
126
|
296
|
|---|
|
HDTV
|
18Mbps
|
1.25G
|
1.01G
|
1.24
|
126
|
77
|
|---|
* Note that there is an approximately 10% error in this SPEsim
performance results.
On hardware, our optimized MPEG2 decoder was able to decode about 1379
CIF frames/sec, about 10% lower than the simulation result. Adjusted for
modeling error, each SPU is capable of decoding about 324 frames/sec with
a 5Mbps SDTV video stream. Intel reports that its Pentium4 processor
running at 2.8GHz is capable of decoding about 310 frames of a SDTV
720x480x30fps video stream every second when using the highly tuned Intel
Performance Primitive (IPP) library (see Resources).
With linear scaling to 3.2GHz, this frame rate increases to 354fps.
Although differences in the video stream data can change these numbers to
some extent, given that the Cell BE has eight SPUs, Cell BE should outperform a
general purpose processor with SIMD capability by a fair margin.
Optimization of Triangle Transform and Lighting
Transform-light is a critical stage commonly found in vertex-based
graphic pipelines. Because of its structured data, the transform-light
implementation on Cell BE can benefit greatly from the SIMD support in the
SPU. The optimizations here were primarily focused on loop unrolling,
increasing the dual-issue rate, overlapping the DMAs with computation, and
avoiding branches.
With many more registers than general-purpose processors, our
Transform-Light on the SPU benefited significantly from aggressive loop
unrolling.
Table 11. Performance of SPE versus PowerPC 970 for Transform-Lighting
|
Loop Unrolling
|
PPC970 2GHz (Mvtx/sec)
|
PPC970 2.7GHz (Scaled) (Mvtx/sec)
|
SPE 3.2GHz (Mvtx/sec)
|
SPE Advantage
|
|---|
|
1
|
82.95
|
112
|
139.44
|
1.25
| |
2
| 94.8
| 128
|
155.92
|
1.22
| |
4
|
89.47
|
121
|
208.48
|
1.73
| | 8
|
58.45
|
79
| 217.2
|
2.75
|
For comparative analysis, we implemented a best-effort Transform-Light on
a single-core PowerPC 970 (PPC970), system running at 2GHz. The results
in Table 11 show that Transform-Light on one SPU performs about 70% better
than the PPC970/VMX system when scaled to 2.7GHz. Given that the Cell BE
has eight SPUs, the Cell BE should outperform the PPC970/VMX by a wide margin
Cryptography performance
Cryptography is a fast-emerging workload, and is increasingly found in
many applications. The Cell BE, with its SIMD pipelines and a large register
file, can execute cryptography functions efficiently. The code
optimization here was focused on aggressive loop unrolling and
register-based table lookups to take advantage of the SPU's large register
file. Tuning was also done for dual issue, bit permutation, and byte
shuffling. Table 12 summarizes the performance results of several key
cryptography algorithms and hash functions on a single SPE, compared to a
leading brand CPU with SIMD.
Table 12. Performance of a single SPE and another processor for Cryptography
|
Algorithms
|
SPE (Gb/sec)
|
Leading Brand CPU with SIMD (Gb/sec)
|
SPE Performance Advantage
|
|---|
|
AES ECB Encrypt - 128 bit key
|
2.059
|
1.029
|
2.002
| |
- 192 bit key
|
1.710
|
0.877
|
1.950
| |
- 256 bit key
|
1.462
|
0.762
|
1.918
| |
AES CBC Encrypt - 128 bit key
|
0.795
|
0.968
|
0.821
| |
- 192 bit key
|
0.664
|
0.823
|
0.807
| |
- 256 bit key
|
0.570
|
0.725
|
0.786
| |
AES ECB Decrypt - 128 bit key
|
1.499
|
1.035
|
1.448
| |
- 192 bit key
|
1.252
|
0.870
|
1.438
| |
- 256 bit key
|
1.068
|
0.758
|
1.410
| |
AES CBC Decrypt - 128 bit key
|
1.507
|
0.966
|
1.560
| |
- 192 bit key
|
1.249
|
0.829
|
1.507
| |
- 256 bit key
|
1.066
|
0.724
|
1.472
| |
DES - ECB encrypt
|
0.492
|
0.426
|
1.156
| |
- CBC encrypt
|
0.275
|
0.417
|
0.660
| |
- ECB decrypt
|
0.492
|
0.425
|
1.158
| |
- CBC decrypt
|
0.489
|
0.421
|
1.162
| |
TDES - ECB encrypt
|
0.174
|
0.133
|
1.313
| |
- CBC encrypt
|
0.097
|
0.132
|
0.733
| |
- ECB decrypt
|
0.174
|
0.133
|
1.313
| |
- CBC decrypt
|
0.174
|
0.132
|
1.321
| |
MD5
|
2.448
|
2.862
|
0.855
| |
SHA-1
|
2.116
|
0.902
|
2.347
| |
SHA-256
|
0.854
|
0.518
|
1.649
|
Results demonstrate that the cryptography performance on a single SPE is
typically better (up to 2.3x at the same frequency) than a leading brand
processor with SIMD.
Cell BE performance summary
In Terrain Rendering Engine (TRE): Cell Broadband Optimized Real-Time
Ray-Caster , Minor, et al. showed
that their implementation of the Terrain Rendering Engine on the Cell BE runs
50x faster than a PPC970 with the VMX running at 2GHz.
With Cell BE's innovative micro-architectural features, we showed that a
wide variety of algorithms on the Cell BE achieve performance that is equal
to or significantly better than a general-purpose processor. For
applications that can take advantage of SPU's SIMD pipelines and PPU's
thread-level parallelism, the results will be similar. Table 13
summarizes the performance advantage of the Cell BE (or its one SPU) over
other processors.
Table 13. Performance comparison of Cell BE versus other processors for different applications
|
Type
|
Algorithm
|
3.2 GHz GPP
|
3.2 GHz Cell BE
|
Perf Advantage
|
|---|
|
HPC
| Matrix Multiplication (S.P.)
|
25.6 Gflops* (w/SIMD)
|
200 GFlops (8SPEs)
|
8x (8SPEs)
|
|---|
| Linpack (S.P.) 4kx4k
|
25.6 GFlops* (w/SIMD)
|
156 GFlops (8SPEs)
|
6x (8SPEs)
|
|---|
| Linpack (D.P.) 1kx1k
|
7.2 GFlops (3.6GHz IA32/SSE3)
|
9.67 GFLops (8SPEs)
|
1.3x (8SPEs)
|
|---|
|
graphics
| TRE
|
.85 fps (2.7GHz G5/VMX)
|
30 fps (Cell BE)
|
35x (Cell BE)
|
|---|
| transform-light
|
128 MVPS (2.7GHz G5/VMX)
|
217 MVPS (one SPE)
|
1.7x (one SPE)
|
|---|
|
security
| AES ECB encryp. 128b key
|
1.03 Gbps
|
2.06Gbps (one SPE)
|
2x (one SPE)
|
|---|
| AES ECB decryp. 128b key
|
1.04 Gbps
|
1.5Gbps (one SPE)
|
1.4x (one SPE)
|
|---|
| TDES ECB encryp.
|
0.13 Gbps
|
0.17 Gbps (one SPE)
|
1.3x (one SPE)
|
|---|
| DES ECB encryp.
|
0.43 Gbps
|
0.49 Gbps (one SPE)
|
1.1x (one SPE)
|
|---|
|
SHA-1
|
0.9 Gbps
|
2.12 Gbps (one SPE)
|
2.3x (one SPE)
|
|---|
|
video processing
| mpeg2 decoder (sdtv)
|
354 fps (w/SIMD)
|
329 fps (one SPE)
|
0.9x (one SPE)
|
|---|
* assuming 100% compute efficiency, achieving theoretical peak of
25.6GLOPS, in its single precision MatrixMultiply & Linpack implementation
Summary
The Cell BE is a best-of-breed design that delivers significant
computational power, high bandwidth, excellent power and performance, and
leading area efficiency within the constraints of a process technology.
With eight decoupled SPU SIMD engines, each with dedicated resources
including DMA channels, the Cell BE has eight times more SIMD capability (for
up to 16-way data parallelism) than traditional processors with vector
architecture extensions, such as the PPC970 in the G5. The PPE is also
free to process other applications even while all SPUs are in use. The
Cell BE can perform especially well in cases where a general-purpose
processor would normally get tied up by a single SIMD application. With
its performance density and efficiency, the Cell BE can outperform
leading-edge processors on a variety of workloads by approximately an
order of magnitude, in some cases more, when software is tuned and
optimized for this architecture.
Attributions
Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.
Resources Learn
-
"Introduction
to the Cell Multiprocessor," by J. Kahle, et al., IBM Journal of
Research and Development, Vol. 49, No. 4/5, July/Sept. 2005, pp. 589-604.
-
Block Partitioned Algorithm for LU Factorization as discussed in "
Software Libraries for Linear Algebra Computations on High Performance
Computers," by J.J. Dongarra, D. W. Walker.
-
"Survey
of High Performance Machines," by J.J. Dongarra of the University
of Tennessee / Oak Ridge national Laboratory.
-
"A Fast
computational Algorithm for the Discrete Cosine Transform;" by
W.A. Chen, C. Harrison, S.C. Fralick; published in IEEE Transactions on
Communications; Vol. COM-25, No. 9, Sept. 1977 (page 1004; Subscribe to see entire article).
-
Intel White paper, "Performance
Benchmarks for Intel Integrate Performance Primitives," Intel
Corp.
-
"Terrain
Rendering Engine (TRE): Cell Broadband Optimized Real-Time
Ray-Caster," by B. Minor, G. Fossum, V. To, from Proceedings of
GPSx conference, October 2005, will soon be posted to the IBM
Semiconductor solutions Technical library.
-
The
Cell BE Broadband Engine project
page at IBM Research
offers a wealth of links, diagrams, information and articles.
-
The
IBM Semiconductor Solutions Technical Library
Cell Broadband Engine
documentation section
lists
specifications, user manuals, and more.
-
Find all Cell BE-related articles, discussion forums, downloads,
and more at the IBM
developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things CBE.
-
Keep abreast of all the Cell BE news: subscribe to the Power
Architecture Community Newsletter.
Get products and technologies
Discuss
About the authors  | |  | Dr. Thomas Chen is a Senior Technical Staff Member working in the Systems Performance area of the Systems and Technology Group. He has been working on performance and architecture design of Cell Processor. His other responsibilities include future workload analysis & characterization, storage & I/O workload analysis, as well as the performance of both processor and system architecture & designs. Dr. Chen received his Master and Ph.D. degrees, both in Computer and Electrical Engineering, from the State University of New York at Buffalo.
|
 | |  | Dr. Ram Raghavan is a senior performance analyst at the STI Design Center, where he has been mainly focused on the performance of the Cell processor beyond the cores. Before joining the STI project, he was with the IBM Systems and Technology Group working on the performance aspects of IBM Power servers and before that in the development of system verification tools. He holds a Ph.D. degree in the area of Computer Science and Engineering from the University of Michigan at Ann Arbor.
|
 | |  | Jason Dale is an Engineer with IBM Server and Technology Group, and has been focused on the performance of microprocessor core performance inside the CELL Broadband Engine. He is currently working on performance analysis for a next-generation processor core. Jason holds a B.S. degree from Clarkson University and a M.S. degree from U.T. Austin, both in computer engineering. |
 | |  | Eiji Iwata received the M.E. degree in information systems engineering from Kyushu University in Fukuoka, Japan and joined Sony Corporation, Tokyo in 1991. In 1996, he was a Visiting Researcher at Hydra project in CSL, Stanford University. Since 1997, he developed a CMP for media applications in Sony and joined STI Design Center in 2001. Currently he works on Cell related project in Sony Computer Entertainment, Inc. |
Rate this page
|  |