Cell Broadband Engine Architecture and its first implementation

A performance view

The Cell Broadband Engine (Cell BE) processor is the first implementation of the Cell Broadband Engine Architecture (CBEA), developed jointly by Sony, Toshiba, and IBM. In addition to Cell BE's use in the upcoming Sony PlayStation® 3 console, there is a great deal of interest in using it in Cell BE-based workstations, media-rich electronics devices, video and image processing systems, as well as several other emerging applications.

The Cell BE includes one POWER™ Processing Element (PPE) and eight Synergistic Processing Elements (SPEs). The Cell BE architecture is designed to be well-suited for a wide variety of programming models, and allows for partitioning of work between the PPE and the eight SPEs. This paper shows that Cell BE can outperform other state-of-the-art processors by approximately an order of magnitude, or more in some cases.

Until recently, improvements in the performance of general-purpose processor systems were primarily derived from higher processor clock frequencies, wider issue super-scalar or deeper super-pipeline designs. However, without a commensurate increase in the memory speed, these approaches only led to increased memory latencies and even more complex logic to hide them. Further, not being able to have a large number of concurrent accesses to memory, complex cores end up under-utilizing the execution pipelines and memory bandwidth. Invariably the end results can be poor utilization of the chip area and a disproportionate increase in power dissipation relative to overall performance.

The approach taken by the Cell BE design was to focus on improving performance/area and performance/power ratios (see Introduction to the Cell Multiprocessor, listed in Related topics. These goals are largely achieved by using powerful, yet simple cores that use area more efficiently with less power dissipation. Supported by an interconnect with high data bandwidth, these cores could work both independently and cooperatively. By supporting a large number of simultaneous memory accesses from the cores, the memory bandwidth can be used more efficiently as well. The design philosophy is somewhat similar to the recent trend of having multiple general-purpose cores in the same chip; in the Cell BE, all cores are just much simpler, yet still powerful.

The following section, Cell BE Overview and its Performance Characteristics, introduces the performance characteristics of the Cell BE, focusing on the PPE, the SPEs, the Element Interconnect Bus (EIB), the XDR memory, and the IO interfaces (IOIFs). Finally, Application Examples and their Performance Characteristics, characterizes performance of several applications that exploit the Cell BE processor features and compares the results to other general-purpose processors.

Cell BE overview and its performance characteristics

Figure 1. Cell BE Processor Block Diagram
Figure 1. Cell BE Processor Block Diagram
Figure 1. Cell BE Processor Block Diagram

Figure 1 shows a high-level view of the first implementation of Cell BE. It includes a general-purpose 64-bit POWER Processing Element (PPE). In addition, the Cell BE incorporates eight Synergistic Processing Elements (SPEs) interconnected by a high-speed, memory-coherent Element Interconnect Bus (EIB). This initial implementation of Cell BE is targeted to run at 3.2GHz.

The SIMD units on the eight SPEs provide the majority of the computational power of the Cell BE. When using single-precision floating-point fused multiply-add instructions, the eight SPEs in the first-generation Cell BE chip can perform a total of 64 floating-point operations per cycle.

The integrated memory controller (MIC) provides a peak bandwidth of 25.6GB/s to an external XDR memory, while the integrated I/O controller provides peak bandwidths of 25GB/s (for inbound) and 35GB/s (for outbound). The EIB supports a peak bandwidth of 204.8GB/s for intra-chip data transfers among the PPE, the SPEs, and the memory and the I/O interface controllers.

The POWER Processing Element (PPE)

The PPE consists of a POWER Processing Unit (PPU) connected to a 512KB L2 cache. The PPE is the main processor of the Cell BE, and is responsible for running the operating system and coordinating the SPEs. The key design goals of the PPE are to maximize the performance/power ratio as well as the performance/area ratio. The PPU is a dual-issue, in-order processor with dual-thread support. A pipeline timing diagram, as detailed in Introduction to the Cell Multiprocessor (see Related topics for a link), is shown in Figure 2.

Figure 2. The PPU pipeline
Figure 2. The PPU pipeline
Figure 2. The PPU pipeline

The PPE core can fetch four instructions at a time, and issue two. In order to improve performance from its in-order pipeline, the PPE utilizes delayed-execution pipelines and allows limited out-of-order execution of load instructions. This allows the PPE to get some of the advantages of out-of-order execution without any significant increase in complexity. We do not focus on the PPE in this paper since most of the algorithms presented here do not utilize the PPE.

SPE performance

The SPE is a modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). An SPU is a compute engine with SIMD support and 256KB of dedicated local storage. The MFC contains a DMA controller with an associated MMU, as well as an Atomic Unit to handle synchronization operations with other SPUs and the PPU.

An SPU is a dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations. The SPU operates directly on instructions and data from its dedicated local store, and relies on a channel interface to access the main memory and other local stores. The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution.

Figure 3. SPE block diagram
Figure 3. SPE block diagram
Figure 3. SPE block diagram

The SPUs SIMD support can perform operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision. Figure 3 shows the main functional units in an SPU: (1) an SPU floating point unit for single-precision, double-precision, and integer multiplies; (2) an even SPU fixed-point unit for arithmetic, logical operations, and word shifts; (3) an odd SPU fixed-point unit for permutes, shuffles, and quadword rotates; (4) an SPU control unit for instruction sequencing and branch execution; (5) an SPU local store unit for load and stores; also to supply instructions to the control unit; (6) an SPU channel/DMA transport which is responsible for controlling input and output through the MFC.

As Figure 3 shows, each functional unit is assigned to one of the two execution pipelines. The floating point and fixed point units are on the even pipeline while the rest of the functional units are on the odd pipeline. The SPU can issue and complete up to two instructions per cycle, one on each of the execution pipelines. A dual issue occurs when a group of fetched instructions has two issueable instructions, one of which is executed by a unit on the even pipeline and the other executed by a unit on the odd pipeline.

There are three types of instruction fetches: flush-initiated fetches, inline prefetches, and hint fetches. To fetch instructions, the instruction fetch logic reads 32 instructions at a time into its instruction line buffer (ILB), from which two instructions at a time are sent to the issue logic. When the operands are ready, the issue logic sends the instructions to the functional units for execution. Functional unit pipelines vary from two to seven cycles. Hint instructions can preload instructions into the ILB.

Features such as a deterministic LS access time, simple issue rules, software-inserted branch hints, a large register file, and so on, are exposed to the compiler and applications for performance tuning. With some tuning efforts, we have seen a wide variety of applications approach the theoretical IPC of 2 in the SPU. The SPEs have DMA support for excellent data streaming bandwidth that is much higher than many modern processors.

The Element Interconnect Bus

The Element Interconnect Bus (EIB) in the Cell BE allows for communication among the PPE, the SPEs, the off-chip memory, and the external I/O (see Figure 4). The EIB consists of one address bus and four 16B-wide data rings, two of which run clockwise and the other two counter-clockwise. Each ring can potentially allow up to three concurrent data transfers as long as their paths do not overlap. The EIB operates at half the speed of the processor.

Each requestor on the EIB starts with a small number of initial command credits to send out requests on the bus. The number of credits is the size of the command buffer inside the EIB for that particular requestor. One command credit is used for each request on the bus. When a slot becomes open in the command buffer as a previous request makes progress in the EIB request pipeline, the EIB returns the credit back to the requestor.

Figure 4. The EIB and the connected units
Figure 4. The EIB and the connected units
Figure 4. The EIB and the connected units

When a requestor needs a data ring, it makes a single request to the EIB data bus arbiter. The arbiter arbitrates among multiple requestors and decides which data ring is granted to which requestor and when. The memory controller is given the highest priority to prevent stalling of the requestor on the read data, while all others are treated equally with a round-robin priority. Any bus requestor can use any of the four rings to send or receive data. The data arbiter does not grant a data ring to a requestor if the transfer would cross more than halfway around the ring on its way to destination or would interfere with another data transfer already in progress at the time.

Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle. Since each snooped address request can potentially transfer up to 128B, the theoretical peak data bandwidth on the EIB at 3.2GHz is 128Bx1.6GHz = 204.8GB/s.

The sustained data bandwidth on the EIB will however be lower than the peak bandwidth due to several factors: the locations of the destination and the source relative to each other, the potential for the new transfer interfering with those already in progress, the number of Cell BE chips in the system, whether the data transfers are to/from memory or between local stores in the SPEs, and the efficiency of the data arbiter.

Reduced bus bandwidths can result such as when: (1) all requestors access the same destination at the same time, such as memory on the same local store; (2) all transfers are in the same direction and cause idling on two of the four data rings; (3) there are a large number of partial cache line transfers lowering the bus efficiency; and (4) each data transfer is a 6-hop, inhibiting the units on the way from using the same ring.

We ran a series of experiments on the hardware in our lab, using a core frequency of 3.2GHz. In our experiments, all four pairs of SPEs did streaming reads or writes to each other's local stores. For instance: SPE0 reads from SPE2, SPE4 from SPE6, SPE1 from SPE3, and SPE5 from SPE7, and vice versa in each case. Table 1 below shows the results from a few of the runs. We use the notation SPEx <-> SPEy to indicate that SPEx reads from SPEy's local store, and vice versa.

Table 1. Sustained EIB bandwidth achieved for some SPE-to-SPE DMA transfers
Test configuration Aggregate EIB BW 3.2 GHz
SPE1 <-> SPE3, SPE5 <-> SPE7, SPE0 <-> SPE2, SPE4 <-> SPE6 186 GB/s
SPE0 <-> SPE4, SPE1 <-> SPE5, SPE2 <-> SPE6, SPE3 <-> SPE7 197 GB/s
SPE0 <-> SPE1, SPE2 <-> SPE3, SPE4 <-> SPE5, SPE6 <-> SPE7 197 GB/s
SPE0 <-> SPE3, SPE1 <-> SPE2, SPE4 <-> SPE7, SPE5 <-> SPE6 197 GB/s
SPE0 <-> SPE7, SPE1 <-> SPE6, SPE2 <-> SPE5, SPE3 <-> SPE4 78 GB/s
SPE0 <-> SPE5, SPE1 <-> SPE4, SPE2 <-> SPE7, SPE3 <-> SPE6 95 GB/s
SPE0 <-> SPE6, SPE1 <-> SPE7, SPE2 <-> SPE4, SPE3 <-> SPE5 197 GB/s

The sustained effective EIB data bandwidth in our experiments varied from 78GB/s (38% of peak) to 197GB/s (96% of peak). In the case where the bandwidth is only 78GB/s, the communicating SPEs are farthest apart (see Figure 4), and only one transfer happens on each of the four rings. In the case where the bandwidth is 95GB/s, the communicating SPEs are five hops away from each other, still preventing other transfers to take place because of path overlap. In the remaining cases the achieved bandwidth is close to the peak of 204.8GB/s.

The memory subsystem

The memory interface controller (MIC) in the Cell BE chip is connected to the external RAMBUS XDR memory through two XIO channels that operate at a maximum effective frequency of 3.2GHz (400 MHz, Octal Data Rate). Both RAMBUS channels can have eight concurrently operating banks and a maximum size of 256MB, for a total of 512MB.

The MIC has separate read and write request queues for each XIO channel operating independently. For each channel, the MIC arbiter alternates the dispatch between read and write queues after a minimum of every eight dispatches from each queue or until the queue becomes empty, whichever is shorter. High-priority read requests are given priority over reads and writes.

Writes of sizes 16B or more, but less than 128B, can be written directly to memory using a masked-write operation, but writes smaller than 16B need a read-modify-write operation. Due to a small number of buffers for read-modify-write operations, the read part of the read-modify-write operation is given a higher priority over normal reads, while the write part is given a higher priority over normal writes.

Other performance-enhancing features in the MIC include a fast-path mode where an incoming request with the request queues empty is dispatched immediately to save a few cycles of latency for reads, and a speculative read mode where reads can be dispatched even before the combined response is received from the EIB.

With both XIO channels operating at 3.2GHz, the peak raw memory bandwidth is 25.6GB/s. However, normal memory operations such as refresh, scrubbing, and so on, typically reduce the bandwidth by about 1GB/s. The peak bandwidth assumes that all the banks are kept active all the time by the incoming request streams, and requests are all of the same type (read or write), and each is exactly 128B in size. If streaming reads and writes are intermingled, the effective bandwidth can be reduced to about 21GB/s; the bandwidth loss in this case arises from the overhead of turning around the MIC-to-XIO bidirectional bus.

Finally, the Cell BE also implements resource allocation to provide controlled access to the critical shared resources such as the memory or the I/O links. This controlled access allows time-critical applications to meet timing targets by preventing contention for the shared resources.

Flexible I/O Interface

There are seven transmit and five receive RAMBUS RRAC FlexIO links that are 1B wide each. These links can be configured as two logical interfaces. With the FlexIO links operating at 5GHz, the I/O interface provides a raw bandwidth of 35GB/s outbound and 25GB/s inbound. A typical configuration may have one I/O interface configured with raw bandwidths of 30GB/s outbound and 20GB/s inbound; and the other I/O interface with raw bandwidths of 5GB/s outbound and 5GB/s inbound.

Data and commands on the I/O interface are transmitted as packets. In addition to the command, response, and data, each packet may carry information such as the data tag, data size, command id, and flow control information, as well as other information. Because of these overheads, and non-optimal arrival times of data and commands, the effective bandwidth on the two interfaces is typically lower, and ranges from 50% to 80% of the raw bandwidth. Of course, other factors such as the prevailing data traffic on the EIB, resource allocation, speed of the I/O devices, ordering characteristics of the I/O data traffic, interrupts, and so on, can potentially reduce the I/O bandwidth further.

Application examples and their performance characteristics

In this section, we present the results for a number of applications that showcase the performance of the Cell BE chip. The applications cover a wide range: Matrix Multiplication, Linpack, MPEG-2 video decoding, Triangle Transform and Lighting, and cryptography algorithms such as AES, TDES, MD5, and so on.

Some of the performance results in this section were obtained from our cycle-accurate SPEsim performance model, which is an internal tool, as it provides a great deal of insight into pipeline behavior. In other cases, the results were measured on the hardware with no operating system support and overheads. The results from the hardware indicate that the simulator is typically more than 98% accurate for compute intensive code, while in some cases such as Linpack, which has significant memory activity, the modeling accuracy can drop to about 88%, due to simplifying assumptions in the EIB and memory system model.

Optimization of Matrix Multiplication

Our matrix multiplication program calculates C = A x B, where A, B, and C are square matrices of order N. Each element cij of C is computed as follows:

A well-known optimization to reduce the required memory bandwidth is to partition the matrices into smaller square matrices of order M < N:

where Aij, Bij and Cij are square matrices of order M. Each matrix block Cij is then solved by:

The first optimization for the SPE architecture was to take advantage of the four way-SIMD using 32-bit multiply-and-add operations to perform up to eight single-precision floating point operations per cycle.

The next optimization was to take advantage of DMA engines in SPE by adopting a double-buffer approach to overlap computations on the previously fetched data blocks with transfers of subsequent data blocks to and from memory. This effectively removes memory stalls.

Additional optimizations, such as overlapping the execution of blocks, loop exchange, and software pipelining were applied to balance the usage of the two SPU pipelines and maximize the dual-issue rate. With these tuning efforts, the matrix multiplication program on a single SPU improved its performance by almost 60x over the original scalar code running on an SPU.

Table 2. Performance of Matrix Multiplication on one SPU
# of Cycles # of Insts. CPI Dual Issue Channel Stalls Other Stalls # of Used Registers GFLOPS
Original (scalar) 258.9M 247.1M 1.05 26.1% 11.4% 26.3% 47 0.42
SIMD optimized 9.78M 13.8M 0.711 40.3% 3.0% 9.8% 60 10.96
SIMD + dbl buf 9.68M 13.6M 0.711 41.4% 2.6% 10.2% 65 11.12
Optimized code 4.27M 8.42M 0.508 80.1% 0.2% 0.4% 69 25.12

The data presented in Table 2 above was obtained from a cycle-accurate SPEsim simulator. The accuracy of the results from our SPEsim simulator for matrix multiplication (with a matrix size of 256x256), as shown in the table below, is higher than 99%.

Number of SPUs 1
SPEsim (GFLOPS) 25.12
Hardware (GFLOPS) 25.01
Accuracy (%) 99.6%

Since operations in each data block are independent from those in other blocks, the matrix multiplication algorithm is easily parallelized to all eight SPUs. Figure 5 shows that the matrix multiplication performance increases almost linearly with the number of SPUs, especially with large matrix sizes. Using eight SPUs, the parallel version of matrix multiplication achieves 201GFLOPS, very close to the theoretical maximum of 204.8GFLOPS.

Figure 5. Performance of parallelized matrix multiplication for different sized matrices
Figure 5. Performance of parallelized matrix multiplication for different sized matrices
Figure 5. Performance of parallelized matrix multiplication for different sized matrices

Assuming that matrix multiplication can achieve its peak single-precision floating point capability, a Pentium4 with SSE3 at 3.2GHz can achieve 25.6GFLOPS, while the Cell BE could achieve 201GFLOPS, better by almost a factor of eight.

Optimization of Linpack

Linpack solves a dense system of linear equations Ax = b, where A is a matrix of order N, and x and b are single dimension arrays of size N. Matrix multiplication is a key part of Linpack. The Linpack implementation on the Cell BE was based upon the Block Partitioned Algorithm for LU Factorization (see Software Libraries for Linear Algebra Computations on High Performance Computers listed in Related topics).

Linpack single precision floating point performance

The initial implementation of single-precision Linpack was a scalar version. When run on the cycle-accurate SPEsim simulator with a matrix size of 1024x1024 using partitioned blocks of 64x64, it achieved only 0.26GFLOPS with a 3.2GHz SPE.

With optimizations added to hide the memory latency by overlapping data transfers with computation using double buffers, as well as working to maximize dual issue with optimal instruction scheduling, the final optimized version of the program achieved 16.5GFLOPS on a 1Kx1K matrix, 22.0GFLOPS on a 4Kx4K matrix, and 23.5GFLOPS on a 8Kx8K matrix (Table 3).

Table 3. Performance of optimized Linpack on a single SPU
Code Matrix size # of Cycles # of Insts. CPI Dual Issue Channel Stalls Other Stalls # of Regs GFLOPS Efficiency
Orig. 1024x1024 9.11G 6.57G 1.39 16.0% 1.9% 50% 32 0.26 1.02%
Opt. 1024x1024 140M 205M 0.68 56.7% 18.1% 3.9% 72 16.5 64.5%
Opt 4096x4096 6.66G 11.9G 0.56 71.7% 6.4% 1.7% 68 22.0 85.9%
Opt. 8192x8192 50.0G 94G 0.53 75.8% 3.8% 1.0% 68 23.5 91.8%

* Note that there is an approximately 10% error in this SPEsim performance results.

Table 4 shows the Linpack performance when optimized to run on all eight SPEs. Note that the computational efficiency improved significantly with the larger matrix size. Eight SPUs running Linpack at 3.2GHz achieves 155.5GFLOPS on a 4Kx4K matrix.

Table 4. Performance of parallelized Linpack on eight SPUs
Matrix size Cycles # of Insts. CPI Single Issue Dual Issue Channel Stalls Other Stalls # of Used Regs SPEsim Mea- sured Model accuracy Effi- ciency
1024x1024 27.6M 2.92M 0.95 27.9% 32.6% 26.9% 12.6% 126 83.12 73.04 87.87% 35.7%
4096x4096 918.0M 1.51G 0.61 29.0% 56.7% 10.8% 3.4% 126 160 155.5 97.2% 75.9%

Table 5 compares the performance results of parallel Linpack from the SPEsim and the hardware.

Table 5. Performance of parallelized Linpack on SPEsim and hardware
1kx1k matrix 4kx4k matrix
Number of SPUs 1 2 3 4 5 8 8
SPEsim (GFLOPS) 16.03 32.79 45.62 56.33 64.94 83.12 160
Hardware (GFLOPS) 14.94 30.46 42.04 50.8 58.21 73.04 155.5
Model Accuracy 93.20% 92.89% 92.15% 90.18% 89.64% 87.87% 97.19%
Efficiency 58.36% 59.49% 54.74% 49.61% 45.48% 35.66% 75.93%

As can be seen from Table 5, the SPEsim results for the Linpack are off by 7% to 12%, except for in the 4Kx4K case, due to not modeling all the complexities of DMA and the memory system.

Linpack double precision floating point performance

Although the SPU double-precision (DP) floating-point is not as high as the single-precision performance, it is still good. Each SPU is capable of executing two DP instructions every seven cycles. With Fused-Multiply-Add, an SPU can achieve a peak 1.83GFLOPS at 3.2GHz. With eight SPUs and fully pipelined DP floating-point support in the PPE's VMX, the Cell BE is capable of a peak 21.03GFLOPS DP floating-point, compared to a peak of 230.4GFLOPS SP floating point.

The initial implementation of double-precision Linpack was a scalar version. When run on the SPEsim with a matrix size of 1024x1024, using partitioned blocks of 64x64, it achieved 0.27GFLOPS on a 3.2GHz SPE. With additional tuning, the final version achieved 1.55GFLOPS on a 4Kx4K matrix, or about 85% of peak.

Table 6. Performance of double-precision FP Linpack on one SPU
Single-SPU Linpack (DP) matrix size # of Cycles # of Insts. CPI Dual Issue Channel Stalls Other Stalls Used Regs 3.2 GHz GFLOPs Efficiency
Orig. 1Kx1K 8.46G 4.91G 1.72 9.5% 3.00% 58.0% 33 0.27 14.88%
Opt. 1Kx1K 1.57G 466M 3.36 2.8% 0.80% 74.1% 74 1.46 80.06%
Opt. 4Kx4K 94.5G 26.0G 3.63 1.9% 0.20% 75.8% 74 1.55 84.88%

The double-precision version of Linpack on the Cell BE was also parallelized to run on all eight SPUs. Table 7 summarizes Linpack performance parallelized to eight SPUs, running on SPEsim.

Table 7. Performance of parallelized double-precision Linpack on eight SPUs
matrix size # of Cycles # of Insts. CPI Dual Issue Channel Stalls Other Stalls Used Regs SPEsim GFLOPS Measured GFLOPS Model Accuracy Efficiency
1Kx1K 236.7M 69.1M 3.42 2.9% 6.7% 68.5% 128 9.704 9.46 97.49% 64.66%
2Kx2K 1.64G 44.9M 3.65 2.2% 3.3% 72.5% 128 11.184 11.05 98.80% 75.53%
Table 8. Comparison of SPEsim and hardware performance results
1kx1k matrix
Number of SPUs 1 2 3 4 5 6 7 8
SPEsim (GFLOPS) 1.46 2.84 4.15 5.39 6.56 7.66 8.67 9.71
Hardware (GFLOPS) 1.45 2.81 4.11 5.32 6.46 7.52 8.51 9.46
Model Accuracy 99.14% 98.82% 99.12% 98.79% 98.52% 98.12% 98.21% 97.45%
Efficiency 79.23% 76.78% 74.86% 72.68% 70.60% 68.49% 66.43% 64.62%

Table 8 shows the results measured on hardware as compared to the SPEsim model. The modeling accuracy for DP Linpack is greater than 97% in all cases. With a 1024x1024 matrix, the computational efficiency of parallel Linpack is greater than 64% when running on all eight SPUs. The best parallel Linpack DP floating-point results measured from hardware is 11.82GFLOPS for a 4096x4096 matrix, with an efficiency of 81%. The Linpack performance on IA-32 and IA-64 machines (Survey of High Performance Machines; see Related topics) is summarized in Table 9 along with the SPU results.

Table 9. Comparison of Linpack performance between Cell BE and other processors
Linpack 1kx1k (DP) Peak GFLOPS Actual GFLOPS Efficiency
SPU, 3.2GHz 1.83 1.45 79.23%
8 SPUs, 3.2GHz 14.63 9.46 64.66%
Pentium4, 3.2GHz 6.4 3.1 48.44%
Pentium4 + SSE3, 3.6GHz 14.4 7.2 50.00%
Itanium, 1.6GHz 6.4 5.95 92.97%

The Linpack implementation on the Cell BE has the highest DP floating point performance in the chart, exceeding the most recent IA-32 and IA-64 machines available today.

Optimization of MPEG-2 video decoding

The Cell BE is primarily targeted for game applications (Introduction to the Cell Multiprocessor), some of which demand a high video processing capability. Consequently, significant effort was devoted to optimizing MPEG-2 video decoding. Figure 6 shows the major components of an MPEG-2 video decoder; a Variable Length Decoder (VLD), an Inverse Quantizer (IQ), an Inverse Discrete Cosine Transform (IDCT), the Motion Compensation section (MC), and control logic.

The VLD code in an MPEG2 decoder has a lot of branches and needs significant optimization to run well on the SPU. Our optimizations include algorithmic tuning, lookup-table modification, fast bit manipulation with SPU intrinsics, static branch prediction with the gcc compiler's __builtin_expect pragma, function inlining, and global register assignment. Most of the optimizations also reduced the instruction count and eliminated or minimized branch penalties.

Figure 6. Process flow of MPEG2 decoding
Figure 6. Process flow of MPEG2 decoding
Figure 6. Process flow of MPEG2 decoding

A fast IDCT Algorithm (see Fast computational Algorithm for the Discrete Cosine Transform listed in Related topics), was adopted to operate an 8x8 2-D IDCT using 512 mpya (multiply and add) and 128 add/subtract instructions. This IDCT algorithm was regular and it was easy to keep the precision required by the MPEG2 standard for 16bx16b multiplication. Most operations in IDCT were four-way SIMD for the inner-product, except matrix transpose, which was eight-way SIMD.

The Motion Compensation part was implemented primarily with eight-way SIMD, and some 16-way SIMD. Function inlining and nested if else to switch case conversion were adopted to eliminate some branches and to improve code scheduling by enlarging basic block size.

Motion compensation required small pixel block transfers (for example, 16x16 pixels) from system memory to local store in order to construct the predicted blocks. If a video frame were stored in raster scan manner, then motion compensation would require numerous small DMA transfers (for example, 16 transfers of 16 bytes). In the Cell BE, however, the DMA transfers are most efficient when performed on 128B naturally aligned boundaries. In order to increase the efficiency, the data structure was rearranged to put each macroblock in a 384B contiguous area, consisting of a 16x16 pixel luminance block and two 8x8 pixel chrominance blocks (in the case of the 4:2:0 format). With this arrangement, a macroblock could be retrieved from the main memory to local store with fewer 128B DMA transfers.

Table 10 shows our results from the optimized decoder running on cycle-accurate SPU simulator. As is evident, a single 3.2GHz SPU can easily support any kind of real-time MPEG-2 video decoding.

Table 10. Single SPU MPEG2 decode performance at different resolutions
# of Cycles # of Insts. CPI # of Used Registers Frames/sec. @3.2GHz
CIF 1Mbps 63.4M 51.9M 1.22 126 1514
SDTV 5Mbps 263M 220M 1.20 126 365
SDTV 8Mbps 324M 290M 1.12 126 296
HDTV 18Mbps 1.25G 1.01G 1.24 126 77

* Note that there is an approximately 10% error in this SPEsim performance results.

On hardware, our optimized MPEG2 decoder was able to decode about 1379 CIF frames/sec, about 10% lower than the simulation result. Adjusted for modeling error, each SPU is capable of decoding about 324 frames/sec with a 5Mbps SDTV video stream. Intel reports that its Pentium4 processor running at 2.8GHz is capable of decoding about 310 frames of a SDTV 720x480x30fps video stream every second when using the highly tuned Intel Performance Primitive (IPP) library (see Related topics). With linear scaling to 3.2GHz, this frame rate increases to 354fps. Although differences in the video stream data can change these numbers to some extent, given that the Cell BE has eight SPUs, Cell BE should outperform a general purpose processor with SIMD capability by a fair margin.

Optimization of Triangle Transform and Lighting

Transform-light is a critical stage commonly found in vertex-based graphic pipelines. Because of its structured data, the transform-light implementation on Cell BE can benefit greatly from the SIMD support in the SPU. The optimizations here were primarily focused on loop unrolling, increasing the dual-issue rate, overlapping the DMAs with computation, and avoiding branches.

With many more registers than general-purpose processors, our Transform-Light on the SPU benefited significantly from aggressive loop unrolling.

Table 11. Performance of SPE versus PowerPC® 970 for Transform-Lighting
Loop Unrolling PPC970 2GHz (Mvtx/sec) PPC970 2.7GHz (Scaled) (Mvtx/sec) SPE 3.2GHz (Mvtx/sec) SPE Advantage
1 82.95 112 139.44 1.25
294.8128 155.92 1.22
4 89.47 121 208.48 1.73
8 58.45 79 217.2 2.75

For comparative analysis, we implemented a best-effort Transform-Light on a single-core PowerPC 970 (PPC970), system running at 2GHz. The results in Table 11 show that Transform-Light on one SPU performs about 70% better than the PPC970/VMX system when scaled to 2.7GHz. Given that the Cell BE has eight SPUs, the Cell BE should outperform the PPC970/VMX by a wide margin

Cryptography performance

Cryptography is a fast-emerging workload, and is increasingly found in many applications. The Cell BE, with its SIMD pipelines and a large register file, can execute cryptography functions efficiently. The code optimization here was focused on aggressive loop unrolling and register-based table lookups to take advantage of the SPU's large register file. Tuning was also done for dual issue, bit permutation, and byte shuffling. Table 12 summarizes the performance results of several key cryptography algorithms and hash functions on a single SPE, compared to a leading brand CPU with SIMD.

Table 12. Performance of a single SPE and another processor for Cryptography
Algorithms SPE (Gb/sec) Leading Brand CPU with SIMD (Gb/sec) SPE Performance Advantage
AES ECB Encrypt  - 128 bit key 2.059 1.029 2.002
                        - 192 bit key 1.710 0.877 1.950
                        - 256 bit key 1.462 0.762 1.918
AES CBC Encrypt - 128 bit key 0.795 0.968 0.821
                        - 192 bit key 0.664 0.823 0.807
                        - 256 bit key 0.570 0.725 0.786
AES ECB Decrypt - 128 bit key 1.499 1.035 1.448
                        - 192 bit key 1.252 0.870 1.438
                        - 256 bit key 1.068 0.758 1.410
AES CBC Decrypt - 128 bit key 1.507 0.966 1.560
                        - 192 bit key 1.249 0.829 1.507
                        - 256 bit key 1.066 0.724 1.472
DES   - ECB encrypt 0.492 0.426 1.156
        - CBC encrypt 0.275 0.417 0.660
        - ECB decrypt 0.492 0.425 1.158
        - CBC decrypt 0.489 0.421 1.162
TDES - ECB encrypt 0.174 0.133 1.313
        - CBC encrypt 0.097 0.132 0.733
        - ECB decrypt 0.174 0.133 1.313
        - CBC decrypt 0.174 0.132 1.321
MD5 2.448 2.862 0.855
SHA-1 2.116 0.902 2.347
SHA-256 0.854 0.518 1.649

Results demonstrate that the cryptography performance on a single SPE is typically better (up to 2.3x at the same frequency) than a leading brand processor with SIMD.

Cell BE performance summary

In Terrain Rendering Engine (TRE): Cell Broadband Optimized Real-Time Ray-Caster , Minor, et al. showed that their implementation of the Terrain Rendering Engine on the Cell BE runs 50x faster than a PPC970 with the VMX running at 2GHz.

With Cell BE's innovative micro-architectural features, we showed that a wide variety of algorithms on the Cell BE achieve performance that is equal to or significantly better than a general-purpose processor. For applications that can take advantage of SPU's SIMD pipelines and PPU's thread-level parallelism, the results will be similar. Table 13 summarizes the performance advantage of the Cell BE (or its one SPU) over other processors.

Table 13. Performance comparison of Cell BE versus other processors for different applications
Type Algorithm 3.2 GHz GPP 3.2 GHz Cell BE Perf Advantage
HPC Matrix Multiplication (S.P.) 25.6 Gflops* (w/SIMD) 200 GFlops (8SPEs) 8x (8SPEs)
Linpack (S.P.) 4kx4k 25.6 GFlops* (w/SIMD) 156 GFlops (8SPEs) 6x (8SPEs)
Linpack (D.P.) 1kx1k 7.2 GFlops (3.6GHz IA32/SSE3) 9.67 GFLops (8SPEs) 1.3x (8SPEs)
graphics TRE .85 fps (2.7GHz G5/VMX) 30 fps (Cell BE) 35x (Cell BE)
transform-light 128 MVPS (2.7GHz G5/VMX) 217 MVPS (one SPE) 1.7x (one SPE)
security AES ECB encryp. 128b key 1.03 Gbps 2.06Gbps (one SPE) 2x (one SPE)
AES ECB decryp. 128b key 1.04 Gbps 1.5Gbps (one SPE) 1.4x (one SPE)
TDES ECB encryp. 0.13 Gbps 0.17 Gbps (one SPE) 1.3x (one SPE)
DES ECB encryp. 0.43 Gbps 0.49 Gbps (one SPE) 1.1x (one SPE)
SHA-1 0.9 Gbps 2.12 Gbps (one SPE) 2.3x (one SPE)
video processing mpeg2 decoder (sdtv) 354 fps (w/SIMD) 329 fps (one SPE) 0.9x (one SPE)

* assuming 100% compute efficiency, achieving theoretical peak of 25.6GLOPS, in its single precision MatrixMultiply & Linpack implementation


The Cell BE is a best-of-breed design that delivers significant computational power, high bandwidth, excellent power and performance, and leading area efficiency within the constraints of a process technology.

With eight decoupled SPU SIMD engines, each with dedicated resources including DMA channels, the Cell BE has eight times more SIMD capability (for up to 16-way data parallelism) than traditional processors with vector architecture extensions, such as the PPC970 in the G5. The PPE is also free to process other applications even while all SPUs are in use. The Cell BE can perform especially well in cases where a general-purpose processor would normally get tied up by a single SIMD application. With its performance density and efficiency, the Cell BE can outperform leading-edge processors on a variety of workloads by approximately an order of magnitude, in some cases more, when software is tuned and optimized for this architecture.


Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.

Downloadable resources

Related topics

Zone=Multicore acceleration
ArticleTitle=Cell Broadband Engine Architecture and its first implementation