The Cell Broadband Engine processor, jointly developed by Sony, Toshiba, and IBM, has nine processors -- eight Synergistic Processor Engines (SPEs) and one general purpose dual-core PowerPC®-based processor (the PPE). Sony uses the Cell/B.E. processor as the processing unit of its PLAYSTATION® 3, released in late 2006; as well, others are testing the processor in such applications as medical imaging, media processing, and scientific computing. (In fact, supercomputers and mainframes seem to be getting in on the action, too; IBM is producing a hybrid Cell/B.E.-Opteron supercomputer for LANL and has plans to link the processor to mainframes through blade systems.)
The processor obviously has a bright future in many industries, but to fully unleash its power, you will need to keep in mind its unique programming architecture when writing, configuring, and porting applications for and to it. We've already done some of that (the porting part), and we believe our experiences and the techniques we've learned can be helpful to you in understanding what you should consider when porting computational intensive applications to the Cell/B.E. architecture.
The Cell/B.E. PPE implements the PowerPC architecture so that Linux® for PowerPC and its existing applications can run on the Cell/B.E. chip without any change. But if you want to utilize the SPE's computing power, you need to follow some porting guides.
The SPE is a vector-only processor. Its architecture demonstrates the following characteristics:
- Each SPE has dual pipelines and supports dual issues. The even pipeline is for arithmetic computing, and the odd pipeline is for memory operation.
- Each SPE has a 256KB memory space called the local store.
- The SPE uses DMA to manage the data between system memory and its local store. (Direct Memory Access is a means of handling data transfer between memory and a peripheral device that bypasses the central processing unit.)
- The SPE doesn't have hardware-implemented branch architecture; it only uses software-assisted methods to optimize the branch.
JPEG (Joint Photographic Experts Group) is a popular standard of static image compression and is used by image-processing products such as photographs, printers, browsers, and so on. The JPEG algorithm has two functions -- image compression and decompression between a bitmap (BMP) image and the JPEG image. One of the most popular implementations of the JPEG algorithm is done by the Independent JPEG Group, and this article shows porting IJG's JPEG compression implementation to the SPE.
Figure 1 illustrates how the JPEG compression algorithm works.
Figure 1. The JPEG compression flow
For more details on the JPEG compression algorithm, please see Resources.
The six key porting considerations
Consider the following six technology issues when attempting to port a compute-intensive application to the Cell/B.E. SPE:
- Compiler tool chains
- Workload characteristics
- Memory
- DMA transfer issues
- SPE-PPE communication
- 1-to-8 SPE performance scaling
The remaining sections of this article look at each issue in detail and include a discussion on performance.
Two sets of tool chains are required to port an application to the Cell/B.E. SPE because
instruction sets on the SPE and PPE are different. They are in the latest SDK release
(2.1; see Resources).
The IBM XLC compiler (see Resources) is optimized for the Cell/B.E. processor and is used in our porting by setting the SPU_COMPILER environment variable as XLC.
To offload the computing workload to the SPE, you should remember that the PPE is just a normal PowerPC and that the SPE is good at vector computing instead of scalar computing. This makes it very important to analyze the characteristics of the workload before you assign the modules between the PPE and SPE.
In the JPEG application, DCT, quantization, and color space conversion are all computationally intensive, so you partition the JPEG compression algorithm into two modules respectively on the PPE and SPE. Figure 2 illustrates the detail.
Figure 2. Parcel parts of the JPEG algorithm between the SPE and PPE
You can only put 256KB of code and data on each SPE. If the code size is allowed, you should put program code into the local store of the SPE for better performance because other ways which overlay program code onto the SPE can degrade the performance. The code of the JPEG compression program is small enough to fit in local store.
We also analyzed the JPEG compression program on an x86 system to learn more about its memory requirements. We divided raw data into blocks and transferred them block by block through DMA. And speaking of DMA . . . .
DMA plays an important role in Cell/B.E. memory architecture. The Memory Flow Controller in each SPE will serve as an interface from the local store of an SPE to the main memory or local stores of other SPEs.
If some data is larger than the size of local store, you can put them into main memory and use software-driven DMA operations to transfer them. Like Figure 2 shows, a DMA transfer is initiated by the SPE to read blocks of BMP raw data from main memory to local store. After the SPE finishes working on the data and the compression result is available, it uses another DMA transfer to write the result back to main memory.
Three mailboxes in each SPE can be used for communication between the SPE and PPE. Two of them can send outbound messages to the PPE and another one can read inbound messages from the PPE. In addition, an SPE can read inbound messages from the PPE by two signal channels.
For JPEG application porting purposes, the PPE works as a controller and passes some parameters to the SPE by mailbox when a compression work is launched. The SPE also uses a mailbox to notify completion of current work. No synchronization work is required among different SPEs because they're designed to work independently.
Scaling from a single SPE to eight SPEs affects performance. Eight SPEs can work on different modules of a task in a pipeline way and can also work on different tasks in a parallel way.
In our JPEG compression porting example, we expected that the Cell/B.E. processor could handle lots of JPEG compression tasks, so each independent task was assigned on each SPE. The performance throughput of eight SPE is important here. In an earlier porting, we found that the PPE was the bottleneck: It impacted total performance if the Huffman encode function ran on the PPE. After moving Huffman encode function to each SPE, total performance is better, although the performance of a single SPE decreases relatively.
Some last-minute performance considerations
To improve the performance of JPEG compression on an SPE, we used some optimization methods. Double buffers were used to transfer BMP data and hide the latency of data transfer. Lots of intrinsic functions were used in the vectorization of fDCT and color space conversion modules. In addition, we analyzed the context of some hot missed branches, and then used static branch predication to reduce their miss rates. The article "Maximizing the power of the Cell Broadband Engine processor" (see Resources) will give you access to more optimization methods.
When porting an application to a Cell/B.E. SPE, developers will have to do things to take advantage of the potential benefits the processor has to offer:
- Make some changes to original source code.
- Experiment with various specific optimization methods to fine tune for better performance.
We hope that by offering our experiences, we can jump start your porting process.
Learn
-
For optimization techniques (25 five of them), try "Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance" (developerWorks, June 2006).
-
For experienced C/C++ programmers who are interested in developing applications or
libraries for the Cell/B.E. processor, the Cell/B.E.
programming tutorial
can help with your exploits.
-
The "Cell Broadband Engine Programming Handbook" provides information for developing applications, libraries, middleware, drivers, compilers, or operating systems -- the whole nine yards -- for the processor.
-
The Cell/B.E. specifications take a close look at the processor structure based upon the 64-bit Power Architecture technology, but with unique features directed toward distributed processing and media-rich applications.
-
The JPEG page and JPEG library will provide loads of information on compression algorithms.
-
The "Software Development Kit 2.1 Installation Guide Version 2.1" (PDF) will walk you through installation and configuration and many of the basics you need to know to get started with development. Two companion pieces, "Cell/B.E. SDK 2.1: Setting up Fedora Core 6" and "Cell/B.E. SDK 2.1: Understanding the terminology" (developerWorks, April 2007), can help get the requisite FC6 up and running and provide a quick reference to Cell/B.E. terminology.
-
For more on Cell/B.E. programming, try the developerWorks's series "Programming high-performance applications on the Cell/B.E. processor," "PS3 fab to lab," and "The little broadband engine that could."
-
The IBM microNews newsletter delivers Cell/B.E happenings to your desktop twice a month.
Get products and technologies
-
Here is the centerpiece of Cell/B.E. development, the latest Cell/B.E. SDK release, version 2.1.
-
We used the IBM XLC compiler in our porting efforts -- it is optimized for the Cell/B.E. processor.
-
The developerWorks Cell Broadband Engine Resource Center is your clearinghouse for Cell/B.E.-related resources, downloads, and news.
-
The Linux on Cell/B.E.-based systems site at the Barcelona Supercomputing Center provides information about how to enable Linux on the processor.
Discuss
- Participate in the discussion forum.
-
The Cell Broadband Engine Architecture forum is the place to get your technical questions about the processor answered. (Six juicy problems and answers from the forums are rounded up periodically and highlighted in the blog series, "Forum watch.")
-
The Power Architecture blog provides news, downloads, instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies and is the home of two blog series -- "Forum watch" (Q&A roundup) and the "FixIt" technology updates.
-
This contact page will enable you to discuss customized
Cell/B.E. processor solutions with an IBM representative.

Yang Pu is a software engineer in IBM China System Technology Group. After joining IBM in 2005, he has been working for performance tools -- which includes performance optimization of benchmarks on Cell/B.E. processors -- for about two years.

Cheng Long is a staff software engineer in IBM China Systems and Technology lab. He has rich experience on system and software performance analysis, performance tuning, and performance tools development. Currently he is the team lead for developing an emerging IBM performance analysis toolset, Visual Performance Analyzer. More information about Visual Performance Analyzer can be found at IBM alphaWorks http://www.alphaworks.ibm.com/tech/vpa.





