 | Level: Intermediate Yang Pu (puyang@cn.ibm.com), Software engineer, IBM Systems & Technology Group, Development Cheng Long (clong@cn.ibm.com), Software engineer, IBM Systems & Technology Group, Development Rui Jianhua (ruijh@cn.ibm.com), Software engineer, IBM Systems & Technology Group, Development
19 Jun 2007 The Cell Broadband
Engine™ (Cell/B.E.) processor has powerful computation capabilities, but to fully
unleash its power, you need to provide a unique programming paradigm. In this article,
learn best practices for porting a JPEG compression application to the Cell/B.E.
Synergistic Processor Engine (SPE), and see how to take advantage of the processor's unique architecture and avoid its shortcomings.
The Cell Broadband Engine processor, jointly developed by Sony, Toshiba, and IBM, has nine
processors -- eight Synergistic Processor Engines (SPEs) and one general purpose dual-core
PowerPC®-based processor (the PPE). Sony uses the Cell/B.E. processor as the
processing unit of its PLAYSTATION® 3, released in late 2006; as well, others are testing
the processor in such applications as medical imaging, media processing, and scientific
computing. (In fact, supercomputers and mainframes seem to be getting in on the action,
too; IBM is producing a hybrid
Cell/B.E.-Opteron supercomputer for LANL and has plans to link
the processor to mainframes through blade systems.)
The processor obviously has a bright future in many industries, but to fully unleash its
power, you will need to keep in mind its unique programming architecture when writing,
configuring, and porting applications for and to it. We've already done some of that (the
porting part), and we believe our experiences and the techniques we've learned can be
helpful to you in understanding what you should consider when porting computational
intensive applications to the Cell/B.E. architecture.
A quick look at the hardware
The Cell/B.E. PPE implements the PowerPC architecture so that Linux® for PowerPC and
its existing applications can run on the Cell/B.E. chip without any change. But if you
want to utilize the SPE's computing power, you need to follow some porting guides.
The SPE is a vector-only processor. Its architecture demonstrates the following characteristics:
- Each SPE has dual pipelines and supports dual issues. The even pipeline is for
arithmetic computing, and the odd pipeline is for memory operation.
- Each SPE has a 256KB memory space called the local store.
- The SPE uses DMA to manage the data between system memory and its local store. (Direct Memory Access is a means of handling data transfer between memory and a peripheral device that bypasses the central processing unit.)
- The SPE doesn't have hardware-implemented branch architecture; it only uses software-assisted methods to optimize the branch.
So what's JPEG?
JPEG (Joint Photographic Experts Group) is a popular standard of static image compression
and is used by image-processing products such as photographs, printers, browsers, and so
on. The JPEG algorithm has two functions -- image compression and decompression between a
bitmap (BMP) image and the JPEG image. One of the most popular implementations of the JPEG
algorithm is done by the Independent JPEG Group, and this article shows porting IJG's JPEG compression implementation to the SPE.
Figure 1 illustrates how the JPEG compression algorithm works.
Figure 1. The JPEG compression flow
For more details on the JPEG compression algorithm, please see Resources.
The six key porting considerations
Consider the following six technology issues when attempting to port
a compute-intensive application to the Cell/B.E. SPE:
- Compiler tool chains
- Workload characteristics
- Memory
- DMA transfer issues
- SPE-PPE communication
- 1-to-8 SPE performance scaling
The remaining sections of this article look at each issue in detail and include a
discussion on performance.
Compiler tool chains
Two sets of tool chains are required to port an application to the Cell/B.E. SPE because
instruction sets on the SPE and PPE are different. They are in the latest SDK release
(2.1; see Resources).
The IBM XLC compiler (see Resources) is optimized for the Cell/B.E. processor and is used in our porting by setting the SPU_COMPILER environment variable as XLC.
What's in a workload?
To offload the computing workload to the SPE, you should remember that the PPE is just a
normal PowerPC and that the SPE is good at vector computing instead of scalar
computing. This makes it very important to analyze the characteristics of the workload
before you assign the modules between the PPE and SPE.
In the JPEG application, DCT, quantization, and color space conversion are all
computationally intensive, so you partition the JPEG compression algorithm into two
modules respectively on the PPE and SPE. Figure 2 illustrates the detail.
Figure 2. Parcel parts of the JPEG algorithm between the SPE and PPE
Limits on memory
You can only put 256KB of code and data on each SPE.
If the code size is allowed, you should put program code into the local store of the
SPE for better performance because other ways which overlay program code onto the SPE can degrade the performance. The code of the JPEG compression program is small enough to fit in local store.
We also analyzed the JPEG compression program on an x86 system to learn more about its memory requirements. We divided raw data into blocks and transferred them block by block through DMA. And speaking of DMA . . . .
The role of DMA transfer
DMA plays an important role in Cell/B.E. memory architecture. The Memory Flow Controller in each SPE will serve as an interface from the local store of an SPE to the main memory or local stores of other SPEs.
If some data is larger than the size of local store, you can put them into main memory and
use software-driven DMA operations to transfer them. Like Figure 2 shows, a DMA transfer
is initiated by the SPE to read blocks of BMP raw data from main memory to local store. After
the SPE finishes working on the data and the compression result is available, it uses another DMA transfer to write the result back to main memory.
Chit-chat between SPE and PPE
Three mailboxes in each SPE can be used for communication between the SPE and PPE. Two of them can send outbound messages to the PPE and another one can read inbound messages from the PPE. In addition, an SPE can read inbound messages from the PPE by two signal channels.
For JPEG application porting purposes, the PPE works as a controller and passes some parameters to the SPE by mailbox when a compression work is launched. The SPE also uses a mailbox to notify completion of current work. No synchronization work is required among different SPEs because they're designed to work independently.
From one to eight
Scaling from a single SPE to eight SPEs affects performance. Eight SPEs can work on different modules of a task in a pipeline way and can also work on different tasks in a parallel way.
 |
Huffman coding
Huffman coding ("prefix-free code") is an entropy encoding algorithm used for lossless
data compression which refers to the use of a variable length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It uses a specific method for choosing the representation for each symbol, resulting in a prefix-free code (sometimes called "prefix codes") -- that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol -- that expresses the most common characters using shorter strings of bits than are used for less common source symbols. For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding (ASCII coding). Thank you, David A. Huffman.
|
|
In our JPEG compression porting example, we expected that the Cell/B.E. processor could
handle lots of JPEG compression tasks, so each independent task was assigned on each SPE.
The performance throughput of eight SPE is important here. In an earlier porting, we found
that the PPE was the bottleneck: It impacted total performance if the Huffman encode
function ran on the PPE. After moving Huffman encode function to each SPE, total performance is better, although the performance of a single SPE decreases relatively.
Some last-minute performance considerations
To improve the performance of JPEG compression on an SPE, we used some optimization
methods. Double buffers were used to transfer BMP data and hide the latency of data
transfer. Lots of intrinsic functions were used in the vectorization of fDCT and color
space conversion modules. In addition, we analyzed the context of some hot missed
branches, and then used static branch predication to reduce their miss rates. The article
"Maximizing the power of the Cell Broadband Engine processor" (see Resources) will give you access to more optimization methods.
In conclusion
When porting an application to a Cell/B.E. SPE, developers will have to do things to take advantage of the potential benefits the processor has to offer:
- Make some changes to original source code.
- Experiment with various specific optimization methods to fine tune for better performance.
We hope that by offering our experiences, we can jump start your porting process.
Resources Learn
-
For optimization techniques (25 five of them), try "Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance" (developerWorks, June 2006).
-
For experienced C/C++ programmers who are interested in developing applications or
libraries for the Cell/B.E. processor, the Cell/B.E.
programming tutorial
can help with your exploits.
-
The "Cell Broadband Engine Programming Handbook" provides information for developing applications, libraries, middleware, drivers, compilers, or operating systems -- the whole nine yards -- for the processor.
-
The Cell/B.E. specifications take a close look at the processor structure based upon the 64-bit Power Architecture technology, but with unique features directed toward distributed processing and media-rich applications.
-
The JPEG page and JPEG library will provide loads of information on compression algorithms.
-
The "Software Development Kit 2.1 Installation Guide Version 2.1" (PDF) will walk you through installation and configuration and many of the basics you need to know to get started with development. Two companion pieces, "Cell/B.E. SDK 2.1: Setting up Fedora Core 6" and "Cell/B.E. SDK 2.1: Understanding the terminology" (developerWorks, April 2007), can help get the requisite FC6 up and running and provide a quick reference to Cell/B.E. terminology.
-
For more on Cell/B.E. programming, try the developerWorks's series "Programming high-performance applications on the Cell/B.E. processor," "PS3 fab to lab," and "The little broadband engine that could."
-
The IBM microNews newsletter delivers Cell/B.E happenings to your desktop twice a month.
Get products and technologies
Discuss
About the authors  | 
|  | Yang Pu is a software engineer in IBM China System Technology Group. After joining IBM in 2005, he has been working for performance tools -- which includes performance optimization of benchmarks on Cell/B.E. processors -- for about two years. |
 | 
|  | Cheng Long is a staff software engineer in IBM China Systems and Technology lab. He has rich experience on system and software performance analysis, performance tuning, and performance tools development. Currently he is the team lead for developing an emerging IBM performance analysis toolset, Visual Performance Analyzer. More information about Visual Performance Analyzer can be found at IBM alphaWorks http://www.alphaworks.ibm.com/tech/vpa. |
 | 
|  | Rui Jianhua is a staff software engineer in IBM China System Technology Group. His responsibilities include performance tools developments and benchmark optimization. He has rich experience on system architecture and system performance. |
Rate this page
|  |