Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Porting practices: Compute-intensive applications

These practices can help when you want to bring a compute-intensive application to the Cell/B.E. architecture

Yang Pu (puyang@cn.ibm.com), Software engineer, IBM Systems & Technology Group, Development
photo
Yang Pu is a software engineer in IBM China System Technology Group. After joining IBM in 2005, he has been working for performance tools -- which includes performance optimization of benchmarks on Cell/B.E. processors -- for about two years.
Cheng Long (clong@cn.ibm.com), Software engineer, IBM Systems & Technology Group, Development
photo
Cheng Long is a staff software engineer in IBM China Systems and Technology lab. He has rich experience on system and software performance analysis, performance tuning, and performance tools development. Currently he is the team lead for developing an emerging IBM performance analysis toolset, Visual Performance Analyzer. More information about Visual Performance Analyzer can be found at IBM alphaWorks http://www.alphaworks.ibm.com/tech/vpa.
Rui Jianhua (ruijh@cn.ibm.com), Software engineer, IBM Systems & Technology Group, Development
photo
Rui Jianhua is a staff software engineer in IBM China System Technology Group. His responsibilities include performance tools developments and benchmark optimization. He has rich experience on system architecture and system performance.

Summary:  The Cell Broadband Engine™ (Cell/B.E.) processor has powerful computation capabilities, but to fully unleash its power, you need to provide a unique programming paradigm. In this article, learn best practices for porting a JPEG compression application to the Cell/B.E. Synergistic Processor Engine (SPE), and see how to take advantage of the processor's unique architecture and avoid its shortcomings.

Date:  19 Jun 2007
Level:  Intermediate

Activity:  8506 views
Comments:  

The Cell Broadband Engine processor, jointly developed by Sony, Toshiba, and IBM, has nine processors -- eight Synergistic Processor Engines (SPEs) and one general purpose dual-core PowerPC®-based processor (the PPE). Sony uses the Cell/B.E. processor as the processing unit of its PLAYSTATION® 3, released in late 2006; as well, others are testing the processor in such applications as medical imaging, media processing, and scientific computing. (In fact, supercomputers and mainframes seem to be getting in on the action, too; IBM is producing a hybrid Cell/B.E.-Opteron supercomputer for LANL and has plans to link the processor to mainframes through blade systems.)

The processor obviously has a bright future in many industries, but to fully unleash its power, you will need to keep in mind its unique programming architecture when writing, configuring, and porting applications for and to it. We've already done some of that (the porting part), and we believe our experiences and the techniques we've learned can be helpful to you in understanding what you should consider when porting computational intensive applications to the Cell/B.E. architecture.

A quick look at the hardware

The Cell/B.E. PPE implements the PowerPC architecture so that Linux® for PowerPC and its existing applications can run on the Cell/B.E. chip without any change. But if you want to utilize the SPE's computing power, you need to follow some porting guides.

The SPE is a vector-only processor. Its architecture demonstrates the following characteristics:

  • Each SPE has dual pipelines and supports dual issues. The even pipeline is for arithmetic computing, and the odd pipeline is for memory operation.
  • Each SPE has a 256KB memory space called the local store.
  • The SPE uses DMA to manage the data between system memory and its local store. (Direct Memory Access is a means of handling data transfer between memory and a peripheral device that bypasses the central processing unit.)
  • The SPE doesn't have hardware-implemented branch architecture; it only uses software-assisted methods to optimize the branch.

So what's JPEG?

JPEG (Joint Photographic Experts Group) is a popular standard of static image compression and is used by image-processing products such as photographs, printers, browsers, and so on. The JPEG algorithm has two functions -- image compression and decompression between a bitmap (BMP) image and the JPEG image. One of the most popular implementations of the JPEG algorithm is done by the Independent JPEG Group, and this article shows porting IJG's JPEG compression implementation to the SPE.

Figure 1 illustrates how the JPEG compression algorithm works.


Figure 1. The JPEG compression flow
The JPEG compression flow

For more details on the JPEG compression algorithm, please see Resources.

The six key porting considerations

Consider the following six technology issues when attempting to port a compute-intensive application to the Cell/B.E. SPE:

  • Compiler tool chains
  • Workload characteristics
  • Memory
  • DMA transfer issues
  • SPE-PPE communication
  • 1-to-8 SPE performance scaling

The remaining sections of this article look at each issue in detail and include a discussion on performance.

Compiler tool chains

Two sets of tool chains are required to port an application to the Cell/B.E. SPE because instruction sets on the SPE and PPE are different. They are in the latest SDK release (2.1; see Resources). The IBM XLC compiler (see Resources) is optimized for the Cell/B.E. processor and is used in our porting by setting the SPU_COMPILER environment variable as XLC.

What's in a workload?

To offload the computing workload to the SPE, you should remember that the PPE is just a normal PowerPC and that the SPE is good at vector computing instead of scalar computing. This makes it very important to analyze the characteristics of the workload before you assign the modules between the PPE and SPE.

In the JPEG application, DCT, quantization, and color space conversion are all computationally intensive, so you partition the JPEG compression algorithm into two modules respectively on the PPE and SPE. Figure 2 illustrates the detail.


Figure 2. Parcel parts of the JPEG algorithm between the SPE and PPE
Parcel parts of the JPEG algorithm between the SPE and PPE

Limits on memory

You can only put 256KB of code and data on each SPE. If the code size is allowed, you should put program code into the local store of the SPE for better performance because other ways which overlay program code onto the SPE can degrade the performance. The code of the JPEG compression program is small enough to fit in local store.

We also analyzed the JPEG compression program on an x86 system to learn more about its memory requirements. We divided raw data into blocks and transferred them block by block through DMA. And speaking of DMA . . . .

The role of DMA transfer

DMA plays an important role in Cell/B.E. memory architecture. The Memory Flow Controller in each SPE will serve as an interface from the local store of an SPE to the main memory or local stores of other SPEs.

If some data is larger than the size of local store, you can put them into main memory and use software-driven DMA operations to transfer them. Like Figure 2 shows, a DMA transfer is initiated by the SPE to read blocks of BMP raw data from main memory to local store. After the SPE finishes working on the data and the compression result is available, it uses another DMA transfer to write the result back to main memory.

Chit-chat between SPE and PPE

Three mailboxes in each SPE can be used for communication between the SPE and PPE. Two of them can send outbound messages to the PPE and another one can read inbound messages from the PPE. In addition, an SPE can read inbound messages from the PPE by two signal channels.

For JPEG application porting purposes, the PPE works as a controller and passes some parameters to the SPE by mailbox when a compression work is launched. The SPE also uses a mailbox to notify completion of current work. No synchronization work is required among different SPEs because they're designed to work independently.

From one to eight

Scaling from a single SPE to eight SPEs affects performance. Eight SPEs can work on different modules of a task in a pipeline way and can also work on different tasks in a parallel way.

Huffman coding

Huffman coding ("prefix-free code") is an entropy encoding algorithm used for lossless data compression which refers to the use of a variable length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It uses a specific method for choosing the representation for each symbol, resulting in a prefix-free code (sometimes called "prefix codes") -- that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol -- that expresses the most common characters using shorter strings of bits than are used for less common source symbols. For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding (ASCII coding). Thank you, David A. Huffman.

In our JPEG compression porting example, we expected that the Cell/B.E. processor could handle lots of JPEG compression tasks, so each independent task was assigned on each SPE. The performance throughput of eight SPE is important here. In an earlier porting, we found that the PPE was the bottleneck: It impacted total performance if the Huffman encode function ran on the PPE. After moving Huffman encode function to each SPE, total performance is better, although the performance of a single SPE decreases relatively.

Some last-minute performance considerations

To improve the performance of JPEG compression on an SPE, we used some optimization methods. Double buffers were used to transfer BMP data and hide the latency of data transfer. Lots of intrinsic functions were used in the vectorization of fDCT and color space conversion modules. In addition, we analyzed the context of some hot missed branches, and then used static branch predication to reduce their miss rates. The article "Maximizing the power of the Cell Broadband Engine processor" (see Resources) will give you access to more optimization methods.

In conclusion

When porting an application to a Cell/B.E. SPE, developers will have to do things to take advantage of the potential benefits the processor has to offer:

  • Make some changes to original source code.
  • Experiment with various specific optimization methods to fine tune for better performance.

We hope that by offering our experiences, we can jump start your porting process.


Resources

Learn

Get products and technologies

Discuss

About the authors

photo

Yang Pu is a software engineer in IBM China System Technology Group. After joining IBM in 2005, he has been working for performance tools -- which includes performance optimization of benchmarks on Cell/B.E. processors -- for about two years.

photo

Cheng Long is a staff software engineer in IBM China Systems and Technology lab. He has rich experience on system and software performance analysis, performance tuning, and performance tools development. Currently he is the team lead for developing an emerging IBM performance analysis toolset, Visual Performance Analyzer. More information about Visual Performance Analyzer can be found at IBM alphaWorks http://www.alphaworks.ibm.com/tech/vpa.

photo

Rui Jianhua is a staff software engineer in IBM China System Technology Group. His responsibilities include performance tools developments and benchmark optimization. He has rich experience on system architecture and system performance.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=232042
ArticleTitle=Porting practices: Compute-intensive applications
publish-date=06192007
author1-email=puyang@cn.ibm.com
author1-email-cc=
author2-email=clong@cn.ibm.com
author2-email-cc=
author3-email=ruijh@cn.ibm.com
author3-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers