The Cmpware Configurable Multiprocessor Development Kit (CMP-DK) for the Cell/B.E. processor is a multicore simulation and software development environment that provides fast and efficient simulation of the Cell/B.E. architecture combined with an interactive, display-rich environment that permits large amounts of information to be displayed in a fast, simple, intuitive format. These capabilities are essential to analyzing the complex behavior of multicore systems.
One component of the CMP-DK is a scheduling tool for the Cell/B.E.'s Synergistic Processing Unit (SPU), of which there are eight. The eight SPUs (powerful DSP-like devices that contain four floating-point arithmetic and logic units (ALUs) operating in SIMD fashion) do the bulk of the computation in most Cell/B.E. applications. Because of their DSP-like architecture, the order of the instructions the SPUs execute can have a significant effect on performance. It is not unusual to see a more-than-fivefold performance improvement from hand-tuning an SPU program. So, in order to achieve the maximum performance from the Cell/B.E. application, you need an understanding of SPU scheduling.
That's where the Cmpware CMP-DK comes in. The Cmpware CMP-DK contains an SPU scheduling tool as a part of the overall software development suite.
Cmpware has made the SPU Scheduling Tool part of the CMP-DK, which is available as a free, standard Eclipse plug-in. The tool permits fast and easy analysis of SPU code in a highly intuitive, interactive, graphical way. You install it the way you would any Eclipse plug-in (see sidebar).
Perhaps the most important issue in improving SPU performance through instruction scheduling involves data dependencies. Because of the pipelining of the SPU, results from one operation might not be available for several cycles after an instruction executes. Instructions attempting to use these results can stall, often for 6 to 13 cycles. This can have a dramatic impact on performance.
One problem with attempting to write code for the SPU is that the number of cycles an instruction can stall in such a data-dependent situation depends on which instructions are being executed. Some instructions produce no stalling at all, while some stall for two cycles, and others can stall for six cycles (and so on).
Another factor that can affect the stall rate is the register usage. Keeping track of registers across as many as a dozen instructions can be challenging, even when writing small programs. How to rearrange code or even how to implement a particular algorithm can depend heavily on the particular mix of instructions and their ordering.
In addition to instruction scheduling for data dependencies, the SPU has a second performance-enhancing architectural feature that can double performance in some cases. Of course, this comes with a tradeoff: it can also further complicate instruction scheduling. The SPU ALU has two instruction pipelines that run in parallel, permitting two instructions to be executed in each clock cycle. Unfortunately, these two pipelines are not identical. One pipeline handles (in general) memory access functions while the second pipeline handles arithmetic and logical operations.
On each clock cycle, two instructions can be executed if they are the correct types of instruction and if they are aligned at the correct addresses. Keeping track of which instructions go in which pipeline and which instruction is at a given address can be daunting. The instruction pipeline issue scheduling is also particularly sensitive. The addition or removal of a single instruction can, in some cases, cause a two-fold change in performance. Combining these two scheduling activities can be extremely difficult, even for experienced programmers.
The Cmpware SPU Scheduling Tool addresses the problems associated with scheduling SPU code in several ways.
The scheduling process is split into two separate tasks. The data dependency stalls are handled separately from the dual pipeline issue scheduling. Because these two activities are interdependent, the SPU Scheduling Tool features rapid display of data for both scheduling modes. The click of a button toggles the display from one optimization phase to the other, permitting fast, interactive experimentation of the different scheduling approaches.
The displays themselves also provide scheduling information in simple and easy-to-use formats. The interface indicates not only exactly where performance bottlenecks occur, but it also provides details about the bottlenecks' severities.
Figure 1 shows the SPU Scheduling Tool interface's only three controls: three buttons on the top of the main window that are used to load data files and to display scheduling information.
Figure 1. The three controls
Once an SPU binary file (either in ELF or in raw binary) is loaded into the viewer using the File Load button, the disassembly for this SPU code can be displayed in the main window. By clicking on either the Pipeline Instruction Issue View button or the Data Dependency View button, each instruction in the disassembly is colored according to its schedule properties. In general, the coloring goes from green (no stall) to yellow (stalling 1-2 cycles) to orange (stalling 3-4 cycles) to red (stalling 5+ cycles).
The goal in optimizing SPU software is to achieve a schedule with all instructions registering green. Depending on your application, it might not be necessary to achieve maximum performance. In other cases, some code might not be in critical loops, in which case it might not require optimization. CMP-DK enables you to profile so you can easily see which sections of SPU code are most often used (and therefore which portion of the SPU code can benefit most from schedule optimization).
In the Data Dependency View (Figure 2), SPU instructions are colored according to the type of instruction and by the registers each instruction uses.
Figure 2. The SPU pipeline data dependency view
If a source register that an instruction uses is written with a result from a previous instruction, that register might not yet be ready to be used. The Data Dependency View colors an instruction to indicate how many cycles the instruction needs to stall before all source data from input registers is available.
Improving SPU performance involves rearranging instructions to reduce the number of cycles stalled. Typically, you can insert instructions that do not contain registers that subsequent instructions use. This enables the processor to perform useful work while waiting for data for other instructions to become available. Various techniques to perform this sort of optimization exist, but they are beyond the scope of this article. Making more liberal use of registers and not reusing registers can greatly improve performance.
Loads from memory are often a source of stalls, so loading a group of registers, then using those registers later in the code is one technique used to reduce stalls. But many times simple trial and error, and even changes to the underlying algorithm, can produce large increases in performance.
Once data dependency stalls are improved, select the Pipeline Instruction Issue View. As shown in Figure 3, this view indicates the alignment of instructions and the ability to use the dual-issue pipeline.
Figure 3. The SPU dual-issue pipeline view
A green-colored instruction indicates a correctly aligned instruction. A yellow-colored instruction indicates an incorrectly aligned instruction. The goal is to correctly align all instructions, producing a completely green view. This permits two instructions to be executed per cycle.
As with the data dependency scheduling, the amount of optimization required is specific to the application. If the code is not in a key loop or performance-critical section, it might be acceptable to ignore a relatively bad schedule in order to concentrate efforts on parts of the code that have a greater impact on overall performance.
Note that SPU dual-issue instruction scheduling is very sensitive to instruction ordering, and it might be somewhat non-intuitive at first. For example, a section of code with all yellow (unaligned) instructions requires the shift of only a single instruction at the top of the block, often through the insertion of a noop (no-operation) instruction, to completely align the entire block, changing all of the yellow instructions to green. While the worst possible performance comes from an all yellow region, it is the easiest to fix. In general, a striped region (alternating yellow and green instructions) requires the most work to optimize.
Optimizing data: Never in a vacuum
Finally, note that the processes of optimizing data dependency stalls and dual-issue pipeline optimizations are not completely independent of each other. Some small amount of iteration can be useful to achieve the highest performance. Check both views to be sure that changes to optimize one mode do not cause setbacks in the other mode.
SPU instruction scheduling can provide dramatic increases in SPU performance. However, instruction scheduling might not be a familiar technique for many programmers. Most desktop and server processors do not expose this level of pipeline issues to the programmer, although it is common in DSP processors. The Cmpware SPU Scheduling Tool attempts to range the programmer experience level by simplifying the complex task of scheduling the SPU and by making the techniques to optimize performance available to less experienced programmers while still providing a power tool for those with more experience.
While the Cmpware SPU Scheduling Tool is described in terms of assembly language coding, it can also be used to view and examine the output of compilers and other high-level tools. Many of these tools automatically provide schedule optimizations of various types, but frequently they only optimize based on the source code supplied. It is often possible to make changes to high-level code implementation and to algorithms that can provide dramatically improved SPU performance.
Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Read the
Cmpware CMP-DK
for the Cell B.E. tutorial (PDF)
to find out more about CMP-DK, how to
use the SPU Scheduling Tool, or how to install the Eclipse plug-in. The
tutorial includes sample applications that deal with shared
memory and mailboxes, the dependencies and pipeline optimizations outlined
in this article, and how to do a system-level analysis of performance.
- Refer to the Eclipse Web site for
more detailed information about Eclipse, Eclipse plug-ins, and using update
sites to install Eclipse plug-ins.
- Review an oldie-but-goodie (thanks to
Daniel Brokenshire),
"Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance"
(developerWorks, June 2006) to see the platform's architectural
characteristics so you can milk near-theoretical performance from it.
- Peruse the rest of the articles in this unofficial
"partners" series:
- "Core partners, Part 1: Build high-performance apps for multicore processors" (developerWorks, May 2007) about the RapidMind Development Platform, which provides a simple single-source mechanism to develop portable high-performance applications for multicore processors.
- "Core partners, Part 2: Using DDT to clean up Cell/B.E. app bugs" (developerWorks, February 2008), which describes how to use Allinea Software's Distributed Debugging Tool (DDT) to debug complete Cell/B.E. applications, including multiple threads within a single Cell/B.E. processor and among clusters of Cell/B.E. processors.
- "Core partners, Part 3: Transforming Gedae-built portable apps" (developerWorks, April 2008), which examines the portability of applications developed in Gedae by analyzing the work required to move an example application from a simulation on a PC to actually running on a DSP board (the Mercury Computer System AdapDev system) to running on a multicore Cell/B.E. system.
- "Core partners, Part 4: Managing the PlayStation 3 Wi-Fi network" (developerWorks, June 2008), in which Terra Soft Solutions IT Manager Aaron Johnson shows you how to configure and encrypt the built-in PS3 Wi-Fi network and how to switch from a wireless network back to a wired network on the PS3.
- Learn more about Cell/B.E. programming
from the developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture
® when you sign up to receive Cell/B.E. news in your newsletter.
Get products and technologies
- Get a demo
version of the CMP-DK,
read some of the documentation in the library, or see what services
Cmpware offers to help you optimize your Cell/B.E. applications.
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the library of Cell/B.E. documentation.
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
Discuss
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
Juicy problems and answers from the forums are rounded up periodically and highlighted
in the "Forum watch" blog series.
- Go to the Cell Broadband Engine/Power Architecture blog for
news, downloads,
instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find
the popular "Forum
watch" blog series (Q&A roundup), the "FixIt" technology updates, and the Infobomb
quick-read technology introductions.

Steven A. Guccione received a BSEE from Boston University, an MSEE from the University of Minnesota, and a PhD from the University of Texas at Austin. Dr. Guccione is the author of approximately 40 papers and 20 patents, primarily in the areas of high performance and reconfigurable computing. Dr. Guccione is the Chief Scientist and founder of Cmpware Inc., which makes tools for programming multicore architectures.
Comments (Undergoing maintenance)




