Skip to main content

Core partners, Part 5: Increasing SPU performance with instruction scheduling

A good scheduler keeps data dependencies from killing processor performance

Steven Guccione (Steven.Guccione@cmpware.com), Chief Scientist, Cmpware Inc.
Photo of Steven Guccione
Steven A. Guccione received a BSEE from Boston University, an MSEE from the University of Minnesota, and a PhD from the University of Texas at Austin. Dr. Guccione is the author of approximately 40 papers and 20 patents, primarily in the areas of high performance and reconfigurable computing. Dr. Guccione is the Chief Scientist and founder of Cmpware Inc., which makes tools for programming multicore architectures.

Summary:  The collection of processors in a Cell Broadband Engine™ (Cell/B.E.) processor displays a DSP-like architecture. This means that the order in which the SPUs execute the instructions can have a significant effect on performance. Without a good scheduling mechanism in place, data dependencies can stall processor performance. In this article, learn from a Cmpware expert how and why to use the Cmpware CMP-DK Cell/B.E. SPU Scheduling Tool, which permits fast and easy analysis of SPU code in an intuitive, graphical format.

View more content in this series

Date:  19 Aug 2008
Level:  Introductory PDF:  A4 and Letter (212KB)Get Adobe® Reader®
Activity:  2271 views

Introduction

The Cmpware Configurable Multiprocessor Development Kit (CMP-DK) for the Cell/B.E. processor is a multicore simulation and software development environment that provides fast and efficient simulation of the Cell/B.E. architecture combined with an interactive, display-rich environment that permits large amounts of information to be displayed in a fast, simple, intuitive format. These capabilities are essential to analyzing the complex behavior of multicore systems.

One component of the CMP-DK is a scheduling tool for the Cell/B.E.'s Synergistic Processing Unit (SPU), of which there are eight. The eight SPUs (powerful DSP-like devices that contain four floating-point arithmetic and logic units (ALUs) operating in SIMD fashion) do the bulk of the computation in most Cell/B.E. applications. Because of their DSP-like architecture, the order of the instructions the SPUs execute can have a significant effect on performance. It is not unusual to see a more-than-fivefold performance improvement from hand-tuning an SPU program. So, in order to achieve the maximum performance from the Cell/B.E. application, you need an understanding of SPU scheduling.

That's where the Cmpware CMP-DK comes in. The Cmpware CMP-DK contains an SPU scheduling tool as a part of the overall software development suite.

Installing the plug-in

The Cmpware SPU Scheduling Tool is a standard Eclipse plug-in. The Eclipse update site URL for the Cmpware SPU Scheduling Tool is http://www.cmpware.com/sputool/. You can install the plug-in using the same procedure you would use for installing any Eclipse plug-in. From the main Eclipse menu, select Eclipse Help > Software Updates > Find and Install, and follow the prompts to install the Cmpware SPU Scheduling Tool.

You can find more information about installing Eclipse plug-ins in the Resources section.

Cmpware has made the SPU Scheduling Tool part of the CMP-DK, which is available as a free, standard Eclipse plug-in. The tool permits fast and easy analysis of SPU code in a highly intuitive, interactive, graphical way. You install it the way you would any Eclipse plug-in (see sidebar).

Scheduling the SPU

Perhaps the most important issue in improving SPU performance through instruction scheduling involves data dependencies. Because of the pipelining of the SPU, results from one operation might not be available for several cycles after an instruction executes. Instructions attempting to use these results can stall, often for 6 to 13 cycles. This can have a dramatic impact on performance.

One problem with attempting to write code for the SPU is that the number of cycles an instruction can stall in such a data-dependent situation depends on which instructions are being executed. Some instructions produce no stalling at all, while some stall for two cycles, and others can stall for six cycles (and so on).

Another factor that can affect the stall rate is the register usage. Keeping track of registers across as many as a dozen instructions can be challenging, even when writing small programs. How to rearrange code or even how to implement a particular algorithm can depend heavily on the particular mix of instructions and their ordering.

In addition to instruction scheduling for data dependencies, the SPU has a second performance-enhancing architectural feature that can double performance in some cases. Of course, this comes with a tradeoff: it can also further complicate instruction scheduling. The SPU ALU has two instruction pipelines that run in parallel, permitting two instructions to be executed in each clock cycle. Unfortunately, these two pipelines are not identical. One pipeline handles (in general) memory access functions while the second pipeline handles arithmetic and logical operations.

On each clock cycle, two instructions can be executed if they are the correct types of instruction and if they are aligned at the correct addresses. Keeping track of which instructions go in which pipeline and which instruction is at a given address can be daunting. The instruction pipeline issue scheduling is also particularly sensitive. The addition or removal of a single instruction can, in some cases, cause a two-fold change in performance. Combining these two scheduling activities can be extremely difficult, even for experienced programmers.


Exploring the Scheduling Tool

The Cmpware SPU Scheduling Tool addresses the problems associated with scheduling SPU code in several ways.

The scheduling process is split into two separate tasks. The data dependency stalls are handled separately from the dual pipeline issue scheduling. Because these two activities are interdependent, the SPU Scheduling Tool features rapid display of data for both scheduling modes. The click of a button toggles the display from one optimization phase to the other, permitting fast, interactive experimentation of the different scheduling approaches.

The displays themselves also provide scheduling information in simple and easy-to-use formats. The interface indicates not only exactly where performance bottlenecks occur, but it also provides details about the bottlenecks' severities.

Figure 1 shows the SPU Scheduling Tool interface's only three controls: three buttons on the top of the main window that are used to load data files and to display scheduling information.


Figure 1. The three controls
The three controls

Once an SPU binary file (either in ELF or in raw binary) is loaded into the viewer using the File Load button, the disassembly for this SPU code can be displayed in the main window. By clicking on either the Pipeline Instruction Issue View button or the Data Dependency View button, each instruction in the disassembly is colored according to its schedule properties. In general, the coloring goes from green (no stall) to yellow (stalling 1-2 cycles) to orange (stalling 3-4 cycles) to red (stalling 5+ cycles).

The goal in optimizing SPU software is to achieve a schedule with all instructions registering green. Depending on your application, it might not be necessary to achieve maximum performance. In other cases, some code might not be in critical loops, in which case it might not require optimization. CMP-DK enables you to profile so you can easily see which sections of SPU code are most often used (and therefore which portion of the SPU code can benefit most from schedule optimization).

In the Data Dependency View (Figure 2), SPU instructions are colored according to the type of instruction and by the registers each instruction uses.


Figure 2. The SPU pipeline data dependency view
The SPU pipeline data dependency view

If a source register that an instruction uses is written with a result from a previous instruction, that register might not yet be ready to be used. The Data Dependency View colors an instruction to indicate how many cycles the instruction needs to stall before all source data from input registers is available.

Improving SPU performance involves rearranging instructions to reduce the number of cycles stalled. Typically, you can insert instructions that do not contain registers that subsequent instructions use. This enables the processor to perform useful work while waiting for data for other instructions to become available. Various techniques to perform this sort of optimization exist, but they are beyond the scope of this article. Making more liberal use of registers and not reusing registers can greatly improve performance.

Loads from memory are often a source of stalls, so loading a group of registers, then using those registers later in the code is one technique used to reduce stalls. But many times simple trial and error, and even changes to the underlying algorithm, can produce large increases in performance.

Once data dependency stalls are improved, select the Pipeline Instruction Issue View. As shown in Figure 3, this view indicates the alignment of instructions and the ability to use the dual-issue pipeline.


Figure 3. The SPU dual-issue pipeline view
The SPU dual-issue pipeline view

A green-colored instruction indicates a correctly aligned instruction. A yellow-colored instruction indicates an incorrectly aligned instruction. The goal is to correctly align all instructions, producing a completely green view. This permits two instructions to be executed per cycle.

As with the data dependency scheduling, the amount of optimization required is specific to the application. If the code is not in a key loop or performance-critical section, it might be acceptable to ignore a relatively bad schedule in order to concentrate efforts on parts of the code that have a greater impact on overall performance.

Note that SPU dual-issue instruction scheduling is very sensitive to instruction ordering, and it might be somewhat non-intuitive at first. For example, a section of code with all yellow (unaligned) instructions requires the shift of only a single instruction at the top of the block, often through the insertion of a noop (no-operation) instruction, to completely align the entire block, changing all of the yellow instructions to green. While the worst possible performance comes from an all yellow region, it is the easiest to fix. In general, a striped region (alternating yellow and green instructions) requires the most work to optimize.


Optimizing data: Never in a vacuum

Finally, note that the processes of optimizing data dependency stalls and dual-issue pipeline optimizations are not completely independent of each other. Some small amount of iteration can be useful to achieve the highest performance. Check both views to be sure that changes to optimize one mode do not cause setbacks in the other mode.


Conclusion

SPU instruction scheduling can provide dramatic increases in SPU performance. However, instruction scheduling might not be a familiar technique for many programmers. Most desktop and server processors do not expose this level of pipeline issues to the programmer, although it is common in DSP processors. The Cmpware SPU Scheduling Tool attempts to range the programmer experience level by simplifying the complex task of scheduling the SPU and by making the techniques to optimize performance available to less experienced programmers while still providing a power tool for those with more experience.

While the Cmpware SPU Scheduling Tool is described in terms of assembly language coding, it can also be used to view and examine the output of compilers and other high-level tools. Many of these tools automatically provide schedule optimizations of various types, but frequently they only optimize based on the source code supplied. It is often possible to make changes to high-level code implementation and to algorithms that can provide dramatically improved SPU performance.


Resources

Learn

Get products and technologies

Discuss

About the author

Photo of Steven Guccione

Steven A. Guccione received a BSEE from Boston University, an MSEE from the University of Minnesota, and a PhD from the University of Texas at Austin. Dr. Guccione is the author of approximately 40 papers and 20 patents, primarily in the areas of high performance and reconfigurable computing. Dr. Guccione is the Chief Scientist and founder of Cmpware Inc., which makes tools for programming multicore architectures.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=331166
ArticleTitle=Core partners, Part 5: Increasing SPU performance with instruction scheduling
publish-date=08192008
author1-email=Steven.Guccione@cmpware.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers