Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

An introduction to compiling for the Cell Broadband Engine architecture, Part 4: Partitioning large tasks

Divide and conquer: Parallelizing partitioned jobs

Power Architecture editors, developerWorks, IBM
The developerWorks Power Architecture editors welcome your comments on this article. E-mail them at dwpower@us.ibm.com.

Summary:  This tutorial, fourth and penultimate in the "An introduction to compiling for the Cell Broadband Engine™ architecture" series, discusses ways to partition code to run across the multiple cores available in a Cell Broadband Engine™ (Cell BE) processor. It gives particular attention to efficient partitioning of code to allow larger programs or data sets to be manipulated using the 256KB of local store available on the Synergistic Processor Elements (SPEs).

View more content in this series

Date:  07 Feb 2006
Level:  Intermediate PDF:  A4 and Letter (114 KB | 18 pages)Get Adobe® Reader®

Activity:  6181 views
Comments:  

Partitioning using OpenMP

Overview

A single-source program contains C or Fortran with user directives or pragmas hinting at partitioning. The compiler outlines all code within the pragmas into separate functions compiled for the SPE, then replaces the outlined code with a call to the parallel runtime, and compiles this wrapper code for the PPE. The master thread continues to execute on the PPE and participate in workshare constructs.

The PPE runtime places outlined functions on a work queue containing information about the scope of a job (such as number of iterations to execute, or the size of data set to work on in each iteration). The runtime has up to eight SPE threads to pull work items (outlined parallel functions) from the queue and execute them on SPEs. The PPE can then wait for an SPE task to complete, or continue executing code while the SPEs work.

The outlining process

Outlining happens early in pass one of TPO, right after intra-procedural optimization. It does not depend on knowledge of the target processor or processors. Instead, it creates an extra node in the call graph. There must be no jumps outside the region being outlined (interprocedural jumps). Procedures are nested to access shared variables. Alias sets for outlines must accurately represent uses and definitions, otherwise later optimizations might move (or eliminate) loads and stores.

Some variables might be turned into automatics for the new procedure, which has to transfer alias relationships.


The cloning process

After a function has been outlined, the next stage is cloning. In pass two of TPO, the outlined function is cloned, and one version is specialized into a PPE version, and one into an SPE version. All called functions must also be cloned, and SPE call sites are modified to call the SPE versions of cloned subroutines.

The partitioning pass creates SPE and PPE partitions and invokes a lower-level optimizer for machine-specific optimization.

Cloning outlined functions for heterogeneous processors

The cloning process assumes that outlined functions are nested, for instance, that local variables may be found in the parent's stack frame. For heterogeneous cloning, that means accessing the stack of a separate processor. Even if you had the system memory address of the parent stack frame in the PPE, you couldn't compute the offset when executing in a different compilation path.

To resolve this, all such references are identified and gathered into a structure (within which offsets are known), and this structure can have its address computed at runtime and passed as a parameter to the SPE routine, which can then use the softcache code to access these variables in the SPE function.

More about cloning

Cloning has substantial implications. The same techniques that potentially allow the same function to run simultaneously on the PPE and an SPE could allow it to run simultaneously on six or eight SPEs as well. With no programmer-visible change in the code, there's potential for a huge amount of parallelism.


PPE runtime

The first OpenMP construct sets up the runtime system. This involves creating SPE threads and a work queue, and getting the DMA queue addresses. The address of the work queue is sent to each SPE, and global options are set. After the partitioning and scheduling setup is complete, a setup_done signal is sent to the SPE runtime.

Each OpenMP parallel/work-sharing construct invokes the runtime system. The PPE itself calls outlined procedures to do its part of the work. The runtime uses barrier synchronization.

SPE runtime

The SPE runtime is much simpler than the PPE runtime. It sits in an infinite loop waiting for signals from the PPE runtime. When there's work to be done, the runtime fetches work items from the work queue, using DMA. Depending on the work type, it then either invokes an SPE-outlined procedure directly, or translates the address of an SPE-outlined procedure from a PPE-outlined procedure. Once again, a barrier object handles synchronization.


Communicating between the PPE and SPE runtime

The PPE uses DMA signals to communicate work to be done to the SPEs. The SPEs use mailboxes to notify the PPE of the completion of tasks. The PPE takes items off the work queue when it doesn't have any management tasks to handle, helping share some of the load.

Items in the work queue indicate the function to be run, the lower and upper bounds, and the work type to be done; the SPEs pick these items up and run them, then go back to checking the queue again.

Runtime interaction

The PPE runtime is responsible for partitioning and scheduling decisions, and handling synchronization. So, PPE functions tend to be stubs which set up the work queue, and then outlined functions get executed on the SPEs or on the PPE as appropriate.

Limitations

This design doesn't really provide for nested parallelism. Parallel constructs with system calls end up serialized, because system calls execute on the PPE. The aggregated SPE binary isn't partitioned yet. In general, some features are still under development.

Tune in next time

This tutorial referred in passing to techniques for sharing memory between the SPEs and main memory (and potentially each other). The next tutorial, the last in the series, takes a detailed look at some of the memory management techniques and code developed for this, such as the softcache code.

Acknowledgement

This tutorial series is based on the original presentation Optimizing Compiler for the Cell Processor given at PACT 2005 by Alexandre Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter Oden, Daniel Prener, Janice Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind of IBM Research.

This Part 4 is based on the section "Parallelization and partitioning in the Single Source Cell compiler."

Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.

7 of 10 | Previous | Next

Comments



Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=98032
TutorialTitle=An introduction to compiling for the Cell Broadband Engine architecture, Part 4: Partitioning large tasks
publish-date=02072006
author1-email=dwpower@us.ibm.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.