Partitioning using OpenMP
A single-source program contains C or Fortran with user directives or pragmas hinting at partitioning. The compiler outlines all code within the pragmas into separate functions compiled for the SPE, then replaces the outlined code with a call to the parallel runtime, and compiles this wrapper code for the PPE. The master thread continues to execute on the PPE and participate in workshare constructs.
The PPE runtime places outlined functions on a work queue containing information about the scope of a job (such as number of iterations to execute, or the size of data set to work on in each iteration). The runtime has up to eight SPE threads to pull work items (outlined parallel functions) from the queue and execute them on SPEs. The PPE can then wait for an SPE task to complete, or continue executing code while the SPEs work.
Outlining happens early in pass one of TPO, right after intra-procedural optimization. It does not depend on knowledge of the target processor or processors. Instead, it creates an extra node in the call graph. There must be no jumps outside the region being outlined (interprocedural jumps). Procedures are nested to access shared variables. Alias sets for outlines must accurately represent uses and definitions, otherwise later optimizations might move (or eliminate) loads and stores.
Some variables might be turned into automatics for the new procedure, which has to transfer alias relationships.
After a function has been outlined, the next stage is cloning. In pass two of TPO, the outlined function is cloned, and one version is specialized into a PPE version, and one into an SPE version. All called functions must also be cloned, and SPE call sites are modified to call the SPE versions of cloned subroutines.
The partitioning pass creates SPE and PPE partitions and invokes a lower-level optimizer for machine-specific optimization.
Cloning outlined functions for heterogeneous processors
The cloning process assumes that outlined functions are nested, for instance, that local variables may be found in the parent's stack frame. For heterogeneous cloning, that means accessing the stack of a separate processor. Even if you had the system memory address of the parent stack frame in the PPE, you couldn't compute the offset when executing in a different compilation path.
To resolve this, all such references are identified and gathered into a structure (within which offsets are known), and this structure can have its address computed at runtime and passed as a parameter to the SPE routine, which can then use the softcache code to access these variables in the SPE function.
Cloning has substantial implications. The same techniques that potentially allow the same function to run simultaneously on the PPE and an SPE could allow it to run simultaneously on six or eight SPEs as well. With no programmer-visible change in the code, there's potential for a huge amount of parallelism.
The first OpenMP construct sets up the runtime system. This involves creating SPE threads and a work queue, and getting the DMA queue addresses. The address of the work queue is sent to each SPE, and global options are set. After the partitioning and scheduling setup is complete, a setup_done signal is sent to the SPE runtime.
Each OpenMP parallel/work-sharing construct invokes the runtime system. The PPE itself calls outlined procedures to do its part of the work. The runtime uses barrier synchronization.
The SPE runtime is much simpler than the PPE runtime. It sits in an infinite loop waiting for signals from the PPE runtime. When there's work to be done, the runtime fetches work items from the work queue, using DMA. Depending on the work type, it then either invokes an SPE-outlined procedure directly, or translates the address of an SPE-outlined procedure from a PPE-outlined procedure. Once again, a barrier object handles synchronization.
Communicating between the PPE and SPE runtime
The PPE uses DMA signals to communicate work to be done to the SPEs. The SPEs use mailboxes to notify the PPE of the completion of tasks. The PPE takes items off the work queue when it doesn't have any management tasks to handle, helping share some of the load.
Items in the work queue indicate the function to be run, the lower and upper bounds, and the work type to be done; the SPEs pick these items up and run them, then go back to checking the queue again.
The PPE runtime is responsible for partitioning and scheduling decisions, and handling synchronization. So, PPE functions tend to be stubs which set up the work queue, and then outlined functions get executed on the SPEs or on the PPE as appropriate.
This design doesn't really provide for nested parallelism. Parallel constructs with system calls end up serialized, because system calls execute on the PPE. The aggregated SPE binary isn't partitioned yet. In general, some features are still under development.
This tutorial referred in passing to techniques for sharing memory between the SPEs and main memory (and potentially each other). The next tutorial, the last in the series, takes a detailed look at some of the memory management techniques and code developed for this, such as the softcache code.
This tutorial series is based on the original presentation Optimizing Compiler for the Cell Processor given at PACT 2005 by Alexandre Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter Oden, Daniel Prener, Janice Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind of IBM Research.
This Part 4 is based on the section "Parallelization and partitioning in the Single Source Cell compiler."
Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.

