Before discussing the RapidMind Development Platform -- which can be used to develop applications for the Cell/B.E. processor that are able to effectively exploit the architecture by letting you write a single, single-threaded C++ program using an existing C++ compiler -- let me explain why Cell/B.E. processors and the RapidMind platform combine to make a good application development environment.
Introducing RapidMind and Cell/B.E.
The nine cores in the Cell/B.E. processor include:
- One standard "host" core, the PPU, which uses the PowerPC® architecture.
- Eight Synergistic Processing Elements (SPEs) which are vector co-processors highly tuned for numerical computation.
The PPU is a general-purpose processor and is capable of running traditional operating system and application code. However, most of the computational performance of the Cell/B.E. processor resides in the more specialized SPEs. Each SPE core has a number of unique features, including a large vector register file, a vector ALU, and an explicitly managed high-speed local memory. The Cell/B.E. processor also includes a number of features that allow the PPU and the SPEs to communicate and synchronize with each other.
Why RapidMind on the Cell/B.E. processor?
To use the RapidMind platform on the Cell/B.E. processor, it is not necessary to understand the details of the SPE cores or perform any SPE-specific programming. It is only necessary to write a single-source, single-threaded program using an existing C++ compiler (such as g++) and run it on the PPU. This program only needs to include a single header file and link to the RapidMind platform library.
To execute a computation on the SPEs, it should be expressed as a data-parallel computation and executed using the RapidMind platform. The interface to the RapidMind platform is implemented as a library that integrates cleanly with existing IDEs and build environments. Although it is used as a library, the RapidMind platform can also be thought of as a high-performance, embedded parallel programming language with its own code management and parallel runtime. It is possible to express arbitrary computations with the RapidMind interface and have these execute at performance levels rivaling that of native hardware-dependent tools, but with a simple, maintainable, and portable programming model.
The RapidMind interface is based on a small set of types. These types mimic standard C++ types -- floating point numbers, arrays, and functions -- making it straightforward to port existing code. However, computations expressed using RapidMind types can be collected into computationally intense kernels, dynamically compiled directly to optimized SPE machine language, and run in parallel under the management of a sophisticated runtime system that includes automatic load balancing and synchronization with the host. The RapidMind platform automates most of the low-level tasks involved in using the additional high-performance SPE cores, making it easier to take advantage of the extreme computational performance provided by the Cell/B.E. processor for a range of applications.
The C++ programming language is not usually considered suitable for high-performance programming. The C++ modularity mechanisms and memory model unfortunately incur overhead that is hard to eliminate with traditional compilation strategies. However, by using a dynamic compilation strategy, the RapidMind platform architecture bypasses these problems and makes it possible to eliminate overhead while targeting non-traditional architectures such as the SPEs.
In fact, using dynamic compilation, RapidMind implementations of benchmark applications can match or even significantly exceed the performance of the same applications written at a lower level with native tools. The RapidMind implementation is also often significantly simpler and more portable than comparable implementations.
In the following sections, I present a simple introduction to the RapidMind platform, the basic types and programming model, and an example showing how a simple loop kernel can be converted to run on the SPEs.
RapidMind interface and programming model
The RapidMind interface is based on three main types:
- the
Valuetype, - the
Arraytype, and - the
Programtype.
These are defined in the rapidmind/platform.hpp include file. Including this file and linking to the rmplatform library is all that is needed to use the platform. RapidMind type declarations are protected from name collisions with user-defined types by the rapidmind namespace. Also, to express control flow, some macros are used that are usually protected with an "RM_" prefix. By including rapidmind/shortcuts.hpp, some aliases can be defined that omit these prefixes. For brevity I will use these aliases in my examples, and will also omit namespace qualifications.
The Value<N,T> type represents a fixed-size container, a homogeneous tuple, with N instances of type T. The element type T can be any basic C++ numerical type, such as float or int. The type T may also be a bool.
There are also short forms for value tuples of up to length four. For example, a Value3f is a 3-tuple of single-precision floating point numbers, and a Value4ui is a 4-tuple of unsigned integers. The Value<1,T> type is in most cases a direct drop-in replacement for a single instance of type T. For example, single-precision complex numbers can be implemented using the RapidMind platform with std::complex<Value1f>.
All the usual operators are overloaded on values; they operate component by component. Most of the standard C library functions are also extended to values and execute in parallel on each element. The value types also provide support for swizzling and writemasking:
- Swizzling allows the arbitrary reshuffling of components of a value tuple.
- Writemasking allows the programmer to write to only a subset of the components of a value tuple in the destination of an assignment.
For example, given a 4-tuple Value4f c, you can extract the first three components of c and reverse their order using the swizzle expression c(2,1,0). In a single operation this extracts components 2, 1, and 0 and packs them together into a 3-tuple.
The value type permits the explicit specification of vectorized computations which are a good match to the vector register architecture of the SPEs.
The Array<D,V> type represents a variable-sized multidimensional container in which D is the dimensionality (1, 2, or 3) and V is the element type (a RapidMind value). This type is intended for managing large amounts of data. Like the normal pointer arrays used in C++, RapidMind arrays can be assigned to one another in O(1) time. However, RapidMind arrays use by-value rather than by-reference semantics. This simplifies memory management and avoids unnecessary side effects.
Finally, RapidMind supports a unique Program type which represents a computation. It is literally a container for program code and can be constructed dynamically. It can be thought of as a function that, unlike ordinary C++ functions, can be created and manipulated at run time. Basically, sequences of computations on RapidMind values can be "recorded" and stored in a RapidMind program object. Following is a simple code example of how a program object can be constructed:
Program prog = BEGIN {
In<Value3f> a, b;
Out<Value3f> c;
Value3f d = func(a, b);
c = d + a * 2.0f;
} END;
|
In this example, a C++ function func() is called. This function can be defined with ordinary C++ mechanisms and many other C++ modularity constructs, such as classes, can also be used. The value d is a local variable only visible inside the program. The values a and b are marked as inputs, and the value c is marked as an output. Inputs are initialized with the values of actual arguments, and values marked as outputs will be copied into actual outputs when the program is later bound and executed. Programs can have more than one input but also more than one output.
The BEGIN macro switches from "immediate" to "retained" mode. By default, operations on RapidMind types actually execute on the host just like the C++ numerical types they emulate. In retained mode, however, operations are not executed, they are recorded and stored in the associated program object. When the END tag is encountered, the RapidMind platform switches back to immediate mode but also prepares the captured operations stored in the program object for execution on the SPEs.
Note that only operations on RapidMind types are captured when a program object is built. This is the mechanism that allows the platform to avoid C++ overhead. In effect, normal C++ modularity constructs act as scaffolding for generating intense sequences of numerical operations, but the overhead for these modularity constructs only has to be executed once when the program is created, not every time it is used.
Basically, C++ constructs are "baked" into the program object -- pointers are de-referenced, functions are inlined, loops are unrolled, and C++ variables are interpreted as constants, resulting in very dense and efficient code. The RapidMind platform, however, has dynamic versions of all these constructs, so they can be used when -- and only when -- they are necessary.
Dynamic code generation has other implications for performance. It is easy to create alternative versions of program objects whenever necessary to exploit special cases in the input data. It is also straightforward to tune program objects (either explicitly or automatically) by modifying construction parameters, such as a blocking factor or the number of times a loop is unrolled, until performance is maximized.
Syntactically, the construction of a program object looks like a dynamic function definition and the resulting program object can in fact be used much like an ordinary function. However, RapidMind program objects are normally applied to entire arrays. The computation represented by a program is executed concurrently on all elements of the array, using all the cores on the Cell/B.E. processor in parallel. If A, B, and C are RapidMind array objects, then parallel execution is initiated as follows: C = prog(A,B);.
Program objects can include arbitrary computations, including dynamic data-dependent control flow and random access reads from other arrays. An extension of the previous example, including dynamic control flow and random access into a 1D array B rather than streaming access, could be expressed as follows:
Program p = BEGIN {
In<Value3f> a;
In<Value1i> u;
Out<Value3f> c;
Value3f d = func(a, B[u]);
IF (all(a > 0.0f)) {
c = d + a * 2.0f;
} ELSE {
c = d - a * 2.0f;
} ENDIF;
} END;
|
The inclusion of control flow makes the programming model very general. Technically, RapidMind uses an SPMD (single program, multiple data) data-parallel programming model. A number of collective operations are also available, including higher-order reductions which can take an associative "combiner" program as an argument.
Access patterns also make it possible to read from only part of an input array and update only part of an output array. The combination of control flow in kernels and general collective operations on arrays makes it possible to use a bulk-synchronous style of parallel programming which has been shown to apply to a wide range of problems.
The data-parallel programming model used by the RapidMind platform allows the system to scale gracefully to variable numbers of cores. For example, under Yellow Dog Linux (from Terra Soft), a RapidMind-enabled program can execute in parallel on the six available SPE cores on the Sony® PLAYSTATION® 3. On an IBM QS20 Cell/B.E. blade which has two connected Cell/B.E. processors, the same program can automatically take advantage of all 16 available SPE cores.
Note that the amount of parallelism expressed in a RapidMind computation is usually much larger than the number of cores. After dividing the work over the available cores, the "extra" parallelism is used internally for various additional optimizations, including vectorization and latency hiding. The platform runtime also includes automatic load-balancing, a necessary feature since control flow can lead to differences in execution times for different instances of the computational kernel represented by a RapidMind program object.
This additional example should make the RapidMind interface concepts clearer. In the following section, I'll convert a nested loop kernel to a parallel implementation that runs on all SPEs:
#include <cmath>
const int w = 512, h = 512;
float f;
float a[w][h][4], b[w][h][4];
void compute() {
for (int i = 0; i < w; i++)
for (int j = 0; j < h; j++)
for (int k = 0; k < 4; k++) {
a[i][j][k] += f * b[i][j][k];
}
}
|
The conversion process has three steps as Figure 1 illustrates.
Figure 1. The RapidMind conversion workflow
- First, convert the numerical types used in the kernel to the equivalent RapidMind types.
- Second, create a RapidMind program object.
- Third, apply the program object to an array of data which will invoke a parallel computation which will be coordinated by the platform's runtime execution manager.
To enable use of the RapidMind platform, you'll need to include the RapidMind header files. For brevity in these examples, I'll also load the short forms of the macros and specify the global use of the rapidmind namespace:
#include <rapidmind/platform.hpp> #include <Rapidmind/shortcuts.hpp> using namespace rapidmind; |
Now I'll convert the data to use the RapidMind types. Later I'll create a RapidMind program to implement the innermost loop using component-wise operations on the value tuples, so here I'll declare the arrays to hold value tuples of the appropriate length:
const int w = 512, h = 512; Value1f f; Array<2,Value4f> a(w,h), b(w,h); |
You should build the program object representing the computation during an initialization phase. After you build it, you can use it repeatedly to perform computation. In more complex programs, you normally build program objects in class constructors; here, I'll use a simple initialization function to build the program object and will store it in a global variable. This function should be called during system initialization:
Program compute_prog;
void init_compute_prog () {
compute_prog = BEGIN {
In<Value4f> r, s;
Out<Value4f> t;
t = r + f * s;
} END;
}
|
Note that the program object actually refers to a non-local variable f of type Value1f. When the program object runs on the SPEs, it will use whatever the current value of this variable is when it runs, just like an ordinary C++ function. The RapidMind platform automatically tracks dependencies between program objects and references to non-local variables declared using RapidMind types and sets up the appropriate communication without further intervention from the developer. The length of the f 1-tuple is also automatically promoted to a 4-tuple by duplicating its scalar value before applying the component-wise multiplication operation.
Finally, the program can be executed whenever necessary:
a = compute_prog(a,b); |
This example illustrates one further point. Note that the array a is both an input and an output of the program. The RapidMind platform uses parallel assignment semantics -- inputs are always bound to the "previous" value, not the "new" value.
This article gave a brief introduction to the RapidMind platform. For more information, including benchmark results and more detailed code examples, please visit the RapidMind Web site (see Resources), where a number of white papers are available. You can also sign up for an evaluation version of the RapidMind platform, complete with full documentation and sample code for a variety of applications.
Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
-
For more detailed examples, including comparisons of code complexity and benchmarks on various hardware targets including the Cell/B.E. processor, try the RapidMind Web site (including an overview of the technology).
-
Jonathon Bartlett's series, "Programming high-performance applications on the Cell/B.E. processor," (developerWorks, started January 2007) covers the following:
- Linux on PS3
- Programming the SPEs
- Introducing the SPU
- Programming the SPU for performance
- Programming the SPU in C/C++
-
"Debugging Cell Broadband Engine systems" (developerWorks, August 2006) provides debugging tools (and describes how to use new versions of the GNU Debugger) to diagnose problems in both the Cell/B.E. PPU and SPU programs.
-
"Unleashing the power of the Cell Broadband Engine" (developerWorks, November 2005) explores programming models for the Cell/B.E. processor, from the simple to the progressively more advanced.
-
Don't have time to read the forums? Try this issue of "Forum watch: Is hardware better than sim when modeling SPU-to-SPU performance?" (developerWorks, March 2007), which packages up six of the recent, real-world, hottest forum Qs and As on Cell/B.E. processor technology.
Get products and technologies
-
Drop into the Cell Broadband Engine resource center for an overview of the processor plus news, downloads, documentation, and community efforts.
-
Grab the latest version of the Cell/B.E. SDK.
Discuss
- Participate in the discussion forum.
-
Need answers right away? Drop into the Cell Broadband Engine Architecture forum with your questions.

An Associate Professor at the University of Waterloo and co-founder of RapidMind, Dr. McCool continues to perform research at the university and sit on the Board of Directors at RapidMind Inc. His research interests include high-quality real-time rendering, global and local illumination, hardware algorithms, parallel computing, reconfigurable computing, interval and Monte Carlo methods and applications, end-user programming and metaprogramming, image and signal processing, and sampling.
Comments (Undergoing maintenance)




