Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Core partners, Part 1: Build high-performance apps for multicore processors

See how RapidMind delivers a single-source development platform for Cell/B.E. applications

Michael McCool (mmccool@rapidmind.net), Founder and Chief Scientist, RapidMind
Photo of Michael McCool
An Associate Professor at the University of Waterloo and co-founder of RapidMind, Dr. McCool continues to perform research at the university and sit on the Board of Directors at RapidMind Inc. His research interests include high-quality real-time rendering, global and local illumination, hardware algorithms, parallel computing, reconfigurable computing, interval and Monte Carlo methods and applications, end-user programming and metaprogramming, image and signal processing, and sampling.

Summary:  The RapidMind Development Platform provides a simple single-source mechanism to develop portable high-performance applications for multicore processors. In particular, you can use it to develop applications that fully exploit the power of the Cell Broadband Engine™ (Cell/B.E.) processor's unique architecture by writing only one, single-threaded C++ program using an existing C++ compiler. In this article, author Michael McCool takes you on a guided tour of the RapidMind Development Platform.

View more content in this series

Date:  01 May 2007
Level:  Intermediate
Also available in:   Russian

Activity:  10401 views
Comments:  

Before discussing the RapidMind Development Platform -- which can be used to develop applications for the Cell/B.E. processor that are able to effectively exploit the architecture by letting you write a single, single-threaded C++ program using an existing C++ compiler -- let me explain why Cell/B.E. processors and the RapidMind platform combine to make a good application development environment.

Introducing RapidMind and Cell/B.E.

The nine cores in the Cell/B.E. processor include:

  • One standard "host" core, the PPU, which uses the PowerPC® architecture.
  • Eight Synergistic Processing Elements (SPEs) which are vector co-processors highly tuned for numerical computation.

The PPU is a general-purpose processor and is capable of running traditional operating system and application code. However, most of the computational performance of the Cell/B.E. processor resides in the more specialized SPEs. Each SPE core has a number of unique features, including a large vector register file, a vector ALU, and an explicitly managed high-speed local memory. The Cell/B.E. processor also includes a number of features that allow the PPU and the SPEs to communicate and synchronize with each other.

Why RapidMind on the Cell/B.E. processor?

To use the RapidMind platform on the Cell/B.E. processor, it is not necessary to understand the details of the SPE cores or perform any SPE-specific programming. It is only necessary to write a single-source, single-threaded program using an existing C++ compiler (such as g++) and run it on the PPU. This program only needs to include a single header file and link to the RapidMind platform library.

To execute a computation on the SPEs, it should be expressed as a data-parallel computation and executed using the RapidMind platform. The interface to the RapidMind platform is implemented as a library that integrates cleanly with existing IDEs and build environments. Although it is used as a library, the RapidMind platform can also be thought of as a high-performance, embedded parallel programming language with its own code management and parallel runtime. It is possible to express arbitrary computations with the RapidMind interface and have these execute at performance levels rivaling that of native hardware-dependent tools, but with a simple, maintainable, and portable programming model.

The RapidMind interface is based on a small set of types. These types mimic standard C++ types -- floating point numbers, arrays, and functions -- making it straightforward to port existing code. However, computations expressed using RapidMind types can be collected into computationally intense kernels, dynamically compiled directly to optimized SPE machine language, and run in parallel under the management of a sophisticated runtime system that includes automatic load balancing and synchronization with the host. The RapidMind platform automates most of the low-level tasks involved in using the additional high-performance SPE cores, making it easier to take advantage of the extreme computational performance provided by the Cell/B.E. processor for a range of applications.

Making C++ high performance

The C++ programming language is not usually considered suitable for high-performance programming. The C++ modularity mechanisms and memory model unfortunately incur overhead that is hard to eliminate with traditional compilation strategies. However, by using a dynamic compilation strategy, the RapidMind platform architecture bypasses these problems and makes it possible to eliminate overhead while targeting non-traditional architectures such as the SPEs.

In fact, using dynamic compilation, RapidMind implementations of benchmark applications can match or even significantly exceed the performance of the same applications written at a lower level with native tools. The RapidMind implementation is also often significantly simpler and more portable than comparable implementations.

In the following sections, I present a simple introduction to the RapidMind platform, the basic types and programming model, and an example showing how a simple loop kernel can be converted to run on the SPEs.

RapidMind interface and programming model

The RapidMind interface is based on three main types:

  • the Value type,
  • the Array type, and
  • the Program type.

These are defined in the rapidmind/platform.hpp include file. Including this file and linking to the rmplatform library is all that is needed to use the platform. RapidMind type declarations are protected from name collisions with user-defined types by the rapidmind namespace. Also, to express control flow, some macros are used that are usually protected with an "RM_" prefix. By including rapidmind/shortcuts.hpp, some aliases can be defined that omit these prefixes. For brevity I will use these aliases in my examples, and will also omit namespace qualifications.

The Value type

The Value<N,T> type represents a fixed-size container, a homogeneous tuple, with N instances of type T. The element type T can be any basic C++ numerical type, such as float or int. The type T may also be a bool.

There are also short forms for value tuples of up to length four. For example, a Value3f is a 3-tuple of single-precision floating point numbers, and a Value4ui is a 4-tuple of unsigned integers. The Value<1,T> type is in most cases a direct drop-in replacement for a single instance of type T. For example, single-precision complex numbers can be implemented using the RapidMind platform with std::complex<Value1f>.

All the usual operators are overloaded on values; they operate component by component. Most of the standard C library functions are also extended to values and execute in parallel on each element. The value types also provide support for swizzling and writemasking:

  • Swizzling allows the arbitrary reshuffling of components of a value tuple.
  • Writemasking allows the programmer to write to only a subset of the components of a value tuple in the destination of an assignment.

For example, given a 4-tuple Value4f c, you can extract the first three components of c and reverse their order using the swizzle expression c(2,1,0). In a single operation this extracts components 2, 1, and 0 and packs them together into a 3-tuple. The value type permits the explicit specification of vectorized computations which are a good match to the vector register architecture of the SPEs.

The Array type

The Array<D,V> type represents a variable-sized multidimensional container in which D is the dimensionality (1, 2, or 3) and V is the element type (a RapidMind value). This type is intended for managing large amounts of data. Like the normal pointer arrays used in C++, RapidMind arrays can be assigned to one another in O(1) time. However, RapidMind arrays use by-value rather than by-reference semantics. This simplifies memory management and avoids unnecessary side effects.

The unique Program type

Finally, RapidMind supports a unique Program type which represents a computation. It is literally a container for program code and can be constructed dynamically. It can be thought of as a function that, unlike ordinary C++ functions, can be created and manipulated at run time. Basically, sequences of computations on RapidMind values can be "recorded" and stored in a RapidMind program object. Following is a simple code example of how a program object can be constructed:

Program prog = BEGIN {
  In<Value3f> a, b;
  Out<Value3f> c;

  Value3f d = func(a, b);
  c = d + a * 2.0f;
} END;

In this example, a C++ function func() is called. This function can be defined with ordinary C++ mechanisms and many other C++ modularity constructs, such as classes, can also be used. The value d is a local variable only visible inside the program. The values a and b are marked as inputs, and the value c is marked as an output. Inputs are initialized with the values of actual arguments, and values marked as outputs will be copied into actual outputs when the program is later bound and executed. Programs can have more than one input but also more than one output.

The BEGIN macro switches from "immediate" to "retained" mode. By default, operations on RapidMind types actually execute on the host just like the C++ numerical types they emulate. In retained mode, however, operations are not executed, they are recorded and stored in the associated program object. When the END tag is encountered, the RapidMind platform switches back to immediate mode but also prepares the captured operations stored in the program object for execution on the SPEs.

Note that only operations on RapidMind types are captured when a program object is built. This is the mechanism that allows the platform to avoid C++ overhead. In effect, normal C++ modularity constructs act as scaffolding for generating intense sequences of numerical operations, but the overhead for these modularity constructs only has to be executed once when the program is created, not every time it is used.

Basically, C++ constructs are "baked" into the program object -- pointers are de-referenced, functions are inlined, loops are unrolled, and C++ variables are interpreted as constants, resulting in very dense and efficient code. The RapidMind platform, however, has dynamic versions of all these constructs, so they can be used when -- and only when -- they are necessary.

Dynamic code generation has other implications for performance. It is easy to create alternative versions of program objects whenever necessary to exploit special cases in the input data. It is also straightforward to tune program objects (either explicitly or automatically) by modifying construction parameters, such as a blocking factor or the number of times a loop is unrolled, until performance is maximized.

Syntactically, the construction of a program object looks like a dynamic function definition and the resulting program object can in fact be used much like an ordinary function. However, RapidMind program objects are normally applied to entire arrays. The computation represented by a program is executed concurrently on all elements of the array, using all the cores on the Cell/B.E. processor in parallel. If A, B, and C are RapidMind array objects, then parallel execution is initiated as follows: C = prog(A,B);.

Program objects can include arbitrary computations, including dynamic data-dependent control flow and random access reads from other arrays. An extension of the previous example, including dynamic control flow and random access into a 1D array B rather than streaming access, could be expressed as follows:

Program p = BEGIN {
  In<Value3f> a;
  In<Value1i> u;
  Out<Value3f> c;
  
  Value3f d = func(a, B[u]);
  IF (all(a > 0.0f)) {
    c = d + a * 2.0f;
  } ELSE {
    c = d - a * 2.0f;
  } ENDIF;
} END;

The inclusion of control flow makes the programming model very general. Technically, RapidMind uses an SPMD (single program, multiple data) data-parallel programming model. A number of collective operations are also available, including higher-order reductions which can take an associative "combiner" program as an argument.

Access patterns also make it possible to read from only part of an input array and update only part of an output array. The combination of control flow in kernels and general collective operations on arrays makes it possible to use a bulk-synchronous style of parallel programming which has been shown to apply to a wide range of problems.

The data-parallel programming model used by the RapidMind platform allows the system to scale gracefully to variable numbers of cores. For example, under Yellow Dog Linux (from Terra Soft), a RapidMind-enabled program can execute in parallel on the six available SPE cores on the Sony® PLAYSTATION® 3. On an IBM QS20 Cell/B.E. blade which has two connected Cell/B.E. processors, the same program can automatically take advantage of all 16 available SPE cores.

Note that the amount of parallelism expressed in a RapidMind computation is usually much larger than the number of cores. After dividing the work over the available cores, the "extra" parallelism is used internally for various additional optimizations, including vectorization and latency hiding. The platform runtime also includes automatic load-balancing, a necessary feature since control flow can lead to differences in execution times for different instances of the computational kernel represented by a RapidMind program object.

Loop conversion example

This additional example should make the RapidMind interface concepts clearer. In the following section, I'll convert a nested loop kernel to a parallel implementation that runs on all SPEs:

#include <cmath>

const int w = 512, h = 512;
float f;
float a[w][h][4], b[w][h][4];

void compute() {
   for (int i = 0; i < w; i++) 
      for (int j = 0; j < h; j++) 
         for (int k = 0; k < 4; k++) {
            a[i][j][k] += f * b[i][j][k];
   }
}

The conversion process has three steps as Figure 1 illustrates.


Figure 1. The RapidMind conversion workflow
The RapidMind conversion workflow

  • First, convert the numerical types used in the kernel to the equivalent RapidMind types.
  • Second, create a RapidMind program object.
  • Third, apply the program object to an array of data which will invoke a parallel computation which will be coordinated by the platform's runtime execution manager.

To enable use of the RapidMind platform, you'll need to include the RapidMind header files. For brevity in these examples, I'll also load the short forms of the macros and specify the global use of the rapidmind namespace:

#include <rapidmind/platform.hpp>
#include <Rapidmind/shortcuts.hpp>
using namespace rapidmind;

Now I'll convert the data to use the RapidMind types. Later I'll create a RapidMind program to implement the innermost loop using component-wise operations on the value tuples, so here I'll declare the arrays to hold value tuples of the appropriate length:

const int w = 512, h = 512;
Value1f f;
Array<2,Value4f> a(w,h), b(w,h);

You should build the program object representing the computation during an initialization phase. After you build it, you can use it repeatedly to perform computation. In more complex programs, you normally build program objects in class constructors; here, I'll use a simple initialization function to build the program object and will store it in a global variable. This function should be called during system initialization:

Program compute_prog;
void init_compute_prog () {
   compute_prog = BEGIN {
      In<Value4f> r, s;
      Out<Value4f> t;
      t = r + f * s;
   } END;
}

Note that the program object actually refers to a non-local variable f of type Value1f. When the program object runs on the SPEs, it will use whatever the current value of this variable is when it runs, just like an ordinary C++ function. The RapidMind platform automatically tracks dependencies between program objects and references to non-local variables declared using RapidMind types and sets up the appropriate communication without further intervention from the developer. The length of the f 1-tuple is also automatically promoted to a 4-tuple by duplicating its scalar value before applying the component-wise multiplication operation.

Finally, the program can be executed whenever necessary:

a = compute_prog(a,b);

This example illustrates one further point. Note that the array a is both an input and an output of the program. The RapidMind platform uses parallel assignment semantics -- inputs are always bound to the "previous" value, not the "new" value.

If you want to know more

This article gave a brief introduction to the RapidMind platform. For more information, including benchmark results and more detailed code examples, please visit the RapidMind Web site (see Resources), where a number of white papers are available. You can also sign up for an evaluation version of the RapidMind platform, complete with full documentation and sample code for a variety of applications.


Resources

Learn

Get products and technologies

Discuss

About the author

Photo of Michael McCool

An Associate Professor at the University of Waterloo and co-founder of RapidMind, Dr. McCool continues to perform research at the university and sit on the Board of Directors at RapidMind Inc. His research interests include high-quality real-time rendering, global and local illumination, hardware algorithms, parallel computing, reconfigurable computing, interval and Monte Carlo methods and applications, end-user programming and metaprogramming, image and signal processing, and sampling.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=216536
ArticleTitle=Core partners, Part 1: Build high-performance apps for multicore processors
publish-date=05012007
author1-email=mmccool@rapidmind.net
author1-email-cc=