 | Level: Intermediate Peter Seebach (developerworks@seebs.plethora.net), Freelance author, Plethora.net
07 Aug 2007 The SIMD-only architecture of the Cell Broadband Engine™ (Cell/B.E.)
processor's Synergistic Processor Engine (SPE) is an architecture that has no scalar
operations -- all operations are performed on 16-byte vectors. Design code that helps
the Cell/B.E. compiler make efficient use of this architecture.
The Cell/B.E. processor's SPEs offer a fairly unusual architecture, unique because they have no scalar operations -- all operations are performed
on 16-byte vectors. The performance trade-off favors many kinds of processing and isn't
recommended for some types; this poses potential challenges to developers.
In this article, I'll take a brief
break from communication models to look specifically at code designed to run
on the SPE.
A number of good overviews of the SPE architecture are available (see Resources). The SPE architecture is built to give good performance
in a small die area. One of the ways it accomplishes this is by
simplifying its instruction set and instruction-processing capabilities
dramatically, offloading some work to the compiler or developer. Even a very
good compiler benefits a lot from efficient use of the SPE.
The following section looks at some of the tasks that are designed for the SPE's structure.
Good tasks for the SPE
To start with, some tasks are simply better matches for the SPE than others.
A task which is easily parallelized and runs naturally on 32-bit objects in
groups will be a very comfortable fit. A task which requires a lot of
branching and works primarily with individual scalars will be a poor fit; offloading
such a task may well still be better than using only the PPE, but performance won't be anywhere
near the theoretical computational power of the processor.
The SPE does not have native scalar instructions in the way that a
conventional processor usually would. While the PPE has both scalar and SIMD
operations available to it, the SPE has only SIMD operations. No
single instruction adds a pair of 32-bit words together. There are,
however, a number of ways of approximating this functionality in the cases
when it's necessary.
The fastest solution is simply to store three unused 32-bit words adjacent
to each of the 32-bit words you need to operate on, perform SIMD instructions,
then ignore the other three words.
When space is at a premium (which it can
be given the 256KB size of local store), another option is to use masking
operations to leave other words alone when modifying a given word of interest.
Each of these options will be a good match for some applications and a poor
match for others. Since data and code share the local store, it's rarely
justifiable to use masking operations unless the object being modified is one
of a largish set; otherwise, the space savings of storing multiple objects
in a given 16-byte word are wiped out quickly by the increased code size.
Either way, since there are no native scalar instructions, parallel operations
are always free on the SPE; any time you can accomplish multiple similar
tasks at once, you should do so.
Using selection operations
On a scalar processor, it is often (though not always) rewarding to avoid
performing unnecessary operations. On an SIMD processor, it is often better
to perform the operations unconditionally, then use masking operations to
merge modified and unmodified values.
As a trivial example, consider a function that takes the absolute
value of an array full of numbers:
Listing 1. Look at those abs()!
void
int_abs(int a[], int len) {
int i;
for (i = 0; i < len; ++i)
if (a[i] < 0)
a[i] = 0 - a[i];
}
|
For the sake of brevity, assume that the values are always small enough in
magnitude that this works. This solution is simple enough, but implementing
it on the SPE would be painfully slow -- every iteration through the loop
involves a branch, and branches are not friendly to the SPE's
instruction-decoding buffers.
Written using spu intrinsics, the branch can be eliminated:
Listing 2. Using selection instead of branches
void
int_abs(vector signed int a[], int len) {
int i;
vector signed int z = spu_splats(0);
for (i = 0; i < len; ++i) {
vector int vec_abs = spu_sub(z, a[i]);
a[i] = spu_sel(a[i], vec_abs, spu_cmpgt(z, a[i]));
}
}
|
This may look complicated, but it really isn't bad at all. Rather than
branching between two code paths, the code simply calculates the inverse
of every item in a given vector, then uses a selection operation to mix
the unmodified results with the inverted results. Comparison operators create
special selection masks which the spu_sel()
intrinsic uses to select components from the two vectors given as its
other arguments. In this example, for every element where zero is greater
than the corresponding element of a[i], that element is replaced with the
corresponding element of vec_abs.
Iterative tasks
Some tasks are inherently iterative â succeeding
steps depend on the results of earlier steps. While you cannot parallelize
the steps of a single such task, it may be possible to perform several such
tasks at once. In some cases, the exact task to be performed might vary between
items, but it's cheap enough to perform two different tasks and use a
selection operator to merge the results.
 |
Collatz conjecture
The Collatz conjecture (see resources) is an unsolved conjecture in mathematics named after Lothar Collatz who first proposed it in 1937. The conjecture is also known as the 3n + 1 conjecture, the Ulam conjecture (after Stanislaw Ulam), the Syracuse problem, as the hailstone sequence or hailstone numbers, or as Wondrous numbers as per Gödel, Escher, Bach. It asks whether a certain kind of number sequence always ends in the same way, regardless of the starting number. Although the conjecture has not been proven, most mathematicians who have looked into the problem believe intuitively that the conjecture is true due to one or both of two reasons:
-
Experimental evidence. The conjecture has been checked by computer for all start values up
to 10 X 258 â 2.88 X 1018. While impressive, such computer bounds are of very limited evidential value since several conjectures have turned out to have exceptionally large-valued counterexamples (such as the Pólya conjecture).
- Probabilistic evidence. If you consider only the odd numbers in the sequence generated by the Collatz process, then you can argue that on average (specifically, the geometric mean of the ratios) the next odd number should be about 3/4 of the previous one which suggests that they should decrease in the long run (although this is not evidence against cycles, only against divergence).
|
|
It's hard to predict in advance which kinds of tasks will be good examples
of these, but a mathematically interesting one I'm fond of provides a good
example of the genre. A mathematical function has been defined, with f(x)
equal to (3x+1) when x is odd and (x/2) when x is even. The Collatz
conjecture is that, for all integer x, iterating on this function long enough
produces the value 1. It has been verified for all x up to 10x2^58, meaning
that it is true for any value you can store in a 32-bit signed value.
The number of iterations it takes to reach the value 1 is widely variable, and
calculating the number of iterations a given value needs is an iterative
process:
Listing 3. Scratching that iterative process
int
series(int x) {
int i = 0;
while (x != 1) {
++i;
if (x % 2 == 0) {
x = x / 2;
} else {
x = x * 3 + 1;
}
}
return i;
}
|
While this is arguably a contrived example, it serves as an interesting
example for parallelization in that you cannot express in code the number
of iterations needed, and the number of iterations needed for each item in
a given vector may be different.
The following implementation is hardly a pinnacle of optimization, but does
illustrate key techniques.
Listing 4. The Collatz sequence done on a vector
vector signed int
collatz(vector signed int x) {
vector signed int unity = spu_splats(1);
vector signed int i = spu_splats(0);
vector signed int timesthreeplusone;
vector signed int divtwo;
vector unsigned int done;
vector unsigned int odd;
vector unsigned int donesum = spu_splats(0u);
done = spu_cmpeq(x, unity);
while (spu_extract(donesum, 0) != -4) {
/* add one to i for values which aren't 1 */
i = spu_sel(spu_add(i, unity), i, done);
/* odd values have 1 bits set */
odd = spu_cmpeq(spu_and(x, unity), unity);
timesthreeplusone = spu_addx(x, spu_add(x, x), unity);
divtwo = spu_convts(spu_convtf(x, 1), 0);
x = spu_sel(spu_sel(divtwo, timesthreeplusone, odd), x, done);
done = spu_cmpeq(x, unity);
donesum = spu_add(done, spu_rlqwbyte(done, 8));
donesum = spu_add(donesum, spu_rlqwbyte(donesum, 4));
}
return i;
}
|
This function uses a special vector named unity containing the constant value
1 in a number of places. It is used both to increment members of the vector i, counting
iterations performed on each member of the input vector, and to check for completion. The
increment operation uses spu_sel() to merge i with the
incremented vector. The done vector is also used to determine when the loop is
complete. Each member of the vector has all bits zero, or all bits 1. Summing
the vector with itself rotated by 8 bytes, and then summing the result with itself
rotated by 4 bytes, results in a vector where each of the four components is
the sum of all four members of the vector named done. If all four have all bits
1, this compares equal to -4.
The next stage is to calculate the function.
There are three cases: A given vector may be already done (in which case, it
is left alone), or it may be odd, or it may be even. Results are calculated
for both the odd and even cases, then selected together based on the vector
odd and the results merged into x for the positions which were not previously
identified as done.
This function is not maximally efficient in several ways, but it is
substantially more efficient than performing scalar operations on each value
in turn. It may seem that a huge number of operations are wasted; however,
on the SPE the scalar code continues to perform parallel operations on
multiple data objects, too, so there's no real downside.
The only additional
cost the parallelized version imposes is the selection operation to modify
only the values which were not already done. Everything else would happen
anyway in the scalar code: The division by two would affect all four words
at once (even though three are empty), and the same would happen for the other
calculations. It might make more sense to use a right shift operator instead
of using the conversion to float (scaling by 2^1) and conversion back.
Thinking sideways
Trying to get your head around what's efficient when working on parallel code
is a bit of a stretch for people used to scalar operations. It's easy enough
to modify an algorithm to use a few parallel operations; what's tricky is
using them all the time.
On a scalar processor, branching is often the only
sane way to express something.
Note that some
out-of-order execution systems will calculate a result before the branch that
determines whether it will be used; this is surprisingly similar to the effect
of the selection operators on the SPE.
On a scalar processor, performing
operations you don't need is almost always a performance penalty. On the
SPE, it may often be a substantial performance boost; code which uses
selection operators instead of branches is generally smaller and faster.
Skipping an operation using a branch is more expensive than performing it
and discarding the results.
If it seems as though you can't possibly perform related operations in
parallel, start thinking about whether there's a higher-level opportunity
to parallelize. One example I found striking was published by developerWorks in
a previous article on maximizing the power of the Cell Broadband Engine processor (see resources).
A common problem in 3D work is subdividing polygons. One strategy for
parallelizing these would be to try to unroll a loop for processing
triangles and this does produce some speed improvement at a noticeable
cost in code size. However, performing a much simpler algorithm on multiple
triangles in parallel was much faster and used substantially less code space.
It's hard to parallelize the operation on a single triangle, but it's easy
to perform the simple (and therefore probably less buggy, as well!) algorithm
on multiple triangles in parallel.
The SPU provides excellent resources for shuffling data items, making it easy
to convert vector-across objects (where an object's components are all stored
as a single vector) into parallel arrays (where corresponding components of
several objects are stored in a single vector). Even with the need to
shuffle, the gain in performance and algorithmic simplicity is quite
attractive.
Finally, last but not least: As has been the case with every optimization
effort since the first person wondered whether an algorithm could maybe run
faster, focus on correctness first and don't spend a lot of time optimizing
something until you have real-world data showing that optimizing it would matter.
Developer time is still at a premium and a blindingly fast algorithm which
doesn't work is still useless.
Next up: A real-world program
In the next installment I'll provide a more in-depth look at a real program -- a simple, iterative-function fractal generator -- then I'll explain how to use multiple SPEs to vectorize a single task employing the job queue model.
Resources Learn
- Use an RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Check out the other articles in the "Little broadband engine that could" series.
-
See Wikipedia's entry on the Collatz conjecture.
-
"Maximizing the
power of the Cell Broadband Engine processor: 25 tips to optimal application
performance" (developerWorks, June 2006) demonstrates SPE programming tips (it's where the one on subdividing triangles comes from).
-
The Unrolling AltiVec series (developerWorks, 2005) is an oldie but goodie that exposes you to the various guises of this vector processing SIMD technology.
-
Jonathon Bartlett's series on "Programming high performance applications on the Cell/B.E. processor" (developerWorks, January 2007 to present) provides an intro to Linux on the PS3, programming the PS3's SPE, an intro to the SPU, SPU performance programming, C/C++ SPU programming, and managing smart buffer DMA transfers.
-
You know, to use the Cell/B.E. SDK 2.1, you'll have to be running Fedora Core 6 -- this quick install guide (developerWorks, April 2007) should help you get FC6 up and running.
-
The
IBM Semiconductor Solutions Technical Library
Cell Broadband Engine
documentation section contains a wealth of downloadable manuals,
specifications, and much more.
-
Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM
developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
-
The IBM microNews newsletter delivers Cell/B.E happenings to your desktop twice a month.
Get products and technologies
Discuss
About the author  | 
|  | Peter Seebach has a better motto than the US Postal Service: "He always delivers!" That often means "better ways for more effective communication." |
Rate this page
|  |