IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
The little broadband engine that could: Why is my scalar code so slow?
skip to main content

developerWorks  >  Power Architecture technology  >

The little broadband engine that could: Why is my scalar code so slow?

Discover which tasks are best suited for the SPE and how to simulate scalar functionality when needed

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Intermediate

Peter Seebach (developerworks@seebs.plethora.net), Freelance author, Plethora.net

07 Aug 2007

The SIMD-only architecture of the Cell Broadband Engine™ (Cell/B.E.) processor's Synergistic Processor Engine (SPE) is an architecture that has no scalar operations -- all operations are performed on 16-byte vectors. Design code that helps the Cell/B.E. compiler make efficient use of this architecture.

The Cell/B.E. processor's SPEs offer a fairly unusual architecture, unique because they have no scalar operations -- all operations are performed on 16-byte vectors. The performance trade-off favors many kinds of processing and isn't recommended for some types; this poses potential challenges to developers.

In this article, I'll take a brief break from communication models to look specifically at code designed to run on the SPE.

A number of good overviews of the SPE architecture are available (see Resources). The SPE architecture is built to give good performance in a small die area. One of the ways it accomplishes this is by simplifying its instruction set and instruction-processing capabilities dramatically, offloading some work to the compiler or developer. Even a very good compiler benefits a lot from efficient use of the SPE.

The following section looks at some of the tasks that are designed for the SPE's structure.

Good tasks for the SPE

To start with, some tasks are simply better matches for the SPE than others. A task which is easily parallelized and runs naturally on 32-bit objects in groups will be a very comfortable fit. A task which requires a lot of branching and works primarily with individual scalars will be a poor fit; offloading such a task may well still be better than using only the PPE, but performance won't be anywhere near the theoretical computational power of the processor.

The SPE does not have native scalar instructions in the way that a conventional processor usually would. While the PPE has both scalar and SIMD operations available to it, the SPE has only SIMD operations. No single instruction adds a pair of 32-bit words together. There are, however, a number of ways of approximating this functionality in the cases when it's necessary.

The fastest solution is simply to store three unused 32-bit words adjacent to each of the 32-bit words you need to operate on, perform SIMD instructions, then ignore the other three words.

When space is at a premium (which it can be given the 256KB size of local store), another option is to use masking operations to leave other words alone when modifying a given word of interest.

Each of these options will be a good match for some applications and a poor match for others. Since data and code share the local store, it's rarely justifiable to use masking operations unless the object being modified is one of a largish set; otherwise, the space savings of storing multiple objects in a given 16-byte word are wiped out quickly by the increased code size.

Either way, since there are no native scalar instructions, parallel operations are always free on the SPE; any time you can accomplish multiple similar tasks at once, you should do so.

Using selection operations

On a scalar processor, it is often (though not always) rewarding to avoid performing unnecessary operations. On an SIMD processor, it is often better to perform the operations unconditionally, then use masking operations to merge modified and unmodified values.

As a trivial example, consider a function that takes the absolute value of an array full of numbers:


Listing 1. Look at those abs()!
                
     void
     int_abs(int a[], int len) {
           int i;

           for (i = 0; i < len; ++i)
                 if (a[i] < 0)
                       a[i] = 0 - a[i];
     }
      

For the sake of brevity, assume that the values are always small enough in magnitude that this works. This solution is simple enough, but implementing it on the SPE would be painfully slow -- every iteration through the loop involves a branch, and branches are not friendly to the SPE's instruction-decoding buffers.

Written using spu intrinsics, the branch can be eliminated:


Listing 2. Using selection instead of branches
                
             void
             int_abs(vector signed int a[], int len) {
                         int i;
                         vector signed int z = spu_splats(0);
                         for (i = 0; i < len; ++i) {
                                     vector int vec_abs = spu_sub(z, a[i]);
                                     a[i] = spu_sel(a[i], vec_abs, spu_cmpgt(z, a[i]));
                         }
             }

This may look complicated, but it really isn't bad at all. Rather than branching between two code paths, the code simply calculates the inverse of every item in a given vector, then uses a selection operation to mix the unmodified results with the inverted results. Comparison operators create special selection masks which the spu_sel() intrinsic uses to select components from the two vectors given as its other arguments. In this example, for every element where zero is greater than the corresponding element of a[i], that element is replaced with the corresponding element of vec_abs.

Iterative tasks

Some tasks are inherently iterative — succeeding steps depend on the results of earlier steps. While you cannot parallelize the steps of a single such task, it may be possible to perform several such tasks at once. In some cases, the exact task to be performed might vary between items, but it's cheap enough to perform two different tasks and use a selection operator to merge the results.

Collatz conjecture
The Collatz conjecture (see resources) is an unsolved conjecture in mathematics named after Lothar Collatz who first proposed it in 1937. The conjecture is also known as the 3n + 1 conjecture, the Ulam conjecture (after Stanislaw Ulam), the Syracuse problem, as the hailstone sequence or hailstone numbers, or as Wondrous numbers as per Gödel, Escher, Bach. It asks whether a certain kind of number sequence always ends in the same way, regardless of the starting number. Although the conjecture has not been proven, most mathematicians who have looked into the problem believe intuitively that the conjecture is true due to one or both of two reasons:
  1. Experimental evidence. The conjecture has been checked by computer for all start values up to 10 X 258 ≈ 2.88 X 1018. While impressive, such computer bounds are of very limited evidential value since several conjectures have turned out to have exceptionally large-valued counterexamples (such as the Pólya conjecture).
  2. Probabilistic evidence. If you consider only the odd numbers in the sequence generated by the Collatz process, then you can argue that on average (specifically, the geometric mean of the ratios) the next odd number should be about 3/4 of the previous one which suggests that they should decrease in the long run (although this is not evidence against cycles, only against divergence).

It's hard to predict in advance which kinds of tasks will be good examples of these, but a mathematically interesting one I'm fond of provides a good example of the genre. A mathematical function has been defined, with f(x) equal to (3x+1) when x is odd and (x/2) when x is even. The Collatz conjecture is that, for all integer x, iterating on this function long enough produces the value 1. It has been verified for all x up to 10x2^58, meaning that it is true for any value you can store in a 32-bit signed value. The number of iterations it takes to reach the value 1 is widely variable, and calculating the number of iterations a given value needs is an iterative process:


Listing 3. Scratching that iterative process
                
     int
     series(int x) {
            int i = 0;

            while (x != 1) {
                  ++i;
                  if (x % 2 == 0) {
                        x = x / 2;
                  } else {
                        x = x * 3 + 1;
                  }
             }
             return i;
     }

While this is arguably a contrived example, it serves as an interesting example for parallelization in that you cannot express in code the number of iterations needed, and the number of iterations needed for each item in a given vector may be different.

The following implementation is hardly a pinnacle of optimization, but does illustrate key techniques.


Listing 4. The Collatz sequence done on a vector
                
vector signed int
collatz(vector signed int x) {
        vector signed int unity = spu_splats(1);
        vector signed int i = spu_splats(0);
        vector signed int timesthreeplusone;
        vector signed int divtwo;
        vector unsigned int done;
        vector unsigned int odd;
		 vector unsigned int donesum = spu_splats(0u);

        done = spu_cmpeq(x, unity);
        while (spu_extract(donesum, 0) != -4) {
                /* add one to i for values which aren't 1 */
                i = spu_sel(spu_add(i, unity), i, done);

                /* odd values have 1 bits set */
                odd = spu_cmpeq(spu_and(x, unity), unity);
                timesthreeplusone = spu_addx(x, spu_add(x, x), unity);
                divtwo = spu_convts(spu_convtf(x, 1), 0);
                x = spu_sel(spu_sel(divtwo, timesthreeplusone, odd), x, done);
                done = spu_cmpeq(x, unity);
                donesum = spu_add(done, spu_rlqwbyte(done, 8));
                donesum = spu_add(donesum, spu_rlqwbyte(donesum, 4));
        }
        return i;
}

This function uses a special vector named unity containing the constant value 1 in a number of places. It is used both to increment members of the vector i, counting iterations performed on each member of the input vector, and to check for completion. The increment operation uses spu_sel() to merge i with the incremented vector. The done vector is also used to determine when the loop is complete. Each member of the vector has all bits zero, or all bits 1. Summing the vector with itself rotated by 8 bytes, and then summing the result with itself rotated by 4 bytes, results in a vector where each of the four components is the sum of all four members of the vector named done. If all four have all bits 1, this compares equal to -4.

The next stage is to calculate the function. There are three cases: A given vector may be already done (in which case, it is left alone), or it may be odd, or it may be even. Results are calculated for both the odd and even cases, then selected together based on the vector odd and the results merged into x for the positions which were not previously identified as done.

This function is not maximally efficient in several ways, but it is substantially more efficient than performing scalar operations on each value in turn. It may seem that a huge number of operations are wasted; however, on the SPE the scalar code continues to perform parallel operations on multiple data objects, too, so there's no real downside.

The only additional cost the parallelized version imposes is the selection operation to modify only the values which were not already done. Everything else would happen anyway in the scalar code: The division by two would affect all four words at once (even though three are empty), and the same would happen for the other calculations. It might make more sense to use a right shift operator instead of using the conversion to float (scaling by 2^1) and conversion back.

Thinking sideways

Trying to get your head around what's efficient when working on parallel code is a bit of a stretch for people used to scalar operations. It's easy enough to modify an algorithm to use a few parallel operations; what's tricky is using them all the time.

On a scalar processor, branching is often the only sane way to express something. Note that some out-of-order execution systems will calculate a result before the branch that determines whether it will be used; this is surprisingly similar to the effect of the selection operators on the SPE.

On a scalar processor, performing operations you don't need is almost always a performance penalty. On the SPE, it may often be a substantial performance boost; code which uses selection operators instead of branches is generally smaller and faster. Skipping an operation using a branch is more expensive than performing it and discarding the results.

If it seems as though you can't possibly perform related operations in parallel, start thinking about whether there's a higher-level opportunity to parallelize. One example I found striking was published by developerWorks in a previous article on maximizing the power of the Cell Broadband Engine processor (see resources). A common problem in 3D work is subdividing polygons. One strategy for parallelizing these would be to try to unroll a loop for processing triangles and this does produce some speed improvement at a noticeable cost in code size. However, performing a much simpler algorithm on multiple triangles in parallel was much faster and used substantially less code space. It's hard to parallelize the operation on a single triangle, but it's easy to perform the simple (and therefore probably less buggy, as well!) algorithm on multiple triangles in parallel.

The SPU provides excellent resources for shuffling data items, making it easy to convert vector-across objects (where an object's components are all stored as a single vector) into parallel arrays (where corresponding components of several objects are stored in a single vector). Even with the need to shuffle, the gain in performance and algorithmic simplicity is quite attractive.

Finally, last but not least: As has been the case with every optimization effort since the first person wondered whether an algorithm could maybe run faster, focus on correctness first and don't spend a lot of time optimizing something until you have real-world data showing that optimizing it would matter. Developer time is still at a premium and a blindingly fast algorithm which doesn't work is still useless.

Next up: A real-world program

In the next installment I'll provide a more in-depth look at a real program -- a simple, iterative-function fractal generator -- then I'll explain how to use multiple SPEs to vectorize a single task employing the job queue model.



Resources

Learn

Get products and technologies

Discuss


About the author

Peter Seebach

Peter Seebach has a better motto than the US Postal Service: "He always delivers!" That often means "better ways for more effective communication."




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top



    About IBMPrivacyContact