Skip to main content

The little broadband engine that could: Use multiple SPEs for a single task

Vectorizing a single task under the job queue model using multiple SPEs

Peter Seebach, Freelance author, Plethora.net
Peter Seebach
Peter Seebach believes that a sample program is worth at least three articles (so where's his money?).

Summary:  Peter Seebach uses a simple, iterative-function fractal generator program to describe how to use multiple Synergistic Processor Engines (SPEs) to vectorize a single task using the job queue model.

View more content in this series

Date:  18 Sep 2007
Level:  Intermediate
Activity:  2578 views

It's about time to give a more in-depth look at a real program. The program I selected is the right sort of thing to vectorize. I selected a fairly simple fractal generator. You might recognize the algorithm as being the same as the one I used as a Java™ example program in previous articles (see Resources).

This is an iterative function type of fractal. It is most well-known as the generator of those snowflakes people often use to illustrate what a fractal looks like. In general, you start with a series of line segments, and then replace each of them with the same pattern of line segments. This involves a fair amount of calculation, and it chews up a fair amount of memory. This makes it a good candidate to begin applying some of the optimization techniques I have presented so far in this series.

Starting the coding

The code assumes that the definition of a fractal is a series of points starting with {0, 0} and ending with {1, 0}. For my initial setup, I developed a simple version of the algorithm coded in generic C language. This version, running on the PPE, is a good way to perform regression tests. Optimizations often break code, sometimes in subtle ways; so having clean data sets to compare with is useful. Another reason to code in C is that the performance of plain old native C code offers a good minimum performance standard to shoot for in optimized code.


Listing 1. Generating a fractal
                
for (i = 0; i < length - 1; ++i) {
            affine a = affinetransform(old[i], old[i+1]);
            for (j = 0; j < points; ++j) {
                        new[k++] = transform(a, frac[j]);
            }
}

Listing 1 is pretty self-explanatory. For each segment in the previous iteration, generate an affine transform that maps a unit line onto that segment. Then apply it, in turn, to each point in the original structure. If you start with a triangle, this begins the laborious process of turning it into a snowflake, as shown in Figure 1.


Figure 1. A snowflake
A snowflake

The affinetransform() and transform() functions are both trivial, as shown in Listing 2.


Listing 2. Transforms
                
affine
affinetransform(point one, point two) {
            affine a;
            float len = sqrt((two.y - one.y) * (two.y - one.y) +
                                     (two.x - one.x) * (two.x - one.x));
            float t = atan2(two.y - one.y, two.x - one.x);
            float sin_t = sin(t);
            float cos_t = cos(t);
            a.x0 = one.x;
            a.y0 = one.y;
            a.x1 = len * cos_t;
            a.x2 = len * sin_t;
            a.y1 = -len * sin_t;
            a.y2 = len * cos_t;
            return a;
}

point
transform(affine a, point p) {
            point p2;
            p2.x = (a.x1 * p.x) + (a.y1 * p.y) + a.x0;
            p2.y = (a.x2 * p.x) + (a.y2 * p.y) + a.y0;
            return p2;
}

None of this code is particularly complicated, but it is fairly clear what's going on. The focus is on making sure you know what the code does and that it produces the right results. I used gnuplot to plot output data sets so I could see the results: sometimes an error that is not obvious when looking at the numbers is painfully clear when looking at a chart.

The IDL tool

As the code overhead of even a fairly simple function shows, hand-coding to move arguments and data structures to one of the SPEs is a significant amount of work. One simple solution would be to hide the complexity behind a library that can automatically marshal arguments, transmit them, and report back when a function call is completed. The functions to be offloaded to the SPE can be defined through an Interface Definition Language (IDL), which is translated into stub functions and glue code.

The Cell SDK's idl tool provides exactly this. It uses an interface definition language that has its roots in the Distributed Computing Environment (DCE) developed for AIX. While that environment was built for networked distributed computing over RPC, the essential design is similar.

What does the idl tool do? It interprets files written in IDL, and then generates stub code for the defined functions. When you define a function named foo, the generated glue code hides all the overhead and setup code. The provided foo function generated for the PPU calls the stub function for the SPU. In fact, it might be more complicated than this. The SPU might not be able to process an entire array at once because of limited local store. There is code in the library to handle automatic double-buffering and to move data around in the background. You never see this; you just call a function and eventually get results.

This article does not discuss every little experiment I tried along the way. Instead it focuses on the major design ideas as my code evolved from something that could be running on the SPE up through a superior implementation. I used the IDL tool (see sidebar) for this task rather than creating my own code. While it seems likely that custom, hand-tuned code could do a better job, it also seems likely that I could calculate the fractals by hand with a pencil in the time it would take me to get it fully debugged.

Summarizing the early attempts

The obvious annoyance factor of this operation is simply that it is built around an irregular relationship that seems to require nested loops. For each line segment and for each point in the definition, you need to perform a calculation.

My first conclusion was that the easiest thing to do would be to hand the program sets of affine transform values and points, letting it do the hard work of combining them. In other words, I would populate arrays with the results of the affinetransform() function and copies of the fractal points, and then perform transform() on the SPE. This approach would have been, to put it mildly, moderately expensive. The SPE had only to make a single pass through its data sources, but it needed to receive 8 values per pair it generated. It worked.

Unfortunately, the resulting program took substantially more time to run than the original unoptimized program. That was a bit discouraging. It's not just the (admittedly huge) bandwidth cost. It is also that the PPE was doing all the work of packing and unpacking affine transforms and points into vectors. It was better on really large data sets in which the setup costs for the IDL code were harder to spot, but it was still definitely slower. The gap was even larger when compiling without optimization, where speed difference with a factor of two was not uncommon. Ouch.

One obvious target was that in hoping to avoid nested loops on the SPE, I was copying in each affine transform multiple times (once for each fractal point) and similarly that I was copying in each fractal multiple times (once for each line segment). For N line segments and M points, the PPE was generating the whole N*M set of combinations to keep the SPE from having to do any nested loops.

So, the next version still had the PPE provide affine transforms, but the SPE did the nested loops. Of course, this creates a new problem: the data returned to the PPE are out of order. Still, it's easy enough to unpack them in the right order on the PPE. Well, maybe it's not really easy. I hang my head in shame recalling how many tries it took me to get this right.


Listing 3. A simple (?) matter of programming
                
for (j = 0; j < CHUNKSIZE; ++j) {
            int o = (slot * CHUNKSIZE) + 
                        ((j / (PASSES*4)) * (PASSES*4) * points) + 
                        j % (PASSES*4); 
            int m = l * points; 
            for (k = 0; k < points; ++k) { 
                        new[m + k].x = px[o + (k * (PASSES * 4))]; 
                        new[m + k].y = py[o + (k * (PASSES * 4))]; 
            } 
            ++l; 
}

With this change, I managed to get the SPE-enhanced version of the code to run more efficiently for medium-sized data sets—not much more efficiently, but enough to be noticeable. The PASSES constant is the number of vectors being shifted per iteration. For reasons not obvious to me, doing two in a row improved performance. There is actually a subtle bug in here, but thanks to a quirk of the IDL implementation, I was never actually bitten by it.

Even in this version, it's clear that the PPE's use of resources is a serious issue. Furthermore, even though I was beating the PPE version of code with this, the -O3 compiler level reversed my fortunes. At -O3, gcc began inlining function calls, and the loss of function-call overhead left the native code, once again, in the lead. Oops.

Hating trigonometry

My guess was that one of the big holdups was the affine transforms. Not only are they computationally intensive, but their six output values are larger than their four input values. This seemed like a good candidate for migration to the SPE.

Unfortunately, the atan2() function [in many cases, atan2(y, x) is the same as atan(y/x)] is not provided in the SPE's simdmath library. There is documentation for it, but it is not available right now. But, thanks to the completeness of Berkeley man pages, I found a nice, clear definition that I could implement. (The Linux® man page said only "the signs of both arguments are used to determine the quadrant of the result," which didn't help me much.) So, I wrote a nice, simple atan2f4() function that was intended to be essentially a wrapper around the simpler atanf4() function provided by the library.

The results surprised me. I started getting huge numbers in the trillions. Now, I'm not exactly a numerical analysis sort of person. The reasonable inference was an error in my code. After some work, I tracked down a typo in the simdmath library that caused erratic behavior of atan() for some values (see the sidebar for details).

A subtle bug in atanf4

I had some very weird results while trying to implement my atan2f4() function. I'm not normally a floating-point genius, so I made the reasonable assumption that the problem was in my code. It wasn't. The particularly unnerving thing was that it was unpredictable. I wouldn't always get the same errors for the same values! The errors were obvious, though. Values on the order of three trillion are a little outside the normal range for an arctangent.

The problem was a bug in the SIMD math library that came with the version of the SDK I was using at the time. The atanf4() function uses the constant SM_PI_2. Because it operates on multiple objects, of course, a vector must be created. The code for this was bias = (vector float)spu_or(sign, (vector unsigned int)(spu_splats(SM_PI_2)));

The use of spu_or to set the sign bit was clever, but it required that the vector be treated as a vector of unsigned ints, which hid the type mismatch. Because SM_PI_2 is a double-precision float, the generic spu_splats produced a vector containing two double-precision floats instead of four single-precision floats. So, the bias value was calculated incorrectly for the second and fourth slots in the vector, but this only showed up for some values! I believe the bug is fixed now.

After fixing that, I had a working program. This one still had the PPE pass in two arrays: one array of starting points for line segments and one array of ending points.

Now, if you are astute, you realize that because all the line segments are connected, the array of ending points is the same array offset by one. However, the SPE has no capacity for unaligned access. That makes it a little harder than it might be otherwise. So rather than fuss with it, I just passed in four values instead of six per point, and I saw a noticeable speed increase. In fact, the new version of the program was measurably faster than the PPE-only code, even with the compiler inlining. Finally!

This led to some hard thinking about argument passing. While it is true that the SPE is not very fast at unaligned access, it might still be faster to pass in only one array. My next revision of the program passed in a single array of points, but it passed in one more point than before. The following trivial code was enough to patch up the data:


Listing 4. An off-by-one correctness
                
nx = spu_or(spu_slqwbyte(vox[0], 4), spu_rlmaskqwbyte(vox[1], (signed) -12));
ny = spu_or(spu_slqwbyte(voy[0], 4), spu_rlmaskqwbyte(voy[1], (signed) -12));

The vox and voy variables are vector pointers that are walked through the array of old points. This code shifts the first vector to the left by 4 bytes and to the second right by 12. The code then mashes the bits together. This produces yet another noticeable improvement in performance. In fact, on larger data sets, my code was over twice the speed of the PPE-only code I started with.

Completing the final optimization

Throughout all of my debugging, I noticed a tendency for duplicate values to show up in my fractal calculations. In fact, the deeper the fractal went, the more duplicates there were. Early on, I had briefly considered stripping the duplicates out, but I never got around to it. However, at one point it suddenly occurred to me how these duplicates were generated.

Because the definition ends at {1, 0}, the last point of the fractal definition always transforms to the end of a given line segment. Because the definition starts at {0, 0}, the first point of the fractal definition always transforms to the beginning of a line segment. So, because every line segment begins at the end of the previous one, each iteration through produces a pair of duplicate points. Worse, each iteration after that generates a whole set of additional duplicates, as the generator dutifully transforms each point in its definition onto a single point. That turns out to be a fairly large number of points, especially with a bit of recursion.

Following is a little table of the number of points used by the simple, five-point snowflake fractal and by a more complicated 7-point double snowflake design:

Depth:             5 fractal points        7 fractal points  
--------------------------------------------------------------
1                    5                        7
2                    20                       42
3                    95                       287
4                    470                      2002
5                    2345                     14007
6                    11720                    98042
7                    58595                    686287

Now, imagine that instead of rendering that last point of each pattern, you omit it. Of course, this would leave you without the final {1,0} on each line, but it's easy enough to add it, because you know exactly where it goes each time. That simple change produces these numbers:

Depth:             5 fractal points        7 fractal points  
--------------------------------------------------------------
1                    5                        7
2                    17                       37
3                    65                       217
4                    257                      1297
5                    1025                     7777
6                    4097                     46657
7                    16385                    279937

As you note, these quickly fall well below half the size of their previous counterparts and without a single actual point of data lost. (And yes, the five-point fractal's pattern is powers of four, plus one. You get no points for guessing the seven-point fractal's pattern.)

Now, this is rather embarrassing. All that previous work, from setting up the SPE code to hand-coding my atan2f4() function to debugging the SIMD math library—all that gained me was about a factor of three performance increase, and it cost me a full working day to do it. By contrast, eliminating the harmless duplicate numbers on comparable data set sizes (I was using depths of 7 or so with the seven-point fractal) gained me a factor of three performance increase, and it took me at most fifteen minutes to debug.

More importantly, because my optimization reduced the size of the working set so much, it enabled me to start working with substantially larger data sets for testing. My Cell/B.E. system has fairly limited memory, and many of my early tests were crippled by the huge performance impact of swapping, which made about a factor of 10 or 20 difference in performance.

So, there is one more lesson of Cell/B.E. performance optimization. As in any programming exercise, you often find that simply giving your algorithm more than a moment's thought pays off just as much as hours spent working away over a hot vector unit. A lesson to remember, certainly.


Conclusion

Using a fractal generating program, you can see what a difference various optimization techniques can make.


Resources

Learn

Get products and technologies

Discuss

About the author

Peter Seebach

Peter Seebach believes that a sample program is worth at least three articles (so where's his money?).

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=254644
ArticleTitle=The little broadband engine that could: Use multiple SPEs for a single task
publish-date=09182007
author1-email=developerworks@seebs.plethora.net
author1-email-cc=developerworks@seebs.plethora.net

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers