It's about time to give a more in-depth look at a real program. The program I selected is the right sort of thing to vectorize. I selected a fairly simple fractal generator. You might recognize the algorithm as being the same as the one I used as a Java™ example program in previous articles (see Resources).
This is an iterative function type of fractal. It is most well-known as the generator of those snowflakes people often use to illustrate what a fractal looks like. In general, you start with a series of line segments, and then replace each of them with the same pattern of line segments. This involves a fair amount of calculation, and it chews up a fair amount of memory. This makes it a good candidate to begin applying some of the optimization techniques I have presented so far in this series.
The code assumes that the definition of a fractal is a series of points starting with {0, 0} and ending with {1, 0}. For my initial setup, I developed a simple version of the algorithm coded in generic C language. This version, running on the PPE, is a good way to perform regression tests. Optimizations often break code, sometimes in subtle ways; so having clean data sets to compare with is useful. Another reason to code in C is that the performance of plain old native C code offers a good minimum performance standard to shoot for in optimized code.
Listing 1. Generating a fractal
for (i = 0; i < length - 1; ++i) {
affine a = affinetransform(old[i], old[i+1]);
for (j = 0; j < points; ++j) {
new[k++] = transform(a, frac[j]);
}
}
|
Listing 1 is pretty self-explanatory. For each segment in the previous iteration, generate an affine transform that maps a unit line onto that segment. Then apply it, in turn, to each point in the original structure. If you start with a triangle, this begins the laborious process of turning it into a snowflake, as shown in Figure 1.
Figure 1. A snowflake
The affinetransform() and
transform() functions are both trivial, as shown in
Listing 2.
Listing 2. Transforms
affine
affinetransform(point one, point two) {
affine a;
float len = sqrt((two.y - one.y) * (two.y - one.y) +
(two.x - one.x) * (two.x - one.x));
float t = atan2(two.y - one.y, two.x - one.x);
float sin_t = sin(t);
float cos_t = cos(t);
a.x0 = one.x;
a.y0 = one.y;
a.x1 = len * cos_t;
a.x2 = len * sin_t;
a.y1 = -len * sin_t;
a.y2 = len * cos_t;
return a;
}
point
transform(affine a, point p) {
point p2;
p2.x = (a.x1 * p.x) + (a.y1 * p.y) + a.x0;
p2.y = (a.x2 * p.x) + (a.y2 * p.y) + a.y0;
return p2;
}
|
None of this code is particularly complicated, but it is fairly clear what's going on. The focus is on making sure you know what the code does and that it produces the right results. I used gnuplot to plot output data sets so I could see the results: sometimes an error that is not obvious when looking at the numbers is painfully clear when looking at a chart.
This article does not discuss every little experiment I tried along the way. Instead it focuses on the major design ideas as my code evolved from something that could be running on the SPE up through a superior implementation. I used the IDL tool (see sidebar) for this task rather than creating my own code. While it seems likely that custom, hand-tuned code could do a better job, it also seems likely that I could calculate the fractals by hand with a pencil in the time it would take me to get it fully debugged.
Summarizing the early attempts
The obvious annoyance factor of this operation is simply that it is built around an irregular relationship that seems to require nested loops. For each line segment and for each point in the definition, you need to perform a calculation.
My first conclusion was that the easiest thing to do would be to hand the
program sets of affine transform values and points, letting it do the hard work of
combining them. In other words, I would populate arrays with the results of the
affinetransform() function and copies of the fractal
points, and then perform transform() on the SPE. This
approach would have been, to put it mildly, moderately expensive. The SPE had only
to make a
single pass through its data sources, but it needed to receive 8 values per pair
it generated. It worked.
Unfortunately, the resulting program took substantially more time to run than the original unoptimized program. That was a bit discouraging. It's not just the (admittedly huge) bandwidth cost. It is also that the PPE was doing all the work of packing and unpacking affine transforms and points into vectors. It was better on really large data sets in which the setup costs for the IDL code were harder to spot, but it was still definitely slower. The gap was even larger when compiling without optimization, where speed difference with a factor of two was not uncommon. Ouch.
One obvious target was that in hoping to avoid nested loops on the SPE, I was copying in each affine transform multiple times (once for each fractal point) and similarly that I was copying in each fractal multiple times (once for each line segment). For N line segments and M points, the PPE was generating the whole N*M set of combinations to keep the SPE from having to do any nested loops.
So, the next version still had the PPE provide affine transforms, but the SPE did the nested loops. Of course, this creates a new problem: the data returned to the PPE are out of order. Still, it's easy enough to unpack them in the right order on the PPE. Well, maybe it's not really easy. I hang my head in shame recalling how many tries it took me to get this right.
Listing 3. A simple (?) matter of programming
for (j = 0; j < CHUNKSIZE; ++j) {
int o = (slot * CHUNKSIZE) +
((j / (PASSES*4)) * (PASSES*4) * points) +
j % (PASSES*4);
int m = l * points;
for (k = 0; k < points; ++k) {
new[m + k].x = px[o + (k * (PASSES * 4))];
new[m + k].y = py[o + (k * (PASSES * 4))];
}
++l;
}
|
With this change, I managed to get the SPE-enhanced version of the code to run more
efficiently for medium-sized data sets—not much more
efficiently, but enough to be noticeable. The PASSES
constant is the number of vectors being shifted per iteration. For reasons not
obvious to me, doing two in a row improved performance. There is actually a subtle
bug in here, but thanks to a quirk of the IDL implementation, I was never actually
bitten by it.
Even in this version, it's clear that the PPE's use of resources is a serious issue. Furthermore, even though I was beating the PPE version of code with this, the -O3 compiler level reversed my fortunes. At -O3, gcc began inlining function calls, and the loss of function-call overhead left the native code, once again, in the lead. Oops.
My guess was that one of the big holdups was the affine transforms. Not only are they computationally intensive, but their six output values are larger than their four input values. This seemed like a good candidate for migration to the SPE.
Unfortunately, the atan2() function [in many cases,
atan2(y, x) is the same as
atan(y/x)] is not provided in the SPE's simdmath
library. There is documentation for it, but it is not available right now. But,
thanks to the completeness of Berkeley man pages, I found a nice, clear definition
that I could implement. (The Linux® man page said only "the signs of
both arguments are used to determine the quadrant of the result," which didn't help
me much.) So, I wrote a nice, simple atan2f4()
function that was intended to be essentially a wrapper around the simpler
atanf4() function provided by the library.
The results surprised me. I started getting huge numbers in
the trillions. Now, I'm not exactly a numerical analysis sort of person.
The reasonable inference was an error in my code. After some work, I
tracked down a typo in the simdmath library that caused erratic behavior of
atan() for some values (see the
sidebar for details).
After fixing that, I had a working program. This one still had the PPE pass in two arrays: one array of starting points for line segments and one array of ending points.
Now, if you are astute, you realize that because all the line segments are connected, the array of ending points is the same array offset by one. However, the SPE has no capacity for unaligned access. That makes it a little harder than it might be otherwise. So rather than fuss with it, I just passed in four values instead of six per point, and I saw a noticeable speed increase. In fact, the new version of the program was measurably faster than the PPE-only code, even with the compiler inlining. Finally!
This led to some hard thinking about argument passing. While it is true that the SPE is not very fast at unaligned access, it might still be faster to pass in only one array. My next revision of the program passed in a single array of points, but it passed in one more point than before. The following trivial code was enough to patch up the data:
Listing 4. An off-by-one correctness
nx = spu_or(spu_slqwbyte(vox[0], 4), spu_rlmaskqwbyte(vox[1], (signed) -12));
ny = spu_or(spu_slqwbyte(voy[0], 4), spu_rlmaskqwbyte(voy[1], (signed) -12));
|
The vox and voy variables
are vector pointers that are walked through the array of old points. This code
shifts the first vector to the left by 4 bytes and to the second right by 12. The
code then mashes
the bits together. This produces yet another noticeable improvement in
performance. In fact, on larger data sets, my code was over twice the
speed of the PPE-only code I started with.
Completing the final optimization
Throughout all of my debugging, I noticed a tendency for duplicate values to show up in my fractal calculations. In fact, the deeper the fractal went, the more duplicates there were. Early on, I had briefly considered stripping the duplicates out, but I never got around to it. However, at one point it suddenly occurred to me how these duplicates were generated.
Because the definition ends at {1, 0}, the last point of the fractal definition always transforms to the end of a given line segment. Because the definition starts at {0, 0}, the first point of the fractal definition always transforms to the beginning of a line segment. So, because every line segment begins at the end of the previous one, each iteration through produces a pair of duplicate points. Worse, each iteration after that generates a whole set of additional duplicates, as the generator dutifully transforms each point in its definition onto a single point. That turns out to be a fairly large number of points, especially with a bit of recursion.
Following is a little table of the number of points used by the simple, five-point snowflake fractal and by a more complicated 7-point double snowflake design:
Depth: 5 fractal points 7 fractal points -------------------------------------------------------------- 1 5 7 2 20 42 3 95 287 4 470 2002 5 2345 14007 6 11720 98042 7 58595 686287 |
Now, imagine that instead of rendering that last point of each pattern, you omit it. Of course, this would leave you without the final {1,0} on each line, but it's easy enough to add it, because you know exactly where it goes each time. That simple change produces these numbers:
Depth: 5 fractal points 7 fractal points -------------------------------------------------------------- 1 5 7 2 17 37 3 65 217 4 257 1297 5 1025 7777 6 4097 46657 7 16385 279937 |
As you note, these quickly fall well below half the size of their previous counterparts and without a single actual point of data lost. (And yes, the five-point fractal's pattern is powers of four, plus one. You get no points for guessing the seven-point fractal's pattern.)
Now, this is rather embarrassing. All that previous work,
from setting up the SPE code to hand-coding my
atan2f4() function to debugging the SIMD math
library—all that gained me was about a factor of three performance
increase, and it cost me a full working day to do it. By contrast, eliminating the harmless duplicate numbers on comparable data set
sizes (I was using depths of 7 or so with the seven-point fractal) gained me a factor
of three performance increase, and it took me at most fifteen minutes to debug.
More importantly, because my optimization reduced the size of the working set so much, it enabled me to start working with substantially larger data sets for testing. My Cell/B.E. system has fairly limited memory, and many of my early tests were crippled by the huge performance impact of swapping, which made about a factor of 10 or 20 difference in performance.
So, there is one more lesson of Cell/B.E. performance optimization. As in any programming exercise, you often find that simply giving your algorithm more than a moment's thought pays off just as much as hours spent working away over a hot vector unit. A lesson to remember, certainly.
Using a fractal generating program, you can see what a difference various optimization techniques can make.
Learn
- Use an RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Check out the
other
articles in the "Little broadband engine that could" series.
- See the article
"Buoy makes
simple Java UI programming a snap"
(developerWorks, March 2005). Buoy, a free user-interface toolkit built on top of
Swing, is the same program offered in this article in Java, only a little slower.
- Refer to
"Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance"
(developerWorks, June 2006) for
SPE programming tips.
- Check out the
"Unrolling
AltiVec"
series (developerWorks, 2005), which is an oldie but goodie that introduces the
various guises of this vector processing SIMD technology.
- Refer to Jonathon Bartlett's series on "Programming high
performance applications on the Cell/B.E. processor" (developerWorks, January 2007
to present) for the following:
- "Intro to Linux on the PS3"
- "Programming the PS3's SPE"
- "Intro to the SPU"
- "SPU performance programming"
- "C/C++ SPU programming"
- "Managing smart buffer DMA transfers"
- Use this
quick install guide
(developerWorks, April 2007) to install and run Fedora Core 6, which you need in order to
use the Cell/B.E. SDK 2.1.
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture when you sign up to receive Cell/B.E. news in your newsletter.
Get products and technologies
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
Discuss
- Participate in the discussion forum.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
Juicy problems and answers from the forums are rounded up periodically and highlighted
in the "Forum watch" blog series.
- Go to the Power Architecture blog for news, downloads,
instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find
the popular "Forum watch" blog series (Q&A roundup) and the "FixIt" technology updates.
Comments (Undergoing maintenance)





