Motorola® AltiVec™ can dramatically improve the performance of many tasks, even tasks that you might initially think are too linear to get much advantage from parallelizing. As always, the most important part of effective optimization is determining where the system is spending most of its time, and focusing your efforts there. For instance, if you were working on an e-mail client, the most useful thing to optimize would be the code for deleting spam. Pay careful attention to the code examples in this article and you should be able to get truly incredible performance.
Since Apple makes the source code for Darwin available, it's possible to even look into optimizing the kernel. One routine comes immediately to mind: as it turns out, it takes as much as 85% of the CPU time on my dual G4 tower. This routine simply updates a counter, then continues operation. It's in the file src/xnu/osfmk/kern/sched_prim.c. Listing 1 shows the complete source:
Listing 1. Where 85% of CPU time goes
void
idle_thread(void)
{
counter(c_idle_thread_block++);
thread_block(idle_thread_continue);
/*NOTREACHED*/
}
|
The counter() function is in fact simply a
macro that expands to its argument. So, if you focus on the work done in
this function, you'll see that all that's really happening is an increment
of a value of type mach_counter_t, which is an
unsigned int.
The most important thing in optimization is understanding what the code actually does. This function can essentially be reduced to the simpler form in Listing 2, once you've removed the inefficient thread-blocking:
Listing 2. A slight simplification
static int x;
void
function(void)
{
++x;
}
|
On the PowerPC®, this expands into a fairly large chunk of assembly; you
can see for yourself using gcc -S -c file.c.
The output file, file.s, looks like Listing 3:
Listing 3. Assembly code for a simple function
_function: mflr r0 bcl 20,31,"L00000000001$pb" "L00000000001$pb": mflr r10 mtlr r0 addis r9,r10,ha16(_x-"L00000000001$pb") la r9,lo16(_x-"L00000000001$pb")(r9) lwz r2,0(r9) addi r2,r2,1 stw r2,0(r9) blr .lcomm _x,4,2 |
Most of this code is setup. (For more on PowerPC assembly, see the Resources section.) The actual work is just done in the three lines right before the return, illustrated in Listing 4:
Listing 4. Incrementing a variable in PowerPC assembly
lwz r2,0(r9) addi r2,r2,1 stw r2,0(r9) |
Load the value, add one to it, and store it back. That's fine as far as
it goes -- but remember, the computer is spending something like 85% of its time calling this function over and over. The first thing
to do is figure out how long the loop is really taking. I used gettimeofday() to get time stamps before and after
running the loop a few times, as shown in Listing 5:
Listing 5. A minimalist test harness
int
main(void) {
struct timezone dontcare = { 0, 0 };
struct timeval before, after;
long long microsec;
gettimeofday(&before, &dontcare);
for (x = 0; x < 100000000; function())
;
gettimeofday(&after, &dontcare);
microsec = (after.tv_usec - before.tv_usec) +
1000000 * (after.tv_sec - before.tv_sec);
printf("%lld microseconds\n", microsec);
return 0;
}
|
The output from a test run is 3197848
microseconds -- a tad over 3 million microseconds. Three seconds is
a pretty long time in computing...
The first thing to do is just unroll the loop. This is, for a simple loop like this, quite easy, as you can see in Listing 6:
Listing 6. A first pass at unrolling the loop
void
function(void)
{
++x;
++x;
++x;
++x;
}
|
With this simple tweak, processing time is down to 1,332,169 microseconds, a substantial improvement. The function is now better than twice as fast as it was before. It could be unrolled further, but when you're performing the same operation on a 32-bit value four times in a row, that suggests that maybe it'd be more efficient to use AltiVec. (Of course, this first unrolling would improve performance even on a non-AltiVec CPU.)
Converting to vector operations
Just as the main processor works by loading, modifying, and then storing values, AltiVec also works by loading, modifying, and then storing values. The difference is that the AltiVec unit works on 16-byte chunks of memory at a time; for instance, it can operate on four 32-bit values at once, instead of a single value at once. The load and store overhead will be a bit high, but getting the values into a vector register should produce a measurable speedup.
Consider Listing 7:
Listing 7. Naive vectorization of addition
typedef vector unsigned int v_u_int;
void
function(void)
{
unsigned int xs[4] = { x + 1, x + 2, x + 3, x + 4 };
register v_u_int v = vec_ld(0, xs);
register v_u_int inc4 = (vector unsigned int) { 4, 4, 4, 4 };
v = vec_add(v, inc4);
vec_st(v, 0, xs);
x = xs[3];
}
|
The code in Listing 7 initializes an array with the next four values
after x, then loads the array into a vector
register. Next, the value 4 is loaded into all four 32-bit values of a
second register, and the registers are added. Finally, the incremented
values are stored back into the array, and the last member of the array is
read out. The time to run this version of the loop is only about a million
microseconds -- not exactly the huge speedup AltiVec should offer when
well-tuned, but it's nice to note that, even without a serious effort at
further unrolling, performance has improved.
The next step is to unroll a little further. The simplest thing to do
is to just repeat the line adding inc4 to v. This gets time down to about 430,000 microseconds.
That's pretty good. At this point, however, the dependency of each
operation on previous operations is starting to show.
To get past this limit, the code needs to think ahead, and perform more operations at once. That means having more than one set of source and destination registers active at a time, and adding more than four at a time to each value, as Listing 8 shows:
Listing 8. Unrolling further
unsigned int xs[4] = { x + 1, x + 2, x + 3, x + 4 };
register v_u_int v1 = vec_ld(0, xs);
register v_u_int v2, v3, v4, v5, v6, v7, v8;
register v_u_int inc4 = (vector unsigned int) { 4, 4, 4, 4 };
register v_u_int inc8 = (vector unsigned int) { 8, 8, 8, 8 };
register v_u_int inc16 = (vector unsigned int) { 16, 16, 16, 16
};
v2 = vec_add(v1, inc4);
v3 = vec_add(v1, inc8);
v4 = vec_add(v2, inc8);
v5 = vec_add(v1, inc16);
v6 = vec_add(v2, inc16);
v7 = vec_add(v3, inc16);
v8 = vec_add(v4, inc16);
vec_st(v8, 0, xs);
x = xs[3];
|
The intent here is to reduce the number of times that the processor
needs to stall waiting for the results of a previous operation. So, for
instance, before adding anything to v2, the
code adds something to v1, storing it in a
different register. At the end of these operations, it stores the highest
value it has back into x. This brings the time
for the main loop down to just over 310,000 microseconds. At this point,
the vectorized version is faster than the original, unoptimized, version
by a factor of ten.
In fact, in this particular case, it's actually marginally faster just
to keep looping on adding the inc4 vector in
place; the overhead of setting up all the additional registers ends up
costing too much, and running eight adds in a row can trim the time to
250,000 microseconds.
However, once the set-up cost has been paid, it's possible to get a
dramatic improvement by making more use of your array of vectors. The four
lines filling vectors v5 through v8 can be expanded as Listing 9 shows:
Listing 9. Unrolling the unrolled code even further
v5 = vec_add(v1, inc16);
v6 = vec_add(v2, inc16);
v7 = vec_add(v3, inc16);
v8 = vec_add(v4, inc16);
v1 = vec_add(v5, inc16);
v2 = vec_add(v6, inc16);
v3 = vec_add(v7, inc16);
v4 = vec_add(v8, inc16);
v5 = vec_add(v1, inc16);
v6 = vec_add(v2, inc16);
v7 = vec_add(v3, inc16);
v8 = vec_add(v4, inc16);
|
This allows you to take full advantage of interleaving, making sure that there are three operations between writing a value in a given vector and using that value in another operation. This gets the execution time down to a fairly zippy 200,000 microseconds.
Analyzing performance: Color wheels and color models
Unfortunately, while the previous example was an excellent theoretical illustration, it does not work out so well in practice. Because its code runs in the system idle loop, it ends up running the most (and thus getting the largest speedup) when the system is fairly quiet. As the system becomes more heavily loaded, and the system spends less time idle, other functions step in to sabotage the efficiency of the system and make the user aware that performance is lagging. I am talking, as any user of a modern Mac OS X system will well know, of the times that the spinning ball animation, sometimes called the "beach ball of death," appears.
This suggests an avenue for performance optimization: if the generation of that spinning color wheel could be sped up substantially, the performance enhancement would be immediately apparent. The easiest thing to do would be to convert each pixel's RGB values to HSV, increment the hue by some small amount, then convert back to RGB. The AltiVec pixel type is, unfortunately, not particularly useful for this task; what's really needed is a big array of RGB values.
The code for this is a lot more complicated, but it's still reasonably
approachable. The starting function, called inc_hsv(), takes a set of input values, converts them
to HSV, rotates them, and converts them back. Running this conversion on
65,536 data points takes only a fraction of a second, so the test case is
to run that same loop 100 times. That takes a little under 5 seconds.
Unrolling this loop will be a little more challenging. In fact, trying to
write out the unrolled version would be very painful, especially because
the natural size for AltiVec operations on unsigned characters is batches
of 16. In the interest of brevity, I have not provided an inline
listing of the non-vectorized version of this loop unrolled 16 times.
To follow along with this example, see the colors.c sidefile and the color_vec.c sidefile. The original version is colors.c; the vectorized version is color_vec.c.
The sample code provided does make a few assumptions. Most noticeably,
it simply assumes that all of the scratch spaces it allocates will be
16-byte aligned. Production code would want to ensure this, possibly by
putting objects (such as an array of 16 bytes) in a union with some sort
of vector type.
Unfortunately, the algorithm used in this example is a painfully bad one for AltiVec, as it involves division. AltiVec simply does not have a division operator. It has multiplication, and a reciprocal "estimate," but no division. This means that some significant chunks of the process can't be done in vector code. Still, enough of it is vectorized that, even on a first attempt, there's a noticeable improvement: a hundred times through the loop goes from just under 5 seconds, to a bit under 4. That's about a 20% speedup.
Going through the code line-by-line would would soon grow tedious. But
a few things are worth pointing out specifically. The AltiVec code gets a
huge advantage from having the vec_min()
operator. In Listing 10, you can compare the code to select minimum and
maximum values in the original program with the AltiVec version:
Listing 10. Getting the maximum of three values
/* original version */
if (r > g) {
if (r > b) {
max = r;
if (g > b) {
min = b;
} else {
min = g;
}
} else {
max = b;
min = g;
}
} /* second half of this code omitted */
/* AltiVec version */
max16 = vec_max(vec_max(r16, g16), b16);
min16 = vec_min(vec_min(r16, g16), b16);
|
Experienced programmers might well use a macro to calculate the minimum and maximum values, making the code substantially shorter and easier to read -- but not actually any faster. AltiVec wins dramatically on this.
The AltiVec processor has unpack instructions, which sign-extend
values. For instance, unpacking a vector signed
char gives you a vector signed short.
That's a nice feature; unfortunately, there are no corresponding unsigned
operations. One way to deal with this is to use the vec_mul function, and multiply by all ones, as Listing 11 shows:
Listing 11. Unpacking unsigned vectors
scratch[0] = vec_mule(hue16, u8_1); scratch[1] = vec_mulo(hue16, u8_1); |
This looks wonderful, but it introduces a bug: the 16 elements of
the vector are no longer in order. The first vector holds even-numbered
elements, and the second odd-numbered elements. The solution I found for
this is the inorder vector, which is used to
permute a 16-byte vector such that expanding it out with vec_mule() and vec_mulo()
produces the first four values in one vector, the next four values in
another, and so on.
One thing that might seem a bit confusing is trying to perform conditional operations. In general, you don't perform conditional operations on an AltiVec unit. You perform operations unconditionally, then select results from vectors based on the output of comparison functions. For instance, in the original code, wraparound of the hue value looks like Listing 12:
Listing 12. Conditional operations without AltiVec
if (h > 251) h -= 251; else if (h < 21) h += 4; |
In the AltiVec code (where hue has a range from 0 to 6), it's done quite differently, as you can see in Listing 13:
Listing 13. Conditional operations with AltiVec
vector float six;
vector float sub6;
vector bool int dosub;
six = (vector float) { 6.0, 6.0, 6.0, 6.0 };
dosub = vec_cmpgt(fh[j], six);
sub6 = vec_sub(fh[j], six);
fh[j] = vec_sel(fh[j], sub6, dosub);
|
The dosub vector is filled with 1s for the fields where the vector fh[j] has a value greater than 6. A new vector, sub6 is created by subtracting 6 from every value in
the original vector; then fields are selected either from the original
vector, or from the modified one, depending on the bits in the comparison
vector.
With this in mind, look at the complete vectorized code listing, color_vec.c. There are two pieces of code being handled in loops. The
first is the actual hue calculation, which is made more complicated by the
need to divide by the delta value for each
pixel. It looks like it should be vectorizable; in particular, the
essential calculation made each time is identical, with only the source
arrays changing.
The second is the assignment of output values, drawing from the p, q, and t vectors. This is not a very large loop, but it's a
substantial chunk of code. If it could be vectorized, it would improve
things quite a bit. In the first attempt, these calculations were made
16 times, without vectors. Vectorizing the generation of these
numbers improved the speedup from 20% to 50%.
The updated code as is runs nearly twice as fast as the original, unvectorized, version. You may be able to improve it even further. Once you've done that, go ahead and put it into the relevant piece of the Darwin kernel and see how much less time you spend waiting for the system to render that endearing beach ball.
As you can see, an experienced programmer can easily avoid the pitfalls of premature optimization, carefully selecting appropriate sections of code to optimize. The AltiVec unit does have occasional shortcomings, such as the lack of a division operator, but careful usage can nonetheless improve performance dramatically. Attention to detail is important, and making sure you understand the code you wish to optimize is absolutely necessary.
- See the sample code for the second example in this article: colors.c (the original pre-vectorization code) and color_vec.c (the AltiVec-optimized vectorized
code).
- Check out the previous parts of the Unrolling AltiVec series.
- Find the PowerPC Microprocessor Family: Vector/SIMD Multimedia
Extension Technology Programming Environments Manual from the Semiconductor solutions technical library.
- SIMDtech has working groups
involved with various SIMD extensions, including AltiVec, MMX, and
others.
- Apple's page about
the Velocity Engine is a slightly buzzword-heavy description of the
AltiVec variants used in Mac systems.
- A previous two-part developerWorks article, "TCP/IP
checksum vectorization using AltiVec" by Ayal Zaks, Dorit Naishlos,
and Daniel Citron, discussed TCP checksum vectorization using AltiVec;
Part 2 of the series addresses Scalar
optimizations and simple vectorization ( developerWorks, October, November 2004) .
- A
discussion of throughput vs. latency, on Apple's site, is of
particular interest.
- Apple provides detailed
performance information about the G4, G4+, and G5.
- Work is being done on auto-vectorization
in GCC.
- Crescent Bay
Software sells software to automatically vectorize C code.
- Apple's page
on performance tools gives links to a number of useful tools,
including
simg4andsimg5. - The GCC Wiki serves as a
repository for information about GCC, with
up-to-the minute reports on status, useful tips, and everything else you
might want.
- IBM Senior Processor Architect Peter Sandon discusses vector
processing in the G5 in the Ars
Technica interview, "The real structure of the VMX (a.k.a. Altivec)
unit."
- "Save
your code from meltdown using PowerPC instructions," Jonathan Rentzsch
gets into some of the gritty detail of
PowerPC assembly code (developerWorks, November 2004).
- For more on the joys and dangers of writing code that directly
accesses memory, check out "Data
alignment on PowerPC," Jonathan Rentzsch (developerWorks, February
2005).
- Learn everything you needed to know about converting color
spaces, whether you were afraid to ask or not.
- It appears that rumors
of the CD-ROM cursor's death have been greatly exaggerated.
- Intel plans to release hardware in 2006 to
enhance TCP/IP stacks; but did you know you can do something similar
right now with AltiVec?
- The macstl project
tries to unify the architectures in a simple C++ template library. Read
more about it in the Slashdot posting Grand
Unified Theory of SIMD.
- The SIMD
Cross-Platform Headers project does something similar.
- How does SIMD architecture really work? An ancient but still useful
article by Holger Bettag, "Why
AltiVec is a Good Thing," has some answers.
- This
MacTech article about PowerPC assembly is roughly 10 years old, but
covers everything discussed here except AltiVec. Isn't stable
architecture nice?
- Snopes has
more information on April Fools' Pranks.
- You will also appreciate this list of the Top 100 April Fool's
Day Hoaxes of All Time.




