The Motorola® AltiVec™ instruction set offers a variety of vector operations. Vendors and reviewers claim that AltiVec is a particularly efficiently designed set of vector operations, allowing programmers to get excellent performance and make the most use of the multiple execution units it provides. But how do you get this performance?
This article, the second in the Unrolling AltiVec series, gives an overview of performance considerations that will help you get started making the best possible use of an AltiVec processor. (If you're unfamiliar with the basics of AltiVec, read Part 1 of this series before you go any further.) The compiler may get some or all of the ideas outlined here on its own, but if you're careful, you can probably do a better job. Still, check for likely compiler options before you spend a lot of time hand-tweaking code. If you can get acceptable performance by setting a compiler flag, do it and be happy.
One word of advice, though. Before you spend a lot of time
optimizing a process, remember that the first rule of optimization is
to find out where the time is going in the first place. If you spend a
couple of days carefully getting the best possible performance out of a
routine that accounts for one percent of all the CPU time your program
uses, you are probably wasting your time. Before you even start on
optimizing (unless you're doing it for fun, or to learn about
vectorizing code), you should take a profiler to your program and find
out where the time is spent. (simg4 and
simg5 are both good profilers; see Resources for links.)
Great processors don't always think alike
One thing to keep in mind when optimizing code for AltiVec is that different AltiVec-equipped processors have different capabilities. The previous article in this series briefly mentioned these differences; this article goes into more depth about the differences. In short, the "AltiVec unit" is not really just a single processing component: it's a number of different components, and in principle, all of them can be running at once.
Up to a point, the goal of optimization is to try to get as many components as possible working at the same time. That's the rule of thumb. However, an algorithm that goes through elaborate contortions to try to ensure that a given unit gets used for something may spend more time setting up the calculation than it saves by using that unit.
Thus far in the history of AltiVec processors, newer processors have generally been able to run more instructions at once. This doesn't necessarily make them faster on a single instruction; in fact, on average, they will take a cycle or two longer to perform a given operation. The newer processors still perform better, thanks to increased clock speed.
More subtly, the performance hit that these extra cycles cause is mitigated by the ability to start more instructions in a given cycle. For instance, on an original G4, most permutation instructions can run in a single cycle, while on a G4+, they may take two. However, on the original G4, two vector operations can start on the same cycle only if at least one of them is a permute operation; otherwise, the second operation must be delayed at least one cycle. On the G4+, three instructions may be dispatched in a single cycle, going to any two (different) vector processing units. On the G5, only one arithmetic operation and one permutation may be dispatched simultaneously, but up to two other instructions, plus a branch, may be dispatched to other parts of the processor, including the vector load/store units.
Unfortunately, the net result of these differences is that a program that runs brilliantly on one AltiVec processor may stall frequently on another. When optimizing code for general production, you must test it (or simulate it) on all the available types of processors, and focus on general principles. On the other hand, if you have a guarantee that your code will only be running on a specific processor, you can take advantage of more detailed knowledge of its instruction timing and its dispatch capabilities and limitations. General optimization strategies will give you code that works pretty well everywhere. Even if you end up moving code written for one specific processor to another, the porting process is not particularly difficult.
If your code doesn't have any loops, it probably can't be vectorized much. There's still some room for taking advantage of multiple execution units, but AltiVec probably won't particularly help you. SIMD instructions are primarily good for doing the same thing over and over, and that normally means loops. If you aren't doing something multiple times, it probably doesn't take long enough to be worth trying to optimize. So, you want to vectorize your loops. Even if you can't do that, you can improve performance by unrolling them. This will also make it easier to see whether there are any opportunities for vectorization.
To unroll a loop is to make each pass through the loop do the work of two or more passes. The simple code example in Listing 1 illustrates this:
Listing 1. Unrolling a loop
int
rolled_sum(char bytes[16]) {
int i;
int sum = 0;
for (i = 0; i < 16; ++i) {
sum += bytes[i];
}
return sum;
}
int
unrolled_sum(char bytes[16]) {
int i;
int sum[4] = 0;
for (i = 0; i < 16; i += 4) {
sum[0] += bytes[i + 0];
sum[1] += bytes[i + 1];
sum[2] += bytes[i + 2];
sum[3] += bytes[i + 3];
}
return sum[0] + sum[1] + sum[2] + sum[3];
}
|
Both versions of this function add 16 values to sum. However, the second does so in a way that
makes it easy to see that there are four similar operations happening
in a row. This might make it easier for a compiler -- or a programmer
-- to see the opportunity to vectorize the code.
When you've unrolled a loop a bit, you can often begin to see possible opportunities for vectorization. Sometimes, you won't immediately; it may seem that each phase of the calculation depends on the previous one, so you can't do multiple parts simultaneously. However, in some cases, these dependencies can be dealt with by splitting a process up into parts. Even if this requires a small amount of extra calculation, the speed benefits of vectorizing might pay for it.
Unrolling is often an optimization in and of itself, because branching is typically comparatively expensive. However, some of the work that goes into preparing an algorithm for vectorization could actually reduce performance. Don't abandon the effort just because the intermediate stage isn't working out. It can take a while to get an optimization correct.
Now that you've got a nicely unrolled loop, it's time to start looking for patterns in it. If every run through the loop involves adding two numbers together, you might benefit from putting those numbers in a pair of vectors, and adding them once.
Loading vectors, performing a single operation, and writing back to memory isn't all that good of a deal. What you ideally want is a series of operations that occurs on vectors, writing back to memory only at the end. The more of your loop's core logic you can perform in vectors, the more benefit you're getting, and the more likely you are to get enough payoff to cover the cost of the setup involved in loading to, and storing from, the vector registers.
There is another concern here, though. Depending on the processor you're using, you might face some limitations. For instance, on a G5, you can't dispatch two vector arithmetic operations in a single cycle, although you can combine a vector arithmetic operation with a permute. Don't overuse permutes just because you can launch them cheaply, though; moving data back and forth between the arithmetic and permutation units will cost you a cycle each way.
If you can arrange to interleave vector operations with ongoing non-vector operations, you will improve performance noticeably. Similarly, if you have to do a lot of loads and stores along with your arithmetic operations, interleave them if you can.
Now, let's say you have a vectorized version of some loop. You may find that it's not particularly efficient. In particular, if you have a frequently called function that does a small amount of vector math, you might find that performance actually drops to an abysmal level.
Here is where you may wish to unroll your loop again. AltiVec gives you 32 vector registers. That's enough that you may well be able to do two or more versions of essentially the same vectorized loop in a row. This gives you a substantial performance advantage. With a single vectorized pass through a loop, all the data must be loaded in, processed, and stored before the next phase can start. If you have two or three passes through the loop, using different sets of registers, the third phase may well have loaded all of its data before the first phase has finished computing, giving you a substantial boost in speed.
Often, getting data into or out of the processor is as much a bottleneck as any actual computation. This is another area where unrolling can help dramatically. The earlier you start loading data, the sooner you can start processing it. The G5's multiple load/store units improve this situation a lot, allowing you to start two loads, or a load and a store, in a single cycle.
In the particular case where a moderately vectorizable function is called many times, it is almost certainly worth building a version of the function that runs on four or more sets of data at once, essentially unrolling the loop. Apple gives a very good example of how this would work using the regular scalar floating point unit. For another look at loop unrolling to take advantage of AltiVec, read the two-part developerWorks series on TCP/IP checksum vectorization (see Resources).
Assembly and processor simulations
If your best efforts using the C interface aren't getting the
results you think you should be getting, it might be time to consider
writing your own assembly code. This is not a step to be undertaken
lightly. Most obviously, if you make a mistake involving the VRSAVE register, you may cause your program to
fail catastrophically!
This is a good time to stop and profile your code. It is easy to get bogged down optimizing code which isn't even consuming all that much processor time. Start by finding out where the processor time is being spent, and optimize that code.
If you've already got C code, a good starting place may be to compile the code to assembly and look at the resulting code. It might be immediately obvious to you what went wrong, and indeed, you might find that a simple change to your original source gives the compiler enough hints to do the right thing.
If that doesn't work, it's time to get down and dirty. You may want
to use a processor simulator, such as simg4
or simg5 (see Resources), to try to find out exactly
where your code is getting bogged down. What you're looking for are
stalls, where an instruction must be delayed as it waits for the
results of another instruction, or for some processor resource limit
that has been reached. Try to eliminate dependencies, or separate
dependent operations, as much as possible, by putting other code
between them.
This article has given something of an overview of what's involved in optimizing AltiVec code. You can find a great deal of additional information on Apple's developer pages. Apple clearly has a substantial interest in making sure that programmers targeting the Mac are getting the best possible performance out of AltiVec processors.
Compiler documentation is often full of tips, tricks, and tidbits
about optimization techniques supported, or ways to give hints to the
compiler about what you're trying to do. Read it carefully. Generally
this documentation will come with the compiler itself. If you're using
gcc, you may want to check out the new GCC
Wiki (see Resources).
Most importantly, if you have a piece of code that really needs to be optimized this heavily, profile it, study it, and try to find out what's happening. Don't be afraid to rethink the algorithm from the top. The world is full of hand-tuned assembly for bubble sort.
If you find all this discussion a bit too theoretical, you'll want to tune in for the third part of the series. In it, you'll see a more detailed, real-world example of unrolling and optimizing a loop, showing how to put some of these principles into practice. Stay tuned!
- Check out the first part of this
series (developerWorks, March 2005) .
- Read the PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming Environments Manual.
- SIMD has working groups
involved with various SIMD extensions, including AltiVec, MMX, and others.
Don't miss their AltiVec
tutorial (in PDF format) by Ian Ollman.
- Apple's page
about the Velocity Engine is a slightly buzzword-heavy description
of the AltiVec variants used in Mac systems.
- Duff's
Device is the most famous example of loop unrolling.
- Users looking for a non-Macintosh PowerPC might be interested in Pegasos.
- Motorola recently spun off its chipmaking division into a separate
company called Freescale. The Freescale site also has a page about AltiVec.
- A previous two-part developerWorks article, "TCP/IP checksum
vectorization using AltiVec," by Ayal Zaks, Dorit Naishlos, and Daniel
Citron, discussed TCP checksum vectorization using AltiVec; start with
Part
1 and Part
2.
- A discussion of throughput vs. latency,
on Apple's site, is of particular interest.
- Apple provides detailed
performance information about the G4, G4+, and G5.
- Work is being done on auto-vectorization in
gcc. - Crescent Bay
Software sells software to automatically vectorize C code.
- Apple's
page on performance tools gives links to a number of useful tools,
including
simg4andsimg5. - The GCC Wiki serves as a
repository for information about
gcc, with up-to-the minute reports on status, useful tips, and everything else you might want. - IBM Senior Processor Architect Peter Sandon discusses vector
processing in the G5 in this interview.
- "Save your
code from meltdown using PowerPC instructions," Jonathan Rentzsch
gets into the gritty detail of PowerPC
assembly code (developerWorks, November 2004).
- For more on the joys and dangers of writing code that directly
accesses memory, check out "Data
alignment on PowerPC," Jonathan Rentzsch (developerWorks, February
2005).
-
The world
is full of hand-tuned assembly for bubble
sort.
- Have experience you'd be willing to share with Power Architecture zone
readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside
IBM are welcomed. Check out the Power Architecture author
FAQ to learn more.
- Have a question or comment on this story, or
on Power Architecture technology in general?
Post it in the Power Architecture technical forum
or send in a letter to the editors.
- The Power Architecture Community Newsletter includes full-length articles as well as recent news about members of the Power Architecture community and upcoming events of interest. Subscribe to the newsletter today!
- All things Power are chronicled in the developerWorks Power
Architecture editors' blog, which is just one of many developerWorks
blogs.
- Find more articles and resources on Power Architecture
technology and all things
related in the developerWorks Power
Architecture technology content area.
- Download a IBM PowerPC 405 Evaluation Kit to demo a SoC in a simulated
environment, or just to explore the fully licensed version of
Power Architecture technology. This and other fine Power Architecture-related downloads are listed in
the developerWorks Power Architecture technology content area's downloads section.




