On August 31, 1999, Apple® announced that it would begin selling computers that were considered "supercomputers" by the U.S. government, and could thus not be exported to some countries. This caused a lot of controversy at the time. Now, five years later, lots of computers are faster. But what is still interesting is that the technology that Apple used as the basis for this fairly dramatic claim is still in use, and it's still of major interest to developers trying to get the best performance out of certain kinds of tasks.
Motorola® AltiVec™ is one of the names (see the sidebar, "The nine billion names of AltiVec" for the others) for a specific example of Single Instruction, Multiple Data, or SIMD, execution. Normally, a single instruction to a computer does a single thing. A single SIMD instruction will also generally do a single thing -- but it will do it to multiple pieces of data at once, thus the fairly unimaginative (but at least pronounceable!) name. The multiple pieces of data being operated on at once are often called vectors, hence the name AltiVec. More traditional supercomputer vector processors might well store fifty or more pieces of data in a single vector register. AltiVec is a little more conservative. (See the sidebar, "Why vectors? below for more on vector processing.)
If a processor has AltiVec support, that means that the processor has support for an additional set of instructions, as well as the special registers used by those instructions. However, the question of whether a particular chip has AltiVec support can't be answered by just seeing if a particular feature is present or absent. Different tasks performed using AltiVec instructions are handled by different components of the chip, and different chips may have different sets of components. For instance, the original G4, the first shipping processor with AltiVec support, had only a single unit to handle permutation instructions, but later models, called the G4+ by Motorola, had two. Each of these units can process one operation at a time, but if you have two units, you can start another operation before the first one is finished.
Software support for AltiVec mostly consists of generating instructions that the hardware can use. There are extensions to C that you can use to specify these precise operations, or you can drop straight into assembly. On the other hand, some compilers can do an acceptable job of transforming code automatically into vector operations.
The existence of noticeably different implementations of AltiVec has implications for programmers. The best way to get an algorithm to run may vary from one processor to another. This isn't a problem specific to AltiVec; the best possible optimization of a piece of code often varies widely from one generation of any processor to another.
To get use out of AltiVec, you need a chip that has the hardware, an operating system that supports it, and a compiler or assembler that can generate code to use it.
The only processors currently supporting AltiVec are the G4 and G5. The G4 (including model numbers 7400 and 7410) and G4+ (7450 and 7455) processors are made by Motorola. (There are more models than just the ones listed here, but these are the most widely discussed.) The G5 chips include the IBM 970 and 970FX; these are essentially POWER4™ cores with an AltiVec unit bolted on. So far, only PowerPC® processors have had AltiVec support, not the POWER™ line. If you want to buy "a computer with AltiVec," Apple's Mac line is your most likely option. For evaluation boards and custom designs, however, you can go with any of the many vendors who do development kits based on either the G4 or G5.
Any operating system that runs on PowerPC and has been updated since the year 2000 will almost certainly work with AltiVec; if there are counterexamples, I have been unable to find them. The highest profile OSes would be Linux™, Mac OS X, and AIX. In theory, an operating system not specifically written for AltiVec-based hardware might not save information stored in the AltiVec registers when switching from one task to another. However, most OSes seem to have been long since patched to handle this difficulty. Even this theoretical problem will not arise as long as only one user program at a time uses the AltiVec instructions; it's only when more than one program on a system uses AltiVec that problems can arise.
As for compiler support, the GNU project's GCC compiler supports AltiVec; so does the Metrowerks
CodeWarrior compiler, and, of course, IBM VisualAge®. All three produce
functional code using the AltiVec extensions. (This, of course, applies to
current compiler versions; no one is promising that a 1995 copy of any
compiler will have support for the vector instructions.)
Now that you know what AltiVec is, you're probably wondering what
exactly it does. It provides thirty-two 128-bit registers to hold
vectors. Each register can be seen as providing sixteen 8-bit values,
eight 16-bit values, or four 32-bit values. The 32-bit values can be
integer or floating point, and all integer values can be signed or
unsigned. AltiVec does not support 64-bit values, which can be a bit of a
crimp; but in AltiVec's defense, getting two operations at once might not
justify the overhead of getting vectors arranged. Furthermore, AltiVec
supports an additional type, called pixel, that
holds eight 16-bit pixel values. Past this, there's a fairly large number
of instructions that perform various operations: loading registers from
memory, performing arithmetic on them (in various types), and writing them
back.
Of particular interest is the existence of permutation instructions, which allow a register to be populated with bits or pieces of another register, possibly shuffled. These instructions allow bytes to be reshuffled from two source registers into a single destination register, in any order or pattern whatsoever. This is an incredibly general tool, useful anywhere from image processing to cryptography.
You might think that the overhead of saving thirty-two 128-bit
registers would be a little steep, and the designers of AltiVec would
agree. To this end, a 32-bit register called VRSAVE has been provided. When the operating system
switches context, it saves only the registers whose corresponding bit in
the VRSAVE register has been set to one. This
is an excellent compromise, allowing a programmer who needs only a few
registers to save and restore those registers, and no others, on context
switches. On the down side, this register must be manually updated if you
are writing in assembly. (A C compiler targeting AltiVec will keep the
register up to date for you.)
In a couple of cases, AltiVec instructions deviate a little from the
"one operation performed on multiple operands" model that is generally
associated with vector processing. For instance, the vmaddfp instruction multiplies two operands together
and adds a third in a single operation. Furthermore, there are a few
operations that operate on the register as though it were a single 128-bit
value, such as bit shifts or boolean operations, which don't care what
type of data you think the register holds.
AltiVec units face competition from three sources. The first, and sometimes the most effective, is the rest of the CPU. While AltiVec is incredible for some kinds of tasks, there are others on which it simply doesn't perform very well. Some algorithms that don't lend themselves well to vectorization will be hard or impossible to convert to AltiVec. Any process in which every stage of computation depends on the results of the previous stage will probably see very little improvement from using AltiVec.
AltiVec units also face competition from other AltiVec units, since, as this article has noted, not all AltiVec units are created equal. The G4, G4+, and G5 each have different performance characteristics. The original G4 needs the fewest cycles per instruction, but can run fewer instructions at a time; the G5 needs the most cycles per instruction, but can run more at a time. There are a number of additional complexities involved, but in general, for a given clock speed, the G5 will get the most work done on well-pipelined code. Ironically, however, badly pipelined code may run in fewer cycles on an older processor. This doesn't mean that you actually get better performance on the older processor, though. The newer processors have uniformly higher clock speeds.
AltiVec could also be compared to other SIMD architectures. The x86 world has provided us with a great number of these, including MMX, MMX2, 3DNow!, SSE, SSE2, and SSE3. The MMX instruction set, dating back to the days of 166 MHz Pentium processors, was carefully designed to require no changes to operating system software. To this end, it shared registers with the floating-point unit. Unfortunately, this made it impossible for a program to use MMX acceleration and floating point at the same time, which limited the usefulness of the original MMX instruction set.
More recent efforts, such as SSE2, are somewhat better. SSE2 provides eight registers, which are not shared with the floating point unit. SSE2 does have 64-bit floating point types, which is a plus. However, AltiVec's selection of instructions is more complete, and most of them work from two registers into a third, letting the processor perform moderately complicated vector operations entirely in registers, without touching memory until the final data is ready to come out. This, and the larger pool of registers, favors deeply pipelined operations that can come close to saturating the processor's multiple execution units. AltiVec still wins.
The competing 3DNow! instruction set, developed by AMD, has some features of MMX and some of SSE, and has gone through a few revisions of its own. However, since only AMD's chips use it, it's not as widely supported. To add insult to injury, some programs that test for the availability of 3DNow! support may generate false positives and try to use these instructions on processors that don't support them, such as the Pentium 4M. This has made support for these extensions less common than it might otherwise be.
This highlights one of the real advantages that AltiVec has over the various SIMD instruction sets available for x86 processors: its comparative stability. Every AltiVec processor since the original G4 has had the same essential functionality, the same large register pool that isn't shared with anything, and a reasonably complete set of likely operations. This has made it easier for support to become widespread: a program designed to take advantage of the original G4 will still get a noticeable performance improvement on today's G5.
Apple has been very aggressive about getting AltiVec optimizations into core components of its operating system, to make sure that users feel they're getting some benefit from it. Graphics applications on the Mac are very likely to be AltiVec enhanced. The consistent architecture has made the return on investment of AltiVec optimization quite good. Now that every Mac shipped comes with an AltiVec processor, every user will benefit if a program is able to make effective use of AltiVec for processing.
Any operating system can use AltiVec. As an example, consider the TCP checksum algorithm (see Resources for an article on vectorizing this), which chews up a substantial number of cycles on a heavily loaded server, and which can usually be sped up dramatically using AltiVec. This operation happens in the operating system, not in specific network applications -- but all applications relying on the network benefit from the performance boost.
Even if you don't want to specifically target AltiVec, you can still
get some benefit from it. Automatic vectorization of code, while not up to
the best human optimizations, can produce a noticeable performance
improvement. There are both commercial and free products in this arena.
There is support for autovectorization in a branch of GCC, appropriately enough called
autovec-branch. Automatic vectorization provides a substantial
benefit: because the compiler is doing the vectorizing, the original code
is not dependent on a specific processor's vector execution model, so your
code can remain portable.
The next article in this series will look in more detail at getting good performance out of AltiVec when programming it directly, using either C or assembly. That gives you enough time to go out and get a machine with AltiVec support!
-
SIMD has working groups
involved with various SIMD extensions, including AltiVec, MMX, and
others.
- Check out the PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming Environments Manual.
-
Apple's page
about the Velocity Engine is a slightly buzzword-heavy description
of the AltiVec variants used in Mac systems.
- Motorola recently spun off its chipmaking division into a separate
company called Freescale. The Freescale site also has a page about AltiVec (pdf).
- A previous two-part developerWorks article, "TCP/IP checksum
vectorization using AltiVec," by Ayal Zaks, Dorit Naishlos, and Daniel
Citron, discussed TCP checksum vectorization using AltiVec; start with
Part
1 (developerWorks, October 2004).
-
A discussion of throughput vs. latency,
on Apple's site, is of particular interest.
- Apple provides detailed
performance information about the G4, G4+, and G5.
- Work is being done on auto-vectorization in
GCC. -
Crescent Bay
Software sells software to automatically vectorize C code.
-
Apple's
page on performance tools gives links to a number of useful tools,
including
simg4andsimg5. -
The GCC Wiki serves as a
repository for information about
GCC, with up-to-the minute reports on status, useful tips, and everything else you might want. - IBM Senior Processor Architect Peter Sandon discusses vector
processing in the G5 in this interview (Ars Technica).
- "Save your
code from meltdown using PowerPC atomic instructions," by Jonathan Rentzsch, gets into the gritty detail of PowerPC
assembly code (developerWorks, November 2004).
- For more on the joys and dangers of writing code that directly
accesses memory, check out "Data
alignment on PowerPC," Jonathan Rentzsch (developerWorks, February
2005).
- Have experience you'd be willing to share with Power Architecture zone
readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside
IBM are welcomed. Check out the Power Architecture author
FAQ to learn more.
- Have a question or comment on this story, or
on Power Architecture technology in general?
Post it in the Power Architecture technical forum
or send in a letter to the editors.
-
The Power Architecture Community Newsletter includes full-length articles as well as recent news about members of the Power Architecture community and upcoming events of interest. Subscribe to the newsletter today!
- All things Power are chronicled in the developerWorks Power
Architecture editors' blog, which is just one of many developerWorks
blogs.
- Find more articles and resources on Power Architecture
technology and all things
related in the developerWorks Power
Architecture technology content area.




