Unrolling AltiVec, Part 1: Introducing the PowerPC SIMD unit

Get acquainted with AltiVec's abilities, rivals, and aliases

AltiVec? Velocity Engine? VMX? If you've only been casually following PowerPC development, you might be confused by the various guises of this vector processing SIMD technology. In this first installment of a three-part series, Peter Seebach gives you the basics on what AltiVec is, what it does -- and how it stacks up against its competition.

Share:

Peter Seebach (crankyuser@seebs.plethora.net), Author, Freelance

Peter SeebachPeter Seebach uses vector processing a lot, and is personally able to cook up to three eggs at once, making him something of an expert in the field.



01 March 2005

On August 31, 1999, Apple® announced that it would begin selling computers that were considered "supercomputers" by the U.S. government, and could thus not be exported to some countries. This caused a lot of controversy at the time. Now, five years later, lots of computers are faster. But what is still interesting is that the technology that Apple used as the basis for this fairly dramatic claim is still in use, and it's still of major interest to developers trying to get the best performance out of certain kinds of tasks.

Motorola®AltiVec™ is one of the names (see the sidebar, "The nine billion names of AltiVec" for the others) for a specific example of Single Instruction, Multiple Data, or SIMD, execution. Normally, a single instruction to a computer does a single thing. A single SIMD instruction will also generally do a single thing -- but it will do it to multiple pieces of data at once, thus the fairly unimaginative (but at least pronounceable!) name. The multiple pieces of data being operated on at once are often called vectors, hence the name AltiVec. More traditional supercomputer vector processors might well store fifty or more pieces of data in a single vector register. AltiVec is a little more conservative. (See the sidebar, "Why vectors? below for more on vector processing.)

The nine billion names of AltiVec (well, OK, only three)

VMX was the original code name for this extension inside IBM. That term isn't in widespread use, with generic terms like SIMD or vector processor preferred in IBM documentation. AltiVec is Motorola's trade name for this set of extensions, and the company thoughtfully trademarked the term. That's why Apple uses the name Velocity Engine, which is nicely generic and refers to either company's implementation of the technology.

The interaction of these names is confusing, and they are often bandied about interchangeably. Whenever you hear one of these terms used, it's probably safe to assume that it is technically interchangeable with the others. This article uses the term AltiVec because it's the prettiest.

Is this hardware or software?

If a processor has AltiVec support, that means that the processor has support for an additional set of instructions, as well as the special registers used by those instructions. However, the question of whether a particular chip has AltiVec support can't be answered by just seeing if a particular feature is present or absent. Different tasks performed using AltiVec instructions are handled by different components of the chip, and different chips may have different sets of components. For instance, the original G4, the first shipping processor with AltiVec support, had only a single unit to handle permutation instructions, but later models, called the G4+ by Motorola, had two. Each of these units can process one operation at a time, but if you have two units, you can start another operation before the first one is finished.

Software support for AltiVec mostly consists of generating instructions that the hardware can use. There are extensions to C that you can use to specify these precise operations, or you can drop straight into assembly. On the other hand, some compilers can do an acceptable job of transforming code automatically into vector operations.

The existence of noticeably different implementations of AltiVec has implications for programmers. The best way to get an algorithm to run may vary from one processor to another. This isn't a problem specific to AltiVec; the best possible optimization of a piece of code often varies widely from one generation of any processor to another.


Who supports this?

To get use out of AltiVec, you need a chip that has the hardware, an operating system that supports it, and a compiler or assembler that can generate code to use it.

The only processors currently supporting AltiVec are the G4 and G5. The G4 (including model numbers 7400 and 7410) and G4+ (7450 and 7455) processors are made by Motorola. (There are more models than just the ones listed here, but these are the most widely discussed.) The G5 chips include the IBM 970 and 970FX; these are essentially POWER4™ cores with an AltiVec unit bolted on. So far, only PowerPC® processors have had AltiVec support, not the POWER™ line. If you want to buy "a computer with AltiVec," Apple's Mac line is your most likely option. For evaluation boards and custom designs, however, you can go with any of the many vendors who do development kits based on either the G4 or G5.

Our apologies!
The original version of this article contained an erroneous reference to a "970GX." We apologize for any confusion this may have caused.
--Editors

Any operating system that runs on PowerPC and has been updated since the year 2000 will almost certainly work with AltiVec; if there are counterexamples, I have been unable to find them. The highest profile OSes would be Linux™, Mac OS X, and AIX. In theory, an operating system not specifically written for AltiVec-based hardware might not save information stored in the AltiVec registers when switching from one task to another. However, most OSes seem to have been long since patched to handle this difficulty. Even this theoretical problem will not arise as long as only one user program at a time uses the AltiVec instructions; it's only when more than one program on a system uses AltiVec that problems can arise.

As for compiler support, the GNU project's GCC compiler supports AltiVec; so does the Metrowerks CodeWarrior compiler, and, of course, IBM VisualAge®. All three produce functional code using the AltiVec extensions. (This, of course, applies to current compiler versions; no one is promising that a 1995 copy of any compiler will have support for the vector instructions.)


A few technical details

Now that you know what AltiVec is, you're probably wondering what exactly it does. It provides thirty-two 128-bit registers to hold vectors. Each register can be seen as providing sixteen 8-bit values, eight 16-bit values, or four 32-bit values. The 32-bit values can be integer or floating point, and all integer values can be signed or unsigned. AltiVec does not support 64-bit values, which can be a bit of a crimp; but in AltiVec's defense, getting two operations at once might not justify the overhead of getting vectors arranged. Furthermore, AltiVec supports an additional type, called pixel, that holds eight 16-bit pixel values. Past this, there's a fairly large number of instructions that perform various operations: loading registers from memory, performing arithmetic on them (in various types), and writing them back.

Why vectors?

Early vector processors didn't so much perform multiple operations at once, rather they simply queued operations up so they could be performed in series. You'd start loading values into the vector register, and a little later start getting outputs at a ferocious pace. Instead of each operation taking a full load/modify/store cycle, followed by another load/modify/store, you would set up a series of loads, and then a little later a series of stores would follow. Even naive use of AltiVec can approximate this, by letting you load four (or more) values at once, operate on them, then store them all back at once.

That said, AltiVec works best when you're doing multiple sets of operations at once, which is one of the reasons it has a large number of registers: you can load one register while another is being processed, and so on. This is important to users simply because it's still many times faster than individual operations. Furthermore, AltiVec's design allows a little more flexibility than some larger vector processors and is well-suited to the variety of tasks desktop computers face.

Of particular interest is the existence of permutation instructions, which allow a register to be populated with bits or pieces of another register, possibly shuffled. These instructions allow bytes to be reshuffled from two source registers into a single destination register, in any order or pattern whatsoever. This is an incredibly general tool, useful anywhere from image processing to cryptography.

You might think that the overhead of saving thirty-two 128-bit registers would be a little steep, and the designers of AltiVec would agree. To this end, a 32-bit register called VRSAVE has been provided. When the operating system switches context, it saves only the registers whose corresponding bit in the VRSAVE register has been set to one. This is an excellent compromise, allowing a programmer who needs only a few registers to save and restore those registers, and no others, on context switches. On the down side, this register must be manually updated if you are writing in assembly. (A C compiler targeting AltiVec will keep the register up to date for you.)

In a couple of cases, AltiVec instructions deviate a little from the "one operation performed on multiple operands" model that is generally associated with vector processing. For instance, the vmaddfp instruction multiplies two operands together and adds a third in a single operation. Furthermore, there are a few operations that operate on the register as though it were a single 128-bit value, such as bit shifts or boolean operations, which don't care what type of data you think the register holds.


Compare and contrast

AltiVec units face competition from three sources. The first, and sometimes the most effective, is the rest of the CPU. While AltiVec is incredible for some kinds of tasks, there are others on which it simply doesn't perform very well. Some algorithms that don't lend themselves well to vectorization will be hard or impossible to convert to AltiVec. Any process in which every stage of computation depends on the results of the previous stage will probably see very little improvement from using AltiVec.

AltiVec units also face competition from other AltiVec units, since, as this article has noted, not all AltiVec units are created equal. The G4, G4+, and G5 each have different performance characteristics. The original G4 needs the fewest cycles per instruction, but can run fewer instructions at a time; the G5 needs the most cycles per instruction, but can run more at a time. There are a number of additional complexities involved, but in general, for a given clock speed, the G5 will get the most work done on well-pipelined code. Ironically, however, badly pipelined code may run in fewer cycles on an older processor. This doesn't mean that you actually get better performance on the older processor, though. The newer processors have uniformly higher clock speeds.

AltiVec could also be compared to other SIMD architectures. The x86 world has provided us with a great number of these, including MMX, MMX2, 3DNow!, SSE, SSE2, and SSE3. The MMX instruction set, dating back to the days of 166 MHz Pentium processors, was carefully designed to require no changes to operating system software. To this end, it shared registers with the floating-point unit. Unfortunately, this made it impossible for a program to use MMX acceleration and floating point at the same time, which limited the usefulness of the original MMX instruction set.

More recent efforts, such as SSE2, are somewhat better. SSE2 provides eight registers, which are not shared with the floating point unit. SSE2 does have 64-bit floating point types, which is a plus. However, AltiVec's selection of instructions is more complete, and most of them work from two registers into a third, letting the processor perform moderately complicated vector operations entirely in registers, without touching memory until the final data is ready to come out. This, and the larger pool of registers, favors deeply pipelined operations that can come close to saturating the processor's multiple execution units. AltiVec still wins.

The competing 3DNow! instruction set, developed by AMD, has some features of MMX and some of SSE, and has gone through a few revisions of its own. However, since only AMD's chips use it, it's not as widely supported. To add insult to injury, some programs that test for the availability of 3DNow! support may generate false positives and try to use these instructions on processors that don't support them, such as the Pentium 4M. This has made support for these extensions less common than it might otherwise be.

This highlights one of the real advantages that AltiVec has over the various SIMD instruction sets available for x86 processors: its comparative stability. Every AltiVec processor since the original G4 has had the same essential functionality, the same large register pool that isn't shared with anything, and a reasonably complete set of likely operations. This has made it easier for support to become widespread: a program designed to take advantage of the original G4 will still get a noticeable performance improvement on today's G5.


Making use of AltiVec

Apple has been very aggressive about getting AltiVec optimizations into core components of its operating system, to make sure that users feel they're getting some benefit from it. Graphics applications on the Mac are very likely to be AltiVec enhanced. The consistent architecture has made the return on investment of AltiVec optimization quite good. Now that every Mac shipped comes with an AltiVec processor, every user will benefit if a program is able to make effective use of AltiVec for processing.

Any operating system can use AltiVec. As an example, consider the TCP checksum algorithm (see Resources for an article on vectorizing this), which chews up a substantial number of cycles on a heavily loaded server, and which can usually be sped up dramatically using AltiVec. This operation happens in the operating system, not in specific network applications -- but all applications relying on the network benefit from the performance boost.

Even if you don't want to specifically target AltiVec, you can still get some benefit from it. Automatic vectorization of code, while not up to the best human optimizations, can produce a noticeable performance improvement. There are both commercial and free products in this arena. There is support for autovectorization in a branch of GCC, appropriately enough called autovec-branch. Automatic vectorization provides a substantial benefit: because the compiler is doing the vectorizing, the original code is not dependent on a specific processor's vector execution model, so your code can remain portable.

The next article in this series will look in more detail at getting good performance out of AltiVec when programming it directly, using either C or assembly. That gives you enough time to go out and get a machine with AltiVec support!

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=48972
ArticleTitle=Unrolling AltiVec, Part 1: Introducing the PowerPC SIMD unit
publish-date=03012005