AIX vector programming

Some PowerPC® processors implement a Single Instruction Multiple Data (SIMD)-style vector extension.

Often referred to as AltiVec or VMX, the vector extension to the PowerPC architecture provides an additional instruction set for performing vector and matrix mathematical functions.

The Vector Arithmetic Logic Unit is an SIMD-style arithmetic unit, in which a single instruction performs the same operation on all the data elements of each vector. AIX® 5.3 with recommended technology level 5300-30 is the first AIX release to enable vector programming. The IBM® PowerPC 970 processor is the first processor supported by AIX that implements the vector extension. These processors are currently found in the JS20 blade servers offered with the BladeCenter.

Vector extension overview

The vector extension consists of an additional set of 32 128-bit registers that can contain a variety of vectors including signed or unsigned 8-bit, 16-bit, or 32-bit integers, or 32-bit IEEE single-precision floats. There is a vector status and control register that contains a sticky status bit indicating saturation, as well as a control bit for enabling Java™ or non-Java mode for floating-point operations.

The default mode initialized by AIX for every new process is Java-mode enabled, which provides IEEE-compliant floating-point operations. The alternate non-Java mode results in a less precise mode for floating point computations, which might be significantly faster on some implementations and for specific operations. For example, on the PowerPC 970 processor running in Java mode, some vector floating-point instructions will encounter an exception if the input operands or result are denormal, resulting in costly emulation by the operating system. For this reason, you are encouraged to consider explicitly enabling the non-Java mode if the rounding is acceptable, or to carefully attempt to avoid vector computations on denormal values.

The vector extension also includes more than 160 instructions providing load and store access between vector registers and memory, in register manipulation, floating point arithmetic, integer arithmetic and logical operations, and vector comparison operations. The floating point arithmetic instructions use the IEEE 754-1985 single precision format, but do not report IEEE exceptions. Default results are produced for all exception conditions as specified by IEEE for untrapped exceptions. Only IEEE default round-to-nearest rounding mode is provided. No floating-point division or square-root instructions are provided, but instead a reciprocal estimate instruction is provided for division, and a reciprocal square root estimate instruction is provided for square root.

There is also a 32-bit special purpose register that is managed by software to represent a bitmask of vector registers in use. This allows the operating system to optimize vector save and restore algorithms as part of context switch management.

Runtime determination of vector capability

A program can determine whether a system supports the vector extension by reading the vmx_version field of the _system_configuration structure. If this field is non-zero, then the system processors and operating system contain support for the vector extension. A __power_vmx() macro is provided in /usr/include/sys/systemcfg.h for performing this test. This can be useful for software that conditionally exploits the vector extension when present, or to use functionally equivalent scalar code paths when not present.

AIX ABI extension

The AIX Application Binary Interface (ABI) has been extended to support the addition of vector register state and conventions. Refer to the Assembler Language Reference for a complete description of the ABI extensions.

AIX supports the AltiVec programming interface specification. Below is a table of the C and C++ vector data types. All vector data types are 16 bytes in size, and must be aligned on a 16-byte boundary. Aggregates containing vector types must follow normal conventions of aligning the aggregate to the requirement of its largest member. If an aggregate containing a vector type is packed, then there is no guarantee of 16-byte alignment of the vector type. An AIX compiler supporting the AltiVec programming interface specification is required.

Table 1. New C and C++ Vector Data Types
New C and C++ types Contents
vector unsigned characters 16 unsigned characters
vector signed characters 16 signed characters
vector bool characters 16 unsigned characters
vector unsigned short 8 unsigned short
vector signed short 8 signed short
vector bool short 8 unsigned short
vector unsigned integers 4 unsigned integers
vector signed integers 4 signed integers
vector bool integers 4 unsigned integers
vector float 4 float

The following table outlines the vector register usage conventions.

Table 2. Vector Register Conventions
Register type Register Status Use
VRs VR0 Volatile Scratch register
VR1 Volatile Scratch register
VR2 Volatile

First vector argument

First vector of function return value

VR3 Volatile Second vector argument, scratch
VR4 Volatile Third vector argument, scratch
VR5 Volatile Fourth vector argument, scratch
VR6 Volatile Fifth vector argument, scratch
VR7 Volatile Sixth vector argument, scratch
VR8 Volatile Seventh vector argument, scratch
VR9 Volatile Eighth vector argument, scratch
VR10 Volatile Ninth vector argument, scratch
VR11 Volatile Tenth vector argument, scratch
VR12 Volatile Eleventh vector argument, scratch
VR13 Volatile Twelfth vector argument, scratch
VR14:19 Volatile Scratch
VR20:31 Reserved (default mode)

Nonvolatile (extended ABI mode)

When the default Vector enabled mode is used, these registers are reserved and must not be used.

In the extended ABI vector enabled mode, these registers are nonvolatile and their values are preserved across function calls

Special Purpose VRSAVE Reserved In the AIX ABI, VRSAVE is not used. An ABI-compliant program must not use or alter VRSAVE.
Special Purpose VSCR Volatile The Vector Status and Control Register contains the saturation status bit and non-Java mode control bit.

The AltiVec Programming Interface Specification defines the VRSAVE register to be used as a bitmask of vector registers in use. AIX requires that an application never modify the VRSAVE register.

The first 12 vector parameters in a function are placed in VR2 through VR13. Unneeded vector parameter registers contain undefined values upon entry to the function. Non-variable length argument list vector parameters are not shadowed in general purpose registers (GPRs). Any additional vector parameters, from 13th and beyond, are passed through memory on the program stack, 16-byte aligned, in their appropriate mapped location within the parameter region corresponding to their position in the parameter list.

For variable length argument lists, va_list continues to be a pointer to the memory location of the next parameter. When va_arg() accesses a vector type, va_list must first be aligned to a 16-byte boundary. The receiver or consumer of a variable-length argument list is responsible for performing this alignment prior to retrieving the vector type parameter.

A non-packed structure or union passed by a value that has a vector member anywhere within it will be aligned to a 16-byte boundary on the stack.

A function that takes a variable-length argument list has all parameters mapped in the argument area ordered and aligned according to their types. The first eight words (32-bit) or doublewords (64-bit) of a variable-length argument list are shadowed in GPRs r3 - r10. This includes vector parameters.

Functions that have a return value declared as a vector data type place the return value in VR2. Any function that returns a vector type or has vector parameters requires a function prototype. This avoids the compiler shadowing the VRs in GPRs for the general case.

Legacy ABI compatibility and interoperability

Due to the nature of interfaces (such as setjmp(), longjmp(), sigsetjmp(), siglongjmp(), _setjmp(), _longjmp(), getcontext(), setcontext(), makecontext(), and swapcontext()) that must save and restore nonvolatile machine state, there is risk introduced when considering dependencies between legacy and vector-extended ABI modules. To complicate matters, the setjmp family of functions in libc reside in a static member of libc, which means that every existing AIX binary has a statically bound copy of the setjmp family and others that existed with the version of AIX it was linked against. Furthermore, existing AIX binaries have jmpbufs and ucontext data structure definitions that are insufficient to house any additional nonvolatile vector register state.

Any cases where legacy modules and new modules interleave calls, or call-backs, where a legacy module could perform a longjmp() or setcontext(), bypassing normal linkage convention of a vector extended module, has a risk of compromising the nonvolatile vector register state.

For this reason, while the AIX ABI defines nonvolatile vector registers, the default compilation mode when using vectors (AltiVec) in AIX compilers is to not use any of the nonvolatile vector registers. This results in a default compilation environment that safely allows exploitation of vectors (AltiVec) while introducing no risk with respect to interoperability with legacy binaries.

For applications where interoperability and module dependence is completely known, an additional compilation option can be enabled that will allow the use of nonvolatile vector registers. This mode should only be used when all dependent legacy modules and behaviors are fully known and understood as either having no dependence on functions such as setjmp(), sigsetjmp(), _setjmp(), or getcontext(), or ensuring that all module transitions are performed using normal subroutine linkage convention, and that no call-backs to an upstream legacy module are used.

The default AltiVec compilation environment predefines __VEC__, in accordance with the AltiVec Technology Programming Interface Manual.

When the option to use nonvolatile vector registers is enabled, the compilation environment must also predefine __EXTABI__. You can compile non-vector enabled modules to be extended ABI-aware by explicitly defining __AIXEXTABI. This will ensure that those modules can safely interact with vector-enabled modules that are enabled to use nonvolatile vector registers.

Extended context

In order to support the additional machine state required by the vector extension as well as other extensions such as user keys, AIX 5.3 introduced support for extended context structures. The primary application-visible use of machine-context information is its presence in the sigcontext structure provided to signal handlers, and the resulting activation of the machine context in the sigcontext upon return from the signal handler. The sigcontext structure is actually a subset of the larger ucontext structure. The two structures are identical for up to sizeof(struct sigcontext). When AIX builds a signal context to be passed to a signal handler, it actually builds a ucontext structure on the signal handler's stack. The machine-context portion of a signal context must contain all of the active machine state, volatile and nonvolatile, for the involuntarily interrupted context. To accomplish this without affecting binary compatibility with existing signal handlers, space previously reserved in the ucontext structure now serves as an indication of whether extended context information is available.

A newly defined field in the ucontext, __extctx, is the address of an extended context structure, struct __extctx, as defined in the sys/context.h file. A new field, __extctx_magic, within the ucontext structure indicates whether the extended context information is valid when the value of __extctx_magic is equal to __EXTCTX_MAGIC. The additional vector machine state for a thread using the vector extension is saved and restored as a member of this new context extension to the ucontext structure as a part of signal delivery and return.

The ucontext structure is also used on APIs (such as getcontext(), setcontext(), swapcontext(), and makecontext()). In these cases, the context needing to be saved is due to a voluntary action, for which calling linkage convention requires only that nonvolatile machine state be saved. Because the default mode of vector enablement on AIX, as described in the ABI section, is to not use nonvolatile vector registers, there are no extensions of the ucontext structure required for the majority of applications. If an application chooses to explicitly enable the use of nonvolatile vector registers, it will pick up an extended sized ucontext structure that already has space for the __extctx field that is included by the implicit definition of __EXTABI__ by the compiler. The extended ucontext can also be picked up by an explicit definition of __AIXEXTABI.

Similarly, the jmp_buf for use with setjmp() or longjmp() requires no change for default-mode vector-enabled applications, since nonvolatile vector registers are not used. The explicit enablement of nonvolatile vector registers results in larger jmp_buf allocations, due to the implicit definition of __EXTABI__ by the compiler. The extended jump buffers can also be activated by explicit definition of __AIXEXTABI.

See the sys/context.h header file for a more detailed layout of the extended context information.

Vector memory allocation and alignment

Vector data types introduce a data type requiring 16-byte alignment. In accordance with the AltiVec programming interface specification, a set of malloc subroutines (vec_malloc, vec_free, vec_realloc, vec_calloc) are provided by AIX that give 16-byte aligned allocations.

Vector-enabled compilation, with _VEC_ implicitly defined by the compiler, will result in any calls to legacy malloc and calloc being redirected to their vector-safe counterparts, vec_malloc and vec_calloc, respectively. Non-vector code can also be explicitly compiled to pick up these same malloc and calloc redirections by explicitly defining __AIXVEC. The alignment of the default malloc(), realloc(), and calloc() allocations can also be controlled at runtime.

First, externally to any program, a new environment variable, MALLOCALIGN, can be set to the default alignment desired for every malloc() allocation. An example is
MALLOCALIGN=16;  export MALLOCALIGN

The MALLOCALIGN environment variable can be set to any power of 2 greater than or equal to the size of a pointer in the corresponding execution mode (4 bytes for 32-bit mode, 8 bytes for 64-bit mode). If MALLOCALIGN is set to an invalid value, then the value is rounded up to the next power of 2, and all subsequent malloc() allocations will be aligned to that value.

Also, internally to a program, the program can use a new command option to the mallopt() interface to specify the desired alignment for future allocations. An example is
rc = mallopt(M_MALIGN, 16);

Refer to mallopt and MALLOCALIGN for more information.

printf and scanf of vector data types

In accordance with the AltiVec programming interface specification, support is added to the AIX versions of scanf, fscanf, sscanf, wsscanf, printf, fprintf, sprintf, snprintf, wsprintf, vprintf, vfprintf, vsprintf, and vwsprintf for the new vector conversion format strings. The new size formatters are as follows:

  • vl or lv consumes one argument and modifies an existing integer conversion, resulting in vector signed int, vector unsigned int, or vector bool for output conversions or vector signed int * or vector unsigned int * for input conversions. The data is then treated as a series of four 4-byte components, with the subsequent conversion format applied to each.
  • vh or hv consumes one argument and modifies an existing short integer conversion, resulting in vector signed short, or vector unsigned short for output conversions or vector signed short * or vector unsigned short * for input conversions. The data is treated as a series of eight 2-byte components, with the subsequent conversion format applied to each.
  • v consumes one argument and modifies a 1-byte integer, 1-byte character, or 4-byte floating point conversion. If the conversion is a floating point conversion the result is vector float for output conversion or vector float * for input conversion. The data is treated as a series of four 4-byte floating point components with the subsequent conversion format applied to each. If the conversion is an integer or character conversion, the result is either vector signed char, vector unsigned char, or vector bool char for output conversion or vector signed char * or vector unsigned char * for input conversions. The data is treated as a series of sixteen 1-byte components, with the subsequent conversion format applied to each.

Any conversion format that can be applied to the singular form of a vector-data type can be used with a vector form. The %d, %x, %X, %u, %i, and %o integer conversions can be applied with the %lv, %vl, %hv, %vh, and %v vector-length qualifiers. The %c character conversion can be applied with the %v vector length qualifier. The %a, %A, %e, %E, %f, %F, %g, and %G float conversions can be applied with the %v vector length qualifier.

For input conversions, an optional separator character can be specified excluding whitespace preceding the separator. If no separator is specified, the default separator is a space including whitespace characters preceding the separator, unless the conversion is c, and then the default conversion is null.

For output conversions, an optional separator character can be specified immediately preceding the vector size conversion. If no separator is specified, the default separator is a space, unless the conversion is c, and then the default separator is null.

Threaded applications

Multithreaded applications exploiting the vector extension are also supported. These applications are supported in both system scope (1:1 threading model) and process scope (M:N threading model). If a multithreaded application is compiled with nonvolatile vector registers enabled, the pthreads for that application will be flagged as extended ABI pthreads. The result will be larger context-save buffer allocations within the pthread library for those threads. The dbx AIX debugger also provides full support for machine-level debugging of vector-enabled multithreaded programs.

Compilers

An AIX compiler supporting the vector extension must conform to the AIX ABI Vector Extension. As previously described, the default vector-enabled compilation mode on AIX should be with the use of nonvolatile vector registers disabled. An explicit option to enable the use of nonvolatile vector registers can be provided and enabled at your discretion, after understanding the issues and risks regarding new and old module interoperability.

When enabling the use of nonvolatile vector registers, a C or C++ compiler must predefine __EXTABI__. Also, when enabled for any form of vector compilation, a C or C++ compiler is expected to predefine __VEC__. If compiling non-vector enabled C or C++ modules for linkage with vector-enabled Fortran modules, it is best that the C or C++ modules be explicitly compiled with at least __AIXVEC defined (explicit definition analogous to __VEC__), and also __AIXEXTABI (explicit definition analogous to __EXTABI) if nonvolatile vector registers are enabled in the Fortran modules.

In addition to the AltiVec programming interface specification, which provides an explicit extension to the C and C++ languages for vector programming, some compilers will probably allow the exploitation of the vector extension in some optimization settings when targeting a processor that supports the vector extension.

Refer to your compiler documentation for more details.

Assembler

The AIX assembler, in the /usr/ccs/bin/as directory, now supports the additional instruction set defined by the vector extension, and specifically as implemented by the PowerPC 970 processor. You can use the new -m970 assembly mode or the .machine 970 pseudo op within the source file to enable the assembly of the new vector instructions. Refer to the Assembler Language Reference for more information.

Debugger

The dbx AIX debugger, in /usr/ccs/bin/dbx, supports machine-level debugging of vector-enabled programs. This support includes the ability to disassemble the new vector instructions, and to display and set vector registers. A new $instructionset value of 970 has been defined for enabling disassembly of the PowerPC 970-specific instructions, including the vector instructions, when not running dbx on a PowerPC 970 system. Note that if running dbx on a PowerPC 970, the default $instructionset will be 970.

To view vector registers, the subcommand unset $novregs must be used, as vector registers are not displayed by default. Also, if the processor does not support the vector extension, or the process or thread being examined is not using the vector extension, then no vector register state will be displayed. Otherwise, the registers subcommand will print all of the vector registers and their contents in raw hexadecimal.

You can also display the vector registers individually, formatted according to a fundamental type. For instance, print $vr0 will display the contents of register VR0 as an array of 4 integers. print $vr0c will display the contents of register VR0 as an array of 16 characters. print $vr0s will display the contents of register VR0 as an array of 8 shorts, and print $vr0f will display the contents of register VR0 as an array of 4 floats.

You can assign entire vector registers, for example assign $vr0 = $vr1, or assign individual vector elements of the vector register as if assigning an element of an array. For example, assign $vr0[3] = 0x11223344 sets only the 4th integer member of VR0. assign $vr0f[0] = 1.123 results in only the first float member of VR0 being set to the value 1.123.

You can trace vector registers throughout the execution of a function or program, for example tracei $vr0 in main will display the contents of VR0 each time it is modified in main(). Likewise, by specifying one of the format registers ($vr0f, $vr0c, $vr0s) to tracei, each display of the contents will be formatted accordingly.

As long as compilers represent vector data types as arrays of their fundamental types, dbx should also be able to display a vector data type formatted as an array.

Refer to the dbx command documentation for more information.

Enablement for third-party debuggers is also provided in the form of PTT_READ_VEC and PTT_WRITE_VEC new ptrace operations, for reading or writing vector register state for a thread. Refer to the ptrace documentation for details.

The /proc filesystem is also enhanced to support a /proc-based debugger. The status and lwpstatus files for a vector-enabled process and thread, respectively, are extended to include vector register state. A new control message, PCSVREG, is supported on the write of a process or thread control file for setting vector register state. Refer to the /proc File Reference for more details.

Core files

AIX also supports the inclusion of vector machine state as part of the core file for a vector-enabled process or thread. Only if a process or thread is using the vector extension will the vector machine state be included in the core image for that thread. Note that if you select the pre-AIX 4.3 core file format, vector state will not be included. The vector state is only supported in the current core file formats. You can use the dbx command to read and display the vector machine state of a vector-enabled core file.