By: Paul Clarke.
The IBM Power processor family has a somewhat hidden little gem which, when exploited, could yield signficant performance improvements. IBM Power processors have a vector processing facility (known as AltiVec, VMX, and VSX in different incantations) which can perform multiple computations with a single instruction, also known as SIMD (single instruction, multiple data). POWER8 has 64 "vectorscalar registers' (VSRs), the first 32 of which are shared space with the 32 floating point registers. Each VSR is 128 bits. Thus, each VSR can hold 2 doubleprecision (64 bits each) or 4 singleprecision (32 bits each) floating point quantities. Vector instructions operate on all of the quantities simultaneously. Among many other capabilities, there are instructions for:

vector add, subtract, multiply, divide

vector multiplyadd

vector square root
With vector multiplyadd, for example, it is thus possible to perform 4 singleprecision floating point multiplications and 4 singleprecision floating point additions all with a single instruction.
Compilers generally have some support for "autovectorization," but compilers also have to be conservative in exploiting the feature due to unknowns like alignment or aliasing which cannot be determined at compilation time. Thus, it is often necessary to implement vectorization explicitly in order to exploit it.
Fortunately, the compilers provide data types and functions which allow higher level languages like C to exploit the vector capabilities of the processor without having to write in assembly language. The "vector add" instruction can be invoked by calling the "vec_add" function (and including the "altivec.h" header file and using the "mvsx" compiler flag).
As an example, straightforward code to determine the maximum value in an array of 32 singleprecision floating point numbers might look like this:
max = array[0];
for (i = 1; i < 32; i++) {
if (max < array[i]) max = array[i];
}
This code is very compact and understandable. However, it does not fully exploit the capabilities of a vectorenabled procesor.
Note that there are no VSX instructions which take a vector and return the maximum value of the elements of the vector. However, there is an instruction that will take two vectors and return a vector with the maximums of the corresponding elements. While this will obviously not give the final answer, it sure can help to get the answer quicker.
We can successively load up pairs of vectors, and compute the maximums of the respective elements, thereby performing 4 compares and no conditionals in a single instruction. After that first "vector maximum", we will have eliminated 4 elements from consideration. If we do this carefullly, by reusing vectors that have been elminated, we can avoid needing a larger set of registers.
Let's start with an array of 8 singleprecision floating point numbers. The code for a vector approach might look like this:
/* include "altivec.h" to get the data types and functions */
#include <altivec.h>
/* take an array of 8 floats, return the max */
float max8(const float *data) {
__vector float v[2]; /* the "working" vectors */
float m0, m1; /* some helpers for the final answer */
/* load the vectors by simple unitary assignments,
* which should translate to a single "vector load" instruction each
*/
v[0] = *(__vector float *)(&data[0]);
v[1] = *(__vector float *)(&data[4]);
/* compare the two vectors
* return a vector of the maximums of the individual elements
* note that we reuse v[0]
*/
v[0] = vec_max(v[0],v[1]);
/* now dig into the resulting "max" vector
* and find the maximum element explicitly
*/
m0 = v[0][0] > v[0][1] ? v[0][0] : v[0][1];
m1 = v[0][2] > v[0][3] ? v[0][2] : v[0][3];
return m0 > m1 ? m0 : m1;
}
Note that the code is significantly more complex.
If we add some iterations and time stamping around that function, we can look at its performance. Avoiding specifics, since this is only an example, here are some performance results:
mode 
time 
nonvectorized 
3.693498 
vectorized 
3.744294 
There's not much difference. A quick analysis shows that the "simple" code will do 7 floating point comparisons, and 7 conditionals. The vectorized code will do 1 vector compare, 3 floating point comparisons, plus 4 vector to scalar floating point conversions for the final step. Without deeper analysis, it could be that this roughly evens out. Let's try to scale our example up.
For an array of 32 floating point numbers, the nonvectorized code will look exactly the same. If we continue to avoid loops, the vectorized code will look like this (edited slightly for brevity):
float max32(const float *data) {
__vector float v[4]; /* the "working" vectors */
float m0, m1; /* some helpers for the final answer */
v[0] = *(__vector float *)(&data[0]);
v[1] = *(__vector float *)(&data[4]);
v[0] = vec_max(v[0],v[1]);
v[1] = *(__vector float *)(&data[8]);
v[2] = *(__vector float *)(&data[12]);
v[1] = vec_max(v[1],v[2]);
v[0] = vec_max(v[0],v[1]);
v[1] = *(__vector float *)(&data[16]);
v[2] = *(__vector float *)(&data[20]);
v[1] = vec_max(v[1],v[2]);
v[2] = *(__vector float *)(&data[24]);
v[3] = *(__vector float *)(&data[28]);
v[2] = vec_max(v[2],v[3]);
v[1] = vec_max(v[1],v[2]);
v[0] = vec_max(v[0],v[1]);
m0 = v[0][0] > v[0][1] ? v[0][0] : v[0][1];
m1 = v[0][2] > v[0][3] ? v[0][2] : v[0][3];
return m0 > m1 ? m0 : m1;
}
Note again that the code is fairly complex.
Here are some performance results for an array of 32 floats:
mode 
time 
nonvectorized 
474.958120 
vectorized 
3.733729 
A very significant difference.
A quick analysis for a 32 float array has the "simple" code with 31 floating point comparisons and 31 conditionals. The vectorized code has 7 vector comparisons, 3 float compares, 3 conditionals, and 4 vector to scalar conversions.
With some effort, converting computationally intensive code to incorporate vectorization can have significant performance benefits.