The MASS high-performance libraries
MASS stands for Mathematical Acceleration Subsystem. It consists of libraries of mathematical functions specifically tuned for optimum performance on various IBM computing platforms. MASS was originally launched by IBM in 1995, and has been continuously improved and expanded since then.
There are currently versions of MASS for all the IBM® POWER™ processors, running AIX® or Linux® operating systems. There are also versions for the IBM System BlueGene®/L and the IBM System BlueGene®/P supercomputers, as well as the IBM Cell Broadband Engine™ (Cell/B.E.™) solution.The libraries contain accelerated implementations of elementary functions such as the trigonometric and hyperbolic functions and their inverses, power, logarithm, exponential, error function, and others. Complete lists of the included functions are available on the IBM Mathematical Acceleration Subsystem page.
There are both scalar and vector libraries, and for Cell/B.E. and POWER7 there are single instruction multiple data (SIMD) libraries as well. Note that the accuracy and exception handling might not be identical in MASS functions and system library functions.The MASS libraries are packaged with the IBM XL C/C++ and XL Fortran compilers, and are also available on the MASS Web site, for users of other compilers (such as gcc) for the target hardware.
The libraries are callable from C, C++, or Fortran source programs. The IBM XL C/C++ and IBM XL Fortran compilers are capable of recognizing opportunities to use MASS to accelerate the source program, and invoking it automatically without the need for source-program changes. This article shows you how to implement a technique that can help your organization leverage this powerful capability.
Any C, C++, or Fortran program that contains calls to mathematical library functions (such as exp, log, sin, cos, etc.) in a loop can potentially benefit from the techniques described in this article.
Auto-vectorization is a process whereby the IBM XL C/C++ or Fortran compilers recognize an opportunity to improve the performance of the program being compiled by replacing a call to a standard library (C/C++ libm or Fortran intrinsic) function in a loop with a call to the corresponding MASS vector function. Since the MASS vector functions are much faster (by a factor of up to about 30) than a repeated call to a standard library function, the resulting performance improvement can be substantial.
A simple example is a loop that computes a particular function for multiple arguments, such as the following Fortran program.
subroutine sub (y, x, n)
real*8 y(*), x(*)
integer n, i
intrinsic exp
do i=1,n
y(i) = exp(x(i)) ! standard intrinsic
end do
end subroutine
|
With the appropriate compiler options, the compiler will recognize the opportunity to accelerate the program by replacing the repeated call to exp() with a single call to the corresponding MASS vector function vexp(), resulting in a program that behaves as though it were originally written as follows:
include 'massv.include'
subroutine sub (y, x, n)
real*8 y(*), x(*)
integer n
call vexp (y, x, n) ! MASS vector function
end subroutine
|
This is only a simple example, demonstrating the basic idea behind auto-vectorization. The XL compilers are actually able to recognize much more complicated opportunities, and will also try to rearrange the instructions in the source program if necessary in order to create auto-vectorization opportunities.
A more complicated and realistic example is examined in the case study section of this article.
Compiler options for auto-vectorization
You can compile programs with any of the following sets of options:
- -qhot -qnostrict (for Fortran)
- -qhot -qnostrict -qignerrno (for C/C++)
- -qhot -O3
- -O4
- -O5
When you compile a program using one of these option sets, the compiler automatically attempts to vectorize calls to system math functions by calling the equivalent MASS vector functions (except for calls to the following functions, which are exceptions: vatan2, vsatan2, vdnint, vdint, vcosisin, vscosisin, vqdrt, vsqdrt, vrqdrt, vsrqdrt, vpopcnt4, vpopcnt8, vexp2, vexp2m1, vsexp2, vsexp2m1, vlog2, vlog21p, vslog2, and vslog21p). If the program is such that the compiler cannot vectorize, it automatically tries to call the equivalent MASS scalar functions. For automatic vectorization or scalarization, the compiler uses versions of the MASS functions contained in the compiler library libxlopt.a. You do not need to add any special calls to the MASS functions in your code, or to link explicitly to the xlopt library.
In addition to any of the preceding sets of options, when the -qipa option is in effect, if the compiler cannot vectorize, it tries to inline the MASS scalar functions before deciding to call them.
If you want to disable auto-vectorization, you can add the option -qhot=novector.
What follows is a case study of an actual application — a discrete Fourier transform (DFT) — showing the performance improvements that result when compiling with various compiler options. The application is simple enough to be clearly illustrative, yet complex enough to provide non-trivial optimization opportunities.
Timing for both programs was done with the driver program given in Appendix 3, on a POWER6 computer running at 4.704 GHz.
Appendix 1 shows the Fortran DFT source program. It contains a nested loop that calls exp(), cos(), and sin(), followed by a loop that calls sin() and sqrt(). The program was compiled with -O3 (which does not do auto-vectorization) and also with -O4 (which does auto-vectorization). The results are shown in Figure 1.
Note that the benefit of auto-vectorization grows as the problem size is increased, reaching a factor of 8.94x improvement at a problem size of 2000.
Figure 1: Comparison of DFT Fortran performance with compile options -O3 and -O4, for various problem sizes.
Appendix 2 shows a C version of the Fortran DFT program in Appendix 1 (It contains a dummy consume() routine so that the compiler's inter-procedural analysis [IPA] cannot see that the result of the computation is not actually used in this illustrative example, and would therefore optimize away the entire program).
The program was compiled with -O3 (which does not offer auto-vectorization), with -O4 (which does offer auto-vectorization), and with -O5 (which does offer auto-vectorization as well as IPA). The results are shown in Figure 2.
As illustrated in the Fortran example, the benefit of auto-vectorization grows as the problem size is increased, reaching a factor of 6.00x at n=2000. In addition, the activation of IPA at -O5 provides an additional speedup of 1.22x since it can determine that the inputs and outputs are not aliased (that is, they do not overlap in memory), allowing it to vectorize the call to atan2 in the conversion to polar coordinates. The speedup of -O5 over -O3 is 7.33x at n=2000.
Figure 2: Comparison of DFT C performance with compile options -O3, -O4, and -O5, for various problem sizes.
This article has offered you a description of the IBM MASS libraries, and the auto-vectorization capabilities of the IBM XL C/C++ and XL Fortran compilers. In addition, the article has demonstrated the use of various compiler options on an example program (discrete Fourier transform), showing you how accelerations of up to 8.94 times over previous speeds are obtainable using the compilers through automatic invocation of MASS by auto-vectorization.
This demonstration is intended to encourage users with programs that call mathematical functions to experiment with the available compiler options, and benefit from speedups resulting from IBM XL C/C++ or XL Fortran compiler auto-vectorization.
Appendix 1 – Fortran DFT source program
subroutine dft (x, a, phi , n)
real*8 x(n), a(n), phi(n)
integer n
! Compute discrete Fourier transform of real inputs
! x(i) and convert to polar form.
real*8, parameter :: pi=3.1415926535897932384d0
real*8 y_re(n), y_im(n), t, term_re, term_im
intrinsic exp, cos, sin, sqrt, atan
y_re(1:n) = 0.d0
y_im(1:n) = 0.d0
do k=1,n
! compute y(k), k-th DFT output
do i=1,n
! compute i-th term of y(k):
! x(k)*exp(-2*pi*I*(k-1)*(i-1)/n)
! compute real and imaginary parts of i-th term
! using exp(I*t)=exp(t)*(cos(t)+I*sin(t))
t = -2.d0*pi*(k-1)*(i-1)/n
term_re = x(i) * cos(t) * exp(t)
term_im = x(i) * sin(t) * exp(t)
! add term to sum
y_re(k) = y_re(k) + term_re
y_im(k) = y_im(k) + term_im
end do
end do
! transform y to polar coordinates
do k=1,n
! compute amplitude of y(k)
a(k) = sqrt (y_re(k)**2 + y_im(k)**2)
! compute phase of y(k)
phi(k) = atan (y_im(k) / y_re(k))
end do
end subroutine
! initialize input data
subroutine init (a, n)
real*8 a(n)
integer n
intrinsic sin,sqrt
do j=1,n
a(j)=sin(1.d0/sqrt(real(j,8)))
end do
end subroutine
|
Appendix 2 – C DFT source program
#include <math.h>
#define PI 3.1415926535897932384
void dft(double x[],double a[],double phi[],int *m)
{
double y_re[NMAX], y_im[NMAX], t, s, term_re, term_im;
int i,j,k,n=*m;
for(i=0;i<n;++i) {
y_re[i]=y_im[i]=0;
}
for(k=0;k<n;++k)
{
// compute y(k), k-th DFT output
for(i=0;i<n;++i)
{
// compute i-th term of y(k):
// x(k)*exp(-2*pi*I*(k-1)*(i-1)/n)
// compute real and imaginary parts of i-th term
// using exp(I*t)=exp(t)*(cos(t)+I*sin(t))
t=-2.*PI*k*i/(double)n;
term_re=x[i]*exp(t)*cos(t);
term_im=x[i]*exp(t)*sin(t);
// add term to sum
y_re[k]+=term_re;
y_im[k]+=term_im;
}
}
// transform y to polar coordinates
for(k=0;k<n;++k)
{
// compute amplitude of y(k)
a[k]=sqrt(y_re[k]*y_re[k]+y_im[k]*y_im[k]);
// compute phase of y(k)
phi[k]=atan2(y_im[k],y_re[k]);
}
}
// initialize input data
void init(double a[],int *m)
{
int j,n=*m;
for(j=0;j<n;++j)
{
a[j]=sin(1./sqrt((double)j+1.0));
}
}
// Dummy function to use result, preventing compiler from
// optimizing away the computation.
void consume(double a[],double b[],double c[])
{
}
|
Here is the Fortran driver program that was used to time the DFT codes.
program main
interface
subroutine dft(x,a,phi,n)
real*8 x(n),a(n),phi(n)
integer n
end subroutine
subroutine init(a,n)
real*8 a(n)
integer n
end subroutine
subroutine consume(a,b,c)
real*8 a(*),b(*),c(*)
end subroutine
end interface
! Parameters:
! nmax is the problem size.
! nrep is the number of repetitions of the
! problem. This should be chosen so that
! the elapsed time is long enough to give
! sufficient timing resolution.
! cyc is the clock frequency in Hz for the
! processor that the program is to be run on.
! (Can be found from AIX command pmcycles.)
integer, parameter :: nmax=1000
integer, parameter :: nrep=100
real*8, parameter :: cyc=4704000000.d0
real*8 x(nmax), a(nmax), phi(nmax)
real*8 tx, ty, accum, del(4)
intrinsic sin, sqrt
real*8 rtc
del(4)=0.d0
acc = 0.d0
do k=1,nrep
tx=rtc()
call init(x,nmax)
call dft(x,a,phi,nmax)
ty=rtc()
call consume(x,a,phi)
do j=1,nmax
acc = acc + a(j) + phi(j)
end do
del(4) = del(4) + (ty-tx)
end do
del(1) = del(4)/real(nmax,8)
del(2) = del(1)/real(nrep,8)
del(3) = cyc*del(2)
print *,'acc=',acc/real(nrep,8),' n=',nmax,
& ' r=',nrep, ' a=',del(1),' b=',del(2),
& ' c=',del(3),' w=',del(4)
end program
|
The exact performance obtained may vary depending on the model of processor used and its configuration, as well as on the version of the compilers used. Therefore, you may experience performance different from that obtained in the experiments described here.
Learn
- IBM Mathematical Acceleration Subsystem: Provides contents, usage, performance, and accuracy information for the MASS libraries.
- IBM XL C/C++ for AIX User's Guide: Provides more details on compiler flags.
- IBM XL C/C++ for AIX Optimization and Programming Guide: Provides more details on the high-performance libraries.
- IBM XL Fortran for AIX User's Guide: Provides more details on compiler flags.
- IBM XL Fortran for AIX Optimization and Programming Guide: Provides more details on the high-performance libraries.
- The POWER4 Processor Introduction and Tuning Guide: This Redbook has more details on compiler optimization flags.
- AIX and UNIX: Visit the developerWorks AIX and UNIX zone to expand your UNIX skills.
- New to AIX and UNIX: Visit the New to AIX and UNIX page to learn more about AIX and UNIX.
- developerWorks technical events and webcasts: Stay current with developerWorks technical events and webcasts.
- AIX 5L Wiki: A collaborative environment for technical information related to AIX.
- Podcasts: Tune in and catch up with IBM technical experts.
Get products and technologies
- IBM trial software: Build your next development project with software for download directly from developerWorks.
Discuss
- Participate in the AIX and UNIX forums:
- AIX 5L — technical
- AIX for Developers Forum
- Cluster Systems Management
- IBM Support Assistant
- Performance Tools — technical
- Virtualization — technical
- More AIX and UNIX forums
- Participate in the developerWorks blogs and get involved in the developerWorks community.
- Participate in the C/C++ Café, which includes the Scientific Computing with C/C++ Blog

Robert Enenkel works in the Optimizing Compiler Group at the IBM Toronto Laboratory; he was previously a Research Associate at the IBM Centre for Advanced Studies (CAS). Enenkel received his B.S., M.S., and Ph.D. degrees from the University of Toronto, with thesis work in the area of parallel numerical methods for differential equations. He currently performs research and development in numerical computing as it relates to compilers and operating systems, including floating-point arithmetic, mathematical function libraries, and the performance tuning of algorithms. Dr. Enenkel has received several IBM invention achievement awards and author recognition awards. More information may be found on his Web page at https://www-927.ibm.com/ibm/cas/toronto/people/members/renekel.shtml.

Daniel Zabawa works in the Optimizing Compiler Group at the IBM Toronto Laboratory. He received his B.S. and M.S. degrees in Computer Science from the University of Toronto. Zabawa currently performs research and development in loop optimization and scheduling, floating-point algorithms for superscalar architectures and performance tuning of floating-point math functions.




