The MASS highperformance libraries
MASS stands for Mathematical Acceleration Subsystem. It consists of libraries of mathematical functions specifically tuned for optimum performance on various IBM computing platforms. MASS was originally launched by IBM in 1995, and has been continuously improved and expanded since then.
There are currently versions of MASS for all the IBM® POWER™ processors, running AIX® or Linux® operating systems. There are also versions for the IBM System BlueGene®/L and the IBM System BlueGene®/P supercomputers, as well as the IBM Cell Broadband Engine™ (Cell/B.E.™) solution.The libraries contain accelerated implementations of elementary functions such as the trigonometric and hyperbolic functions and their inverses, power, logarithm, exponential, error function, and others. Complete lists of the included functions are available on the IBM Mathematical Acceleration Subsystem page.
There are both scalar and vector libraries, and for Cell/B.E. and POWER7 there are single instruction multiple data (SIMD) libraries as well. Note that the accuracy and exception handling might not be identical in MASS functions and system library functions.The MASS libraries are packaged with the IBM XL C/C++ and XL Fortran compilers, and are also available on the MASS Web site, for users of other compilers (such as gcc) for the target hardware.
The libraries are callable from C, C++, or Fortran source programs. The IBM XL C/C++ and IBM XL Fortran compilers are capable of recognizing opportunities to use MASS to accelerate the source program, and invoking it automatically without the need for sourceprogram changes. This article shows you how to implement a technique that can help your organization leverage this powerful capability.
What programs can benefit?
Any C, C++, or Fortran program that contains calls to mathematical library functions (such as exp, log, sin, cos, etc.) in a loop can potentially benefit from the techniques described in this article.
What is autovectorization?
Autovectorization is a process whereby the IBM XL C/C++ or Fortran compilers recognize an opportunity to improve the performance of the program being compiled by replacing a call to a standard library (C/C++ libm or Fortran intrinsic) function in a loop with a call to the corresponding MASS vector function. Since the MASS vector functions are much faster (by a factor of up to about 30) than a repeated call to a standard library function, the resulting performance improvement can be substantial.
A simple example is a loop that computes a particular function for multiple arguments, such as the following Fortran program.
subroutine sub (y, x, n) real*8 y(*), x(*) integer n, i intrinsic exp do i=1,n y(i) = exp(x(i)) ! standard intrinsic end do end subroutine
With the appropriate compiler options, the compiler will recognize the opportunity to accelerate the program by replacing the repeated call to exp() with a single call to the corresponding MASS vector function vexp(), resulting in a program that behaves as though it were originally written as follows:
include 'massv.include' subroutine sub (y, x, n) real*8 y(*), x(*) integer n call vexp (y, x, n) ! MASS vector function end subroutine
This is only a simple example, demonstrating the basic idea behind autovectorization. The XL compilers are actually able to recognize much more complicated opportunities, and will also try to rearrange the instructions in the source program if necessary in order to create autovectorization opportunities.
A more complicated and realistic example is examined in the case study section of this article.
Compiler options for autovectorization
You can compile programs with any of the following sets of options:
 qhot qnostrict (for Fortran)
 qhot qnostrict qignerrno (for C/C++)
 qhot O3
 O4
 O5
When you compile a program using one of these option sets, the compiler automatically attempts to vectorize calls to system math functions by calling the equivalent MASS vector functions (except for calls to the following functions, which are exceptions: vatan2, vsatan2, vdnint, vdint, vcosisin, vscosisin, vqdrt, vsqdrt, vrqdrt, vsrqdrt, vpopcnt4, vpopcnt8, vexp2, vexp2m1, vsexp2, vsexp2m1, vlog2, vlog21p, vslog2, and vslog21p). If the program is such that the compiler cannot vectorize, it automatically tries to call the equivalent MASS scalar functions. For automatic vectorization or scalarization, the compiler uses versions of the MASS functions contained in the compiler library libxlopt.a. You do not need to add any special calls to the MASS functions in your code, or to link explicitly to the xlopt library.
In addition to any of the preceding sets of options, when the qipa option is in effect, if the compiler cannot vectorize, it tries to inline the MASS scalar functions before deciding to call them.
If you want to disable autovectorization, you can add the option qhot=novector.
Case study
What follows is a case study of an actual application — a discrete Fourier transform (DFT) — showing the performance improvements that result when compiling with various compiler options. The application is simple enough to be clearly illustrative, yet complex enough to provide nontrivial optimization opportunities.
Timing for both programs was done with the driver program given in Appendix 3, on a POWER6 computer running at 4.704 GHz.
Appendix 1 shows the Fortran DFT source program. It contains a nested loop that calls exp(), cos(), and sin(), followed by a loop that calls sin() and sqrt(). The program was compiled with O3 (which does not do autovectorization) and also with O4 (which does autovectorization). The results are shown in Figure 1.
Note that the benefit of autovectorization grows as the problem size is increased, reaching a factor of 8.94x improvement at a problem size of 2000.
Figure 1: Comparison of DFT Fortran performance with compile options O3 and O4, for various problem sizes.
Appendix 2 shows a C version of the Fortran DFT program in Appendix 1 (It contains a dummy consume() routine so that the compiler's interprocedural analysis [IPA] cannot see that the result of the computation is not actually used in this illustrative example, and would therefore optimize away the entire program).
The program was compiled with O3 (which does not offer autovectorization), with O4 (which does offer autovectorization), and with O5 (which does offer autovectorization as well as IPA). The results are shown in Figure 2.
As illustrated in the Fortran example, the benefit of autovectorization grows as the problem size is increased, reaching a factor of 6.00x at n=2000. In addition, the activation of IPA at O5 provides an additional speedup of 1.22x since it can determine that the inputs and outputs are not aliased (that is, they do not overlap in memory), allowing it to vectorize the call to atan2 in the conversion to polar coordinates. The speedup of O5 over O3 is 7.33x at n=2000.
Figure 2: Comparison of DFT C performance with compile options O3, O4, and O5, for various problem sizes.
Conclusion
This article has offered you a description of the IBM MASS libraries, and the autovectorization capabilities of the IBM XL C/C++ and XL Fortran compilers. In addition, the article has demonstrated the use of various compiler options on an example program (discrete Fourier transform), showing you how accelerations of up to 8.94 times over previous speeds are obtainable using the compilers through automatic invocation of MASS by autovectorization.
This demonstration is intended to encourage users with programs that call mathematical functions to experiment with the available compiler options, and benefit from speedups resulting from IBM XL C/C++ or XL Fortran compiler autovectorization.
Appendix 1 – Fortran DFT source program
subroutine dft (x, a, phi , n) real*8 x(n), a(n), phi(n) integer n ! Compute discrete Fourier transform of real inputs ! x(i) and convert to polar form. real*8, parameter :: pi=3.1415926535897932384d0 real*8 y_re(n), y_im(n), t, term_re, term_im intrinsic exp, cos, sin, sqrt, atan y_re(1:n) = 0.d0 y_im(1:n) = 0.d0 do k=1,n ! compute y(k), kth DFT output do i=1,n ! compute ith term of y(k): ! x(k)*exp(2*pi*I*(k1)*(i1)/n) ! compute real and imaginary parts of ith term ! using exp(I*t)=exp(t)*(cos(t)+I*sin(t)) t = 2.d0*pi*(k1)*(i1)/n term_re = x(i) * cos(t) * exp(t) term_im = x(i) * sin(t) * exp(t) ! add term to sum y_re(k) = y_re(k) + term_re y_im(k) = y_im(k) + term_im end do end do ! transform y to polar coordinates do k=1,n ! compute amplitude of y(k) a(k) = sqrt (y_re(k)**2 + y_im(k)**2) ! compute phase of y(k) phi(k) = atan (y_im(k) / y_re(k)) end do end subroutine ! initialize input data subroutine init (a, n) real*8 a(n) integer n intrinsic sin,sqrt do j=1,n a(j)=sin(1.d0/sqrt(real(j,8))) end do end subroutine
Appendix 2 – C DFT source program
#include <math.h> #define PI 3.1415926535897932384 void dft(double x[],double a[],double phi[],int *m) { double y_re[NMAX], y_im[NMAX], t, s, term_re, term_im; int i,j,k,n=*m; for(i=0;i<n;++i) { y_re[i]=y_im[i]=0; } for(k=0;k<n;++k) { // compute y(k), kth DFT output for(i=0;i<n;++i) { // compute ith term of y(k): // x(k)*exp(2*pi*I*(k1)*(i1)/n) // compute real and imaginary parts of ith term // using exp(I*t)=exp(t)*(cos(t)+I*sin(t)) t=2.*PI*k*i/(double)n; term_re=x[i]*exp(t)*cos(t); term_im=x[i]*exp(t)*sin(t); // add term to sum y_re[k]+=term_re; y_im[k]+=term_im; } } // transform y to polar coordinates for(k=0;k<n;++k) { // compute amplitude of y(k) a[k]=sqrt(y_re[k]*y_re[k]+y_im[k]*y_im[k]); // compute phase of y(k) phi[k]=atan2(y_im[k],y_re[k]); } } // initialize input data void init(double a[],int *m) { int j,n=*m; for(j=0;j<n;++j) { a[j]=sin(1./sqrt((double)j+1.0)); } } // Dummy function to use result, preventing compiler from // optimizing away the computation. void consume(double a[],double b[],double c[]) { }
Appendix 3 – Driver program
Here is the Fortran driver program that was used to time the DFT codes.
program main interface subroutine dft(x,a,phi,n) real*8 x(n),a(n),phi(n) integer n end subroutine subroutine init(a,n) real*8 a(n) integer n end subroutine subroutine consume(a,b,c) real*8 a(*),b(*),c(*) end subroutine end interface ! Parameters: ! nmax is the problem size. ! nrep is the number of repetitions of the ! problem. This should be chosen so that ! the elapsed time is long enough to give ! sufficient timing resolution. ! cyc is the clock frequency in Hz for the ! processor that the program is to be run on. ! (Can be found from AIX command pmcycles.) integer, parameter :: nmax=1000 integer, parameter :: nrep=100 real*8, parameter :: cyc=4704000000.d0 real*8 x(nmax), a(nmax), phi(nmax) real*8 tx, ty, accum, del(4) intrinsic sin, sqrt real*8 rtc del(4)=0.d0 acc = 0.d0 do k=1,nrep tx=rtc() call init(x,nmax) call dft(x,a,phi,nmax) ty=rtc() call consume(x,a,phi) do j=1,nmax acc = acc + a(j) + phi(j) end do del(4) = del(4) + (tytx) end do del(1) = del(4)/real(nmax,8) del(2) = del(1)/real(nrep,8) del(3) = cyc*del(2) print *,'acc=',acc/real(nrep,8),' n=',nmax, & ' r=',nrep, ' a=',del(1),' b=',del(2), & ' c=',del(3),' w=',del(4) end program
Disclaimer
The exact performance obtained may vary depending on the model of processor used and its configuration, as well as on the version of the compilers used. Therefore, you may experience performance different from that obtained in the experiments described here.
Resources
Learn
 IBM Mathematical Acceleration Subsystem: Provides contents, usage, performance, and accuracy information for the MASS libraries.
 IBM XL C/C++ for AIX User's Guide: Provides more details on compiler flags.
 IBM XL C/C++ for AIX Optimization and Programming Guide: Provides more details on the highperformance libraries.
 IBM XL Fortran for AIX User's Guide: Provides more details on compiler flags.
 IBM XL Fortran for AIX Optimization and Programming Guide: Provides more details on the highperformance libraries.
 The POWER4 Processor Introduction and Tuning Guide: This Redbook has more details on compiler optimization flags.
 AIX and UNIX: Visit the developerWorks AIX and UNIX zone to expand your UNIX skills.
 New to AIX and UNIX: Visit the New to AIX and UNIX page to learn more about AIX and UNIX.
 developerWorks technical events and webcasts: Stay current with developerWorks technical events and webcasts.
 AIX 5L Wiki: A collaborative environment for technical information related to AIX.
 Podcasts: Tune in and catch up with IBM technical experts.
Get products and technologies
 IBM trial software: Build your next development project with software for download directly from developerWorks.
Discuss
 Participate in the AIX and UNIX forums:
 Participate in the developerWorks blogs and get involved in the developerWorks community.
 Participate in the C/C++ Café, which includes the Scientific Computing with C/C++ Blog
Comments
Dig deeper into Rational software on developerWorks
 Overview
 Free online course: Getting started with IBM Bluemix
 IBM Bluemix Garage Method
 Technical library (tutorials and more)
 Forums
 Communities

developerWorks Premium
Exclusive tools to build your next great app. Learn more.

developerWorks Labs
Technical resources for innovators and early adopters to experiment with.

IBM evaluation software
Evaluate IBM software and solutions, and transform challenges into opportunities.