How to improve the performance of programs calling mathematical functions

Taking advantage of IBM XL C/C++ or XL Fortran compiler auto-vectorization

This article introduces the IBM MASS high-performance mathematical libraries, and demonstrates how to benefit from them — without the need for source program changes — by using the auto-vectorization capability of the IBM® XL C/C++ and XL Fortran compilers. After introducing the concept of auto-vectorization and the associated compiler options, a case study of a discrete Fourier transform program is offered as a real life example of auto-vectorization. Timing results demonstrate that speedups of up to 8.94 times are obtained by the compilers on the example program, via the automatic invocation of MASS by auto-vectorization.

Robert F. Enenkel (enenkel@ca.ibm.com), Software developer, IBM

Photo of Robert EnenkelRobert Enenkel works in the Optimizing Compiler Group at the IBM Toronto Laboratory; he was previously a Research Associate at the IBM Centre for Advanced Studies (CAS). Enenkel received his B.S., M.S., and Ph.D. degrees from the University of Toronto, with thesis work in the area of parallel numerical methods for differential equations. He currently performs research and development in numerical computing as it relates to compilers and operating systems, including floating-point arithmetic, mathematical function libraries, and the performance tuning of algorithms. Dr. Enenkel has received several IBM invention achievement awards and author recognition awards. More information may be found on his Web page at https://www-927.ibm.com/ibm/cas/toronto/people/members/renekel.shtml.



Daniel M. Zabawa (dzabawa@ca.ibm.com ), Software developer, IBM

Photo of Daniel ZabawaDaniel Zabawa works in the Optimizing Compiler Group at the IBM Toronto Laboratory. He received his B.S. and M.S. degrees in Computer Science from the University of Toronto. Zabawa currently performs research and development in loop optimization and scheduling, floating-point algorithms for superscalar architectures and performance tuning of floating-point math functions.



13 April 2010

Also available in Chinese Vietnamese

The MASS high-performance libraries

MASS stands for Mathematical Acceleration Subsystem. It consists of libraries of mathematical functions specifically tuned for optimum performance on various IBM computing platforms. MASS was originally launched by IBM in 1995, and has been continuously improved and expanded since then.

There are currently versions of MASS for all the IBM® POWER™ processors, running AIX® or Linux® operating systems. There are also versions for the IBM System BlueGene®/L and the IBM System BlueGene®/P supercomputers, as well as the IBM Cell Broadband Engine™ (Cell/B.E.™) solution.The libraries contain accelerated implementations of elementary functions such as the trigonometric and hyperbolic functions and their inverses, power, logarithm, exponential, error function, and others. Complete lists of the included functions are available on the IBM Mathematical Acceleration Subsystem page.

There are both scalar and vector libraries, and for Cell/B.E. and POWER7 there are single instruction multiple data (SIMD) libraries as well. Note that the accuracy and exception handling might not be identical in MASS functions and system library functions.The MASS libraries are packaged with the IBM XL C/C++ and XL Fortran compilers, and are also available on the MASS Web site, for users of other compilers (such as gcc) for the target hardware.

The libraries are callable from C, C++, or Fortran source programs. The IBM XL C/C++ and IBM XL Fortran compilers are capable of recognizing opportunities to use MASS to accelerate the source program, and invoking it automatically without the need for source-program changes. This article shows you how to implement a technique that can help your organization leverage this powerful capability.

What programs can benefit?

Any C, C++, or Fortran program that contains calls to mathematical library functions (such as exp, log, sin, cos, etc.) in a loop can potentially benefit from the techniques described in this article.


What is auto-vectorization?

Auto-vectorization is a process whereby the IBM XL C/C++ or Fortran compilers recognize an opportunity to improve the performance of the program being compiled by replacing a call to a standard library (C/C++ libm or Fortran intrinsic) function in a loop with a call to the corresponding MASS vector function. Since the MASS vector functions are much faster (by a factor of up to about 30) than a repeated call to a standard library function, the resulting performance improvement can be substantial.

A simple example is a loop that computes a particular function for multiple arguments, such as the following Fortran program.

subroutine sub (y, x, n)
    real*8 y(*), x(*)
    integer n, i
    intrinsic exp
    do i=1,n
        y(i) = exp(x(i))  ! standard intrinsic
    end do
end subroutine

With the appropriate compiler options, the compiler will recognize the opportunity to accelerate the program by replacing the repeated call to exp() with a single call to the corresponding MASS vector function vexp(), resulting in a program that behaves as though it were originally written as follows:

include 'massv.include'
subroutine sub (y, x, n)
    real*8 y(*), x(*)
    integer n
    call vexp (y, x, n)  ! MASS vector function
end subroutine

This is only a simple example, demonstrating the basic idea behind auto-vectorization. The XL compilers are actually able to recognize much more complicated opportunities, and will also try to rearrange the instructions in the source program if necessary in order to create auto-vectorization opportunities.

A more complicated and realistic example is examined in the case study section of this article.


Compiler options for auto-vectorization

You can compile programs with any of the following sets of options:

  1. -qhot -qnostrict (for Fortran)
  2. -qhot -qnostrict -qignerrno (for C/C++)
  3. -qhot -O3
  4. -O4
  5. -O5

When you compile a program using one of these option sets, the compiler automatically attempts to vectorize calls to system math functions by calling the equivalent MASS vector functions (except for calls to the following functions, which are exceptions: vatan2, vsatan2, vdnint, vdint, vcosisin, vscosisin, vqdrt, vsqdrt, vrqdrt, vsrqdrt, vpopcnt4, vpopcnt8, vexp2, vexp2m1, vsexp2, vsexp2m1, vlog2, vlog21p, vslog2, and vslog21p). If the program is such that the compiler cannot vectorize, it automatically tries to call the equivalent MASS scalar functions. For automatic vectorization or scalarization, the compiler uses versions of the MASS functions contained in the compiler library libxlopt.a. You do not need to add any special calls to the MASS functions in your code, or to link explicitly to the xlopt library.

In addition to any of the preceding sets of options, when the -qipa option is in effect, if the compiler cannot vectorize, it tries to inline the MASS scalar functions before deciding to call them.

If you want to disable auto-vectorization, you can add the option -qhot=novector.


Case study

What follows is a case study of an actual application — a discrete Fourier transform (DFT) — showing the performance improvements that result when compiling with various compiler options. The application is simple enough to be clearly illustrative, yet complex enough to provide non-trivial optimization opportunities.

Timing for both programs was done with the driver program given in Appendix 3, on a POWER6 computer running at 4.704 GHz.

Appendix 1 shows the Fortran DFT source program. It contains a nested loop that calls exp(), cos(), and sin(), followed by a loop that calls sin() and sqrt(). The program was compiled with -O3 (which does not do auto-vectorization) and also with -O4 (which does auto-vectorization). The results are shown in Figure 1.

Note that the benefit of auto-vectorization grows as the problem size is increased, reaching a factor of 8.94x improvement at a problem size of 2000.

Figure 1: Comparison of DFT Fortran performance with compile options -O3 and -O4, for various problem sizes.
Graph shows seconds per element vs. problem size.

Appendix 2 shows a C version of the Fortran DFT program in Appendix 1 (It contains a dummy consume() routine so that the compiler's inter-procedural analysis [IPA] cannot see that the result of the computation is not actually used in this illustrative example, and would therefore optimize away the entire program).

The program was compiled with -O3 (which does not offer auto-vectorization), with -O4 (which does offer auto-vectorization), and with -O5 (which does offer auto-vectorization as well as IPA). The results are shown in Figure 2.

As illustrated in the Fortran example, the benefit of auto-vectorization grows as the problem size is increased, reaching a factor of 6.00x at n=2000. In addition, the activation of IPA at -O5 provides an additional speedup of 1.22x since it can determine that the inputs and outputs are not aliased (that is, they do not overlap in memory), allowing it to vectorize the call to atan2 in the conversion to polar coordinates. The speedup of -O5 over -O3 is 7.33x at n=2000.

Figure 2: Comparison of DFT C performance with compile options -O3, -O4, and -O5, for various problem sizes.
Graph shows seconds per element vs. problem size.

Conclusion

This article has offered you a description of the IBM MASS libraries, and the auto-vectorization capabilities of the IBM XL C/C++ and XL Fortran compilers. In addition, the article has demonstrated the use of various compiler options on an example program (discrete Fourier transform), showing you how accelerations of up to 8.94 times over previous speeds are obtainable using the compilers through automatic invocation of MASS by auto-vectorization.

This demonstration is intended to encourage users with programs that call mathematical functions to experiment with the available compiler options, and benefit from speedups resulting from IBM XL C/C++ or XL Fortran compiler auto-vectorization.


Appendix 1 – Fortran DFT source program

  subroutine dft (x, a, phi , n)
    real*8 x(n), a(n), phi(n)
    integer n

    ! Compute discrete Fourier transform of real inputs         
    ! x(i) and convert to polar form.

    real*8, parameter :: pi=3.1415926535897932384d0
    real*8 y_re(n), y_im(n), t, term_re, term_im
    intrinsic exp, cos, sin, sqrt, atan

    y_re(1:n) = 0.d0
    y_im(1:n) = 0.d0
    do k=1,n
      ! compute y(k), k-th DFT output
      do i=1,n
        ! compute i-th term of y(k):
        ! x(k)*exp(-2*pi*I*(k-1)*(i-1)/n)
        ! compute real and imaginary parts of i-th term
        ! using exp(I*t)=exp(t)*(cos(t)+I*sin(t))

        t = -2.d0*pi*(k-1)*(i-1)/n
        term_re = x(i) * cos(t) * exp(t)
        term_im = x(i) * sin(t) * exp(t)

        ! add term to sum
        y_re(k) = y_re(k) + term_re
        y_im(k) = y_im(k) + term_im
      end do
    end do

    ! transform y to polar coordinates
    do k=1,n
      ! compute amplitude of y(k)
      a(k) = sqrt (y_re(k)**2 + y_im(k)**2)
      ! compute phase of y(k)
      phi(k) = atan (y_im(k) / y_re(k))
    end do

  end subroutine

  ! initialize input data
  subroutine init (a, n)
    real*8 a(n)
    integer n
    intrinsic sin,sqrt
    
    do j=1,n
      a(j)=sin(1.d0/sqrt(real(j,8)))
    end do
  end subroutine

Appendix 2 – C DFT source program

#include <math.h>
#define PI 3.1415926535897932384
void dft(double x[],double a[],double phi[],int *m)
{
  double y_re[NMAX], y_im[NMAX], t, s, term_re, term_im;
  int i,j,k,n=*m;
  for(i=0;i<n;++i) {
    y_re[i]=y_im[i]=0;
  }
  for(k=0;k<n;++k) 
  {
    // compute y(k), k-th DFT output
    for(i=0;i<n;++i)
    {
      // compute i-th term of y(k):
      // x(k)*exp(-2*pi*I*(k-1)*(i-1)/n)
      // compute real and imaginary parts of i-th term
      // using exp(I*t)=exp(t)*(cos(t)+I*sin(t))

      t=-2.*PI*k*i/(double)n;
      term_re=x[i]*exp(t)*cos(t);
      term_im=x[i]*exp(t)*sin(t);

      // add term to sum
      y_re[k]+=term_re;
      y_im[k]+=term_im;
    }
  }

  // transform y to polar coordinates
  for(k=0;k<n;++k)
  {
    // compute amplitude of y(k)
    a[k]=sqrt(y_re[k]*y_re[k]+y_im[k]*y_im[k]);
    // compute phase of y(k)
    phi[k]=atan2(y_im[k],y_re[k]);
  }
}

// initialize input data
void init(double a[],int *m)
{
  int j,n=*m;
  for(j=0;j<n;++j) 
  {
    a[j]=sin(1./sqrt((double)j+1.0));
  }
}

// Dummy function to use result, preventing compiler from
// optimizing away the computation.
void consume(double a[],double b[],double c[])
{
}

Appendix 3 – Driver program

Here is the Fortran driver program that was used to time the DFT codes.

  program main
    interface
      subroutine dft(x,a,phi,n)
      real*8 x(n),a(n),phi(n)
      integer n
      end subroutine
      subroutine init(a,n)
      real*8 a(n)
      integer n
      end subroutine
      subroutine consume(a,b,c)
      real*8 a(*),b(*),c(*)
      end subroutine
    end interface

    ! Parameters:
    !   nmax is the problem size.
    !   nrep is the number of repetitions of the
    !        problem. This should be chosen so that
    !        the elapsed time is long enough to give
    !        sufficient timing resolution.
    !   cyc is the clock frequency in Hz for the
    !       processor that the program is to be run on.
    !       (Can be found from AIX command pmcycles.)
    
    integer, parameter :: nmax=1000
    integer, parameter :: nrep=100
    real*8, parameter :: cyc=4704000000.d0

    real*8 x(nmax), a(nmax), phi(nmax)
    real*8 tx, ty, accum, del(4)
    intrinsic sin, sqrt
    real*8 rtc

    del(4)=0.d0
    acc = 0.d0
    do k=1,nrep
      tx=rtc()
      call init(x,nmax)
      call dft(x,a,phi,nmax)
      ty=rtc()
      call consume(x,a,phi)
      do j=1,nmax
        acc = acc + a(j) + phi(j)
      end do
      del(4) = del(4) + (ty-tx)
    end do
    del(1) = del(4)/real(nmax,8)
    del(2) = del(1)/real(nrep,8)
    del(3) = cyc*del(2)
    print *,'acc=',acc/real(nrep,8),' n=',nmax,
 &    ' r=',nrep, ' a=',del(1),' b=',del(2),
 &    ' c=',del(3),' w=',del(4)
  end program

Disclaimer

The exact performance obtained may vary depending on the model of processor used and its configuration, as well as on the version of the compilers used. Therefore, you may experience performance different from that obtained in the experiments described here.

Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Rational software on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Rational, Multicore acceleration
ArticleID=481621
ArticleTitle=How to improve the performance of programs calling mathematical functions
publish-date=04132010