Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

How to improve the performance of programs calling mathematical functions

Taking advantage of IBM XL C/C++ or XL Fortran compiler auto-vectorization

Robert F. Enenkel (enenkel@ca.ibm.com), Software developer, IBM
Photo of Robert Enenkel
Robert Enenkel works in the Optimizing Compiler Group at the IBM Toronto Laboratory; he was previously a Research Associate at the IBM Centre for Advanced Studies (CAS). Enenkel received his B.S., M.S., and Ph.D. degrees from the University of Toronto, with thesis work in the area of parallel numerical methods for differential equations. He currently performs research and development in numerical computing as it relates to compilers and operating systems, including floating-point arithmetic, mathematical function libraries, and the performance tuning of algorithms. Dr. Enenkel has received several IBM invention achievement awards and author recognition awards. More information may be found on his Web page at https://www-927.ibm.com/ibm/cas/toronto/people/members/renekel.shtml.
Daniel M. Zabawa (dzabawa@ca.ibm.com ), Software developer, IBM
Photo of Daniel Zabawa
Daniel Zabawa works in the Optimizing Compiler Group at the IBM Toronto Laboratory. He received his B.S. and M.S. degrees in Computer Science from the University of Toronto. Zabawa currently performs research and development in loop optimization and scheduling, floating-point algorithms for superscalar architectures and performance tuning of floating-point math functions.

Summary:  This article introduces the IBM MASS high-performance mathematical libraries, and demonstrates how to benefit from them — without the need for source program changes — by using the auto-vectorization capability of the IBM® XL C/C++ and XL Fortran compilers. After introducing the concept of auto-vectorization and the associated compiler options, a case study of a discrete Fourier transform program is offered as a real life example of auto-vectorization. Timing results demonstrate that speedups of up to 8.94 times are obtained by the compilers on the example program, via the automatic invocation of MASS by auto-vectorization.

Date:  13 Apr 2010
Level:  Intermediate PDF:  A4 and Letter (99KB | 12 pages)Get Adobe® Reader®
Also available in:   Chinese  Vietnamese  Portuguese

Activity:  19396 views
Comments:  

The MASS high-performance libraries

MASS stands for Mathematical Acceleration Subsystem. It consists of libraries of mathematical functions specifically tuned for optimum performance on various IBM computing platforms. MASS was originally launched by IBM in 1995, and has been continuously improved and expanded since then.

There are currently versions of MASS for all the IBM® POWER™ processors, running AIX® or Linux® operating systems. There are also versions for the IBM System BlueGene®/L and the IBM System BlueGene®/P supercomputers, as well as the IBM Cell Broadband Engine™ (Cell/B.E.™) solution.The libraries contain accelerated implementations of elementary functions such as the trigonometric and hyperbolic functions and their inverses, power, logarithm, exponential, error function, and others. Complete lists of the included functions are available on the IBM Mathematical Acceleration Subsystem page.

There are both scalar and vector libraries, and for Cell/B.E. and POWER7 there are single instruction multiple data (SIMD) libraries as well. Note that the accuracy and exception handling might not be identical in MASS functions and system library functions.The MASS libraries are packaged with the IBM XL C/C++ and XL Fortran compilers, and are also available on the MASS Web site, for users of other compilers (such as gcc) for the target hardware.

The libraries are callable from C, C++, or Fortran source programs. The IBM XL C/C++ and IBM XL Fortran compilers are capable of recognizing opportunities to use MASS to accelerate the source program, and invoking it automatically without the need for source-program changes. This article shows you how to implement a technique that can help your organization leverage this powerful capability.

What programs can benefit?

Any C, C++, or Fortran program that contains calls to mathematical library functions (such as exp, log, sin, cos, etc.) in a loop can potentially benefit from the techniques described in this article.


What is auto-vectorization?

Auto-vectorization is a process whereby the IBM XL C/C++ or Fortran compilers recognize an opportunity to improve the performance of the program being compiled by replacing a call to a standard library (C/C++ libm or Fortran intrinsic) function in a loop with a call to the corresponding MASS vector function. Since the MASS vector functions are much faster (by a factor of up to about 30) than a repeated call to a standard library function, the resulting performance improvement can be substantial.

A simple example is a loop that computes a particular function for multiple arguments, such as the following Fortran program.

subroutine sub (y, x, n)
    real*8 y(*), x(*)
    integer n, i
    intrinsic exp
    do i=1,n
        y(i) = exp(x(i))  ! standard intrinsic
    end do
end subroutine

With the appropriate compiler options, the compiler will recognize the opportunity to accelerate the program by replacing the repeated call to exp() with a single call to the corresponding MASS vector function vexp(), resulting in a program that behaves as though it were originally written as follows:

include 'massv.include'
subroutine sub (y, x, n)
    real*8 y(*), x(*)
    integer n
    call vexp (y, x, n)  ! MASS vector function
end subroutine

This is only a simple example, demonstrating the basic idea behind auto-vectorization. The XL compilers are actually able to recognize much more complicated opportunities, and will also try to rearrange the instructions in the source program if necessary in order to create auto-vectorization opportunities.

A more complicated and realistic example is examined in the case study section of this article.


Compiler options for auto-vectorization

You can compile programs with any of the following sets of options:

  1. -qhot -qnostrict (for Fortran)
  2. -qhot -qnostrict -qignerrno (for C/C++)
  3. -qhot -O3
  4. -O4
  5. -O5

When you compile a program using one of these option sets, the compiler automatically attempts to vectorize calls to system math functions by calling the equivalent MASS vector functions (except for calls to the following functions, which are exceptions: vatan2, vsatan2, vdnint, vdint, vcosisin, vscosisin, vqdrt, vsqdrt, vrqdrt, vsrqdrt, vpopcnt4, vpopcnt8, vexp2, vexp2m1, vsexp2, vsexp2m1, vlog2, vlog21p, vslog2, and vslog21p). If the program is such that the compiler cannot vectorize, it automatically tries to call the equivalent MASS scalar functions. For automatic vectorization or scalarization, the compiler uses versions of the MASS functions contained in the compiler library libxlopt.a. You do not need to add any special calls to the MASS functions in your code, or to link explicitly to the xlopt library.

In addition to any of the preceding sets of options, when the -qipa option is in effect, if the compiler cannot vectorize, it tries to inline the MASS scalar functions before deciding to call them.

If you want to disable auto-vectorization, you can add the option -qhot=novector.


Case study

What follows is a case study of an actual application — a discrete Fourier transform (DFT) — showing the performance improvements that result when compiling with various compiler options. The application is simple enough to be clearly illustrative, yet complex enough to provide non-trivial optimization opportunities.

Timing for both programs was done with the driver program given in Appendix 3, on a POWER6 computer running at 4.704 GHz.

Appendix 1 shows the Fortran DFT source program. It contains a nested loop that calls exp(), cos(), and sin(), followed by a loop that calls sin() and sqrt(). The program was compiled with -O3 (which does not do auto-vectorization) and also with -O4 (which does auto-vectorization). The results are shown in Figure 1.

Note that the benefit of auto-vectorization grows as the problem size is increased, reaching a factor of 8.94x improvement at a problem size of 2000.


Figure 1: Comparison of DFT Fortran performance with compile options -O3 and -O4, for various problem sizes.
Graph shows seconds per element vs. problem size.

Appendix 2 shows a C version of the Fortran DFT program in Appendix 1 (It contains a dummy consume() routine so that the compiler's inter-procedural analysis [IPA] cannot see that the result of the computation is not actually used in this illustrative example, and would therefore optimize away the entire program).

The program was compiled with -O3 (which does not offer auto-vectorization), with -O4 (which does offer auto-vectorization), and with -O5 (which does offer auto-vectorization as well as IPA). The results are shown in Figure 2.

As illustrated in the Fortran example, the benefit of auto-vectorization grows as the problem size is increased, reaching a factor of 6.00x at n=2000. In addition, the activation of IPA at -O5 provides an additional speedup of 1.22x since it can determine that the inputs and outputs are not aliased (that is, they do not overlap in memory), allowing it to vectorize the call to atan2 in the conversion to polar coordinates. The speedup of -O5 over -O3 is 7.33x at n=2000.


Figure 2: Comparison of DFT C performance with compile options -O3, -O4, and -O5, for various problem sizes.
Graph shows seconds per element vs. problem size.

Conclusion

This article has offered you a description of the IBM MASS libraries, and the auto-vectorization capabilities of the IBM XL C/C++ and XL Fortran compilers. In addition, the article has demonstrated the use of various compiler options on an example program (discrete Fourier transform), showing you how accelerations of up to 8.94 times over previous speeds are obtainable using the compilers through automatic invocation of MASS by auto-vectorization.

This demonstration is intended to encourage users with programs that call mathematical functions to experiment with the available compiler options, and benefit from speedups resulting from IBM XL C/C++ or XL Fortran compiler auto-vectorization.


Appendix 1 – Fortran DFT source program

  subroutine dft (x, a, phi , n)
    real*8 x(n), a(n), phi(n)
    integer n

    ! Compute discrete Fourier transform of real inputs         
    ! x(i) and convert to polar form.

    real*8, parameter :: pi=3.1415926535897932384d0
    real*8 y_re(n), y_im(n), t, term_re, term_im
    intrinsic exp, cos, sin, sqrt, atan

    y_re(1:n) = 0.d0
    y_im(1:n) = 0.d0
    do k=1,n
      ! compute y(k), k-th DFT output
      do i=1,n
        ! compute i-th term of y(k):
        ! x(k)*exp(-2*pi*I*(k-1)*(i-1)/n)
        ! compute real and imaginary parts of i-th term
        ! using exp(I*t)=exp(t)*(cos(t)+I*sin(t))

        t = -2.d0*pi*(k-1)*(i-1)/n
        term_re = x(i) * cos(t) * exp(t)
        term_im = x(i) * sin(t) * exp(t)

        ! add term to sum
        y_re(k) = y_re(k) + term_re
        y_im(k) = y_im(k) + term_im
      end do
    end do

    ! transform y to polar coordinates
    do k=1,n
      ! compute amplitude of y(k)
      a(k) = sqrt (y_re(k)**2 + y_im(k)**2)
      ! compute phase of y(k)
      phi(k) = atan (y_im(k) / y_re(k))
    end do

  end subroutine

  ! initialize input data
  subroutine init (a, n)
    real*8 a(n)
    integer n
    intrinsic sin,sqrt
    
    do j=1,n
      a(j)=sin(1.d0/sqrt(real(j,8)))
    end do
  end subroutine


Appendix 2 – C DFT source program

#include <math.h>
#define PI 3.1415926535897932384
void dft(double x[],double a[],double phi[],int *m)
{
  double y_re[NMAX], y_im[NMAX], t, s, term_re, term_im;
  int i,j,k,n=*m;
  for(i=0;i<n;++i) {
    y_re[i]=y_im[i]=0;
  }
  for(k=0;k<n;++k) 
  {
    // compute y(k), k-th DFT output
    for(i=0;i<n;++i)
    {
      // compute i-th term of y(k):
      // x(k)*exp(-2*pi*I*(k-1)*(i-1)/n)
      // compute real and imaginary parts of i-th term
      // using exp(I*t)=exp(t)*(cos(t)+I*sin(t))

      t=-2.*PI*k*i/(double)n;
      term_re=x[i]*exp(t)*cos(t);
      term_im=x[i]*exp(t)*sin(t);

      // add term to sum
      y_re[k]+=term_re;
      y_im[k]+=term_im;
    }
  }

  // transform y to polar coordinates
  for(k=0;k<n;++k)
  {
    // compute amplitude of y(k)
    a[k]=sqrt(y_re[k]*y_re[k]+y_im[k]*y_im[k]);
    // compute phase of y(k)
    phi[k]=atan2(y_im[k],y_re[k]);
  }
}

// initialize input data
void init(double a[],int *m)
{
  int j,n=*m;
  for(j=0;j<n;++j) 
  {
    a[j]=sin(1./sqrt((double)j+1.0));
  }
}

// Dummy function to use result, preventing compiler from
// optimizing away the computation.
void consume(double a[],double b[],double c[])
{
}


Appendix 3 – Driver program

Here is the Fortran driver program that was used to time the DFT codes.

  program main
    interface
      subroutine dft(x,a,phi,n)
      real*8 x(n),a(n),phi(n)
      integer n
      end subroutine
      subroutine init(a,n)
      real*8 a(n)
      integer n
      end subroutine
      subroutine consume(a,b,c)
      real*8 a(*),b(*),c(*)
      end subroutine
    end interface

    ! Parameters:
    !   nmax is the problem size.
    !   nrep is the number of repetitions of the
    !        problem. This should be chosen so that
    !        the elapsed time is long enough to give
    !        sufficient timing resolution.
    !   cyc is the clock frequency in Hz for the
    !       processor that the program is to be run on.
    !       (Can be found from AIX command pmcycles.)
    
    integer, parameter :: nmax=1000
    integer, parameter :: nrep=100
    real*8, parameter :: cyc=4704000000.d0

    real*8 x(nmax), a(nmax), phi(nmax)
    real*8 tx, ty, accum, del(4)
    intrinsic sin, sqrt
    real*8 rtc

    del(4)=0.d0
    acc = 0.d0
    do k=1,nrep
      tx=rtc()
      call init(x,nmax)
      call dft(x,a,phi,nmax)
      ty=rtc()
      call consume(x,a,phi)
      do j=1,nmax
        acc = acc + a(j) + phi(j)
      end do
      del(4) = del(4) + (ty-tx)
    end do
    del(1) = del(4)/real(nmax,8)
    del(2) = del(1)/real(nrep,8)
    del(3) = cyc*del(2)
    print *,'acc=',acc/real(nrep,8),' n=',nmax,
 &    ' r=',nrep, ' a=',del(1),' b=',del(2),
 &    ' c=',del(3),' w=',del(4)
  end program


Disclaimer

The exact performance obtained may vary depending on the model of processor used and its configuration, as well as on the version of the compilers used. Therefore, you may experience performance different from that obtained in the experiments described here.


Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

About the authors

Photo of Robert Enenkel

Robert Enenkel works in the Optimizing Compiler Group at the IBM Toronto Laboratory; he was previously a Research Associate at the IBM Centre for Advanced Studies (CAS). Enenkel received his B.S., M.S., and Ph.D. degrees from the University of Toronto, with thesis work in the area of parallel numerical methods for differential equations. He currently performs research and development in numerical computing as it relates to compilers and operating systems, including floating-point arithmetic, mathematical function libraries, and the performance tuning of algorithms. Dr. Enenkel has received several IBM invention achievement awards and author recognition awards. More information may be found on his Web page at https://www-927.ibm.com/ibm/cas/toronto/people/members/renekel.shtml.

Photo of Daniel Zabawa

Daniel Zabawa works in the Optimizing Compiler Group at the IBM Toronto Laboratory. He received his B.S. and M.S. degrees in Computer Science from the University of Toronto. Zabawa currently performs research and development in loop optimization and scheduling, floating-point algorithms for superscalar architectures and performance tuning of floating-point math functions.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Rational, Multicore acceleration
ArticleID=481621
ArticleTitle=How to improve the performance of programs calling mathematical functions
publish-date=04132010
author1-email=enenkel@ca.ibm.com
author1-email-cc=
author2-email=dzabawa@ca.ibm.com
author2-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers