How to use the Advance Toolchain for Linux on POWER

This page has not been liked. Updated 4/11/13, 1:54 AM by hravalTags: None
Draft in progress



Working to adapt the previous Advance Toolchain article based on Version 3.0 to the newer Advance Toolchain 4.0

This article introduces the Advance Toolchain, one of the key performance improving technologies available for Linux on POWER users. The Advance Toolchain has numerous versions available now for RHEL and SLES releases.

This article is actively updated as new data points and examples are executed with each new release of the Advance Toolchain (now at Advance Toolchain 4.0-1).

Introductory instructions are provided on how to use the Advance Toolchain to generate executables for IBM POWER processor-based systems running Linux.

  • The examples in this article were all executed on RHEL 5.2 and confirmed on RHEL 5.3. The Advance Toolchain is available for RHEL 6, SLES 10, and SLES 11 (and service packs) as well. The examples are being re-done on RHEL 5.5 and SLES 11 sp1.
For discussions...



For discussions on this page, see this thread

For additional questions or observations, post on the Advance Toolchain for Linux on Power forum

Additional Linux for POWER performance information is available on the Performance page

Contents

 

Introduction

The Advance Toolchain is a set of open source software development extensions and tools allowing users to take leading edge advantage of IBM latest hardware features:

  1. POWER6 enablement
  2. POWER6 Optimized scheduler
  3. POWER6 Native DFP instruction support
  4. POWER6 VMX enablement with auto-vector
  5. POWER7 enablement
  6. POWER7 Optimized scheduler
  7. POWER7 Native DFP instruction support
  8. POWER7 VMX/VSX enablement with auto-vector
  9. ppc970, POWER4, POWER5, POWER5+,POWER6, POWER6x, POWER7 optimized system and math libraries
  10. libhugetlbfs 2.0 support

The Advance Toolchain is a self contained toolchain which does not rely on the base system toolchain for operability, and in fact is designed to coexist with the toolchain shipped with the operating system. The Advance Toolchain package includes the following components:

  • GNU Compiler Collection (gcc, g++, gfortran)
  • GNU C Library (glibc)
  • Decimal Floating Point Library (libdfp)
  • GNU Binary Utilities (ld, ldd, objcopy, objdump, nm, and others)
  • GNU Debugger (gdb)
  • performance analysis tools (Oprofile, Valgrind, gprof, mtrace, xtrace, iTrace)
  • The AUXV Library (libauxv)

For more information on Decimal Floating Point support, we refer you to other papers listed under the Reference section. We will cover the performance analysis tools in a future article.

In this article, we will provide some examples of the performance improvements available when using the Advance Toolchain as compared to the toolchain provided in each distro.

Be aware that after you build your executables with the Advance Toolchain on one system you will need the Advance Toolchain runtime package installed on the systems where the executable is to be run.

 

What is Different? What is provided?

The Advance Toolchain approach provides a mechanism for access to libraries and enhancements which have not yet been incorporated into the Redhat and Novell/SUSE operating system base releases. In general, the Advance Toolchain provides newer, more up-to-date, versions of the libraries as the code evolves in the toolchain development communities.

The following tables compare the Linux Distribution system toolchain package versions with those provided by the Advance Toolchain.

Redhat RHEL

Tool RHEL 5.5 Advance Toolchain 2.1-1 Advance Toolchain 3.0-0 Advance Toolchain 4.0-1
gcc 4.1.2-48.el5 4.3.3-ibm-r153685 4.4.4-ibm-r160160 4.5.3-ibm-r171016
binutils 2.17.50.0.6-14.el5 2.20-20091003 2.20.51-20100526 2.21.0-20110311
glibc 2.5-49 2.8-ibm with libdfp add-on 2.11-ibm 2.12-ibm
libdfp N/A N/A 1.0.3 1.0.7
gmp 4.1.4-10.el5 4.2.2 4.2.4 4.3.2
mpfr N/A 2.4.0 2.4.1 3.0.0
gdb 7.0.1-23.el5 7.0 7.1 7.2
oprofile 0.9.4-15.el5 0.9.5-20091021 0.9.6 0.9.7
valgrind 3.5.0-1.el5 3.4.1 with iTrace tool 3.4.1 with iTrace tool 3.6.0 with iTrace tool and POWER7 support

Novell/SUSE SLES

Tool SLES 10 sp3 SLES 11 Advance Toolchain 2.1-1 Advance Toolchain 3.0-0 Advance Toolchain 4.0-1
gcc 4.1.2_20070115-0.29.6 4.3-62.198 4.3.3-ibm-r153685 4.4.4-ibm-r160160 4.5.3-ibm-r171016
binutils 2.16.91.0.5-23.34.33 2.20.0-0.7.9 2.20-20091003 2.20.51-20100526 2.21.0-20110311
glibc 2.4-31.71.1 2.11.1-0.17.4 2.8-ibm with libdfp add-on 2.11-ibm 2.12-ibm
libdfp N/A 1.0.1-0.4.17 N/A 1.0.3 1.0.7
gmp 4.1.4-20.10 4.2.3-10.99 4.2.2 4.2.4 4.3.2
mpfr 2.2.1-6.6 2.3.2-3.115 2.4.0 2.4.1 3.0.0
gdb 6.8.50.20090302-1.5.18 7.0-0.4.16 7.0 7.1 7.2
oprofile not shipped 0.9.4-51.4 0.9.5-20091021 0.9.6 0.9.7
valgrind not shipped not shipped 3.4.1 with iTrace tool 3.4.1 with iTrace tool 3.6.0 with iTrace tool and POWER7 support

Using the Advance Toolchain, Linux for Power customers have easy access to the latest technologies and package versions of these tools and libraries.

By providing feedback on the latest Advance Toolchain, IBM developers may easily push performance enhancements, bug fixes, and package updates into future Linux Distribution service packs, releases, and versions in a more timely manner.

 

Installation

Today, customers can download the Advance Toolchain from the University of Illinois NCSA ftp site located here:

Advance Toolchain 2.1-1

Advance Toolchain 3.0-0

Advance Toolchain 4.0-1

There are specific Advance Toolchain packages for the varying Linux Distribution releases. The following example shows the (partial) directory structure for the RHEL6 Advance Toolchain found here .

advance-toolchain-at4.0-cross-4.0-0.i686.rpm	        212 MB	1/4/11 2:17:00 PM
advance-toolchain-at4.0-cross-4.0-1.i686.rpm	        231 MB	3/31/11 4:14:00 PM
advance-toolchain-at4.0-devel-4.0-0.ppc64.rpm	        213 MB	2/15/11 1:41:00 PM
advance-toolchain-at4.0-devel-4.0-1.ppc64.rpm	        239 MB	3/25/11 2:33:00 PM
advance-toolchain-at4.0-perf-4.0-0.ppc64.rpm	        94.7 MB	2/15/11 1:41:00 PM
advance-toolchain-at4.0-perf-4.0-1.ppc64.rpm	        95.6 MB	3/25/11 2:33:00 PM
advance-toolchain-at4.0-runtime-4.0-0.ppc64.rpm	        288 MB	2/15/11 1:41:00 PM
advance-toolchain-at4.0-runtime-4.0-1.ppc64.rpm	        289 MB	3/25/11 2:33:00 PM
advance-toolchain-at4.0-runtime-compat-4.0-0.ppc64.rpm  142 MB	2/15/11 1:42:00 PM
advance-toolchain-at4.0-runtime-compat-4.0-1.ppc64.rpm  141 MB	3/25/11 2:33:00 PM
advance-toolchain-at4.0-src-4.0-0.tgz                   510 MB	2/15/11 1:51:00 PM
advance-toolchain-at4.0-src-4.0-1.tgz                   515 MB	3/25/11 2:50:00 PM
gpg-pubkey-00f50ac5-45e497dc	                        1649 B	4/13/11 8:39:00 AM
release_notes.at4.0-4.0-0.html	                        14.1 kB	2/15/11 1:51:00 PM
release_notes.at4.0-4.0-1.html	                        18.8 kB	3/29/11 10:06:00 AM

Direct your browser to the latest release notes (ftp://linuxpatch.ncsa.uiuc.edu/toolchain/at/at4.0/redhat/RHEL6/release_notes.at4.0-4.0-1.html) to see what has changed since the last release.

The recommended installation method is to use YaST or YUM commands in order to verify the authenticity of the packages and the release notes contain instructions for doing this. There is nothing wrong with manually downloading and install the rpms either.

To manually install the Advance Toolchain download the latest three rpm files.

advance-toolchain-at4.0-runtime-4.0-1.ppc64.rpm
advance-toolchain-at4.0-devel-4.0-1.ppc64.rpm
advance-toolchain-at4.0-perf-4.0-1.ppc64.rpm
Advance Toolchain Compatibility Libraries for RHEL4/SLES9



The advance-toolchain-at4.0-runtime-compat-4.0-1.ppc64.rpm package is only for use on RHEL4 and SLES9 systems. These systems are too old to allow Advance Toolchain development but they can still run executables built on more modern systems that were linked against the Advance Toolchain libraries.

Do not install advance-toolchain-at4.0-runtime-compat-4.0-1.ppc64.rpm on RHEL5, RHEL6, SLES10, SLES11 or you will experience problems.

Install the rpms in the following order using the rpm command as shown.

rpm -ivh advance-toolchain-at4.0-runtime-4.0-1.ppc64.rpm
rpm -ivh advance-toolchain-at4.0-devel-4.0-1.ppc64.rpm
rpm -ivh advance-toolchain-at4.0-perf-4.0-1.ppc64.rpm

By default, you will find the Advance Toolchain installed at /opt/at4.0/ on your system.

# ls -l /opt/at4.0/
total 56
drwxr-xr-x  2 root root 4096 Jun 10 12:27 bin
drwxr-xr-x  2 root root 4096 Jun 10 12:26 bin64
drwxr-xr-x  2 root root 4096 Jun 10 12:26 etc
drwxr-xr-x 28 root root 4096 Jun 10 12:27 include
drwxr-xr-x  2 root root 4096 Jun 10 12:27 info
drwxr-xr-x 15 root root 4096 Jun 10 12:27 lib
drwxr-xr-x 10 root root 4096 Jun 10 12:27 lib64
drwxr-xr-x  4 root root 4096 Jun 10 12:27 libexec
drwxr-xr-x  3 root root 4096 Jun 10 12:27 libexec64
drwxr-xr-x  4 root root 4096 Jun 10 12:27 man
drwxr-xr-x  4 root root 4096 Jun 10 12:27 powerpc-linux
drwxr-xr-x  4 root root 4096 Jun 10 12:27 powerpc64-linux
drwxr-xr-x  2 root root 4096 Jun 10 12:27 sbin
drwxr-xr-x  2 root root 4096 Jun 10 12:27 sbin64
drwxr-xr-x  7 root root 4096 Jun 10 12:27 scripts
drwxr-xr-x  7 root root 4096 Jun 10 12:27 share

 

How to use the Advance Toolchain

As an example, we use a C program called 429.mcf from SPECint2006 benchmark suite to demonstrate how to build with the Advance Toolchain. See http://www.spec.org/cpu2006/ for details on the CPU2006 components.

First, we will build 429.mcf with the Advance Toolchain. This can be done by just using gcc located at /opt/at05/bin/ directly rather than the one that comes with the distros (/usr/bin/gcc). In this example, we also use -mcpu and -mtune compiler options to tell compiler that we want to generate code optimized for Power6.

/opt/at05/bin/gcc -c -o mcf.o      -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   mcf.c
/opt/at05/bin/gcc -c -o mcfutil.o  -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   mcfutil.c
/opt/at05/bin/gcc -c -o readmin.o  -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   readmin.c
/opt/at05/bin/gcc -c -o implicit.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   implicit.c
/opt/at05/bin/gcc -c -o pstart.o   -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   pstart.c
/opt/at05/bin/gcc -c -o output.o   -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   output.c
/opt/at05/bin/gcc -c -o treeup.o   -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   treeup.c
/opt/at05/bin/gcc -c -o pbla.o     -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   pbla.c
/opt/at05/bin/gcc -c -o pflowup.o  -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   pflowup.c
/opt/at05/bin/gcc -c -o psimplex.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   psimplex.c
/opt/at05/bin/gcc -c -o pbeampp.o  -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6   pbeampp.c

/opt/at05/bin/gcc   -O3 -mcpu=power6 -mtune=power6 -m32   mcf.o mcfutil.o readmin.o implicit.o pstart.o output.o
                     treeup.o pbla.o pflowup.o psimplex.o pbeampp.o  -lm -o mcf

Then, we can use the normal ldd command to print out the shared library dependencies of the executable that we just built. Note that the SPEC.org run-time harness converts the "mcf" executable name in this case to mcf_base.at05

# ldd mcf_base.at05

        linux-vdso32.so.1 =>  (0x00100000)
        libm.so.6 => /opt/at05/lib/power6/libm.so.6 (0x0ff30000)
        libc.so.6 => /opt/at05/lib/power6/libc.so.6 (0x0fd90000)
        /opt/at05/lib/ld.so.1 (0xf7fc0000)

Note that the executable we just built with Advance Toolchain depends on the C and Math libraries (libm.so.6, libc.so.6) that comes with the Advance Toolchain (/opt/at05/lib/power6/). So this means when you move the executable to another system, it will expect to find the Advance Toolchain at that location.

Now, let us try to build the same executable with gcc compiler that comes with the distros (/usr/bin/gcc).

/usr/bin/gcc -c -o mcf.o      -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  mcf.c
/usr/bin/gcc -c -o mcfutil.o  -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  mcfutil.c
/usr/bin/gcc -c -o readmin.o  -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  readmin.c
/usr/bin/gcc -c -o implicit.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  implicit.c
/usr/bin/gcc -c -o pstart.o   -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  pstart.c
/usr/bin/gcc -c -o output.o   -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  output.c
/usr/bin/gcc -c -o treeup.o   -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  treeup.c
/usr/bin/gcc -c -o pbla.o     -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  pbla.c
/usr/bin/gcc -c -o pflowup.o  -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  pflowup.c
/usr/bin/gcc -c -o psimplex.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  psimplex.c
/usr/bin/gcc -c -o pbeampp.o  -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6  pbeampp.c

/usr/bin/gcc   -O3 -mcpu=power6 -mtune=power6 -m32 mcf.o mcfutil.o readmin.o implicit.o pstart.o output.o 
               treeup.o pbla.o pflowup.o psimplex.o pbeampp.o -lm -o mcf

The ldd output shows the following. The runtime directives in this case renamed the mcf executable to mcf_base.gcc412

# ldd mcf_base.gcc412

        linux-vdso32.so.1 =>  (0x00100000)
        libm.so.6 => /lib/power6/libm.so.6 (0x0fd50000)
        libc.so.6 => /lib/power6/libc.so.6 (0x0fe30000)
        /lib/ld.so.1 (0x0ffc0000)

In case you want to simply relink your pre-built application with the Advance Toolchain, those instructions are available the Advance Toolchain release notes (listed in the Reference Section).

With Advance toolchain 3.0, we again can simply use gcc located at /opt/at3.0/bin

/opt/at3.0/bin/gcc -c -o mcf.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  mcf.c
/opt/at3.0/bin/gcc -c -o mcfutil.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  mcfutil.c
/opt/at3.0/bin/gcc -c -o readmin.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  readmin.c
/opt/at3.0/bin/gcc -c -o implicit.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  implicit.c
/opt/at3.0/bin/gcc -c -o pstart.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  pstart.c
/opt/at3.0/bin/gcc -c -o output.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  output.c
/opt/at3.0/bin/gcc -c -o treeup.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  treeup.c
/opt/at3.0/bin/gcc -c -o pbla.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  pbla.c
/opt/at3.0/bin/gcc -c -o pflowup.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  pflowup.c
/opt/at3.0/bin/gcc -c -o psimplex.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  psimplex.c
/opt/at3.0/bin/gcc -c -o pbeampp.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO  -O3 -mcpu=power6 -mtune=power6 -m32                  pbeampp.c
/opt/at3.0/bin/gcc   -O3 -mcpu=power6 -mtune=power6 -m32            mcf.o mcfutil.o readmin.o implicit.o pstart.o output.o treeup.o pbla.o pflowup.o psimplex.o pbeampp.o             -lm        -o mcf

# ldd mcf_base.at30
        linux-vdso32.so.1 =>  (0x00100000)
        libm.so.6 => /opt/at3.0/lib/power6/libm.so.6 (0x0ff30000)
        libc.so.6 => /opt/at3.0/lib/power6/libc.so.6 (0x0fd90000)
        /opt/at3.0/lib/ld.so.1 (0xf7fb0000)

 

Performance Data

In our testing environment, we saw performance improvement in several benchmark components in SPECcpu2006 when using the Advance Toolchain 1.1, as compared with GCC 4.1.2, on RHEL 5.2. A good gain was seen in one comoponent (464.h264ref - 7.4% improvement), while significant gains were seen in our engineering tests for two components:

  • 483.xalancbmk: 39% improvement
  • 410.bwaves: 77% improvement !

The same compiler options were used for both gcc 4.1.2 and the Advance Toolchain gcc and libraries. Those options are -O3 -mcpu=power6 -mtune=power6.

The gap between the Advance Toolchain and the GCC that comes with the distributions is becoming narrower in general on a newer distros like SLES11sp1. This is because Advance Toolchain 3.0 is based on the gcc version 4.4.4, while the SLES11sp1 toolchain is gcc 4.3.4. In our test environment with SLES11sp1 running on a POWER7-based system, 410.bwaves with Advance Toolchain 3.0 is 16% better than the distros GCC, while 483.xalancbmk is 2%. Both compilers used the same compiler options: -O3 -mcpu=power7 -mtune=power7.

The option flag -ffast-math, however, has very positive impact to 410.bwaves with Advance Toolchain 3.0 on SLES11sp1. The gap between the Advance Toolchain and the distros compiler in this case is huge. The Advance Toolchain outperforms the distros compiler by 68%. By using the -ffast-math option with Advance Toolchain 3.0, the pow (x, 0.75) optimization is enabled. Profile data with -ffast-math is also provided below.

To better understand some of these improvements, we gathered some basic "oprofile" performance analysis data. The following performance data was collected from the speed runs on IBM Power 550 (POWER6 4.2 GHz cores) running RHEL5.2.

We use Oprofile to monitor Processor cycles (PM_CYC_GRP1) and Instructions completed (PM_INST_CMPL_GRP1) events during each run. We generally monitor these two events in order to calculate CPI (cycles per instruction) metric. With the profiling outputs we will be able to understand why we gain performance improvement with Advance Toolchain.

 

483.xalancbmk comparison

Below is the Oprofile output of 483.xalancbmk with gcc 4.1.2 (out of the box with RHEL 5.2)

CPU: ppc64 POWER6, speed 4204 MHz (estimated)
Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor cycles) with a unit mask of 0x00 
                           (No unit mask) count 50000000
Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Instructions completed) with a unit mask of 0x00 
                           (No unit mask) count 50000000

samples  %        samples  %        image name               app name                 symbol name
19267    27.7139  1449      5.3989  libc-2.5.so              libc-2.5.so              _int_malloc
5273      7.5848  7146     26.6254  Xalan_base.gcc412        Xalan_base.gcc412        xalanc_1_8::
4991      7.1791  2805     10.4512  Xalan_base.gcc412        Xalan_base.gcc412        xercesc_2_5::
4989      7.1762  1354      5.0449  Xalan_base.gcc412        Xalan_base.gcc412        xercesc_2_5::
3258      4.6864  934       3.4800  Xalan_base.gcc412        Xalan_base.gcc412        xalanc_1_8::
2161      3.1084  651       2.4256  Xalan_base.gcc412        Xalan_base.gcc412        xalanc_1_8::
2030      2.9200  1327      4.9443  Xalan_base.gcc412        Xalan_base.gcc412        xercesc_2_5::
1670      2.4022  742       2.7646  Xalan_base.gcc412        Xalan_base.gcc412        xalanc_1_8::
1647      2.3691  137       0.5105  libc-2.5.so              libc-2.5.so              malloc

Oprofile profiling output of 483.xalancbmk with the Advance Toolchain.

CPU: ppc64 POWER6, speed 4204 MHz (estimated)
Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor cycles) with a unit mask of 0x00 
                           (No unit mask) count 50000000
Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Instructions completed) with a unit mask of 0x00 
                           (No unit mask) count 50000000

samples  %        samples  %        image name               app name                 symbol name
5510     10.9523  2882     11.0481  Xalan_base.at05          Xalan_base.at05          xercesc_2_5::
5211     10.3580  7205     27.6202  Xalan_base.at05          Xalan_base.at05          xalanc_1_8::
4872      9.6842  1325      5.0794  Xalan_base.at05          Xalan_base.at05          xercesc_2_5::
3108      6.1778  793       3.0399  Xalan_base.at05          Xalan_base.at05          xalanc_1_8::
2365      4.7009  1325      5.0794  Xalan_base.at05          Xalan_base.at05          xercesc_2_5::
1988      3.9516  706       2.7064  Xalan_base.at05          Xalan_base.at05          xalanc_1_8::
1651      3.2817  849       3.2546  Xalan_base.at05          Xalan_base.at05          xalanc_1_8::
1615      3.2102  132       0.5060  libc-2.6.90.so           libc-2.6.90.so           malloc
1419      2.8206  414       1.5871  libc-2.6.90.so           libc-2.6.90.so           _int_malloc

With gcc 4.1.2, we spend almost 28% of time in the _int_malloc routine, compared to only 2.8% with the Advance Toolchain. Note also that the number of samples for 'Instruction Completed' events for _int_malloc routine with Advance Toolchain is significantly less than that with gcc 4.1.2 (414 versus 1449). CPI for _int_malloc routine in gcc 4.1.2 is 13.3 (19267/1449), while CPI for _int_malloc routine in the Advance Toolchain is 3.4 (1419/414), significantly lower. Clearly, with 483.xalancbmk, _int_malloc routine in Advance Toolchain performs much more efficiently than that in gcc 4.1.2. This speedup is due to an improved malloc implementation in the GLIBC-2.7 version (vs GLIBC-2.5) combined with better code generation associated with GCC-4.1.3 (vs GCC-4.1.2).

Performance of Xalan code built from GCC-4.3.4 and Advance Toolchain 3.0 on SLES11SP1 are not that much different, only 2%. Profiling data are therefore pretty similar.

 

410.bwaves comparison

Below is the Oprofile output of 410.bwaves with gcc 4.1.2.

CPU: ppc64 POWER6, speed 4204 MHz (estimated)
Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor cycles) with a unit mask of 0x00 
                           (No unit mask) count 50000000
Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Instructions completed) with a unit mask of 0x00 
                           (No unit mask) count 50000000

samples  %        samples  %        image name               app name                 symbol name
84778    56.4126  26152    39.2696  libm-2.5.so              libm-2.5.so              __mul
34871    23.2037  27592    41.4319  bwaves_base.gcc412       bwaves_base.gcc412       mat_times_vec_
5937      3.9506  4644      6.9734  bwaves_base.gcc412       bwaves_base.gcc412       bi_cgstab_block_
5483      3.6485  2050      3.0783  bwaves_base.gcc412       bwaves_base.gcc412       shell_
3593      2.3908  1558      2.3395  libm-2.5.so              libm-2.5.so              sub_magnitudes
2435      1.6203  8         0.0120  vmlinux                  vmlinux                  .pseries_dedicated_idle_sleep
2189      1.4566  1155      1.7343  libm-2.5.so              libm-2.5.so              __ieee754_pow
1861      1.2383  768       1.1532  bwaves_base.gcc412       bwaves_base.gcc412       jacobian_
1367      0.9096  8         0.0120  vmlinux                  vmlinux                  .ppc64_runlatch_off
1162      0.7732  452       0.6787  libm-2.5.so              libm-2.5.so              __exp1
838       0.5576  325       0.4880  bwaves_base.gcc412       bwaves_base.gcc412       flux_
729       0.4851  244       0.3664  libm-2.5.so              libm-2.5.so              powl@GLIBC_2.0
439       0.2921  75        0.1126  libm-2.5.so              libm-2.5.so              isnanf
411       0.2735  102       0.1532  libm-2.5.so              libm-2.5.so              norm

Here is the Oprofile output of 410.bwaves with Advance Toolchain.

CPU: ppc64 POWER6, speed 4204 MHz (estimated)
Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor cycles) with a unit mask of 0x00 
                           (No unit mask) count 50000000
Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Instructions completed) with a unit mask of 0x00 
                           (No unit mask) count 50000000

samples  %        samples  %        image name               app name                 symbol name
34521    38.5086  27600    47.1505  bwaves_base.at05         bwaves_base.at05         mat_times_vec_
22632    25.2462  18553    31.6950  libm-2.6.90.so           libm-2.6.90.so           __mul
5873      6.5514  4630      7.9097  bwaves_base.at05         bwaves_base.at05         bi_cgstab_block_
5496      6.1308  2063      3.5243  bwaves_base.at05         bwaves_base.at05         shell_
3965      4.4230  27        0.0461  vmlinux                  vmlinux                  .pseries_dedicated_idle_sleep
2924      3.2618  1165      1.9902  libm-2.6.90.so           libm-2.6.90.so           sub_magnitudes
2431      2.7118  1139      1.9458  libm-2.6.90.so           libm-2.6.90.so           __ieee754_pow
2319      2.5869  835       1.4265  bwaves_base.at05         bwaves_base.at05         jacobian_
2294      2.5590  8         0.0137  vmlinux                  vmlinux                  .ppc64_runlatch_off
1233      1.3754  432       0.7380  libm-2.6.90.so           libm-2.6.90.so           __exp1
806       0.8991  335       0.5723  bwaves_base.at05         bwaves_base.at05         flux_

The hot routine in gcc 4.1.2 case is __mul in the math library (libm-2.5). We spend 57% of time there as opposed to 25% with Advance Toolchain (libm-2.6.90). This speedup is due to the changed in the libm-2.6.90.

For example, originally, there is an inner loop in the math component __mul in libm.

for (i=i1,j=i2-1; i<i2; i++,j--)  zk += X[i]*Y[j];

That loop was optimized in the new libm. The new code is shown below.

/* rearrange this inner loop to allow the fmadd instructions to be
       independent and execute in parallel on processors that have
       dual symetrical FP pipelines.  */
    if (i1 < (i2-1))
    {
        /* make sure we have at least 2 iterations */
        if (((i2 - i1) & 1L) == 1L)
        {
                /* Handle the odd iterations case.  */
                zk2 = x->d[i2-1]*y->d[i1];
        }
        else
                zk2 = zero.d;
        /* Do two multiply/adds per loop iteration, using independent
           accumulators; zk and zk2.  */
        for (i=i1,j=i2-1; i<i2-1; i+=2,j-=2)
        {
                zk += x->d[i]*y->d[j];
                zk2 += x->d[i+1]*y->d[j-1];
        }
        zk += zk2; /* final sum.  */
    }
    else    {
        /* Special case when iterations is 1.  */
        zk += x->d[i1]*y->d[i1];
    }

By doing this, two fmadds instructions can be executed in parallel on POWER4, POWER5 and POWER6.

Here is how the profile looks like on POWER7 with SLES11SP1.

With Distros GCC v4.3.4,

CPU: ppc64 POWER7, speed 3550 MHz (estimated)
Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor Cycles) with a unit mask of 0x00 (No unit mask) count 100000
Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Number of PowerPC Instructions that completed.) with a unit mask of 0x00 (No unit mask) count 100000

samples      %     samples   %        image name              app name               symbol name
17148263    53.3  16014958  51.2  bwaves_base.gcc434       bwaves_base.gcc434       .mat_times_vec_
6540182     20.3   8239712  26.3  libm-2.11.1.so           libm-2.11.1.so           .__mul
2161148      6.7    693857   2.2  bwaves_base.gcc434       bwaves_base.gcc434       .shell_
1637263      5.0   2918598   9.3  bwaves_base.gcc434       bwaves_base.gcc434       .bi_cgstab_block_
838656       2.6    417653   1.3  vmlinux                  vmlinux                  .mutex_spin_on_owner
664063       2.0    421670   1.3  libm-2.11.1.so           libm-2.11.1.so           .__ieee754_pow
496053       1.5    397871   1.2  bwaves_base.gcc434       bwaves_base.gcc434       .jacobian_
381957       1.1    514454   1.6  libm-2.11.1.so           libm-2.11.1.so           .sub_magnitudes
248906       0.7    169724   0.5  libm-2.11.1.so           libm-2.11.1.so           .__exp1
231802       0.7    159470   0.5  bwaves_base.gcc434       bwaves_base.gcc434       .flux_

With Advance Toolchain 3.0,

CPU: ppc64 POWER7, speed 3550 MHz (estimated)
Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor Cycles) with a unit mask of 0x00 (No unit mask) count 100000
Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Number of PowerPC Instructions that completed.) with a unit mask of 0x00 (No unit mask) count 100000
samples     %     samples    %        image name            app name              symbol name
13663262   50.6  13822146   48.9    bwaves_base.at30     bwaves_base.at30         .mat_times_vec_
6304590    23.3   8126221   28.8    libm-2.11.1.so       libm-2.11.1.so           .__mul
1942634     7.1   2656349    9.4    bwaves_base.at30     bwaves_base.at30         .bi_cgstab_block_
1810912     6.7    609084    2.1    bwaves_base.at30     bwaves_base.at30         .shell_
607147      2.2    421814    1.4    bwaves_base.at30     bwaves_base.at30         .jacobian_
565505      2.0    437193    1.5    libm-2.11.1.so       libm-2.11.1.so           .__ieee754_pow
334724      1.2    465156    1.6    libm-2.11.1.so       libm-2.11.1.so           .sub_magnitudes
233800      0.8    180191    0.6    libm-2.11.1.so       libm-2.11.1.so           .__exp1
228305      0.8    159124    0.5    bwaves_base.at30     bwaves_base.at30         .flux_

Note CPI for mat_times_vec routine from AT3.0 is 0.99, while GCC 4.3.4 is 1.07.

Here is how the profile looks like on POWER7 with SLES11SP1 with -ffast-math.



With Distros GCC v4.3.4,

CPU: ppc64 POWER7, speed 3550 MHz (estimated)
Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor Cycles) with a unit mask of 0x00 (No unit mask) count 100000
Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Number of PowerPC Instructions that completed.) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        samples  %        image name               app name                 symbol name
12467464 44.2  16228953 48.9  bwaves_base.gcc434-fastmath bwaves_base.gcc434-fastmath .mat_times_vec_
7585627  26.9  9568920  28.8  libm-2.11.1.so              libm-2.11.1.so           .__mul
1893055   6.7  640687    1.9  bwaves_base.gcc434-fastmath bwaves_base.gcc434-fastmath .shell_
1720410   6.1  3191830   9.6  bwaves_base.gcc434-fastmath bwaves_base.gcc434-fastmath .bi_cgstab_block_
925265    3.2  460656    1.3  vmlinux-2.6.32.12-0.7-ppc64 vmlinux-2.6.32.12-0.7-ppc64 .mutex_spin_on_owner
659790    2.3  408821    1.2  libm-2.11.1.so              libm-2.11.1.so           .__ieee754_pow
426744    1.5  584904    1.7  libm-2.11.1.so              libm-2.11.1.so           .sub_magnitudes
250722    0.8  178445    0.5  libm-2.11.1.so              libm-2.11.1.so           .__exp1

CPI for mat_times_vec routine is even lower at 0.77 (12467464/16228953).

With Advance Toolchain 3.0,

CPU: ppc64 POWER7, speed 3550 MHz (estimated)
Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor Cycles) with a unit mask of 0x00 (No unit mask) count 100000
Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Number of PowerPC Instructions that completed.) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        samples  %        image name               app name                 symbol name
11532685 72.3  14087777 76.6  bwaves_base.at30-fastmath bwaves_base.at30-fastmath .mat_times_vec_
1856526  11.6  2767796  15.0  bwaves_base.at30-fastmath bwaves_base.at30-fastmath .bi_cgstab_block_
1370895   8.6  538636    2.9  bwaves_base.at30-fastmath bwaves_base.at30-fastmath .shell_
358421    2.2  386151    2.1  bwaves_base.at30-fastmath bwaves_base.at30-fastmath .jacobian_
165240    1.0  148917    0.8  bwaves_base.at30-fastmath bwaves_base.at30-fastmath .flux_

Note that the ._ieee754_pow and _mul routines disappears from the profile.

Libhugetlbfs

Since Advance Toolchain version 2.0, using libhugetlbfs have been supported. Consult the release note for instruction.

 

Support

As mentioned in the Release Notes listed below in the References, for questions regarding the use of the Advance Toolchain or to report suspected defects in the Advance Toolchain, please go to:



http://www-128.ibm.com/developerworks/forums/dw_forum.jsp?forum=937&cat=72

  • Open the Advance Toolchain topic.
  • Select 'Post a New Reply'
  • Enter and submit your question or problem

 

References

Advance Toolchain 2.1:

Advance Toolchain 3.0:

Advance Toolchain 4.0:

Decimal Floating Point

Technical preview: DFP functionality for XL C/C++ Advanced Edition for Linux, V9.0

Nigel Griffiths's wiki page on Decimal Floating Point

Advance Toolchain performance improvements

 

Acknowledgements

Originally written by: Chakarat Skawratananond

We would like to thank Michael Meissner, Steve Munroe, Bill Buros, Peter Wong, Dan Jones, Jenifer Hopper, Ryan Arnold, and Carlos Eduardo Seo for their input and review of drafts of this article.

For discussions...



For additional questions or observations on taking advantage of the Advance Toolchain, post it on the Advance Toolchain for Linux on Power forum