This article introduces one of the key performance improving technologies available for Linux on Power users. The Advance Toolchain Version 2.0-5 is available for Linux on Power systems.
This article provides introductory instructions on how to use the Advance Toolchain to generate executables for IBM Power processor-based systems running Linux. These examples were all executed on RHEL 5.2 and confirmed on RHEL 5.3. The Advance Toolchain is available for SLES 10 and SLES 11 as well.
HPC Central is a joint IBM/Customer accessible and editable forum to provide improved HPC technical communications. See HPC Central
and Terms of Use
Contents
Introduction
The Advance Toolchain is a set of open source software development extensions and tools allowing users to take greater leading edge advantage of IBM latest hardware features:
- POWER6 enablement and exploitation for improved performance
- ppc970, POWER4, POWER5, POWER5+,POWER6, POWER6x optimized system and math libraries, and
- Decimal Floating Point capability.
The Advance Toolchain is a self contained toolchain which does not rely on the base system toolchain for operability, and in fact is designed to coexist with the toolchain shipped with the operating system. The Advance Toolchain package includes the following components
- GNU Compiler Collection (gcc, g++, gfortran),
- C Libraries (libc, libmpfr, and others),
- binaries utilities (ld, ldd, objcopy, objdump, nm, and others),
- debugger (gdb32, gdb64), and
- performance analysis tools (Oprofile, Valgrind, gprof, mtrace, xtrace).
For more information on Decimal Floating Point support, we refer you to other papers listed under the Reference section. We will cover the performance analysis tools in a future article.
In this article, we will provide some examples of the performance improvements available when using the Advance Toolchain as compared to the toolchain provided in each distro.
Be aware, when you build your executables with the Advance Toolchain, you will need the Advance Toolchain installed on the systems where the executable is being run.
What is Different? What is provided?
The Advance Toolchain approach provides a mechanism for access to libraries and enhancements which have not yet been incorporated into the Red Hat and SUSE operating system bases. In general, the Advance Toolchain provides newer, more up-to-date, versions of the libraries as the code evolves in the community.
The binaries and libraries are newer versions.
| Tool |
RHEL 5.1 |
RHEL 5.2 |
RHEL 5.3 |
Advance Toolchain 2.0-5 |
| gcc |
4.1.2 |
4.1.2 |
4.1.3 |
|
| binutils |
2.17.50.0.6-5.el5 |
2.17.50.0.6-6.el5 |
2.17.50.0.6-9.el5 |
|
| glibc |
2.5-18 |
2.5-24 |
2.7-2007-08-02 |
|
| libm |
2.5 |
2.5 |
2.6.90 |
|
| glibc-powerpc-cpu-addon |
v0.06 |
v0.07 |
v0.06 |
|
| oprofile |
|
|
0.9.3-18.el5 |
|
and for Novell SUSE
| Tool |
SLES 10 sp2 |
SLES 11 |
Advance Toolchain 2.0-5 |
| gcc |
4.1.2_20070115-0.21 |
4.3-62.198 |
4.3. 20080606 |
| binutils |
2.16.91.0.5-23.31 |
2.19-11.28 |
|
| glibc |
2.4-31.54 |
2.9-13.2 |
|
| libm |
2.4 |
2.9 |
|
| glibc-powerpc-cpu-addon |
? |
|
|
| oprofile |
not shipped |
0.9.4-51.4 |
|
So why is this provided?
By having access to the latest Linux toolchain for Power systems, customers and programmers have easy access to the latest technologies and versions of the tools, libraries, and executables from the community.
Then, by providing feedback on the latest toolchain, changes, fixes, and updates can be more easily integrated into future distro service packs, releases, and versions.
Installation
Today, customers can download the Advance Toolchain from University of Illinois ftp site located here:
There are specific versions for varying distro levels and releases.
Index of ftp://linuxpatch.ncsa.uiuc.edu/toolchain/at/at05/redhat/RHEL5
To install, first download the latest three rpm files.
Install them using the rpm command. For example,
The recommended installation method is to use YaST or YUM commands in order to verify the authenticity of the packages. Please consult the Release Notes for the Advance Toolchain 05 for the detailed instruction.
Links to the readme files:
By default, you will find the Advance Toolchain installed at /opt/at05 on your system.
How to use the Advance Toolchain
As an example, we use a C program called 429.mcf from SPECint2006 benchmark suite to demonstrate how to build with the Advance Toolchain. See http://www.spec.org/cpu2006/
for details on the CPU2006 components.
First, we will build 429.mcf with the Advance Toolchain. This can be done by just using gcc located at /opt/at05/bin rather than the one that comes with the distros (/usr/bin/gcc). In this example, we also use -mcpu and -mtune compiler options to tell compiler that we want to generate code optimized for Power6.
Then, we can use the normal ldd command to print out the shared library dependencies of the executable that we just built. Note that the SPEC.org run-time harness converts the "mcf" executable name in this case to mcf_base.at05
Note that the executable we just built with Advance Toolchain depends on the C and Math libraries (libm.so.6, libc.so.6) that comes with the Advance Toolchain (/opt/at05/lib/power6/). So this means when you move the executable to another system, it will expect to find the Advance Toolchain at that location.
Now, let us try to build the same executable with gcc compiler that comes with the distros (/usr/bin/gcc).
The ldd output shows the following. The runtime directives in this case renamed the mcf executable to mcf_base.gcc412
In case you want to simply relink your pre-built application with the Advance Toolchain, those instructions are available the Advance Toolchain release notes (listed in the Reference Section).
Performance Data
In our testing environment, we saw performance improvement in several benchmark components in SPECcpu2006 when using the Advance Toolchain 1.1, as compared with GCC 4.1.2, on RHEL 5.2. A good gain was seen in one comoponent (464.h264ref - 7.4% improvement), while significant gains were seen in our engineering tests for two components:
- 483.xalancbmk: 39% improvement
- 410.bwaves: 77% improvement !
The same compiler options were used for both gcc 4.1.2 and gthe Advance Toolchain gcc and libraries. Those options are -O3 -mcpu=power6 -mtune=power6.
To better understand some of these improvements, we gathered some basic "oprofile" performance analysis data. The following performance data was collected from the speed runs on IBM Power 550 (POWER6 4.2 GHz cores) running RHEL5.2.
We use Oprofile to monitor Processor cycles (PM_CYC_GRP1) and Instructions completed (PM_INST_CMPL_GRP1) events during each run. We generally monitor these two events in order to calculate CPI (cycles per instruction) metric. With the profiling outputs we will be able to understand why we gain performance improvement with Advance Toolchain.
483.xalancbmk comparison
Below is the Oprofile output of 483.xalancbmk with gcc 4.1.2 (out of the box with RHEL 5.2)
Oprofile profiling output of 483.xalancbmk with the Advance Toolchain.
With gcc 4.1.2, we spend almost 28% of time in the _int_malloc routine, compared to only 2.8% with the Advance Toolchain. Note also that the number of samples for 'Instruction Completed' events for _int_malloc routine with Advance Toolchain is significantly less than that with gcc 4.1.2 (414 versus 1449). CPI for _int_malloc routine in gcc 4.1.2 is 13.3 (19267/1449), while CPI for _int_malloc routine in the Advance Toolchain is 3.4 (1419/414), significantly lower. Clearly, with 483.xalancbmk, _int_malloc routine in Advance Toolchain performs much more efficiently than that in gcc 4.1.2. This speedup is due to an improved malloc implementation in the GLIBC-2.7 version (vs GLIBC-2.5) combined with better code generation associated with GCC-4.1.3 (vs GCC-4.1.2).
410.bwaves comparison
Below is the Oprofile output of 410.bwaves with gcc 4.1.2.
Here is the Oprofile output of 410.bwaves with Advance Toolchain.
The hot routine in gcc 4.1.2 case is __mul in the math library (libm-2.5). We spend 57% of time there as opposed to 25% with Advance Toolchain (libm-2.6.90). This speedup is due to the changed in the libm-2.6.90.
For example, originally, there is an inner loop in the math component __mul in libm.
for (i=i1,j=i2-1; i<i2; i++,j--) zk += X[i]*Y[j];
That loop was optimized in the new libm. The new code is shown below.
/* rearrange this inner loop to allow the fmadd instructions to be
independent and execute in parallel on processors that have
dual symetrical FP pipelines. */
if (i1 < (i2-1))
{
/* make sure we have at least 2 iterations */
if (((i2 - i1) & 1L) == 1L)
{
/* Handle the odd iterations case. */
zk2 = x->d[i2-1]*y->d[i1];
}
else
zk2 = zero.d;
/* Do two multiply/adds per loop iteration, using independent
accumulators; zk and zk2. */
for (i=i1,j=i2-1; i<i2-1; i+=2,j-=2)
{
zk += x->d[i]*y->d[j];
zk2 += x->d[i+1]*y->d[j-1];
}
zk += zk2; /* final sum. */
}
else {
/* Special case when iterations is 1. */
zk += x->d[i1]*y->d[i1];
}
By doing this, two fmadds instructions can be executed in parallel on POWER4, POWER5 and POWER6.
Libhugetlbfs
This version of Advance Toolchain does not officially support libhugetlbfs. More formal support will be provided in the future release.
Support
As mentioned in the Release Notes listed below in the References, for questions regarding the use of the Advance Toolchain or to report suspected defects in the Advance Toolchain, please go to:
http://www-128.ibm.com/developerworks/forums/dw_forum.jsp?forum=937&cat=72
- Open the Advance Toolchain topic.
- Select 'Post a New Reply'
- Enter and submit your question or problem
References
Release Notes for the Advance Toolchain 05 Version 1.1-0
GLIBC PowerPC CPU-tuned add-on website
Decimal Floating Point
Technical preview: DFP functionality for XL C/C++ Advanced Edition for Linux, V9.0
Nigel Griffiths's wiki page on Decimal Floating Point
Acknowledgements
Written by: Chakarat Skawratananond
We would like to thank Bill Buros, Peter Wong, Dan Jones, Jenifer Hopper, Steve Munroe, Ryan Arnold, and Carlos Eduardo Seo for their input and review of drafts of this article.