| | This article introduces one of the key performance improving technologies available for Linux on Power users. The Advance Toolchain Version 2.0-5 is available for Linux on Power systems. |
| | |
| | This article provides introductory instructions on how to use the Advance Toolchain to generate executables for IBM Power processor-based systems running Linux. These examples were all executed on RHEL 5.2 and confirmed on RHEL 5.3. The Advance Toolchain is available for SLES 10 and SLES 11 as well. |
| | |
| | HPC Central is a joint IBM/Customer accessible and editable forum to provide improved HPC technical communications. See [HPC Central|http://www.ibm.com/developerworks/wikis/display/hpccentral/HPC+Central] and [Terms of Use|http://www.ibm.com/developerworks/wikis/display/hpccentral/HPC+Central+Terms+of+Use] |
| | |
| | {tip:title=For discussions...} |
| | For discussions on this page, see [this thread|http://www.ibm.com/developerworks/forums/thread.jspa?threadID=232373&tstart=0] |
| | |
| | For additional questions or observations, post it on the [Advance Toolchain for Linux on Power forum|http://www.ibm.com/developerworks/forums/forum.jspa?forumID=1518] |
| | |
| | Additional Linux on Power performance information is available on the [Performance page| http://www.ibm.com/developerworks/wikis/display/LinuxP/Performance] |
| | {tip} |
| | |
| | *Contents* |
| | {toc:minLevel=2} |
| | |
| | \\ |
| | h2. Introduction |
| | |
| | The Advance Toolchain is a set of open source software development extensions and tools allowing users to take greater leading edge advantage of IBM latest hardware features: |
| | # POWER6 enablement and exploitation for improved performance |
| | # ppc970, POWER4, POWER5, POWER5+,POWER6, POWER6x optimized system and math libraries, and |
| | # Decimal Floating Point capability. |
| | |
| | The Advance Toolchain is a self contained toolchain which does not rely on the base system toolchain for operability, and in fact is designed to coexist with the toolchain shipped with the operating system. The Advance Toolchain package includes the following components |
| | * GNU Compiler Collection (gcc, g++, gfortran), |
| | * C Libraries (libc, libmpfr, and others), |
| | * binaries utilities (ld, ldd, objcopy, objdump, nm, and others), |
| | * debugger (gdb32, gdb64), and |
| | * performance analysis tools (Oprofile, Valgrind, gprof, mtrace, xtrace). |
| | |
| | For more information on Decimal Floating Point support, we refer you to other papers listed under the Reference section. We will cover the performance analysis tools in a future article. |
| | |
| | In this article, we will provide some examples of the performance improvements available when using the Advance Toolchain as compared to the toolchain provided in each distro. |
| | |
| | Be aware, when you build your executables with the Advance Toolchain, you will need the Advance Toolchain installed on the systems where the executable is being run. |
| | |
| | \\ |
| | h2. What is Different? What is provided? |
| | |
| | The Advance Toolchain approach provides a mechanism for access to libraries and enhancements which have not yet been incorporated into the Red Hat and SUSE operating system bases. In general, the Advance Toolchain provides newer, more up-to-date, versions of the libraries as the code evolves in the community. |
| | |
| | The binaries and libraries are newer versions. |
| | | Tool | RHEL 5.1 | RHEL 5.2 | RHEL 5.3 | Advance Toolchain 2.0-5 | |
| | | gcc | 4.1.2 | 4.1.2 | 4.1.3 | | |
| | | binutils | 2.17.50.0.6-5.el5 | 2.17.50.0.6-6.el5 | 2.17.50.0.6-9.el5 | | |
| | | glibc | 2.5-18 | 2.5-24 | 2.7-2007-08-02 | | |
| | | libm | 2.5 | 2.5 | 2.6.90 | | |
| | | glibc-powerpc-cpu-addon | v0.06 | v0.07 | v0.06 | | |
| | | oprofile | | | 0.9.3-18.el5 | | |
| | |
| | |
| | and for Novell SUSE |
| | |
| | | Tool | SLES 10 sp2 | SLES 11 | Advance Toolchain 2.0-5 | |
| |  | | gcc | 4.1.2_20070115-0.21 | | | |
| | | binutils | 2.16.91.0.5-23.31 | | | |
| | | glibc | 2.4-31.54 | | | |
| | | libm | ? | | | |
| | | glibc-powerpc-cpu-addon | ? | | | |
| | | oprofile | not shipped | | | |
| | | | gcc | 4.1.2_20070115-0.21 | 4.3-62.198 | 4.3. 20080606 | |
| | | binutils | 2.16.91.0.5-23.31 | 2.19-11.28 | | |
| | | glibc | 2.4-31.54 | 2.9-13.2 | | |
| | | libm | 2.4 | 2.9 | | |
| | | glibc-powerpc-cpu-addon | ? | | | |
| | | oprofile | not shipped | 0.9.4-51.4 | | |
| | |
| | |
| | \\ |
| | h3. So why is this provided? |
| | |
| | By having access to the latest Linux toolchain for Power systems, customers and programmers have easy access to the latest technologies and versions of the tools, libraries, and executables from the community. |
| | |
| | Then, by providing feedback on the latest toolchain, changes, fixes, and updates can be more easily integrated into future distro service packs, releases, and versions. |
| | |
| | \\ |
| | h2. Installation |
| | |
| | Today, customers can download the Advance Toolchain from University of Illinois ftp site located here: |
| | |
| | * [ftp://linuxpatch.ncsa.uiuc.edu/toolchain/at/at05/] |
| | |
| | There are specific versions for varying distro levels and releases. |
| | |
| | Index of ftp://linuxpatch.ncsa.uiuc.edu/toolchain/at/at05/redhat/RHEL5 |
| | {noformat} |
| | advance-toolchain-devel-1.1-0.ppc64.rpm 87256 KB 09/15/2007 12:00:00 AM |
| | advance-toolchain-devel-2.0-5.ppc64.rpm 83170 KB 10/31/2008 03:40:00 PM |
| | advance-toolchain-perf-1.1-0.ppc64.rpm 40909 KB 09/15/2007 12:00:00 AM |
| | advance-toolchain-perf-2.0-5.ppc64.rpm 53251 KB 10/31/2008 03:40:00 PM |
| | advance-toolchain-runtime-1.1-0.ppc64.rpm 207433 KB 09/15/2007 12:00:00 AM |
| | advance-toolchain-runtime-2.0-5.ppc64.rpm 227874 KB 10/31/2008 03:40:00 PM |
| | advance-toolchain-src-1.1-0.tgz 172723 KB 09/15/2007 12:00:00 AM |
| | advance-toolchain-src-2.0-5.tgz 222533 KB 10/31/2008 03:52:00 PM |
| | gpg-pubkey-00f50ac5-45e497dc 2 KB 09/15/2007 12:00:00 AM |
| | release_notes.at05-1.1-0.html 14 KB 11/11/2008 08:02:00 PM |
| | release_notes.at05-2.0-5.html 18 KB 11/06/2008 10:55:00 PM |
| | {noformat} |
| | |
| | To install, first download the latest three rpm files. |
| | {noformat} |
| | advance-toolchain-devel-2.0-5.ppc64.rpm |
| | advance-toolchain-perf-2.0-5.ppc64.rpm |
| | advance-toolchain-runtime-2.0-5.ppc64.rpm |
| | {noformat} |
| | |
| | Install them using the rpm command. For example, |
| | {noformat} |
| | rpm -ivh advance-toolchain-*.rpm |
| | {noformat} |
| | |
| | The recommended installation method is to use YaST or YUM commands in order to verify the authenticity of the packages. Please consult the Release Notes for the Advance Toolchain 05 for the detailed instruction. |
| | |
 | | Links to the readme files: |
| | * [Release notes for SLES|ftp://linuxpatch.ncsa.uiuc.edu/toolchain/at/at05/suse/SLES_10/release_notes.at05-2.0-5.html] |
| | |
| | |
| | By default, you will find the Advance Toolchain installed at /opt/at05 on your system. |
| | {noformat} |
| | # ls -l /opt/at05/ |
| | total 56 |
| | drwxr-xr-x 2 root root 4096 Jun 10 12:27 bin |
| | drwxr-xr-x 2 root root 4096 Jun 10 12:26 bin64 |
| | drwxr-xr-x 2 root root 4096 Jun 10 12:26 etc |
| | drwxr-xr-x 28 root root 4096 Jun 10 12:27 include |
| | drwxr-xr-x 2 root root 4096 Jun 10 12:27 info |
| | drwxr-xr-x 15 root root 4096 Jun 10 12:27 lib |
| | drwxr-xr-x 10 root root 4096 Jun 10 12:27 lib64 |
| | drwxr-xr-x 4 root root 4096 Jun 10 12:27 libexec |
| | drwxr-xr-x 3 root root 4096 Jun 10 12:27 libexec64 |
| | drwxr-xr-x 4 root root 4096 Jun 10 12:27 man |
| | drwxr-xr-x 4 root root 4096 Jun 10 12:27 powerpc-linux |
| | drwxr-xr-x 2 root root 4096 Jun 10 12:27 sbin |
| | drwxr-xr-x 2 root root 4096 Jun 10 12:27 sbin64 |
| | drwxr-xr-x 7 root root 4096 Jun 10 12:27 share |
| | {noformat} |
| | |
| | \\ |
| | h2. How to use the Advance Toolchain |
| | |
| | As an example, we use a C program called 429.mcf from SPECint2006 benchmark suite to demonstrate how to build with the Advance Toolchain. See [http://www.spec.org/cpu2006/] for details on the CPU2006 components. |
| | |
| | First, we will build 429.mcf with the Advance Toolchain. This can be done by just using gcc located at /opt/at05/bin rather than the one that comes with the distros (/usr/bin/gcc). In this example, we also use \-mcpu and \-mtune compiler options to tell compiler that we want to generate code optimized for Power6. |
| | {noformat} |
| | /opt/at05/bin/gcc -c -o mcf.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 mcf.c |
| | /opt/at05/bin/gcc -c -o mcfutil.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 mcfutil.c |
| | /opt/at05/bin/gcc -c -o readmin.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 readmin.c |
| | /opt/at05/bin/gcc -c -o implicit.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 implicit.c |
| | /opt/at05/bin/gcc -c -o pstart.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 pstart.c |
| | /opt/at05/bin/gcc -c -o output.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 output.c |
| | /opt/at05/bin/gcc -c -o treeup.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 treeup.c |
| | /opt/at05/bin/gcc -c -o pbla.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 pbla.c |
| | /opt/at05/bin/gcc -c -o pflowup.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 pflowup.c |
| | /opt/at05/bin/gcc -c -o psimplex.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 psimplex.c |
| | /opt/at05/bin/gcc -c -o pbeampp.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 pbeampp.c |
| | |
| | /opt/at05/bin/gcc -O3 -mcpu=power6 -mtune=power6 -m32 mcf.o mcfutil.o readmin.o implicit.o pstart.o output.o |
| | treeup.o pbla.o pflowup.o psimplex.o pbeampp.o -lm -o mcf |
| | {noformat} |
| | |
| | Then, we can use the normal ldd command to print out the shared library dependencies of the executable that we just built. Note that the SPEC.org run-time harness converts the "mcf" executable name in this case to {{mcf_base.at05}} |
| | {noformat} |
| | # ldd mcf_base.at05 |
| | |
| | linux-vdso32.so.1 => (0x00100000) |
| | libm.so.6 => /opt/at05/lib/power6/libm.so.6 (0x0ff30000) |
| | libc.so.6 => /opt/at05/lib/power6/libc.so.6 (0x0fd90000) |
| | /opt/at05/lib/ld.so.1 (0xf7fc0000) |
| | {noformat} |
| | Note that the executable we just built with Advance Toolchain depends on the C and Math libraries (libm.so.6, libc.so.6) that comes with the Advance Toolchain (/opt/at05/lib/power6/). So this means when you move the executable to another system, it will expect to find the Advance Toolchain at that location. |
| | |
| | Now, let us try to build the same executable with gcc compiler that comes with the distros (/usr/bin/gcc). |
| | {noformat} |
| | /usr/bin/gcc -c -o mcf.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 mcf.c |
| | /usr/bin/gcc -c -o mcfutil.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 mcfutil.c |
| | /usr/bin/gcc -c -o readmin.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 readmin.c |
| | /usr/bin/gcc -c -o implicit.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 implicit.c |
| | /usr/bin/gcc -c -o pstart.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 pstart.c |
| | /usr/bin/gcc -c -o output.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 output.c |
| | /usr/bin/gcc -c -o treeup.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 treeup.c |
| | /usr/bin/gcc -c -o pbla.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 pbla.c |
| | /usr/bin/gcc -c -o pflowup.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 pflowup.c |
| | /usr/bin/gcc -c -o psimplex.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 psimplex.c |
| | /usr/bin/gcc -c -o pbeampp.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -O3 -mcpu=power6 -mtune=power6 pbeampp.c |
| | |
| | /usr/bin/gcc -O3 -mcpu=power6 -mtune=power6 -m32 mcf.o mcfutil.o readmin.o implicit.o pstart.o output.o |
| | treeup.o pbla.o pflowup.o psimplex.o pbeampp.o -lm -o mcf |
| | {noformat} |
| | The ldd output shows the following. The runtime directives in this case renamed the mcf executable to {{mcf_base.gcc412}} |
| | {noformat} |
| | # ldd mcf_base.gcc412 |
| | |
| | linux-vdso32.so.1 => (0x00100000) |
| | libm.so.6 => /lib/power6/libm.so.6 (0x0fd50000) |
| | libc.so.6 => /lib/power6/libc.so.6 (0x0fe30000) |
| | /lib/ld.so.1 (0x0ffc0000) |
| | {noformat} |
| | In case you want to simply relink your pre-built application with the Advance Toolchain, those instructions are available the Advance Toolchain release notes (listed in the Reference Section). |
| | |
| | \\ |
| | h2. Performance Data |
| | |
| | In our testing environment, we saw performance improvement in several benchmark components in SPECcpu2006 when using the Advance Toolchain 1.1, as compared with GCC 4.1.2, on RHEL 5.2. A good gain was seen in one comoponent (464.h264ref - 7.4% improvement), while significant gains were seen in our engineering tests for two components: |
| | * 483.xalancbmk: 39% improvement |
| | * 410.bwaves: 77% improvement ! |
| | |
| | The same compiler options were used for both gcc 4.1.2 and gthe Advance Toolchain gcc and libraries. Those options are {{-O3 -mcpu=power6 -mtune=power6}}. |
| | |
| | To better understand some of these improvements, we gathered some basic "oprofile" performance analysis data. The following performance data was collected from the speed runs on IBM Power 550 (POWER6 4.2 GHz cores) running RHEL5.2. |
| | |
| | We use Oprofile to monitor Processor cycles (PM_CYC_GRP1) and Instructions completed (PM_INST_CMPL_GRP1) events during each run. We generally monitor these two events in order to calculate CPI (cycles per instruction) metric. With the profiling outputs we will be able to understand why we gain performance improvement with Advance Toolchain. |
| | |
| | \\ |
| | h3. 483.xalancbmk comparison |
| | |
| | Below is the Oprofile output of 483.xalancbmk with gcc 4.1.2 (out of the box with RHEL 5.2) |
| | |
| | {noformat} |
| | CPU: ppc64 POWER6, speed 4204 MHz (estimated) |
| | Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor cycles) with a unit mask of 0x00 |
| | (No unit mask) count 50000000 |
| | Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Instructions completed) with a unit mask of 0x00 |
| | (No unit mask) count 50000000 |
| | |
| | samples % samples % image name app name symbol name |
| | 19267 27.7139 1449 5.3989 libc-2.5.so libc-2.5.so _int_malloc |
| | 5273 7.5848 7146 26.6254 Xalan_base.gcc412 Xalan_base.gcc412 xalanc_1_8:: |
| | 4991 7.1791 2805 10.4512 Xalan_base.gcc412 Xalan_base.gcc412 xercesc_2_5:: |
| | 4989 7.1762 1354 5.0449 Xalan_base.gcc412 Xalan_base.gcc412 xercesc_2_5:: |
| | 3258 4.6864 934 3.4800 Xalan_base.gcc412 Xalan_base.gcc412 xalanc_1_8:: |
| | 2161 3.1084 651 2.4256 Xalan_base.gcc412 Xalan_base.gcc412 xalanc_1_8:: |
| | 2030 2.9200 1327 4.9443 Xalan_base.gcc412 Xalan_base.gcc412 xercesc_2_5:: |
| | 1670 2.4022 742 2.7646 Xalan_base.gcc412 Xalan_base.gcc412 xalanc_1_8:: |
| | 1647 2.3691 137 0.5105 libc-2.5.so libc-2.5.so malloc |
| | {noformat} |
| | |
| | Oprofile profiling output of 483.xalancbmk with the Advance Toolchain. |
| | |
| | {noformat} |
| | CPU: ppc64 POWER6, speed 4204 MHz (estimated) |
| | Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor cycles) with a unit mask of 0x00 |
| | (No unit mask) count 50000000 |
| | Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Instructions completed) with a unit mask of 0x00 |
| | (No unit mask) count 50000000 |
| | |
| | samples % samples % image name app name symbol name |
| | 5510 10.9523 2882 11.0481 Xalan_base.at05 Xalan_base.at05 xercesc_2_5:: |
| | 5211 10.3580 7205 27.6202 Xalan_base.at05 Xalan_base.at05 xalanc_1_8:: |
| | 4872 9.6842 1325 5.0794 Xalan_base.at05 Xalan_base.at05 xercesc_2_5:: |
| | 3108 6.1778 793 3.0399 Xalan_base.at05 Xalan_base.at05 xalanc_1_8:: |
| | 2365 4.7009 1325 5.0794 Xalan_base.at05 Xalan_base.at05 xercesc_2_5:: |
| | 1988 3.9516 706 2.7064 Xalan_base.at05 Xalan_base.at05 xalanc_1_8:: |
| | 1651 3.2817 849 3.2546 Xalan_base.at05 Xalan_base.at05 xalanc_1_8:: |
| | 1615 3.2102 132 0.5060 libc-2.6.90.so libc-2.6.90.so malloc |
| | 1419 2.8206 414 1.5871 libc-2.6.90.so libc-2.6.90.so _int_malloc |
| | |
| | {noformat} |
| | |
| | With gcc 4.1.2, we spend almost 28% of time in the _int_malloc routine, compared to only 2.8% with the Advance Toolchain. Note also that the number of samples for 'Instruction Completed' events for _int_malloc routine with Advance Toolchain is significantly less than that with gcc 4.1.2 (414 versus 1449). CPI for _int_malloc routine in gcc 4.1.2 is 13.3 (19267/1449), while CPI for _int_malloc routine in the Advance Toolchain is 3.4 (1419/414), significantly lower. Clearly, with 483.xalancbmk, _int_malloc routine in Advance Toolchain performs much more efficiently than that in gcc 4.1.2. This speedup is due to an improved malloc implementation in the GLIBC-2.7 version (vs GLIBC-2.5) combined with better code generation associated with GCC-4.1.3 (vs GCC-4.1.2). |
| | |
| | |
| | \\ |
| | h3. 410.bwaves comparison |
| | |
| | Below is the Oprofile output of 410.bwaves with gcc 4.1.2. |
| | |
| | {noformat} |
| | CPU: ppc64 POWER6, speed 4204 MHz (estimated) |
| | Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor cycles) with a unit mask of 0x00 |
| | (No unit mask) count 50000000 |
| | Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Instructions completed) with a unit mask of 0x00 |
| | (No unit mask) count 50000000 |
| | |
| | samples % samples % image name app name symbol name |
| | 84778 56.4126 26152 39.2696 libm-2.5.so libm-2.5.so __mul |
| | 34871 23.2037 27592 41.4319 bwaves_base.gcc412 bwaves_base.gcc412 mat_times_vec_ |
| | 5937 3.9506 4644 6.9734 bwaves_base.gcc412 bwaves_base.gcc412 bi_cgstab_block_ |
| | 5483 3.6485 2050 3.0783 bwaves_base.gcc412 bwaves_base.gcc412 shell_ |
| | 3593 2.3908 1558 2.3395 libm-2.5.so libm-2.5.so sub_magnitudes |
| | 2435 1.6203 8 0.0120 vmlinux vmlinux .pseries_dedicated_idle_sleep |
| | 2189 1.4566 1155 1.7343 libm-2.5.so libm-2.5.so __ieee754_pow |
| | 1861 1.2383 768 1.1532 bwaves_base.gcc412 bwaves_base.gcc412 jacobian_ |
| | 1367 0.9096 8 0.0120 vmlinux vmlinux .ppc64_runlatch_off |
| | 1162 0.7732 452 0.6787 libm-2.5.so libm-2.5.so __exp1 |
| | 838 0.5576 325 0.4880 bwaves_base.gcc412 bwaves_base.gcc412 flux_ |
| | 729 0.4851 244 0.3664 libm-2.5.so libm-2.5.so powl@GLIBC_2.0 |
| | 439 0.2921 75 0.1126 libm-2.5.so libm-2.5.so isnanf |
| | 411 0.2735 102 0.1532 libm-2.5.so libm-2.5.so norm |
| | |
| | {noformat} |
| | |
| | Here is the Oprofile output of 410.bwaves with Advance Toolchain. |
| | |
| | {noformat} |
| | CPU: ppc64 POWER6, speed 4204 MHz (estimated) |
| | Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor cycles) with a unit mask of 0x00 |
| | (No unit mask) count 50000000 |
| | Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Instructions completed) with a unit mask of 0x00 |
| | (No unit mask) count 50000000 |
| | |
| | samples % samples % image name app name symbol name |
| | 34521 38.5086 27600 47.1505 bwaves_base.at05 bwaves_base.at05 mat_times_vec_ |
| | 22632 25.2462 18553 31.6950 libm-2.6.90.so libm-2.6.90.so __mul |
| | 5873 6.5514 4630 7.9097 bwaves_base.at05 bwaves_base.at05 bi_cgstab_block_ |
| | 5496 6.1308 2063 3.5243 bwaves_base.at05 bwaves_base.at05 shell_ |
| | 3965 4.4230 27 0.0461 vmlinux vmlinux .pseries_dedicated_idle_sleep |
| | 2924 3.2618 1165 1.9902 libm-2.6.90.so libm-2.6.90.so sub_magnitudes |
| | 2431 2.7118 1139 1.9458 libm-2.6.90.so libm-2.6.90.so __ieee754_pow |
| | 2319 2.5869 835 1.4265 bwaves_base.at05 bwaves_base.at05 jacobian_ |
| | 2294 2.5590 8 0.0137 vmlinux vmlinux .ppc64_runlatch_off |
| | 1233 1.3754 432 0.7380 libm-2.6.90.so libm-2.6.90.so __exp1 |
| | 806 0.8991 335 0.5723 bwaves_base.at05 bwaves_base.at05 flux_ |
| | |
| | {noformat} |
| | The hot routine in gcc 4.1.2 case is \_\_mul in the math library (libm-2.5). We spend 57% of time there as opposed to 25% with Advance Toolchain (libm-2.6.90). This speedup is due to the changed in the libm-2.6.90. |
| | |
| | For example, originally, there is an inner loop in the math component \_\_mul in libm. |
| | {code} |
| | for (i=i1,j=i2-1; i<i2; i++,j--) zk += X[i]*Y[j]; |
| | {code} |
| | |
| | That loop was optimized in the new libm. The new code is shown below. |
| | |
| | {code} |
| | /* rearrange this inner loop to allow the fmadd instructions to be |
| | independent and execute in parallel on processors that have |
| | dual symetrical FP pipelines. */ |
| | if (i1 < (i2-1)) |
| | { |
| | /* make sure we have at least 2 iterations */ |
| | if (((i2 - i1) & 1L) == 1L) |
| | { |
| | /* Handle the odd iterations case. */ |
| | zk2 = x->d[i2-1]*y->d[i1]; |
| | } |
| | else |
| | zk2 = zero.d; |
| | /* Do two multiply/adds per loop iteration, using independent |
| | accumulators; zk and zk2. */ |
| | for (i=i1,j=i2-1; i<i2-1; i+=2,j-=2) |
| | { |
| | zk += x->d[i]*y->d[j]; |
| | zk2 += x->d[i+1]*y->d[j-1]; |
| | } |
| | zk += zk2; /* final sum. */ |
| | } |
| | else { |
| | /* Special case when iterations is 1. */ |
| | zk += x->d[i1]*y->d[i1]; |
| | } |
| | {code} |
| | |
| | By doing this, two fmadds instructions can be executed in parallel on POWER4, POWER5 and POWER6. |
| | |
| | \\ |
| | h2. Libhugetlbfs |
| | |
| | This version of Advance Toolchain does not officially support libhugetlbfs. More formal support will be provided in the future release. |
| | |
| | \\ |
| | h2. Support |
| | |
| | As mentioned in the Release Notes listed below in the References, for questions regarding the use of the Advance Toolchain or to report suspected defects in the Advance Toolchain, please go to: |
| | [http://www-128.ibm.com/developerworks/forums/dw_forum.jsp?forum=937&cat=72] |
| | |
| | * Open the Advance Toolchain topic. |
| | * Select 'Post a New Reply' |
| | * Enter and submit your question or problem |
| | |
| | \\ |
| | h2. References |
| | |
| | Release Notes for the Advance Toolchain 05 Version 1.1-0 |
| | * [ftp://linuxpatch.ncsa.uiuc.edu/toolchain/at/at05/redhat/RHEL5/release_notes.at05-1.1-0.html] |
| | |
| | GLIBC PowerPC CPU-tuned add-on website |
| | * http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html |
| | |
| | *Decimal Floating Point* |
| | |
| | Technical preview: DFP functionality for XL C/C+\+ Advanced Edition for Linux, V9.0 |
| | * [http://www-1.ibm.com/support/docview.wss?rs=2239&context=SSJT9L&uid=swg27010218] |
| | |
| | Nigel Griffiths's wiki page on Decimal Floating Point |
| | * http://www-941.ibm.com/collaboration/wiki/display/WikiPtype/Decimal+Floating+Point |
| | |
| | \\ |
| | h2. Acknowledgements |
| | |
| | Written by: Chakarat Skawratananond |
| | |
| | We would like to thank Bill Buros, Peter Wong, Dan Jones, Jenifer Hopper, Steve Munroe, Ryan Arnold, and Carlos Eduardo Seo for their input and review of drafts of this article. |
| | |
| | {tip:title=For discussions...} |
| | For additional questions or observations on taking advantage of the Advance Toolchain, post it on the [Advance Toolchain for Linux on Power forum|http://www.ibm.com/developerworks/forums/forum.jspa?forumID=1518] |
| | {tip} |