How to Leverage Decimal Floating-Point unit on POWER6 for Linux

This page has not been liked. Updated 4/11/13 10:13 AM by hravalTags: None

This page provides software developers the easy steps to follow to take advantage of the decimal floating-point unit (DFU) available on IBM POWER6 processor-based systems running Linux. We will first present a brief introduction of the decimal floating-point (DFP) technology, and then go over the information on how to use the compilers to leverage DFP.

Today, there are two packages with compilers that can be used to exploit the decimal floating-point functionality on POWER6-based Linux systems:

  • Advance Toolchain v1.1 which provides an alternative newer gcc compiler on your system, and
  • IBM XL C/C++ Advanced Edition for Linux, V9.0.

Newer versions of the Advance Toolchain are available now - in particular Advance Toolchain version 2.1-1. The nice advantage of this latest release is that official support can be made available for that - and POWER7 exploitation features are being introduced.

Information provided in this page can be applied to both Red Hat and SUSE's SLES 10. The examples below were carried out on the system running RHEL5.2. Java applications can also take advantage of the Decimal Floating Point enhancements, but that is beyond the scope of this article. Check out references provided at the end of this page for more information on Java exploitation.

Additional Linux on Power performance information is available at http://www.ibm.com/developerworks/wikis/display/LinuxP/Home

Contents

 

For discussions or questions...

To start a discussion or get a question answered, consider posting on the Linux for Power Architecture forum.

 

 


Decimal Floating-Point

Decimal (the classic day-to-day base 10) data is widely used in commercial and financial applications. However, most computer systems have only binary (base two) arithmetic, using 0 and 1 to represent numbers. There are two binary number systems in computers: integer (fixed-point), and floating-point. Unfortunately, decimal calculations cannot be directly implemented with binary floating-point. For example, the value 0.1 would need an infinitely recurring binary fraction while a decimal number system can represent it exactly, as one tenth. So, using binary floating-point cannot guarantee that results will be the same as those using decimal arithmetic.

There's a good web page available which describes General Decimal Arithmetic - see http://speleotrove.com/decimal/. Mike Cowlishaw, an IBM Fellow, has consolidated a lot of good information on this page.

In general, decimal floating-point operations have been emulated with binary fixed-point integers. Decimal numbers are traditionally held in a binary-coded decimal (BCD) format. While BCD provides sufficient accuracy for decimal calculation, it imposes a heavy cost in performance because it is usually implemented in software.

IBM POWER6 processor-based systems provide hardware support for decimal floating-point arithmetic. POWER6 microprocessor core includes the decimal floating-point unit that provides acceleration for the decimal floating-point arithmetic. The IBM POWER instruction set is expanded; 54 new instructions were added to support the decimal floating-point unit architecture.

Next, we show how developers can exploit decimal floating point math on Linux.

 


The Advance Toolchain version 1.1-0

The Advance Toolchain is a set of free-software development tools allowing users to take greater leading edge advantage of IBM latest hardware features: (1) Power6 enablement and exploitation, (2) ppc970, POWER4, POWER5, POWER5+,POWER6, POWER6x optimized system libraries, and (3) Decimal Floating Point capability.

Advance Toolchain is a self contained toolchain which does not rely on the base system toolchain for operability, and in fact is designed to coexist with the toolchain shipped with the operating system. That is, you do not have to uninstall the regular GCC compilers that come with your Linux distribution in order to use the Advance Toolchain.

The Advance Toolchain package includes the following components:

  • GNU Compiler Collection (gcc, g++, gfortran),
  • C Libraries (libc, libmpfr, and others),
  • binaries utilities (ld, ldd, objcopy, objdump, nm, and others),
  • debugger (gdb32, gdb64), and
  • performance analysis tools (Oprofile, Valgrind, gprof, mtrace, xtrace).

Customers can download the Advance Toolchain from

Download and install the following three rpms (rpm -ivh *.rpm)

  • advance-toolchain-devel-1.1-0.ppc64.rpm
  • advance-toolchain-perf-1.1-0.ppc64.rpm
  • advance-toolchain-runtime-1.1-0.ppc64.rpm

The simple release notes are available at:

The recommended installation method is to use YaST or YUM commands in order to verify the authenticity of the packages. Please consult the Release Notes for the Advance Toolchain for the detailed instructions. In our experience, installing with the rpm method is just fine.

The following is a list of gcc compiler options for Advance Toolchain related to Decimal Floating Point:

  • -D__STDC_WANT_DEC_FP__ : enabling the reference of DFP defined symbols.
  • -ldfp : enabling the decimal floating-point functionality provided by the Advance Toolchain.
  • -mno-dfp: instructing the compiler to use calls to library functions to handle decimal floating point computation, regardless of the architecture level. You may experience performance degradation when using software emulation.

 


IBM XL C/C++ Compilers

IBM XL C/C++ Advanced Edition for Linux is a standards-based compiler with advanced optimizing features for select Linux distributions running on POWER-based systems. It is not free but there is a 60-days trial program in case you would like to check it out.

http://www-01.ibm.com/software/awdtools/fortran/xlfortran/features/linux/xlf-linux.html

http://www-01.ibm.com/software/awdtools/xlcpp/features/linux/xlcpp-linux.html

To try out DFP functionality with IBM XL compiler, you need to first install the Advance Toolchain and then configure the IBM compiler to use it. The Advance Toolchain provides the runtime support for DFP.

Assuming that you installed the Advance Toolchain and the IBM compiler at their default locations, to configure the IBM compiler, you basically have to execute the following command:

#cd /opt/ibmcmp/vacpp/9.0/bin
# ./vac_configure   -gcc /opt/at05 -gcc64 /opt/at05   -ibmcmp /opt/ibmcmp -o   
         /etc/opt/ibmcmp/vac/9.0/vac.dfp.cfg   -dfp /opt/ibmcmp/vac/9.0/etc/vac.base.cfg

Detailed instruction to configure the IBM compiler for DFP can be found at

http://www-1.ibm.com/support/docview.wss?rs=2239&context=SSJT9L&uid=swg27010218

We recommend upgrading your IBM XL compiler with the latest ptf which can be downloaded at

http://www-1.ibm.com/support/docview.wss?uid=swg24018145

Following is a list of compiler options for IBM XL compilers related to Decimal Floating Point:

  • -qdfp : enabling decimal floating-point support. Specifically, this option will make the compiler to recognize decimal floating-point literal suffixes, and the _Decimal32, _Decimal64, and _Decimal128 keywords.
  • -qfloat=dfpemulate : instructing the compiler to use calls to library functions to handle decimal floating point computation, regardless of the architecture level. You may experience performance degradation when using software emulation.
  • -qfloat=nodfpemulate : this is the default when -qarch=pwr6 or -qarch=pwr6e is specified.
  • -D__STDC_WANT_DEC_FP__ : enabling the reference of DFP defined symbols.
  • -F/path/to/my/configfile : specifying the full path name of the compiler configuration file to use.
  • -ldfp : enabling the decimal floating-point functionality provided by the Advance Toolchain.

Below, we provide two sample codes demonstrating the usage of DFP functionalities. For each program, we show how to build using the Advance Toolchain and the IBM XL compilers.

The first sample code is just a simple program. We recommend using it to check if your compiler setup is correct.

 


Sample Code 1

sample1.c
#include <stdio.h>
#include <float.h>

int main() {
    _Decimal128 d128;
    double fl ; 

    printf("Hello DFP world\n");
    printf("DEC32_MAX = %Hf\n", DEC32_MAX);
    printf("DEC64_MAX = %Df\n", DEC64_MAX);
    printf("DEC128_MAX = %DDf\n", DEC128_MAX);
    d128 = 1.000001DL;
    printf("1.000001 as _Decimal128: \n = '%40.30DDf'\n", d128);
    fl = 1.000001;
    printf("1.000001 as a float: \n = '%40.30f'\n",fl);
}

Here is how to build it with the IBM compiler. The program's output is also shown below.

# export PATH=$PATH:/opt/ibmcmp/vac/9.0/bin
# xlc -qarch=pwr6 -qdfp -F/etc/opt/ibmcmp/vac/9.0/vac.dfp.cfg sample1.c -o sample1 -ldfp -D__STDC_WANT_DEC_FP__

# ./sample1
Hello DFP world
DEC32_MAX = 9.999999E+96
DEC64_MAX = 9.999999999999999E+384
DEC128_MAX = 9.999999999999999999999999999999999E+6144
1.000001 as _Decimal128:
 = '                                1.000001'
1.000001 as a float:
 = '        1.000000999999999917733362053696'

Now build the same program with the Advance Toolchain.

#/opt/at05/bin/gcc -Wall sample1.c -o sample1 -std=gnu99 -ldfp -D__STDC_WANT_DEC_FP__
# ./sample1
Hello DFP world
DEC32_MAX = 9.999999E+96
DEC64_MAX = 9.999999999999999E+384
DEC128_MAX = 9.999999999999999999999999999999999E+6144
1.000001 as _Decimal128:
 = '                                1.000001'
1.000001 as a float:
 = '        1.000000999999999917733362053696'

As you can see, using DFP is 100% accurate, whereas using binary floating-point is not quite "exactly correct".

The second program is originally from Nigel Griffiths's AIX Decimal Floating Point wiki page (See References Section). We only need to add a few header files so that the code can be compiled on Linux.

 


Sample Code 2

sample2.c
/* Code from Nigel Griffiths */ 
#include <ctype.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

/* Takes a string with a decimal number and returns a _Decimal128 * Format: [+ -]digits.digits  */ 
_Decimal128 atodecimal(char *s)
{
_Decimal128 top=0, bot=0, result;
int negative=0, i;

        if( s[0] == '-') {
                negative=1;
                s++;
        }
        if( s[0] == '+') s++;
         for(; isdigit(*s); s++) {
                top = top * 10;
                top = top + *s - '0';
        }
        if(*s == '.') {
                s++;
                for(i=strlen(s)-1; isdigit(s[i]);i--) {
                        bot = bot / 10;
                        bot = bot + (_Decimal128)(s[i] - '0')/(_Decimal128)10;
                }
        }
        result = top + bot;
        if(negative)
                result = -result;
        return result;
}
int main(int argc, char **argv)
{
long i, count;
double dfund, dinterest;
_Decimal128 Dfund, Dinterest;                 /* Declaring the new data type*/

        dfund     = atof(argv[1]);
        dinterest = atof(argv[2]);
        Dfund     = atodecimal(argv[1]);      /* Assigning values just like other data types */
        Dinterest = atodecimal(argv[2]);
        count    = atoi(argv[3]);

        printf("double  fund=%20.10f interest=%40.30f\n",dfund,dinterest);
        printf("Decimal fund=%20.10DDf interest=%40.30DDf\n",Dfund,Dinterest);  

        for(i=0;i<count;i++) {
                dfund=dfund*dinterest;
                Dfund=Dfund*Dinterest;        /* performing maths */
        }

        printf("Print final funds\n");
        printf("double  fund=%30.10f\n",dfund);
        printf("Decimal fund=%30.10DDf\n",Dfund);
}

Here is how to build this program using the IBM XL compilers on Power6 processor-based systems.

# xlc sample2.c -o dfphw -qarch=pwr6 -qdfp -F/etc/opt/ibmcmp/vac/9.0/vac.dfp.cfg -D__STDC_WANT_DEC_FP__ -ldfp

Here is the output and its timing when we run the program on a Power6 processor-based p 550 running at 4.2 GHz.

# time ./dfphw 10 1.000001 60000000
double  fund=       10.0000000000 interest=        1.000000999999999917733362053696
Decimal fund=                  10 interest=                                1.000001
Print final funds
double  fund=1141973124493563816969240576.0000000000
Decimal fund=1141973130130727445029596475.971760

real    0m1.325s
user    0m1.317s
sys     0m0.003s

Now, to demonstrate the benefit of DFU over the software-based decimal floating-point computation, we will force the compiler to use the software emulation mode.

# xlc sample2.c -o dfpsw -qarch=pwr6 -qdfp -F/etc/opt/ibmcmp/vac/9.0/vac.dfp.cfg -D__STDC_WANT_DEC_FP__ -qfloat=dfpemulate -ldfp
#  time ./dfpsw 10 1.000001 60000000
double  fund=       10.0000000000 interest=        1.000000999999999917733362053696
Decimal fund=                  10 interest=                                1.000001
Print final funds
double  fund=1141973124493563816969240576.0000000000
Decimal fund=1141973130130727445029596475.971760

real    3m1.189s
user    3m0.715s
sys     0m0.315s

As you can see, Using DFU is 141 times faster (1.3 seconds vs 183 seconds) than the software emulation. Keep in mind that you may or may not see this big improvement in your application.

If you want to use the Advance Toolchain to build sample2.c, here is how.

# /opt/at05/bin/gcc sample2.c -o at11 -D__STDC_WANT_DEC_FP__ -ldfp

To build this program without the DFU hardware support with the Advance Toolchain, here is how to do it:

# /opt/at05/bin/gcc sample2.c -o at11sw -D__STDC_WANT_DEC_FP__ -mno-dfp

Basically, you only need to replace the flag -ldfp with -mno-dfp.

Next, we will show how to use OProfile to determine if your code is really using the Decimal Floating-point Unit. OProfile uses hardware performance counters to enable profiling all running program with little overhead. In addition to the event-based profiling, we can use OProfile to get the basic time-spent profiling as well. At the time of this writing, the latest version of OProfile (v0.9.3) has the support for DFU-related events on POWER6. Those events can be found in the following groups: Group 89 pm_dfu and Group 90 pm_dfu2.

#Group 89 pm_dfu, DFU events
event:0X0590 counters:0 um:zero minimum:1000 name:PM_DFU_ADD_GRP89 : (Group 89 pm_dfu) DFU add type instruction
event:0X0591 counters:1 um:zero minimum:1000 name:PM_DFU_ADD_SHIFTED_BOTH_GRP89 : (Group 89 pm_dfu) DFU add type with both operands shifted
event:0X0592 counters:2 um:zero minimum:1000 name:PM_DFU_BACK2BACK_GRP89 : (Group 89 pm_dfu) DFU back to back operations executed
event:0X0593 counters:3 um:zero minimum:1000 name:PM_DFU_CONV_GRP89 : (Group 89 pm_dfu) DFU convert from fixed op

#Group 90 pm_dfu2, DFU events
event:0X05A0 counters:0 um:zero minimum:1000 name:PM_DFU_ENC_BCD_DPD_GRP90 : (Group 90 pm_dfu2) DFU Encode BCD to DPD
event:0X05A1 counters:1 um:zero minimum:1000 name:PM_DFU_EXP_EQ_GRP90 : (Group 90 pm_dfu2) DFU operand exponents are equal for add type
event:0X05A2 counters:2 um:zero minimum:1000 name:PM_DFU_FIN_GRP90 : (Group 90 pm_dfu2) DFU instruction finish
event:0X05A3 counters:3 um:zero minimum:1000 name:PM_DFU_SUBNORM_GRP90 : (Group 90 pm_dfu2) DFU result is a subnormal

In the following example, we will configure Oprofile to monitor the event called DFU instruction finish when we run

the program dfphw.

Here is what you need to do before running the program.

### To clear Oprofile log and cache
# rm -rf /var/lib/oprofile/samples/current
# rm -f /var/lib/oprofile/samples/oprofiled.log
# rm -f /root/.oprofile/daemonrc

### To configure Oprofile to monitor the PM_DFU_FIN_GRP90 (DFP Instruction finish) event with the count 5000.
# opcontrol --vmlinux=/boot/vmlinux-2.6.18-92.el5 -e PM_DFU_FIN_GRP90:5000

# opcontrol --reset
# opcontrol --init
# opcontrol --status
# opcontrol --start

Then, run the dfwhw program. Now, issue the following commands to get the profile output and annotated source.

To get annotated source, you need to compile your code with the -g flag.

# opcontrol --dump
# opcontrol --stop
# opcontrol --shutdown

# opreport   -l -p /lib/modules/2.6.18-92.el5/kernel  > oprofile.out

# opannotate --source > oprofile.source-annotate

Below is the profiling output (oprofile.out). You will see that the DFU instruction finish event occurs 100% in main routine. This confirms that you are really using DFU.

CPU: ppc64 POWER6, speed 4204 MHz (estimated)
Counted PM_DFU_FIN_GRP90 events ((Group 90 pm_dfu2) DFU instruction finish) with a unit mask of 0x00 (No unit mask) count 5000
samples  %        symbol name
12000    100.000  main

Here is a part of our annotated source showing where in the main routine the DFU operation has taken place.

               :int main(int argc, char **argv) /* main total:  12000 100.000 */
               :{
               :long i, count;
               :double dfund, dinterest;
               :_Decimal128 Dfund, Dinterest;                  /* Declaring the new data type*/
               :
               :        dfund     = atof(argv[1]);
               :        dinterest = atof(argv[2]);
               :        Dfund     = atodecimal(argv[1]);       /* Assigning values just like other data types */
               :        Dinterest = atodecimal(argv[2]);
               :        count    = atoi(argv[3]);
               :
               :        printf("double  fund=%20.10f interest=%40.30f\n",dfund,dinterest);
               :        printf("Decimal fund=%20.10DDf interest=%40.30DDf\n",Dfund,Dinterest); 
               :
 12000 100.000 :        for(i=0;i<count;i++) {
               :                dfund=dfund*dinterest;
               :                Dfund=Dfund*Dinterest;         /* performing maths */
               :        }
               :
               :        printf("Print final funds\n");
               :        printf("double  fund=%30.10f\n",dfund);
               :        printf("Decimal fund=%30.10DDf\n",Dfund);
               :}

 


Summary

Decimal numbers are widely used in commercial and financial applications. Software support for DFP is generally available today but has performance problem. Decimal Floating-Point Unit provides hardware support for decimal floating-point arithmetic on POWER6-processor based systems. There are two compilers available for Linux: IBM XL C/C++ compilers and Advance Toolchain, to exploit this feature. This hardware support in general will give you a performance boost. The level of performance improvement however depends on the nature of your applications.

 


Summarized References