Skip to main content

skip to main content

developerWorks  >  Power Architecture technology  >

Porting workshop, Part 5: Mixed-precision workloads

Affecting mixed-precision calculations while porting compute-intense applications to the Cell Broadband Engine architecture

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Introductory

John Easton (JKJ@uk.ibm.com), Infrastructure Architect, Emerging Technologies, IBM
Ingo Meents (MEENTS@de.ibm.com), Architect for Cell Solutions, Advanced Planning, Simulation, and Optimization, IBM
Olaf Stephan (STEPHANO@de.ibm.com), Server Specialist, DB2, Warehousing BI Solutions, IBM
Horst Zisgen (horst_zisgen@de.ibm.com), Program Manager Simulation/Operations Research, IBM
Sei Kato (SEIKATO@jp.ibm.com), Research Staff Member, IBM

02 Oct 2007

The seven quick-read parts of this "Porting workshop" series take you on a real-world trip from strategy and planning through workload execution, performance tweaking, optimization, and a solid conclusion. The series describes how to most effectively port compute-intensive applications to the Cell Broadband Engine platform. In this Part 5, the authors determine how to make mixed-precision calculations work with the sample application.

This seven-part, quick-read workshop series is taken from the real-world case study whitepaper, "Porting Financial Markets Applications to the Cell Broadband Engine Architecture" (written by John Easton, Ingo Meents, Olaf Stephan, Horst Zisgen, and Sei Kato, IBM Systems and Technology Group, June 2007; see Resources). You can probably spend less than 10 minutes reading each installment and come out at the end with a strong basic knowledge of the requirements for effectively porting a compute-intensive application (in this case, a financial market application) to the Cell/B.E. processor.

Editor's note: The performance results in this series were obtained using Versions 1 and 2.1 of the Cell Broadband Engine Software Developer Kit (SDK). The current version of the SDK, the IBM Software Development Kit for Multicore Acceleration, Version 3.0, has recently become available and offers many enhancements in functionality, ease of use, and performance over the earlier versions. While the results documented in this article are correct for the earlier versions of the SDK, different results will be obtained with SDK 3.0. Watch for updates to the articles in this series that will describe the latest performance improvements obtained using SDK 3.0.

Workshop series
|_ 1. Porting strategies (developerWorks, August 2007)
|
|_ 2. Analysis of the original code (developerWorks, August 2007)
|
|_ 3. Initial performance results (developerWorks, September 2007)
|
|_ 4. Mersenne-Twister (developerWorks, September 2007)
|
|_ 5. Mixed-precision workloads.
|
|_ 6. Tying it all together.
|
|_ 7. Getting the most performance.

Introducing the application

The example application modified in this article is a piece of code used to price a European Option to highlight the benefits of the Cell/B.E. blade. A European Option is a simple financial contract with strict terms and properties that gives the buyer the right to trade a given asset at a specific price on a specific date. It is generally an option that can be exercised only at the end of its life. By contrast, an American Option can be traded at any time between its purchase date and the date at which the contract expires. Because a European Option is traded on a fixed date, it is a simpler calculation to perform because the time variability of the American Option is removed.

You can use several different models price a European Option, depending on the type of asset that underlies it. For example, an option based on currency is calculated using a slightly different model than an option based on futures. In the example described in this series, the calculation is based on a simple Monte Carlo simulation technique. You will generate 200,000,000 uniform, pseudo-random numbers. These numbers are transformed to a log-normal distribution using a Box-Müller transform. Using the random numbers generated, you will execute the financial model repeatedly to simulate a random walk. The final stage of the analysis will be the calculation of the relevant statistics, such as the minimum, maximum, and average and the 95 percent quartile for losses.



Back to top


Evaluating mixed-precision workloads

Given that the current implementation of the Cell/B.E. platform has limited double-precision floating point capabilities, there is a lot of interest in the work done by Kurzak and Dongarra (see "Implementation of a Mixed-Precision in Solving Systems of Linear Equations on the CELL Processor" in Resources) on mixed-precision calculations on the Cell/B.E. platform. In their paper, an approximate solution to a system of linear equations is generated using single-precision arithmetic. This solution is then plugged into the equation, and the delta is computed. Then a second iteration using double-precision arithmetic is started, based on the delta.

This approach cannot be mapped directly to the problem of pricing European Options using a Monte Carlo method, but you can do something similar by generating 32-bit (single-precision) random numbers and using these for the double-precision cases. In essence, rather than performing all of the calculations exclusively using single- or double-precision, only those parts of the calculation that actually need double-precision are calculated using double-precision. This increases the programming effort needed slightly: first, to identify which parts of the code should use which sort of precision and, second, to make the appropriate changes to the code. But it also delivers a substantial performance improvement due to the outstanding single-precision performance of the chip.

We recently investigated a couple of potential uses of this mixed-precision technique for the European Option pricing code. In both cases, the pricing of the European Option is still performed using double-precision arithmetic. The Mersenne-Twister random-number generator is where we applied the mixed-precision mathematics.

In the first case, a double-precision random variant is generated by concatenating two single-precision random variables. In the second case, the double-precision variant is formed by generating one single-precision random variable and then turning it into a double-precision random variable by doing a double-precision division. The second case requires generating half the number of random numbers (200,000,000) compared to the first case (400,000,000), but the second case does require additional double-precision arithmetic. The determining factor here is whether the doubling of the single-precision mathematical component in the first case is offset by the double-precision activities in the second case.

See Table 6 for an example. (Tables and figures are numbered consecutively throughout the series to match the versions in the original whitepaper.)


Table 6. Performance by number of SPUs (different precision types)
# SPUCC_DP_MTCC_DP_SDKM_DP_MTSP_MTSP_SDK
140.3340.3345.7612.0111.16
220.3320.3322.886.065.70
313.5613.5615.264.053.80
410.1710.1711.443.042.85
58.138.139.162.432.29
66.786.787.642.031.91
75.825.826.551.751.64
85.095.095.751.531.44
94.534.525.111.361.28
104.084.084.601.221.15
113.703.704.181.111.05
123.403.393.841.020.96
133.143.143.540.940.89
142.922.923.290.880.83
152.722.723.070.820.78
162.552.562.880.770.73

Table 6 shows the results in which:

  • CC_DP_MT = Concatenation Double-Precision Mersenne-Twister
  • CC_DP_SDK = Concatenation Double-Precision SDK
  • M_DP_MT = Division Double-Precision Mersenne-Twister
  • SP_MT = Single-Precision Mersenne-Twister
  • SP_SDK = Single-Precision SDK

Figure 4 shows the results.


Figure 4. Plot of the different precision runtimes against the number of SPUs
Plot of the different precision runtimes against the number of SPUs

Because we completed the various optimizations over a number of months, many things changed, so unfortunately, we cannot make a direct comparison between these mixed-precision numbers and those described in earlier experiments. In Part 4, you could see a factor-of-four improvement of the Mersenne-Twister random-number generator over the SDK-supplied one. After that, the optimized code in the SDK changed significantly with the way that vectorized random-number generation is performed. The rejection method is further causing uncertainty due to the scalar nature of that process and the randomness involved.

The SDK also changed considerably. The original porting of the code was performed using SDK 1.0. The current release is SDK 2.1. We observed that running the binaries on a QS20 blade running SDK 2.1 was approximately twice as fast as running the same binaries on a QS20 running SDK 2.0. For a proper comparison, the firmware and kernel versions should also be taken into consideration, but this shows the continuing advances being made in terms of the SDK and its performance. The following table shows examples of the speedup between SDK 2.0 and SDK 2.1:

SDKCC_DP_SDKSP_MTM_DP_MT
2.07.83s2.045.71
2.13.94s1.042.88

We've since used some additional optimization techniques to improve performance even further. These techniques include:

  • Unrolling more parts of the Mersenne-Twister random-number generator function that generates 32-bit unsigned integers.

  • Adding additional software pipelining by parallelizing computation and data loads into the two pipes of the SPE's execution unit.

  • Pre-calculating items, such as:
    a[0]=<something>;
    for (i=0;i<N;i++)
    {
          sinf4(a[0]) ;
          sinf4(a[i+1));
          ......
    }
    

    Precalculating items reduces the runtime and enables the compiler to make a better job of optimizing the code.

  • Introducing new variables to eliminate dependencies and allow the compiler to make better use of the two pipelines.

This demonstrates that you can use the outstanding single precision capabilities of the Cell/B.E. environment to outweigh the constraints of the limited double-precision performance of the current Cell/B.E. processor implementation.



Back to top


Acknowledgments

Many other individuals contributed (both knowingly and unknowingly) to this piece of work. The authors wish to acknowledge their kind contributions. Without this assistance, this paper would never have been written.

Share this...

digg Digg this story
del.icio.us Post to del.icio.us
Slashdot Slashdot it!



Resources

Learn

Get products and technologies

Discuss


About the authors

John is currently leading a worldwide emerging technologies team within IBM Systems and Technology Group. He has several roles competing for his time, all of which revolve around advising organizations on how best to exploit new technologies. John has been working for IBM for over 20 years in a variety of technical roles. He worked in Distributed Systems Development in Austin before the launch of the RS/6000, and he holds several patents in the areas of security and systems software. Before taking his current role, he was the European technical leader for grid computing.


Ingo Meents joined IBM nine years ago and works currently as an IT Architect in IBM Global Engineering Solutions (GES). His current focus is to provide IBM customers with knowledge of the latest Cell/B.E. software technology by consulting, educating, briefing, and creating solutions for this platform. Before his work on the Cell/B.E. platform, he was lead architect for a modeling, simulation, and production planning solution used by the IBM 300mm semiconductor line in Fishkill. Starting as a research student at IBM, Ingo Meents received his doctor's degree from the University of Clausthal in 2001.


Olaf Stephan joined IBM in 1998 and currently works as an IT Specialist in IBM Global Engineering Solutions (GES). His focus is to provide IBM customers with knowledge of the latest Cell/B.E. software technology by consulting, educating, briefing, and development for this platform. Before his work on the Cell/B.E. platform, he worked in the areas of data management, data warehousing, business intelligence, and data integration. Olaf holds a Masters degree in Electrical Engineering, specializing in Communications Technology, from the University of Applied Sciences, Koblenz, Germany.


Horst has over 10 years of experience in the application of simulation methods and the development of mathematical models in different areas. He is currently leading a development team in IBM Global Engineering Solutions (GES) that is working on a simulation and planning solution used by IBM 300mm manufacturing in Fishkill and by external customers as well. Horst is also the European subject matter expert for the GES supply chain offerings. In addition, Horst regularly gives lectures at universities about simulation and mathematical modeling. Horst is a member of a standardization group for simulation and optimization.


Sei Kato is a staff member in IBM Research, Tokyo Research Laboratory. He joined IBM in 2002 after receiving his PhD in Mathematical Science from the University of Tokyo. After joining IBM, Sei has worked on modeling and simulating the performance of Web systems. His is currently working on the acceleration of financial calculations and on large-scale traffic simulations.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top


IBM is a trademark of IBM Corporation in the United States, other countries, or both. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.