In this seven-part, quick-read workshop series, taken from the real-world case study whitepaper, "Porting Financial Markets Applications to the Cell Broadband Engine™ Architecture" (see Resources), you can spend minimal time reading each installment and complete the series with a strong basic knowledge of the requirements for effectively porting a compute-intensive application (in this case, a financial market application) to the Cell/B.E. processor.
Editor's note: The performance results in this series were obtained using Versions 1 and 2.1 of the Cell Broadband Engine Software Developer Kit (SDK). The current version of the SDK, the IBM Software Development Kit for Multicore Acceleration, Version 3.0, has recently become available and offers many enhancements in functionality, ease of use, and performance over the earlier versions. While the results documented in this article are correct for the earlier versions of the SDK, different results will be obtained with SDK 3.0. Watch for updates to the articles in this series that will describe the latest performance improvements obtained using SDK 3.0.
A description of the application
The example application that the modification applies to is a piece of code used to price a European Option to highlight the benefits of the Cell/B.E. blade. A European Option is just a simple financial contract with strict terms and properties that gives the buyer the right to trade a given asset at a specific price on a specific date -- it is generally an option that can only be exercised at the end of its life. In constrast, an American Option may be traded at any time between its purchase date and the date at which the contract expires.
As such, because a European Option is traded on a fixed date, it is a simpler calculation to perform since the time variability of the American Option has been removed.
A number of different models can be used to price a European Option depending on the type of asset that underlies it. For instance, an option based on currency is calculated using a slightly different model than an option that is based on futures. In the case described in this series, the calculation is based on a simple Monte Carlo simulation technique. A large number (200,000,000 in this case) of uniform, pseudo-random numbers needs to be generated. These numbers are transformed to a log-normal distribution via a Box-Müller transform. Using the random numbers generated, the financial model is executed repeatedly to simulate a random walk. The final stage of the analysis is the calculation of the relevant statistics, namely the minimum, maximum and average and the 95 percent quantile for losses.
TOPIC: Analysis of the original code
We used standard Linux profiling tools and techniques to analyze the original code -- the
result of this profiling exercise using gprof is shown in Table 1.
Table 1. gprof output for the original code (flat profile -- each sample counts as 0.01 seconds)
| %time | Seconds | Calls | Function name |
|---|---|---|---|
| 62.70 | 118.32 | 200000000 | getRandom() |
| 37.18 | 70.16 | 1 | simulateEuropeanOptionValue() |
| 0.14 | 0.27 | 1 | hpcMonteCarlo::random() |
| 0.00 | 0.00 | 2 | hpcBlackScholes() |
Although other calls are made by the application, they consume no meaningful time. As you
can see from Table 1, the 200,000,000 calls to the getRandom()
function account for 62.7 percent of the total run time, and the single call to simulateEuropeanOptionValue() accounts for a further 37.18 percent.
Based on these findings, these are the functions where the most effort should be expended in porting the code to the Cell/B.E. environment and optimizing it.
The software developers' kit (SDK) for the Cell/B.E. platform provides a random-number generator that is optimized to exploit the SPU cores. Although the code in this article uses the Mersenne-Twister method of generating random numbers (more on this is Part 4 of this series), and the SDK uses a different algorithm, we determined that the fastest way to get an initial comparison between the Cell/B.E. system and other systems is to simply substitute the SDK random-number generator for the one provided by the customer code.
Our porting strategy to optimize the code for the Cell/B.E. technology was thus to use this SDK random-number generator and to spread the 200,000,000 simulation runs across the SPUs. Being 128-bit SIMD processing units, the SPUs have the ability to perform four single-precision (32-bit) or two double-precision (64-bit) simulations at a time on each SPU. So by spreading the random-number generator across either 8 or 16 SPU units (utilizing one or both of the Cell/B.E. chips on the QS20 blade server) and initializing each SPU thread separately to use distinct random-number seeds, we can generate a maximum of 64 random numbers at once on a blade. After the random numbers have been calculated and the simulation performed, the results can be aggregated.
The initialization and aggregation is performed by the PPU unit of the Cell/B.E. chip. In
order to do this, we needed to change the simulateEuropeanOptionValue() simulation code. We also converted the code from C++ (as supplied) to C. This was because the alpha version of the XLC compiler that we were working with at the time did not have C++ support. This C++ support in the IBM XLC compiler has since been released, so this step is no longer necessary.
However, it is worth noting that most C++ code, because of its object-oriented nature, is frequently inefficient and major improvements in performance may be realized by using C instead. At the time we performed this activity, the GNU gcc compiler did provide C++ support but, as you can see, it did not generate code that performed as well as that generated by the XLC compiler. Again, this situation has improved since then.
In addition to changing from C++ to C, we also changed the formula to use the SDK built-in random-number generation functions and to maximize parallelism. The flow of the resultant code is as follows:
- Process the input parameters to determine the number of simulations required and the number of SPUs to use to run them.
- Extract the required Monte Carlo initialization code.
- Add Cell/B.E.-specific functionality to spread the calculation over several SPUs.
- Fire up multiple SPU threads.
- Set up an initialization context for each thread.
- Run the simulations on the SPUs.
- Get back the results.
- Aggregate the results.
- Report back to the user.
In order to provide useful performance data we also changed the timing routines to use the real time elapsed gettimeofday() function to avoid confusion as to what is being included (CPU time, children's CPU time, and so forth) in the timing report.
Many other individuals contributed (both knowingly and unknowingly) to this piece of work. The authors wish to acknowledge their kind contributions. Without their assistance, this paper would never have been written.
So why should you read the original whitepaper? The original whitepaper combines the contents of this entire series -- everything's available now. The paper also provides a tidy intro to the Cell/B.E. architecture, and it explains why the processor is important, especially for compute-intensive financial market applications.
Learn
- Use an RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Porting
Financial Markets Applications to the Cell Broadband Engine Architecture" (pdf;
alphaWorks, June 2007) written by Josh Easton, Ingo Meents, Olaf Stephan, Horst
Zisgen and Sei Kate is the whitepaper this series was taken from.
-
"Introduction to the Cell Multiprocessor" (IBM Journal of Research and Development, 2005) provides an introductory overview of the Cell/B.E. multiprocessor's history, the program objectives and challenges, the design concept, the architecture and programming models, and the implementation.
-
"Porting practices: Compute-intensive applications" (developerWorks, June 2007) can help when you want to bring a compute-intensive application to the Cell/B.E. architecture
-
"Tech tips: SPU vector intrinsics at your fingertips" (developerWorks, May 2007) is a handy list to keep you on the right side of common Cell/B.E. SPU vector intrinsics (and was taken from a fuller article, "Programming high-performance applications on the Cell BE processor, Part 5").
-
"Cell Broadband Engine Architecture and its first implementation" (developerWorks, November 2005) provides an up-close look at the performance figures and characteristics of the first implementation.
-
The QuantLib project is a free/open-source library written in C++ with a clean object model for modeling, trading, and risk management in real-life -- it is then exported to different languages such as C#, Objective Caml, Java, Perl, Python, GNU R, Ruby, and Scheme.
-
The Mersenne-Twister is a very fast pseudo-random number-generating algorithm which uses memory quite efficiently and is has a far longer period and far higher order of equidistribution than any other implemented generators.
-
"Implementation of a Mixed-Precision in Solving Systems of Linear Equations on the CELL Processor" describes in detail the implementation of code to solve linear system of equations using Gaussian elimination in single precision with iterative refinement of the solution to the full double precision accuracy.
-
The Software Development Kit 2.1 Installation Guide Version 2.1 (PDF) will walk you through installation and configuration and many of the basics you need to know to get started with development. Two companion pieces, "Cell/B.E. SDK 2.1: Setting up Fedora Core 6" and "Cell/B.E. SDK 2.1: Understanding the terminology" (developerWorks, April 2007), can help get the requisite FC6 up and running and provide a quick reference to Cell/B.E. terminology.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab to lab"
- "The little broadband engine that could"
Get products and technologies
-
The OpenMP API, a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications, supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix and Windows NT platforms.
-
Here is the centerpiece of Cell/B.E. development, the latest Cell/B.E. SDK release, version 2.1.
-
We mentioned the IBM XLC compiler for porting efforts -- it is optimized for the Cell/B.E. processor.
-
The developerWorks Cell Broadband Engine Resource Center is your clearinghouse for Cell/B.E.-related resources, downloads, and news.
Discuss
- Participate in the discussion forum.
-
The Cell Broadband Engine Architecture forum is the place to get your technical questions about the processor answered. (Juicy problems and answers from the forums are rounded up periodically and highlighted in the blog series, "Forum watch.")
-
The Power Architecture blog provides news, downloads, instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies and is the home of two blog series -- "Forum watch" (Q&A roundup) and the "FixIt" technology updates.
-
This contact page will enable you to discuss customized Cell/B.E. processor solutions with an IBM rep.
John Easton has worked for IBM for 18 years in a variety of UNIX technical roles. He worked in Distributed Filesystems development in Austin during the development of the RS/6000 and holds several patents pertaining to security and distributed systems. From 1990 to 2002, he focused on high availability and clustering, becoming the worldwide technical support leader for these areas and part of the Poughkeepsie lab team responsible for architecting and developing the HACMP and HAGEO products. He designed carrier-grade Linux solutions for several major telecommunications companies and represented IBM to the Service Availability Forum. Since 2002, he has been part of IBM's Grid Computing organization and the senior grid architect for EMEA. He is responsible for designing and implementing grid solutions for major companies across Europe. He brings expertise from his previous role, designing mission-critical grid solutions and influencing IBM product strategy in these areas.
Ingo Meents joined IBM nine years ago and works currently as an IT Architect in IBM Global Engineering Solutions (GES). His current focus is to provide IBM customers with knowledge of the latest Cell/B.E. software technology by consulting, educating, briefing, and creating solutions for this platform. Before his work on the Cell/B.E. platform, he was lead architect for a modeling, simulation, and production planning solution used by the IBM 300mm semiconductor line in Fishkill. Starting as a research student at IBM, Ingo Meents received his doctor's degree from the University of Clausthal in 2001.
Olaf Stephan joined IBM in 1998 and works currently as an IT Specialist in IBM Global Engineering Solutions (GES). His focus is to provide IBM's customers with knowledge of the latest Cell/B.E. software technology by consulting, educating, briefing, and development for this platform. Prior to his work on the Cell/B.E. platform he has worked in the area data management, data warehousing, business intelligence, and data integration. Olaf holds a Masters degree in Electrical Engineering, specializing in Communications Technology, from the University of Applied Sciences, Koblenz, Germany.
Horst has more than 10 years experience in the application of simulation methods and the development of mathematical models in different areas. He is currently leading in IBM's Global Engineering Solutions (GES) division the development team of a simulation and planning solution which is used by the IBM's 300mm manufacturing site in Fishkill and external customers as well. Furthermore, he is the European subject matter expert for the GES supply chain offerings. In addition he gives regularly lectures at university in the filed of simulation and mathematical modeling and he is member of a standardization group concerning simulation and optimization.
Sei Kato is a researcher staff member of IBM Research, Tokyo Research Laboratory. He joined IBM in 2002 after receiving his PhD in Mathematical Science from the University of Tokyo. After joining IBM, he has worked on modeling and simulating the performance of Web system. His current studies are the speed-up of financial calculation and the large-scale traffic simulation.
Comments (Undergoing maintenance)




