Skip to main content

Porting workshop, Part 3: Initial performance results

Run performance tests on modified code in Part 3 of this series

John Easton (JKJ@uk.ibm.com), Senior Software Engineer, IBM Global Services
John Easton has worked for IBM for 18 years in a variety of UNIX technical roles. He worked in Distributed Filesystems development in Austin during the development of the RS/6000 and holds several patents pertaining to security and distributed systems. From 1990 to 2002, he focused on high availability and clustering, becoming the worldwide technical support leader for these areas and part of the Poughkeepsie lab team responsible for architecting and developing the HACMP and HAGEO products. He designed carrier-grade Linux solutions for several major telecommunications companies and represented IBM to the Service Availability Forum. Since 2002, he has been part of IBM's Grid Computing organization and the senior grid architect for EMEA. He is responsible for designing and implementing grid solutions for major companies across Europe. He brings expertise from his previous role, designing mission-critical grid solutions and influencing IBM product strategy in these areas.
Ingo Meents (MEENTS@de.ibm.com), Architect for Cell Solutions, Advanced Planning, Simulation, and Optimization, IBM
Ingo Meents joined IBM nine years ago and works currently as an IT Architect in IBM Global Engineering Solutions (GES). His current focus is to provide IBM customers with knowledge of the latest Cell/B.E. software technology by consulting, educating, briefing, and creating solutions for this platform. Before his work on the Cell/B.E. platform, he was lead architect for a modeling, simulation, and production planning solution used by the IBM 300mm semiconductor line in Fishkill. Starting as a research student at IBM, Ingo Meents received his doctor's degree from the University of Clausthal in 2001.
Olaf Stephan, Server Specialist, DB2, Warehousing BI Solutions, IBM
Olaf Stephan joined IBM in 1998 and works currently as an IT Specialist in IBM Global Engineering Solutions (GES). His focus is to provide IBM's customers with knowledge of the latest Cell/B.E. software technology by consulting, educating, briefing, and development for this platform. Prior to his work on the Cell/B.E. platform he has worked in the area data management, data warehousing, business intelligence, and data integration. Olaf holds a Masters degree in Electrical Engineering, specializing in Communications Technology, from the University of Applied Sciences, Koblenz, Germany.
Horst Zisgen, Program Manager Simulation/Operations Research, IBM
Horst has more than 10 years experience in the application of simulation methods and the development of mathematical models in different areas. He is currently leading in IBM's Global Engineering Solutions (GES) division the development team of a simulation and planning solution which is used by the IBM's 300mm manufacturing site in Fishkill and external customers as well. Furthermore, he is the European subject matter expert for the GES supply chain offerings. In addition he gives regularly lectures at university in the filed of simulation and mathematical modeling and he is member of a standardization group concerning simulation and optimization.
Sei Kato, Research Staff Member, IBM
Sei Kato is a researcher staff member of IBM Research, Tokyo Research Laboratory. He joined IBM in 2002 after receiving his PhD in Mathematical Science from the University of Tokyo. After joining IBM, he has worked on modeling and simulating the performance of Web system. His current studies are the speed-up of financial calculation and the large-scale traffic simulation.

Summary:  The seven, quick-read parts of this series, "Porting workshop," take you on a real-world trip from strategy and planning through workload execution through performance tweaking through optimization to a solid conclusion -- how to most effectively port compute-intensive applications to the Cell Broadband Engine platform.™ platform. In part three, the authors run and review performance tests and data on the modified code.

View more content in this series

Date:  04 Sep 2007
Level:  Introductory
Activity:  2539 views

In this seven-part, quick-read workshop series, taken from the real-world case study whitepaper, "Porting Financial Markets Applications to the Cell Broadband Engine™ Architecture" (see Resources), you can spend minimal time reading each installment and complete the series with a strong basic knowledge of the requirements for effectively porting a compute-intensive application (in this case, a financial market application) to the Cell/B.E. processor.

Editor's note: The performance results in this series were obtained using Versions 1 and 2.1 of the Cell Broadband Engine Software Developer Kit (SDK). The current version of the SDK, the IBM Software Development Kit for Multicore Acceleration, Version 3.0, has recently become available and offers many enhancements in functionality, ease of use, and performance over the earlier versions. While the results documented in this article are correct for the earlier versions of the SDK, different results will be obtained with SDK 3.0. Watch for updates to the articles in this series that will describe the latest performance improvements obtained using SDK 3.0.

Workshop series

Part 1: Porting strategies (developerWorks, August 2007)

Part 2: Analysis of the original code (developerWorks, August 2007)

Part 3: Initial performance results (developerWorks, September 2007)

Part 4: Mersenne-Twister (developerWorks, September 2007)

Part 5: Mixed-precision workloads (developerWorks, September 2007)

Part 6: Tying it all together (developerWorks, October 2007)

Part 7: Getting the most performance (developerWorks, November 2007)

A description of the application

The example application that the modification applies to is a piece of code used to price a European Option to highlight the benefits of the Cell/B.E. blade. A European Option is just a simple financial contract with strict terms and properties that gives the buyer the right to trade a given asset at a specific price on a specific date -- it is generally an option that can only be exercised at the end of its life. In constrast, an American Option may be traded at any time between its purchase date and the date at which the contract expires.

As such, because a European Option is traded on a fixed date, it is a simpler calculation to perform since the time variability of the American Option has been removed.

A number of different models can be used to price a European Option depending on the type of asset that underlies it. For instance, an option based on currency is calculated using a slightly different model than an option that is based on futures. In the case described in this series, the calculation is based on a simple Monte Carlo simulation technique. A large number (200,000,000 in this case) of uniform, pseudo-random numbers needs to be generated. These numbers are transformed to a log-normal distribution via a Box-Müller transform. Using the random numbers generated, the financial model is executed repeatedly to simulate a random walk. The final stage of the analysis is the calculation of the relevant statistics, namely the minimum, maximum and average and the 95 percent quantile for losses.

TOPIC: Initial performance results

To run the performance tests, the following parameteres were used on the modified code:

  • Compiler used: spuxlc, ppuxlc.
  • Compiler optimization setting: -03 -qstrict.
  • Random-cnumber generation method: sdk.
  • Precision: single.
  • Number of evaluations: 200,000,000.

Please note that the code was developed using the IBM ®Full-System Simulator for the Cell Broadband Engine Processor that can be downloaded as part of the SDK (see Resources). The simulator provides a complete Cell/B.E. development environment, including a Linux kernel for Cell/B.E. blades, Linux support libraries, tool chains, a system simulator, source code for libraries, and samples.

Though the simulator provides a cycle-accurate simulation, in order to get a better feel for the real performance of the system, we ran the benchmark code on an early prototype version of the Cell/B.E.-based blade -- the IBM BladeCenter QS20. The major difference between this and the version of the blade that is now generally available was that this prototype was running at 2.4GHz instead of 3.2GHz.

Table 2 shows the initial performance results that were achieved. (Tables and figures are numbered consecutively throughout the series to match the versions in the original whitepaper.)


Table 2. Performance by number of SPUs (single precision)
Number of SPUsElapsed time (seconds) 2.4 GHz Cell/B.E. processor (measured)Elapsed time (seconds) 3.2 GHz Cell/B.E. processor (estimated)Speedup
1 65.7 49.27 1
2 32.9 24.6 1.99
3 21.9 16.42 3
4 16.4 12.3 4
5 13.18 9.88 4.98
6 10.9 8.17 6.02
7 9.4 7.05 6.98
8 8.2 6.15 8.01
9 7.3 5.4 9
10 6.6 4.95 9.95
11 6 4.5 10.95
12 5.5 4.12 11.94
13 5.1 3.8 12.88
14 4.7 3.52 13.97
15 4.4 3.3 14.93
16 4.1 3.07 16.02

As you can see from the Speedup column, the performance speeds up linearly as more SPUs are applied to the problem. Figure 1 illusetrates this in the graph which plots both run-time and speedup against the number of SPUs.


Figure 1. Plot of the single precision run-time and speedup against the number of SPUs
Plot of the single precision run-time and speedup against the number of SPUs

Many organizations require double-precision arithmetic. The initial target marketplace for the Cell/B.E. technology -- home entertainment systems -- typically doesn't need this precision, so the initial implementation of the Cell/B.E. environment provides very limited double-precision support in hardware.

The SPU supports both single- and double-precision floating-point operations. Single-precision instructions are performed in four-way SIMD fashion, fully pipelined; whereas, double-precision instructions are partially pipelined. The data formats for single- and double-precision instructions are those defined by IEEE Standard 754; however, the results calculated by single-precision instructions depart from the IEEE Standard 754 by placing emphasis on real-time graphics requirements that are typical of multimedia processing.

Although some believe that the Cell/B.E. platform cannot do double-precision mathematics, Table 3 shows that it can: A primarily software-driven approach is used rather than the double-precision capabilities of the hardware.


Table 3. Performance by number of SPUs (double precision)
Number of SPUsElapsed time (seconds) 2.4 GHz Cell/B.E. processor (measured)Elapsed time (seconds) 3.2 GHz Cell/B.E. processor (estimated)Speedup
1 157.3 117.9 1
2 78.6 58.9 2
3 52.4 39.3 3
4 39.3 29.47 4
5 31.49 23.61 4.99
6 26.25 19.68 5.99
7 22.5 16.8 6.99
8 19.7 14.7 7.98
9 17.5 13.12 8.98
10 15.78 11.8 9.96
11 14.3 10.7 11
12 13.1 9.82 12
13 12.1 9.1 13
14 11.3 8.47 13.92
15 10.5 7.87 14.98
16 9.9 7.42 15.89

In Table 3, the effective factor of two increases in the run-times is due to the size of the SIMD vector. Only two double-precision (64-bit) numbers are generated per SPU at once instead of four single-precision (32-bit) numbers. As such, use of double-precision numbers requires twice as many loop iterations. Note that as the number of SPUs is increased, the parallelization still improves by a factor of 157.3 / 9.9 = 15.88, again equating to a linear speedup. This linear relationship is clearly seen in Figure 2.


Figure 2. Plot of the double precision run-time and speedup against the number of SPUs
Plot of the double precision run-time and speedup against the number of SPUs

To get a feel for the capabilities of the Cell/B.E. platform in running this application, it is worth comparing the previous figures with the figures produced from running the original customer code on the typical technologies being deployed by organizations. Table 4 shows this comparison.


Table 4. Comparison of code runtimes (in seconds) between different processor types and compilers
SystemProcessor/Speed (GHz)RAM (GB)Operating System/CompilerOriginal/OptimizedTime SP/DP
IBM BladeCenter HS20Intel Xeon / 3.84RedHat Enterprise Linux 4 / gccOriginal64.89
IBM BladeCenter LS20AMD Opteron dual core / 2.04RedHat Enterprise Linux 4 / gccOriginal57.13
IBM System x 3650Intel Xeon dual-core (Woodcrest) / 3.08RedHat Enterprise Linux 4 / Intel ICPCOriginal24.83
Intel X5355Intel quad-core (Cloverton) / 2.6616RedHat Enterprise Linux 4 / Intel ICPCOriginal28.59
IBM BladeCenter QS20 (pre-GA)Cell/B.E. / 2.41Fedora Core 5 / gccOriginal312.41/-
IBM BladeCenter QS20Cell/B.E. / 3.2 (simulated)1Fedora Core 5 / gccOriginal234.3/-
IBM BladeCenter QS20 (pre-GA)Cell/B.E. / 2.41Fedora Core 5 / gccModified51.46/-
IBM BladeCenter QS20Cell/B.E. / 3.2 (simulated)1Fedora Core 5 / gccModified38.59/-
IBM BladeCenter QS20 (pre-GA)Cell/B.E. / 2.41Fedora Core 5 / xlCModified4.2/9.9
IBM BladeCenter QS20Cell/B.E. / 3.2 (simulated)1Fedora Core 5 / xlCModified3.15/7.42

As Table 4 shows, compared with more traditional x86 processors, the Cell/B.E. system is significantly more powerful when running this same code. This table also illustrates the effect of the optimizations in the XLC compiler compared to the performance of the same modified code on the Cell/B.E. platform compiled using the two different compilers, gcc and XLC.

Acknowledgements and notes

Many other individuals contributed (both knowingly and unknowingly) to this piece of work. The authors wish to acknowledge their kind contributions. Without their assistance, this paper would never have been written.

So why should you read the original whitepaper? The original whitepaper combines the contents of this entire series -- everything's available now. The paper also provides a tidy intro to the Cell/B.E. architecture, and it explains why the processor is important, especially for compute-intensive financial market applications.


Resources

Learn

Get products and technologies

  • The OpenMP API, a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications, supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix and Windows NT platforms.

  • Here is the centerpiece of Cell/B.E. development, the latest Cell/B.E. SDK release, version 2.1.

  • We mentioned the IBM XLC compiler for porting efforts -- it is optimized for the Cell/B.E. processor.

  • The developerWorks Cell Broadband Engine Resource Center is your clearinghouse for Cell/B.E.-related resources, downloads, and news.

Discuss

About the authors

John Easton has worked for IBM for 18 years in a variety of UNIX technical roles. He worked in Distributed Filesystems development in Austin during the development of the RS/6000 and holds several patents pertaining to security and distributed systems. From 1990 to 2002, he focused on high availability and clustering, becoming the worldwide technical support leader for these areas and part of the Poughkeepsie lab team responsible for architecting and developing the HACMP and HAGEO products. He designed carrier-grade Linux solutions for several major telecommunications companies and represented IBM to the Service Availability Forum. Since 2002, he has been part of IBM's Grid Computing organization and the senior grid architect for EMEA. He is responsible for designing and implementing grid solutions for major companies across Europe. He brings expertise from his previous role, designing mission-critical grid solutions and influencing IBM product strategy in these areas.

Ingo Meents joined IBM nine years ago and works currently as an IT Architect in IBM Global Engineering Solutions (GES). His current focus is to provide IBM customers with knowledge of the latest Cell/B.E. software technology by consulting, educating, briefing, and creating solutions for this platform. Before his work on the Cell/B.E. platform, he was lead architect for a modeling, simulation, and production planning solution used by the IBM 300mm semiconductor line in Fishkill. Starting as a research student at IBM, Ingo Meents received his doctor's degree from the University of Clausthal in 2001.

Olaf Stephan joined IBM in 1998 and works currently as an IT Specialist in IBM Global Engineering Solutions (GES). His focus is to provide IBM's customers with knowledge of the latest Cell/B.E. software technology by consulting, educating, briefing, and development for this platform. Prior to his work on the Cell/B.E. platform he has worked in the area data management, data warehousing, business intelligence, and data integration. Olaf holds a Masters degree in Electrical Engineering, specializing in Communications Technology, from the University of Applied Sciences, Koblenz, Germany.

Horst has more than 10 years experience in the application of simulation methods and the development of mathematical models in different areas. He is currently leading in IBM's Global Engineering Solutions (GES) division the development team of a simulation and planning solution which is used by the IBM's 300mm manufacturing site in Fishkill and external customers as well. Furthermore, he is the European subject matter expert for the GES supply chain offerings. In addition he gives regularly lectures at university in the filed of simulation and mathematical modeling and he is member of a standardization group concerning simulation and optimization.

Sei Kato is a researcher staff member of IBM Research, Tokyo Research Laboratory. He joined IBM in 2002 after receiving his PhD in Mathematical Science from the University of Tokyo. After joining IBM, he has worked on modeling and simulating the performance of Web system. His current studies are the speed-up of financial calculation and the large-scale traffic simulation.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=253118
ArticleTitle=Porting workshop, Part 3: Initial performance results
publish-date=09042007
author1-email=JKJ@uk.ibm.com
author1-email-cc=
author2-email=MEENTS@de.ibm.com
author2-email-cc=
author3-email=STEPHANO@de.ibm.com
author3-email-cc=
author4-email=horst_zisgen@de.ibm.com
author4-email-cc=
author5-email=SEIKATO@jp.ibm.com
author5-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers