Skip to main content

Enabling applications, Part 1: Is your application ready for Cell/B.E.?

Follow these suggestions to determine whether the Cell/B.E. platform is right for your application

Kane Scarlett, Editor, Multicore acceleration, IBM
Kane Scarlett
Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.

Summary:  Learn from the experts how to evaluate your application's appropriateness for the Cell/B.E.™ platform from the standpoints of performance and power needs, the opportunities that exist for parallelism, whether the algorithms line up nicely, and whether your application has access to a Cell/B.E.-enabled library. This article is Part 1 of a 3-part series from the IBM Redbook®Programming the Cell Broadband Engine: Examples and Best Practices. [09/10/08 update: Made various changes based on updates since the IBM Redbook was published.--Ed.]

View more content in this series

Date:  10 Sep 2008 (Published 09 Sep 2008)
Level:  Introductory PDF:  A4 and Letter (71KB)Get Adobe® Reader®
Activity:  2560 views

Introduction

How do you begin to enable your application on the Cell/B.E. platform? You begin by asking yourself three questions:

  • Is the application likely to perform well on the Cell/B.E. platform?
  • Which parallel programming model should I use for this application?
  • Which framework should I use to support the programming model?

Based on the information in the IBM Redbook Programming the Cell Broadband Engine: Examples and Best Practices (see Resources, this series includes three short articles with the answers to each of these questions.

This first article in the series helps you to determine whether the Cell/B.E. platform is right for your application.

Making some decisions

The decision tree in Figure 1 provides an overview of whether you should build your application to leverage the speed and power of a Cell/B.E. processor.


Figure 1. Oh, decision tree!
Oh, decision tree!

Achieving higher performance per watt

One of the driving forces behind enabling applications on the Cell/B.E. platform is the desire (or need) for a higher level of performance per watt. The design choices for the Cell/B.E. platform express power efficiency more than twice as efficient (as expressed in peak gflops per watt) as conventional, general purpose processors.


Understanding where parallelism can rule

The Cell/B.E. platform offers parallelism at four levels:

  • Across multiple System x™ servers in a hybrid environment. This level is expressed using either message passing interface (MPI), a language-independent communication protocol used to program parallel computers at the cluster level or using some sort of grid computing middleware.
  • Across multiple Cell/B.E. processors or servers. This level can use MPI communication between the Cell/B.E. servers in the case of a homogeneous cluster of standalone Cell/B.E. servers, or this level can use ALF or DaCS for hybrid clusters. (See Resources for more about ALF and DaCS.)
  • Across multiple SPEs inside the Cell/B.E. processor or server using libspe2, ALF, DaCS, or a single-source compiler.
  • At the word level with SIMD instructions on each SPE using SIMD intrinsics or the auto-SIMDization capabilities of the compilers.

The more parallel-processing opportunities the application can leverage, the better.


Matching algorithms

You want to look for a match between the main computational kernels of the application and the Cell/B.E. strengths as listed in Table 1.


Table 1. Important Cell/B.E. features as seen from a programmer's perspective
GoodNot so good
Large register fileNot applicable
DMA (memory latency hiding)DMA latency
EIB bandwidthNot applicable
Memory performanceMemory size
SIMDScalar performance (scalar on vector)
Local Store (latency/bandwidth)Local Store (limited size)
8 SPE per processor (high level of achievable parallelism)PPE performance
NUMA (good scaling)SMP scaling
NABranching
Single or double-precision floating pointNot applicable

Because the computational characteristics and data movement patterns of all applications can be characterized by a composition of 13 dwarfs (as coined for 13 different kernels by a study from David Patterson and others), it is important to know which kernels construct a given application. This is usually fairly easy to determine: the chosen kernel is usually one that is suited for the numerical methods used in the application. Table 2 provides a description of the 13 dwarfs, an example (in the form of application or benchmark), the performance bottleneck common to the kernel in question (if known), and the affinity of each to leveraging the Cell Broadband Engine™ Architecture.


Table 2. The 13 dwarfs computational kernels
Dwarf nameDescriptionExample, app, or benchmarkPerformance bottleneckCell/B.E. affinity
(1=poor, 5=good)
Dense matricesBLAS, matrix-matrix operationsHPCC:HPL, ScaLAPACK, NAS:LUCPU limited5
Sparse matricesMatrix-vector operations with sparse matricesSuperLU, SpMV, NAS:CGCPU limited 50%, bandwidth limited 50%4
Spectral methodsFFT transformsHPCC:FFT, NAS:FT, FFTWMemory latency limited5
N-body methods Interactions between particles, external, near and farNAMD, GROMACSCPU limited4-5
Structured gridsRegular grids, can be automatically refinedWRF, Cactus, NAS:MGMemory bandwidth limited5
Unstructured gridsIrregular grids, finite elements and nodesABAQUS, FIDAP (Fluent)Memory latency limited3
Map-reduceIndependent data sets, simple reduction at the endMonte-Carlo, NAS:EP, Ray tracingUnknown5
Combinatorial logicLogical functions on large data sets, encryptionAES, DESMemory bandwidth limited for CRC, CPU limited for cryptography4
Graph traversalDecision tree, searchingXML parsing, QuicksortMemory latency limited3
Dynamic programmingHidden Markov models, sequence alignmentBLASTMemory latency limited4
Back-track and Branch+BoundConstraint optimizationSimplex algorithmUnknown?
Graphical modelsHidden Markov models, Bayesian networksHMMER, bioinformatics, genomicsUnknown5
Finite state machineXML transformation, Huffman decodingSPECInt:gccUnknown?

In a study of how the Cell/B.E. processor performs on four of the 13 dwarfs (dense matrices algebra, sparse matrices algebra, spectral methods, and structures grids), the IBM Redbook authors compared the Cell/B.E.-based performance of these kernels with those of a superscalar processor (the Opteron), a VLIW processor (the Itanium2), and a vector processor (the Cray X1E). The results were favorable for the Cell/B.E. performance, and these kernels since the examples are extremely common in many HPC applications.

Other results from other testing include the following:

  • Relatively successful numbers resulted for the graphical models, dynamic programming, unstructured grids, and combinatorial logic kernels.
  • The map-reduce kernel is embarrassingly parallel, and it is a perfect fit. Look for examples of this in ray tracing or Monte Carlo simulations.
  • The graph traversal dwarf is a more difficult target because it employs random memory accesses. Some new sorting algorithms, such as AA-sort, seem to exploit the Cell/B.E. architecture better.

Certain architecture features are more important to the individual kernels. The following list shows which features are important for each kernel.

  • Dense matrices: 8 SPEs per processor, SIMD, large register file for deep unrolling, fused multiply-add
  • Sparse matrices: 8 SPEs, memory latency hiding with DMA, high memory-sustainable load
  • Spectral methods: 8 SPEs, large register file, 6 cycles Local Store latency, memory latency hiding with DMA
  • Structured grids: 8 SPEs, SIMD, high memory bandwidth, memory latency hiding with DMA
  • Unstructured grids: 8 SPEs, high memory throughput
  • Map-reduce: 8 SPEs
  • Combinatorial logic: Large register file
  • Graph traversal: Memory latency hiding
  • Dynamic programming: SIMD
  • Graphical models: 8 SPEs, SIMD

The algorithm match also depends on the data types being used. The current Cell/B.E. implementation has single-precision floating-point and double-precision floating point capabilities.

As you can see from the Affinity column in Table 2 and from the previous bullet list, the Cell/B.E. platform is a good match for many of the common computational kernels. This is the result of the design decisions to address the main bottlenecks: memory latency, throughput, and a very high computational density. Eight SPEs per processor each has a very large register file and an extremely low local store latency (6 cycles compared with 15 for the current crop of general purpose processors).


Getting ready to go

As you notice, the decision tree (Figure 1) doesn't yet address the ability to call a Cell/B.E.-enabled library or whether the application (or a portion of the application) can be rewritten. According to the IBM Redbook, "The Cell BE may be easy on the electricity bill but can be hard on the programmer. Enabling an application on the Cell BE may result in very substantial algorithmic and coding efforts. But the results are usually worth the efforts."

Some other gems from the IBM Redbook about this topic include the following:

  • As for parallelization, the effort might have already been made using OpenMP at the process level. If this is the case, using the XLC single-source compiler might be the only viable alternative. It offers the code portability that could be a key requirement for some customers. But currently these compilers don't match the level of performance you can get from native SPE programming.

  • For new developments, some development environments can offer a higher level of abstraction. And the portability of code is maintained among the Cell/B.E., GPU, and general multicore processors. But then your application is tied to the development environment.

  • A new standardized language for writing applications to run on massively multicore systems might emerge (such as X108 or Chapel9). But adopting new languages is a slow process, and it doesn't address the fate of the existing C/C++ and FORTRAN code.

  • A standard API for the host-accelerator model might be a more viable option (think of ALF). APIs might just have a faster adoption rate than languages, such as MPI in the 1990s.

Planning tips

Chances are that if you're reading this, your answer to these concerns is that you can't wait—you need to program now. If that's the case, here are some planning considerations and potential problem workarounds.

  • Source code changes: There are portability concerns and potential limits to the scope of code changes you should make. The Cell/B.E. APIs are written in standard C/C++ and FORTRAN, and approaches such as host-accelerator can limit the number of source code changes.

  • Operating systems: Windows® applications can be a problem because Cell/B.E. runs only on Linux. If you're stuck with a Windows application, you could use IBM DAV to offload the computational part (and only this part) to the Cell/B.E. processor.

  • Languages: C/C++ and FORTRAN is fully supported. ADA has some support, but other languages aren't supported. You might have to rewrite the compute-intensive sections in C and use some form of offloading for Java™ or VBA applications running on Windows.

  • Libraries: Although the supported libraries list grows daily, you might find that some libraries are not supported. Some ISVs offer some library support. The best workaround here is to use the workload libraries provided in the IBM SDK.

Conclusion

As always, the decision to move to a Cell/B.E. system depends on where you are and where you are trying to go. You don't have to wait for the other articles in this series to forge ahead in your decision making—just go directly to the IBM Redbook source material.

Acknowledgments

I'd like to thank Chris Almond, Abraham Arevalo, Ricardo M. Matinata, Maharaja (Raj) Pandian, Eitan Peri, Kurtis Ruby, and Francois Thomas for their marvelous work on the IBM Redbook Programming the Cell Broadband Engine: Examples and Best Practices from which this article is derived.


Resources

Learn

Get products and technologies

Discuss

About the author

Kane Scarlett

Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=335917
ArticleTitle=Enabling applications, Part 1: Is your application ready for Cell/B.E.?
publish-date=09102008
author1-email=kane@us.ibm.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers