How do you begin to enable your application on the Cell/B.E. platform? You begin by asking yourself three questions:
- Is the application likely to perform well on the Cell/B.E. platform?
- Which parallel programming model should I use for this application?
- Which framework should I use to support the programming model?
Based on the information in the IBM Redbook Programming the Cell Broadband Engine: Examples and Best Practices (see Resources, this series includes three short articles with the answers to each of these questions.
This first article in the series helps you to determine whether the Cell/B.E. platform is right for your application.
The decision tree in Figure 1 provides an overview of whether you should build your application to leverage the speed and power of a Cell/B.E. processor.
Figure 1. Oh, decision tree!

Achieving higher performance per watt
One of the driving forces behind enabling applications on the Cell/B.E. platform is the desire (or need) for a higher level of performance per watt. The design choices for the Cell/B.E. platform express power efficiency more than twice as efficient (as expressed in peak gflops per watt) as conventional, general purpose processors.
Understanding where parallelism can rule
The Cell/B.E. platform offers parallelism at four levels:
- Across multiple System x™ servers in a hybrid environment. This level is expressed using either message passing interface (MPI), a language-independent communication protocol used to program parallel computers at the cluster level or using some sort of grid computing middleware.
- Across multiple Cell/B.E. processors or servers. This level can use MPI communication between the Cell/B.E. servers in the case of a homogeneous cluster of standalone Cell/B.E. servers, or this level can use ALF or DaCS for hybrid clusters. (See Resources for more about ALF and DaCS.)
- Across multiple SPEs inside the Cell/B.E. processor or server using libspe2, ALF, DaCS, or a single-source compiler.
- At the word level with SIMD instructions on each SPE using SIMD intrinsics or the auto-SIMDization capabilities of the compilers.
The more parallel-processing opportunities the application can leverage, the better.
You want to look for a match between the main computational kernels of the application and the Cell/B.E. strengths as listed in Table 1.
Table 1. Important Cell/B.E. features as seen from a programmer's perspective
| Good | Not so good |
|---|---|
| Large register file | Not applicable |
| DMA (memory latency hiding) | DMA latency |
| EIB bandwidth | Not applicable |
| Memory performance | Memory size |
| SIMD | Scalar performance (scalar on vector) |
| Local Store (latency/bandwidth) | Local Store (limited size) |
| 8 SPE per processor (high level of achievable parallelism) | PPE performance |
| NUMA (good scaling) | SMP scaling |
| NA | Branching |
| Single or double-precision floating point | Not applicable |
Because the computational characteristics and data movement patterns of all applications can be characterized by a composition of 13 dwarfs (as coined for 13 different kernels by a study from David Patterson and others), it is important to know which kernels construct a given application. This is usually fairly easy to determine: the chosen kernel is usually one that is suited for the numerical methods used in the application. Table 2 provides a description of the 13 dwarfs, an example (in the form of application or benchmark), the performance bottleneck common to the kernel in question (if known), and the affinity of each to leveraging the Cell Broadband Engine™ Architecture.
Table 2. The 13 dwarfs computational kernels
| Dwarf name | Description | Example, app, or benchmark | Performance bottleneck | Cell/B.E. affinity (1=poor, 5=good) |
|---|---|---|---|---|
| Dense matrices | BLAS, matrix-matrix operations | HPCC:HPL, ScaLAPACK, NAS:LU | CPU limited | 5 |
| Sparse matrices | Matrix-vector operations with sparse matrices | SuperLU, SpMV, NAS:CG | CPU limited 50%, bandwidth limited 50% | 4 |
| Spectral methods | FFT transforms | HPCC:FFT, NAS:FT, FFTW | Memory latency limited | 5 |
| N-body methods | Interactions between particles, external, near and far | NAMD, GROMACS | CPU limited | 4-5 |
| Structured grids | Regular grids, can be automatically refined | WRF, Cactus, NAS:MG | Memory bandwidth limited | 5 |
| Unstructured grids | Irregular grids, finite elements and nodes | ABAQUS, FIDAP (Fluent) | Memory latency limited | 3 |
| Map-reduce | Independent data sets, simple reduction at the end | Monte-Carlo, NAS:EP, Ray tracing | Unknown | 5 |
| Combinatorial logic | Logical functions on large data sets, encryption | AES, DES | Memory bandwidth limited for CRC, CPU limited for cryptography | 4 |
| Graph traversal | Decision tree, searching | XML parsing, Quicksort | Memory latency limited | 3 |
| Dynamic programming | Hidden Markov models, sequence alignment | BLAST | Memory latency limited | 4 |
| Back-track and Branch+Bound | Constraint optimization | Simplex algorithm | Unknown | ? |
| Graphical models | Hidden Markov models, Bayesian networks | HMMER, bioinformatics, genomics | Unknown | 5 |
| Finite state machine | XML transformation, Huffman decoding | SPECInt:gcc | Unknown | ? |
In a study of how the Cell/B.E. processor performs on four of the 13 dwarfs (dense matrices algebra, sparse matrices algebra, spectral methods, and structures grids), the IBM Redbook authors compared the Cell/B.E.-based performance of these kernels with those of a superscalar processor (the Opteron), a VLIW processor (the Itanium2), and a vector processor (the Cray X1E). The results were favorable for the Cell/B.E. performance, and these kernels since the examples are extremely common in many HPC applications.
Other results from other testing include the following:
- Relatively successful numbers resulted for the graphical models, dynamic programming, unstructured grids, and combinatorial logic kernels.
- The map-reduce kernel is embarrassingly parallel, and it is a perfect fit. Look for examples of this in ray tracing or Monte Carlo simulations.
- The graph traversal dwarf is a more difficult target because it employs random memory accesses. Some new sorting algorithms, such as AA-sort, seem to exploit the Cell/B.E. architecture better.
Certain architecture features are more important to the individual kernels. The following list shows which features are important for each kernel.
- Dense matrices: 8 SPEs per processor, SIMD, large register file for deep unrolling, fused multiply-add
- Sparse matrices: 8 SPEs, memory latency hiding with DMA, high memory-sustainable load
- Spectral methods: 8 SPEs, large register file, 6 cycles Local Store latency, memory latency hiding with DMA
- Structured grids: 8 SPEs, SIMD, high memory bandwidth, memory latency hiding with DMA
- Unstructured grids: 8 SPEs, high memory throughput
- Map-reduce: 8 SPEs
- Combinatorial logic: Large register file
- Graph traversal: Memory latency hiding
- Dynamic programming: SIMD
- Graphical models: 8 SPEs, SIMD
The algorithm match also depends on the data types being used. The current Cell/B.E. implementation has single-precision floating-point and double-precision floating point capabilities.
As you can see from the Affinity column in Table 2 and from the previous bullet list, the Cell/B.E. platform is a good match for many of the common computational kernels. This is the result of the design decisions to address the main bottlenecks: memory latency, throughput, and a very high computational density. Eight SPEs per processor each has a very large register file and an extremely low local store latency (6 cycles compared with 15 for the current crop of general purpose processors).
As you notice, the decision tree (Figure 1) doesn't yet address the ability to call a Cell/B.E.-enabled library or whether the application (or a portion of the application) can be rewritten. According to the IBM Redbook, "The Cell BE may be easy on the electricity bill but can be hard on the programmer. Enabling an application on the Cell BE may result in very substantial algorithmic and coding efforts. But the results are usually worth the efforts."
Some other gems from the IBM Redbook about this topic include the following:
- As for parallelization, the effort might have already been made using
OpenMP at the process level. If this is the case, using the XLC
single-source compiler might be the only viable alternative. It offers
the code portability that could be a key requirement for some
customers. But currently these compilers don't match the level of
performance you can get from native SPE programming.
- For new developments, some development environments can offer a higher
level of abstraction. And the portability of code is maintained among
the Cell/B.E., GPU, and general multicore processors. But then your
application is tied to the development environment.
- A new standardized language for writing applications to run on
massively multicore systems might emerge (such as X108 or Chapel9).
But adopting new languages is a slow process, and it doesn't address
the fate of the existing C/C++ and FORTRAN code.
- A standard API for the host-accelerator model might be a more viable option (think of ALF). APIs might just have a faster adoption rate than languages, such as MPI in the 1990s.
Chances are that if you're reading this, your answer to these concerns is that you can't wait—you need to program now. If that's the case, here are some planning considerations and potential problem workarounds.
- Source code changes: There are portability concerns and potential
limits to the scope of code changes you should make. The Cell/B.E.
APIs are written in standard C/C++ and FORTRAN, and approaches such as
host-accelerator can limit the number of source code changes.
- Operating systems: Windows® applications can be a problem
because Cell/B.E. runs only on Linux. If you're stuck with a Windows
application, you could use IBM DAV to offload the computational part
(and only this part) to the Cell/B.E. processor.
- Languages: C/C++ and FORTRAN is fully supported. ADA has some support,
but other languages aren't supported. You might have to rewrite the
compute-intensive sections in C and use some form of offloading for
Java™ or VBA applications running on Windows.
- Libraries: Although the supported libraries list grows daily, you might find that some libraries are not supported. Some ISVs offer some library support. The best workaround here is to use the workload libraries provided in the IBM SDK.
As always, the decision to move to a Cell/B.E. system depends on where you are and where you are trying to go. You don't have to wait for the other articles in this series to forge ahead in your decision making—just go directly to the IBM Redbook source material.
I'd like to thank Chris Almond, Abraham Arevalo, Ricardo M. Matinata, Maharaja (Raj) Pandian, Eitan Peri, Kurtis Ruby, and Francois Thomas for their marvelous work on the IBM Redbook Programming the Cell Broadband Engine: Examples and Best Practices from which this article is derived.
Learn
- Use an
RSS
feed
to request notification for the upcoming articles in this series. (Find
out more about
RSS feeds of developerWorks content.)
- Refer to the original source material for
this article in the IBM Redbook
Programming the Cell Broadband Engine: Examples and Best Practices
(IBM Redbooks, August 2008).
- Read
"The Landscape of Parallel Computing Research: A View from Berkeley"
(EECS Department, University of California at Berkeley, December 2006) for
more about the 13 dwarfs (computation kernels for applications).
- Explore the IDC study
"Solutions for the Datacenter's Thermal Challenges"
to see what data center designers want in future hardware.
- Find performance numbers for the various
computational kernels on various platforms at:
- "Scientific Computing Kernels on the Cell Processor" (Computational Research Division, Lawrence Berkeley National Laboratory, 2007).
- "Biological sequence analysis on the Cell BE: HMMer-Cell" (IBM, 2007).
- "Synthetic programming on the Cell/B.E." (UT/IBM Cell Workshop, 2006).
- "Multigrid Finite Element Solver on the Cell/B.E." (Digital Medics, 2006).
- "Cell Broadband Engine Architecture and its first implementation" (developerWorks, November 2005).
- "Charm++, offload API, and the Cell processor" (University of Illinois at Urbana, 2006).
- Take a look at these ALF-related
quick-read guides:
- "Introducing ALF."
- "10 major ALF concepts."
- "Programming with ALF: Basic ALF application structure."
- "Programming with ALF: Double buffering."
- "Programming with ALF: Handling ALF constraints."
- "Programming with ALF: Optimizing ALF applications."
- "Programming with ALF: Accelerator buffer management."
- "ALF and hybrid x86."
- Take a look at these DaCS-related
quick-read guides:
- "Intro to DaCS."
- "APIs, apps, versions, and PDT."
- "Reservation services."
- "Process management."
- "Group functions."
- "Intro to data communications."
- "Datacomm details: rDMA."
- "Datacomm details: rDMA block transfers."
- "Datacomm details: rDMA list transfers."
- "Datacomm details: Message-passing services."
- "Datacomm details: Mailbox services."
- "Wait identifier management."
- "Transfer completion routines."
- "Locking primitives."
- "Element types."
- "Error handling."
- "Error codes glossary."
- "Trace events glossary."
- "DaCS and hybrid x86."
- In the "Fun with ALF" series, you get
code examples that show you how to
add large matrices together,
convert I/O data,
find minimum and maximum values,
overcome memory limits with multiple vector dot products,
perform matrix math using overlapped I/O buffers,
and
use task dependency in a two-stage pipeline application
(developerWorks, March-July 2008).
- Learn more about Cell/B.E. programming
from the developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the
Cell
Broadband Engine documentation
section of the IBM Semiconductor Solutions Technical Library for a wealth
of downloadable manuals, specifications, and more.
- Sign up for the
developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to
your inbox each week. Check Power Architecture® when you sign up to receive Cell/B.E. news in your newsletter.
Get products and technologies
- Get your copy of
the
IBM SDK for Multicore Acceleration 3.0
or browse through the
library of Cell/B.E. documentation.
- Find all
Cell/B.E.-related articles, discussion forums, downloads, and more at the
IBM developerWorks
Cell
Broadband Engine resource center:
your definitive resource for all things Cell/B.E.
- Contact IBM about
custom
Cell/B.E.-based or custom-processor based solutions.
Discuss
- Check out the
Cell Broadband
Engine Architecture forum
to get your technical questions about the processor answered. Juicy
problems and answers from the forums are rounded up periodically and
highlighted in the
"Forum watch" blog series.
- Go to the
Cell Broadband Engine/Power Architecture blog
for
news,
downloads, instructional resources, and event notifications for Cell/B.E.
and other Power Architecture-related technologies. You can find the
popular
"Forum
watch"
blog series (Q&A roundup), the "FixIt" technology updates, and the
Infobomb
quick-read technology introductions.

Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.
Comments (Undergoing maintenance)




