The Cell Broadband Engine (Cell/B.E.) architecture is the result of collaboration among IBM, Sony, and Toshiba to design a high-performance and power-efficient processor that can drive applications in fields as diverse as gaming, HDTV, and supercomputing. The Cell/B.E. multiprocessor consists of:
- The Power Processing Element (PPE) that contains a 64-bit Power Architecture core: the Power Processing Unit (PPU).
- Eight Synergistic Processor Elements (SPEs), which are specialized coprocessor units, each containing a Synergistic Processing Unit (SPU) with a coherent on-chip bus for communication between the elements.
While the PPE has a familiar processor with at least 256MB RAM (global memory) available, each SPE has a large register set and a local store of 256KB. Access to the global memory from the SPEs is performed using Direct Memory Access (DMA) using the bus or by exchanging messages with the PPE through a mailbox mechanism.
Standard PowerPC™ programs can run unmodified on a Cell/B.E. system, such as the IBM QS20 BladeCenter® or the Sony PS3, both of which run the Fedora or Yellow Dog Linux® distributions using just the PPE. However, high-performance computing users are smart to exploit the capability of the SPUs fully. The choices for this fit broadly into three models:
- Transparent model—Using libraries that have been optimized for the Cell/B.E. architecture. Many high-performance computing (HPC) codes use proprietary or open source libraries for the computational kernel of the application. These libraries might have been ported to the Cell/B.E. platform. Simple relinking enables an application to use the new libraries.
- Intermediate model—Using advanced languages or compiler directives. A number of third parties provide compilers and libraries that can optimize the transfer of data and the computation between the SPUs and the PPU. This can involve rewriting the computational kernel of an application or just adding compiler directives rather similar to parallelizing using OpenMP.
- Direct model—Using applications written for both the SPUs and the PPU. The applications are responsible for data transfer and synchronization. This is typically used to directly control the behavior of the Cell/B.E. processor, to hand-optimize the performance of the computational kernel, or to use code that does not fit well with the patterns that the advanced languages support.
Inevitably, where programming exists, there is debugging (or at least the need
to debug). That's where Allinea Software's DDT (now in Version 2.1.1) comes in.
DDT is a debugger for multithreaded and
parallel applications. Like programming for the Cell/B.E. processor, the
process of debugging can also differ from your previous experiences.
Trying to find bugs by placing
With graphical debugging tools like DDT, you can control the progress of a program at runtime, and you can see all the values of its variables, its memory, and the current execution stack. This capability provides far more information and flexibility than print statements do, making bug repair a quicker and less frustrating task.
With applications written in the direct model for Cell/B.E. platforms, the debugging need is greatest. At this level, extensions to enable SPU debugging and PPU debugging concurrently are essential. You need to be able to see variables and memory across every part of the processor.
For the intermediate model in which you use the advanced compiler tools and languages for Cell/B.E. apps, a full Cell/B.E. debugger can give a better view of the program state than a standard debugger. For example, you can allow users to see the active SPE threads and check their progress while monitoring the PPU.
DDT is designed to debug programs that typically can have high degrees of parallelism, sometimes thousands of processors simultaneously running the same application. But DDT is also effective for smaller clusters of Linux computers. This is what makes DDT a good choice for Cell/B.E. application debugging. DDT's inherently parallel model of execution and support for parallelism through intuitive process controls, along with sophisticated memory debugging, make it easier for DDT to find common errors with heap memory or detect illegal reads.
DDT is a source level debugger. It enables you to see source files and examine the detail of various threads as they progress through execution.
Cell/B.E. applications consist of two components:
- Code for the PPU
- Code for the SPUs
The SPU code is usually embedded in the PPU binary at link time using special tools in the IBM SDK for Multicore Acceleration 3.0 (Cell/B.E. SDK 3.0). When DDT loads an application for debugging, it automatically detects both the PPU and SPU code, and it finds the source files for both sections.
Initially the program stops at
main(), which is the entry
point to the code. Then, as each SPU thread is started, DDT can stop the process so
that you can see how the thread was created. And, at the end of an SPU
thread, DDT can stop allowing the reason for termination to be easily determined.
Tracking thread termination is an important task. Many errors inside SPU threads
can cause immediate, and often difficult to understand, thread termination.
Examples include exhausting the stack or the lack of available space in the local store.
When a program is first paused inside DDT, there are many parts of the DDT interface that can help you understand what is going on with the code.
The source code is highlighted, showing which lines have threads on them. By simply hovering the cursor, the actual threads present are identified.
The Parallel Stack View is another very popular feature for showing divergent behavior across processors and threads. This window displays the call stacks of each thread in a single tree view, optionally including every thread of every process for a program with multiple Cell/B.E. processors.
Where threads share a common stack (meaning they are in the same part of a program), they are on the same branch of the tree. The number of threads at each part of the tree is shown, meaning it is easy to find the threads that are behaving abnormally just by looking at those branches that contain only one thread.
The SPU and PPU threads are shown across the top of the DDT main window,
indicated by the
SPU acronyms beneath the thread numbers. Clicking
on a thread to select it makes that thread current, which means that the
variables shown are those present on that particular thread.
You can add breakpoints to the program to stop PPU and SPU threads when a particular point in the program is reached. DDT also allows you to set breakpoints that apply to all or just one of the threads.
DDT can control all the threads or just one at a time, which lets you focus on an individual PPU or SPU thread (by stepping through one line of code at a time while the other threads are paused). Following individual threads through a computation is one of the most useful capabilities of a debugger.
Often Cell/B.E. code exhibits SIMD (Single Instruction Multiple Data)
parallelism with each SPU thread executing the same part of a code at the same
time. In these situations, it can be instructive to step through each thread ahead a line
at a time. When threads diverge, it can be the source of the problem. DDT does this
with a special mode, which you can select by toggling the
threads together button.
Each SPE and PPE has its own addressable memory region. For the SPE, the local
store holds such variables as automatic stack and global or static variables. It
also contains local heap memory allocated using the
malloc() system call within the SPE.
There are some traps to SPU programming. A global variable is only global on the same SPE. And each SPE and PPE has a different variable, unlike conventional multithreaded programming in which processes share the same memory. DDT can help locate problems like this with such tools as the Cross Thread Comparison window. It examines the variables with the same identifier across each SPU and PPU thread. This is ideal for spotting rogue values. The common values are also grouped together so that you can really focus on the differences.
If an individual SPU or PPU thread is selected, the values of its variables are shown in the Local Variables tab. Dragging the current line markers in the source code selects the variables shown on each selected line and evaluates these also in the Current Line tab. For user defined data types (structures, classes, and derived types) the contents can be opened by clicking on the variables.
Variables can also be dropped into the Evaluations box for more detailed
analysis. You can follow pointers from the evaluation window by right-clicking and
There are additional tabs in DDT that show the current status of DMA requests and mailbox events. The DDT shows this information that the kernel provides. DDT can also show other important data, such as the current content of the mailboxes that are used to communicate between the PPU and SPUs. SPUs and PPUs have three mailboxes: two for SPU-to-PPU communication (one interrupting the PPU, while the other must be polled), and one for PPU-to-SPU communication (polled).
This quick overview of DDT should explain how you can use its capabilities for a better operations experience when debugging Cell/B.E. applications. DDT for the Cell/B.E. platform is available now for the IBM QS20 BladeCenter and the Sony PS3 running Fedora Core 6 and the Cell SDK 2.1. Watch for a version of DDT that works with SDK 3.0 soon.
- Use an
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- This article is a part of an unofficial "partners" series. The first article in this series is
Part 1: Build high-performance apps for multicore processors"
(developerWorks, May 2007) about the RapidMind Development Platform, which provides a
simple single-source mechanism to develop portable high-performance applications
for multicore processors.
- Get your own copy of
"Using DDT to debug
applications on the IBM Cell BE processor," which is
the original whitepaper from which this article was adapted.
- Use the tutorial
"Cell/B.E. SDK 3.0, Part 5: Debug and complete dynamic or static performance"
(developerWorks, November 2007) to debug Cell/B.E. applications. This tutorial is
part of a six-part series.
"Debugging Cell Broadband Engine systems"
(developerWorks, August 2006) for the essential concepts and tools
developers need to perform successful Cell/B.E.-related debugging.
- To learn more about Cell/B.E. programming, try the
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all things Cell/B.E.
Get products and technologies
- Learn more about
the DDT debugger.
You can download an evaluation version of
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the library of Cell/B.E. documentation.
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture when you sign up to receive Cell/B.E. news in your newsletter.
- Participate in the discussion forum.
- For the fastest, best answers to your
Cell/B.E.-related questions, try the peer-and-expert
Cell Broadband Engine Architecture forum.
Cell Broadband Engine/Power Architecture notebook
is a blog-based resource that hosts
as well as two instructional features: the
of interesting questions and hot topics from the forum, and the
series (short, precise, task-specific, quick-read knowledge "bombs" gleaned from
David's history in High Performance Computing began with the Oxford BSP group in 1993, working on an early alternative model for parallel programming to the emerging but complex MPI standard. He obtained a DPhil in Parallel Computing, producing work on the simulation of shared-memory systems using, and formal semantics for, distributed-memory clusters. He subsequently continued to work at Oxford in post-doctoral and teaching positions, researching parallel libraries and languages. After two years developing software for high-volume online services on clusters of Java computers, he returned to the fold of High Performance Computing at Allinea, researching the development tools needed for parallel programming.