 | Level: Introductory Power Architecture editors, developerWorks, IBM
16 Nov 2005 This paper from the MPR Fall Processor Forum 2005 explores programming models for the Cell Broadband Engine (CBE) Processor, from the simple to the progressively more advanced. With nine cores on a single die, programming for the CBE is like programming for no processor you've ever met before. Read why.
The Cell Broadband Engine (CBE) Processor offers the potential for increased processor performance for a broad variety of applications. However,
coming anywhere close to the theoretical performance capability of the
processor requires a good understanding of the processor's capabilities,
and the choice of a programming model which matches the processor's
architecture.
This paper reviews the basic architecture of the CBE processor and some
of the programming models which fit well with its design.
The basic CBE architecture
Figure 1. The CBE block diagram
The CBE processor contains a PowerPC® processor element, used as a primary processor, and eight "synergistic" processor
elements. The architecture allows any SPE to take up a controller
role if needed. These are called the PPE and
SPEs, respectively (see the sidebar on acronyms).
The CBE processor also contains much bandwidth. Each of the eight SPEs has 256KB
of local storage for code and data, and 128 registers, each 128 bits wide.
The instruction set for the SPEs is designed to favor SIMD processing.
The SPEs do not have any hardware cache of main memory.
The closeness of each SPU to its 256KB of local store makes it easy to
note the abstract similarity to a cache. SPE programmers can manage
the local store to keep frequently used pieces of data. However, from
the hardware-architecture point of view, they are not the same.
Caches hold temporary copies of physical memory. The local storage on each
SPE is not associated with some region of physical memory; it is a private,
non-coherent, local store. Data can be transferred directly from one
SPE's local storage to another, without going through physical memory. The
internal bus, called the Element Interconnect Bus, has a bandwidth of
roughly 100GB/sec; for more information on the ways data moves around the
CBE, see the related article, "Unleashing the Cell Processor: The Element Interconnect Bus," to be published next week.
Each SPE can communicate with the PPE through mailboxes, which are
registers available to a given SPE and the main processor. The three
three mailboxes are: inbound and outbound non-interrupting mailboxes and an
outbound interrupting mailbox. The interrupts allow notification and
communication in a quick and efficient manner.
Levels of parallelism
 |
Acronym alley
The CBE processor has a broad feature set for which a number of terms
have been coined. Following are a few of the acronyms you will see used in
discussions of the Cell Broadband Engine:
PPE - PowerPC Processing Element. This is a two-thread,
dual-issue, processor. Unlike most modern PowerPC chips, it executes
in-order, and has a comparatively short pipeline.
SPE - Synergistic Processing Element. This is a specialized
processor with a small amount of local storage and its own access to
system memory. SPEs are capable of running independently of PPE as
independent processors. They may also run under the control of the PPE,
and are used to offload computationally intensive tasks.
CESOF - Cell Embedded SPE Object Format. An object format used to
bundle SPE code together with PPE code. It gives the PPE a simple handle
that can be used to hand code off to a given SPE. The format also enables
the SPE code to reference global objects defined in the main memory. In
fact, the CESOF is an ELF format with additional sections to support
symbol resolution and embedding SPE ELF objects within PPE ELF objects .
DMA - Direct Memory Access. Generically, transfer of memory
directly from one part of a system to another without processor
interaction. Within the CBE, DMA is also used to refer to moving data between,
for instance, processor elements and main memory, or between two processor
elements.
EIB - Element Interconnect Bus. This is the bus that connects the
various processor elements to memory and other I/O.
MFC - Memory Flow Controller. The part of an SPE which carries
out DMA operations and moves data around.
LS - Local storage. 256KB of shared instruction and data storage
local to each SPE. (Not to be mistaken for a cache.)
SIMD - Single Instruction Multiple Data. The simple way to do
parallelism, as in the PowerPC's VMX/AltiVec instructions.
CBEA - Cell Broadband Engine Architecture. Strictly speaking, the
Cell Broadband Engine is the first CBEA processor, and is often called the
Cell.
IDL - Interface Definition Language.
|
|
Several kinds of parallelism are available when developing
applications for the CBE. For starters, both the PPE and the SPEs have SIMD
instructions available, so a single instruction can perform multiple
simultaneous operations. They are also superscalar, capable of executing
two instructions per clock cycle. This level of parallelism is familiar
to PowerPC developers already from the VMX/AltiVec instructions available
on recent PowerPC processors.
Each processing element can be performing a different ongoing task,
which allows for task-level parallelism. The PPE is dual-threaded, and
there are 8 SPE cores, allowing a total of ten tasks at once (two on the
PPE, and one on each SPE). Each task, in turn, could be using SIMD
instructions to process large amounts of data.
While all of this processing is going on, the DMA engines (MFCs) on each
SPE can also be moving data around. This is a separate component of the
architecture, and need not prevent the processors from operating on the
data already available to them. Thus, the CBE on-chip cores don't need
to spend much time moving data around.
Finally, you can have multiple CBE processors in a system, or
even multiple systems in a cluster. This level of parallelism is fairly
well understood and is not specific to Cell. For more on data coherency,
see "Unleashing the Cell Processor: The Element Interconnect Bus," to be published next week.
The array of options here is potentially bewildering. You should
adopt programming models designed to make efficient use of the available
resources. A good understanding of the ways in which you can use the CBE processor
makes it much easier to develop efficient, reliable code that
can be delivered on a useful schedule. A good programming model makes
good use of the huge computational capacity and bandwidth the CBE
provides, dividing work up among the various processing components
available. Furthermore, use of consistent models allows the development
of language constructs, libraries, frameworks, or even operating system
support to simplify development.
This paper reviews both "small" (local-store only) and "large"
(using external code or data) single-SPE programming models, as well as multiSPE
parallel models. At the end, it also describes the multitasking aspect of
sharing the SPEs.
Programming the PPE
Before getting into the details of programming the SPEs, you need to understand the role of the PPE. The PPE is a 64-bit PowerPC chip,
designed to run general-purpose code and facilitate the SPEs. The CESOF
object format (remember to see the Acronyms
sidebar) is used to store a chunk of code for delivery to an SPE.
Each SPE image is associated with a handle which can be used at runtime to load specific
code on an SPE. The PPE handles memory mapping and exception
handling, to load code on the SPEs, and to start and stop their execution.
In general, but not always, scheduling of the SPEs is left up to the PPE,
and OS services (such as file I/O) also run on the PPE.
Small single-SPE models
The simplest way to use a single SPE is to load a single chunk of code on
it, along with the data it needs to process, and let it run. The SPE runs
entirely out of its local data store, with no access to (or bandwidth load
on) main memory. Many workloads can be handled entirely this way, but
code and data together must fit in 256KB.
In this model, input and output are always explicit. The SPE program is
given arguments (passed in as arguments to a main function) and returns an
exit status. It can also communicate using the mailboxes, or by system
calls. This model might be supported by an IDL. SPE executables are
compiled and linked separately, then embedded as read-only data in the PPE
executable, using the CESOF format. (The CESOF object is bundled into a
section of the ELF object file for the PPE executable.) At runtime, the
PPE loads and initializes the SPE, then starts it running on the code.
See also the "SPU Application Binary Interface Specification," listed in the Resources section of this document.
The following code listings show how this works:
Listing 1. A program to run on the
SPE
/* spe_foo.c
* A C program to be compiled into an executable called "spe_foo"
*/
int main(unsigned long long speid, addr64 argp, addr64 envp)
{
int i;
/* func_foo would be the real code */
i = func_foo(argp);
return i;
}
Listing 2: A program to run another program on the SPE.
/* spe_runner.c
* A C program to be linked with spe_foo and run on the PPE.
*/
extern spe_program_handle_t spe_foo;
int main()
{
int rc, status = 0;
speid_t spe_id;
spe_id = spe_create_thread(0, &spe_foo, 0, NULL, -1, 0);
rc = spe_wait(spe_id, &status, 0);
return status;
}
|
Note that the result of spe_wait and the status code returned by the SPE
program are distinct; if spe_wait succeeds, then status will be filled in
with the value returned from the SPE program. Interaction between the SPE and the PPE is not needed until execution completes. The
spe_wait operation blocks until the SPE program exits.
Another option is to load multiple pieces of code and data to a single
SPE; this can be the basis of a primitive multitasking environment on the
SPE, useful for multiple small jobs which do not need a
dedicated processor. You cannot provide memory protection between
tasks running on a single SPE, and such multitasking is necessarily
cooperative, not preemptive. However, in cases where the tasks are small
enough, and complete reliably enough, this can dramatically improve
performance, freeing up other SPEs for dedicated tasks. The cost of
transferring a small program to the SPE might be too high in some cases,
however.
Large single-SPE programming models
When all your code and data together cannot be fit entirely within 256KB,
you might need to use a "large" model. In this model, the PPE reserves chunks of effective address space for use by the SPE program, which accesses
them through DMA. Memory mapping into effective address space is set up
to give the SPE a secondary memory store which it can access using the DMA engine (MFC).
You can use this model in many ways. One is the streaming model: load a regular chunk of sequential data, modify it, write it back to main memory, and repeat.
Another is to use the local store as a temporary buffer, copying random data back and forth as needed.
In some cases, the same techniques could be used for code, not just data;
the primary code loaded on the SPE could use overlay segments located in main
memory as-needed. The compiler can generate automatic overlaying code to
handle this case. CESOF, and the toolchains in use, allow SPE code to
refer to objects defined in the effective address space.
The simple LS-resident multitasking possible on a single SPE becomes
somewhat more flexible when combined with automatic loading and storing of
job code and data; a small kernel running on the SPE can load tasks, or
swap them out when blocked or completed, allowing the SPE to manage
multiple tasks from a job queue. Once again, this is still not
preemptive.
One consideration when using DMA to span the effective memory space
of
an SPE is that it imposes significant latency. With that in mind,
prefetching (also known as double-buffering or multibuffering,
particularly in game programming) becomes an important technique. If, while
buffer N is being processed, buffer N-1 is being written out, and then
buffer N+1 is being read in, the processor can execute continuously, even
if the time required to transfer the data is a substantial fraction (up to
half) of the time it takes to perform an operation.
MultiSPE programming models
It is possible to combine the work of multiple SPEs. Synchronization
becomes an issue here. Options include MFC atomic update commands, mailboxes, SPE signal notification
registers, events and interrupts, or even just polling of
shared memory. As with large single-SPE models, a compiler (for example,
openMP) might transparently manage access to shared memory with proper
critical section access controls.
The job queue model is a popular model for programming the CBE,
which allows any idling SPE to obtain another task quickly, providing
automatic load balancing. One special case of this that optimizes
particularly well for regular and sequential chunks is the streaming
model. If a given piece of data can be processed quickly enough
by a single SPE, but there is too much sequential data of that sort
for a single SPE to process quickly enough, multiple SPEs can be
assigned to process the array of data, pulling new blocks of data
off a FIFO queue and processing them simultaneously.
Another option for streaming is the pipeline, where each SPE handles part
of a task, processing the output from the previous SPE. This might make
heavy use of direct LS to LS DMA, bypassing main memory to reduce
bandwidth use. This allows SPEs to perform tasks which need too much code
to leave any room for efficient data handling. The code can be split up
among multiple SPEs, allowing data to be handed across quickly. This is a
good example of the use of the DMA channels for message-passing. On the
down side, load balancing is much harder than it is with the previously mentioned streaming setup. If a given chunk of data
cannot be processed quickly enough by a single SPE, code overlays
might be a better fit.
Kernel management of SPEs
An OS running on the PPE can manage and allocate SPEs, providing and
arbitrating access to them. Making the SPEs available like this allows
preemptive multitasking of more tasks than there are SPEs to run them.
Running tasks or threads can be mapped onto SPEs, paused and copied out,
resumed, and so on. Because the context-switch cost is relatively large
(the whole 256KB of local store, 128x16B register file, and DMA command
queues), a run-to-completion policy is of course strongly favored, but the
option of preemption is there. This can allow for memory protection
between tasks, because a task will be swapped out completely before
another task is loaded on the same SPE.
Application development
Everything you've ever been told about application optimization applies
more so to the CBE architecture. Choice of algorithms, interactions
between algorithms, and related issues are crucial to effective
development. Budget some time for experimentation; partition the
algorithm and program, see how it works, and be prepared to try again.
Start with the code that will run on the PPE, then offload specific tasks
to SPEs. Switch to SIMD code on the SPEs if you need to, but be sure your
overall algorithm partitioning is working before you spend a lot of time
vectorizing code that will have to be rewritten anyway.
When targeting a CBE processor, you have to budget both computation and
bandwidth. There's plenty of both, but the sheer volume of computation
available can swamp your bandwidth, and the sheer volume of data available
can swamp your processing. If you have bandwidth crunches, look for ways to calculate data rather
than copying data, and look for ways to do more calculations before
passing the data onward.
Look for bottlenecks when benchmarking your code. The PPE can easily
become a bottleneck if you rely too heavily on precalculating what you
want the SPEs to do. If an operation is taking too long on a single SPE,
look into splitting the work.
Conclusion
A good choice of programming model and a clear understanding of the many
models possible on the CBE architecture can reduce development cost
while improving performance. Abstractions, such as streaming and job-queue programming models, and
development tools are
crucial to reliable and efficient development. Don't be afraid to mix
programming models; you may find the best design has two SPEs running
unique tasks, two SPEs streaming a common task, and a four-SPE pipeline
handling a particularly complicated task. You are not required to use all
the SPEs in the same way. New applications may suggest new programming
models; budget time for experimentation. Streaming can emulate the
function of pipelining, although it might impose a slight additional cost.
The CBE architecture makes it virtually certain that a fairly easy
development effort will result in an impressive performance gain over a more
traditional processor architecture, but achieving top performance is difficult.
Acknowledgments
This article was adapted by Peter Seebach, working from the original
presentation "Unleashing the power of Cell Broadband Engine: A programming
model approach," presented at MPR Fall Processor Forum 2005 by Alex
Chow of IBM. Peter would like to thank Tim Kelly, Alex Chow, and Daniel
Brokenshire for technical and editorial review during the writing process.
Resources Learn
-
This paper is based on a presentation given at Fall Processor Forum 2005: The Road
to Multicore. See the rest in this series.
-
The
Cell Broadband Engine project
page at IBM Research offers a wealth of links, diagrams, information,
and articles.
-
Introduction
to the Cell multiprocessor (IBM Journal of Research and Development,
2005) has a good discussion of the history of the CBE project.
- The IBM Semiconductor Solutions Technical Library Cell
Broadband Engine documentation section lists specifications, user
manuals, and articles of general interest.
-
The SPU
Application Binary Interface Specification V1.3 discusses register
usage and calling conventions, data type sizes and alignment, low-level
system and language binding information, information on loading and
linking, and coding examples. This specification defines the system
interface for SPU-targeted object files to help ensure maximum binary
portability across implementations.
-
Find related articles, downloads, discussion forums, and more at the IBM
developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things CBE
-
Keep abreast of all the CBE news: subscribe to the Power
Architecture Community Newsletter
Get products and technologies
Discuss
About the author  | |  | The developerWorks Power Architecture editors welcome your comments on this article. E-mail them at dwpower@us.ibm.com.
|
Rate this page
|  |