The Cell Broadband Engine processor has a novel architecture. Programmers familiar with single-core systems or even with homogeneous multicore systems may find it challenging to make effective use of the Cell/B.E. processor's unusual architecture. In this series, I introduce Cell/B.E. programming from the perspective of an experienced programmer (me) who is new to the Cell/B.E. architecture, and I'll show you how to develop a feel for programming models that work on the Cell/B.E. processor. The series will be much more enlightening if you download the SDK 2.1 and follow along (see Resources).
If you have actual Cell/B.E. hardware (I used a Sony® PLAYSTATION® 3), you can use the SDK's development tools to target it. If you don't have Cell/B.E. hardware, the SDK also provides a simulator that offers a usable emulation of the hardware for testing purposes, although obviously the simulated hardware is slower. Use whichever one you want to play along with in this series. The simulator offers some additional debugging features that end-user hardware lacks, so you might find it useful even if you have hardware.
Because other articles discuss the Cell/B.E. architecture in more detail (see Resources), for now a brief overview should suffice. The Cell/B.E. processor offers a dual-threaded PowerPC® processor core (the Power Processor Element or PPE) and eight single-threaded SIMD processor cores (the Synergistic Processor Elements or SPEs). The PPE is a fairly standard PowerPC design, including AltiVec/VMX vector enhancements; for more information about the PPE's vector processing, you might look to the Unrolling AltiVec series from 2005 (see Resources). The PPE, although architecturally similar to other 64-bit PowerPC chips, isn't directly related to a specific product line. The SPEs are where the Cell/B.E. architecture gets interesting.
The SPE is a vector-only processor. Every operation works on multiple data elements in parallel, stored in 128-bit registers. There are no scalar (single-element) registers or operations. Each SPE has 256KB of local storage for both instructions and data. This storage is not a "cache" -- it does not share address space with main memory.
Communication between elements is also somewhat unusual, although it won't have significant effects on our early explorations of Cell/B.E. development. Data does not need to be managed entirely by the PPE. Each SPE can communicate directly with main memory and furthermore, the SPEs can communicate directly with each other.
Communication between the PPE and the SPEs happens through "mailboxes" (dedicated registers which can be written to by one core and retrieved by the other), through DMA, or both. This series begins with an overview of the API used to communicate between the PPE and the SPEs.
The API for offloading code to the SPEs is distributed as a code library for the PPE, libspe. This series looks at the revised libspe2 implementation which offers a substantial API revision from the implementation released in earlier versions of the Cell/B.E. SDK. You can find complete documentation in the Cell/B.E. SDK (see Resources).
The SPE API is built around the notion of an SPE context; this is a representation of the current state of a single SPE containing the complete data set to be loaded to it (including any executable code). In the previous SDK, the API provided for launching "threads" on the SPEs. In the revised API, the call to run a context is synchronous: The call does not complete until the called program completes. Your code must create multiple threads (or processes, if you really prefer) to run multiple SPEs. On the other hand, you have more direct control over choices related to thread scheduling since you use the thread API directly.
SPE programs are compiled as standalone programs with a main function which simply performs the necessary work. Setup and tear down are handled automatically by the startup code linked in with the SPE program and by the library code running on the PPE; you can generally ignore them. SPE programs often do at least some of their own data transfer, however.
The SPE API defines a number of opaque data types which are used as arguments to various API functions. Without going into the exhaustive details of all of them, here are the ones I'll be using in the first few examples:
-
spe_context_ptr_t. -
spe_program_handle_t. -
spe_stop_info_t.
Okay, you probably need a little more detail.
The first, spe_context_ptr_t, represents a
virtualized SPE state; it includes the register states and the contents
of local store. This is the core item which indicates a program on an SPE,
whether it's being loaded, or running, or being queried as to why it stopped.
This is an opaque handle; do not peer into its internals.
The second, spe_program_handle_t, is a handle
that can be used to identify an SPE executable program built for use with
libspe2. This type can be created from a file containing an SPE binary
or embedded into a PPE program using the ppu-embedspu
utility. This is another opaque handle type whose contents are used only
by the library.
Finally, the spe_stop_info_t type is used to
record the reason for which an SPE program stopped execution. Unlike the
other two, this is not an opaque type; its structure is documented. The
most important member is stop_reason which
indicates the reason for which the SPE program stopped execution. The most
common value should be SPE_EXIT (the program was
successful and an exit status has been stored in the structure). Find a complete
description of this type in the documentation for the SPE Runtime
Management Library, found in the "pdfs" directory of the SDK distribution
(see Resources).
To run a program on an SPE, create an SPE context, load a program into it,
and run it. If you have an embedded SPE binary of type spe_program_handle_t, this can be reduced to four
lines of code:
Listing 1. The SPE API
entry = SPE_DEFAULT_ENTRY;
context = spe_context_create(0, 0);
spe_program_load(context, my_program);
spe_context_run(context, &entry, 0, 0, 0 0);
|
Of course, like any trivialized example, this has a number of potential flaws. First, there's no error checking; that's always a bad thing. Second, the program is called without any arguments provided, meaning, it will do exactly the same thing every time; it has no way to acquire any data at runtime because it hasn't been given any information to tell it where the data would be. This is good enough to run "hello, world!," but not much else.
The entry argument allows you to have multiple entry points into a single
SPE executable. If you have a few small programs, you could load them all
at once, then use different entries when starting a run to run different
programs without redoing the whole load process. The special value
SPE_DEFAULT_ENTRY causes the default entry point in the SPE program's ELF
headers to be used.
It's instructive to understand what happens on the SPE when the PPE is using
the libspe2 API. The libspe2 code provides a runtime in which your program
runs. From your perspective, execution begins at a function called main();
some library startup code obtains the arguments passed
in from the PPE and then calls your main function with those arguments. This
function returns an int (which is stored into the spe_stop_info_t structure on the PPE side) and
takes three arguments -- an unsigned long long indicating which SPU the code
is running on and a pair of 64-bit arguments which contain the argp and
envp arguments to spe_context_run. These arguments are canonically of
type unsigned long long, although some programs
engage in mild type punning, treating them as 64-bit pointers or other
types. In contrast to the UNIX® environment norm of multiple string arguments,
there are a fixed number of arguments (two) and each is a pointer into main memory.
That's a pretty big difference because the SPE does not have direct access to
main memory. Instead of having multiple directly accessible arguments,
you get a pointer into main memory and the SPE has to issue DMA
requests to get the pointed-to data from main memory. For instance, the
following code would gather a chunk of data pointed to by argp:
Listing 2. Issuing a DMA request
spu_writech(MFC_WrTagMask, 1 << 0);
spu_mfcdma32((void *)(&data), (unsigned int)argp, sizeof(data), 0,
MFC_GET_CMD);
(void)spu_mfcstat(2);
|
This request asks for sizeof(data) bytes of data copied from the address in
main memory denoted by argp to be copied into the object data. No indication of how much data has been provided is passed to the program; the
SPE program has to know what data it will be called with, and it's up to the
calling program on the PPE to provide the right data in the right order.
The spu_mfcdma32() function takes a 32-bit value even though the target
processor (the PPE) has a native 64-bit address space. The spu_mfcdma64()
function takes an additional argument specifying the top 32 bits of an
effective address. The effective address is, under the hood, always split
into a pair of 32-bit registers; the high-order register is optional and
if it's not specified, it is treated as zero. On current systems, it turns
out to be safe enough to assume that the address space of the Cell/B.E. processor will
always fit in the first 32 bits. (And yes, you probably ought to feel very
afraid when someone makes a statement like that.)
SPE programs that need more than one argument will generally build a structure containing named members and pass that over to be used the way arguments normally would be.
In some cases, the data passed in as arguments will have more addresses that are, in turn, DMA-ed over in whole or in part. For instance, the argument list might contain the address of a 16MB chunk of data to process. The SPE can't hold that much data at once in the 256KB local store, but it can DMA over one chunk at a time for processing. Whether it's an argument passed in or an address gotten some other way, the DMA targets must be aligned on cache lines (16 bytes).
Finally, the main function can return a status to the calling environment
in a way familiar to most C programmers; the return value of the main()
function is passed back to the calling environment. The result is a moderately
simplistic, but sufficient, interface for handling a pretty good range of
problems.
You have a lot of flexibility in how you move data to and from an SPE. As you might expect, the simplest models are not the highest performers, however, they're a good starting point for understanding the problem. Furthermore, if simple code is "fast enough," you might as well use it during your prototyping stage. In particular, if you can simplify communication with the SPE, then you can focus your debugging effort on the actual calculations, which is a good thing.
The simplest usage is to have the SPE program take two pointers: one for an input buffer and one for an output buffer. The SPE program reads in the input buffer, processes the data, and then writes it to the output buffer. If the data set is small enough to fit into available local store, you don't need a more elaborate setup. You could even just use a single pointer and write data back into the buffer you got it from, although this might be a nuisance to program for on the other end. Paired buffers are easier to keep track of.
For
now, though, I'm going to present a fairly minimalist sample program which
uses the SPE to perform a memcpy() operation.
Just as a simpler communications protocol lets you focus on
your algorithm, a simpler algorithm lets you focus on the communications
protocol. Here's the algorithm:
Listing 3. Maybe not the best use ever of the SPE's talents
memcpy(output, input, spe_args.count * sizeof(int));
|
Well, the obvious question here is, what's spe_args? Back to
the API: The PPE
can pass a single pointer into the program, so multiple arguments have to be
bundled up. My solution is this:
Listing 4. A rigorous argument
typedef struct {
int *input;
int *output;
int count;
unsigned char pad[4];
} spe_arg_t;
|
This uncovers a couple of quirks of the API. In particular, note that the array is padded out to 16 bytes; this is the size of a cache line, and the DMA engine likes to work in cache lines. Apart from that, we have a pretty straightforward structure definition: two pointers and a number of objects to copy. By the way, this is not an API type; this is a user-defined type invented for this article. You can -- and have to -- make up your own.
The SPE program is fairly short. As formatted originally, it's about 36 lines of
code. (I have adapted the formatting a bit for display in the article.)
I'll go through it one section at a time. Note that there's only one
user-written function, main(), in the whole thing. I start out with a few
headers:
Listing 5. Headers
#include "dma_1.h"
#include <spu_mfcio.h>
#include <stdio.h>
#include <string.h>
|
The "dma_1.h" file contains the spe_arg_t typedef which is shared between
the SPE code and the PPE code. The API for interacting with the memory flow
controller (MFC) is provided by the mfc headers. Error messages (used in
debugging) are printed
to standard error, using stdio. This is a great idea when debugging, but you
probably won't want to use it in production code. Finally, memcpy() is
declared in <string.h>. Onto the actual code:
Listing 6. I hear your arguments
int
main(unsigned long long id, unsigned long long argp)
{
spe_arg_t spe_args;
int *input, *output;
spu_writech(MFC_WrTagMask, -1);
spu_mfcdma32(&spe_args,
(unsigned int) argp,
sizeof(spe_args),
0,
MFC_GET_CMD);
(void)spu_mfcstat(2);
if (spe_args.count < 0) {
fputs("Unable to perform any task a negative number "
"of times. Buy a quantum computer.",
stderr);
return 1;
}
|
Getting arguments is pretty straightforward. The spu_writech() command and
the spu_mfcstat() command are used to verify successful completion of the DMA
routine.
The spu_mfcdma32() function allows DMA from a 32-bit subset of the PPE's
theoretical 64-bit address space which is plenty for our purposes. The
arguments are the local address, the remote address, the number of bytes
to transmit, a channel, and the command to perform (in this case, GET).
When the DMA transfer is complete, the spe_arg_t structure passed as an
argument by the PPE program has been copied into the SPE program for argument
validation.
Listing 7. Getting down to brass tacks
input = malloc(spe_args.count * sizeof(int));
output = malloc(spe_args.count * sizeof(int));
spu_writech(MFC_WrTagMask, 1 << 0);
spu_mfcdma32(input,
spe_args.input,
spe_args.count * sizeof(int),
0,
MFC_GET_CMD);
(void)spu_mfcstat(2);
memcpy(output, input, spe_args.count * sizeof(int));
|
Having found out how many items the PPE needs copied, the SPE program can allocate local storage, read the items in (using the same code as before), and then perform its algorithm, reviewed above. No problem here. Now how do you send the data back?
Listing 8. Enough with the arguments, how about some backtalk?
spu_writech(MFC_WrTagMask, 1 << 0);
spu_mfcdma32(output,
spe_args.output,
spe_args.count * sizeof(int),
0,
MFC_PUT_CMD);
(void)spu_mfcstat(2);
return 0;
}
|
Yes, it really is that easy. Note that there are two communications with the
PPE in this code fragment, not one! The PUT command sends back the contents
of the output buffer, but the return statement at the end of main() also sends
back data; the status generated will be passed back to the PPE.
The PPE program is conceptually not much more elaborate, but it exposes some features (or perhaps quirks) of SPE/PPE communications. The core of the program looks roughly like this:
Listing 9. And I say to the SPE, copy, and it copieth
extern spe_program_handle_t spu_prog;
spe_arg_t spe_args;
int status;
int *input;
int *output;
/* initialize input and output */
[...]
/* fill in spe_args with two pointers and a count */
spe_args.input = input;
spe_args.output = output;
spe_args.count = 1024;
context = spe_context_create(0, 0);
spe_program_load(context, &spu_prog);
status = spe_context_run(context, &entry, 0, &spe_args, NULL, 0);
|
The address (in PPE address space) of the spe_args structure is passed in
as an argument which will eventually become the argp argument of
the SPE program's main. The spu_prog object is created by the ppu-embedspu
program and consists of a PPE ELF object containing an SPE
executable and some meta data about it.
The NULL argument to spe_context_run would become a third argument
to the main function on the SPE; there's no need for it in this example.
The spe_context_run function runs the program until it stops. The
return value from main() in the SPE program is stored into the variable
status in the PPE program. Previous versions of libspe used a threaded
system in which you invoked a thread and then waited for it. In libspe2,
the operation is synchronous, so if you want to run a thread in the
background, you have to make your own pthread calls.
The initialization of the "input" and "output" objects was omitted from the above sample code for a good reason: It's more complicated than the rest of the code.
In fact, the code above won't quite work: It gets a bus error when the SPE
tries to retrieve its arguments. The problem is that there are unusually
strict alignment requirements on data that will be transferred through DMA.
Even the alignment guaranteed by malloc() may not be strict enough. Variables
can be declared using gcc's __attribute__ extension to compel alignment:
Listing 10. Lawful neutral
spe_arg_t spe_args __attribute__ ((aligned (16)));
|
Allocated memory is a bit trickier. You can't tell in advance what alignment an allocated region will have, so your best bet is to allocate a bit of extra storage and get an aligned region within it. Here's one way to do that:
Listing 11. Generating an aligned pointer
intptr_t addrfixup;
int *input, *input_save;
input_save = malloc(count * sizeof(int) + ALIGN);
addrfixup = (unsigned long) input_save;
addrfixup &= ~(ALIGN-1);
addrfixup += ALIGN;
input = (int *) addrfixup;
|
This code is a little disconcerting. It does have the hidden assumption
that ALIGN is a power of two. 16 works just fine, for DMA to
the SPE. However, for any power of two, this will give you a guarantee
that no matter what your starting value is, the resulting value will be
aligned to a multiple of that many bytes. The assumptions made about
pointer/integer conversions aren't perfectly portable, but then neither
are "DMA operations to the SPE." The original pointer has to be saved so
it can be freed later.
With all that done, though, this is a viable framework for getting data to and from an SPE. When you want to do unit tests on SPE code or algorithms or get timing information on them, this is probably enough. It's not particularly efficient for a number of reasons, not the least of which is that running a task on only a single SPE and waiting for it to complete probably saves you no time at all. Similarly, the lack of buffering on the SPE end means that all the data has to be copied before the SPE even starts computing, and none of the output is copied back until the computations are finished. But this provides a framework for interacting with the SPEs and shows the basics of how the API works.
As the series continues, I'll update, tweak, and in some cases outright replace this framework, but the basic model of loading code to an SPE will generally serve.
While DMA to and from the SPE is adequate for moving blocks of data, there's often a need for the ability to send smaller messages faster. This is done using mailboxes, special purpose registers available to the SPEs and to the PPE. The next article in this series looks at mailboxes and how they can be used to send data back and forth.
Learn
- Use an RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Check out the other articles in the "Little broadband engine that could" series.
-
The
Unrolling AltiVec
series (developerWorks, 2005) is an oldie but goodie that exposes you to the various guises of this vector processing SIMD technology.
-
Jonathon Bartlett's series on Programming high performance applications on the Cell/B.E. processor (developerWorks, January 2007 to present) provides an intro to Linux on the PS3, programming the PS3's SPE, an intro to the SPU, SPU performance programming, C/C++ SPU programming, and managing smart buffer DMA transfers.
-
You know, to use the Cell/B.E. SDK 2.1, you'll have to be running Fedora Core 6 -- this quick install guide (developerWorks, April 2007) should help you get FC6 up and running.
-
The
IBM Semiconductor Solutions Technical Library
Cell Broadband Engine
documentation section contains a wealth of downloadable manuals,
specifications, and much more.
-
Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM
developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
-
The IBM microNews newsletter delivers Cell/B.E happenings to your desktop twice a month.
Get products and technologies
-
Get Cell: Contact
IBM about custom Cell-based or custom-processor based solutions.
-
Get the alphaWorksCell
Broadband Engine downloads -- including the IBM Full System Simulator,
support libraries, toolchains, source code for libraries and samples.
Discuss
- Participate in the discussion forum.
-
Post your question to the IBM developerWorks Power Architecture Cell
Broadband Engine discussion forum
Comments (Undergoing maintenance)





