Previous articles in the series reviewed the IDL tool included with the Cell Broadband Engine SDK versions up through 2.1. In version 3.0 though, the IDL tool is no longer provided. Instead, you're given comparable function through the Data Communication and Synchronization library (DaCS), which is a set of services designed to simplify the development of distributed applications on the Cell/B.E. (or similar) processors. The version of the library distributed with the SDK is moderately specific to the Cell/B.E. environment, but it is well-suited to program execution in this environment.
DaCS provides abstractions supporting development models in which the PPE assigns tasks to SPE units which then go about these tasks asynchronously. The DaCS library defines a variety of DaCS elements (DEs), which can be divided into:
- Host elements (HEs)
- Accelerator elements (AEs)
For purposes of Cell/B.E. development, you can consider these as abstractions of the PPE (a host element) and the SPEs (accelerator elements). On a typical Cell/B.E. blade server with two processors, it is possible to allocate a single blade with 16 SPE children or to allocate two CBE devices, each with eight children. The latter view might allow better control of processor affinity at some cost in complexity. In the Cell/B.E. SDK running on a PlayStation3 system, it is possible to allocate only the 6 available SPE children; the higher-level framework is not available.
While DaCS has abstractions allowing for multiple processes on a single DE, the SDK currently supports only a single process on each element. Each process is started in the usual way for SPE programs, and it can then initialize the SPU side of the DaCS library, allowing communication with the library running on the PPE side.
The simplest way to set up an SPE process using DaCS will be eerily familiar to anyone who's used libspe, because the DaCS library uses the same embedded binaries. An SPE program compiled with spu-gcc can be turned into an embedded object using the ppu-embedspu program, and then it can be linked into a program using DaCS. Setup is slightly more complicated than with libspe, although you don't need to manage your own threading. DaCS handles this transparently.
Listing 1. Running a program on an SPE
extern spe_program_handle_t spu_prog;
de_id_t spes[16];
dacs_process_id_t pids[16];
uint32_t children;
int32_t status;
dacs_runtime_init(NULL, NULL);
dacs_get_num_avail_children(DACS_DE_SPE, &children);
dacs_reserve_children(DACS_DE_SPE, &children, spes);
dacs_de_start(spes[0], &spu_prog, NULL, NULL,
DACS_PROC_EMBEDDED, &pids[0]);
dacs_de_wait(spes[0], pids[0], &status);
dacs_runtime_exit();
|
This program does the following:
- It queries the number of available children (the example assumes that at least one is available).
- It attempts to reserve all of those children.
- It begins a program on the first one.
- It waits for the program to terminate.
- It exits the DaCS runtime.
It's not a complicated program. The omission of error-checking is
strictly for introductory purposes. In real code, you would want to check all of
these operations carefully. The spu_prog handle is the
same sort of embedded SPU code that was used in
previous articles in the series,
although the DaCS library obliges you to change from
-m32 to -m64 when creating
embedded binaries.
The status value is filled in with the return value of the
main function of the SPE program. The two NULL
arguments immediately after &spu_prog are
passed in as argv and envp, respectively.
Now you are at the point of the simplest case in which a program runs entirely on data provided in its arguments (and so on). However, the real strength of DaCS is in manipulation of data once the child program is up and running.
Understanding the general principles
There are a number of common patterns throughout the entire DaCS API. The most significant patterns are:
- The way the API handles return values
- The need to consistently release resources
- The alignment requirements across all the API calls
The example code in Listing 1 omits error checking for brevity. As a result, it
never uses the return values of any DaCS functions. Each function in the DaCS
library returns only a success or failure indicator. Any data generated by the
function are stored into objects pointed to by their arguments. For example, to
get a count of available children, you can call
dacs_get_num_avail_children(DACS_DE_SPE, &children);.
The address of children is passed to the function, which
modifies the object through this pointer. The return value is usually the
constant DACS_SUCCESS, indicating a successful
operation, but it could also be an error code.
This setup avoids the quirk of needing special sentinel values to indicate error returns. This setup also provides greater consistency across the API. On the other hand, it can be a little confusing at times, and there are a number of cases where you need to pass the address of an object to one function but the object itself to another. Keep a close eye on that. Don't just ignore compiler warnings about type mismatches!
Allocating and releasing resources
Each resource allocation API call has a corresponding call to release the
resource. These calls are not optional, even if your program is about to exit. In
fact, dacs_runtime_exit() can hang if resources have
not been freed correctly! So, if you create a remote memory region with
dacs_remote_mem_create, you must destroy it with
dacs_remote_mem_destroy when you are done with it. If
you have accepted access to a memory region with
dacs_remote_mem_accept, you must release it with
dacs_remote_mem_release when you are done. The
dacs_remote_mem_destroy call blocks until all clients
have been released.
While this might sound fairly intrusive, it's not bad at all (although returning an error condition would be preferable to blocking). Still, it tends to be easy to get these right, because actual code tends to have a clearly defined boundary for when a resource is no longer being used.
Nearly everything needs to be aligned suitably for the processor, which generally means aligned to a 16-byte boundary. Misaligned things don't always have the effects you'd expect, and sometimes they aren't detected gracefully. If you start seeing mysterious crashes, look for unaligned accesses, and not necessarily on the DE that's crashing! In one example, a misaligned object on a PPE can cause an SPE to crash. (Worse yet, changing the alignment of other objects on the PPE can influence the alignment of the object in question.)
While gcc's __attribute__ ((aligned (16))) is your
friend, it does not necessarily have any effect (or at least, the desired effect)
on stack variables. This means that in some cases, you will have to declare objects
globally or allocate memory for them. Unfortunately, the alignment requirements
don't mesh well with object sizes. For example, you can't simply declare an array
of N integers to use as arguments for dacs_send
messages, because if the first item in the array is aligned correctly, later ones
won't be!
The documentation doesn't really give much detail for the alignment requirements. In practice, just align everything to 16 bytes, and you should be fine.
Regarding communication and memory access
The SPEs have local storage, but they have no direct access to the main memory the PPE uses. Most readers are probably aware of this, but it bears emphasis because all non-trivial Cell/B.E. programming ends up involving a fair amount of code to shuffle data. With IDL, data were moved automatically by the library without explicit operations, much as data are sent to a remote system using a remote procedure call (RPC) interface. In DaCS, data management is somewhat more explicit.
DaCS provides functions to send and receive messages, and it also provides a family of functions to share regions of memory between elements. In general, DaCS provides basic tools as opposed to a complete protocol. Operations are asynchronous, and overall structure is somewhat subject to negotiation. In general, you can send messages and then wait for confirmation of their receipt. What this means is that both sides need to cooperate. If you send a message and wait for the other side to receive it but the other side never receives it, you will wait forever.
Many of the communication functions offer byte-swapping primitives. While these
are doubtless of significant importance in a mixed environment, such as an Intel-based
or AMD-based server using Cell/B.E. blade servers, the only code you need to know
to make effective use of a pure-Cell/B.E. system is
DACS_BYTE_SWAP_DISABLE.
Wait identifiers and transfer completion
While wait identifiers in and of themselves don't do much, you can't use any of the communication systems without understanding wait identifiers. The message-passing functions and memory transfer functions defined by DaCS happen asynchronously. Rather than requiring a different test or wait protocol for each, they use a common facility called a wait identifier. A wait identifier, once allocated (or reserved), is passed as an argument to a communication function, and it can then be queried to see whether a given communication has completed. Note that a wait identifier can be used only for a transaction initiated on the local DE. You can't query someone else's wait identifiers.
Wait identifiers standardize the interaction with a variety of functions.
Whether you're calling dacs_put or
dacs_recv, you pass in a wait identifier to the
function, then you call dacs_test to see whether it's done
or dacs_wait to block until it completes.
For large blocks of data, your best choice by far is access to remote memory. The procedure seems a little convoluted at first, but it works well in practice. In particular, the Cell/B.E. processor does not have a shared memory map across all cores. The SPEs have local store, and they simply can't address main memory. Similarly, there is generally no way to address the local store of an SPE from the PPE. The only way to access remote memory is through DMA transfers.
DaCS abstracts this to the concept of a memory region, which is created on
the DE that actually has the memory, and then it's shared out to another DE. For
example, the PPE could share out a chunk of system main memory that would then be
accepted by one of the SPEs. Once a chunk of memory has been accepted, it can be
accessed through the dacs_put and
dacs_get functions (or their corresponding list
equivalents, which can perform multiple transfers).
A typical use of these might be to allocate a number of such buffers on the PPE and then share them out to the SPEs. The SPEs can then copy data out of these buffers as needed. It is possible for more than one SPE to have access to the same shared memory region. If you don't feel like tracking a large number of these regions, you could, in theory, just make a single large one and do reads and writes to different parts of it. It might be easier to understand, however, if you provide multiple separate buffers.
For smaller chunks of data, DaCS provides a message passing system that allows
you to send and receive messages. Messages are sent asynchronously, and a
dacs_wait on the corresponding wait identifier blocks
until the remote side has received the message.
A particular caveat: if the receiver's specified buffer isn't large enough to hold the whole message, the operation might fail silently. (Ideally, a future release of the library should yield an error, rather than failing silently.) Nonetheless, the simplest thing to do is to ensure that you never send messages larger than the receive buffers that you plan to use. A standard fixed message size might be the best choice.
Messages might be sent on streams, which are 32-bit values used to identify the
kind of message. A later dacs_recv call might check for
only messages on a particular stream, or they might specify the magic value
DACS_STREAM_ALL to accept any messages that are
waiting. Stream identifiers must be between 0 and the predefined constant
DACS_STREAM_UB, inclusive.)
The DaCS library provides an abstraction of the mailbox facility on the Cell/B.E. system. If you are sending simple 32-bit values, these might offer substantial improvements in efficiency. Furthermore, the alignment requirements are somewhat looser, so you can send individual members out of an array of 32-bit values without worrying about alignment. The obvious weakness of mailboxes is that mailbox operations are automatically blocking operations; you can't perform them asynchronously. They also can't perform byte swapping.
What you can do, then, is to use dacs_mailbox_test to see
whether a read or write operation will block. Remember though: As with most such
operations, this is subject to race conditions. If all six of your child DEs check
at the same time whether they can write without blocking, they might end up all
thinking they can and all writing at once.
DaCS offers a rather breathtaking array of tools. Because they're fairly specialized, you really need to know about most of them before you can start designing a useful protocol. Without the message passing or mailbox functions, you can't make much use of the remote memory access features. Without wait identifiers, you can't do much at all except send data back and forth through mailboxes.
The next article in the series shows you how these various tools can be combined to build a DaCS-based version of the fractal program from "The little broadband engine that could: Use multiple SPEs for a single task."
Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Check out the other articles in the "Little broadband engine that could" series.
- Check out two flavors of
documentation on how great DaCS is:
"DaCS for Cell BE Programmer's Guide and API Reference"
and
"DaCS for Hybrid-x86 Programmer's Guide and API Reference"
(IBM Semiconductor solutions library, October 2007).
- Read
"Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance"
(developerWorks, June 2006) to learn some
SPE programming tips.
- Get information from Jonathan Bartlett's series on "Programming high
performance applications on the Cell/B.E. processor" (developerWorks, January 2007
to present) provides an
intro to Linux on the PS3,
programming the PS3's SPE,
an
intro to the SPU,
SPU performance programming,
C/C++ SPU programming,
and
managing smart buffer DMA transfers.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Speaking of Cell/B.E. SDK documentation,
there's a new blog series that abstracts important topic sections of some of the
major SDK documentation to give you a quick-read on the topic (in case you don't
need a fuller explanation) -- they're called
Infobombs,
and some topics already covered include:
- Getting a successful FC7 install on a PS3.
- The basic structure of an ALF application.
- Double buffering on ALF as an optimization.
- ALF and DaCS for x86 hybrids.
- Configuring and using the Basic Linear Algebra Subprograms.
- Glossaries on ALF and DaCS error codes, trace events, and ALF attributes.
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all things Cell/B.E.
- Keep abreast of all the latest in Cell/B.E.
news and information: subscribe to the
IBM microNews newsletter.
Get products and technologies
Discuss
- Participate in the discussion forum.
- Get fast answers from IBM experts and
real-world practitioners in developerWorks
Cell
Broadband Engine discussion forum.
- The
Cell Broadband Engine/Power Architecture notebook
is a blog-based resource that hosts
news,
as well as two instructional features: the
"Forum watch"
of interesting questions and hot topics from the forum, and the
"Infobomb"
series (short, precise, task-specific, quick-read knowledge "bombs" gleaned from
Cell/B.E. documentation).
Comments (Undergoing maintenance)





