 | Level: Intermediate Peter Seebach (developerworks@seebs.net), Freelance writer, Plethora.net
04 Mar 2008 In SDK 3.0, the Data Communication and Synchronization library (DaCS)
provides a sparkling substitute for IDL. DaCS is a set of services to aid the development
of applications and application frameworks in a heterogeneous multi-tiered system.
This article takes you on a tour of the DaCS process model and
explores general DaCS principles, including communication and memory access.
Introduction
Previous articles in the series reviewed the IDL tool included with the Cell Broadband
Engine SDK versions up through 2.1. In version 3.0 though, the IDL tool
is no longer provided. Instead, you're given comparable function through the
Data Communication and Synchronization library (DaCS), which is a set of services designed
to simplify the development of distributed applications on the Cell/B.E.
(or similar) processors. The version of the library distributed with the SDK is
moderately specific to the Cell/B.E. environment, but it is well-suited to program
execution in this environment.
DaCS provides abstractions supporting development models in which the PPE
assigns tasks to SPE units which then go about these tasks asynchronously. The
DaCS library defines a variety of DaCS elements (DEs), which can be divided into:
- Host elements (HEs)
- Accelerator elements (AEs)
For purposes of Cell/B.E. development, you can consider these as abstractions of
the PPE (a host element) and the SPEs (accelerator elements). On a typical
Cell/B.E. blade server with two processors, it is possible to allocate a
single blade with 16 SPE children or to allocate two CBE devices, each with eight
children.
The latter view might allow better control of processor affinity at some cost in
complexity. In the Cell/B.E. SDK running on a PlayStation3 system, it is
possible to allocate only the 6 available SPE children; the higher-level framework
is not available.
Introducing the process model
While DaCS has abstractions allowing for multiple processes on a single DE, the
SDK currently supports only a single process on each element. Each process is
started in the usual way for SPE programs, and it can then initialize the SPU side of
the DaCS library, allowing communication with the library running on the PPE side.
The simplest way to set up an SPE process using DaCS will be eerily familiar to
anyone who's used libspe, because the DaCS library uses the same embedded binaries.
An SPE program compiled with spu-gcc can be turned into an embedded object using
the ppu-embedspu program, and then it can be linked into a program using DaCS. Setup is
slightly more complicated than with libspe, although you don't need to manage your
own threading. DaCS handles this transparently.
Listing 1. Running a program on an SPE
extern spe_program_handle_t spu_prog;
de_id_t spes[16];
dacs_process_id_t pids[16];
uint32_t children;
int32_t status;
dacs_runtime_init(NULL, NULL);
dacs_get_num_avail_children(DACS_DE_SPE, &children);
dacs_reserve_children(DACS_DE_SPE, &children, spes);
dacs_de_start(spes[0], &spu_prog, NULL, NULL,
DACS_PROC_EMBEDDED, &pids[0]);
dacs_de_wait(spes[0], pids[0], &status);
dacs_runtime_exit();
|
This program does the following:
- It queries the number of available children (the example assumes that at least one is
available).
- It attempts to reserve all of those children.
- It begins a program on the first one.
- It waits for the program to terminate.
- It exits the DaCS runtime.
It's not a complicated program. The omission of error-checking is
strictly for introductory purposes. In real code, you would want to check all of
these operations carefully. The spu_prog handle is the
same sort of embedded SPU code that was used in
previous articles in the series,
although the DaCS library obliges you to change from
-m32 to -m64 when creating
embedded binaries.
The status value is filled in with the return value of the
main function of the SPE program. The two NULL
arguments immediately after &spu_prog are
passed in as argv and envp, respectively.
Now you are at the point of the simplest case in which a program runs
entirely on data provided in its arguments (and so on). However, the real strength
of DaCS is in manipulation of data once the child program is up and running.
Understanding the general principles
There are a number of common patterns throughout the entire DaCS API. The most
significant patterns are:
- The way the API handles return values
- The need to consistently release resources
- The alignment requirements across all the API calls
DaCS API and return values
The example code in Listing 1 omits error checking for brevity. As a result, it
never uses the return values of any DaCS functions. Each function in the DaCS
library returns only a success or failure indicator. Any data generated by the
function are stored into objects pointed to by their arguments. For example, to
get a count of available children, you can call
dacs_get_num_avail_children(DACS_DE_SPE, &children);.
The address of children is passed to the function, which
modifies the object through this pointer. The return value is usually the
constant DACS_SUCCESS, indicating a successful
operation, but it could also be an error code.
This setup avoids the quirk of needing special sentinel values to indicate error
returns. This setup also provides greater consistency across the API. On the other hand, it
can be a little confusing at times, and there are a number of cases where you need
to pass the address of an object to one function but the object itself to another.
Keep a close eye on that. Don't just ignore compiler warnings about type
mismatches!
Allocating and releasing resources
Each resource allocation API call has a corresponding call to release the
resource. These calls are not optional, even if your program is about to exit. In
fact, dacs_runtime_exit() can hang if resources have
not been freed correctly! So, if you create a remote memory region with
dacs_remote_mem_create, you must destroy it with
dacs_remote_mem_destroy when you are done with it. If
you have accepted access to a memory region with
dacs_remote_mem_accept, you must release it with
dacs_remote_mem_release when you are done. The
dacs_remote_mem_destroy call blocks until all clients
have been released.
While this might sound fairly intrusive, it's not bad at all (although
returning an error condition would be preferable to blocking).
Still, it tends to be easy to get these right, because actual code tends to
have a clearly defined boundary for when a resource is no longer being used.
Alignment requirements
Nearly everything needs to be aligned suitably for the processor, which generally
means aligned to a 16-byte boundary. Misaligned things don't always have the
effects you'd expect, and sometimes they aren't detected gracefully. If you start seeing
mysterious crashes, look for unaligned accesses, and not necessarily on the DE
that's crashing! In one example, a
misaligned object on a PPE can cause an SPE to crash. (Worse yet, changing the
alignment of other objects on the PPE can influence the alignment of the object in
question.)
While gcc's __attribute__ ((aligned (16))) is your
friend, it does not necessarily have any effect (or at least, the desired effect)
on stack variables. This means that in some cases, you will have to declare objects
globally or allocate memory for them. Unfortunately, the alignment requirements
don't mesh well with object sizes. For example, you can't simply declare an array
of N integers to use as arguments for dacs_send
messages, because if the first item in the array is aligned correctly, later ones
won't be!
The documentation doesn't really give much detail for the
alignment requirements. In practice, just align everything to 16 bytes, and you
should be fine.
Regarding communication and
memory access
The SPEs have local storage, but they have no direct access to the main memory
the PPE uses. Most readers are probably aware of this, but it bears emphasis because all
non-trivial Cell/B.E. programming ends up involving a fair amount of code to
shuffle data. With IDL, data were moved automatically by the library without
explicit operations, much as data are sent to a remote system using a remote
procedure call (RPC) interface. In DaCS, data management is somewhat more
explicit.
DaCS provides functions to send and receive messages, and it also provides a family of
functions to share regions of memory between elements. In general, DaCS provides
basic tools as opposed to a complete protocol. Operations are asynchronous, and
overall structure is somewhat subject to negotiation. In general, you can send
messages and then wait for confirmation of their receipt. What this means is that
both sides need to cooperate. If you send a message and wait for the other side to
receive it but the other side never receives it, you will wait forever.
Many of the communication functions offer byte-swapping primitives. While these
are doubtless of significant importance in a mixed environment, such as an Intel-based
or AMD-based server using Cell/B.E. blade servers, the only code you need to know
to make effective use of a pure-Cell/B.E. system is
DACS_BYTE_SWAP_DISABLE.
Wait identifiers and transfer completion
While wait identifiers in and of themselves don't do much, you can't use any of
the communication systems without understanding wait identifiers. The
message-passing functions and memory transfer functions defined by DaCS happen
asynchronously. Rather than requiring a different test or wait protocol for each,
they use a common facility called a wait identifier. A wait identifier,
once allocated (or reserved), is passed as an argument to a communication
function, and it can then be queried to see whether a given communication has
completed. Note that a wait identifier can be used only for a transaction
initiated on the local DE. You can't query someone else's wait identifiers.
Wait identifiers standardize the interaction with a variety of functions.
Whether you're calling dacs_put or
dacs_recv, you pass in a wait identifier to the
function, then you call dacs_test to see whether it's done
or dacs_wait to block until it completes.
Remote Direct Memory Access
For large blocks of data, your best choice by far is access to remote memory.
The procedure seems a little convoluted at first, but it works well in practice.
In particular, the Cell/B.E. processor does not have a shared memory map across
all cores. The SPEs have local store, and they simply can't address main memory.
Similarly, there is generally no way to address the local store of an SPE from the
PPE. The only way to access remote memory is through DMA transfers.
DaCS abstracts this to the concept of a memory region, which is created on
the DE that actually has the memory, and then it's shared out to another DE. For
example, the PPE could share out a chunk of system main memory that would then be
accepted by one of the SPEs. Once a chunk of memory has been accepted, it can be
accessed through the dacs_put and
dacs_get functions (or their corresponding list
equivalents, which can perform multiple transfers).
A typical use of these might be to allocate a number of such buffers on the PPE
and then share them out to the SPEs. The SPEs can then copy data out of these buffers
as needed. It is possible for more than one SPE to have access to the same shared
memory region. If you don't feel like tracking a large number of these regions,
you could, in theory, just make a single large one and do reads and writes to
different parts of it. It might be easier to understand, however, if you provide
multiple separate buffers.
Message passing
For smaller chunks of data, DaCS provides a message passing system that allows
you to send and receive messages. Messages are sent asynchronously, and a
dacs_wait on the corresponding wait identifier blocks
until the remote side has received the message.
A particular caveat: if the receiver's specified buffer isn't
large enough to hold the whole message, the operation might fail silently.
(Ideally, a future release of the library should yield an error, rather than failing silently.)
Nonetheless, the simplest thing to do is to ensure that you never send messages
larger than the receive buffers that you plan to use. A standard fixed message size
might be the best choice.
Messages might be sent on streams, which are 32-bit values used to identify the
kind of message. A later dacs_recv call might check for
only messages on a particular stream, or they might specify the magic value
DACS_STREAM_ALL to accept any messages that are
waiting. Stream identifiers must be between 0 and the predefined constant
DACS_STREAM_UB, inclusive.)
Mailboxes, too
The DaCS library provides an abstraction of the mailbox facility on the
Cell/B.E. system. If you are sending simple 32-bit values, these might offer
substantial improvements in efficiency. Furthermore, the alignment requirements
are somewhat looser, so you can send individual members out of an array of 32-bit
values without worrying about alignment. The obvious weakness of mailboxes is that
mailbox operations are automatically blocking operations; you can't perform them
asynchronously. They also can't perform byte swapping.
What you can do, then, is to use dacs_mailbox_test to see
whether a read or write operation will block. Remember though: As with most such
operations, this is subject to race conditions. If all six of your child DEs check
at the same time whether they can write without blocking, they might end up all
thinking they can and all writing at once.
Conclusion
DaCS offers a rather breathtaking array of tools. Because they're fairly
specialized, you really need to know about most of them before you can start
designing a useful protocol. Without the message passing or mailbox functions, you
can't make much use of the remote memory access features. Without wait
identifiers, you can't do much at all except send data back and forth through
mailboxes.
The next article in the series shows you how these various tools can be combined to build a
DaCS-based version of the fractal program from "The little broadband engine that
could: Use multiple SPEs for a
single task."
Resources Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Check out the other articles in the "Little broadband engine that could" series.
- Check out two flavors of
documentation on how great DaCS is:
"DaCS for Cell BE Programmer's Guide and API Reference"
and
"DaCS for Hybrid-x86 Programmer's Guide and API Reference"
(IBM Semiconductor solutions library, October 2007).
- Read
"Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance"
(developerWorks, June 2006) to learn some
SPE programming tips.
- Get information from Jonathan Bartlett's series on "Programming high
performance applications on the Cell/B.E. processor" (developerWorks, January 2007
to present) provides an
intro to Linux on the PS3,
programming the PS3's SPE,
an
intro to the SPU,
SPU performance programming,
C/C++ SPU programming,
and
managing smart buffer DMA transfers.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- Speaking of Cell/B.E. SDK documentation,
there's a new blog series that abstracts important topic sections of some of the
major SDK documentation to give you a quick-read on the topic (in case you don't
need a fuller explanation) -- they're called
Infobombs,
and some topics already covered include:
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all things Cell/B.E.
- Keep abreast of all the latest in Cell/B.E.
news and information: subscribe to the
IBM microNews newsletter.
Get products and technologies
Discuss
About the author  | 
|  | Peter Seebach usually responds to a new API the way most people respond to a
brightly-wrapped present--by shaking it. He keeps wrapping paper and old
documentation alike for the sentimental value. |
Rate this page
|  |