Because the synergistic processing unit (SPU) is focused on vector, not scalar, processing, it is only able to load and store 16 bytes at a time (the size of a register) from local store locations which are aligned on 16-byte boundaries. Therefore, you cannot just load a word from, say, memory location 12. To get that word, you would need to load a quadword from memory location 0, and then shift the bits so that the value you want is in the preferred slot. The original quadword must be loaded, the appropriate value inserted into the right location in the quadword, and then the result stored back. Because of these issues, it is usually advisable to store all data aligned to 16 bytes. To load a value which crosses a 16-byte boundary is even more difficult, as you would actually have to load it into two registers, shift them, and then mask and combine them. Storing such values is even more difficult, so it is best to never use values that cross 16-byte boundaries.
While it allows you to use data that is not aligned to 16-byte boundaries, the loading and storing technique I will discuss requires that the data be naturally aligned to prevent it from crossing the 16-byte boundary. That means that words will be 4-byte aligned, halfwords will be 2-byte aligned, and bytes don't have to be aligned at all.
Doing an unaligned load requires two or three instructions, depending on the size of the data. The reason for this is that if you are loading a single value, you probably want it in the preferred slot of the register. The first instruction does the load and the second instruction rotates the value so that the requested address is at the beginning of the register. Then, if the data is smaller than a word, a shift is needed to move it away from the beginning into the preferred slot (if it is a word or a doubleword, the beginning of the register is the preferred slot). Here is the code for a byte load, which takes an address in the preferred slot of register 3 and uses it to load a byte into the preferred slot of register 4:
Listing 3. Load from non-aligned memory
###Load byte unaligned address $3 into preferred slot of register $4### #Loads from nearest quadword boundary lqd $4, 0($3) #Rotate value to the beginning of the register rotqby $4, $4, $3 #Rotate value to the preferred slot (-3 for bytes, -2 for halfwords, and nothing for words or doublewords) rotqbyi $4, $4, -3 |
Remember, the lqd instruction only loads from 16-byte
boundaries. It will therefore ignore the four least significant bits during the
load, and just load an aligned quadword from memory. Therefore, for arbitrary
addresses, we have no idea where in the loaded quadword the value we wanted is.
The rotqby instruction, "rotate (left) quadword by
bytes," uses the address you loaded from to indicate how far to rotate the
register. It only uses the least four significant bits of the address in the
register (the ones ignored by the load) to determine how far to rotate. This will
always be the number of bytes it needs to shift left to move the address
specified to the beginning of the register. Finally, for bytes, the preferred
slot is not at the beginning of the register, but three bytes to the
right. So the instruction rotqbyi does a shift
using an immediate-mode value to shift by. Word- and doubleword-sized transfers
do not need this last instruction, because their preferred slot is at the
beginning of the register anyway. At the end of this, register 4 has the final
value, with the byte shifted into the preferred slot.
Storing is more difficult. Here is the code to store a byte that is in the preferred slot of register $4 into the address specified by register $3:
Listing 4. Store to non-aligned address
###Store preferred byte slot $4 into unaligned address $3 #Load the data into a temporary register lqd $5, 0($3) #Generate the controls for a byte insertion cbd $6, 0($3) #Shuffle the data in shufb $7, $4, $5, $6 #Store it back stqd $7, 0($3) |
To understand this cryptic-looking sequence, again keep in mind that
the SPU only does loads and stores a quadword at a time, on quadword-aligned
addresses. Therefore, if you want to store only one byte, if you tried to do it
directly on an unaligned address, it would both go into the wrong location and
clobber the remaining bytes in the quadword. To avoid this, you need to first
load the quadword from memory, insert the value into the appropriate byte in the
quadword, and then store it back. The hard part is inserting it into the proper
location based only on the address. Thankfully, two instructions help out, cbd ("generate control for byte insertion")
and shufb ("shuffle bytes"). The cbd instruction takes an address and generates a control
word that can be used by shufb to insert a byte at the
proper location in the quadword for that address. cbd $6,
0($3) uses the address in register 3 to generate the control quadword, and
then stores it in register 6. The instruction shufb $7, $4,
$5, $6 uses the control quadword in register 6 to generate a new value
into register 7 which consists of the original quadword that was in memory (now
in register 5) and a byte from register 4 in the preferred slot, and stores the
result in register 7. Once the byte is shuffled in, the value is stored back
into memory.
To illustrate the technique, I'll write a function that takes the
address of an ASCII character, loads it, converts it to uppercase, and stores it
back. I'll put the function convert_to_upper in a separate file from the main function so that I can reuse it in another program
later on. Here is the code for the main
function (save it as convert_main.s):
Listing 5. Uppercase conversion program start
.data string_start: .ascii "We will convert the following letter, " letter_to_convert: .ascii "q" remaining: .ascii ", to uppercase\n\0" .text .global main .type main, @function main: .equ MAIN_FRAME_SIZE, 32 .equ LR_OFFSET, 16 #PROLOGUE stqd $lr, LR_OFFSET($sp) stqd $sp, -MAIN_FRAME_SIZE($sp) ai $sp, $sp, -MAIN_FRAME_SIZE #MAIN FUNCTION ila $3, letter_to_convert brsl $lr, convert_to_upper ila $3, string_start brsl $lr, printf #EPILOGUE ai $sp, $sp, MAIN_FRAME_SIZE lqd $lr, LR_OFFSET($sp) bi $lr |
Now enter the function that actually does the uppercase conversion (enter as
convert_to_upper.s):
Listing 6. Function to convert to uppercase
.text .global convert_to_upper .type convert_to_upper, @function convert_to_upper: #Register usage # $3 - parameter 1 -- address of byte to be converted # $4 - byte value to be converted # $5 - $4 greater than 'a' - 1? # $6 - $4 greater than 'z'? # $7 - $4 less than or equal to 'z'? # $8 - $4 between 'a' and 'z' (inclusive)? # $9 through $12 - temporary storage for final store # $13 - conversion factor #address of letter stored in unaligned address in $3 #UNALIGNED LOAD lqd $4, 0($3) rotqby $4, $4, $3 rotqbyi $4, $4, -3 #IS IN RANGE 'a'-'z'? cgtbi $5, $4, 'a' - 1 cgtbi $6, $4, 'z' nand $7, $6, $6 and $8, $5, $7 #Mask out irrelevant bits andi $8, $8, 255 #Skip uppercase conversion and store if $4 is not lowercase (based on $8) brz $8, end_convert is_lowercase: #Perform Conversion il $13, 'a' - 'A' absdb $4, $4, $13 #Unaligned Store lqd $9, 0($3) cbd $10, 0($3) shufb $11, $4, $9, $10 stqd $11, 0($3) end_convert: #no stack frame, no return value, just return bi $lr |
To compile and run, perform the following commands:
spu-gcc convert_main.s convert_to_upper.s -o convert ./convert |
The main function doesn't function too
differently than before, so I won't discuss it here. Note, however, that it is passing the
address of the letter to convert_to_upper, not
the letter itself.
The convert_to_upper function takes the address of an
arbitrary character, converts it to uppercase, and then stores it back and
returns nothing. It never calls another function, so it doesn't need a stack
frame.
The first thing the function does is an unaligned load as described previously
into register 4. It then checks to see if the byte is in the range a through z. It does that by
comparing if it is greater than 'a' - 1, and then
seeing if it is greater than 'z.' I did not do a
"less than" comparison, because they aren't available on the SPU! SPUs
only have comparisons for "greater than" and "equal to." Therefore, if you want
to do a "less than or equal to" comparison, you must do a "greater than"
comparison and then do a "not" on it, which is performed using the nand instruction with both source arguments being the same
register. You then combine the comparisons using the and instruction (note that you could have combined all the
logical instructions into one with an xor, but the
code would have been much less clear). Finally, because the branch instructions
only operate on halfword or word values, you have to mask out the non-relevant
portions of the register. (I didn't have to do that in the factorial example
because I was dealing with a full word).
If the bits in the preferred slot of register 8 are all set to false, you skip to
the end of the function. If they are true, you perform the conversion. The only
byte-oriented arithmetic function on the SPU is absdb,
"absolute difference of bytes," which gives the absolute value of the difference
between two operands. You use that, combined with the difference between the
lowercase and uppercase values, to perform the conversion. Finally, you perform
an unaligned store. Since you did not call any functions or use any local
storage, you did not need a stack frame at all, so you can now just exit through
the link register.
So far I have concentrated on SPE-only programs. Now I will look into PPE-controlled programs, and for that, I need to know how to get the PPE and the SPE to communicate.
Remember that SPEs have a memory that is separate from the processor's main memory, called the local store. The SPE cannot read main memory directly, but instead must import and export data between the local store and main memory using DMA commands to a unit called the memory flow controller, or MFC. The local store address space is limited to 32 bits, but it is usually much smaller (in the Sony® PLAYSTATION® 3, for instance, it is only 18 bits). The reason for this is so that memory accesses by SPE code can be deterministic. Main memory can get swapped out, moved around, cached, uncached, or memory mapped. Therefore, the amount of time required for any particular memory access is completely unknown (if the memory is swapped out, who knows how long it will take). By separating out the SPE memory into a local store, the SPE can have a deterministic access time for any memory it accesses, and schedule the MFC to asynchronously move data in and out of main memory as needed. Addresses within an SPE's local store are called local store addresses (LSAs), while addresses within the main memory are called effective addresses (EAs). This will be important as you learn how to use the memory flow controller's DMA facilities.
SPEs communicate with the outside world by using channels. A channel is a 32-bit area which can be written to or read from (but not both -- they are unidirectional) using special instructions. A channel can also have a depth, or channel count. The channel count is the amount of data waiting to be read (for read channels), or the amount of data which can still be written (for write channels). Channels are used for all SPE input and output. They are used for issuing DMA commands to the memory flow controller, handling SPE events, and reading and writing messages to and from the PPE. The next program I'll show you utilizes the MFC and the channel interface to do character conversions on data specified by the PPE.
Creating and running SPE tasks
So far, the main function has not been using any
parameters. However, when it is run from a PPE program, it actually receives
three 64-bit parameters -- the SPE task identifier in register 3, a
pointer to
application parameters in register 4, and a pointer to runtime environment
information in register 5. The contents of the areas pointed to by application
and environment pointers are actually user-defined. However, remember that they
point to memory in the main storage of the application (an effective
address), not to the SPE's local store. Therefore, they cannot be accessed
directly, but must be moved in through DMA.
SPE tasks are created with the function speid_t
spe_create_thread(spe_gid_t spe_gid, spe_program_handle_t *spe_program_handle,
void *argp, void *envp, unsigned long mask, int flags). The parameters
work as follows:
- spe_gid
This is the SPE thread group to assign this task to. It can simply be set to zero. - spe_program_handle
This is a pointer to a structure which holds the data about the SPE program itself. This data is normally defined either automatically by embedding an SPU application within a PPU executable (this will be shown later), by usingdlopen()/dlsym()on a library containing an SPU application, or by usingspe_open_image()to directly load an SPU application. - argp
This is a pointer to application-specific data for program initialization. Set to null if it is not going to be used. - envp
This is a pointer to environment data for the program. Set to null if it is not going to be used. - mask
This is the processor affinity mask. Set it to -1 to assign the process to any available SPE. Otherwise, it contains a bitmask for each available processor. 1 means that the processor should be used, 0 means that it should not. Most applications set this to -1. - flags
This is a set of bit flags which modify how the SPE is set up. These are all outside the scope of this article.
As an example of DMA communication, I will write a program where the PPE takes a string, and invokes an SPE program which copies over the string, converts it to uppercase, and copies it back into main storage. All of the data transfers will use the MFC's DMA facilities, controlled through SPE channels.
The main SPE program will receive an effective address pointer to a struct
containing the size and pointer of a string in main memory. It will then copy it
into its buffer, perform the conversion, and copy it back. Here is the SPE code
(enter as convert_dma_main.s):
Listing 7. SPU code to perform uppercase conversion for PPU program
.data
.align 4
conversion_info:
conversion_length:
.octa 0
conversion_data:
.octa 0
.equ CONVERSION_STRUCT_SIZE, 32
.section .bss #Uninitialized Data Section
.align 4
.lcomm conversion_buffer, 16384
.text
.global main
.type main, @function
#MFC Constants
.equ MFC_GET_CMD, 0x40
.equ MFC_PUT_CMD, 0x20
#Stack Frame Constants
.equ MAIN_FRAME_SIZE, 80
.equ MAIN_REG_SAVE_OFFSET, 32
.equ LR_OFFSET, 16
main:
#Prologue
stqd $lr, LR_OFFSET($sp)
stqd $sp, -MAIN_FRAME_SIZE($sp)
ai $sp, $sp, -MAIN_FRAME_SIZE
#Save Registers
#Save register $127 (will be used for current index)
stqd $127, MAIN_REG_SAVE_OFFSET($sp)
#Save register $126 (will be used for base pointer)
stqd $126, MAIN_REG_SAVE_OFFSET+16($sp)
#Save register $125 (will be used for final size)
stqd $125, MAIN_REG_SAVE_OFFSET+24($sp)
##COPY IN CONVERSION INFORMATION##
ila $3, conversion_info #Local Store Address
#register 4 already has address #64-bit Effective Address
il $5, CONVERSION_STRUCT_SIZE #Transfer size
il $6, 0 #DMA Tag
il $7, MFC_GET_CMD #DMA Command
brsl $lr, perform_dma
#Wait for DMA to complete
il $3, 0
brsl $lr, wait_for_dma_completion
##COPY STRING IN TO BUFFER##
#Load buffer data pointer
ila $3, conversion_buffer #Local Store
lqr $4, conversion_data #64-bit Effective Address
lqr $5, conversion_length #SIZE
il $6, 0 #DMA Tag
il $7, MFC_GET_CMD #DMA Command
brsl $lr, perform_dma
#Wait for DMA to complete
il $3, 0
brsl $lr, wait_for_dma_completion
#LOOP THROUGH BUFFER
#Load buffer size
lqr $125, conversion_length
#Load buffer pointer
ila $126, conversion_buffer
#Load buffer index
il $127, 0
loop:
ceq $7, $125, $127
brnz $7, loop_end
#Compute address for function parameter
a $3, $127, $126
#Next index
ai $127, $127, 1
#Run function
brsl $lr, convert_to_upper
#Repeat loop
br loop
loop_end:
#Copy data back
ila $3, conversion_buffer #Local Store Address
lqr $4, conversion_data #64-bit effective address
lqr $5, conversion_length #Size
il $6, 0 #DMA Tag
il $7, MFC_PUT_CMD #DMA Command
brsl $lr, perform_dma
#Wait for DMA to complete
il $3, 0
brsl $lr, wait_for_dma_completion
#Return Value
il $3, 0
#Epilogue
ai $sp, $sp, MAIN_FRAME_SIZE
lqd $lr, LR_OFFSET($sp)
bi $lr
|
This code relies on some utility functions for handling DMA commands. Enter
those functions as dma_utils.s:
Listing 8. DMA transferring utilities
##UTILITY FUNCTION TO PERFORM DMA OPS## #Parameters - Local Store Address, 64-bit Effective Address, Transfer Size, DMA Tag, DMA Command .global perform_dma .type perform_dma, @function perform_dma: shlqbyi $9, $4, 4 #Get the low-order 32-bits of the address wrch $MFC_LSA, $3 wrch $MFC_EAH, $4 wrch $MFC_EAL, $9 wrch $MFC_Size, $5 wrch $MFC_TagID, $6 wrch $MFC_Cmd, $7 bi $lr .global wait_for_dma_completion .type wait_for_dma_completion, @function wait_for_dma_completion: #We receive a tag in register 3 - convert to a tag mask il $4, 1 shl $4, $4, $3 wrch $MFC_WrTagMask, $4 #Tell the DMA that we only want it to inform us on DMA completion il $5, 2 wrch $MFC_WrTagUpdate, $5 #Wait for DMA Completion, and store the result in the return value rdch $3, $MFC_RdTagStat #Return bi $lr |
Now, not only do you need to compile this program, you need to prepare it for
embedding in a PPE application. Assuming you still have the convert_to_upper.s from your last program in the current
directory, here are the commands to compile the code and prepare it for
embedding:
spu-gcc convert_dma_main.s dma_utils.s convert_to_upper.s -o spe_convert embedspu -m64 convert_to_upper_handle spe_convert spe_convert_csf.o |
This produces what is called a CESOF Linkable, which allows an object file for the SPE to be embedded in a PPE application and loaded as needed.
Here is the PPU code to make use of the SPU code (enter as ppu_dma_main.c):
Listing 9. PPU code to utilize SPU application
#include <stdio.h>
#include <libspe.h>
#include <errno.h>
#include <string.h>
/* embedspu actually defines this in the generated object file,
we only need an extern reference here */
extern spe_program_handle_t convert_to_upper_handle;
/* This is the parameter structure that our SPE code expects */
/* Note the alignment on all of the data that will be passed to the SPE is 16-bytes */
typedef struct {
int length __attribute__((aligned(16)));
unsigned long long data __attribute__((aligned(16)));
} conversion_structure;
int main() {
int status = 0;
/* Pad string to a quadword -- there are 12 spaces at the end. */
char *tmp_str = "This is the string we want to convert to uppercase. ";
/* Copy it to an aligned boundary */
char *str = memalign(16, strlen(tmp_str) + 1);
strcpy(str, tmp_str);
/* Create conversion structure on an aligned boundary */
conversion_structure conversion_info __attribute__((aligned(16)));
/* Set the data elements in the parameter structure */
conversion_info.length = strlen(str) + 1; /* add one for null byte */
conversion_info.data = (unsigned long long)str;
/* Create the thread and check for errors */
speid_t spe_id = spe_create_thread(0, &convert_to_upper_handle,
&conversion_info, NULL, -1, 0);
if(spe_id == 0) {
fprintf(stderr, "Unable to create SPE thread: errno=%d\n", errno);
return 1;
}
/* Wait for SPE thread completion */
spe_wait(spe_id, &status, 0);
/* Print out result */
printf("The converted string is: %s\n", str);
return 0;
}
|
To build and execute the program, enter the following commands:
gcc -m64 spe_convert_csf.o ppu_dma_main.c -lspe -o dma_convert ./dma_convert |
A lot of things are going on in this code, and my goal is to introduce all of the necessary foundational material so that we don't get bogged down in it when learning optimization secrets in the next article. (Stay with me, and you'll be on your way to expert SPU programming in no time!) Now, I'll explain what the code is doing. I'll start with the PPU code, since it's a little easier.
The first interesting part of the PPU code is the inclusion of the libspe.h header file, which contains all of the function
declarations for running programs on the SPE. It then references a handle called
convert_to_upper_handle. This is only an extern reference, not the declaration itself. This is
because convert_to_upper_handle is defined in spe_convert_csf.o. The name of the variable was set on the
command line of the embedspu command. That variable
is the handle to the program code, which will be used to create your SPE tasks.
Next, you define the structure that will be used as the parameter to your SPE
program. You need the length of the string and the pointer to the string itself.
These all need to be quadword aligned, so that you can copy it into your main
program and use the values with DMA transfers. Note that the pointer you used is
declared an unsigned long long rather than just a
pointer. This is so that the address transfer is stored the same way whether it
is compiled in 32-bit mode or 64-bit mode. With a pointer, if it were
compiled in
32-bit mode, the pointer would be aligned differently within the structure. You
also have to use the memalign function and strcpy to copy the data into an area of appropriate
alignment. Here's a pointer from long nights of trial and error with this stuff:
If you are continually receiving a "bus error," you are probably doing a DMA
transfer that is either not 16-byte aligned or is not a multiple of 16 bytes.
In the main program, you declare your variables. Note that all of the declared
variables which will be copied using DMA are aligned on quadword boundaries and
are multiples of quadwords. That's because DMA transfers, with a few exceptions
for small transfers, must be quadword aligned in both the source and
destination addresses (the program will get even better performance if both
source and destination are 128-byte aligned). Next, the SPE task is
created with spe_create_thread, passing in your
parameter structure. Now you can just wait for the SPE task to complete using
spe_wait, and then print out the final value. As you
may have guessed, most of the interesting parts of the program are taking place
on the SPE, including all of the DMA transfers. DMA transfers are almost always
done by the SPEs rather than by the PPE because they can handle much more data
and many more active DMA operations than the PPE.
Before getting into the details of the main program, I'll explore the DMA
utility functions. The first function is perform_dma,
which, not surprisingly, performs DMA commands. The Cell BE Handbook defines the
sequence of channel operations needed to perform a DMA transfer on pages 450-456 (see Resources).
The first thing the function is doing is converting the 64-bit effective address
in register 4 into two 32-bit components -- a high- and a low-order
component
(remember, the channels are only 32 bits wide). Because channels are
written
using a register's preferred word-sized slot, the 64-bit address already has the
high-order bits in the preferred slot. Therefore, you just shift the contents to
the left by four bytes into a new register to get the low-order bits in the
preferred slot. You then write the local store address, the high-order bits of
the effective address, the low-order bits of the effective address, the size of
the transfer, the "tag" of the DMA command, and then the command itself to their
appropriate channels using the wrch instruction.
When the command is written, the DMA request is enqueued into the MFC provided it
has available slots -- yours certainly does as you are not doing any other
concurrent DMA requests. The "tag" is a number which can be assigned to one or
many DMA commands. All DMA commands issued with the same tag are considered a
single group, and status updates and sequencing operations apply to the group as
a whole. In this application, you will only have one DMA command active at a
time, so all of your operations will use 0 as the DMA tag. The DMA command should
be either MFC_GET_CMD or MFC_PUT_CMD. There are others, but we aren't concerned with
them here. MFC commands are all done from the perspective of the SPE, whether or
not it is actually the SPE issuing the command. So MFC_GET_CMD moves data from main memory to the local store,
and MFC_PUT_CMD goes the other way.
Because DMA commands are asynchronous, it is useful to be able to wait for one
to complete. The function wait_for_dma_completion
does precisely that. It takes a tag as its only parameter, converts it to a tag
mask, requests a DMA status, and then reads the status. So how does this wait
for the DMA operation to complete? When writing to the $MFC_WrTagUpdate channel with a value of 2, it causes the
$MFC_RdTagStat to not have a value until the operation
is completed. Thus, when you try to read the channel using rdch, it will block until the status is available, at which
point the transfer will be complete.
Now, moving on to the actual program itself. The first thing our SPE program
does is reserve space for the application's parameter data. This is also aligned
to quadword boundaries (.align 4 in assembly language
works the same as __attribute__((aligned(16))) in C
because 2^4 = 16). .octa reserves quadword values
(the mnemonic is a holdover from 16-bit days). You then define a constant CONVERSION_STRUCT_SIZE for the size of the whole structure.
After this, you go to the .bss section, which is like
the .data section, except that the executable itself
does not contain the values, it just notes how much space should be reserved for
them. This section is for uninitialized data. .lcomm
conversion_buffer, 16384 reserves 16K of space, with the starting address
defined in the symbol conversion_buffer. It is
defined for holding 16K because that is the maximum size of an MFC DMA transfer.
Therefore, if any string is longer than that, the PPE will have to invoke the
program multiple times (a better program would simply break up the request into
chunks on the SPE side).
The main function has the main meat of the program.
It starts by setting up a stack frame. It then saves three non-volatile registers
that will be used for the main control of the program. Next, it performs a DMA
transfer to copy in the parameter structure from the PPE. Remember, the first
parameter to the function is the 64-bit address that was passed in from the PPE.
You then use a DMA command to fetch the full structure, and wait for the DMA to
complete. After the transfer, you use the data in that structure to copy the
string itself into your buffer in the local store using another DMA transfer, and
wait for it to complete. Note that you used the ila
instruction ("immediate load address") to load the address of the buffer. The
ila instruction maxes out as 18 bits, which works for
the PLAYSTATION 3. However, if a Cell BE processor has a larger local
store size,
you would load it instead with the following two instructions:
ilhu $3, conversion_buffer@h #load high-order 16 bits of conversion_buffer iohu $3, conversion_buffer@l #"or" it with the low-order 16 bits of conversion_buffer |
Then the target effective address, the length of the string, the DMA tag, and a
MFC_GET_CMD DMA command are all passed to perform_dma. The program then waits for the operation to
complete.
At this point, all of the data is loaded in and you just need to convert it. You
then use register 127 as your loop counter and register 126 as your base pointer,
and perform convert_to_upper on each value until you
get to the end of the buffer.
At loop_end, all of the data is converted, and you
need only to copy it back. You use the same DMA parameters as for the last
transfer, but this time it is an MFC_PUT_CMD command.
Once the DMA is completed, your function is done. You load register 3 with the
return value and perform the function epilogue to restore the stack frame and
return.
SPE/PPE communication using mailboxes
While DMA transfers are an excellent way of moving bulk data between the SPE and the PPE, another simpler method for smaller transfers which I will briefly discuss is mailboxes. For the SPE, it is simply a set of channels (a read channel and a write channel) to write 32-bit values to the PPE.
To demonstrate the concept, I will write a very simple SPE server which waits
for an unsigned integer number in the mailbox and then writes back the square of
that number. Here is the code (enter as square_server.s):
Listing 10. SPU squaring server
.text .global main .type main, @function main: #Read the value from the inbox (stalls if no value until one is available) rdch $3, $SPU_RdInMbox #Square the value mpyu $3, $3, $3 #Write the value back wrch $SPU_WrOutMbox, $3 #Go back and do it again br main |
That's all! This will just sit around and wait for requests and process them.
It simply quits when the parent program quits. And, if there is no value
available in the inbox, the rdch instruction simply
stalls until there is one.
The PPE side isn't much harder (enter as square_client.c):
Listing 11. PPE squaring client
#include <libspe.h>
#include <stdio.h>
extern spe_program_handle_t square_server_handle;
int main() {
int status = 0;
/* Create SPE thread */
speid_t spe_id = spe_create_thread(0, &square_server_handle, NULL, NULL, -1, 0);
if(spe_id == 0) {
fprintf(stderr, "Unable to create SPE thread!\n");
return 1;
}
/* Request a square */
spe_write_in_mbox(spe_id, 4);
/* Wait for result to be available */
while(!spe_stat_out_mbox(spe_id)) {}
/* Read and display result */
printf("The square of 4 is %d\n", spe_read_out_mbox(spe_id));
/* Do it again */
spe_write_in_mbox(spe_id, 10);
while(!spe_stat_out_mbox(spe_id)) {}
printf("The square of 10 is %d\n", spe_read_out_mbox(spe_id));
return 0;
}
|
To compile and run this program, issue the following commands:
spu-gcc square_server.s -o square_server embedspu -m64 square_server_handle square_server square_server_csf.o gcc -m64 square_client.c square_server_csf.o -lspe -o square ./square |
The mailboxes, even for the PPE, are named according to the perspective of the
SPE. So you write to the inbox and read from the outbox if you are the PPE.
Unlike the SPE, the PPE does not stall and wait for a value when it reads or
writes. Instead, the program must use spe_stat_out_mbox to wait for a value, and spe_stat_in_mbox to see if there are slots left for writing
to the mailbox. You don't use the latter as you only have one value in play at a
time.
The real power of mailboxes comes when a program combines the mailbox and the DMA approach. For example, an SPE task can be created which listens for buffer addresses on its mailbox, and then uses that address to pull in all of the data to be processed through DMA.
Thus far, this series has covered the main concepts of assembly language programming on the Cell BE processor of the PLAYSTATION 3 under Linux®. Topics covered include the basic architecture, the syntax of the SPU assembly language, and the primary modes of communication between the SPE and the PPE. The next article looks at how to pump every ounce of performance out of the Cell BE processor SPEs that you can. And later articles will apply this knowledge to SPE programming in C, to make your life just a little bit easier.
-
See the other parts in the Programming high-performance applications on the Cell BE processor series.
-
The details of every SPU instruction are available in the SPU
Instruction Set Architecture Reference Guide. However, most of the time you
are better off looking at the short summaries in the SPU
assembly language guide. In fact, to get a good overview of what the SPU can
do, I suggest a read through the assembly language guide. It is both short and
packed with information. If the instruction doesn't make sense, then look up the
full definition in the instruction set architecture architecture reference.
-
For ABI details, see the SPU
ABI documentation as well as the Linux
extensions to the ABI.
-
An additional method of interprocess communication using special references
called EAR references is this guide to
CESOF linkables. However, the example given uses the function
copy_from_lswhich is not available in the open-source SDK, but is available in the IBM System Simulator for the Cell BE processor.copy_from_lsandcopy_to_lsallow you to perform DMA transfers without regards to alignment, but they both take considerably longer to run. -
Check out this good tutorial on DMA transfers on the Cell BE processor using C.
-
And here is a
more extensive tutorial on using mailboxes (also in C).
-
The documentation
of the SPE management library describes in detail task creation and
communication with SPEs from the PPE.
-
The Definitive Source of Information about the Cell BE processor
itself is the
Cell
BE Handbook.
-
Keep abreast of all things Cell BE:
subscribe to IBM
microNews.
Jonathan Bartlett is the author of the book Programming from the Ground Up, an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.
Comments (Undergoing maintenance)




