Instead of focusing on the SPU's assembly language to help you get to know the Cell Broadband Engine (Cell/B.E.) processor intimately, this tip, excerpted from the developerWorks article "Programming high-performance applications on the Cell BE processor, Part 5," provides a quick look at C/C++ so you can let the compiler do a large amount of the work for you. To use the SPU C/C++ language extensions, the header file
spu_intrinsics.h must be included at the beginning of your code.
The C/C++ language extensions include data types and intrinsics that give the programmer nearly full access to the SPU's assembly language instructions. However, many intrinsics are provided which greatly simplify the SPU's assembly language by coalescing many similar instructions into one intrinsic.
Instructions that differ only on the type of operand (such as
dfa for addition) are represented by a single C/C++ intrinsic which selects the proper instruction based on the type of the operand. For addition,
spu_add, when given two
vector unsigned ints as parameters, will generate the
a (32-bit add) instruction. However, if given two
vector floats as parameters, it will generate the
fa (float add) instruction.
Note that the intrinsics generally have the same limitations as their corresponding assembly language instructions. However, in cases where an immediate value is too large for the appropriate immediate-mode instruction, the compiler will promote the immediate value to a vector and do the corresponding vector/vector operation. For instance,
spu_add(myvec, 2) generates an
ai (add immediate) instruction while
spu_add(myvec, 2000) first loads the
2000 into its own vector using
il and then performs the
a (add) instruction.
The order of operands in the intrinsics is essentially the same as those of the assembly language instruction except that the first operand (which holds the destination register in assembly language) is not specified, but instead is used as the return value for the function. The compiler supplies the actual parameter in the code it generates.
For more on vector intrinsics, see "Programming on the Cell/B.E. processor, Part 5," the article from which this tip was taken.
This list will supply some of the more common SPU intrinsics; types are not given as most of them are polymorphic.
Adds each element of
val1to the corresponding element of
val2is a non-vector value, it adds the value to each element of
Subtract each element of
val2from the corresponding element of
val1is a non-vector value, then
val1is replicated across a vector, and then
val2is subtracted from it.
Because the multiplication instructions operate so differently, the SPU intrinsics do not coalesce them as much as they do for other operations.
spu_mulhandles floating point multiplication (single and double precision). The result is a vector where each element is the result of multiplying the corresponding elements of
Boolean operations operate bit-by-bit, so the type of operands the boolean operations receive is not relevant except for determining the type of value they will return.
spu_eqvis a bitwise equivalency operation, not a per-element equivalency operation.
spu_rlrotates each element of
valleft by the number of bits specified in the corresponding element of
count. Bits rotated off the end are rotated back in on the right. If
countis a scalar value, then it is used as the count for all elements of
spu_sloperates the same way, but performs a shift instead of a rotate.
These are very confusingly named operations. They are named "rotate left and mask," but they are actually performing right shifts (they are implemented by a combination of left shifts and masks, but the programming interface is for right shifts).
spu_rlmaskashifts each element of
valto the right by the number of bits in the corresponding element of
count(or the value of
countis a scalar).
spu_rlmaskareplicates the sign bit as bits are shifted in.
spu_rlmaskqwoperates on the whole quadword at a time, but only up to 7 bits (it performs a modulus on
countto put it in the proper range).
spu_rlmaskqwbyteworks similarly, except that
countis the number of bytes instead of bits, and
countis modulus 16 instead of 8.
These instructions perform element-by-element comparisons of their two operands. The results are stored as all ones (for true) and all zeros (for false) in the resulting vector in the corresponding element.
spu_cmpgtperforms a greater-than comparison while
spu_cmpeqperforms an equality comparison.
spu_sel(val1, val2, conditional)
This corresponds to the
selbassembly language instruction. The instruction itself is bit-based, so all types use the same underlying instruction. However, the intrinsic operation returns a value of the same type as the operands. As in assembly language,
spu_sellooks at each bit in
conditional. If the bit is zero, the corresponding bit in the result is selected from the corresponding bit in
val1; otherwise it is selected from the corresponding bit in
spu_shuffle(val1, val2, pattern)
This is an interesting instruction which allows you to rearrange the bytes in
val2according to a pattern, specified in
pattern. The instruction goes through each byte in
pattern, and if the byte starts with the bits
0b10, the corresponding byte in the result is set to
0x00; if the byte starts with the bits
0b110, the corresponding byte in the result is set to
0xff; if the byte starts with the bits
0b111, the corresponding byte in the result is set to
0x80; finally (and most importantly), if none of the previous are true, the last five bits of the pattern byte are used to choose which byte from
val2should be taken as the value for the current byte. The two values are concatenated, and the five-bit value is used as the byte index of the concatenated value. This is used for inserting elements into vectors as well as performing fast table lookups.
All of the instructions that are prefixed with
spu_ will try to find the best instruction match
based on the types of operands. However, not all vector types are
supported by all instructions -- it is based on the availability of
assembly language instructions to handle it.
In addition, if you want a
specific instruction rather than having the compiler choose one, you can
perform almost any non-branching instruction with the specific
instrinsics. All specific intrinsics take the form
assemblyinstructionname is the name of the assembly
language instruction as defined in the SPU Assembly Language
si_a(a, b) forces the
a to be used for addition.
operands to specific intrinsics are cast to a special type called
qword, which is essentially an opaque register value
type. The return value from specific intrinsics are also
qwords, which can then be cast into whatever vector
type you wish.
In "Programming on the Cell/B.E. processor, Part 5," the article from which this tip was taken, you can learn more about how to use the vector extensions and discover how to direct the compiler to do branch prediction and to perform DMA transfers in C/C++.
The full set of intrinsics is documented in the PPU
& SPU C/C++ Language Extension Specification.
Another (more extended) tutorial resource for Cell/B.E. programming on both
the SPE and the PPE is the official Cell/B.E. Programming Tutorial.
For a complete list of available DMA commands on the MFC, see chapter 7
of the Cell/B.E. Architecture Specification (1.01) and pages 508-510 of the Cell/B.E. Programming Handbook (1.0).
For more information on DMA list commands, see pages 51-62, 124-125,
129-130, and 157-158 of the Cell/B.E. Architecture Specification (1.01) and pages 73, 459-460, 509-510, and 527-530 of the Cell/B.E. Programming Handbook (1.0).
The transfer class ID and replacement class ID fields for MFC operations
is described on pages 78 and 114 of the Cell/B.E. Architecture Specification (1.01) and pages 155-158, 455-456, and 513-515 of the Cell/B.E. Programming Handbook (1.0).
IBM Semiconductor Solutions Technical Library
documentation section contains a wealth of downloadable manuals,
specifications, and much more.
Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM
Broadband Engine resource center: your definitive resource for all
Keep abreast of all the latest in Cell/B.E. news and information:
subscribe to the
IBM microNews newsletter.
Get products and technologies
Get Cell/B.E. solutions: Contact
IBM about custom Cell/B.E.- or custom-processor-based solutions.
Get the alphaWorks Cell/B.E. downloads -- including the IBM Full System Simulator,
support libraries, toolchains, source code for libraries and samples.
- Participate in the discussion forum.
Take part in the IBM developerWorks Power Architecture Cell/B.E. discussion forum
Jonathan Bartlett is the author of the book Programming from the Ground Up , an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.