This article dives into the depths of instruction-cycle-counting, bit manipulation, and other nuances that assembly language has typically been notorious for. By the end of it you might be convinced never to program in assembly language ever again. However, the point of it all is not to program in assembly language at all times, but rather to understand what the compiler needs to do to optimize your code, and be able to supplement that with custom assembly language when required. Knowing how the SPU's assembly language works will also aid you in exploiting the processor in higher-level languages. Subsequent articles will use C and show how to use this optimization knowledge in real-world examples. The SPU has many C language extensions; knowing SPU assembly language will help you make sense of them, and knowing SPU optimization will help you use them well.
The last article ended with a function called convert_to_upper, which operated one byte at a time to
convert a string to uppercase. The functions in the programs in this
article operate on whole buffers at a time. The SPU is built to deal
with data in batches, so moving to a buffer-at-a-time model will make the
enhancements easier. The first version will simply wrap a loop around the
code developed in the previous article. Because it is based on code and
concepts developed in the previous articles, I will not do a step-by-step
explanation of the code.
Here is the unoptimized version of a buffer-at-a-time function for
converting to uppercase (enter as convert_buffer.s):
Listing 1. First example program
.text .global convert_buffer_to_upper .type convert_buffer_to_upper, @function convert_buffer_to_upper: ##REGISTER USAGE: # 3) buffer address / current address # 4) buffer size # 5) end_address # 6) current quadword # 7) current quadword with byte in first position # 8, 9, & 10) Determine if byte is in range # 11) byte insertion control # 12) current quadword with byte properly inserted # 13) true if we need to branch, false otherwise # 14) conversion factor #Calculate end address a $5, $4, $3 loop_start: #UNALIGNED LOAD lqd $6, 0($3) rotqby $7, $6, $3 rotqbyi $7, $7, -3 #IS IN RANGE 'a'-'z'? cgtbi $8, $7, 'a' - 1 cgtbi $9, $7, 'z' xor $10, $8, $9 andi $10, $10, 255 #If no, exit brz $10, finish_loop is_lowercase: #If yes, perform conversion il $14, 'a' - 'A' absdb $7, $7, $14 finish_loop: #Unaligned Store ($6 already has current word) cbd $11, 0($3) shufb $12, $7, $6, $11 stqd $12, 0($3) #Increment pointer ai $3, $3, 1 #Are we at the end? If not then loop. cgt $13, $3, $5 brz $13, loop_start end_function: #Return bi $lr |
As far as performance goes, the current code is terrible. The subsequent sections will improve on it a step at a time.
The function which drives the conversion function is now a little bit
simpler since it only has to load the data, run the function, and copy it
back. Here is the code (enter as convert_driver.s):
Listing 2. Uppercase conversion main function
.data #This is the struct we will copy from the main PPE process .align 4 conversion_info: conversion_length: .octa 0 conversion_data: .octa 0 .equ CONVERSION_STRUCT_SIZE, 32 .section .bss #Uninitialized Data Section #This is the buffer we will store the string in .align 4 .lcomm conversion_buffer, 16384 .text #MFC Constants .equ MFC_GET_CMD, 0x40 .equ MFC_PUT_CMD, 0x20 .equ LR_OFFSET, 16 .global main .type main, @function .equ MAIN_FRAME_SIZE, 32 main: #Prologue stqd $lr, LR_OFFSET($sp) stqd $sp, -MAIN_FRAME_SIZE($sp) ai $sp, $sp, -MAIN_FRAME_SIZE ##COPY IN CONVERSION INFORMATION## ila $3, conversion_info #Local Store Address #register 4 already has address #64-bit Effective Address il $5, CONVERSION_STRUCT_SIZE #Transfer size il $6, 0 #DMA Tag il $7, MFC_GET_CMD #DMA Command brsl $lr, perform_dma #Wait for DMA to complete il $3, 0 brsl $lr, wait_for_dma_completion ##COPY STRING IN TO BUFFER## #Load buffer data pointer ila $3, conversion_buffer #Local Store lqr $4, conversion_data #64-bit Effective Address lqr $5, conversion_length #SIZE il $6, 0 #DMA Tag il $7, MFC_GET_CMD #DMA Command brsl $lr, perform_dma #Wait for DMA to complete il $3, 0 brsl $lr, wait_for_dma_completion ##PERFORM CONVERSION## ila $3, conversion_buffer lqr $4, conversion_length brsl $lr, convert_buffer_to_upper ##COPY DATA BACK## ila $3, conversion_buffer #Local Store Address lqr $4, conversion_data #64-bit effective address lqr $5, conversion_length #Size il $6, 0 #DMA Tag il $7, MFC_PUT_CMD #DMA Command brsl $lr, perform_dma #Wait for DMA to complete il $3, 0 brsl $lr, wait_for_dma_completion ##EXIT PROGRAM## #Return Value il $3, 0 #Epilogue ai $sp, $sp, MAIN_FRAME_SIZE lqd $lr, LR_OFFSET($sp) bi $lr |
You will also need the dma_utils.s and the
ppu_dma_main.c files from the previous article.
To build and run, perform these steps:
spu-gcc convert_buffer.s convert_driver.s dma_utils.s -o spe_convert embedspu -m64 convert_to_upper_handle spe_convert spe_convert_csf.o gcc -m64 spe_convert_csf.o ppu_dma_main.c -lspe -o dma_convert ./dma_convert |
These same steps can be used to build all of the examples in this article.
The most obvious optimization to make on a vector process is to vectorize the code. This is known as SIMD (single instruction, multiple data), or data parallelism. On the SPUs, most instructions can operate on registers as if they contained multiple, independent values (thus the single instruction acting on multiple data items). Each 128-bit register can be treated as 16 independent bytes, eight half-words, four words, two doublewords, or as a single unit. The instruction set is primarily geared around dividing it into four 32-bit words, but there is enough support to handle any of these situations.
If you vectorize this code, since you are treating the values as bytes, that means that each instruction will operate on 16 values at once! However, the problem is that vector processing assumes that each and every instruction will be applied to all elements of the vector. However, in the main loop, you have a conditional branch. This means that vector elements which match the criteria go through a different set of instructions than those that do not. Therefore, at least the way the code presently stands, it cannot be vectorized.
What you need to do first is eliminate the branch so that the code
uses the exact same instructions whether it matches your condition or not
(as I'll show later, eliminating branches helps reduce branching
stalls as well). So how is this done? The key is that the SPU has
several conditional instructions, such as selb,
shufb, and the bit operations, which allow
conditional operations to occur without branching. What the program will
end up doing is calculating both answers, and then using a
conditional instruction to select the desired answer.
Here is the conversion code as it currently stands:
#IS IN RANGE 'a'-'z' cgtbi $8, $7, 'a' - 1 cgtbi $9, $7, 'z' xor $10, $8, $9 andi $10, $10, 255 brz $10, finish_loop is_lowercase: #lowercase condition il $14, 'a' - 'A' absdb $7, $7, $14 finish_loop: #non-lowercase condition #all code winds up here |
In this case, the two answers we are computing are:
- Uppercase-converted letter (if lowercase)
- Original input letter (if not lowercase)
The code starts with the original value in $7.
The first thing you need to do is to move the code which calculates the
converted value before the condition, and then store it in a
different register ($15 in this case). So the
code will look like this:
#$7 has our original value il $14, 'a' - 'A' absdb $15, $7, $14 #$7 has the original, and $15 has the converted value #Choose between the value in $7 and $14 and put it in $7 ##...rest of loop... |
So now you need to figure out which value you want to use. The first thing you need to do is to use your original instructions to check the condition:
cgtbi $8, $7, 'a' - 1 cgtbi $9, $7, 'z' xor $10, $8, $9 |
Note that the previous andi is no longer
needed because it was used to mask out unwanted values for the conditional
branch (conditional branches are based on true or false value of the
word preferred slot value and you only care about the byte
preferred slot value). Since you aren't branching you don't care! So now
$10 has all ones in the preferred slot if it is
in range, and all zeroes if it is out of range. Now all you need is to
choose $7 or $15
based on the value in $10. The instruction
selb (select bits) is perfect for this. selb has four operands:
- destination register
- source value 1
- source value 2
- selector
selb operates by going through the
selector bit-by-bit. For each bit position, if the bit is 0, the same bit
position in the destination register uses the bit from source value 1.
If the bit is 1, it uses the bit from source value 2. If you imagine each
register as an array of bits, selb has the
following meaning:
//imaginary representation of selb for those more familiar with C than assembly language:
for(i = 0; i < 128; i++) {
destination[i] = selector[i] == 0 ? source_1[i] : source_2[i]
}
|
Now I hope you see why the condition statements set all of the
corresponding bits in the destination register to 1 if the condition is
true -- it makes it easier to use that value for selb. In this case, you can simply add the following
line of code:
selb $7, $7, $15, $10 |
Now, all of your values, whether they are lowercase or not, will be processed through the following code sequence:
Listing 3. Branch-free conversion code
#Original value starts in $7
#Perform conversion and store in $15
il $14, 'a' - 'A'
absdb $15, $7, $14
#Is it lowercase ('a'-'z')?
cgtbi $8, $7, 'a'-1
cgtbi $9, $7, 'z'
xor $10, $8, $9
#$10 has all 1s for lowercase and all 0s for non-lowercase in the preferred slot
#Select appropriate value into $7 based on condition
selb $7, $7, $15, $10
#$7 now has the correct value
|
In this case, the choice was between the original value and a processed
value, but the code would have been similar if the choice was between two
processed values. In that case, you would have just had two sets of
processing instructions, with each set using a different register for its
result, and the selb instruction choosing
between them. Likewise, if there were more than two possible directions
for the code to go, multiple selbs could be
used to choose between them. However, at that point, you probably need to
look and see if the cost of calculating all of the different possibilities
for every input is worth the benefit of eliminating branches.
Remember that the point of removing the branch was so that you can vectorize the code. The problem was that in order to vectorize the code, the code must follow the same set of instructions for each member of the vector. Now that you have eliminated the branches this is possible.
In fact, the core conversion code is actually almost already vectorized. All of the instructions operate on the whole register anyway. The problem before was threefold:
- The branch was causing either the whole register to be converted or not converted.
- The register holding the conversion factor was geared to a single byte usage rather than a whole register (
illoads the given value into each word but you need it in each byte). - The load/store instructions and the loop counter were geared for processing a single byte at a time.
Now that you've eliminated the branch, you need to load your conversion
factor into every byte of the conversion register. The easiest way to do
this is to put the conversion factor in the .data section manually and load it directly in. You
should also move it outside of the loop since its value is invariant.
So, in the .data section, add:
.equ CONVERSION_FACTOR, 'a' - 'A' .align 4 conversion_bytes: .fill 16, 1, CONVERSION_FACTOR |
And in the code before the loop, add:
lqr $14, conversion_bytes |
With these additions, all values in register 7 will be processed appropriately. Here is the code again, with a possible starting value to demonstrate what is happening:
Listing 4. Following a set of values through the conversion
#$7 starts with 'Hello There! ' #In hex, that's 0x48656c6c6f2054686572652120202020 #$14 is the conversion factor in each byte #In hex, that's 0x20202020202020202020202020202020 absdb $15, $7, $14 # -> $15 now has 0x28454c4c4f0034484552450100000000 cgtbi $8, $7, 'a'-1 # -> $8 now has 0xffffffffff00ffffffffff0000000000 cgtbi $9, $7, 'z' # -> $9 now has 0xff0000000000ff000000000000000000 xor $10, $8, $9 # -> $10 now has 0x00ffffffff0000ffffffff0000000000 selb $7, $7, $15, $10 # -> $7 now has 0x48454c4c4f2054484552452120202020 # which is hex for 'HELLO THERE! ' |
So now all you need to do is change the loop so that it will utilize this. It needs to load a full quadword (16 bytes) at once, and store it back at once, and increment the pointer by 16 instead of 1. This, interestingly, will require fewer instructions because you are no longer having to mess with the preferred slot. So, here is the complete function with the new loop skeleton:
Listing 5. Loop skeleton for vectorized code
##Store Conversion Factor## .data .equ CONVERSION_FACTOR, 'a' - 'A' .align 4 conversion_bytes: .fill 16, 1, CONVERSION_FACTOR .text .global convert_buffer_to_upper .type convert_buffer_to_upper, @function convert_buffer_to_upper: #Calculate end address a $5, $4, $3 #Load in conversion factors lqr $14, conversion_bytes loop_start: #Aligned Load lqd $7, 0($3) ##CONVERSION## absdb $15, $7, $14 cgtbi $8, $7, 'a'-1 cgtbi $9, $7, 'z' xor $10, $8, $9 selb $7, $7, $15, $10 ##END CONVERSION## #Aligned Store stqd $7, 0($3) #Increment Pointer ai $3, $3, 16 #Exit if needed ($5 has the ending address) cgt $13, $3, $5 brz $13, loop_start end_function: bi $lr |
As you can see, the code is much simpler -- it has fewer branches and fewer instructions. However, this new code now assumes that the starting address is 16-byte aligned, and also that it has enough padding on the end that the next data element in memory is also 16-byte aligned. Otherwise, you might end up converting something besides letters! As you can see, for vector processing, alignment and padding are both critically important. It doesn't really matter if the data itself is large enough to fit in the buffer. Since it is converting as a vector, it doesn't cost anything to convert a few extra bytes of junk. If you wind up having to waste a few bytes in your buffer, it is trivial compared to the amount of time and code needed to special-case the beginning and end of unaligned data. By keeping the data aligned and padded to 16-byte boundaries, vector operations can be performed effortlessly.
Loop unrolling has been an optimization technique since the dawn of computer programming. I cover it here not only because it increases efficiency on its own by eliminating branches, but also because it if you do it right it will help later on in instruction scheduling.
Probably by this point you have already been having trouble keeping up
with which register holds what value. After all, the register names are
essentially arbitrary numbers, which are nearly impossible to make sense
of. However, because the registers are only numbers, you can use .equ to give your registers descriptive names. For
example, you can rewrite your conversion program as follows (note that the
registers have been renumbered):
Listing 6. Uppercase conversion with named registers
.data .equ CONVERSION_FACTOR, 'a' - 'A' .align 4 conversion_bytes: .fill 16, 1, CONVERSION_FACTOR .text .global convert_buffer_to_upper .type convert_buffer_to_upper, @function ##REGISTER DEFINITIONS## #Loop/function control registers .equ BUFFER_REG, 3 #Buffer address / current address .equ BUFFER_SZ_REG, 4 #Buffer size .equ BUFFER_END_REG, 5 #End address .equ CONVERSION_BYTES_REG, 6 #Conversion data .equ IS_FINISHED_REG, 7 #Finished conversion? #Conversion-oriented registers .equ CURRENT_VAL_REG, 8 #Current quadword .equ BOOL_TMP1_REG, 9 #used for computing IN_RANGE_REG .equ BOOL_TMP2_REG, 10 #used for computing IN_RANGE_REG .equ IN_RANGE_REG, 11 #Value in range? .equ PROCESSED_VAL_REG, 12 #Conversion bytes, properly masked #Information about registers .equ NUMREGS, 5 #Number of per-iteration registers .equ REGBYTES, 16 #Number of bytes in a register convert_buffer_to_upper: #Calculate end address a $BUFFER_END_REG, $BUFFER_SZ_REG, $BUFFER_REG lqr $CONVERSION_BYTES_REG, conversion_bytes loop_start: #Aligned Load lqd $CURRENT_VAL_REG, 0($BUFFER_REG) ##CONVERSION## absdb $PROCESSED_VALS_REG, $CURRENT_VAL_REG, $CONVERSION_BYTES_REG cgtbi $BOOL_TMP1_REG, $CURRENT_VAL_REG, 'a'-1 cgtbi $BOOL_TMP2_REG, $CURRENT_VAL_REG, 'z' xor $IN_RANGE_REG, $BOOL_TMP1_REG, $BOOL_TMP2_REG selb $CURRENT_VAL_REG, $CURRENT_VAL_REG, $PROCESSED_VAL_REG, $IN_RANGE_REG ##END CONVERSION## #Aligned Store stqd $CURRENT_VAL_REG, 0($BUFFER_REG) #Increment Pointer ai $BUFFER_REG, $BUFFER_REG, REGBYTES #Exit if needed cgt $IS_FINISHED_REG, $BUFFER_REG, $BUFFER_END_REG brz $IS_FINISHED_REG, loop_start end_function: bi $lr |
It's a lot more verbose, but it also makes it easier to browse through the code. It also makes it much easier to do instruction scheduling for unrolled loops. I'll get to that in a minute. For the present, look at how you might unroll this loop four times, using different registers for each iteration (using different registers will help when you optimize instruction scheduling). I'll discuss why and how I rewrote the program in this way shortly:
Listing 7. Buffer conversion -- loop unrolled
loop_start: #ITERATION 0 lqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG) absdb $(PROCESSED_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $CONVERSION_BYTES_REG cgtbi $(BOOL_TMP1_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP2_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'z' xor $(IN_RANGE_REG+0*NUMREGS), $(BOOL_TMP1_REG+0*NUMREGS), $(BOOL_TMP2_REG+0*NUMREGS) selb $(CURRENT_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $(PROCESSED_VAL_REG+0*NUMREGS), $(IN_RANGE_REG+0*NUMREGS) stqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG) #ITERATION 1 lqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG) absdb $(PROCESSED_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), $CONVERSION_BYTES_REG cgtbi $(BOOL_TMP1_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP2_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'z' xor $(IN_RANGE_REG+1*NUMREGS), $(BOOL_TMP1_REG+1*NUMREGS), $(BOOL_TMP2_REG+1*NUMREGS) selb $(CURRENT_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), $(PROCESSED_VAL_REG+1*NUMREGS), $(IN_RANGE_REG+1*NUMREGS) stqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG) #ITERATION 2 lqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG) absdb $(PROCESSED_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), $CONVERSION_BYTES_REG cgtbi $(BOOL_TMP1_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP2_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'z' xor $(IN_RANGE_REG+2*NUMREGS), $(BOOL_TMP1_REG+2*NUMREGS), $(BOOL_TMP2_REG+2*NUMREGS) selb $(CURRENT_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), $(PROCESSED_VAL_REG+2*NUMREGS), $(IN_RANGE_REG+2*NUMREGS) stqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG) #ITERATION 3 lqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG) absdb $(PROCESSED_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), $CONVERSION_BYTES_REG cgtbi $(BOOL_TMP1_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP2_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'z' xor $(IN_RANGE_REG+3*NUMREGS), $(BOOL_TMP1_REG+3*NUMREGS), $(BOOL_TMP2_REG+3*NUMREGS) selb $(CURRENT_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), $(PROCESSED_VAL_REG+3*NUMREGS), $(IN_RANGE_REG+3*NUMREGS) stqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG) #Increment Pointer ai $BUFFER_REG, $BUFFER_REG, 4*REGBYTES #Exit if needed cgt $IS_FINISHED_REG, $BUFFER_REG, $BUFFER_END_REG brz $IS_FINISHED_REG, loop_start |
What this program is doing is calculating the registers being
used. You could have simply numbered the registers, but then writing
the code and remembering which register does what would get even more
tedious than before. However, since each iteration uses the same number
of registers, you can simply calculate the register number at assembly
time. For example, look at $(BOOL_TMP1_REG+2*NUMREGS). This means that it is
the BOOL_TMP1_REG for iteration 2. The actual
register number, since BOOL_TMP1_REG is 9 and
NUMREGS is 5, is 9+2*5, or 19. This way, if you
need to add a register to your code later, the assembler will
auto-recalculate the new register numbers and you don't have to alter your
register numbering convention. You would just assign the register its
own symbolic name and increase the value of NUMREGS.
In addition, as will be apparent shortly, if you need to re-order your instructions for faster execution, this way of naming registers will make it much easier to see both iterations of the loop the register is dealing with as well as the register's purpose. It also makes it easier to modify the program when both of these are readily visible from the code itself.
What is not usually apparent to new assembly language programmers is that the order of instructions affects the speed of the program. The reason for this is that instructions take more than one cycle to finish, but the processors are set up so that an instruction, depending on the ordering of the instructions, does not have to complete before it begins execution of the next instruction. This is known as pipelining. Setting up instructions so that they take full advantage of a processor's pipeline is called instruction scheduling. A few important terms related to pipelining and instruction scheduling are:
- Latency -- The number of clock cycles an instruction uses to produce a final value. This is the same as the size of the pipeline used to process the value.
- Stall -- A clock cycle where the processor does not begin a new instruction.
- Dependency stall -- This is a stall that occurs because one of the operands of the next instruction requires a value from a previous instruction that has not yet completed.
Most of the work of performance tuning on the SPU deals with avoiding register stalls. Therefore, take a look at the pipelining of different types of instructions on the SPU (information from the Cell BE Handbook, page 688):
SPU instruction latency
| Instruction type | Latency | Pipeline | Additional notes |
|---|---|---|---|
| Double-precision floating-point operations | 13 | Even | The first six cycles are actually stalls in which no other instruction can be issued. Dual-issue (discussed later) is not allowed with these instructions either. |
| Integer multiplies, floating-point/integer conversion, interpolate | 7 | Even | |
| Single-precision floating-point | 6 | Even | |
| Byte operations | 4 | Even | |
| Element-based rotates and shifts | 4 | Even | |
| Immediate-mode loads | 2 | Even | |
Simple integer and logical operations (including selb) | 2 | Even | |
| Load and Store Operations | 6 | Odd | Unlike other architectures, SPU loads and stores are deterministic because there is no cache. By reducing the memory so that it is all on-chip in the local store, the SPU can have much faster, much more reliable load and store times than other types of processors. |
| Branch hints | 6 | Odd | Special rules for branch hints will be discussed in a subsequent section. |
| Channel Operations | 6 | Odd | |
| Special-purpose register manipulation | 6 | Odd | |
| Branches | 4 | Odd | Properly hinted branches (discussed in a subsequent section) allow the next instruction to be issued in the very next cycle. |
| Shuffle bytes | 4 | Odd | |
| Quadword rotates and shifts | 4 | Odd | |
| Estimate | 4 | Odd | |
| Gather, mask, and generate insertion controls | 4 | Odd |
So, say that I have the following instructions:
a $5, $6, $7 #instruction 1 a $8, $5, $9 #instruction 2 a $10, $8, $7 #instruction 3 a $11, $8, $7 #instruction 4 |
In this program, it takes four clock cycles for instruction 1 to finish.
Instruction 2 requires the result of instruction 1 ($5) to compute, so it has to wait the full four clock
cycles. Instruction 3 requires the result of instruction 2 ($8), so it has to wait four clock cycles.
Instruction 4 can be issued in the clock cycle immediately after
instruction 3 because it does not require the result of instruction 3 to
execute. You can visualize it like this:
a $5, $6, $7 #cycle 1 #Stall for $5 #cycle 2 #Stall for $5 #cycle 3 #Stall for $5 #cycle 4 a $8, $5, $9 #cycle 5 #Stall for $8 #cycle 6 #Stall for $8 #cycle 7 #Stall for $8 #cycle 8 a $10, $8, $7 #cycle 9 a $11, $8, $7 #cycle 10 |
As you can see, you will get a drastic increase in performance if you are able to arrange your instructions so that no instruction is waiting on any other instruction.
The SPU is not only able to process multiple values at once through its
pipeline, it can also dual-issue instructions through different
pipelines. The SPU has two pipelines, even (sometimes called
pipeline 0 or the execute pipeline) and odd
(sometimes called pipeline 1 or the load pipeline). In the
table above, the different types of instructions were listed along with
which pipeline they execute in. The SPU actually loads two instructions
at a time from a doubleword-aligned boundary. This is called a fetch
group. If the first instruction in this fetch group is an even
pipeline instruction and the second one is an odd pipeline instruction,
they can both be issued simultaneously. If these conditions are not all
met, or if the second instruction needs to wait for dependencies before
issuing, then they are issued in separate cycles. To help align
instructions properly for enabling dual-issue, there are two no-operation
instructions that can be used to properly pad the instructions -- nop (no-operation on the even pipeline) and lnop (no-operation on the odd pipeline). Also, you
can use .align 3 to force a given instruction
to start in a new fetch group (it will be padded with appropriate no-ops to align it properly).
Look at one iteration in the loop and see how it performs in the SPU pipeline. I've added no-ops so you can see the pipeline issues better:
Listing 8. Loop iteration with stall information
.align 4 #force new fetch group #ITERATION 0 nop lqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG) #stall (waiting on CURRENT_VAL_REG) #stall #stall #stall #stall absdb $(PROCESSED_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $CONVERSION_BYTES_REG lnop cgtbi $(BOOL_TMP1_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'a'-1 lnop cgtbi $(BOOL_TMP2_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'z' lnop #stall (waiting on BOOL_TMP2_REG) xor $(IN_RANGE_REG+0*NUMREGS), $(BOOL_TMP1_REG+0*NUMREGS), $(BOOL_TMP2_REG+0*NUMREGS) lnop #stall (waiting on IN_RANGE_REG) selb $(CURRENT_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $(PROCESSED_VAL_REG+0*NUMREGS), $(IN_RANGE_REG+0*NUMREGS) lnop #stall (waiting on CURRENT_VAL_REG) nop stqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG) |
As you can see, this single iteration is wasting 8 cycles just waiting on registers to finish loading. In addition, it is wasting 7 opportunities for dual-issue. So even in this vectorized implementation, there is a lot of room for improvement!
You may be wondering why the program did
not put selb and stqd in the same fetch group. You could have, but it
would not have increased the speed of the program. Since stqd has to stall for the value of CURRENT_VAL_REG they would have had to be issued
separately anyway, and you would not have gained any speed.
Sometimes instruction scheduling is a hassle. However, when used in conjunction with loop unrolling, it's not so bad. Since each iteration is using a different set of registers for computation, all you have to do is interleave computations from each iteration to fill up the time slots. So your new loop body looks like this:
Listing 9. Interleaved loop body minimizes dependency stalls
lqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG) lqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG) lqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG) lqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG) absdb $(PROCESSED_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $CONVERSION_BYTES_REG absdb $(PROCESSED_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), $CONVERSION_BYTES_REG absdb $(PROCESSED_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), $CONVERSION_BYTES_REG absdb $(PROCESSED_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), $CONVERSION_BYTES_REG cgtbi $(BOOL_TMP1_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP1_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP1_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP1_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP2_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'z' cgtbi $(BOOL_TMP2_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'z' cgtbi $(BOOL_TMP2_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'z' cgtbi $(BOOL_TMP2_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'z' xor $(IN_RANGE_REG+0*NUMREGS), $(BOOL_TMP1_REG+0*NUMREGS), $(BOOL_TMP2_REG+0*NUMREGS) xor $(IN_RANGE_REG+1*NUMREGS), $(BOOL_TMP1_REG+1*NUMREGS), $(BOOL_TMP2_REG+1*NUMREGS) xor $(IN_RANGE_REG+2*NUMREGS), $(BOOL_TMP1_REG+2*NUMREGS), $(BOOL_TMP2_REG+2*NUMREGS) xor $(IN_RANGE_REG+3*NUMREGS), $(BOOL_TMP1_REG+3*NUMREGS), $(BOOL_TMP2_REG+3*NUMREGS) selb $(CURRENT_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $(PROCESSED_VAL_REG+0*NUMREGS), $(IN_RANGE_REG+0*NUMREGS) selb $(CURRENT_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), $(PROCESSED_VAL_REG+1*NUMREGS), $(IN_RANGE_REG+1*NUMREGS) selb $(CURRENT_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), $(PROCESSED_VAL_REG+2*NUMREGS), $(IN_RANGE_REG+2*NUMREGS) selb $(CURRENT_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), $(PROCESSED_VAL_REG+3*NUMREGS), $(IN_RANGE_REG+3*NUMREGS) stqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG) stqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG) stqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG) stqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG) #Increment Pointer ai $BUFFER_REG, $BUFFER_REG, 4*REGBYTES #Exit if needed cgt $IS_FINISHED_REG, $BUFFER_REG, $BUFFER_END_REG brz $IS_FINISHED_REG, loop_start |
This technique is called software pipelining, and this code only loses 2 cycles to stalls. However, it still does not make much use of dual-issue. In fact, there aren't many opportunities to do that at all in this code.
If you were to unroll the loop four more iterations, you could interleave
each set of four so that one set of instructions was doing the loads while
the other one was doing the executes, and that would save some clock
cycles through dual-issue. However, for now, I will simply show how to
save two clock cycles by adjusting the order of the selb and stqd
instructions. Here is the new order:
Listing 10. Rescheduling instructions
selb $(CURRENT_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $(PROCESSED_VAL_REG+0*NUMREGS), $(IN_RANGE_REG+0*NUMREGS) selb $(CURRENT_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), $(PROCESSED_VAL_REG+1*NUMREGS), $(IN_RANGE_REG+1*NUMREGS) .align 3 ####Force to the start of a fetch-group #Next two issued concurrently selb $(CURRENT_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), $(PROCESSED_VAL_REG+2*NUMREGS), $(IN_RANGE_REG+2*NUMREGS) stqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG) #Next two issued concurrently selb $(CURRENT_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), $(PROCESSED_VAL_REG+3*NUMREGS), $(IN_RANGE_REG+3*NUMREGS) stqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG) stqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG) stqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG) |
Simply by properly aligning a section of the program on what you want to
be a fetch-group boundary and moving one instruction (the last selb) to a more opportune location, you can save two
clock cycles. Note that without the .align 3,
if the selb instructions were in the odd
position while the stqd instructions were in
the even position, you would not achieve dual-issue, because dual-issue
only occurs when the two instructions are both appropriately sequenced and
aligned.
The SPU has no real hardware support for branch hinting. However, it makes up for this (and in some cases has the possibility of surpassing hardware support) by providing excellent software support for branch hinting.
Branch hinting is necessary on the SPU because mispredicted branches come at a high cost. It takes 18-19 cycles to recover from a mispredicted branch. In addition, by default, every branch encountered by the SPU is assumed to be not taken, including unconditional branches. What a branch hint does is specify to the processor that, for a specific branch instruction (also called a hint-trigger address), what address it is likely to branch to (called the branch target address). This allows the processor to prepare for the branch ahead of time (prefetching the instructions, for instance). Branch hints never affect the logical outcome of a program. They only affect the number of cycles required to run the program.
There are three branch-hinting instructions:
hbr hint_trigger, $register-- This tells the processor that the branch instruction at the relative addresshint_triggeris likely to branch to the address specified in register$register.hbrr hint_trigger, branch_target-- This tells the processor that the branch instruction at the relative addresshint_triggeris likely to branch to the relative addressbranch_target(both are relative from the current instruction).hbra hint_trigger, branch_target-- The same ashbrr, except thatbranch_targetis specified as an absolute address.
For a branch hint to be most effective (so that the branch does
not stall at all), it must be set at least four instruction fetch-groups
plus 11 cycles before the branch instruction. At minimum, the branch
hint must be four instruction fetch groups before the branch instruction,
or it will have no effect. It also may not be more than 255 instructions
away (physically) from the branch that it hints for (the instruction
itself only has space for 8 bits plus a sign bit for the relative offset
of the hint trigger, which then has two zeroes concatenated at the end).
For example, if it is four instruction fetch groups plus 3 cycles away
from the branch instruction, the branch instruction will enter hint
stall for 8 cycles, which, while not optimum, is still much better
than the 18 cycles it would stall without the hint. Only one hint can be
active at a time, and sync instructions, among
other things, clear out any active hint.
The best place to use a hint in your code is before the loop. Since the loop will be more likely taken than not taken (at least for larger strings), you could give a symbolic name to your branch instruction, and hint before the loop that the branch is likely to be taken. The code change would look like this:
Listing 11. Hinted branches
hbrr loop_branch_instruction, loop_start loop_start: ##... conversions go here ... ## #Increment Pointer ai $BUFFER_REG, $BUFFER_REG, 4*REGBYTES #Exit if needed cgt $IS_FINISHED_REG, $BUFFER_REG, $BUFFER_END_REG loop_branch_instruction: brz $IS_FINISHED_REG, loop_start |
Because the hint is before the loop body, this code leaves your hint active for every iteration of the loop, but it only has to use one cycle.
Unfortunately, because the loop branch is so close to the return statement, you cannot predict both the loop branch and the return branch. However, if you thought that the branch loop was not likely to be taken (in this case, if the string is likely less than 64 characters) then you could hint the return address instead by changing the hint to:
#This assumes that $lr has the right address right now (true in our case) hbr end_function, $lr |
You can actually do some fairly advanced hinting behavior using register-based hint instructions. Just keep in mind the hint restrictions as well as the fact that hint instructions do take up a cycle of your program.
At the end, your optimized function should look like this:
Listing 12. Fully optimized conversion function
.data .equ CONVERSION_FACTOR, 'a' - 'A' .align 4 conversion_bytes: .fill 16, 1, CONVERSION_FACTOR .text .global convert_buffer_to_upper .type convert_buffer_to_upper, @function .equ BUFFER_REG, 3 .equ BUFFER_SZ_REG, 4 .equ BUFFER_END_REG, 5 .equ CONVERSION_BYTES_REG, 6 .equ IS_FINISHED_REG, 7 .equ CURRENT_VAL_REG, 8 .equ BOOL_TMP1_REG, 9 .equ BOOL_TMP2_REG, 10 .equ IN_RANGE_REG, 11 .equ PROCESSED_VAL_REG, 12 .equ NUMREGS, 5 .equ REGBYTES, 16 convert_buffer_to_upper: a $BUFFER_END_REG, $BUFFER_SZ_REG, $BUFFER_REG lqr $CONVERSION_BYTES_REG, conversion_bytes hbrr loop_branch_instruction, loop_start loop_start: lqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG) lqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG) lqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG) lqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG) absdb $(PROCESSED_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $CONVERSION_BYTES_REG absdb $(PROCESSED_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), $CONVERSION_BYTES_REG absdb $(PROCESSED_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), $CONVERSION_BYTES_REG absdb $(PROCESSED_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), $CONVERSION_BYTES_REG cgtbi $(BOOL_TMP1_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP1_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP1_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP1_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'a'-1 cgtbi $(BOOL_TMP2_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'z' cgtbi $(BOOL_TMP2_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'z' cgtbi $(BOOL_TMP2_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'z' cgtbi $(BOOL_TMP2_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'z' xor $(IN_RANGE_REG+0*NUMREGS), $(BOOL_TMP1_REG+0*NUMREGS), $(BOOL_TMP2_REG+0*NUMREGS) xor $(IN_RANGE_REG+1*NUMREGS), $(BOOL_TMP1_REG+1*NUMREGS), $(BOOL_TMP2_REG+1*NUMREGS) xor $(IN_RANGE_REG+2*NUMREGS), $(BOOL_TMP1_REG+2*NUMREGS), $(BOOL_TMP2_REG+2*NUMREGS) xor $(IN_RANGE_REG+3*NUMREGS), $(BOOL_TMP1_REG+3*NUMREGS), $(BOOL_TMP2_REG+3*NUMREGS) selb $(CURRENT_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $(PROCESSED_VAL_REG+0*NUMREGS), $(IN_RANGE_REG+0*NUMREGS) selb $(CURRENT_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), $(PROCESSED_VAL_REG+1*NUMREGS), $(IN_RANGE_REG+1*NUMREGS) .align 3 selb $(CURRENT_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), $(PROCESSED_VAL_REG+2*NUMREGS), $(IN_RANGE_REG+2*NUMREGS) stqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG) selb $(CURRENT_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), $(PROCESSED_VAL_REG+3*NUMREGS), $(IN_RANGE_REG+3*NUMREGS) stqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG) stqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG) stqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG) ai $BUFFER_REG, $BUFFER_REG, REGBYTES cgt $IS_FINISHED_REG, $BUFFER_REG, $BUFFER_END_REG loop_branch_instruction: brz $IS_FINISHED_REG, loop_start end_function: bi $lr |
This code has been branch-eliminated, vectorized, loop-unrolled, instruction-scheduled, and branch-hinted. In other words, it's pretty darn fast. The next article switches to programming in C, but this information will still be useful for understanding what the compiler is trying (or at least should be trying) to do, and for analyzing the output of your compiler to understand where hand-coded assembly could give you better performance.
Stay tuned for the rest of this series: We'll explore coding the SPU in C, covering SPU extensions to C, and other higher-level optimization techniques.
-
See the other parts in the Programming high-performance applications on the Cell BE processor series.
-
Always keep your assembly language overview handy, as well as the instruction set architecture reference for more detailed information.
- ArsTechnica has an overview of SIMD architectures (before the Cell BE processor) as well as an introduction to the Cell BE's SIMD architecture.
- The Cell Broadband Engine Programming Handbook has a lot of interesting low-level details on the SPUs. Some interesting sections are:
- Pages 75-76 and 687-688 discuss pipeline and latency issues.
- Pages 768-772 describe why you need to use no-ops to take advantage of dual-issue rules, what the instruction pipeline looks like, and how instruction prefetch works. Page 772 describes the hbrp instruction which can be used to help out the prefetch schedule.
- Pages 689-697 cover branch elimination and hinting.
- If you want to get really nuts with branch optimization, you should read the additional considerations the IBM compiler team uses for branch optimization.
- Additional optimization considerations and suggestions are available in the slide presentation Cell programming tips and techniques (PDF).
-
Stay abreast of all things Cell BE:
subscribe to IBM
microNews.
Jonathan Bartlett is the author of the book Programming from the Ground Up, an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.




