Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Programming high-performance applications on the Cell BE processor, Part 4: Program the SPU for performance

Jonathan Bartlett (johnnyb@eskimo.com), Director of Technology, New Medio
Jonathan Bartlett is the author of the book Programming from the Ground Up, an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.

Summary:  Write optimal code for the Cell Broadband Engine™ (Cell BE) processor's synergistic processing unit (SPU) and have your programs running lightning fast. This installment of Programming high-performance applications on the Cell BE processor covers SIMD vector programming, branch elimination, loop unrolling, instruction scheduling, and branch hinting techniques. Previous installments have covered the basics of the Sony® PLAYSTATION® 3, the Cell BE architecture, and SPU programming.

View more content in this series

Date:  06 Mar 2007
Level:  Intermediate
Also available in:   Chinese  Russian

Activity:  18861 views
Comments:  

This article dives into the depths of instruction-cycle-counting, bit manipulation, and other nuances that assembly language has typically been notorious for. By the end of it you might be convinced never to program in assembly language ever again. However, the point of it all is not to program in assembly language at all times, but rather to understand what the compiler needs to do to optimize your code, and be able to supplement that with custom assembly language when required. Knowing how the SPU's assembly language works will also aid you in exploiting the processor in higher-level languages. Subsequent articles will use C and show how to use this optimization knowledge in real-world examples. The SPU has many C language extensions; knowing SPU assembly language will help you make sense of them, and knowing SPU optimization will help you use them well.

The starting program

The last article ended with a function called convert_to_upper, which operated one byte at a time to convert a string to uppercase. The functions in the programs in this article operate on whole buffers at a time. The SPU is built to deal with data in batches, so moving to a buffer-at-a-time model will make the enhancements easier. The first version will simply wrap a loop around the code developed in the previous article. Because it is based on code and concepts developed in the previous articles, I will not do a step-by-step explanation of the code.

Here is the unoptimized version of a buffer-at-a-time function for converting to uppercase (enter as convert_buffer.s):


Listing 1. First example program
.text

.global convert_buffer_to_upper
.type convert_buffer_to_upper, @function
convert_buffer_to_upper:
	##REGISTER USAGE:
	#   3) buffer address / current address
	#   4) buffer size
	#   5) end_address
	#   6) current quadword
	#   7) current quadword with byte in first position
	#   8, 9, & 10) Determine if byte is in range
	#   11) byte insertion control
	#   12) current quadword with byte properly inserted
	#   13) true if we need to branch, false otherwise
	#   14) conversion factor

	#Calculate end address
	a $5, $4, $3

loop_start:
	#UNALIGNED LOAD
	lqd $6, 0($3)
	rotqby $7, $6, $3
	rotqbyi $7, $7, -3

	#IS IN RANGE 'a'-'z'?
	cgtbi $8, $7, 'a' - 1
	cgtbi $9, $7, 'z'
	xor $10, $8, $9
	andi $10, $10, 255

	#If no, exit
	brz $10, finish_loop

is_lowercase:
	#If yes, perform conversion
	il $14, 'a' - 'A'
	absdb $7, $7, $14

finish_loop:
	#Unaligned Store ($6 already has current word)
	cbd $11, 0($3)
	shufb $12, $7, $6, $11
	stqd $12, 0($3)

	#Increment pointer
	ai $3, $3, 1

	#Are we at the end?  If not then loop.
	cgt $13, $3, $5
	brz $13, loop_start

end_function:
	#Return
	bi $lr

As far as performance goes, the current code is terrible. The subsequent sections will improve on it a step at a time.

The function which drives the conversion function is now a little bit simpler since it only has to load the data, run the function, and copy it back. Here is the code (enter as convert_driver.s):


Listing 2. Uppercase conversion main function
.data

#This is the struct we will copy from the main PPE process
.align 4
conversion_info:
conversion_length:
	.octa 0
conversion_data:
	.octa 0
.equ CONVERSION_STRUCT_SIZE, 32

.section .bss #Uninitialized Data Section

#This is the buffer we will store the string in
.align 4
.lcomm conversion_buffer, 16384

.text

#MFC Constants
.equ MFC_GET_CMD, 0x40
.equ MFC_PUT_CMD, 0x20

.equ LR_OFFSET, 16

.global main
.type main, @function
.equ MAIN_FRAME_SIZE, 32
main:
	#Prologue
	stqd $lr, LR_OFFSET($sp)
	stqd $sp, -MAIN_FRAME_SIZE($sp)
	ai $sp, $sp, -MAIN_FRAME_SIZE

	##COPY IN CONVERSION INFORMATION##
	ila $3, conversion_info         #Local Store Address
	#register 4 already has address #64-bit Effective Address
	il $5, CONVERSION_STRUCT_SIZE   #Transfer size
	il $6, 0                        #DMA Tag
	il $7, MFC_GET_CMD              #DMA Command
	brsl $lr, perform_dma

	#Wait for DMA to complete
	il $3, 0
	brsl $lr, wait_for_dma_completion

	##COPY STRING IN TO BUFFER##
	#Load buffer data pointer
	ila $3, conversion_buffer #Local Store
	lqr $4, conversion_data   #64-bit Effective Address
	lqr $5, conversion_length #SIZE
	il $6, 0                  #DMA Tag
	il $7, MFC_GET_CMD        #DMA Command
	brsl $lr, perform_dma

	#Wait for DMA to complete
	il $3, 0
	brsl $lr, wait_for_dma_completion

	##PERFORM CONVERSION##
	ila $3, conversion_buffer
	lqr $4, conversion_length
	brsl $lr, convert_buffer_to_upper

	##COPY DATA BACK##
	ila $3, conversion_buffer   #Local Store Address
	lqr $4, conversion_data     #64-bit effective address
	lqr $5, conversion_length   #Size
	il $6, 0                    #DMA Tag
	il $7, MFC_PUT_CMD          #DMA Command
	brsl $lr, perform_dma

	#Wait for DMA to complete
	il $3, 0
	brsl $lr, wait_for_dma_completion

	##EXIT PROGRAM##
	#Return Value
	il $3, 0

	#Epilogue
	ai $sp, $sp, MAIN_FRAME_SIZE
	lqd $lr, LR_OFFSET($sp)
	bi $lr

You will also need the dma_utils.s and the ppu_dma_main.c files from the previous article.

To build and run, perform these steps:

spu-gcc convert_buffer.s convert_driver.s dma_utils.s -o spe_convert
embedspu -m64 convert_to_upper_handle spe_convert spe_convert_csf.o
gcc -m64 spe_convert_csf.o ppu_dma_main.c -lspe -o dma_convert
./dma_convert

These same steps can be used to build all of the examples in this article.


Vectorizing the code

The most obvious optimization to make on a vector process is to vectorize the code. This is known as SIMD (single instruction, multiple data), or data parallelism. On the SPUs, most instructions can operate on registers as if they contained multiple, independent values (thus the single instruction acting on multiple data items). Each 128-bit register can be treated as 16 independent bytes, eight half-words, four words, two doublewords, or as a single unit. The instruction set is primarily geared around dividing it into four 32-bit words, but there is enough support to handle any of these situations.

If you vectorize this code, since you are treating the values as bytes, that means that each instruction will operate on 16 values at once! However, the problem is that vector processing assumes that each and every instruction will be applied to all elements of the vector. However, in the main loop, you have a conditional branch. This means that vector elements which match the criteria go through a different set of instructions than those that do not. Therefore, at least the way the code presently stands, it cannot be vectorized.

What you need to do first is eliminate the branch so that the code uses the exact same instructions whether it matches your condition or not (as I'll show later, eliminating branches helps reduce branching stalls as well). So how is this done? The key is that the SPU has several conditional instructions, such as selb, shufb, and the bit operations, which allow conditional operations to occur without branching. What the program will end up doing is calculating both answers, and then using a conditional instruction to select the desired answer.

Here is the conversion code as it currently stands:

	#IS IN RANGE 'a'-'z'
	cgtbi $8, $7, 'a' - 1
	cgtbi $9, $7, 'z'
	xor $10, $8, $9
	andi $10, $10, 255

	brz $10, finish_loop

is_lowercase:
	#lowercase condition
	il $14, 'a' - 'A'
	absdb $7, $7, $14

finish_loop:
	#non-lowercase condition
	#all code winds up here

In this case, the two answers we are computing are:

  • Uppercase-converted letter (if lowercase)
  • Original input letter (if not lowercase)

The code starts with the original value in $7. The first thing you need to do is to move the code which calculates the converted value before the condition, and then store it in a different register ($15 in this case). So the code will look like this:

	#$7 has our original value
	il $14, 'a' - 'A'
	absdb $15, $7, $14
	#$7 has the original, and $15 has the converted value
	#Choose between the value in $7 and $14 and put it in $7

	##...rest of loop...

So now you need to figure out which value you want to use. The first thing you need to do is to use your original instructions to check the condition:

	cgtbi $8, $7, 'a' - 1
	cgtbi $9, $7, 'z'
	xor $10, $8, $9

Note that the previous andi is no longer needed because it was used to mask out unwanted values for the conditional branch (conditional branches are based on true or false value of the word preferred slot value and you only care about the byte preferred slot value). Since you aren't branching you don't care! So now $10 has all ones in the preferred slot if it is in range, and all zeroes if it is out of range. Now all you need is to choose $7 or $15 based on the value in $10. The instruction selb (select bits) is perfect for this. selb has four operands:

  1. destination register
  2. source value 1
  3. source value 2
  4. selector

selb operates by going through the selector bit-by-bit. For each bit position, if the bit is 0, the same bit position in the destination register uses the bit from source value 1. If the bit is 1, it uses the bit from source value 2. If you imagine each register as an array of bits, selb has the following meaning:

//imaginary representation of selb for those more familiar with C than assembly language:
for(i = 0; i < 128; i++) {
	destination[i] = selector[i] == 0 ? source_1[i] : source_2[i]
}

Now I hope you see why the condition statements set all of the corresponding bits in the destination register to 1 if the condition is true -- it makes it easier to use that value for selb. In this case, you can simply add the following line of code:

	selb $7, $7, $15, $10

Now, all of your values, whether they are lowercase or not, will be processed through the following code sequence:


Listing 3. Branch-free conversion code
	#Original value starts in $7

	#Perform conversion and store in $15
	il $14, 'a' - 'A'
	absdb $15, $7, $14

	#Is it lowercase ('a'-'z')?
	cgtbi $8, $7, 'a'-1
	cgtbi $9, $7, 'z'
	xor $10, $8, $9
	#$10 has all 1s for lowercase and all 0s for non-lowercase in the preferred slot

	#Select appropriate value into $7 based on condition
	selb $7, $7, $15, $10

	#$7 now has the correct value

In this case, the choice was between the original value and a processed value, but the code would have been similar if the choice was between two processed values. In that case, you would have just had two sets of processing instructions, with each set using a different register for its result, and the selb instruction choosing between them. Likewise, if there were more than two possible directions for the code to go, multiple selbs could be used to choose between them. However, at that point, you probably need to look and see if the cost of calculating all of the different possibilities for every input is worth the benefit of eliminating branches.

Remember that the point of removing the branch was so that you can vectorize the code. The problem was that in order to vectorize the code, the code must follow the same set of instructions for each member of the vector. Now that you have eliminated the branches this is possible.

In fact, the core conversion code is actually almost already vectorized. All of the instructions operate on the whole register anyway. The problem before was threefold:

  • The branch was causing either the whole register to be converted or not converted.
  • The register holding the conversion factor was geared to a single byte usage rather than a whole register (il loads the given value into each word but you need it in each byte).
  • The load/store instructions and the loop counter were geared for processing a single byte at a time.

Now that you've eliminated the branch, you need to load your conversion factor into every byte of the conversion register. The easiest way to do this is to put the conversion factor in the .data section manually and load it directly in. You should also move it outside of the loop since its value is invariant. So, in the .data section, add:

.equ CONVERSION_FACTOR, 'a' - 'A'
.align 4
conversion_bytes:
	.fill 16, 1, CONVERSION_FACTOR

And in the code before the loop, add:

	lqr $14, conversion_bytes

With these additions, all values in register 7 will be processed appropriately. Here is the code again, with a possible starting value to demonstrate what is happening:


Listing 4. Following a set of values through the conversion
	#$7 starts with 'Hello There!    '
	#In hex, that's   0x48656c6c6f2054686572652120202020
	#$14 is the conversion factor in each byte
	#In hex, that's   0x20202020202020202020202020202020

	absdb $15, $7, $14
	#  -> $15 now has 0x28454c4c4f0034484552450100000000
	cgtbi $8, $7, 'a'-1
	#  -> $8 now has  0xffffffffff00ffffffffff0000000000
	cgtbi $9, $7, 'z'
	#  -> $9 now has  0xff0000000000ff000000000000000000
	xor $10, $8, $9
	#  -> $10 now has 0x00ffffffff0000ffffffff0000000000
	selb $7, $7, $15, $10
	#  -> $7 now has  0x48454c4c4f2054484552452120202020
	#     which is hex for 'HELLO THERE!    '

So now all you need to do is change the loop so that it will utilize this. It needs to load a full quadword (16 bytes) at once, and store it back at once, and increment the pointer by 16 instead of 1. This, interestingly, will require fewer instructions because you are no longer having to mess with the preferred slot. So, here is the complete function with the new loop skeleton:


Listing 5. Loop skeleton for vectorized code
##Store Conversion Factor##
.data
.equ CONVERSION_FACTOR, 'a' - 'A'
.align 4
conversion_bytes:
	.fill 16, 1, CONVERSION_FACTOR

.text
.global convert_buffer_to_upper
.type convert_buffer_to_upper, @function
convert_buffer_to_upper:
	#Calculate end address
	a $5, $4, $3

	#Load in conversion factors
	lqr $14, conversion_bytes

loop_start:
	#Aligned Load
	lqd $7, 0($3)

	##CONVERSION##
	absdb $15, $7, $14
	cgtbi $8, $7, 'a'-1
	cgtbi $9, $7, 'z'
	xor $10, $8, $9
	selb $7, $7, $15, $10
	##END CONVERSION##

	#Aligned Store
	stqd $7, 0($3)

	#Increment Pointer
	ai $3, $3, 16

	#Exit if needed ($5 has the ending address)
	cgt $13, $3, $5
	brz $13, loop_start

end_function:
	bi $lr

As you can see, the code is much simpler -- it has fewer branches and fewer instructions. However, this new code now assumes that the starting address is 16-byte aligned, and also that it has enough padding on the end that the next data element in memory is also 16-byte aligned. Otherwise, you might end up converting something besides letters! As you can see, for vector processing, alignment and padding are both critically important. It doesn't really matter if the data itself is large enough to fit in the buffer. Since it is converting as a vector, it doesn't cost anything to convert a few extra bytes of junk. If you wind up having to waste a few bytes in your buffer, it is trivial compared to the amount of time and code needed to special-case the beginning and end of unaligned data. By keeping the data aligned and padded to 16-byte boundaries, vector operations can be performed effortlessly.


Unrolling loops

Loop unrolling has been an optimization technique since the dawn of computer programming. I cover it here not only because it increases efficiency on its own by eliminating branches, but also because it if you do it right it will help later on in instruction scheduling.

Probably by this point you have already been having trouble keeping up with which register holds what value. After all, the register names are essentially arbitrary numbers, which are nearly impossible to make sense of. However, because the registers are only numbers, you can use .equ to give your registers descriptive names. For example, you can rewrite your conversion program as follows (note that the registers have been renumbered):


Listing 6. Uppercase conversion with named registers
.data
.equ CONVERSION_FACTOR, 'a' - 'A'
.align 4
conversion_bytes:
	.fill 16, 1, CONVERSION_FACTOR

.text
.global convert_buffer_to_upper
.type convert_buffer_to_upper, @function
	##REGISTER DEFINITIONS##
	#Loop/function control registers
	.equ BUFFER_REG, 3             #Buffer address / current address
	.equ BUFFER_SZ_REG, 4          #Buffer size
	.equ BUFFER_END_REG, 5         #End address
	.equ CONVERSION_BYTES_REG, 6   #Conversion data
	.equ IS_FINISHED_REG, 7        #Finished conversion?

	#Conversion-oriented registers
	.equ CURRENT_VAL_REG, 8  #Current quadword
	.equ BOOL_TMP1_REG, 9    #used for computing IN_RANGE_REG
	.equ BOOL_TMP2_REG, 10   #used for computing IN_RANGE_REG
	.equ IN_RANGE_REG, 11    #Value in range?
	.equ PROCESSED_VAL_REG, 12   #Conversion bytes, properly masked

	#Information about registers
	.equ NUMREGS, 5          #Number of per-iteration registers
	.equ REGBYTES, 16        #Number of bytes in a register
convert_buffer_to_upper:
	#Calculate end address
	a $BUFFER_END_REG, $BUFFER_SZ_REG, $BUFFER_REG

	lqr $CONVERSION_BYTES_REG, conversion_bytes

loop_start:
	#Aligned Load
	lqd $CURRENT_VAL_REG, 0($BUFFER_REG)

	##CONVERSION##
	absdb $PROCESSED_VALS_REG, $CURRENT_VAL_REG, $CONVERSION_BYTES_REG
	cgtbi $BOOL_TMP1_REG, $CURRENT_VAL_REG, 'a'-1
	cgtbi $BOOL_TMP2_REG, $CURRENT_VAL_REG, 'z'
	xor $IN_RANGE_REG, $BOOL_TMP1_REG, $BOOL_TMP2_REG
	selb $CURRENT_VAL_REG, $CURRENT_VAL_REG, $PROCESSED_VAL_REG, $IN_RANGE_REG
	##END CONVERSION##

	#Aligned Store
	stqd $CURRENT_VAL_REG, 0($BUFFER_REG)

	#Increment Pointer
	ai $BUFFER_REG, $BUFFER_REG, REGBYTES

	#Exit if needed
	cgt $IS_FINISHED_REG, $BUFFER_REG, $BUFFER_END_REG
	brz $IS_FINISHED_REG, loop_start

end_function:
	bi $lr

It's a lot more verbose, but it also makes it easier to browse through the code. It also makes it much easier to do instruction scheduling for unrolled loops. I'll get to that in a minute. For the present, look at how you might unroll this loop four times, using different registers for each iteration (using different registers will help when you optimize instruction scheduling). I'll discuss why and how I rewrote the program in this way shortly:


Listing 7. Buffer conversion -- loop unrolled
loop_start:
	#ITERATION 0
	lqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG)
	absdb $(PROCESSED_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $CONVERSION_BYTES_REG
	cgtbi $(BOOL_TMP1_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP2_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'z'
	xor $(IN_RANGE_REG+0*NUMREGS), $(BOOL_TMP1_REG+0*NUMREGS), $(BOOL_TMP2_REG+0*NUMREGS)
	selb $(CURRENT_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS),
	$(PROCESSED_VAL_REG+0*NUMREGS), $(IN_RANGE_REG+0*NUMREGS)
	stqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG)

	#ITERATION 1
	lqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG)
	absdb $(PROCESSED_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), $CONVERSION_BYTES_REG
	cgtbi $(BOOL_TMP1_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP2_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'z'
	xor $(IN_RANGE_REG+1*NUMREGS), $(BOOL_TMP1_REG+1*NUMREGS), $(BOOL_TMP2_REG+1*NUMREGS)
	selb $(CURRENT_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS),
	$(PROCESSED_VAL_REG+1*NUMREGS), $(IN_RANGE_REG+1*NUMREGS)
	stqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG)

	#ITERATION 2
	lqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG)
	absdb $(PROCESSED_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), $CONVERSION_BYTES_REG
	cgtbi $(BOOL_TMP1_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP2_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'z'
	xor $(IN_RANGE_REG+2*NUMREGS), $(BOOL_TMP1_REG+2*NUMREGS), $(BOOL_TMP2_REG+2*NUMREGS)
	selb $(CURRENT_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS),
	$(PROCESSED_VAL_REG+2*NUMREGS), $(IN_RANGE_REG+2*NUMREGS)
	stqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG)

	#ITERATION 3
	lqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG)
	absdb $(PROCESSED_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), $CONVERSION_BYTES_REG
	cgtbi $(BOOL_TMP1_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP2_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'z'
	xor $(IN_RANGE_REG+3*NUMREGS), $(BOOL_TMP1_REG+3*NUMREGS), $(BOOL_TMP2_REG+3*NUMREGS)
	selb $(CURRENT_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS),
	$(PROCESSED_VAL_REG+3*NUMREGS), $(IN_RANGE_REG+3*NUMREGS)
	stqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG)

	#Increment Pointer
	ai $BUFFER_REG, $BUFFER_REG, 4*REGBYTES

	#Exit if needed
	cgt $IS_FINISHED_REG, $BUFFER_REG, $BUFFER_END_REG
	brz $IS_FINISHED_REG, loop_start

What this program is doing is calculating the registers being used. You could have simply numbered the registers, but then writing the code and remembering which register does what would get even more tedious than before. However, since each iteration uses the same number of registers, you can simply calculate the register number at assembly time. For example, look at $(BOOL_TMP1_REG+2*NUMREGS). This means that it is the BOOL_TMP1_REG for iteration 2. The actual register number, since BOOL_TMP1_REG is 9 and NUMREGS is 5, is 9+2*5, or 19. This way, if you need to add a register to your code later, the assembler will auto-recalculate the new register numbers and you don't have to alter your register numbering convention. You would just assign the register its own symbolic name and increase the value of NUMREGS.

In addition, as will be apparent shortly, if you need to re-order your instructions for faster execution, this way of naming registers will make it much easier to see both iterations of the loop the register is dealing with as well as the register's purpose. It also makes it easier to modify the program when both of these are readily visible from the code itself.


Instruction scheduling

What is not usually apparent to new assembly language programmers is that the order of instructions affects the speed of the program. The reason for this is that instructions take more than one cycle to finish, but the processors are set up so that an instruction, depending on the ordering of the instructions, does not have to complete before it begins execution of the next instruction. This is known as pipelining. Setting up instructions so that they take full advantage of a processor's pipeline is called instruction scheduling. A few important terms related to pipelining and instruction scheduling are:

  • Latency -- The number of clock cycles an instruction uses to produce a final value. This is the same as the size of the pipeline used to process the value.
  • Stall -- A clock cycle where the processor does not begin a new instruction.
  • Dependency stall -- This is a stall that occurs because one of the operands of the next instruction requires a value from a previous instruction that has not yet completed.

Most of the work of performance tuning on the SPU deals with avoiding register stalls. Therefore, take a look at the pipelining of different types of instructions on the SPU (information from the Cell BE Handbook, page 688):


SPU instruction latency
Instruction typeLatencyPipelineAdditional notes
Double-precision floating-point operations13EvenThe first six cycles are actually stalls in which no other instruction can be issued. Dual-issue (discussed later) is not allowed with these instructions either.
Integer multiplies, floating-point/integer conversion, interpolate7Even
Single-precision floating-point6Even
Byte operations4Even
Element-based rotates and shifts4Even
Immediate-mode loads2Even
Simple integer and logical operations (including selb)2Even
Load and Store Operations6OddUnlike other architectures, SPU loads and stores are deterministic because there is no cache. By reducing the memory so that it is all on-chip in the local store, the SPU can have much faster, much more reliable load and store times than other types of processors.
Branch hints6OddSpecial rules for branch hints will be discussed in a subsequent section.
Channel Operations6Odd
Special-purpose register manipulation6Odd
Branches4OddProperly hinted branches (discussed in a subsequent section) allow the next instruction to be issued in the very next cycle.
Shuffle bytes4Odd
Quadword rotates and shifts4Odd
Estimate4Odd
Gather, mask, and generate insertion controls4Odd

So, say that I have the following instructions:

	a $5, $6, $7   #instruction 1
	a $8, $5, $9   #instruction 2
	a $10, $8, $7  #instruction 3
	a $11, $8, $7  #instruction 4

In this program, it takes four clock cycles for instruction 1 to finish. Instruction 2 requires the result of instruction 1 ($5) to compute, so it has to wait the full four clock cycles. Instruction 3 requires the result of instruction 2 ($8), so it has to wait four clock cycles. Instruction 4 can be issued in the clock cycle immediately after instruction 3 because it does not require the result of instruction 3 to execute. You can visualize it like this:

	a $5, $6, $7   #cycle 1
	#Stall for $5  #cycle 2
	#Stall for $5  #cycle 3
	#Stall for $5  #cycle 4
	a $8, $5, $9   #cycle 5
	#Stall for $8  #cycle 6
	#Stall for $8  #cycle 7
	#Stall for $8  #cycle 8
	a $10, $8, $7  #cycle 9
	a $11, $8, $7  #cycle 10

As you can see, you will get a drastic increase in performance if you are able to arrange your instructions so that no instruction is waiting on any other instruction.

The SPU is not only able to process multiple values at once through its pipeline, it can also dual-issue instructions through different pipelines. The SPU has two pipelines, even (sometimes called pipeline 0 or the execute pipeline) and odd (sometimes called pipeline 1 or the load pipeline). In the table above, the different types of instructions were listed along with which pipeline they execute in. The SPU actually loads two instructions at a time from a doubleword-aligned boundary. This is called a fetch group. If the first instruction in this fetch group is an even pipeline instruction and the second one is an odd pipeline instruction, they can both be issued simultaneously. If these conditions are not all met, or if the second instruction needs to wait for dependencies before issuing, then they are issued in separate cycles. To help align instructions properly for enabling dual-issue, there are two no-operation instructions that can be used to properly pad the instructions -- nop (no-operation on the even pipeline) and lnop (no-operation on the odd pipeline). Also, you can use .align 3 to force a given instruction to start in a new fetch group (it will be padded with appropriate no-ops to align it properly).

Look at one iteration in the loop and see how it performs in the SPU pipeline. I've added no-ops so you can see the pipeline issues better:


Listing 8. Loop iteration with stall information
.align 4 #force new fetch group
	#ITERATION 0
	nop
	lqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG)
	#stall (waiting on CURRENT_VAL_REG)
	#stall
	#stall
	#stall
	#stall
	absdb $(PROCESSED_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $CONVERSION_BYTES_REG
	lnop
	cgtbi $(BOOL_TMP1_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'a'-1
	lnop
	cgtbi $(BOOL_TMP2_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'z'
	lnop
	#stall (waiting on BOOL_TMP2_REG)
	xor $(IN_RANGE_REG+0*NUMREGS), $(BOOL_TMP1_REG+0*NUMREGS), $(BOOL_TMP2_REG+0*NUMREGS)
	lnop
	#stall (waiting on IN_RANGE_REG)
	selb $(CURRENT_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS),
	$(PROCESSED_VAL_REG+0*NUMREGS), $(IN_RANGE_REG+0*NUMREGS)
	lnop
	#stall (waiting on CURRENT_VAL_REG)
	nop
	stqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG)

As you can see, this single iteration is wasting 8 cycles just waiting on registers to finish loading. In addition, it is wasting 7 opportunities for dual-issue. So even in this vectorized implementation, there is a lot of room for improvement!

You may be wondering why the program did not put selb and stqd in the same fetch group. You could have, but it would not have increased the speed of the program. Since stqd has to stall for the value of CURRENT_VAL_REG they would have had to be issued separately anyway, and you would not have gained any speed.

Sometimes instruction scheduling is a hassle. However, when used in conjunction with loop unrolling, it's not so bad. Since each iteration is using a different set of registers for computation, all you have to do is interleave computations from each iteration to fill up the time slots. So your new loop body looks like this:


Listing 9. Interleaved loop body minimizes dependency stalls
	lqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG)
	lqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG)
	lqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG)
	lqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG)
	absdb $(PROCESSED_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $CONVERSION_BYTES_REG
	absdb $(PROCESSED_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), $CONVERSION_BYTES_REG
	absdb $(PROCESSED_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), $CONVERSION_BYTES_REG
	absdb $(PROCESSED_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), $CONVERSION_BYTES_REG
	cgtbi $(BOOL_TMP1_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP1_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP1_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP1_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP2_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'z'
	cgtbi $(BOOL_TMP2_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'z'
	cgtbi $(BOOL_TMP2_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'z'
	cgtbi $(BOOL_TMP2_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'z'
	xor $(IN_RANGE_REG+0*NUMREGS), $(BOOL_TMP1_REG+0*NUMREGS), $(BOOL_TMP2_REG+0*NUMREGS)
	xor $(IN_RANGE_REG+1*NUMREGS), $(BOOL_TMP1_REG+1*NUMREGS), $(BOOL_TMP2_REG+1*NUMREGS)
	xor $(IN_RANGE_REG+2*NUMREGS), $(BOOL_TMP1_REG+2*NUMREGS), $(BOOL_TMP2_REG+2*NUMREGS)
	xor $(IN_RANGE_REG+3*NUMREGS), $(BOOL_TMP1_REG+3*NUMREGS), $(BOOL_TMP2_REG+3*NUMREGS)
	selb $(CURRENT_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS),
	$(PROCESSED_VAL_REG+0*NUMREGS), $(IN_RANGE_REG+0*NUMREGS)
	selb $(CURRENT_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS),
	$(PROCESSED_VAL_REG+1*NUMREGS), $(IN_RANGE_REG+1*NUMREGS)
	selb $(CURRENT_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS),
	$(PROCESSED_VAL_REG+2*NUMREGS), $(IN_RANGE_REG+2*NUMREGS)
	selb $(CURRENT_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS),
	$(PROCESSED_VAL_REG+3*NUMREGS), $(IN_RANGE_REG+3*NUMREGS)
	stqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG)
	stqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG)
	stqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG)
	stqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG)

	#Increment Pointer
	ai $BUFFER_REG, $BUFFER_REG, 4*REGBYTES

	#Exit if needed
	cgt $IS_FINISHED_REG, $BUFFER_REG, $BUFFER_END_REG
	brz $IS_FINISHED_REG, loop_start

This technique is called software pipelining, and this code only loses 2 cycles to stalls. However, it still does not make much use of dual-issue. In fact, there aren't many opportunities to do that at all in this code.

If you were to unroll the loop four more iterations, you could interleave each set of four so that one set of instructions was doing the loads while the other one was doing the executes, and that would save some clock cycles through dual-issue. However, for now, I will simply show how to save two clock cycles by adjusting the order of the selb and stqd instructions. Here is the new order:


Listing 10. Rescheduling instructions
	selb $(CURRENT_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS),
	$(PROCESSED_VAL_REG+0*NUMREGS), $(IN_RANGE_REG+0*NUMREGS)
	selb $(CURRENT_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS),
	$(PROCESSED_VAL_REG+1*NUMREGS), $(IN_RANGE_REG+1*NUMREGS)
.align 3   ####Force to the start of a fetch-group
	#Next two issued concurrently
	selb $(CURRENT_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS),
	$(PROCESSED_VAL_REG+2*NUMREGS), $(IN_RANGE_REG+2*NUMREGS)
	stqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG)
	#Next two issued concurrently
	selb $(CURRENT_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS),
	$(PROCESSED_VAL_REG+3*NUMREGS), $(IN_RANGE_REG+3*NUMREGS)
	stqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG)
	stqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG)
	stqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG)

Simply by properly aligning a section of the program on what you want to be a fetch-group boundary and moving one instruction (the last selb) to a more opportune location, you can save two clock cycles. Note that without the .align 3, if the selb instructions were in the odd position while the stqd instructions were in the even position, you would not achieve dual-issue, because dual-issue only occurs when the two instructions are both appropriately sequenced and aligned.


Branch hinting

The SPU has no real hardware support for branch hinting. However, it makes up for this (and in some cases has the possibility of surpassing hardware support) by providing excellent software support for branch hinting.

Branch hinting is necessary on the SPU because mispredicted branches come at a high cost. It takes 18-19 cycles to recover from a mispredicted branch. In addition, by default, every branch encountered by the SPU is assumed to be not taken, including unconditional branches. What a branch hint does is specify to the processor that, for a specific branch instruction (also called a hint-trigger address), what address it is likely to branch to (called the branch target address). This allows the processor to prepare for the branch ahead of time (prefetching the instructions, for instance). Branch hints never affect the logical outcome of a program. They only affect the number of cycles required to run the program.

There are three branch-hinting instructions:

  • hbr hint_trigger, $register -- This tells the processor that the branch instruction at the relative address hint_trigger is likely to branch to the address specified in register $register.
  • hbrr hint_trigger, branch_target -- This tells the processor that the branch instruction at the relative address hint_trigger is likely to branch to the relative address branch_target (both are relative from the current instruction).
  • hbra hint_trigger, branch_target -- The same as hbrr, except that branch_target is specified as an absolute address.

For a branch hint to be most effective (so that the branch does not stall at all), it must be set at least four instruction fetch-groups plus 11 cycles before the branch instruction. At minimum, the branch hint must be four instruction fetch groups before the branch instruction, or it will have no effect. It also may not be more than 255 instructions away (physically) from the branch that it hints for (the instruction itself only has space for 8 bits plus a sign bit for the relative offset of the hint trigger, which then has two zeroes concatenated at the end). For example, if it is four instruction fetch groups plus 3 cycles away from the branch instruction, the branch instruction will enter hint stall for 8 cycles, which, while not optimum, is still much better than the 18 cycles it would stall without the hint. Only one hint can be active at a time, and sync instructions, among other things, clear out any active hint.

The best place to use a hint in your code is before the loop. Since the loop will be more likely taken than not taken (at least for larger strings), you could give a symbolic name to your branch instruction, and hint before the loop that the branch is likely to be taken. The code change would look like this:


Listing 11. Hinted branches
	hbrr loop_branch_instruction, loop_start
loop_start:

	##... conversions go here ... ##

	#Increment Pointer
	ai $BUFFER_REG, $BUFFER_REG, 4*REGBYTES

	#Exit if needed
	cgt $IS_FINISHED_REG, $BUFFER_REG, $BUFFER_END_REG
loop_branch_instruction:
	brz $IS_FINISHED_REG, loop_start

Because the hint is before the loop body, this code leaves your hint active for every iteration of the loop, but it only has to use one cycle.

Unfortunately, because the loop branch is so close to the return statement, you cannot predict both the loop branch and the return branch. However, if you thought that the branch loop was not likely to be taken (in this case, if the string is likely less than 64 characters) then you could hint the return address instead by changing the hint to:

#This assumes that $lr has the right address right now (true in our case)
hbr end_function, $lr

You can actually do some fairly advanced hinting behavior using register-based hint instructions. Just keep in mind the hint restrictions as well as the fact that hint instructions do take up a cycle of your program.


Conclusion

At the end, your optimized function should look like this:


Listing 12. Fully optimized conversion function
.data
.equ CONVERSION_FACTOR, 'a' - 'A'
.align 4
conversion_bytes:
	.fill 16, 1, CONVERSION_FACTOR

.text
.global convert_buffer_to_upper
.type convert_buffer_to_upper, @function
	.equ BUFFER_REG, 3
	.equ BUFFER_SZ_REG, 4
	.equ BUFFER_END_REG, 5
	.equ CONVERSION_BYTES_REG, 6
	.equ IS_FINISHED_REG, 7

	.equ CURRENT_VAL_REG, 8
	.equ BOOL_TMP1_REG, 9
	.equ BOOL_TMP2_REG, 10
	.equ IN_RANGE_REG, 11
	.equ PROCESSED_VAL_REG, 12

	.equ NUMREGS, 5
	.equ REGBYTES, 16
convert_buffer_to_upper:
	a $BUFFER_END_REG, $BUFFER_SZ_REG, $BUFFER_REG
	lqr $CONVERSION_BYTES_REG, conversion_bytes

	hbrr loop_branch_instruction, loop_start
loop_start:
	lqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG)
	lqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG)
	lqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG)
	lqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG)
	absdb $(PROCESSED_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), $CONVERSION_BYTES_REG
	absdb $(PROCESSED_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), $CONVERSION_BYTES_REG
	absdb $(PROCESSED_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), $CONVERSION_BYTES_REG
	absdb $(PROCESSED_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), $CONVERSION_BYTES_REG
	cgtbi $(BOOL_TMP1_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP1_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP1_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP1_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'a'-1
	cgtbi $(BOOL_TMP2_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS), 'z'
	cgtbi $(BOOL_TMP2_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS), 'z'
	cgtbi $(BOOL_TMP2_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS), 'z'
	cgtbi $(BOOL_TMP2_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS), 'z'
	xor $(IN_RANGE_REG+0*NUMREGS), $(BOOL_TMP1_REG+0*NUMREGS), $(BOOL_TMP2_REG+0*NUMREGS)
	xor $(IN_RANGE_REG+1*NUMREGS), $(BOOL_TMP1_REG+1*NUMREGS), $(BOOL_TMP2_REG+1*NUMREGS)
	xor $(IN_RANGE_REG+2*NUMREGS), $(BOOL_TMP1_REG+2*NUMREGS), $(BOOL_TMP2_REG+2*NUMREGS)
	xor $(IN_RANGE_REG+3*NUMREGS), $(BOOL_TMP1_REG+3*NUMREGS), $(BOOL_TMP2_REG+3*NUMREGS)
	selb $(CURRENT_VAL_REG+0*NUMREGS), $(CURRENT_VAL_REG+0*NUMREGS),
	$(PROCESSED_VAL_REG+0*NUMREGS), $(IN_RANGE_REG+0*NUMREGS)
	selb $(CURRENT_VAL_REG+1*NUMREGS), $(CURRENT_VAL_REG+1*NUMREGS),
	$(PROCESSED_VAL_REG+1*NUMREGS), $(IN_RANGE_REG+1*NUMREGS)
.align 3
	selb $(CURRENT_VAL_REG+2*NUMREGS), $(CURRENT_VAL_REG+2*NUMREGS),
	$(PROCESSED_VAL_REG+2*NUMREGS), $(IN_RANGE_REG+2*NUMREGS)
	stqd $(CURRENT_VAL_REG+0*NUMREGS), 0*REGBYTES($BUFFER_REG)
	selb $(CURRENT_VAL_REG+3*NUMREGS), $(CURRENT_VAL_REG+3*NUMREGS),
	$(PROCESSED_VAL_REG+3*NUMREGS), $(IN_RANGE_REG+3*NUMREGS)
	stqd $(CURRENT_VAL_REG+1*NUMREGS), 1*REGBYTES($BUFFER_REG)
	stqd $(CURRENT_VAL_REG+2*NUMREGS), 2*REGBYTES($BUFFER_REG)
	stqd $(CURRENT_VAL_REG+3*NUMREGS), 3*REGBYTES($BUFFER_REG)

	ai $BUFFER_REG, $BUFFER_REG, REGBYTES
	cgt $IS_FINISHED_REG, $BUFFER_REG, $BUFFER_END_REG
loop_branch_instruction:
	brz $IS_FINISHED_REG, loop_start

end_function:
	bi $lr

This code has been branch-eliminated, vectorized, loop-unrolled, instruction-scheduled, and branch-hinted. In other words, it's pretty darn fast. The next article switches to programming in C, but this information will still be useful for understanding what the compiler is trying (or at least should be trying) to do, and for analyzing the output of your compiler to understand where hand-coded assembly could give you better performance.

Stay tuned for the rest of this series: We'll explore coding the SPU in C, covering SPU extensions to C, and other higher-level optimization techniques.


Resources

About the author

Jonathan Bartlett is the author of the book Programming from the Ground Up, an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=200411
ArticleTitle=Programming high-performance applications on the Cell BE processor, Part 4: Program the SPU for performance
publish-date=03062007
author1-email=johnnyb@eskimo.com
author1-email-cc=dwpower@us.ibm.com