Software optimization techniques for PowerPC 405 and 440

The PowerPC 405 and 440 processors each have optimization rules to be used when creating assembler code. This document describes a concise set of rules which cover the most common cases.

Neil Leeder (neileedr@us.ibm.com), Software Engineer, IBM, Software Group

Neil Leeder works in the PowerPC enablement group within IBM Microelectronics. He has worked in PowerPC embedded software development since 1993, and was part of the team which created OS Open, the first real-time operating system for embedded PowerPC processors. He played a key role in the development of IBM's software evaluation kits for the 405GP and 440GP processors. Neil has a degree in Computing and Information Systems from the University of Manchester in England. You can contact Neil at neileedr@us.ibm.com.



10 February 2005

Introduction

The code optimization rules for the PowerPC 405 and 440 processors are different, but it is often useful to write one set of code which will perform well on both CPU types. One problem when trying to write optimized code is that the rules can be complex and span many page of documentation.

The rules given here apply to code designed to run well on both 405 and 440. Clearly some of the rules will only produce a benefit on one of the processors, but by following the rules code will run optimally on either processor. Additional details can be found in the User Manuals for the respective processors, for example the 405GP and 440GP.


Four main rules

The four main optimizing rules are covered in the following sections, and are:

  • Instruction pairing
  • Load/Use dependencies
  • Operand dependencies
  • Compare and branch

Instruction Pairing

On the 440, instructions fall into one of three categories. The most commonly occurring instructions are highlighted:
Special categories:

  1. Loads and Stores
  2. Any of these:
  • Instructions which set a bit in the CR; compares and dot forms of instructions
  • Branches
  • Multiply and Divide
  • SPR accesses, system instructions (sc, rfi), XER-updating "o forms"


General categories:

  • Anything else, primarily non-CR, non-XER updating arithmetic and logical instructions

The 440 processor can dispatch two instructions per cycle, provided that the two instructions are not from the same special category. If there are two adjacent instructions from the same Special Category, the second one will wait until at least the cycle after the first one has dispatched. 405 and 440 processors only dispatch instructions in-order, so having to wait results in a wasted cycle in the CPU. It’s therefore useful to mix instructions from different categories to allow pairs to be dispatched together.

Load/Use dependencies

When data is loaded into a register from cache, it takes several cycles before that data is available to be used by another instruction. An instruction which uses data that’s been loaded cannot be dispatched until the third cycle after the load. Therefore it’s most efficient to place other instructions between the data load and its use. How many instructions are needed is dependent on the type of instructions, and whether they can be paired for dispatching together. A maximum of five instructions are needed: this consists of one instruction which will be paired with the load, then two the next cycle, and two the cycle after that. The "use" instruction can then dispatch on the next instruction, see Table 1.

Table 1. Load/use - 5 separating instructions needed
InstructionsCycleComment
lwz       r4,0(r5)nLoad dispatches
addi      r6,r6,0x20naddi dispatches in same cycle as load
or r7,r8,r9n+1
stw r6,32(r1)n+1
subf r10,r11,r12n+2
srawi r8,r8,4n+2
add r9,r4,r12n+3Load data in r4 is available for use 3 cycles after the load instruction

Depending on the instruction mix, as few as two instructions may be inserted between the load and the use. This occurs when the following two instructions cannot be paired with any other instructions. This leads to a performance degradation due to the non-pairing of instructions, but by including the two instructions between the load and use there is no additional penalty. See Table 2. Note here that the stw and lbz instructions in cycles n+1 and N+2 cannot pair with the other loads and stores because they are all part of the same "loads and stores" special category.

Table 2. Load/use - 2 separating instructions needed
InstructionsCycleComment
lwz r4,0(r5)nLoad dispatches
stw r6,32(r1)n+1Store must wait until cycle n+1
lbz r11,20(r1)n+2Load must wait until cycle n+2
lwz r9,0(r4)n+3Load data in r4 is available. Load must wait until cycle n+3 to be dispatched - no additional delay is incurred for waiting for r4 data.

Although most load operations only have one target register, be aware that the "update" instructions, such as lwzu, also update a source register. This updated register will not be available to be used until the load instruction completes, so be careful to apply the same use rules to it as to the target register.

Operand dependencies

Two instructions cannot be dispatched in the same cycle if the first one updates a register which is used by the second instruction. So it’s useful to separate instructions like this by at least one other instruction. In Table 3 the srawi instruction uses r4 which is updated by the preceding addi instruction. The srawi has to wait until the next cycle before it dispatches, leaving a dead cycle where nothing dispatches in parallel with the addi. Eventually the subf instruction executes 2 cycles after the addi.

Table 3. Operand dependency: wasting a cycle
InstructionsCycleComment
addi     r4,r5,r6n
nDead cycle - nothing executes here
srawi r7,r4,4n+1Shift must wait for r4 data - cannot dispatch in same cycle as addi
lwz r8,0(r1)n+1load can pair with shift in same cycle
subf r9,r10,r11n+2subf executes 2 cycles after addi

By moving the srawi instruction after the lwz, the lwz can dispatch in the same cycle as the addi, and there is no delay in the srawi instruction which dispatches in the following cycle, as shown in Table 4. The subf executes one instruction after the addi, showing that a cycle has been saved.

Table 4. Operand dependency: saving a cycle
InstructionsCycleComment
addi r4,r5,r6 n
lwz r8,0(r1)nload can pair with addi in same cycle
srawi r7,r4,4n+1Shift uses r4 data - does not have to wait
subf r9,r10,r11n+1subf executes 1 cycle after addi

Compare and Branch

CR (Condition Register) -updating instructions should be separated from branch instructions by one instruction. CR-updating instructions are usually compares or the "dot form" of arithmetic and logical instructions (but also includes the rarer mcrf, mcrxr, mtcrf). (It’s possible to have best performance with no instructions separating them, but this is only in the situation where you know that the CR-setting instruction will be paired with the preceding instruction (for 440), or the branch is correctly predicted (for 405). Often this is difficult to guarantee, so the best rule is to separate by one instruction.)


Techniques for analyzing optimized code

It can often be useful to trace a segment of code as it runs, then observe the number of cycles taken by each instruction. This can be a useful tool in spotting trouble spots which are slowing down execution. One method of observing code performance is by using the RISCTrace tool, part of the IBM RISCWatch debugger. This shows all the instructions executed, and keeps track of the number of cycles used. The usual method of using RISCTrace effectively is as follows. Create a source file containing the code you wish to analyze. It is often helpful to create a scaffold routine which can set up the environment (registers, data areas, stack etc.) that the test code is expecting. The scaffold code typically sets up its environment, branches to the test code and ends with an instruction which branches to itself (a spin loop). This spin loop prevents execution of the code from "running off the end" of the loaded file. Of course, if you have a complete run-time environment available you may run your code without the need for scaffolding.

Compile and link the code you wish to analyze along with the scaffold code, and load it onto the target system by using the RISCWatch "load file" command. Make a note of the entry point of the code, you will need it later. Make sure that you are set up to use hardware breakpoints (use RISCWatch menus Source -> Breakpoints then select "Hardware BPs", or use the command "bpmode hardware"). Set a breakpoint on the spin loop instruction at the end of the scaffold code. Usually you can open the assembly debug window (Hardware -> Asm debug) and scroll through the scaffold code until you see the final branch instruction, then click on it to set the breakpoint.

The next step is to ensure the code and data you use are loaded into cache. Check that caching is enabled for the areas your code is loaded into: in 405 real mode check the ICCR and DCCR registers, in 405 virtual mode or on the 440 check the relevant TLB entries to ensure that the I (caching inhibited) bit is not set. Click "run" on the assembly debug window. Your code should run and stop at the hardware breakpoint you set. At this point all code and data should be in cache. Now we will run it a second time, this time tracing it.

Set the instruction pointer back to the original entry point you noted earlier, by entering it into the "Set IAR" field in the assembly debug window. Open the trace window (Hardware -> Trigger/ trace). Set the "Cycles Before Trigger" field to the maximum number of cycles you wish to trace - usually a value of 1000 to 10000 is good. Set the "Cycles After Trigger" to 0. Click "Run Trace". Your code will execute, stop at the hardware breakpoint again, and produce a trace listing.

When looking at the trace, the "Cycles/Instruction" column provides an indication of any pipeline stalls. For 405, ideally all the entries in this column should be 1, indicating that the instruction executed in 1 cycle (clearly some instructions, such as divide, are designed to take more than one cycle to execute, and this does not indicate a performance problem). Instructions taking more than one cycle should be investigated to see if reordering them using the rules above can improve their execution speed.

Sample 405 RISCTrace output

# Instr	  Total     Cycle/
# Count	  Cycle     Instr    Address   Opcode	Disassembly
# ------  --------  -------  --------  -------- -----------

00000086 00000085   0000001  00040158  813E0000 lwz     R9,0x00000000(R30)
00000087 00000086   0000001  0004015C  57AB103A rlwinm  R11,R29,2,0,29
00000088 00000088   0000002  00040160  7CAB482E lwzx    R5,R11,R9
00000089 00000089   0000001  00040164  813E0004 lwz     R9,0x00000004(R30)
00000090 00000092   0000003  00040168  7CCB482E lwzx    R6,R11,R9
00000091 00000093   0000001  0004016C  813E0008 lwz     R9,0x00000008(R30)
00000092 00000096   0000003  00040170  7CEB482E lwzx     R7,R11,R9

Figure 5 shows the trace output from RISCTrace on a 405 (some columns omitted for space reasons). In the 3rd column, "Cycle/Instr", you can see that some instructions take more than one cycle to execute. For example, the third instruction, lwzx R5,R11,R9 takes an extra cycle because it has to wait for the load of R9 (in line 1) to complete before it can dispatch. Similarly, the two other lwzx instructions each take 3 cycles, because they use R9 and must wait for the previous instruction to complete its load of that register. In the first case, there is an instruction, rlwinm, between the load of R9 and its subsequent use, so the using instruction only waits one extra cycle, thereby taking a total of two cycles to complete. There are no instructions between the load and use in the last two cases, so the maximum penalty of 3 cycles is incurred. In this case, it would be advantageous to investigate surrounding code to see if other instructions could be placed between the load and use instructions to do useful work while the use instructions are waiting for the load to complete.

For 440, multiple instructions can complete in a cycle, and things are further complicated by the fact that the 440 reports the instructions which have completed in a 4-cycle block. Up to 8 instructions can complete in a single 4-cycle period. The RISCTrace output shows the start of a 4-cycle block by placing a 4 in the "Cycles/Instruction" column, followed by 0’s for all following instructions which completed in the same 4-cycle block. Ideally, a total of 8 instructions should complete in a 4-cycle block. Fewer instructions indicate that the area of code might be worth investigating for optimization.

Useful Hints

On small code fragments, an empty trace file might be produced. Trace requires a minimum number of cycles to execute before it can produce results. To ensure enough cycles are executed, start your code with a small loop, for example loading a small value into the CTR register and decrement it, test it and loop using the bdnz instruction. The value that you load into CTR is dependent on how long your test code is. Notice that all commands you issue in RISCWatch are saved in its main command line window, and may be re-executed by double-clicking on them. This can be useful when running the same (or updated) code multiple times. The sequence of operations:

  1. load file
  2. set breakpoint
  3. run
  4. reset instruction pointer
  5. trace

can be quickly re-executed by double-clicking the commands in the main window.

No-ops considered harmful

When testing or measuring the performance of a fragment of code, it is often useful to have the processor start in a steady-state, so that preceding code does not influence the execution of the code being tested.

If a preceding instruction takes an unpredictable number of instructions to execute (for example, a load which resulted in a cache miss, a mis-predicted branch, or a multiply where the values of the operands determine the time taken), it is difficult to determine if an observed pipeline delay is caused by the preceding instructions or the code being measured. In order to make sure that the number of cycles used by the test code is reported accurately, it is typical to execute a known sequence of instructions so the actual time used by the test instructions can be determined. The usual instructions used are a series of no-op (no operation) instructions, or nops.

The nop instruction is an extended mnemonic for the instruction "ori r0,r0,0". This is a simple single-cycle operation which does not alter any of the machine state - register r0 is written with its previous contents. In the 405, a sequence of nops executes at the maximum processor speed and guarantees no pipeline stalls, or "bubbles". However, in the 440 this is not the case. The nop instruction is both a producer and consumer of r0. When two adjacent nop instructions are executed, the second one cannot be paired to execute the same cycle as the first, because of the dependency on the shared register r0, as described in Operand Dependencies. This leads to a pipeline "bubble". If you need a series of nop-like instructions which will fill the pipeline and execute at maximum speed, you should consider alternating two instructions, using separate registers. An example of a second no-op instruction would be "ori r2,r2,0". A sequence of pipeline-filling no-ops would be:

nop
ori r2,r2,0
nop
ori r2,r2,0

Conclusions

Although there are many code optimization details documented for the 405 and 440 processor families, by following the basic rules described here you can gain most of the optimization benefits available, and have code which will perform well on both processor families.

By using the RISCWatch trace tool, you can verify that critical code is performing optimally.

Resources

  • Download the source code used in this article.
  • The Linux Documentation Project is a repository of Linux documentation including documents about individual software, HOWTO documents, FAQs, and more.
  • "Build a better GUI" (developerWorks, October 2001) discusses the use of Java layout managers for better overall GUI design.
  • Practical UNIX & Internet Security (O'Reilly & Associates; 1996), by Garfinkel and Spafford, is an excellent reference on all aspects of system security from user management to drafting a security policy.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=13094
ArticleTitle=Software optimization techniques for PowerPC 405 and 440
publish-date=02102005