Software optimization techniques for PowerPC 405 and 440
The code optimization rules for the PowerPC 405 and 440 processors are different, but it is often useful to write one set of code which will perform well on both CPU types. One problem when trying to write optimized code is that the rules can be complex and span many page of documentation.
The rules given here apply to code designed to run well on both 405 and 440. Clearly some of the rules will only produce a benefit on one of the processors, but by following the rules code will run optimally on either processor. Additional details can be found in the User Manuals for the respective processors, for example the 405GP and 440GP.
Four main rules
The four main optimizing rules are covered in the following sections, and are:
- Instruction pairing
- Load/Use dependencies
- Operand dependencies
- Compare and branch
On the 440, instructions fall into one of three categories. The most
commonly occurring instructions are highlighted:
- Loads and Stores
- Any of these:
- Instructions which set a bit in the CR; compares and dot forms of instructions
- Multiply and Divide
- SPR accesses, system instructions (sc, rfi), XER-updating "o forms"
- Anything else, primarily non-CR, non-XER updating arithmetic and logical instructions
The 440 processor can dispatch two instructions per cycle, provided that the two instructions are not from the same special category. If there are two adjacent instructions from the same Special Category, the second one will wait until at least the cycle after the first one has dispatched. 405 and 440 processors only dispatch instructions in-order, so having to wait results in a wasted cycle in the CPU. Itâs therefore useful to mix instructions from different categories to allow pairs to be dispatched together.
When data is loaded into a register from cache, it takes several cycles before that data is available to be used by another instruction. An instruction which uses data thatâs been loaded cannot be dispatched until the third cycle after the load. Therefore itâs most efficient to place other instructions between the data load and its use. How many instructions are needed is dependent on the type of instructions, and whether they can be paired for dispatching together. A maximum of five instructions are needed: this consists of one instruction which will be paired with the load, then two the next cycle, and two the cycle after that. The "use" instruction can then dispatch on the next instruction, see Table 1.
Table 1. Load/use - 5 separating instructions needed
|lwz r4,0(r5)||n||Load dispatches|
|addi r6,r6,0x20||n||addi dispatches in same cycle as load|
|add r9,r4,r12||n+3||Load data in r4 is available for use 3 cycles after the load instruction|
Depending on the instruction mix, as few as two instructions may be inserted between the load and the use. This occurs when the following two instructions cannot be paired with any other instructions. This leads to a performance degradation due to the non-pairing of instructions, but by including the two instructions between the load and use there is no additional penalty. See Table 2. Note here that the stw and lbz instructions in cycles n+1 and N+2 cannot pair with the other loads and stores because they are all part of the same "loads and stores" special category.
Table 2. Load/use - 2 separating instructions needed
|lwz r4,0(r5)||n||Load dispatches|
|stw r6,32(r1)||n+1||Store must wait until cycle n+1|
|lbz r11,20(r1)||n+2||Load must wait until cycle n+2|
|lwz r9,0(r4)||n+3||Load data in r4 is available. Load must wait until cycle n+3 to be dispatched - no additional delay is incurred for waiting for r4 data.|
Although most load operations only have one target register, be aware that the "update" instructions, such as lwzu, also update a source register. This updated register will not be available to be used until the load instruction completes, so be careful to apply the same use rules to it as to the target register.
Two instructions cannot be dispatched in the same cycle if the first one updates a register which is used by the second instruction. So itâs useful to separate instructions like this by at least one other instruction. In Table 3 the srawi instruction uses r4 which is updated by the preceding addi instruction. The srawi has to wait until the next cycle before it dispatches, leaving a dead cycle where nothing dispatches in parallel with the addi. Eventually the subf instruction executes 2 cycles after the addi.
Table 3. Operand dependency: wasting a cycle
|n||Dead cycle - nothing executes here|
|srawi r7,r4,4||n+1||Shift must wait for r4 data - cannot dispatch in same cycle as addi|
|lwz r8,0(r1)||n+1||load can pair with shift in same cycle|
|subf r9,r10,r11||n+2||subf executes 2 cycles after addi|
By moving the srawi instruction after the lwz, the lwz can dispatch in the same cycle as the addi, and there is no delay in the srawi instruction which dispatches in the following cycle, as shown in Table 4. The subf executes one instruction after the addi, showing that a cycle has been saved.
Table 4. Operand dependency: saving a cycle
|lwz r8,0(r1)||n||load can pair with addi in same cycle|
|srawi r7,r4,4||n+1||Shift uses r4 data - does not have to wait|
|subf r9,r10,r11||n+1||subf executes 1 cycle after addi|
Compare and Branch
CR (Condition Register) -updating instructions should be separated from branch instructions by one instruction. CR-updating instructions are usually compares or the "dot form" of arithmetic and logical instructions (but also includes the rarer mcrf, mcrxr, mtcrf). (Itâs possible to have best performance with no instructions separating them, but this is only in the situation where you know that the CR-setting instruction will be paired with the preceding instruction (for 440), or the branch is correctly predicted (for 405). Often this is difficult to guarantee, so the best rule is to separate by one instruction.)
Techniques for analyzing optimized code
It can often be useful to trace a segment of code as it runs, then observe the number of cycles taken by each instruction. This can be a useful tool in spotting trouble spots which are slowing down execution. One method of observing code performance is by using the RISCTrace tool, part of the IBM RISCWatch debugger. This shows all the instructions executed, and keeps track of the number of cycles used. The usual method of using RISCTrace effectively is as follows. Create a source file containing the code you wish to analyze. It is often helpful to create a scaffold routine which can set up the environment (registers, data areas, stack etc.) that the test code is expecting. The scaffold code typically sets up its environment, branches to the test code and ends with an instruction which branches to itself (a spin loop). This spin loop prevents execution of the code from "running off the end" of the loaded file. Of course, if you have a complete run-time environment available you may run your code without the need for scaffolding.
Compile and link the code you wish to analyze along with the scaffold code, and load it onto the target system by using the RISCWatch "load file" command. Make a note of the entry point of the code, you will need it later. Make sure that you are set up to use hardware breakpoints (use RISCWatch menus Source -> Breakpoints then select "Hardware BPs", or use the command "bpmode hardware"). Set a breakpoint on the spin loop instruction at the end of the scaffold code. Usually you can open the assembly debug window (Hardware -> Asm debug) and scroll through the scaffold code until you see the final branch instruction, then click on it to set the breakpoint.
The next step is to ensure the code and data you use are loaded into cache. Check that caching is enabled for the areas your code is loaded into: in 405 real mode check the ICCR and DCCR registers, in 405 virtual mode or on the 440 check the relevant TLB entries to ensure that the I (caching inhibited) bit is not set. Click "run" on the assembly debug window. Your code should run and stop at the hardware breakpoint you set. At this point all code and data should be in cache. Now we will run it a second time, this time tracing it.
Set the instruction pointer back to the original entry point you noted earlier, by entering it into the "Set IAR" field in the assembly debug window. Open the trace window (Hardware -> Trigger/ trace). Set the "Cycles Before Trigger" field to the maximum number of cycles you wish to trace - usually a value of 1000 to 10000 is good. Set the "Cycles After Trigger" to 0. Click "Run Trace". Your code will execute, stop at the hardware breakpoint again, and produce a trace listing.
When looking at the trace, the "Cycles/Instruction" column provides an indication of any pipeline stalls. For 405, ideally all the entries in this column should be 1, indicating that the instruction executed in 1 cycle (clearly some instructions, such as divide, are designed to take more than one cycle to execute, and this does not indicate a performance problem). Instructions taking more than one cycle should be investigated to see if reordering them using the rules above can improve their execution speed.
Sample 405 RISCTrace output
# Instr Total Cycle/ # Count Cycle Instr Address Opcode Disassembly # ------ -------- ------- -------- -------- ----------- 00000086 00000085 0000001 00040158 813E0000 lwz R9,0x00000000(R30) 00000087 00000086 0000001 0004015C 57AB103A rlwinm R11,R29,2,0,29 00000088 00000088 0000002 00040160 7CAB482E lwzx R5,R11,R9 00000089 00000089 0000001 00040164 813E0004 lwz R9,0x00000004(R30) 00000090 00000092 0000003 00040168 7CCB482E lwzx R6,R11,R9 00000091 00000093 0000001 0004016C 813E0008 lwz R9,0x00000008(R30) 00000092 00000096 0000003 00040170 7CEB482E lwzx R7,R11,R9
Figure 5 shows the trace output from RISCTrace on a 405 (some columns
omitted for space reasons). In the 3rd column, "Cycle/Instr", you can see
that some instructions take more than one cycle to execute. For example,
the third instruction,
lwzx R5,R11,R9 takes an extra cycle
because it has to wait for the load of R9 (in line 1) to complete before
it can dispatch. Similarly, the two other lwzx instructions each take 3
cycles, because they use R9 and must wait for the previous instruction to
complete its load of that register. In the first case, there is an
instruction, rlwinm, between the load of R9 and its subsequent use, so the
using instruction only waits one extra cycle, thereby taking a total of
two cycles to complete. There are no instructions between the load and use
in the last two cases, so the maximum penalty of 3 cycles is incurred. In
this case, it would be advantageous to investigate surrounding code to see
if other instructions could be placed between the load and use
instructions to do useful work while the use instructions are waiting for
the load to complete.
For 440, multiple instructions can complete in a cycle, and things are further complicated by the fact that the 440 reports the instructions which have completed in a 4-cycle block. Up to 8 instructions can complete in a single 4-cycle period. The RISCTrace output shows the start of a 4-cycle block by placing a 4 in the "Cycles/Instruction" column, followed by 0âs for all following instructions which completed in the same 4-cycle block. Ideally, a total of 8 instructions should complete in a 4-cycle block. Fewer instructions indicate that the area of code might be worth investigating for optimization.
On small code fragments, an empty trace file might be produced. Trace requires a minimum number of cycles to execute before it can produce results. To ensure enough cycles are executed, start your code with a small loop, for example loading a small value into the CTR register and decrement it, test it and loop using the bdnz instruction. The value that you load into CTR is dependent on how long your test code is. Notice that all commands you issue in RISCWatch are saved in its main command line window, and may be re-executed by double-clicking on them. This can be useful when running the same (or updated) code multiple times. The sequence of operations:
- load file
- set breakpoint
- reset instruction pointer
can be quickly re-executed by double-clicking the commands in the main window.
No-ops considered harmful
When testing or measuring the performance of a fragment of code, it is often useful to have the processor start in a steady-state, so that preceding code does not influence the execution of the code being tested.
If a preceding instruction takes an unpredictable number of instructions to execute (for example, a load which resulted in a cache miss, a mis-predicted branch, or a multiply where the values of the operands determine the time taken), it is difficult to determine if an observed pipeline delay is caused by the preceding instructions or the code being measured. In order to make sure that the number of cycles used by the test code is reported accurately, it is typical to execute a known sequence of instructions so the actual time used by the test instructions can be determined. The usual instructions used are a series of no-op (no operation) instructions, or nops.
The nop instruction is an extended mnemonic for the instruction "ori r0,r0,0". This is a simple single-cycle operation which does not alter any of the machine state - register r0 is written with its previous contents. In the 405, a sequence of nops executes at the maximum processor speed and guarantees no pipeline stalls, or "bubbles". However, in the 440 this is not the case. The nop instruction is both a producer and consumer of r0. When two adjacent nop instructions are executed, the second one cannot be paired to execute the same cycle as the first, because of the dependency on the shared register r0, as described in Operand Dependencies. This leads to a pipeline "bubble". If you need a series of nop-like instructions which will fill the pipeline and execute at maximum speed, you should consider alternating two instructions, using separate registers. An example of a second no-op instruction would be "ori r2,r2,0". A sequence of pipeline-filling no-ops would be:
nop ori r2,r2,0 nop ori r2,r2,0
Although there are many code optimization details documented for the 405 and 440 processor families, by following the basic rules described here you can gain most of the optimization benefits available, and have code which will perform well on both processor families.
By using the RISCWatch trace tool, you can verify that critical code is performing optimally.
- The Linux Documentation Project is a repository of Linux documentation including documents about individual software, HOWTO documents, FAQs, and more.
- Practical UNIX & Internet Security (O'Reilly & Associates; 1996), by Garfinkel and Spafford, is an excellent reference on all aspects of system security from user management to drafting a security policy.