Assembly language for Power Architecture, Part 3

Programming with the PowerPC branch processor

A closer look at branching instructions and registers

Content series:

This content is part # of # in the series: Assembly language for Power Architecture, Part 3

Stay tuned for additional content in this series.

This content is part of the series:Assembly language for Power Architecture, Part 3

Stay tuned for additional content in this series.

Branching registers

Branches in PowerPC make use of three special-purpose registers: the condition register, the count register, and the link register.

The condition register

The condition register consists conceptually of seven fields. A field is a segment of four bits used to store status information about the results of an instruction. Two of the fields are somewhat special-purpose, and will be covered shortly, and the remaining fields are available for general use. The fields are named cr0 through cr7.

The first field, cr0 is used for the results of fixed-point computation instructions, which use non-immediate operands (with a few exceptions). The result of the computation is compared with zero, and the appropriate bits are set (negative, zero, or positive). To indicate in a computational instruction that you want it to set cr0, you simply add a period (.) to the end of the instruction. For example, add 4, 5, 6 adds register 5 to register 6 and stores the result in register 4, without setting any status bits in cr0. However, add. 4, 5, 6 does the same thing, but sets the bits in cr0 based on the computed value. cr0 is also the default field for use on compare instructions.

The second field (called cr1) is used by floating-point instructions using the period after the instruction name. Floating-point computation is outside the scope of this article.

Each field has four bits. The usage of those bits varies with the instruction being used. Here are their possible uses (floating-point uses are listed but not described):

Condition register field bits
BitMnemonicFixed-point comparisonFixed-point computationFloating-point comparisonFloating-point computation
0ltLess thanNegativeLess thanException summary
1gtGreater thanpositiveGreater thanEnabled exception summary
2eqEqualZeroEqualInvalid operation exception summary
3soSummary overflowSummary overflowUnorderedOverflow exception

Later you see how to access these fields both implicitly and directly.

The condition register can be loaded to or from a general-purpose register using mtcr, mtcrf, and mfcr. mtcr moves a specified general-purpose register to the condition register. mfcr moves the condition register to a general-purpose register. mtcrf loads the condition register from a general-purpose register, but only the fields specified by an 8-bit mask, which is the first operand.

Here are some examples:

Listing 1. Condition register transfer examples
#Copy register 4 to the condition register
mtcr 4

#Copy the condition register to register 28
mfcr 28

#Copy fields 0, 1, 2, and 7 from register 18 to the condition register
mtcrf 0b11100001, 18

The count and link registers

The link register (called LR) is a special-purpose register that holds return addresses from branch instructions. All branch instructions can be told to set the link register, which, if the branch is taken, sets the link register to the address of the instruction immediately following the current instruction. Branch instructions set the link register by appending the letter l to the end of the instruction. For instance, b is an unconditional branch instruction, and bl is an unconditional branch instruction that sets the link register.

The count register (called CTR) is a special-purpose register designed to hold loop counters. Special branch instructions can decrement the count register and/or conditionally branch depending on whether CTR has reached zero.

Both the link and count registers can be used as a branch destination. bctr branches to the address specified in the count register, and blr branches to the address specified in the link register.

The link and count registers can also be loaded and copied from general-purpose registers. For the link register, mtlr moves a given register value to the link register, and mflr moves a value from the link register to a general-purpose register. mtctr and mfctr do the same for the count register.

Unconditional branching

Unconditional branching on PowerPC instruction sets uses the I-Form instruction format:

I-Form instruction format

Bits 0-5


Bits 6-29

Absolute or relative branch address

Bit 30

Absolute address bit -- If this field is set, the instruction is interpreted as an absolute address, otherwise it is interpreted as a relative address

Bit 31

Link bit -- If this field is set, the instruction sets the link register with the address of the next instruction

As mentioned earlier, adding the letter l onto a branch instruction causes the link bit to be set, so that the "return address" (the instruction after the branch) is stored in the link register. If you affix the letter a at the end (it comes after the l, if that is used), then the address specified is an absolute address (this is not often used in user-level code, because it limits the branch destinations too much).

Listing 2 illustrates unconditional branches, and then exits (enter as branch_example.s):

Listing 2. Unconditional branching examples
.section .opd, "aw"
.align 3
.globl _start
        .quad ._start, .TOC.@tocbase, 0

#branch to target t2
        b t2

#branch to target t3, setting the link register
        bl t3
#This is the instruction that it returns to
        b t4

#branch to target t1 as an absolute address
        ba t1

#branch to the address specified in the link register
#(i.e. the return address)

        li 0, 1
        li 3, 0

Assemble, link, and run it like this:

as -a64 branch_example.s -o branch_example.o
ld -melf64ppc branch_example.o -o branch_example

Notice that the targets for both b and ba are specified the same way in assembly language, despite the fact that they are coded differently in the instruction. The assembler and linker take care of converting the target address into a relative or absolute address for you.

Conditional branching

Comparing registers

The cmp instruction is used to compare registers with other registers or immediate operands, and set the appropriate status bits in the condition register. By default, fixed-point compare instructions use cr0 to store the result, but the field can also be specified as an optional first operand. Compare instructions are written as in Listing 3:

Listing 3. Examples of compare instructions
#Compare register 3 and register 4 as doublewords (64 bits)
cmpd 3, 4

#Compare register 5 and register 10 as unsigned doublewords (64 bits)
cmpld 5, 10

#Compare register 6 with the number 12 as words (32 bits)
cmpwi 6, 12

#Compare register 30 and register 31 as doublewords (64 bits)
#and store the result in cr4
cmpd cr4, 30, 31

As you can see, the d specifies the operands as doublewords while the w specifies the operands as words. The i indicates that the last operand is an immediate value instead of a register, and the l tells the processor to do unsigned (also called logical) comparisons instead of signed comparisons.

Each of these instructions set the appropriate bits in the condition register (as outlined earlier in the article), which can then be used by a conditional branch instruction.

Basics of conditional branching

Conditional branches are a lot more flexible than unconditional branches, but it comes at a cost of branchable distance. Conditional branches use the B-Form instruction format:

The B-Form instruction format

Bits 0-5


Bits 6-10

Specifies the options used regarding how the bit is tested, whether and how the counter register is involved, and any branch prediction hints (called the BO field)

Bits 11-15

Specifies the bit in the condition register to test (called the BI field)

Bits 16-29

Absolute or Relative Address

Bit 30

Addressing Mode -- when set to 0 the specified address is considered a relative address; when set to 1 the address is considered an absolute address

Bit 31

Link Bit -- when set to 1 the link register is set to the address following the current instruction; when set to 0 the link register is not set

As you an see, a full 10 bits are used to specify the branch mode and condition, which limits the address size to only 14 bits (only a 16K range). This is usable for small jumps within a function, but not much else. To conditionally call a function outside of this 16K range, the code would need to do a conditional branch to an instruction containing an unconditional branch to the right location.

The basic forms of the conditional branch look like this:

bc BO, BI, address
bcl BO, BI, address
bca BO, BI, address
bcla BO, BI, address

In this basic form, BO and BI are numbers. Thankfully, we don't have to memorize all the numbers and what they mean. The extended mnemonics (described in the first article) of the PowerPC instruction set come to the rescue again, and we can avoid having to memorize all of the field numbers. Like unconditional branches, appending an l to the instruction name sets the link register and appending an a makes the instruction use absolute addressing instead of relative addressing.

For a simple compare and branch if equal, the basic form (not using the extended mnemonics) looks like this:

Listing 4. Basic form of the conditional branch
#compare register 4 and 5
cmpd 4, 5
#branch if they are equal
bc 12, 2 address

bc stands for "branch conditionally." The 12 (the BO operand) means to branch if the given condition register field is set, with no branch prediction hint, and 2 (the BI operand) is the bit of the condition register to test (it is the equal bit). Now, very few people, especially beginners, are going to be able to remember all of the branch code numbers and condition register bit numbers, nor would it be useful. The extended mnemonics make the code clearer for reading, writing, and debugging.

There are several different ways to specify the extended mnemonics. The way we will concentrate on combines the instruction name and the instruction's BO operand (specifying the mode). The simplest ones are bt and bf. bt branches if the given bit of the condition register is true, and bf branches if the given bit of the condition register is false. In addition, the condition register bit can be specified with mnemonics as well. If you specify 4*cr3+eq this will test bit 2 of cr3 (the 4* is there because each field is four bits wide). The available mnemonics for each bit of the bit fields were given earlier in the description of the condition register. If you only specify the bit without specifying the field, the instruction will default to cr0.

Here are some examples:

Listing 5. Simple conditional branches
#Branch if the equal bit of cr0 is set
bt eq, where_i_want_to_go

#Branch if the equal bit of cr1 is not set
bf 4*cr1+eq, where_i_want_to_go

#Branch if the negative bit (mnemonic is "lt") of cr5 is set
bt 4*cr5+lt, where_i_want_to_go

Another set of extended mnemonics combines the instruction, the BO operand, and the condition bit (but not the field). These use what are more-or-less "traditional" mnemonics for various kinds of common conditional branches. For example, bne my_destination (branch if not equal to my_destination) is equivalent to bf eq, my_destination (branch if the eq bit is false to my_destination). To use a different condition register field with this set of mnemonics, simply specify the field in the operand before the target address, such as bne cr4, my_destination. These are the branch mnemonics following this pattern: blt (less than), ble (less than or equal), beq (equal), bge (greater than or equal), bgt (greater than), bnl (not less than), bne (not equal), bng (not greater than), bso (summary overflow), bns (not summary overflow), bun (unordered - floating point specific), and bnu (not unordered - floating-point specific).

All of the mnemonics and extended mnemonics can have l and/or a affixed to them to enable the link register or absolute addressing, respectively.

Using the extended mnemonics allows a much more readable and writable programming style. For the more advanced conditional branches, the extended mnemonics are more than just helpful, they are essential.

Additional condition register features

Because the condition register has multiple fields, different computations and comparisons can use different fields, and then logical operations can be used to combine the conditions together. All of the logical operations have the following form: cr<opname> target_bit, operand_bit_1, operand_bit_2. For example, to do a logical and on the eq bit of cr2 and the lt bit of cr7, and have it stored in the eq bit of cr0, you would write: crand 4*cr0+eq, 4*cr2+eq, 4*cr7+lt.

You can move around condition register fields using mcrf. To copy cr4 to cr1 you would write mcrf cr1, cr4.

The branch instructions can also give hints to the branch processor for branch prediction. On most conditional branch instructions, appending a + to the instruction will signal to the branch processor that this branch will probably be taken. Appending a - to the instruction will signal that this branch will probably not be taken. However, this is usually not necessary, as the branch processor in the POWER5 CPU is usually able to do branch prediction quite well.

Using the count register

The count register is a special-purpose register used for a loop counter. The BO operand of the conditional branch (controlling the mode) can be used, in addition to specifying how to test condition register bits, to decrement and test the count register. There are two operations you can do with the count register:

  • decrement the count register and branch if it becomes zero
  • decrement the count register and branch if it becomes nonzero

These count register operations can either be used on their own or in conjunction with a condition register test.

In the extended mnemonics, the count register semantics are specified by adding either dz or dnz immediately after the b. Any additional condition or instruction modifier is added after that. So, to have a loop repeat 100 times, you would load the count register with the number 100, and use bdnz to control the loop. Here is how the code would look:

Listing 6. Counter-controlled loop example
#The count register has to be loaded through a general-purpose register
#Load register 31 with the number 100
li 31, 100
#Move it to the count register
mtctr 31

#Loop start address

###loop body goes here###

#Decrement count register and branch if it becomes nonzero
bdnz loop_start

#Code after loop goes here

You can also combine the counter test with other tests. For instance, a loop might need to have an early exit condition. The following code demonstrates an early exit condition when register 24 is equal to register 28.

Listing 7. Count register combined branch example
#The count register has to be loaded through a general-purpose register
#Load register 31 with the number 100
li 31, 100
#Move it to the count register
mtctr 31

#Loop start address

###loop body goes here###

#Check for early exit condition (reg 24 == reg 28)
cmpd 24, 28

#Decrement and branch if not zero, and also test for early exit condition
bdnzf eq, loop_start

#Code after loop goes here

So, rather than having to add an additional conditional branch instruction, all that is needed is the comparison instruction, and the conditional branch is merged into the loop counter branch.

Putting it together

Now we will put this information to practical use.

The first program will be a rewrite of the maximum value program we entered in the first article, and rewrite it according to what we have learned. The first version used a register to hold the current address being read from, and the code used indirect addressing to load the value. What this program will do is use an indexed-indirect addressing mode, with a register for the base address and a register for the index. In addition, rather than the index starting at zero and going forward, the index will count from the end to the beginning in order to save an extra compare instruction. The decrement can implicitly set the condition register (as opposed to an explicit compare with zero), which can then be used by a conditional branch instruction. Here is the new version (enter as max_enhanced.s):

Listing 8. Maximum value program enhanced version
.align 3

   .quad 23, 50, 95, 96, 37, 85

#Compute a constant holding the size of the list
.equ value_list_size, value_list_end - value_list

.section .opd, "aw"
.global _start
.align 3
   .quad ._start, .TOC.@tocbase, 0

   .equ DATA_SIZE, 8

   #Register 3 -- current maximum
   #Register 4 -- list address
   #Register 5 -- current index
   #Register 6 -- current value
   #Register 7 -- size of data (negative)

   #Load the address of the list
   ld 4, value_list@got(2)
   #Register 7 has data size (negative)
   li 7, -DATA_SIZE
   #Load the size of the list
   li 5, value_list_size
   #Set the "current maximum" to 0
   li 3, 0
   #Decrement index to the next value; set status register (in cr0)
   add. 5, 5, 7

   #Load value (X-Form - add register 4 + register 5 for final address)
   ldx 6, 4, 5

   #Unsigned comparison of current value to current maximum (use cr2)
   cmpld cr2, 6, 3

   #If the current one is greater, set it (sets the link register)
   btl 4*cr2+gt, set_new_maximum 

   #Loop unless the last index decrement resulted in zero
   bf eq, loop

   #AFTER THE LOOP -- exit
   li 0, 1

   mr 3, 6
   blr (return using the link register)

Assemble, link, and execute as before:

as -a64 max_enhanced.s -o max_enhanced.o
ld -melf64ppc max_enhanced.o -o max_enhanced

The loop in this program is approximately 15% faster than the loop in the first article because (a) we've shaved off several instructions from the main loop by using the status register to detect the end of the list when we decrement register 5 and (b) the program is using different condition register fields for the comparison (so that the result of the decrement can be held for later).

Note that using the link register in the call to set_new_maximum is not strictly necessary. It would have worked just as well to set the return address explicitly rather than using the link register. However, this gives a good example of link register usage.

A quick introduction to simple functions

The PowerPC ABI is fairly complex, and will be covered in much greater detail in the next article. However, for functions which do not themselves call any functions and follow a few easy rules, the PowerPC ABI provides a greatly simplified function-call mechanism.

In order to qualify for the simplified ABI, your function must obey the following rules:

  • It must not call any other function.
  • It may only modify registers 3 through 12.
  • It may only modify condition register fields cr0, cr1, cr5, cr6, and cr7.
  • It must not alter the link register, unless it restores it before calling blr to return.

When functions are called, parameters are sent in registers, starting with register 3 and going through register 10, depending on the number of parameters. When the function returns, the return value must be stored in register 3.

So let's rewrite our maximum value program as a function, and call it from C.

The parameters we should pass are the pointer to the array as the first parameter (register 3), and the size of the array as the second parameter (register 4). Then, the maximum value will be placed into register 3 for the return value.

So here is our program, reformulated as a function (enter as max_function.s):

Listing 9. The maximum value program as a function
#Functions require entry point declarations as well
.section .opd, "aw"
.global find_maximum_value
.align 3
   .quad .find_maximum_value, .TOC.@tocbase, 0

.align 3

#size of array members
.equ DATA_SIZE, 8

#function begin
   #Register 3 -- list address
   #Register 4 -- list size (elements)
   #Register 5 -- current index in bytes (starts as list size in bytes) 
   #Register 6 -- current value
   #Register 7 -- current maximum
   #Register 8 -- size of data

   #Register 3 and 4 are already loaded -- passed in from calling function
   li 8, -DATA_SIZE
   #Extend the number of elements to the size of the array
   #(shifting to multiply by 8)
   sldi 5, 4, 3

   #Set current maximum to 0
   li, 7, 0
   #Go to next value; set status register (in cr0)
   add. 5, 5, 8

   #Load Value (X-Form - adds reg. 3 + reg. 5 to get the final address)
   ldx 6, 3, 5

   #Unsigned comparison of current value to current maximum (use cr7)
   cmpld cr7, 6, 7

   #if the current one is greater, set it
   bt 4*cr7+gt, set_new_maximum
   #Loop unless the last index decrement resulted in zero
   bf eq, loop

   #Move result to return value
   mr 3, 7

   mr 7, 6
   b set_new_maximum_ret

This is very similar to the earlier version, with the main exceptions being:

  • The initial conditions are passed through parameters instead of hardcoded.
  • The register usage within the function was modified to match the layout of the passed parameters.
  • The extraneous usage of the link register for set_new_maximum was removed in order to preserve the link register's contents.

The C language data type the program is working with is unsigned long long. This is quite cumbersome to write, so it would be better to typedef this as something like uint64. Then, the prototype for the function would be:

uint64 find_maximum_value(uint64[] value_list, uint64 num_values);

Here is a short driver program to test our new function (enter as use_max.c):

Listing 10. Simple C program using the maximum value function
#include <stdio.h>

typedef unsigned long long uint64;

uint64 find_maximum_value(uint64[], uint64);

int main() {
    uint64 my_values[] = {2364, 666, 7983, 456923, 555, 34};
    uint64 max = find_maximum_value(my_values, 6);
    printf("The maximum value is: %llu\n", max);
    return 0;

To compile and run this program, simply do:

gcc -m64 use_max.c max_function.s -o maximum

Notice that since we are actually doing formatted printing now instead of returning the value to the shell, we can make use of the entire 64-bit size of the array elements.

Simple function calls are very cheap as far as performance goes. The simplified function call ABI is fully standard, and provides an easy way to get started writing mixed-language programs which require the speed of custom assembly language in its core loops, and the expressiveness and ease-of-use of higher-level languages for the rest.


Knowing the ins and outs of the branch processor helps to write more efficient PowerPC code. Using the various condition register fields enables the programmer to save and combine conditions in interesting ways. Using the count register helps code efficient loops. Simple functions can enable even the novice programmer to write useful assembly language functions for use by a higher-level language program.

In the next article, I'll cover the PowerPC ABI for function calls, and learn all about how the stack functions on PowerPC platforms.

Downloadable resources

Related topics

Zone=Linux, Multicore acceleration
ArticleTitle=Assembly language for Power Architecture, Part 3: Programming with the PowerPC branch processor