Assembly language for Power Architecture, Part 4

Function calls and the PowerPC 64-bit ABI

Create functions that can be shared with other programs

Content series:

This content is part # of # in the series: Assembly language for Power Architecture, Part 4

Stay tuned for additional content in this series.

This content is part of the series:Assembly language for Power Architecture, Part 4

Stay tuned for additional content in this series.

The simplified ABI

The previous article "Programming with the PowerPC branch processor," briefly discussed the "simplified" ABI. This allows the writing of functions that meet certain criteria with a minimum of fuss. The criteria a function must meet to use the simplified ABI are:

  • It must not call any other function.
  • It may modify only registers 3 through 12 (although see exceptions under Non-volatile register save areas, below).
  • It may modify only register fields cr0, cr1, cr5, cr6, and cr7.

There are a few additional restrictions if your code uses the PowerPC vector processing extensions as well, but that is beyond the scope of this article.

Interestingly, you need not declare in any way when you are using the simplified ABI, because it is a fully-compatible subset of the normal ABI for functions that do not need stack frames, discussed in the next section.

When a function is called using the PowerPC ABI semantics, it passes the parameters to the function in registers. Register 3 has the first fixed-point parameter, register 4 has the second, and so on through register 10. Likewise, floating-point values are passed through the floating-point registers 1 through 13. When the function is completed, the value is returned through register 3, and the function exits using the blr instruction.

To demonstrate the simplified PowerPC ABI, let's look at a function that takes one parameter, squares it, and returns it. Here is the function in assembly language (enter as my_square.s):

Listing 1. Function to square a number using the simplified ABI
.section .opd, "aw"
.align 3

.global my_square
my_square:   #this is the name of the function as seen 
   .quad .my_square, .TOC.@tocbase, 0
#Tell the linker that this is a function reference
.type my_square, @function

.my_square:  #This is the label for the code itself (referenced in the "opd")
   #Parameter 1 -- number to be squared -- in register 3

   #Multiply it by itself, and store it back into register 3
   mulld 3, 3, 3

   #The return value is now in register 3, so we just need to leave

Previously, you were using the .opd section for declaring the program's entry point, but here you're also using it to declare a function. These are called official procedure descriptors, and they contain the information the linker needs to combine position-independent code from different shared object files together. The most important field is the first one, which is the address of the start of the code for the procedure. The second field is the TOC pointer used for the function. The third field is an environment pointer for languages that use one, but is normally just set to zero. Notice that the only symbol definition that is exported globally is the official procedure descriptor.

The C language prototype for this function is:

Listing 2. C prototype for number-squaring function
typedef long long int64;
int64 my_square(int64 val);

Here is the C code for using the function (enter as my_square_tester.c):

Listing 3. C code for calling the my_square function
#include <stdio.h>

/* make declarations easier to write */
typedef long long int64; 

int64 my_square(int64);

int main() {
    int a = 32;
    printf("The square of %lld is %lld.\n", a, my_square(a));
    return 0;

The simple way to compile and run this code is to do the following:

Listing 4. Compiling and running my_square_tester
gcc -m64 my_square.s my_square_tester.c -o my_square_tester

The -m64 flag tells the compiler to use 64-bit instructions, compile using the 64-bit ABI and libraries, and use the 64-bit ABI for linking. It then takes care of all of the linking issues for you (and there are several -- you can see the full linking command line by appending -v to the command line).

As you can see, writing functions using the simplified PowerPC ABI is very straightforward. The issues come in when the functions don't meet these criteria.

The stack

Now let's get into the more complicated parts of the ABI. The most important part of any ABI is the details of how to make use of the stack, which is the area of memory that holds local function data.

The need for a stack

The best way to see why stacks are needed is to look at recursive functions. For simplicity, let's look at the recursive implementation of the factorial function:

Listing 5. Factorial function
typedef long long int64;
int64 factorial(int64 num) {
      //BASE CASE    
      if (num == 0) {
         return 1;
      } else {
         return num * factorial(num - 1);

This may be easy enough to understand conceptually, but let's examine it concretely. What is going on here? What happens, for instance, if you try to find the value of the factorial of 4? Let's follow the sequence:

First, the function will be called, and num will be set equal to 4. Then, because num is greater than 0, factorial will be called again, this time with 3. Now, in the new call to factorial, num is set to 3. However, this references a different memory location than the previous one, even though they share the same name and the same code. Even though it is the same variable name in the same code, num is different this time. This is because each time a function is called, it has an activation record (also called a stack frame) associated with it. The activation record contains all of the call-specific data for the function, including parameters and local variables. This is how recursive functions keep from trashing the values of the variables in other, active function calls. Each call gets its own activation record, so each time it is called the variables get their own storage space within that activation record. Only when the function call is completely finished is the space for the activation record released for reuse (more on this later).

So, with 3 as the value of num, we go through the function again, then with 2, then with 1, then with 0. However, with 0, the function has reached its base case. The base case is the point where it ceases to call itself, and instead returns. So, with 0 as num, it returns 1 as the result. The previous function call picks up where it left off (calling factorial(0)) and multiplies the result, 1, with the value in its own num, also 1. This is returned, and the next function waiting is reactivated. This one multiplies the result, 1, with its value of num, which is 2, and the result, 2, is then returned. The next waiting function call is then reactivated, and the previous result is multiplied by this function's value of num, which is 3, resulting in 6. This number is returned to our original function, whose value of num is 4. This is multiplied with the previous result to get 24.

As you can see, each time a function calls another function, its own values and state are suspended while the next function invocation occurs. This is true for all functions, not just recursive ones. If that function again calls other functions, its state is likewise suspended. When a function returns, the function that called it is revived and it continues from there. So, as we progress, the "live" function calls stack up on top of each other with each function call, and then are removed from the stack with every function return. The result looks like this (factorial will be abbreviated as fac):

  1. fac(4) [active]
  2. fac(4) [suspended], fac(3) [active]
  3. fac(4) [suspended], fac(3) [suspended], fac(2) [active]
  4. fac(4) [suspended], fac(3) [suspended], fac(2) [suspended], fac(1) [active]
  5. fac(4) [suspended], fac(3) [suspended], fac(2) [suspended], fac(1) [suspended], fac(0) [active]
  6. fac(4) [suspended], fac(3) [suspended], fac(2) [suspended], fac(1) [active]
  7. fac(4) [suspended], fac(3) [suspended], fac(2) [active]
  8. fac(4) [suspended], fac(3) [active]
  9. fac(4) [active]

As you can see, the suspended function activation records "stack up", and then, when each function returns, it gets taken off of the stack.

The stack layout

To implement this idea, a range of memory is allocated for each program called the program stack. All PowerPC programs start off with a pointer to this stack in register 1. In the PowerPC ABI, register 1 always points to the top of the stack. This makes it easy for functions to know where their activation record is -- they are simply defined in terms of the stack pointer. If a function is executing, then the stack pointer is pointing to the top of the whole stack, which is also the top of that function's activation record. Because activation records are implemented on a stack, they are often referred to as stack frames, but both terms are equivalent.

Now, when the "top of the stack" is referred to, that is a conceptual designation. Physically, in memory, the stack grows downward, from large-numbered memory addresses to small-numbered ones. Therefore, register 1 will have a pointer to the conceptual top of the stack, and references to stack positions that have positive offsets will actually be below the top of the stack conceptually, and negative offsets will be conceptually above. So, 0(1) refers to the conceptual top of the stack, 4(1) refers to four bytes down from the top (conceptually), 24(1) is even lower conceptually, and 100(1) is lower still.

Now that you understand how the stack looks conceptually and physically, let's look at what exactly the individual stack frames hold. Here is the layout of the stack according to the 64-bit PowerPC ABI, from a physical memory standpoint (stack offsets, where given, refer to the beginning of this location in memory):

Table 1. Stack frame layout
ContainsSizeBeginning stack offset
Floating point non-volatile register save areaVariesVaries
General non-volatile register save areaVariesVaries
VRSAVE4 bytesVaries
Alignment padding4 or 12 bytesVaries
Vector non-volatile register save areaVariesVaries (must be quadword-aligned)
Local variable storageVariesVaries
Parameters for function callsVaries (minimum 64 bytes)48(1)
TOC save area840(1)
Link editor area832(1)
Compiler area824(1)
Link Register save area816(1)
Condition Register save area88(1)
Pointer to top of previous stack frame80(1)

I won't concern you with the floating point, VRSAVE, Vector, or alignment space. Those topics deal with floating point and vector processing and are outside the scope of this article. All stack values must be doubleword (8-byte) aligned, and the whole frame should be quadword (16-byte) aligned. All parameters must be doubleword-aligned.

Now, let's look at what each part of the stack frame does.

Non-volatile register save areas

The first part of the stack frame is the non-volatile register save area. Registers in the PowerPC ABI are divided into three basic classes: dedicated, volatile, and non-volatile. Dedicated registers are registers that have a predefined, permanent function, like the stack pointer (register 1) and the TOC pointer (register 2). Registers 3-12 are volatile registers, which means that any function can modify them freely without having to restore their previous value. However, this means that any time a function calls another function, it should assume that registers 3-12 will be overwritten by that function.

On the other hand, registers 13 and above are considered non-volatile registers. This means that a function can use them provided their value is restored before returning from the function. Therefore, before using a non-volatile register in a function, its value must be saved in the function's stack frame, and then restored before the function returns. Likewise, a function may also assume that the values it assigns to non-volatile registers will not be modified (or at least will be restored) when it makes calls to other functions. A function may use as little or as much memory in this save area as needed.

Now you can see why our earlier rules for the simplified ABI required that only registers 3 through 12 should be used: the others are non-volatile and require stack space to save them! Therefore, in order to use the other registers, they have to be saved on the stack. However, the ABI actually has a way to work around this limitation. Functions are free to use the 288 bytes that are physically below the stack pointer for functions that do not call other functions. Therefore, functions using the simplified ABI actually can save, use, and restore non-volatile registers by using negative offsets from the stack pointer.

Local variable storage

The local variable storage area is a general-purpose area for saving function-specific data. Often this is not needed because of the large number of registers available for use in the PowerPC architecture. However, this space is often used for local arrays. This area can be any size needed by the function.

Parameters for function calls

Function parameters are handled a little differently from other local data. The PowerPC ABI actually puts the storage space for the function parameters in the calling function's stack space. Now, as you saw earlier, function calls actually pass their parameters through registers. However, space must still be reserved for parameters in case the values need to be saved, especially since the parameters are passed using volatile registers. This space is also used for overflow: if there are more parameters than registers available for use, then they need to go in the stack space. Since this parameter area is shared by all functions called from the current one, when a function sets up its stack space, it has to reserve space for the largest number of parameters it will use in a function call.

So that a function can know where its parameters are, parameters are stored from the bottom of memory to the top. The first parameter is in 48(1), while the second parameter is in 56(1). This way, the function being called can know the exact offset of each parameter, no matter how big the parameter list area is. Remember, the parameter list area is defined for all of the calls made by a function, and therefore will likely be bigger than necessary for any individual function call.

Now, since the save area for the parameters passed to a function are actually in the calling function's stack frame, when a function establishes its own stack frame, the offsets to the parameter list now have to be adjusted to account for the function's own stack frame size. So, let's say that function func1 calls function func2 with three parameters, and func2 has a 112-byte stack frame. If func2 wants to access the memory for its first parameter, it would refer to it as 160(1), because it has to go past its own stack frame (112 bytes) and reach the first parameter in the last frame (48 bytes).


Thankfully, functions rarely have to access their parameter save area because most parameters are passed by register, not in the parameter save area. However, space must be allocated for them even if there is nothing stored there. Functions must assume that for the first eight parameters, they are only passed by register, but they will still have a save area available if they need to be stored by the program. This space must also be a minimum of 64 bytes large.

TOC, link editor, and compiler areas

The TOC save area, compiler area, and linker area are all reserved for system use, and are not modified by programmers, but the programmer must reserve space for them.

Link register save area

The link register save area is different from the other parts of the ABI. When a function begins, it actually saves the link register in the calling function's stack frame, not its own, and then only if it needs to save it. Most functions that call other functions will need it, though.

Condition register save area

The condition register save area is needed if any of the non-volatile fields of the condition register are modified. The non-volatile fields are cr2, cr3, and cr4. The condition register should be saved in its area of the stack before any of these fields are modified, and then restored before returning.

Pointer to the previous stack frame

The final item in the stack frame is a pointer to the previous stack frame, often called the back pointer.

Writing a function that uses the stack

Functions create the stack frame during the beginning of the function (called the function prologue) and tear it down at the end of a function (called the function epilogue).

A function's prologue usually follows the following sequence:

  1. Reserve stack space and save the old stack pointer, using stdu 1, -SIZE_OF_STACK(1) (where SIZE_OF_STACK is the size of the stack frame for this function). This will save the old stack pointer and allocate stack memory atomically.
  2. If this function will call another function, or use the link register in any way, it will be saved by the instruction mflr 0 followed by a store into the link register save area of the function that called this one, using the instruction std 0, SIZE_OF_STACK+16(1).
  3. Save all non-volatile registers that will be used during this function (including the condition register, if any of its non-volatile fields will be used).

The function's epilogue follows the reverse sequence, restoring what had been saved, and then destroying the stack frame using ld 1, 0(1), which loads the previous stack pointer back into the stack pointer register.

Now, let's return to the function that we originally implemented without a stack, and as an example, look and see what it would look like with a stack (enter as my_square.s and compile and run as before):

Listing 6. Function to square a number using a stack
.section .opd, "aw"
.align 3

.global my_square
my_square:   #this is the name of the function as seen
        .quad .my_square, .TOC.@tocbase, 0
.type my_square, @function

.my_square:  #This is the label for the code itself (Referenced in the "opd")
        #Set up stack frame & back pointer (112 bytes -- minimum stack)
        stdu 1, -112(1)
        #Save LR (optional)
        mflr 0
        std 0, 128(1)
        #Save non-volatile registers (we don't have any)

        ##FUNCTION BODY##
   #Parameter 1 -- number to be squared -- in register 3
        mulld 3, 3, 3

   #The return value is now in register 3, so we just need to leave

        #Restore non-volatile registers (we don't have any)
        #Restore LR (not needed in this function, but here anyway)
        ld 0, 128(1)
        mtlr 0
        #Restore stack frame atomically
        ld 1, 0(1)

That's exactly the same code as before, just wrapped with prologue and epilogue code. As mentioned, this code is simple enough that it doesn't need prologue and epilogue code and is perfectly fine using the simplified ABI. However, it is a good example of how to set up and tear down a stack frame.

Now, let's return to the factorial function. This function, since it calls itself, makes very good use of stack frames. Let's look at how the factorial function would work in assembly language (enter as factorial.s):

Listing 7. The factorial function in assembly language
.section .opd, "aw"
.align 3

.global factorial
         .quad .factorial, .TOC.@tocbase, 0
.type factorial, @function

   #Reserve Space
   #48 (save areas) + 64 (parameter area) + 8 (local variable) = 120 bytes.
   #aligned to 16-byte boundary = 128 bytes
   stdu 1, -128(1) 
   #Save Link Register
   mflr 0
        std 0, 144(1)

   #Function body

   #Base Case? (register 3 == 0)
   cmpdi 3, 0
   bt- eq, return_one

   #Not base case - recursive call
   #Save local variable
   std 3, 112(1)
   #NOTE - it could also have been stored in the parameter save area.
   #       parameter 1 would have been at 176(1) 

   #Subtract One
   subi 3, 3, 1

   #Call the function (branch and set the link register to the return address)
   bl factorial
   #Linker word

   #Restore local variable (but to a different register - 
   #register 3 is now the return value from the last factorial 
   ld 4, 112(1)
   #Multiply by return value
   mulld 3, 3, 4
   #Result is in register 3, which is the return value register

        #Restore Link Register
        ld 0, 144(1)
        mtlr 0
        #Restore stack
        ld 1, 0(1)

   #Set return value to 1
        li 3, 1
        b factorial_return

To test it from C, enter the following (enter as factorial_caller.c):

Listing 8. Program to call factorial function
#include <stdio.h>
typedef long long int64;
int64 factorial(int64);

int main() {
    int64 a = 10;
    printf("The factorial of %lld is %lld\n", factorial(a));
    return 0;

Compile and run as follows:

Listing 9. Compiling and running factorial
gcc -m64 factorial.s factorial_caller.c -o factorial

There are a few features of this factorial function that are interesting. First of all, we are making use of the local variable storage space. We are saving the current parameter in 112(1). Now, since this is a function parameter, we could have saved an extra doubleword of stack space and stored it in the caller's parameter area.

Another interesting thing in the program is the nop instruction after the function call. That is required by the ABI. That extra instruction allows the linker to insert additional code if necessary during the linking process. For example, if you have a program that has enough symbols to warrant multiple TOCs (TOCs were discussed in "Assembly language for Power Architecture, Part 2: The art of loading and storing on PowerPC"), the linker will emit an instruction (or multiple instructions using a branch) to swap around TOCs for you.

Finally, notice that the branch target for the function call is not the code that starts it, but the .opd entry point descriptor. The linker will take care of converting this to point to the correct code. However, this will let the linker know additional information about the function, including which TOC it is using, so it can emit the code to swap these around if necessary.

Creating dynamic libraries

Now that you know how to make functions, you can put them together into a library. You actually don't need to write any additional code, you just need to compile it all together. To combine the factorial and my_square functions into a single library (let's call it, just enter the following:

Listing 10. Compiling shared libraries
gcc -m64 -shared factorial.s my_square.s -o

This instructs the compiler to produce a shared object called To link this into executables, you need to enable both the compile-time linker and the run-time dynamic linker to find it. To compile the factorial calling function to use the shared object, compile and run like this:

Listing 11. Using the shared library
#-L tells what directories to search, -l tells what libraries to find
gcc -m64 factorial_caller.c -o factorial -L. -lmymath
#Tell the dynamic linker what additional directories to search
#Run the program

Of course, you can get rid of all of those directory flags if the library is installed in a standard library location.

As mentioned in "Assembly language for Power Architecture, Part 2: The art of loading and storing on PowerPC," the TOC, or table of contents, of an application only has 64KB worth of space for holding global data references. So, what happens when several shared objects are loaded into the same application space and the table of contents gets too big? This is what the .TOC.@tocbase reference is for in the official procedure descriptor. The linker can manage several TOCs in a single application. The .TOC.@tocbase instructs the linker to put the address of the TOC for that function in that spot. Then, when the linker is setting up references to functions, it compares the TOC of the current function to the TOC of the function it's calling. If they are the same, it leaves the call alone. If they are different, it actually modifies your code to swap TOC references on function call and return. This is one of the main reasons for the official procedure descriptors, and also one of the main reasons for the extra nop instruction that follows a function call. Because of this, you never have to worry about running out of global symbol space from linking in too many shared objects.


The simplified 64-bit ABI is a breeze to use in programs, and the full ABI is not much harder. The most difficult part is determining the different offsets of the different parts of the stack frame, knowing where each piece should go, and what size it should be.

Creating reusable libraries in assembly language is fast and easy. To convert functions that use the 64-bit ABI into shared libraries, all that is needed is a few extra compiler flags and you're ready to go.

Hopefully, this series of articles has demonstrated the ease and power of PowerPC programming. Perhaps in your next project, you'll consider tapping the full resources of the POWER5 chip by using its assembly language!

Downloadable resources

Related topics

Zone=Linux, Multicore acceleration
ArticleTitle=Assembly language for Power Architecture, Part 4: Function calls and the PowerPC 64-bit ABI