High-level languages offer great advantages in general by hiding many mundane and repetitive details from programmers, allowing them to concentrate on their goals. However, sometimes programmers must use a lower-level language, such as when writing code that deals directly with hardware or that is extremely performance sensitive. Assembly language is the programming language closest to the hardware, which makes it a natural last resort in such situations.
This article assumes a basic understanding of computer design (for example, you should know that a processor has registers and can access memory) and of operating systems (system calls, exceptions, process stacks). This article should be useful to PowerPC programmers unfamiliar with assembly as well as programmers who already know ia32 assembly and want to broaden their horizons.
The PowerPC Architecture Specification, released in 1993, is a 64-bit specification with a 32-bit subset. Almost all PowerPCs generally available (with the exception of late-model IBM RS/6000 and all IBM pSeries high-end servers) are 32-bit.
PowerPC processors have a wide range of implementations, from high-end server CPUs such as the Power4 to the embedded CPU market (the Nintendo Gamecube uses a PowerPC). PowerPC processors have a strong embedded presence because of good performance, low power consumption, and low heat dissipation. The embedded processors, in addition to integrated I/O like serial and ethernet controllers, can be significantly different from the "desktop" CPUs. For example, the 4xx series PowerPC processors lack floating point, and also use a software-controlled TLB for memory management rather than the inverted pagetable found in desktop chips.
PowerPC processors have 32 (32- or 64-bit) GPRs (General Purpose Registers) and various others such as the PC (Program Counter, also called the IAR/Instruction Address Register or NIP/Next Instruction Pointer), LR (link register), CR (condition register), etc. Some PowerPC CPUs also have 32 64-bit FPRs (floating point registers).
PowerPC architecture is an example of a RISC (Reduced Instruction Set Computing) architecture. As a result:
- All PowerPCs (including 64-bit implementations) use fixed-length 32-bit instructions.
- The PowerPC processing model is to retrieve data from memory, manipulate it in registers, then store it back to memory. There are very few instructions (other than loads and stores) that manipulate memory directly.
Application binary interfaces (ABIs)
Technically, a developer can use any GPR for anything. For example, there is no "stack pointer register"; a programmer could use any register for that purpose. In practice, it is useful to define a set of conventions so that binary objects can interoperate with different compilers and pre-written assembly code.
Calling conventions are determined by the ABI (Application Binary Interface) used. ppc32 Linux and NetBSD implementations use the SVR4 (System V R4) ABI, but ppc64 Linux follows AIX and uses the PowerOpen ABI. The ABI specifies which registers are considered volatile (caller-save) and non-volatile (callee-save) when calling subroutines, and a lot more.
Some concrete examples of behavior specified by the SVR4 ABI:
- Since the PowerPC has so many GPRs (32 compared to ia32's 8), arguments are passed in registers starting with
gpr3. - Registers
gpr3throughgpr12are volatile (caller-save) registers that (if necessary) must be saved before calling a subroutine and restored after returning. - Register
gpr1is used as the stack frame pointer.
Many of the SVR4 features are identical to the PowerOpen ABI, which greatly aids interoperability.
All the pros and cons listed in the "Assembly HOWTO" (see Resources for a link) apply to PowerPC.
Sometimes you must touch CPU registers that higher-level languages are
completely unaware of. This is especially true in the course of writing an
operating system. One simple example is assigning your code its own stack
-- on a PowerPC, you must set r1. A C compiler
will only increment or decrement r1, so if your
application is running directly on the hardware, you must set r1 before calling C code. Another example is an
operating system's exception handlers, which must carefully save and
restore state one register at a time until it's safe to call higher-level
code.
Nonetheless, when faced with a situation in which you must use low-level hardware features, you should implement as little as possible in assembly:
- C code is portable and understood by a large number of developers; assembly code (especially PowerPC assembly) is not.
- Higher-level code is frequently much easier to debug than assembly.
- Higher-level code is by definition more expressive than assembly; in other words you can do more with less code (and in less time).
If you find yourself writing high-level constructs such as loops or C structures in assembly, take a step back and consider if this could be done more easily in another language. A general rule is to use just enough assembly to allow you to use a higher-level language.
One of the most common reasons people want to use assembly language is to make a slow program run faster. But in these cases, assembly should be the absolute last place you turn.
General advice on optimization is beyond the scope of this document, but here are some places to start:
- Profile
You must profile your code before starting any optimization work. Not only will this tell you where the hotspots are (they're frequently not where you expect!), it will also give you proof that you've sped anything up once you're done. Once you find hotspots, you can begin optimizing the high-level code (rather than attempting to rewrite it in assembly). - Algorithmic optimization
No matter how tight your assembly is, if you're using an n4 algorithm, you're still going to be incredibly slow. Some other techniques you should try first include using a more appropriate data structure. If you iterate repeatedly over a linked list, think about using a hash table, binary tree, or whatever is appropriate for your application.
Your compiler can almost always do a much better job than you can at
writing assembly! Rather than attempting to rewrite high-level code in
assembly, make judicious use of optimization options such as -O3 and C directives like __inline__. The compiler is aware of tricks like
instruction scheduling, which considers the internals of the processor and
tries to keep all pipelines full at all times. That may involve moving
loads earlier in the instruction stream than required to keep the pipeline
from stalling as the CPU waits for memory accesses to catch up. Unless
you've been coding assembly for many years, these are tasks that most
people cannot correctly perform by hand.
gcc is the best place to start learning assembly (for any architecture).
gcc -O3 -S file.c will produce file.s in gas-compilable format (gas is the
GNU Assembler). Open file.s in your favorite
editor and you can see the assembly output from your C code.
You'll probably see instructions you don't understand. You can look
them up in The PowerPC Architecture: A Specification for a New Family of RISC
Processors, 2nd. Ed and PowerPC Microprocessor Family: The Programming Environments for
32-bit Microprocessors (see Resources for links to these documents). However, like
learning any (spoken) language, there are certain words that are important
and that you should know, and others that can be safely ignored until
you've figured out more important features of the code. A good example of
an important instruction is the branch family of instructions, such as
blr.
Listing 1 is copied directly from the gas example in the Assembly
HOWTO, which unfortunately is completely ia32-specific. It makes two direct
system calls: the first writes to stdout; the second exits the application
(with a return code of 0). It is very unusual
to make system calls directly; normally applications link with a libc
library, which wraps all the system calls.
Listing 1. ia32 assembly
.data # section declaration
msg:
.string "Hello, world!\n"
len = . - msg # length of our dear string
.text # section declaration
# we must export the entry point to the ELF linker or
.global _start # loader. They conventionally recognize _start as their
# entry point. Use ld -e foo to override the default.
_start:
# write our string to stdout
movl $len,%edx # third argument: message length
movl $msg,%ecx # second argument: pointer to message to write
movl $1,%ebx # first argument: file handle (stdout)
movl $4,%eax # system call number (sys_write)
int $0x80 # call kernel
# and exit
movl $0,%ebx # first argument: exit code
movl $1,%eax # system call number (sys_exit)
int $0x80 # call kernel
|
Listing 2 is a straightforward translation of the same code into PowerPC assembly.
Listing 2. PPC32 assembly
.data # section declaration - variables only msg: .string "Hello, world!\n" len = . - msg # length of our dear string .text # section declaration - begin code .global _start _start: # write our string to stdout li 0,4 # syscall number (sys_write) li 3,1 # first argument: file descriptor (stdout) # second argument: pointer to message to write lis 4,msg@ha # load top 16 bits of &msg addi 4,4,msg@l # load bottom 16 bits li 5,len # third argument: message length sc # call kernel # and exit li 0,1 # syscall number (sys_exit) li 3,1 # first argument: exit code sc # call kernel |
PowerPC assembly requires a destination register for all register-to-register operations (because it is a RISC architecture). This register is always the first in the argument list.
Under PPC Linux, system calls are made with the syscall number in gpr0 and arguments beginning with gpr3. The syscall number, order of arguments, and
number of arguments may differ under other PowerPC operating systems
(NetBSD, Mac OS, etc.), which is one reason programmers typically make
system calls through a libc library (which handles the OS-specific
details).
Register notation
PowerPC registers have numbers, not names. For the learner, this can
sometimes be confusing since literals aren't easily distinguishable from
registers. "3" could mean the value 3 or the
register gpr3, or floating point fpr3, or special purpose register spr3.
Get used to it. :)
Immediate instructionsli means "load immediate", which is a way of
saying "take this constant value known at compile time and store it in
this register". Another example of an immediate instruction is addi, for
example addi 3,3,1 would increment the contents
of gpr3 by 1, then store the result back into gpr3. Contrast this with
add 3,3,1, which increments the contents of gpr3
by the contents of gpr1, storing the result back into gpr3.
Instructions ending in "i" are usually immediate instructions.
Mnemonicsli isn't really an instruction; it's actually a mnemonic. A mnemonic is a
bit like a preprocessor macro: it's an instruction that the assembler
will accept but secretly translate into other instructions. In this case,
li 3,1 is really defined as addi 3,0,1.
The sharp-eyed will notice that those instructions aren't
necessarily the same thing: addi is really adding 1 to the
contents of gpr0, storing the result into gpr3, right? That would
be true, except the PowerPC spec says gpr0 sometimes has a value, and
sometimes is read as 0, depending on the context. In this case (and the
addi description states this explicitly), the 0 means value 0 rather than
register gpr0.
Mnemonics shouldn't matter at all to anyone other than assembler
developers, but mnemonics can be confusing when you're looking at disassembly
output. However, GNU objdump -d is quite good
at displaying the original mnemonic rather than the instruction actually
present in the file. For example, objdump will display the mnemonic nop rather than ori 0,0,0
(the actual instruction used).
Loading pointers
The most interesting part of our Hello World example is how we load the
address of msg. As mentioned earlier, PowerPC uses fixed-length
32-bit instructions (in contrast to ia32, which uses variable-length
instructions). That 32-bit instruction is just a 32-bit integer. This
integer is divided into fields of different sizes:
Listing 3. addi machine code format
-------------------------------------------------------------------------- | opcode | src register | dest register | immediate value | | 6 bits | 5 bits | 5 bits | 16 bits | -------------------------------------------------------------------------- |
The number of fields and their sizes will vary by instruction, but the
important point here is that these fields take up space in the
instruction. In the case of addi, after just those three fields are placed
into the instruction, there are only 16 bits left for the immediate value
you're adding!
That means that li can only load 16-bit immediates. You cannot
load a 32-bit pointer into a GPR with just one instruction. You must use two
instructions, loading first the top 16 bits and then the bottom. That is
exactly the purpose of the @ha ("high") and @l ("low") suffixes. (The "a"
part of @ha takes care of sign extension.) Conveniently, lis
(meaning "load immediate shifted") will load directly into the high 16
bits of the GPR. Then all that's left to do is add in the lower bits.
This trick must be used whenever you load an absolute address (or any 32-bit immediate value). The most common use is in referencing globals.
Listing 4. Hello World -- PPC64 assembly
Listing 4 is almost identical to the 32-bit PowerPC example (Listing 2) above. PowerPC was designed as a 64-bit specification with 32-bit implementations, and not only that, PowerPC user-level programs are more or less binary-compatible across those implementations. Under Linux, ppc32 binaries run perfectly well on 64-bit hardware (with a little munging here and there for variable types visible to both 32-bit userland and the 64-bit kernel).
Listing 4. PPC64 assembly
.data # section declaration - variables only
msg:
.string "Hello, world!\n"
len = . - msg # length of our dear string
.text # section declaration - begin code
.global _start
.section ".opd","aw"
.align 3
_start:
.quad ._start,.TOC.@tocbase,0
.previous
.global ._start
._start:
# write our string to stdout
li 0,4 # syscall number (sys_write)
li 3,1 # first argument: file descriptor (stdout)
# second argument: pointer to message to write
# load the address of 'msg':
# load high word into the low word of r4:
lis 4,msg@highest # load msg bits 48-63 into r4 bits 16-31
ori 4,4,msg@higher # load msg bits 32-47 into r4 bits 0-15
rldicr 4,4,32,31 # rotate r4's low word into r4's high word
# load low word into the low word of r4:
oris 4,4,msg@h # load msg bits 16-31 into r4 bits 16-31
ori 4,4,msg@l # load msg bits 0-15 into r4 bits 0-15
# done loading the address of 'msg'
li 5,len # third argument: message length
sc # call kernel
# and exit
li 0,1 # syscall number (sys_exit)
li 3,1 # first argument: exit code
sc # call kernel
|
There are only two differences between the ppc32 code (Listing 2) and the ppc64 code (Listing 4). The first is the way we load pointers, and the second is those assembler directives about an .opd section. It's worth pointing out that the ppc32 code works perfectly under ppc64 Linux when compiled as a ppc32 binary.
Loading pointers
On ppc32 it took two instructions to load a 32-bit immediate value into a
register. On
ppc64 it takes 5! Why?
We still have 32-bit fixed-length instructions, which can only load 16 bits worth of immediate value at a time. Right there you need a minimum of four instructions (64 bits / 16 bits per instruction = 4 instructions). But there are no instructions that can load directly into the high word of a 64-bit GPR. So we have to load up the low word, shift it to the high word, then load the low word again.
The rotate instructions (like the rlicr seen here) are notoriously
complicated, and having jokingly been called Turing-complete. If all you
need to do is load 64-bit immediate values, don't worry about it -- just
convert these five instructions into a macro and never think about it again.
One last note: we used @h here instead of
@ha in the ppc32 example because we then use
ori rather than addi to supply the low 16 bits. On RISC machines it's
frequently possible to accomplish something in many different ways (for
example, there are many possibilities for nop).
Function descriptors -- the .opd section
Under ppc64 Linux, when you define and call a C function foo, that is not actually the address of the
function's code. In assembly if you try to bl
foo, you will quickly find your program crashing. The label foo is actually the address of foo's function
descriptor. Function descriptors are described in detail in the ppc64 ELF
ABI (see Resources), but very briefly you must
have a function descriptor (which is simply a structure containing 3
pointers) if your assembly will be called from C code, because the
compiler expects it.
We don't have any C code here, but the ELF ABI also says that the ELF
file's entry point (_start by default) points to a function
descriptor. So we must have one, and that is what goes into the .opd
section.
Those assembler directives were copied almost directly from the output
of gcc -S. This is another excellent candidate
for a preprocessor macro in your assembly code.
For those of you interested in learning more about PowerPC, you can
start by compiling tiny programs with gcc -S --
provided that you have a PowerPC box handy. If you do not, check out the
PPC cross-compiling mini-howto, as well as the other sites and documents
listed in the Resources section. Also try experimenting with gdb's
psim (PowerPC simulator) target. It's easier than you may think!
- Download the Hello World code samples listed in this article:
- Get details on assembly instructions in The PowerPC Architecture: A Specification for a New Family of RISC
Processors, 2nd. Ed (Morgan Kaufmann, May 1994, ISBN 1-55860-316-6), and also PowerPC Microprocessor Family: The Programming Environments for
32-bit Microprocessors (IBM, February 2000).
- Find links to UNIX assembly projects and programming information at linuxassembly.org's projects page.
- Learn embedded assembly in the
Linux for PowerPC Embedded Systems HOWTO.
- Learn more about function descriptors in the 64-bit PowerPC ELF ABI.
- For IBM PowerPC applications, feature summaries, technical documentation, news, and more, visit the IBM PowerPC Web site.
- Browse the current list of IBM
white papers and technical reports on PowerPC architecture.
- "A
programmer's view of performance monitoring in the PowerPC
microprocessor" (IBM Systems Journal, 1997) shows how you can analyze
processor, software, and system attributes for a variety of workloads with
the Power PC's on-chip Performance monitor (PM).
- "A
decompression core for PowerPC" (IBM Systems Journal, 1998) shows you
how to improve size efficiency for PowerPC code.
- Learn the basics and usage of inline assembly code in Linux in "Inline assembly for x86 in Linux" (developerWorks, March 2001).
- For an overview of embedded development on Linux, see "Linux system development on an embedded device" (developerWorks, March 2002).
- Find more Linux articles in the developerWorks Linux zone.



