Assembly language for Power Architecture, Part 2: The art of loading and storing on PowerPC

Techniques for putting data in memory exactly where you want it

The previous article in this series introduced assembly language programming using the 64-bit PowerPC® instruction set on POWER5 and other processors that use these instructions. This article drills down and discusses the specifics of 64-bit PowerPC assembly language programming on Linux® and UNIX®-like operating systems, focusing on data access methods and position-independent code.

Jonathan Bartlett (johnnyb@eskimo.com), Director of Technology, New Medio

Jonathan Bartlett is the author of the book Programming from the Ground Up, an introduction to programming using Linux assembly language. He is the lead developer at New Medio, developing Web, video, kiosk, and desktop applications for clients.



29 November 2006

Also available in Russian

Addressing modes and why they are important

Before getting into addressing modes, let's review computer memory concepts. You may recognize these facts about memory and programming, but as modern programming languages attempt to de-emphasize the physical aspects of the computer, a refresher can be helpful:

About this "Assembly language for Power Architecture" series

The POWER5 processor is a 64-bit workhorse used in a variety of settings. This four-part series of articles introduces assembly language in general and specifically assembly language programming for the POWER5:

  • Every location in main memory is numbered with a sequential numeric address by which the memory location is referred.
  • Every main memory location is one byte long.
  • Larger data types are made by simply treating multiple bytes as a single unit (using two memory locations together for a 16-bit number, for instance).
  • Registers are 4 bytes long on 32-bit platforms, and 8 bytes long on 64-bit platforms.
  • Memory can be loaded into registers either 1, 2, 4, or 8 bytes at a time.
  • Non-numeric data is stored as numeric data -- the only differences are what operations are used on it and how the data is used.

New assembly language programmers are sometimes surprised by how many different ways you can access memory. These different ways are called addressing modes. Some modes are logically equivalent but differ in their purpose. They are considered different addressing modes because they may be implemented differently based on the processor.

There are actually two addressing modes that don't access memory at all. In immediate mode, the data to be used is part of the instruction (for example, the li instruction stands for "load immediate," because the number to be loaded is part of the instruction itself). In register mode, rather than accessing the contents of main memory, you access registers.

The most obvious addressing mode for accessing main memory is called direct addressing mode. In this mode, the instruction itself contains the address from which to load the data. This mode is often used for global variable access, branching, and subroutine calls. A similar mode is relative addressing mode, which calculates the address based on the current program counter. This is often used for short-range branches where the destination is near the current location, so specifying an offset rather than an absolute address makes more sense. It is similar to direct addressing mode in that the final address is known at either assemble or link time.

The indexed addressing mode makes the most sense as a way to access array elements for global variables. It has two parts: a memory address and an index register. The index register is added to the specified address, and the result is used as the address for the memory access. Some platforms (not PowerPC) allow programmers to specify a multiplier for the index register. Therefore, if each array element is 8 bytes long, you can use 8 as a multiplier. This allows the index register to be used exactly like an array index. Otherwise, the index register would have to be increased/decreased in increments of the data size.

The register indirect addressing mode uses a register to specify the whole address for the memory access. This is used for numerous situations, including, but not limited to:

  • Dereferencing pointer variables
  • Any memory access that is not available by other modes (the address can be calculated by other means and stored in the register, which is then used for the access)

Base-pointer addressing mode acts just like indexed addressing mode (the specified number and the register are added together for the final address), except that the function of the two components are switched. In base-pointer addressing mode, the register has the base address and the literal number has the offset. This is very useful for accessing members of a struct. The register can hold the address of the whole struct, and the numeric portion can be modified depending on the structure member to be accessed.

For instance, let's say you have a struct that has three fields: the first is 8 bytes, the second is 4 bytes and the last is 8 bytes. Then, let's say that the address of the struct itself is in a register called register X. If you want to access the second member of the structure, you'll need to add 8 to the value in the register. So, using base-pointer addressing, you would specify register X as the base pointer and 8 as the offset. To access the third field, you would specify register X as the base pointer and 12 as the offset. To access the first field, you can actually use indirect addressing instead of base-pointer addressing, since there is no offset (this is why on many platforms the first structure member is the fastest to access; you can use a simpler addressing mode -- in PowerPC it does not matter).

Finally, in indexed register indirect addressing mode, both the base and the index are stored in registers. The memory address used is determined by adding the two registers together.


The importance of instruction formats

To learn how addressing modes work for load and store instructions on PowerPC processors, you must first understand a little bit about the PowerPC instruction format. The PowerPC uses a load/store (also called RISC) instruction set, which means that the only time it accesses main memory is for loading into registers or copying a register to memory. All of the actual processing takes place between registers (or between registers and immediate-mode operands). The other main type of processor architecture, CISC (the x86 processor being a popular CISC instruction set), allows for memory access in nearly every instruction. The reason for the load/store architecture is that it allows the rest of the processor to be more efficient. In fact, most modern CISC processors actually translate their instructions to an internalized RISC format for efficiency.

Each instruction on the PowerPC is exactly 32 bits long, with the instruction's opcode (the code telling the processor which instruction it is) taking the first six bits. This 32-bit length includes all immediate-mode values, register references, explicit addresses, and instruction options. This makes for a pretty small squeeze. In fact, the largest length available for a memory address to any instruction format is only 24 bits! This would give you, at most, only 16MB of addressable space. Don't worry -- there are lots of ways around this. This is just to point out why instruction format matters on the PowerPC processor -- you need to know how much space you have to work with!

You don't need to memorize all of the instruction formats to make use of them. However, knowing some of the basic ones will help you read PowerPC documentation and understand some of the general strategies and nuances in the PowerPC instruction set. The PowerPC has 15 different instruction formats, many with several subformats. However, you only need to be concerned with a few of them.


Addressing memory using the D-Form and DS-Form instruction formats

The D-Form instruction is one of the primary memory-access instruction forms. It looks like this:

The D-Form instruction format

Bits 0-5

Opcode

Bits 6-10

Source/target register

Bits 11-16

Address/index register/operand

Bits 16-31

Numeric address, offset, or immediate-mode value

This form is used to perform loads, stores, and immediate-mode calculations. It can be used for the following addressing modes:

  • Immediate addressing mode
  • Direct addressing mode (by specifying zero for the address/index register)
  • Indexed addressing mode
  • Indirect addressing mode (by specifying zero for the address)
  • Base pointer addressing mode

As you can see, the D-Form instruction is very flexible and is used for any register-plus-address memory access form. However, its usability for direct addressing and indexed addressing is extremely limited, because it only has a 16-bit address field to work with! This gives a maximum range of only 64K. Therefore, the direct and indexed addressing modes are only rarely used to fetch and store memory. Instead, this form is much more often used for immediate, indirect, and base-pointer addressing modes, because in these addressing modes the 64K limit is not nearly as problematic because the base register can have the full 64-bit range.

The DS-Form is only used in 64-bit instructions. It is just like the D-Form, except that it uses the last two bits of the address for an extended opcode. However, it pads the Value portion of the address to the right with two zeros. This gives it the same range as D-Form instructions (64K), but limits it to 32-bit aligned memory. For the assembler, the value is specified normally -- it is simply condensed by the assembler. For example, if you wanted an offset of 8, you would still enter 8; the assembler would just convert the value to the bit representation 0b000000000010 instead of 0b00000000001000. If you entered a value that was not a multiple of 4, the assembler would give an error.

Note that in D-Form and DS-Form instructions, if the source register is set to 0, instead of using register 0 it simply does not use the register parameter.

Let's now look at instructions built from D-Forms and DS-Forms.

Immediate-mode instructions are specified in assembler like this:

opcode dst, src, value

Here dst is the destination register, src is a source register (used in computation), and value is the immediate-mode value used. Immediate-mode instructions never use the DS-Form. Here are some immediate-mode instructions:

Listing 1. Immediate-mode instructions
#Add the contents of register 3 to the number 25 and store in register 2
addi 2, 3, 25

#OR the contents of register 6 to the number 0b0000000000000001 and store in register 3
ori 3, 6, 0b00000000000001

#Move the number 55 into register 7
#(remember, when 0 is the second register in D-Form instructions
#it means ignore the register)
addi 7, 0, 55
#Here is the extended mnemonics for the same instruction
li 7, 55

In the non-immediate-mode uses of the D-Form, the second register is added to the value to give the final address of the memory to load from or store to. These instructions have the general form:

opcode dst, d(a)

In this form, the address to load/store is specified as d(a), where d is the numeric address/offset and a is the number of the register to use for the address/offset. They are added together to give the final effective address for the load/store. Here are some example D-Form/DS-Form load/store instructions:

Listing 2. Load/store instruction examples using the D-Form and DS-Form
#load a byte from the address in register 2, store it in register 3, 
#and zero out the remaining bits
lbz 3, 0(2)

#store the 64-bit contents (double-word) of register 5 into the 
#address 32 bits past the address specified by register 23
std 5, 32(23)

#store the low-order 32 bits (word) of register 5 into the address 
#32 bits past the address specified by register 23
stw 5, 32(23)

#store the byte in the low-order 8 bits of register 30 into the 
#address specified by register 4
stb 30, 0(4)

#load the 16 bits (half-word) at address 300 into register 4, and 
#zero-out the remaining bits
lhz 4, 300(0)

#load the half-word (16 bits) that is 1 byte offset from the address 
#in register 31 and store the result sign-extended into register 18
lha 18, 1(31)

If you look carefully, you can see that there is sort of a "base opcode" specified at the beginning of the instruction, with several modifiers following. l and s are used for "load" and "store." b gives you a byte, h gives you a halfword (16 bits), w gives you a word (32 bits), and d gives you a doubleword (64 bits). After this, for loads, the a and z modifiers tell whether the value is sign-extended, or if it is simply zero-padded when loaded into the register. Finally, a u can be attached to tell the processor to update the register used in address calculation with the final computed address of the instruction.


Addressing memory using the X-Form instruction format

The X-Form is used for indexed register indirect addressing, where the values of two registers are added together to determine the address for loading/storing. The X-Form has the following format:

The X-Form instruction format

Bits 0-5

Opcode

Bits 6-10

Source/Destination Register

Bits 11-15

Address Calculation Register A

Bits 16-20

Address Calculation Register B

Bits 21-30

Extended Opcode

Bit 31

Unused

The opcodes are formatted like this:

opcode dst, rega, regb

Here opcode is the opcode for the instruction, dst is the destination (or source) register for the data transfer, and rega and regb are the two registers used for address calculation.

Here are some example instructions using the X-Form:

Listing 3. Examples using X-Form addressing
#Load a doubleword (64 bits) from the address specified by 
#register 3 + register 20 and store the value into register 31
ldx 31, 3, 20

#Load a byte from the address specified by register 10 + register 12 
#and store the value into register 15 and zero-out remaining bits
lbzx 15, 10, 12

#Load a halfword (16 bits) from the address specified by 
#register 6 + register 7 and store the value into register 8, 
#sign-extending the result through the remaining bits
lhax 8, 6, 7

#Take the doubleword (64 bits) in register 20 and store it in the 
#address specified by register 10 + register 11
stdx 20, 10, 11

#Take the doubleword (64 bits) in register 20 and store it in the 
#address specified by register 10 + register 11, and then update 
#register 10 with the final address
stdux 20, 10, 11

The advantage of X-Form, beside being very flexible, is that you have a significantly extended address range. In the D-Form, only one value -- the register -- could specify a full range. In the X-Form, since you have two registers, both components can specify as large a range as necessary. Therefore, in situations where base-pointer addressing or indexed addressing would be used, but the 16-bit range of the constant part of the D-Form is too small, the value can be stored in a register and the X-Form can be used.


Writing position-independent code

Position-independent code is code that works no matter what part of memory it is loaded into. Why do you need position-independent code? Position-independent code allows libraries to be loaded into arbitrary locations in the address space. This is what allows libraries to be arbitrarily combined -- since none of them have specific locations they are bound to, they can be loaded with any other library without worrying about address space conflicts. The linker takes care of making sure libraries are each loaded into their own space. By using position-independent code, the libraries don't have to worry about where they are loaded.

Ultimately, however, position-independent code needs to have a method of locating global variables. It does this by maintaining a global offset table that provides addresses for all global contents that a function or group of functions access (or even a whole program, in most cases). A register is reserved for holding the pointer to the table. Then, all accesses are done by an offset into the table. The offsets are constant. The table itself is set up by the program linker/loader, which also initializes register 2 to hold the global offset table pointer. Using this method, the linker/loader can put both program and data wherever it deems appropriate, and only needs to set up a global offset table containing all of the global pointers.

It is easy to get bogged down in a discussion of all of this. Let's look at some code and analyze what is going on at each step of the way. This is the "add numbers" program used in the previous article, but adapted for position-independent code.

Listing 4. Accessing data through the global offset table
###DATA DEFINITIONS###
.data
.align 3
first_value:
        .quad 1
second_value:
        .quad 2

###ENTRY POINT DECLARATION###
.section .opd, "aw"
.align 3
.globl _start
_start:
        .quad ._start, .TOC.@tocbase, 0

###CODE###
.text
._start:
        ##Load values##
        #Load the address of first_value into register 7 from the global offset table
        ld 7, first_value@got(2)
        #Use the address to load the value of first_value into register 4
        ld 4, 0(7)
        #Load the address of second_value into register 7 from the global offset table
        ld 7, second_value@got(2)
        #Use the address to load the value of second_value into register 5
        ld 5, 0(7)

        ##Perform addition##
        add 3, 4, 5

        ##Exit with status##
        li 0, 1
        sc

To assemble, link, and run the code, do the following:

Listing 5. Assembling, linking, and running the code
#Assemble
as -a64 addnumbers.s -o addnumbers.o

#Link
ld -melf64ppc addnumbers.o -o addnumbers

#Run
./addnumbers

#View the result code (value returned from the program)
echo $?

The data definition and entry point declaration are both the same as before. However, now, instead of having to use 5 instructions to load the address of first_value into register 7, only one instruction is needed: ld 7, first_value@got(2). As I mentioned before, the linker/loader sets up register 2 as the address of the global offset table. The syntax first_value@got asks the linker to use, instead of the address of first_value, the offset within the global offset table that contains first_value's address.

Using this method, most programs can contain all of the global data they use within a single global offset table. The DS-Form can address up to 64K of memory from a single base. Note that in order to get the full range of the DS-Form, register 2 points to the middle of the global offset table, so that it can make use of both positive and negative offsets. Since you are locating pointers to data (instead of data directly), you have access to approximately 8,000 global variables (local variables are in registers or in the stack, which will be discussed in the third article in this series). And even if this were not enough, there can exist multiple global offset tables. The mechanism for this is also discussed in the next article.

While this is much more compact and readable (not to mention relocatable) than the five-instruction data load in the last article, you can still do better. In the 64-bit ELF ABI, the global offset table is actually a subset of a larger section known as the table of contents. In addition to creating global offset table entries, the table of contents can contain variables, which, rather than containing addresses of global data, contain the data items themselves. The size and number of these variables must be small, since the table of contents is only 64K.

To declare a table of contents data item, you have to switch to the .toc section and make the declaration explicitly. It looks like this:

.section .toc
name:
.tc unused_name[TC], initial_value

This will create a table of contents entry. name is the symbol used to refer to it within the code. initial_value is the 64-bit value that is initially assigned. unused_name is a historical relic, not presently used for any purpose in ELF systems. You can leave it out (it is included above just to help with reading legacy code), but the [TC] is required.

To access data that is directly within the table of contents, you need to refer to it using @toc rather than @got. @got still functions, but it functions as before -- returning a pointer to your value rather than the value itself. Take a look at this code:

Listing 6. Difference between @got and @toc
### DATA ###

#Create the variable my_var in the table of contents
.section .toc
my_var:
.tc [TC], 10

### ENTRY POINT DECLARATION ###
.section .opd, "aw"
.align 3
.globl _start
_start:
        .quad ._start, .TOC.@tocbase, 0

### CODE ###
.text
._start:
        #loads the number 10 (my_var contents) into register 3
        ld 3, my_var@toc(2) 

        #loads the address of my_var into register 4
        ld 4, my_var@got(2)
        #loads the number 10 (my_var contents) into register 4
        ld 3, 0(4)

        #load the number 15 into register 5
        li 5, 15

        #store 15 (register 5) into my_var via ToC
        std 5, my_var@toc(2)

        #store 15 (register 5) into my_var via GOT (offset already loaded into register 4)
        std 5, 0(4)

        #Exit with status 0
        li 0, 1
        li 3, 0
        sc

As you can see, if you look up a symbol that defines data within the .toc section (as opposed to the .data section where most data is), using @toc will give you an offset that leads directly to the value itself, while using @got will give you an offset to an address for the value.

Now let's look at the adding numbers example using values from the ToC:

Listing 7. Adding numbers defined in the .toc section
### PROGRAM DATA ###
#Create the values in the table of contents
.section .toc
first_value:
        .tc [TC], 1
second_value:
        .tc [TC], 2

### ENTRY POINT DEFINITION ###
.section .opd, "aw"
.align 3
.globl _start
_start:
        .quad ._start, .TOC.@tocbase, 0

.text
._start:
        ##Load values from the table of contents ##
        ld 4, first_value@toc(2)
        ld 5, second_value@toc(2)

        ##Perform addition##
        add 3, 4, 5

        ##Exit with status##
        li 0, 1
        sc

As you can see, by using .toc-based data, you can significantly lower the number of instructions used by your code. Also, since the table of contents is usually in cache, it significantly lowers memory latency as well. Just be careful with how much data is stored.


Loading and storing multiple values

The PowerPC also has the ability to perform multiple loads and stores with a single instruction. Unfortunately, this is restricted to word-sized (32 bit) data. These are very simple D-Form instructions. You specify the base address register, the offset, and the starting destination register. The processor will then load data into all the registers starting with the listed destination register through register 31, starting with the address specified with the instruction, and moving forward. The instructions for this are lmw (load multiple world) and stmw (store multiple word). Here are a few examples:

Listing 8. Loading and storing multiple values
#Starting at the address specified in register ten, load
#the next 32 bytes into registers 24-31
lmw 24, 0(10)

#Starting at the address specified in register 8, load 
#the next 8 bytes into registers 30-31
lmw 30, 0(8)

#Starting at the address specified in register 5, store
#the low-order 32-bits of registers 20-31 into the next
#48 bytes
stmw 20, 0(5)

And here is our add numbers program again using multiple values:

Listing 9. The add numbers program using multiple values
### Data ###
.data
first_value:
        #using "long" instead of "double" because
        #the "multiple" instruction only operates
        #on 32-bits
        .long 1  
second_value:
        .long 2

### ENTRY POINT DECLARATION ###
.section .opd, "aw"
.align 3
.globl _start
_start:
        .quad ._start, .TOC.@tocbase, 0

### CODE ###
.text
._start:
        #Load the address of our data from the GOT
        ld 7, first_value@got(2)

        #Load the values of the data into registers 30 and 31
        lmw 30, 0(7)

        #add the values together
        add 3, 30, 31

        #exit
        li 0, 1
        sc

With update mode

Most load/store instructions can update the main address register with the final effective address that was used in the load/store instruction. For example, ldu 5, 4(8) will load from the address specified in register 8 plus 4 bytes into register 5, and then store the calculated address back into register 8. This is called loading and storing with update and can be used to decrease the number of instructions required to do a number of tasks. I'll use it more in the next article.


Conclusion

Efficient loading and storing is critical for efficient code. Knowing the instruction formats and addressing modes available helps you understand the possibilities and limitations of a platform. The D-Form and DS-Form instruction formats on the PowerPC are critical for position-independent code. Position-independent code allows you to create shared libraries and allows you to use fewer instructions to load global addresses.

The next article in this series will cover branching, function calls, and integrating with C code.

Resources

Learn

Get products and technologies

  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux, Multicore acceleration
ArticleID=180442
ArticleTitle=Assembly language for Power Architecture, Part 2: The art of loading and storing on PowerPC
publish-date=11292006