Save your code from meltdown using PowerPC atomic instructions

Write correct, concurrent code


Pop quiz! What's wrong with the C code in Listing 1, which increments an integer?

Listing 1. Incrementing an integer: a subtle flaw

Time's up! Did you spot the flaw? If not, don't feel too badly -- it was a bit of a trick question. Just by itself, this simple code isn't wrong. However, you can place it into a context where this code will produce incorrect results: randomly losing increments. That is, calling ++gConnectCountmay not actually increment gConnectCount -- and that's troubling.

The trouble starts if ++gConnectCount is reentered. This can happen if the code is run while it's already running, for instance, if it's called from a signal handler, or from another thread. Even on a single-processor machine, a preemptive threading architecture can reenter code while a thread is already running.

It seems odd to talk about one line of code being reentered. C is a rather low-level language, with most of its semantics mapping onto processor primitives one-to-one. So, thinking that one line of code incrementing a counter would result in one processor instruction generated is not outrageous. Indeed, this is the case for CISC architectures like x86 and 68K.

The IBM® PowerPC®, however, is a RISC architecture with its characteristic load-store design -- so the PowerPC doesn't have instructions that perform operations directly on memory. Instead, you load the memory into a register, perform the operation, and then store it back. Thus, your precious C compiler translated ++gConnectCount into something like four PowerPC instructions, as Listing 2 shows.

Listing 2. Incrementing an integer: generated assembly
lwz  r3, gConnectCount(rtoc) // Load address of gConnectCount into register r3.
lwz  r4, 0(r3)               // Load integer from RAM into register r4.
addi r4, r4, 1               // Add 1 to register r4.
stw  r4, 0(r3)               // Store incremented value from register r4 back into RAM.

You can pretty much ignore Listing 2's first line -- it just fetches the address of the gConnectCount global variable into register r3. The real meat is in the lwx/addi/stw triplet. Figure 1 shows this code as an illustration, presenting what happens when gConnectCount's value starts out at 22.

Figure 1. Incrementing an integer using load/store
Incrementing an integer using load/store
Incrementing an integer using load/store

As you can see, it takes three separate instructions to increment an integer in RAM. Since it takes more than one instruction to accomplish the task, this sequence will possibly be interrupted, as is the case in time-slice-based preemptive multithreading. However, multithreading by itself isn't the problem. For example, it's okay if the code is executed sequentially by multiple threads, as Figure 2 illustrates.

Figure 2. Incrementing an integer: multithreaded sequential execution
Incrementing an integer: multithreaded sequential execution
Incrementing an integer: multithreaded sequential execution

The fun starts when this sequence is interrupted and reentered, as Figure 3 illustrates.

Figure 3. Incrementing an integer: multithreaded concurrent execution
Incrementing an integer: multithreaded concurrent execution
Incrementing an integer: multithreaded concurrent execution

The tragedy unfolds when two threads overlap and read gConnectCount twice. The threads, each unaware of the other's concurrent modification, interact in such a way that one increment is "lost".

Looking again at the sequence of instructions necessary to increment an integer, you can pick out four distinct events, as Figure 4 shows. You can reenter the code before it loads the value (the first event), or after it writes out the result (the last event). However, between the first and last event, there's a span of time when the code will malfunction if reentered. This interval is the Window of Death.

Figure 4. Incrementing an integer: the Window of Death
Incrementing an integer: the Window of Death
Incrementing an integer: the Window of Death

Closing the Window of Death with atomic instructions

The problem with ++gConnectCount is that it boils down to at least three instructions, and bad things happen if this sequence of instructions is reentered. But ++gConnectCount is as tight as you can express incrementing an integer in C. Indeed, you can't make this code reentrant-safe in portable C. For that, you need to drop down and use processor family-specific instructions.

The PowerPC User Instruction Set Architecture provides two interrelated instructions that you can use to make updating memory atomic -- that is, indivisible. Once an update is made indivisible, reentering it won't cause any problems. That's because other threads see the shared state before the update, or after the update, but never during the update (when data might be inconsistent). These two instructions are Load Word and Reserve Index (lwarx) and Store Word Conditional Index (stwcx.).

lwarx works just like the common Load Word and Zero Indexed (lwzx), except that it places a reservation on the loaded address in addition to loading the data. The PowerPC processor can hold only one reservation at a time.

Likewise, stwcx. works just like any other store instruction. The big difference is that stwcx. is conditional -- the store is only performed if a reservation is present on the given address. If a reservation exists, then it clears the reservation, performs the store and sets the condition register CR0[EQ] to true. Otherwise, the instruction does nothing except set CR0[EQ] to false.

Think of a "reservation" as a kind of register. During each store instruction, the processor compares the given address to the reservation. If they are equal, the reservation is cleared. However, reservations also work in a multiprocessor environment. If processor A places a reservation on address X and processor B stores to address X, processor A's reservation is cleared.

When coupled, these two instructions provide a foundation to implement code that's safe against reentrancy. Listing 3 shows how by rewriting ++gConnectCount to be atomic.

Listing 3. Incrementing an integer: atomically
  lwarx  r4, 0, r3 // Read integer from RAM into r4, placing reservation.
  addi   r4, r4, 1 // Add 1 to r4.
  stwcx. r4, 0, r3 // Attempt to store incremented value back to RAM.
  bne-   retry     // If the store failed (unlikely), retry.

Not surprisingly, the code looks largely the same as before. The load/store instructions (lwz and stw) have been replaced with their reservation-aware brethren (lwarx and stwcx.) and looping (the retry: label and bne- instruction) has been added.

The looping has to be added because the store can now fail (if another thread has preempted the reservation). So, after the conditional store, you should test CR0[EQ] to deduce whether the store succeeded. If it did: great, continue on. If it didn't: you should loop back, load from RAM again, and retry the operation. Also, notice the minus symbol after the bne instruction. That's a hint, passed through the assembler and encoded into the instruction, informing the branch prediction unit that this branch is unlikely to be taken. This is just an optimization, but it makes sense here -- you don't expect to be interrupted very often in your short three-instruction window.

The example in this article just increments an integer. However, you can do anything that involves updating a single 32-bit value. With this in mind, you can write a general function wrapper for the atomic instructions, enabling atomic updating from higher-level C code. Listing 4 (which takes advantage of GCC 3.3's new CodeWarrior-style assembly-within-a-C-function support) gives an example.

Listing 4: General atomic updating function
  long         // <- Zero on failure, one on success (r3).
  long prev,   // -> Previous value (r3).
  long next,   // -> New value (r4).
  void *addr ) // -> Location to update (r5).
  lwarx  r6, 0, r5 // current = *addr;
  cmpw   r6, r3    // if( current != prev )
  bne    fail      // goto fail;
  stwcx. r4, 0, r5 // if( reservation == addr ) *addr = next;
  bne-   retry     // else goto retry;
  li      r3, 1    // Return true.
  blr              // We're outta here.
  stwcx. r6, 0, r5 // Clear reservation.
  li     r3, 0     // Return false.
  blr              // We're outta here.

With your new atomic function wrapper, you can rewrite ++gConnectCount to be fully reentrant, yet keep it in high-level C (see Listing 5).

Listing 5. Incrementing an integer atomically in C
long stored;
do {
    stored = AtomicStore(gConnectCount, gConnectCount + 1, &gConnectCount);
} while(!stored);

While this article discusses atomic instructions on PowerPC32/64 processors, the POWER™ line offers a great deal of ISA compatibility with the PowerPC. The concepts outlined here are at least applicable to POWER programming, and chances are the code is as well.

Writing atomic code isn't hard, but it often requires a little esoteric knowledge about the processor you're targeting. The PowerPC offers a clean and high-performance model for handling concurrency correctly, so it's a great place to learn the reentrancy ropes.

Downloadable resources

Related topics

Zone=Multicore acceleration
ArticleTitle=Save your code from meltdown using PowerPC atomic instructions