Skip to main content

Understanding 64-bit PowerPC architecture

Critical considerations in 64-bit microprocessor design

Cathleen Shamieh has engineering and consulting experience in the telecommunications, speech processing, computer telephony, and medical electronics industries. She can be reached at cathleen.shamieh@verizon.net

Summary:  Each of the leading microprocessor manufacturers has announced the availability of one or more 64-bit desktop processors, but differences exist in architectural design, fabrication, support, and intended use of each processor. This article looks at the critical issues in a few of IBM's 64-bit POWER designs, covering 32-bit compatibility, power management, processor bus design, and the manufacturing process.

Date:  19 Oct 2004
Level:  Introductory
Activity:  4567 views
Comments:  

When people talk about 64-bit computing, it's not always clear what they mean. Most often, they mean some combination of register width, bus width, or address space. For the purposes of this article, it means a processor with 64-bit registers and 64-bit addressing.

In the late 1980's, as the desktop processing industry was still struggling to transition to 32-bits, several companies (among them DEC, IBM, Motorola, and Apple) were already in the throes of 64-bit development. When the first 64-bit processor was introduced in 1992 for high-end UNIX servers and workstations, who would have thought of using that kind of processing power just a dozen years later in cell phones and laptops? Yet by the end of the 20th century, desktop applications had tested the computational and memory limits of 32-bit processors, and growth in high-performance consumer electronics called for increasingly more powerful embedded processors with ever-constrained power requirements. Early in the 21st century, the need for impenetrable security has grown tremendously, sparking renewed interest in encryption and decryption algorithms, as well as desktop security. The desktop and embedded industries needed another bit. To preserve the "power of two" relationships that people have come to expect, with 8, 16, and 32-bit words in existing architectures, developers are getting 32 more bits, resulting in an address space that is 4 billion times larger than the maximum address space of a 32-bit system.

Capable of addressing an astronomical 18 billion GB, or 18 exabytes, of memory, 64-bit integers also accelerate complex mathematical calculations through their ability to perform calculations directly on 64-bit numbers, as well as performing multiple operations on smaller numbers within a single CPU cycle (see Resources for the definition of exabyte). The impact of 64-bit processing is substantial: the time it takes to render a 3D model can be reduced dramatically, freeing up computing resources, compressing diagnostic timeframes, and enabling you to work more efficiently.

Billions and billions

When a kilobyte was a big chunk of space, no one cared that it was 1024 instead of 1000 bytes. When the megabyte came along, the usage was entrenched. But, as the orders of magnitude add up, so does the error.

One recommended solution is to distinguish between binary and decimal, using "kilo" for 1000, and "kibi" for 1024. So, 2^64 is 18446744073709551616, which is just over eighteen quintillion, or eighteen billion billion, bytes -- but it's really only seventeen billion gigabytes.

Got that down? While we're at it, we should point out that our British readers probably think we're sloppy at math, because, to them, "billion" often means "million million", and "trillion" means "million million million", so they would call 2^64 "eighteen trillion" instead of "eighteen quintillion".

Want a simpler unit for the amount of memory in a 64-bit address space? Refer to it as "lots," or fall back on Carl Sagan's old line.

This processing power, which used to be available only on high-end servers for complex enterprise applications like real-time business intelligence, is now available on the desktop. Small businesses and home PC users can perform video editing and rendering tasks that were the stuff of dreams a decade ago. Just as 32-bit processing became commonplace in desktops and entry-level servers, so 64-bit processing is poised to become more and more ubiquitous over the next few years. From a theoretical feature bragged up in trade magazines, to a reasonably cost-effective choice for high-end embedded systems, 64-bit processing has come a long way.

This kind of address space is especially useful for simulations and large databases. While home users rarely have a working set of more than 4GB of data, scientists and database technicians are quite happy to have a little more room to work on large datasets, or build larger, more complete, simulations. With modern databases frequently holding terabytes of data, the ability to have more than 4GB of what is, effectively, a working cache, can improve performance dramatically.

More memory than you can possibly count

For applications that don't need to address memory beyond the 32-bit processor limit of 4GB, 64-bit systems still provide substantial benefits in terms of processing speed. In 32-bit computing, integer math uses 32-bit wide general-purpose registers. With 64-bit computing, each general-purpose register is 64-bits wide and can represent a much larger integer. High-level languages, such as C and C++, support 64-bit mathematical operations on 32-bit processors by splitting a 64-bit number across two 32-bit registers. The 64-bit integer types (such as int64_t, sometimes called "long long" on 32-bit systems) can be contained within a single register on a 64-bit machine. This register-width difference produces a substantial difference in resource requirements when performing 64-bit math, as Table 1 illustrates.


Table 1. Resources required to load, add, and store two 64-bit integers
Operation Resources on 32-bit processor Resources on 64-bit processor Effective improvement with 64-bit
Load two 64-bit integers
  • Requires four (4) 32-bit registers to hold data
  • Requires 4 load instructions
  • Requires two (2) 64-bit registers to hold data
  • Requires 2 load instructions
Reduced number of instructions to load data by one half and fewer registers consumed by one half
Add two 64-bit integers
  • Requires 2 addition instructions; an add with carry and an extended to include the carry
  • Requires one addition instruction
Reduced number of instructions by one half and reduced interlocking among instructions and carry status
Store two 64-bit integers
  • Requires four (4) 32-bit registers to hold data
  • Requires 4 store instructions to save data
  • Requires two (2) 64-bit registers to hold data
  • Requires 2 store instructions to save data
Reduced number of instructions to store data by one half and registers consumed by one half
Total resources 10 instructions issued and 4 registers plus carry field 5 instructions issued and 2 registers used One half the instructions, less than one half the resources consumed

Logical operations (AND, OR, XOR) also benefit from wider registers, since they can operate on a much larger data size. As a result, applications that involve the manipulation of huge data sets, such as document management and decision support, run much faster on a 64-bit system. Finally, 64-bit processors can drive 32-bit applications even faster, by handling more data per clock cycle than a 32-bit processor. Therefore, even apps that don't need to address memory beyond 4GB can benefit from 64-bit processing.


The impact of design differences

Frank Lloyd Wright once said, "Architecture is the triumph of human imagination over materials, methods, and men." Like building design, microprocessor design involves imagination and creativity, makes use of different materials and processes, and should bear in mind the intended use of the design. Decisions made during the design process have a great impact on the ultimate "form and function" of the resultant composition.

So what are the critical considerations in the design of a 64-bit processor? It is important to note here that this discussion focuses on 64-bit computing in the desktop, entry-level server, and embedded markets. Sixty-four-bit computing in the high-end server environment has been well established for several years, and is outside the vantage point of this article. The IBM® POWER4™ and POWER5™ processors fall into this category, and are outside the scope of this discussion. Success in the desktop, embedded, and small-scale server environment depends on a combination of performance, power, compatibility with existing 32-bit code, and middleware support. Some of the design factors affecting these elements are:

  • Architecture design (for example, pipeline, register sets)
  • Performance of 32-bit software
  • Silicon manufacturing
  • Power management
  • System interface speed (bus architecture)

Architected for what?

The days of "my processor's clock is faster than your processor's clock" are over. Sure, clock speed is important, but if a processor gets bogged down making calls to memory, I/O devices, and other processors in the system, what difference does it make? Remember, a chain is only as strong as its weakest link. In the microprocessor world, performance is defined by throughput and capacity -- not just clock speed. Processor frequency, cache size, memory bandwidth, and processor architecture all contribute to overall performance. At the 2004 International Solid-State Circuits Conference, a panel of processor architects from IBM, AMD, Intel, Fujitsu, Sun, and Stanford University generally agreed that chips will increasingly rely on parallelism, rather than clock rates, for achieving faster speeds.

Performance benchmarks

One recognized source of benchmark performance data is the non-profit Standard Performance Evaluation Corporation (SPEC). SPEC's philosophy is that relevant benchmarks should be based on real-world applications, such as weather prediction and image processing. The SPEC CPU benchmarks compare systems on a known compute-intensive workload, and the results reflect the combined performance of key system components, such as the system's processor, memory hierarchy, and compiler. For more information and the latest set of benchmark data, visit www.spec.org.

The architecture and its instruction set form the core of a microprocessor system, and architectures are typically designed with one or more goals in mind. The specific goals of a particular design are important considerations that have an impact on the intended use of the processor. For instance, some processors are designed with an emphasis on clock speed or data crunching, while others seek to optimize throughput. Some are focused on general purpose computing, while others are designed to meet the unique needs of embedded systems. Other design goals may include native 32-bit compatibility (see next section), support for symmetric multiprocessing (SMP), and optimization for certain types of applications. Processors with specific design goals will perform better on some applications than others. Dynamic workloads will perform better on processors optimized for throughput, whereas workloads that involve predictable algorithms operating on static data call for processors architected for number crunching.

The IBM® PowerPC® 970 was designed for high performance general purpose processing. Multiple pipelined execution units, branch prediction, and a SIMD, or vector processing (Altivec) unit, combine to allow up to 215 in-flight instructions. With each clock cycle, up to eight instructions can be fetched from the direct-mapped 64K L1 instruction cache, broken down, and dispatched into the execution units, while 32K of write-through, two-way associative L1 data cache can fetch up to eight active data streams, which are loaded into data registers behind the execution units. Different types of instructions are processed concurrently by the execution units, which include two floating-point units, two integer units, two load/store units, a condition register unit, a branch prediction unit, and a vector processing unit. This dual-pipeline 128-bit vector engine performs SIMD processing, applying a single instruction to multiple data simultaneously, and uses a set of 162 specialized SIMD instructions for optimal performance.


Figure 1. PowerPC 970 architecture (Adapted from Apple's "PowerPC G5" white paper, June 2004. See Resources.)
Figure 1. PowerPC 970 architecture

Evolution and compatibility

Customers with substantial technology investments in 32-bit systems will move towards 64-bit computing at different rates and for different reasons, such as the need for large file support. Some applications are best left as 32-bit programs, but should be able to coexist with applications that are ported to 64-bit. To provide customers with investment protection while offering the flexibility to deploy 64-bit technology according to their specific business needs, 64-bit systems should support 32-bit compatibility, and 32-bit and 64-bit computing environments should be able to coexist and share resources on the same system, just as 32-bit programs have in the past.

There are two different ways of providing 32-bit compatibility in 64-bit processor design: native 32-bit support or 32-bit emulation. Native 32-bit support provides full binary compatibility with existing 32-bit applications, enabling them to run at full processor speed. Compatibility through emulation requires the translation of the 32-bit application instructions on the fly, incurring substantial processing overhead and resulting in sub-optimal 32-bit performance.

The IBM PowerPC 970 family of 64-bit processors provides native support for 32-bit processing, enabling user mode 32-bit PowerPC applications to run on the PowerPC 970 processors without any modifications. Because the 64-bit PowerPC architecture is a superset of a 32-bit processor, the PowerPC 970 processor can run 32-bit programs the same way the programs run on a 32-bit processor. The PowerPC 970 has two execution modes: 32-bit, which enables instructions and addressing to behave the same as on a 32-bit processor, and 64-bit, which produces 64-bit addressing and instruction behavior for a true 64-bit environment. Additional supervisor instructions are provided to set up and control the execution mode on a per-process level, which enables the creation of a mixed environment of concurrent 32-bit and 64-bit processes at the system level.

Some operating systems (for example, Linux) support a mix of 64-bit applications and 32-bit applications running at the same time. This allows for customer-controlled migration to a 64-bit environment, and enables customers to port only those applications that truly benefit from 64-bit computing. For maximum flexibility, the IBM PowerPC 970 processor family can execute code in 32-bit environments, mixed 32-bit and 64-bit environments, or in a pure 64-bit environment.


Waiting for something to work on: front-side bus architecture

One reason that cache size is so important in modern processors is that even the fastest processors can get bogged down communicating with the memory controller. Conventional bidirectional buses carry data to and from the processor over the same link, incurring delays when the bus switches direction and while the processor and the memory controller negotiate use of the bus. Dual-channel unidirectional buses enable data to flow to and from the processor simultaneously, eliminating negotiation overhead and more than doubling the effective data rate. The trade-off involved in bus architecture design is cost versus performance. While a dual-channel design revs up memory performance, system costs tend to be higher due to the need for memory module pairs as well as more sophisticated chipset technology to handle the higher complexity of the memory bus.

The IBM PowerPC 970 family of processors features two unidirectional 32-bit point-to-point channels designed to operate at an integer fraction of the CPU core frequency. With a clock speed of 2.5GHz, the front-side bus of the 90nm IBM PowerPC 970FX is theoretically capable of operating at up to 1.25GHz, for an aggregate bandwidth of up to 10GBps. This type of bus architecture achieves its highest throughput only when the number of reads and writes are fairly well balanced. A bidirectional bus architecture, as seen on Intel IA-64 and AMD Athlon processors, achieves a lower peak throughput, but it can deliver its peak throughput in either direction, making it better suited for applications that perform mostly reads or writes.


Manufacturing process matters too

The manufacturing process used in creating processor technology has a tremendous impact on both the performance and power metrics. Traditionally, new processor technology that introduces an increase in processing speed has been accompanied by an inevitable increase in power consumption. The processor industry has come to expect this. However, recent breakthroughs in chip fabrication have enabled manufacturers to produce faster processors -- with decreased power consumption. And 64-bit processors are among the first to benefit from these breakthroughs.

By integrating strained silicon and silicon-on-insulator (SOI) technology into the same manufacturing process, electrons flow faster through transistors, and neighboring transistors are isolated through an insulating layer in the silicon. The result is higher performance with reduced power consumption. Copper wiring used in place of the 30-year-old practice of connecting transistors through aluminum conduits further boosts performance, through improved conductivity and reliability.

A powerful lineage

One of the original design goals of the Apple-IBM-Motorola partnership that developed the PowerPC architecture back in 1991 was to define a 64-bit architecture that was a superset of the 32-bit architecture, in order to provide application binary compatibility for 32-bit applications. The PowerPC architecture that was born of this partnership is -- and always was -- a 64-bit architecture derived from the IBM POWER architecture. From the very beginning, PowerPC was designed to support switching between the 64-bit mode and the 32-bit mode. As a relative of the IBM POWER4 and POWER5 processors, the PowerPC 970 family may be a new generation of PowerPC processors, but it inherits a history of over ten years of 64-bit computing at IBM.

The 90nm IBM PowerPC 970FX is the first chip fabricated using a combination of SOI, strained silicon, and copper wiring technologies, placing over 58 million transistors on a 65mm2 die -- a 50% die shrink over its predecessor, the 130nm IBM PowerPC 970. The PowerPC 970FX runs at speeds up to 2.5GHz, making it smaller, faster, and more power-efficient than the PowerPC 970. These new fabrication technologies are now being deployed by other chip manufacturers anxious to gain the same power-performance advantage over their own predecessors.

Note that the performance of the PowerPC 970 family actually exceeds that of its award-winning parent, the high-end IBM POWER4 processor, in many areas. This is due to the fact that the circuit and process technology used for the POWER4 processor was designed to achieve levels of reliability necessary for the continuous availability server market -- levels that can be relaxed for the desktop and small-scale server market -- at the expense of transistor switching speed. Thus, the fabrication technology used for the PowerPC 970 was designed to eke out higher performance by trading away reliability; for these markets, the trade-off between reliability and performance is different.


Overcoming a power struggle

With more transistors being crammed into smaller chips in order to enhance microprocessor performance, power management has become increasingly challenging. Clock gating and other simple techniques have reached their limit, leading chip designers to implement somewhat precarious techniques, such as tweaking individual devices in critical sections of their chips to match a specific need and designing chips to operate close to their thermal limits. Ongoing research seeks to manage power dissipation while maintaining high levels of processor performance.

Historically, IBM Power Architecture microprocessors have incorporated features to help users effectively manage power dissipation. The PowerPC 750 microprocessor, produced in 0.25-µm technology, first gave users the options of dynamic power management, with three software-selectable power-saving modes, and where execution units were not clocked when idle. The power-saving modes reduced functionality of other areas, with nap and doze modes limiting cache and bus snooping operations, and sleep mode turning off all functional units except for interrupts. These techniques were an effective way to reduce power, as they reduced switching on the chip.

As the process geometries have been reduced to below 130 nm, power dissipation due to leakage currents has greatly increased. IBM addressed this challenge in the 90nm PowerPC 970FX microprocessor by integrating strained silicon and SOI into the same manufacturing process, as previously discussed. This technique speeds the flow of electrons through transistors to increase performance and provides an insulating layer in the silicon that isolates transistors and decreases power consumption.

A new approach to power management, patented by IBM, involves adding some power-control features within the processor chip. This power tuning technique, enabled through advanced system-wide tuning and controlling of processor frequency and voltage, allows designers to quickly and seamlessly change the frequency from full frequency to f/2 and f/4. The frequency switch is applied at a system level -- affecting the processor bus and the bridge and memory controller support chip as well as the processor core. The PowerPC 970FX microprocessor takes advantage of this IBM-refined power-saving technique, enabling a seamless, fine-grained, system-wide frequency and voltage change without stopping core execution units, disrupting interrupts, or disabling bus snooping.

If all of that isn't enough, of course, there's always the option of using liquid cooling, as Apple did in the 2.5Ghz G5 machines (see Resources).


Apple's pick

The 90nm IBM PowerPC 970FX leverages patented fabrication processes and power management techniques, along with ten years of 64-bit computing experience, to achieve high performance on compute- and bandwidth-intensive applications while maintaining compatibility with 32-bit code. Apparently, Apple appreciated the choices IBM processor architects made when designing the 970 family; dual 130nm PowerPC 970s form the foundation of the Power Mac G5, and the PowerPC 970FX is at the core of the Apple Xserve G5, a rack-mount server. Reliability, performance, backward compatibility, and years of IBM research and development have come together to produce 64-bit computing for the masses.


Resources

About the author

Cathleen Shamieh has engineering and consulting experience in the telecommunications, speech processing, computer telephony, and medical electronics industries. She can be reached at cathleen.shamieh@verizon.net

Comments



Trademarks

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=32002
ArticleTitle=Understanding 64-bit PowerPC architecture
publish-date=10192004
author1-email=cathleen.shamieh@verizon.net
author1-email-cc=