When people talk about 64-bit computing, it's not always clear what they mean. Most often, they mean some combination of register width, bus width, or address space. For the purposes of this article, it means a processor with 64-bit registers and 64-bit addressing.
In the late 1980's, as the desktop processing industry was still struggling to transition to 32-bits, several companies (among them DEC, IBM, Motorola, and Apple) were already in the throes of 64-bit development. When the first 64-bit processor was introduced in 1992 for high-end UNIX servers and workstations, who would have thought of using that kind of processing power just a dozen years later in cell phones and laptops? Yet by the end of the 20th century, desktop applications had tested the computational and memory limits of 32-bit processors, and growth in high-performance consumer electronics called for increasingly more powerful embedded processors with ever-constrained power requirements. Early in the 21st century, the need for impenetrable security has grown tremendously, sparking renewed interest in encryption and decryption algorithms, as well as desktop security. The desktop and embedded industries needed another bit. To preserve the "power of two" relationships that people have come to expect, with 8, 16, and 32-bit words in existing architectures, developers are getting 32 more bits, resulting in an address space that is 4 billion times larger than the maximum address space of a 32-bit system.
Capable of addressing an astronomical 18 billion GB, or 18 exabytes, of memory, 64-bit integers also accelerate complex mathematical calculations through their ability to perform calculations directly on 64-bit numbers, as well as performing multiple operations on smaller numbers within a single CPU cycle (see Resources for the definition of exabyte). The impact of 64-bit processing is substantial: the time it takes to render a 3D model can be reduced dramatically, freeing up computing resources, compressing diagnostic timeframes, and enabling you to work more efficiently.
This processing power, which used to be available only on high-end servers for complex enterprise applications like real-time business intelligence, is now available on the desktop. Small businesses and home PC users can perform video editing and rendering tasks that were the stuff of dreams a decade ago. Just as 32-bit processing became commonplace in desktops and entry-level servers, so 64-bit processing is poised to become more and more ubiquitous over the next few years. From a theoretical feature bragged up in trade magazines, to a reasonably cost-effective choice for high-end embedded systems, 64-bit processing has come a long way.
This kind of address space is especially useful for simulations and large databases. While home users rarely have a working set of more than 4GB of data, scientists and database technicians are quite happy to have a little more room to work on large datasets, or build larger, more complete, simulations. With modern databases frequently holding terabytes of data, the ability to have more than 4GB of what is, effectively, a working cache, can improve performance dramatically.
For applications that don't need to address memory beyond the 32-bit processor limit of 4GB, 64-bit systems still provide substantial benefits in terms of processing speed. In 32-bit computing, integer math uses 32-bit wide general-purpose registers. With 64-bit computing, each general-purpose register is 64-bits wide and can represent a much larger integer. High-level languages, such as C and C++, support 64-bit mathematical operations on 32-bit processors by splitting a 64-bit number across two 32-bit registers. The 64-bit integer types (such as int64_t, sometimes called "long long" on 32-bit systems) can be contained within a single register on a 64-bit machine. This register-width difference produces a substantial difference in resource requirements when performing 64-bit math, as Table 1 illustrates.
Table 1. Resources required to load, add, and store two 64-bit integers
|Operation||Resources on 32-bit processor||Resources on 64-bit processor||Effective improvement with 64-bit|
|Load two 64-bit integers||Reduced number of instructions to load data by one half and fewer registers consumed by one half|
|Add two 64-bit integers||Reduced number of instructions by one half and reduced interlocking among instructions and carry status|
|Store two 64-bit integers||Reduced number of instructions to store data by one half and registers consumed by one half|
|Total resources||10 instructions issued and 4 registers plus carry field||5 instructions issued and 2 registers used||One half the instructions, less than one half the resources consumed|
Logical operations (AND, OR, XOR) also benefit from wider registers, since they can operate on a much larger data size. As a result, applications that involve the manipulation of huge data sets, such as document management and decision support, run much faster on a 64-bit system. Finally, 64-bit processors can drive 32-bit applications even faster, by handling more data per clock cycle than a 32-bit processor. Therefore, even apps that don't need to address memory beyond 4GB can benefit from 64-bit processing.
Frank Lloyd Wright once said, "Architecture is the triumph of human imagination over materials, methods, and men." Like building design, microprocessor design involves imagination and creativity, makes use of different materials and processes, and should bear in mind the intended use of the design. Decisions made during the design process have a great impact on the ultimate "form and function" of the resultant composition.
So what are the critical considerations in the design of a 64-bit processor? It is important to note here that this discussion focuses on 64-bit computing in the desktop, entry-level server, and embedded markets. Sixty-four-bit computing in the high-end server environment has been well established for several years, and is outside the vantage point of this article. The IBM® POWER4™ and POWER5™ processors fall into this category, and are outside the scope of this discussion. Success in the desktop, embedded, and small-scale server environment depends on a combination of performance, power, compatibility with existing 32-bit code, and middleware support. Some of the design factors affecting these elements are:
- Architecture design (for example, pipeline, register sets)
- Performance of 32-bit software
- Silicon manufacturing
- Power management
- System interface speed (bus architecture)
The days of "my processor's clock is faster than your processor's clock" are over. Sure, clock speed is important, but if a processor gets bogged down making calls to memory, I/O devices, and other processors in the system, what difference does it make? Remember, a chain is only as strong as its weakest link. In the microprocessor world, performance is defined by throughput and capacity -- not just clock speed. Processor frequency, cache size, memory bandwidth, and processor architecture all contribute to overall performance. At the 2004 International Solid-State Circuits Conference, a panel of processor architects from IBM, AMD, Intel, Fujitsu, Sun, and Stanford University generally agreed that chips will increasingly rely on parallelism, rather than clock rates, for achieving faster speeds.
The architecture and its instruction set form the core of a microprocessor system, and architectures are typically designed with one or more goals in mind. The specific goals of a particular design are important considerations that have an impact on the intended use of the processor. For instance, some processors are designed with an emphasis on clock speed or data crunching, while others seek to optimize throughput. Some are focused on general purpose computing, while others are designed to meet the unique needs of embedded systems. Other design goals may include native 32-bit compatibility (see next section), support for symmetric multiprocessing (SMP), and optimization for certain types of applications. Processors with specific design goals will perform better on some applications than others. Dynamic workloads will perform better on processors optimized for throughput, whereas workloads that involve predictable algorithms operating on static data call for processors architected for number crunching.
The IBM® PowerPC® 970 was designed for high performance general purpose processing. Multiple pipelined execution units, branch prediction, and a SIMD, or vector processing (Altivec) unit, combine to allow up to 215 in-flight instructions. With each clock cycle, up to eight instructions can be fetched from the direct-mapped 64K L1 instruction cache, broken down, and dispatched into the execution units, while 32K of write-through, two-way associative L1 data cache can fetch up to eight active data streams, which are loaded into data registers behind the execution units. Different types of instructions are processed concurrently by the execution units, which include two floating-point units, two integer units, two load/store units, a condition register unit, a branch prediction unit, and a vector processing unit. This dual-pipeline 128-bit vector engine performs SIMD processing, applying a single instruction to multiple data simultaneously, and uses a set of 162 specialized SIMD instructions for optimal performance.
Figure 1. PowerPC 970 architecture (Adapted from Apple's "PowerPC G5" white paper, June 2004. See Resources.)
Customers with substantial technology investments in 32-bit systems will move towards 64-bit computing at different rates and for different reasons, such as the need for large file support. Some applications are best left as 32-bit programs, but should be able to coexist with applications that are ported to 64-bit. To provide customers with investment protection while offering the flexibility to deploy 64-bit technology according to their specific business needs, 64-bit systems should support 32-bit compatibility, and 32-bit and 64-bit computing environments should be able to coexist and share resources on the same system, just as 32-bit programs have in the past.
There are two different ways of providing 32-bit compatibility in 64-bit processor design: native 32-bit support or 32-bit emulation. Native 32-bit support provides full binary compatibility with existing 32-bit applications, enabling them to run at full processor speed. Compatibility through emulation requires the translation of the 32-bit application instructions on the fly, incurring substantial processing overhead and resulting in sub-optimal 32-bit performance.
The IBM PowerPC 970 family of 64-bit processors provides native support for 32-bit processing, enabling user mode 32-bit PowerPC applications to run on the PowerPC 970 processors without any modifications. Because the 64-bit PowerPC architecture is a superset of a 32-bit processor, the PowerPC 970 processor can run 32-bit programs the same way the programs run on a 32-bit processor. The PowerPC 970 has two execution modes: 32-bit, which enables instructions and addressing to behave the same as on a 32-bit processor, and 64-bit, which produces 64-bit addressing and instruction behavior for a true 64-bit environment. Additional supervisor instructions are provided to set up and control the execution mode on a per-process level, which enables the creation of a mixed environment of concurrent 32-bit and 64-bit processes at the system level.
Some operating systems (for example, Linux) support a mix of 64-bit applications and 32-bit applications running at the same time. This allows for customer-controlled migration to a 64-bit environment, and enables customers to port only those applications that truly benefit from 64-bit computing. For maximum flexibility, the IBM PowerPC 970 processor family can execute code in 32-bit environments, mixed 32-bit and 64-bit environments, or in a pure 64-bit environment.
One reason that cache size is so important in modern processors is that even the fastest processors can get bogged down communicating with the memory controller. Conventional bidirectional buses carry data to and from the processor over the same link, incurring delays when the bus switches direction and while the processor and the memory controller negotiate use of the bus. Dual-channel unidirectional buses enable data to flow to and from the processor simultaneously, eliminating negotiation overhead and more than doubling the effective data rate. The trade-off involved in bus architecture design is cost versus performance. While a dual-channel design revs up memory performance, system costs tend to be higher due to the need for memory module pairs as well as more sophisticated chipset technology to handle the higher complexity of the memory bus.
The IBM PowerPC 970 family of processors features two unidirectional 32-bit point-to-point channels designed to operate at an integer fraction of the CPU core frequency. With a clock speed of 2.5GHz, the front-side bus of the 90nm IBM PowerPC 970FX is theoretically capable of operating at up to 1.25GHz, for an aggregate bandwidth of up to 10GBps. This type of bus architecture achieves its highest throughput only when the number of reads and writes are fairly well balanced. A bidirectional bus architecture, as seen on Intel IA-64 and AMD Athlon processors, achieves a lower peak throughput, but it can deliver its peak throughput in either direction, making it better suited for applications that perform mostly reads or writes.
The manufacturing process used in creating processor technology has a tremendous impact on both the performance and power metrics. Traditionally, new processor technology that introduces an increase in processing speed has been accompanied by an inevitable increase in power consumption. The processor industry has come to expect this. However, recent breakthroughs in chip fabrication have enabled manufacturers to produce faster processors -- with decreased power consumption. And 64-bit processors are among the first to benefit from these breakthroughs.
By integrating strained silicon and silicon-on-insulator (SOI) technology into the same manufacturing process, electrons flow faster through transistors, and neighboring transistors are isolated through an insulating layer in the silicon. The result is higher performance with reduced power consumption. Copper wiring used in place of the 30-year-old practice of connecting transistors through aluminum conduits further boosts performance, through improved conductivity and reliability.
The 90nm IBM PowerPC 970FX is the first chip fabricated using a combination of SOI, strained silicon, and copper wiring technologies, placing over 58 million transistors on a 65mm2 die -- a 50% die shrink over its predecessor, the 130nm IBM PowerPC 970. The PowerPC 970FX runs at speeds up to 2.5GHz, making it smaller, faster, and more power-efficient than the PowerPC 970. These new fabrication technologies are now being deployed by other chip manufacturers anxious to gain the same power-performance advantage over their own predecessors.
Note that the performance of the PowerPC 970 family actually exceeds that of its award-winning parent, the high-end IBM POWER4 processor, in many areas. This is due to the fact that the circuit and process technology used for the POWER4 processor was designed to achieve levels of reliability necessary for the continuous availability server market -- levels that can be relaxed for the desktop and small-scale server market -- at the expense of transistor switching speed. Thus, the fabrication technology used for the PowerPC 970 was designed to eke out higher performance by trading away reliability; for these markets, the trade-off between reliability and performance is different.
With more transistors being crammed into smaller chips in order to enhance microprocessor performance, power management has become increasingly challenging. Clock gating and other simple techniques have reached their limit, leading chip designers to implement somewhat precarious techniques, such as tweaking individual devices in critical sections of their chips to match a specific need and designing chips to operate close to their thermal limits. Ongoing research seeks to manage power dissipation while maintaining high levels of processor performance.
Historically, IBM Power Architecture microprocessors have incorporated features to help users effectively manage power dissipation. The PowerPC 750 microprocessor, produced in 0.25-ÃÂµm technology, first gave users the options of dynamic power management, with three software-selectable power-saving modes, and where execution units were not clocked when idle. The power-saving modes reduced functionality of other areas, with nap and doze modes limiting cache and bus snooping operations, and sleep mode turning off all functional units except for interrupts. These techniques were an effective way to reduce power, as they reduced switching on the chip.
As the process geometries have been reduced to below 130 nm, power dissipation due to leakage currents has greatly increased. IBM addressed this challenge in the 90nm PowerPC 970FX microprocessor by integrating strained silicon and SOI into the same manufacturing process, as previously discussed. This technique speeds the flow of electrons through transistors to increase performance and provides an insulating layer in the silicon that isolates transistors and decreases power consumption.
A new approach to power management, patented by IBM, involves adding some power-control features within the processor chip. This power tuning technique, enabled through advanced system-wide tuning and controlling of processor frequency and voltage, allows designers to quickly and seamlessly change the frequency from full frequency to f/2 and f/4. The frequency switch is applied at a system level -- affecting the processor bus and the bridge and memory controller support chip as well as the processor core. The PowerPC 970FX microprocessor takes advantage of this IBM-refined power-saving technique, enabling a seamless, fine-grained, system-wide frequency and voltage change without stopping core execution units, disrupting interrupts, or disabling bus snooping.
If all of that isn't enough, of course, there's always the option of using liquid cooling, as Apple did in the 2.5Ghz G5 machines (see Resources).
The 90nm IBM PowerPC 970FX leverages patented fabrication processes and power management techniques, along with ten years of 64-bit computing experience, to achieve high performance on compute- and bandwidth-intensive applications while maintaining compatibility with 32-bit code. Apparently, Apple appreciated the choices IBM processor architects made when designing the 970 family; dual 130nm PowerPC 970s form the foundation of the Power Mac G5, and the PowerPC 970FX is at the core of the Apple Xserve G5, a rack-mount server. Reliability, performance, backward compatibility, and years of IBM research and development have come together to produce 64-bit computing for the masses.
- Find the definition of exabyte on wikipedia.org.
The Wikipedia article on
64-bit computing provides a good, general foundation and helpful links.
The "Developing Embedded Software for
the IBM PowerPC 970FX Processor" IBM Application Note discusses issues
associated with developing new software and porting existing software to
the PowerPC 970FX processor (IBM, July 2004).
The IBM white paper, "An
Introduction to 64-bit Computing and the IBM PowerPC 970FX", provides
an overview of 64-bit computing and discusses the advantages of a 64-bit
operating system environment (IBM, April 2004).
"PowerPC Microprocessor Family: Programming Environments Manual for 64-bit Microprocessors" (in PDF) Software Reference Manual can help
you develop software that is compatible across the entire
family of 64-bit PowerPC processors (IBM, July 2005).
Learn more about power tuning in the PowerPC 970FX processor with
"Frequency switching improves power management in Power Architecture
chips" by Helena Purgatorio (developerWorks, September 2004).
IBM PowerPC 970FX power envelope and power management"
provides an understanding of the PowerPC 970FX processor's advanced power
management techniques (developerWorks, September 2004).
- Find out more about the International Solid-State Circuits
Conference, where a panel agreed that parallelism, not clock speed, would
be the biggest component of upcoming performance gains.
- Have experience you'd be willing to share with Power Architecture zone
readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside
IBM are welcomed. Check out the Power Architecture author
FAQ to learn more.
- Find more articles and resources on Power Architecture
technology and all things
related in the developerWorks Power
Architecture technology content area.
- Download a IBM PowerPC 405 Evaluation Kit to demo a SoC in a simulated
environment, or just to explore the fully licensed version of
Power Architecture technology.
Cathleen Shamieh has engineering and consulting experience in the telecommunications, speech processing, computer telephony, and medical electronics industries. She can be reached at email@example.com