Just two years ago, in November of 2003, Microsoft and IBM announced the licensing of IBM semiconductor technology for use in "future Xbox products." On Tuesday, October 25, 2005, at the Microprocessor Reports Fall Processor Forum on behalf of IBM and Microsoft, I disclosed technical details about the CPU chip we developed. This chip is being manufactured in the IBM East Fishkill 300mm foundry as well as in Chartered Semiconductor's 300mm foundry in Singapore. Microsoft has announced plans to officially launch sales of the Xbox 360 on November 22 in time for the 2005 holiday season. This paper reviews the new processor and its development process. (Because of the density of technical terms, you might find it helpful to review the sidebar, "Acronym alley".)
The CPU was designed uniquely for Microsoft and for use in the Xbox 360 using the system architecture specifically defined around customer requirements. Microsoft and IBM engineers worked together during the definition phase of the project to specify a design to satisfy the constraints of a mass-produced consumer device. We used existing PowerPC processor and subsystem technology and designs as a foundation to jump-start the development. The chip was developed by the IBM Engineering & Technology Services group, leveraging results from the IBM R&D labs.
The Xbox 360 system has a single chip (with 165 million transistors) for its CPU. This chip is in fact a three-way symmetric multiprocessor design. The three PowerPC cores are identical, except that they are physically reflected through the X and Y axis. Each of the CPU cores is a specialized PowerPC chip with a VMX128 extension related to (and partially compatible with) the VMX instructions in the G4 and G5 CPUs. The three CPU cores share a 1MB Level2 cache. Each processor has 32KB each of data and instruction Level1 cache. The chip's front-side bus/physical interface has a 21.6GB/second bandwidth, and runs at 5.4GHz. The high frequency clocks are generated on-chip by four phase-locked loops: two for the core clocks, two for the PHY clock.
Figure 1. The Xbox 360 chip
The Xbox 360 CPU chip has testing and debug functions, including tracing, configuration control, and performance monitoring features. Access to these functions is through the block in Figure 1 labeled test/debug. The block labeled Miscellaneous IO provides a JTAG port, a POST monitor, and an interface for a serial EEPROM in case patch logic configuration was needed during bring-up.
To improve manufacturing yield, the SRAM Arrays used in the L1 and L2 caches support both row and column redundancy. This redundancy is enabled at chip test by burning electronic fuses. The eFuses are one of the unique capabilities of the IBM 90nm CMOS SOI technology the chip is fabricated in. Efuses were also used to record a unique supply voltage to be used for each chip. Finally, to help reduce the potential impact of process variations on the operation of the PHY analog circuits, eFuses were used for parametric adjustment in the analog units.
The physical package of the chip matters, too. A crucial design goal in the CPU of a consumer electronics device is high volume with good yield and comparatively low cost. The package is a 2-2-2 FC-PBGA, measuring 31mm by 31mm.
The CPU cores (there are three) are the highest frequency PowerPC cores currently available, running at 3.2GHz. Throughout, the CPU uses extensive clock gating, leaving pipelines shut down until there are instructions to be processed; this dramatically reduces power consumption under real-world loads. The basic design is a 64-bit PowerPC architecture, with the complete PowerPC ISA available.
Figure 2. The PowerPC core
The instruction unit is multithreaded, with two simultaneous threads. The instruction cache is 32KB. The core implements a two-issue, in-order execution microarchitecture. This means two instructions are issued at a time but execution within the units is in sequential order. Execution is delayed to cover the load use penalty without stalling the pipeline.
The L1 instruction cache (Icache) is a 32K Byte cache with parity error checking. It is two-way set associative cache with 128B lines. First-level translation for instruction addresses is done using a 64-entry, two-way set associative effective to real address translation cache.
The two issued instructions can go to one of five execution pipes: Branch (which is really part of the instruction unit), Load/Store , Fixed Point, Floating Point, and VMX. Difficult instructions are implemented through microcode. At dispatch they are cracked and converted into multiple micro-ops.
The branch unit includes a 4KB two-way set-associative Branch History Table per thread.
The Fixed Point pipe actually has two units: one to handle the simple operations like (add/sub, cmp, logical ops, and rotate); and one to handle the complex operations like multiply/divide.
The Load/Store pipe handles access to the L1 Data cache and the storage hierarchy. Like the L1 Icache, the L1 Dcache is a 32KByte cache with parity error checking. However, it is four-way set associative. It is "store through" and provides non-blocking access so a cache miss does not hold up a subsequent hit.
A 64 entry two-way associative ERAT handles first-level data address translation. Second-level translation for both data and instructions is handled by a 1K entry four-way associative TLB (translation lookaside buffer) which can be software as well as hardware-managed.
Figure 3. The combined VMX and FPU unit
Floating point instructions are sent to a combined VMX/FPU unit, which has available two simultaneous threads for the VMX and two for the FPU. Once again, the delayed-execution issue queue reduces load latency to two cycles. The load/store unit (LSU) might operate out-of-order with respect to the FPU, but the final results are correct. Each stage in the FP/VMX is also 11 FO4. As a result the pipelines are quite deep and result in significant delay for instruction completion. Scalar double-precision floating point operations have 10-cycle latency. VMX operations have four or 14-cycle latency, depending on the operation.
Note: 11 FO4 refers to the latch-to-latch delay within the pipeline stage. It is a technology independent way of indicating how much logic can go into each pipe stage. This also indicates how many pipeline stages are needed since the less logic you can put in a stage, the more stages are required for the same function. However, the less logic you put in a stage, the faster the clock frequency. The metric being used is the delay of a single Inverter circuit fanning out to four other circuit elements. FO4 stands for Fan Out of 4.
While the term VMX is familiar to PowerPC users, the implementation on the Xbox 360 processor is a new design called VMX128 which was specially enhanced to accelerate 3D graphics and game physics.
The number of vector registers was increased from 32 to 128. All 128 registers are directly addressable, and the original 32 registers are mapped to the first 32 entries of 128-entry vector register file, and so are compatible with the original PowerPC ISA. We also added a number of new instructions. Instructions were added to calculate the dot-product of two vectors made from three or four floating point values. Data formatting instructions were added to help improve the processing of data that has been packed into memory to reduce the program size. These include: instructions for rotate and insert operations, pack/unpack instructions for handling Direct3D® data types, and loads and stores for misaligned data. The VMX128 ISA is binary-compatible with a subset of VMX. A few vector floating point and vector integer instructions are no longer supported, and attempting to execute them will result in the system illegal instruction handler being invoked.
The Level2 cache provides 1MB of cache shared by the three processing units. It uses a MESI protocol for memory coherency, and is eight-way set associative. The cache itself provides single-bit correct, double-bit detect ECC validation.
Figure 4. The architecture of the L2 cache
Figure 4 shows the layout of the cache. The three processor cores are at the top of the figure and communicate with the cache through the crossbar, which operates at full processor frequency. The rest of the L2 cache operates at half processor frequency. There are eight store-gathering buffers per core in the cacheable data path, which are non-sequential to improve performance. Non-cacheable requests go through a separate data path, providing four buffers for each core; these buffers are sequential to simplify ordering for non-cacheable operations.
The L2 cache as a whole supports high-bandwidth streaming. Using a new instruction called Extended Data Cache Block Touch, it is possible to prefetch data directly from main memory to the L1 cache. This significantly reduces L2 thrashing on prefetch which can be a significant problem on an L2 cache of this size if care is not taken. Using an aggressive hit under miss capability implemented in the L1 Dcaches, up to eight outstanding load/prefetches are possible per core.
The L1 caches are write-through, and a store to an address not held in the L1 will not allocate a line into the L1. The L2 cache has a configurable set-locking capability, so that streaming through a locked set does not thrash the rest of the cache.
Procedural Geometry is a technique that Xbox 360 game developers can use to reduce memory utilization and bandwidth. Graphical objects have been represented as a collection of triangles. But think of a sphere or other curved surface. They could be represented as an equation. The CPU chip can use code running in one of the cores to "procedurally generate" the equivalent triangles to represent the object. The data representing these triangles will reside in the L2 since the L1 is store-through. The GPU can read modified data directly from the L2 cache without causing a castout or change in cache state. This reduces the use of main-memory bandwidth but also keeps the L2 unchanged.
The FSB architecture of the Xbox 360 is specifically designed to meet the throughput and latency requirements of a gaming platform. The design, test, and validation of the FSB were performed by IBM, and common VHDL was used in both the CPU and in ATI's GPU, even though the chips were built with different methodology, technology, frequency, and even data width! The FSB accepts commands from the GPU or CPU, reorders them, tracks them, and ensures correct delivery to the other chip. The physical layer has 10.8GB/sec bandwidth, a target specified to support procedural geometry. On the CPU side, this interfaces to a 1.35GHz, 8B wide, FSB dataflow; on the GPU side, it connects to a 16B wide FSB dataflow running at 675Mhz.
The transaction layer provides a common functional interface to the two chips. It manages the Link Layer protocol for reliable packet delivery. It also performs command reordering and manages the two virtual channels. The two virtual channels are used for request and response. This is done primarily for deadlock avoidance, but it also allows configurable performance by setting channel priority.
The link layer provides link training, error detection and retransmission, as well as flow control. We architected an enhanced soft error recovery mechanism to support the use of lower cost manufacturing components and the link initialization must be reliable without software intervention.
The physical layer (PHY) itself is structured as two unidirectional links, each link consisting of two single-byte lanes, with one clock each. The links are source-synchronous, so the receive clock is sent with the data. The clocks run at 5.4GHz, and each link delivers 10.8 GB/sec bandwidth.
The PHY is on a separate, fixed 1.1V, voltage source from the rest of the CPU. The FSB is designed for low cost (6 inches of wire on a simple system card), but also for reliability. The most significant part of the PHY design is the analog portion which is implemented in Current Mode Logic which supports the very low jitter and high noise tolerance required. Termination on the link to improve link signaling quality is controlled dynamically at link training. Low tolerance on-chip resistors are dynamically switched in and out to adjust the termination to 50ohms. This provides robustness for a chip that will be used in a small box next to a powerful graphics processor and assorted other hardware, and which will not enjoy the benefits of a well-maintained data center.
Figure 5. First pass hardware performance
Figure 5 shows the transmit eye diagram, and the receive error rate bathtub curve measured in the lab from the first pass hardware. The eye diagram shows a very clean transmit signal with very little jitter and noise. The error rate bathtub curve, super-imposed on the eye diagram, shows a very low error rate in the eye where the link will operate.
The package design for the chip must carefully control all aspects of interfacing the chip electrically to the rest of the world. This includes high-speed signaling and power distribution or supply voltage. For high-speed signaling, the package must work to minimize signal attenuation between the package and the socket, loss due to reflection within the package, and crosstalk between adjacent signal pairs. For power distribution, the package must ensure the voltage available for the internal circuits never decreases more than 80mV as the current load changes due to dynamic changes in software characteristics.
The physical link specification included the receiver and transmitter performance, the chip package, board layout, and wiring constraints. The specification was created by IBM and used by Microsoft to design the system board. In the end, the PHY design, the FSB architecture, the link specification, and the package design all worked together to produce the needed reliability and high-volume manufacturability.
The Xbox 360 CPU has an array of testing and debug features. Console games are held to rigid quality standards and harsh deadlines, making good support for debugging a significant benefit. The Xbox 360 CPU allows full-speed operation while tracing execution and running tests; this helps to maximize defect coverage, including marginally slow circuits. The analog PHY is likewise covered by a built-in self test (BIST). Over a thousand internal signals can be traced, running at full speed. Robust local and global triggers and pattern-matching capabilities are available. Additionally, the CPU provides an external debug bus for extended traces; this runs at 1/4 full speed for the CPU, but lets the FSB run at full speed.
The trace function is controlled through the JTAG interface and allows logic signals to be sampled and stored within the on-board trace arrays. Each of the processor cores and the L2 cache have two personal trace arrays and controller. The FSB has one trace array and controller . The rest of the chip units share a single trace array and controller. Each controller can be set up to trace 64 logic signals and to act as a triggered logic analyzer allowing great flexibility in the timing and coverage of events. The on-chip trace buffers allow 1024 samples. If the trace requires more space than the on-chip arrays support, the external debug bus can be set up to capture the date externally. Because the CPU runs so fast, an individual sample is limited in number of logic signals captured. Tracing can be turned on and off at any time. However, the trace buffers are initialized at the start of each trace event so only the most recent trace event data is available
Additionally, the Xbox 360 CPU provides 16 32-bit counters for use in performance monitoring, and can monitor hundreds of events across all the functional units in the processor chip. There are programmable start/stop and synchronization conditions.
These features offer developers targeting the Xbox 360 a variety of options for testing and debugging code, with the CPU helping rather than hindering.
One of the priorities in developing the Xbox 360 system was getting a reliable chip on the first pass. The first pass hardware ran with bus, CPU, and cache all at full speed, and a demo game was running on it one week from power-on. This accomplishment was the result of an exhaustive testing and verification process throughout the design of the system.
Verification was both parallel and hierarchical. Each component was verified separately, but larger units were also verified as a group. As much as possible, tests were designed to catch defects where they could be corrected quickly. The process was focused on quality, with quality measurement standards and an extensive review process. IBM verification tools were used along with industry tools. Formal verification was used where appropriate. Another critical component of testing was intelligent randomized test generation, to stress-test components in a variety of circumstances. Hardware acceleration was used to make simulations more practical, using hardware/software co-simulation.
Because of this, the kernel code to be run on the system was in testing and verification long before the first pass chip arrived; this allowed testing, not only of the kernel code, but of the chip on which it was being run (in simulation). The end result was a successful Pass 1 hardware, and the move from first silicon to volume production took only eight months.
This article was adapted by Peter Seebach, working from the original presentation "Application Customized CPU Design for Microsoft Xbox 360," presented at MPR Fall Processor Forum 2005 by Jeffrey Brown of IBM. Peter would like to thank Tim Kelly and Jeff Brown for technical and editorial review during the writing process.
IBM and PowerPC are trademarks of International Business Machines Corporation in the United States, other countries or both.
Microsoft, Direct3D, and Xbox 360 are trademarks of Microsoft Corporation in the United States, other countries, or both.
This paper is based on a presentation given at Fall Processor Forum 2005: The Road
to Multicore. See the rest in this series.
Learn more about the Microsoft XBox 360.
Learn more about the Power Architecture with the technical documentation,
specifications, and manuals posted to the IBM
Semiconductor solutions Technical library.
IBM is serious about gaming: NETgames workshop
papers with titles like "The Effect of Latency and Network Limitations
on MMORPGs (A Field Study of Everquest2);" also "Networked Game Mobility
Model for First-Person-Shooter Games" and even "A Systematic
Classification of Cheating in Online Games" -- plus many more. Also check out the IBM Gaming forum.
Find related articles, downloads, discussion forums and more at the IBM
Broadband Engine resource center: your definitive resource for all
Keep abreast of all the Cell BE news: subscribe to the Power
Architecture Community Newsletter.
Get products and technologies
Get Custom: Contact
IBM E&TS about Engineering & Technology Services and
Architecture-related downloads at the developerWorks Power
Take part in the IBM developerWorks Power
Architecture discussion forums.
Send a letter to the editor.
Jeff is the Chief Engineer for the Xbox360 CPU Chip Development. He is an IBM Distinguished Engineer in the Engineering & Technology Services Division of IBM and a member of the IBM Academy of Technology. He has been part of E&TS from its inception and has a 15 year history of CPU, memory, and IO subsystem development for iSeries, pSeries, and xSeries within IBM.