The IBM BladeCenter QS21 and QS22 provide optimal performance in a small package by using two 3.2GHz Cell Broadband Engine™ Architecture-compliant processors in a symmetric multiprocessing (SMP) configuration. The QS21 uses two Cell Broadband Engine (Cell/B.E.™) processors, while the QS22 uses two PowerXCell™ 8i processors. The Cell/B.E. processor and its architecture were developed in collaboration with IBM, Sony Corporation, Sony Computer Entertainment Inc., and Toshiba Corporation.
Both processors, the Cell/B.E. and the PowerXCell 8i, are asymmetrical, multicore processors that are optimized for parallel processing and streaming applications. They have a POWER™ Processor Element (PPE) and eight highly optimized single-instruction multiple data (SIMD) engines (known as Synergistic Processor Elements or SPEs). Performance of the PPE is comparable to that of contemporary general purpose processors (GPPs). Furthermore, each of the eight SIMD engines is capable of matching or even surpassing a GPP running at the same frequency. The eight decoupled SPE SIMD engines with dedicated resources (which include large register files and DMA engines) provide the Cell/B.E. and PowerXCell 8i processors with a significant performance advantage over current GPPs.
This article compares the CBEA processor memory access model with that of general purpose processors and it provides programmer guidelines to ensure that applications can be developed for maximum memory performance. Finally, this article describes the usage of the Cell Performance Counter (CPC) performance tool to monitor memory access activities for tuning and debugging memory performance.
With advances in technology, the gap between microprocessor performance in clock rate and memory speed has been widened. A microprocessor typically depends on its cache hierarchy to alleviate the speed disparity between the processor and its memory by using its high-speed cache memory to avoid data access stalls.
There are usually multiple levels of caches along a processor's data path to memory. With space and temporal locality that exist in most applications, processors tend to access primarily their level-1 caches (L1), which are the closest of their memory hierarchies, with less frequent accesses to the L2 (and L3) caches in the event of L1 cache misses. The access to memory is typically infrequent, and it only occurs when data are not in any levels of its cache hierarchy. With a cache-based memory design, these general purpose processors rely on their cache facilities to fetch data without having to directly access system memory.
The cache hierarchy can effectively bridge the performance gap between the general purpose processor and its memory by allowing the processor to access data primarily from its high speed caches without having to wait for data to return from the relatively slow system memory. This improved memory subsystem design comes at the cost of a significant increase in silicon and complexity to coherently managing the cache and its directories. To increase the likelihood of data being available in the cache when needed, prefetch mechanisms have been developed so that the processor or application can speculatively load the cache. If the speculative prefetch is wrong, precious memory bandwidth can be wasted.
Moreover, the processor does not have precise control of when the accessed data is available, whether it comes from the cache hierarchy or from distant memory. The programmer does not have the ability to schedule its own data accesses and ensure the data is available at the time it is needed.
The Cell/B.E. PPE and the PowerXCell 8i PPE use a 2-level cache hierarchy similar to other general purpose processors as described. However, the eight SPEs (which are SIMD engines that provide the primary compute capability) do not use a cache-based memory subsystem. Instead, each SPE is equipped with a 256KB, high-performance, local store with an access time similar to the L1 cache of a general purpose processor. The SPU core executes its instructions and accesses data only from its local store. Each SPU has a companion memory flow controller (MFC) that provides DMA services to bringing in (or writing out) data and instructions as needed, as well as facilities to communicate with other SPUs and access system resources. The DMA engines are programmed by application software running on SPUs so that data can be transferred before needed. The DMA engine in each SPU supports up to 16 outstanding memory requests, with each request being a single DMA or a list of DMAs that provide data scattering and gathering capability.
With the capability of supporting more than 100 outstanding memory accesses concurrently at any time from its PPU, eight SPUs, and its IO devices, the CBEA processor can access its memory with a high degree of parallelism and achieve high memory access efficiency and throughput beyond what the cache-based memory subsystem can achieve. In addition, a CBEA application can schedule its data transfer into and out of local storage ahead of time and minimize or avoid stalling the SPU pipelines so they don't wait for data to be available. This is accomplished using double buffering (or multi-buffering) techniques to overlap computation on one buffer at the same time as transferring the next buffer. When the computation processing on the buffer is complete, the SPU can then switch to compute on the next buffer. Depending on the data access demands (in bytes per cycle), the buffer size can be chosen to entirely hide memory data transfer time during the computation cycles required to process a previously transferred buffer. By exposing the memory access controls to the programmer, an application running on a CBEA processor can be tuned to achieve nearly its peak computation capability.
The IBM BladeCenter QS21 is configured with 2GB of XDR (Rambus) memory. The QS22 system expands its memory capacity by using DDR2 SDRAM memory technology to support up to 32GB of memory. The initial QS22 systems can be configured with either 8GB or 32GB of memory. The performance characteristics of 8GB DDR2 for QS22 are very similar to that of XDR memory on QS21. Using the highest density DDR2 available on the market, QS22 systems incorporate 8 dual-rank 4GB DIMMS to achieve 32GB of memory capacity.
Aside from the performance advantage of expanded memory capacity, an application, in general, should perform very much the same on QS22 systems with either 8GB or 32GB of memory, as long as its memory accesses are properly aligned and sized.
To maximize the potential of the QS21 or QS22 memory subsystem, apply specific programming practices, such as the following:
- Transfer data with optimal alignment and granularity.
- Preference processor local memory.
- Use huge pages for large data sets.
- Uniformly distribute the data transfers across all memory banks.
- Maximize the use of local storage.
The most important factor in maximizing the performance of the memory
subsystem is to ensure that all transfers begin on a cache-line (128 byte)
boundary and are sized as a multiple of 128 bytes. The C/C++ type
aligned can be used to ensure arrays,
structures, and classes are properly aligned.
If alignment cannot be guaranteed, consider transferring extra data on the front and back to ensure that cache-line alignment and size are achieved. This is especially important for writes to memory (DMA PUTs). For the large memory configuration on the QS22 (greater than 8GB), partial cache-line writes can have significant application performance impact due to the resulting read-modify-write sequence that occurs when storing a partial cache-line.
If the cache-line alignment cannot be achieved, the next best option is to ensure that the alignment of both the source and destination buffers have the same alignment (the same 7 least significant address bits). This ensures that performance penalties associated with partial cache-line transfers occur only on the first and possibly the last transfers.
The QS21 and QS22 are dual-processor configurations with two memory nodes. The two processors are interconnected using a FlexIO interface running the fully coherent BIF protocol. The bandwidth between the CPUs is half that of the bandwidth to local memory. Figure 1 shows the QS21 and QS22 block diagram.
Figure 1. QS21 and QS22 block diagram
Applications should use NUMA control to preference SPUs on CPU 0 to access memory node 0. Likewise, SPUs in CPU 1 should preference memory accesses to memory node 1. Applications whose data access patterns cannot be separately partitioned and need to use all or most of the 16 available SPEs should use interleaved memory so that both memory nodes contribute to the aggregated memory bandwidth.
The SPE memory management unit (SMM) has a 256-entry translation look-aside buffer (TLB). The TLBs can be thrashed, and they consume additional memory bandwidth reloading TLBs when processing large data sets. To reduce the frequency of TLB reloads and to maximize the memory performance, maximum page sizes should be used. This means using 64KB base pages and, if possible, allocating large data sets in 16MB huge pages.
Each memory node contains 16 memory banks, which are naturally aligned on 128 bytes.
- Bank 0 contains the first 128-byte block and every sixteenth 128-byte block after the first.
- Bank 1 contains the second 128-byte block and every sixteenth 128-byte block after the second.
And so forth. If memory transactions at any time are concentrated on a subset of the 16 banks, maximum memory performance is compromised.
Various coding techniques, such as the following, have been employed successfully to improve memory access distribution:
- If multiple data buffers are sequentially accessed in parallel, consider aligning each buffer on a different 128-byte bank boundary.
- When data partitioning work across multiple SPEs, consider alternate partitioning strategies so that the transfers are more equally distributed.
- For multi-dimensional data sets, consider padding the dimension so that each row, column, or plane doesn't start on the same bank.
Efficiently managing the limited 256KB of local storage is important so that the size of the data buffers are maximized or optimized for the alignment and padding techniques described previously. Various techniques exist to reduce demands on the local store so that data buffers can be optimally sized. The techniques include:
- Using a single set of I/O buffers and using triple buffers or even double buffers instead of double buffering both input and output buffers.
- Building input-only DMA lists at the end of the target buffer.
- Reducing source code impact with overlay.
Using a single set of I/O buffers
Instead of double buffering both input and output buffers, use a single set of I/O buffers and use triple buffers or even double buffers. Table 1 shows four different multi-buffering techniques.
Table 1. Four multi-buffering techniques
|Time step||Double buffered|
input and output
(B0, B1, B2, B3)
compute in place
(B0, B1, B2)
(B0, B1, B2)
compute in place
|0||GET B0||GET B0||GET B0||GET B0|
Compute B0 into B2
Compute B0 into B2
Compute B1 into B3
Compute B1 into B0
Compute B0 into B2
Compute B2 into B1
Compute B1 into B3
Compute B0 into B2
Compute B0 into B2
Compute B1 into B0
Compute B1 into B3
Compute B2 into B1
|7||PUT B3||PUT B2||PUT B2||PUT B1|
The second column demonstrates a time sequence of operations for a traditional double-buffered, both-input-and-output sequence. This solution requires four local store data buffers. If the processing of a buffer can be performed in place, the number of buffers can be reduced to three, as shown in the third column. If the computation cannot be performed in place, a triple buffer solution is still possible by reusing the output buffer for the next input buffer (fourth column). To avoid a possible race condition, the GET of the next input buffer must be ordered with respect to the PUT of the output buffer either by using a GET with a barrier or by using a fence. Even a two-buffer solution is possible if the computation can be performed in place, as demonstrated in the last column of the table.
Building Input-only DMA lists
Build input-only DMA lists at the end of the target buffer. Input data frequently is not sequentially accessed and must be gathered from multiple locations. If there are few locations (less than 16), individual DMAs can be initiated for each region. If there are many locations, a DMA list is recommended.
To save the local storage space required to hold the list, the list element array can be constructed at the end of the target buffer and overwritten by the transfer. The processor architecture requires that all list elements within the list command are guaranteed to be started and issued in sequence so that as long as each list element is of non-zero size, the list array is guaranteed not to be overwritten before it is processed.
Reducing source code impact with overlay
Exploit overlay techniques to reduce source code impact on local storage. To save local storage used by the program, infrequently used code sections can be overlayed and brought into local store only when required.
Avoiding malloc heaps
heaps. Using a general-purpose
malloc heap to
allocate SPE local storage consumes valuable space for both the code and
data structures required to manage the heap. Instead, pre-define data
buffers using static arrays or runtime allocated on the
Additional programming tips and best practices for applications to execute efficiently on CBEA processors to achieve optimal performance are detailed in the CBE Programming Handbook, CBE Programming Tutorial, and other documentation (see Resources).
High-level development tools (such as Gedae, RapidMind, and the IBM XL C/C++ single-source compiler) optimize CBEA programs in their entirety, both at compile time and at runtime. These tools automatically optimize memory alignment, memory access distributions, and data transfer attributes by using the techniques previously described.
SPE-accelerated libraries (like BLAS, LAPACK, and the CodeSourcery Math Library) are called directly from a PPU program, and they simply accelerate the requested function. Data arrays are typically allocated by the application, and they might require specific alignment and allocation strategies in order for the library to achieve optimal transfer characteristics. Consult the individual library's user documentation for specific library-usage instructions.
With its innovative design, the CBEA processor micro-architecture supports a high degree of parallel processing capability at various levels for exploring the following:
- Data-level parallelism (DLP) in its SPU SIMD engines
- Instruction-level parallelism (ILP) in its execution pipelines
- Thread-level parallelism (TLP) in its multiple SPUs
Its memory subsystem design provides a high-degree of concurrent data transfers at applications' control for supporting high memory performance. With these features exposed, an application can take full control of its resources to achieve optimum performance.
To assist programmers in tuning application performance, a performance event-monitoring facility has been implemented in the CBEA processor to monitor activities in its processor pipelines, execution dependencies, resource usages, and so on. A comprehensive set of performance events and signals has been connected from each component and island (such as PPU execution units, PPU storage subsystem, SPUs, Memory Flow Control units, Bus Interface controller, Memory Interface Controller) to performance counters and tracing facilities for recording the frequency and time (in cycles) of each performance event.
This section addresses performance events for monitoring performance memory activities and the use of the Cell Performance Counter (CPC) performance tool for tuning and debugging memory performance. Table 2 summarizes the memory performance events for the two Rambus Extreme Data Rate (XDR) memory interfaces: XIO0 and XIO1.
Table 2. Critical memory performance events
XIO0 | XIO1
|Event type||Event name and description|
|7209 | 7109||Count cycles||Read command queue is empty.|
|7210 | 7110||Count cycles||Write command queue is empty.|
|7212 | 7112||Count cycles||Read command queue is full.|
|7213 | 7113||Count single-cycle events||Memory interface controller responds with a Retry for a read command because the read command queue is full.|
|7214 | 7114||Count cycles||Write command queue is full.|
|7215 | 7115||Count single-cycle events||Memory interface controller responds with a Retry for a write command because the write command queue is full.|
|7234 | 7134||Count single-cycle events||Read command dispatched; includes high-priority and fast-path reads.|
|7235/7335 | 7135||Count single-cycle events||Write command dispatched.|
|7236/7336 | 7136||Count single-cycle events||Read-Modify-Write command (data size < 16 bytes) dispatched.|
|7237/7337 | 7137||Count single-cycle events||Refresh dispatched.|
|7339 | 7139||Count single-cycle events||Byte-masking write command (data size >= 16 bytes) dispatched.|
|7241 | 7141||Count single-cycle events||Write command dispatched after a read command was previously dispatched.|
|7242 | 7142||Count single-cycle events||Read command dispatched after a write command was previously dispatched.|
In general, these events can monitor memory read and write activities, as well as queuing behaviors, at the memory controller. The count cycles event type can be used to count the cycles the event is active. For example, event 7109 and 7110 can be used to count the number of cycles that the read and write command queues are empty, which indicates idleness of the memory controller in processing read or write commands.
The count single-cycle events event type is for counting the occurrence of an event. For example, event 7135 and 7235 can be counted to show how many write commands the memory controller issued. Note that each write command corresponds to a 128-byte data write and that a program-requested DMA is broken into multiple 128-byte writes. As a result, write throughput can be derived by multiplying the count and the data size (128 bytes) and dividing by the execution cycles (or time in seconds).
The CPC performance tool can be used to select the events for monitoring. The four 32-bit counters implemented in the performance monitoring facility allow for any 4 events (from 2 islands or fewer) to be monitored at the same time when using CPC. For a detailed description of the usage of CPC and other performance tools, refer to the performance tuning guide in Resources.
To evaluate balanced usage of both memory channels (XIO0 and XIO1), count the numbers of read and write commands for each channel with events 7134, 7135, 7234, and 7235. Obviously, the read-write ratio can be derived easily from the counts of these events by dividing counts of events 7134 and 7234 by that of 7135 and 7235. The counts from signal events 7109, 7110 and 7209, 7210 indicate the amount of time that the channels are idle, while events 7112, 7114 and 7212, 7214 show the high utilization of both channels when the counts are high. Signals of 7113, 7115 and 7213, 7215 are more important events that could potentially cause system performance impacts due to retries that can be costly when system utilization is high.
Events 7136 and 7236 monitor read-modify-write (RMW) activities on memory accesses. Along with signals 7135 and 7235, derive the RMW-to-write ratio to serve as a good indicator of the severity of RMWs that are caused by smaller-than-128-byte or unaligned writes.
The following example monitors the write and RMW activities on both XIO1 and XIO0 channels. Every 200 million cycles (or 62.5 msec), it samples the occurrence of write and RMW commands during the execution of the prog1 application.
cpc -i 200000000 --sampling-mode count --events 7135,7136,7235,7236 prog1 Sample 7135 7136 7235 7236 ------ ----------- ------ ----------- ----------- 0 80983 37 81173 36 1 404564 3 404602 4 2 981142 4 981165 14 3 972268 0 972282 0 4 977525 0 977484 0 5 974026 0 974023 0 6 973806 0 973811 0 7 979406 0 979398 0 8 971131 0 971132 0 9 979516 0 979520 0 10 973916 2 973923 0 11 973922 0 973922 0 12 980355 0 980355 0 13 971387 0 971384 0 .. ... .. ... ..
This example shows that there are almost no RMW activities in the execution of prog1 (which consumes about 3.74GBps of memory bandwidth or (979406+979398)*128B/62.5msec in writing to memory during the seventh sampling period.
Here is another example showing a significant number of RMWs in the execution of the prog2 application:
cpc -i 200000000 --sampling-mode count --events 7135,7136,7235,7236 prog2 Sample 7135 7136 7235 7236 ------ ----------- ----------- ----------- ----------- 0 79518 135 79743 262 1 8592 0 8588 1 2 88358 169946 88318 169962 3 543 613353 550 613341 4 788 614463 800 614467 5 334 615680 329 615680 6 298 612896 298 612913 7 297 613147 297 613155 8 296 612716 298 612659 9 297 613285 296 613335 10 342 613190 330 613178 11 308 612471 304 612460 12 323 615415 328 615426 .. ... .. ... ..
As you can see, prog2 issues too many small writes that can cause severe performance loss.
With more RMW activities, effective memory bandwidth can be severely reduced. However, its application or system performance impact depends heavily on whether RMW times can still be hidden in computation cycles and on the percentage of the RMW latency on the overall execution. The performance loss from RMWs can be severe for some applications. You really should tune the application to issue minimal or no RMWs for optimum performance.
The IBM BladeCenter QS21 and QS22 CBEA processors provide direct control of memory accesses using DMA command. Compared to general-purpose processors, CBEA processors have the advantage of maximizing performance by allowing applications to schedule their memory accesses directly without relying on cache hierarchy.
Programs control CBEA processors data reads and writes, transferring only the data that is required for processing without wasting precious memory and system-bus bandwidth in speculative data pre-fetches. The architecture enables applications to overlap data transfers with computation and processing to completely hide the memory access latency.
If applications apply best practices when accessing memory, the DMA-based memory model achieves high computation efficiency and performance. Performance tools, such as CPC, are available to help identify inefficient memory usage patterns and to help eliminate application memory performance issues.
"Cell Broadband Engine Architecture and its first implementation -- A performance view"
(IBM Journal of Research and Development, 2007) for
how a Cell/B.E. processor can outperform other modern processors.
"The Potential of the Cell Processor for Scientific Computing"
(ACM, 2005) to find out about the potential of using the Cell/B.E. processor as a
building block for future, high-end computing systems.
- Use the tutorial series
"Cell/B.E. SDK 3.0 tools: Using performance tools"
(developerWorks, April 2008) for a tour of six performance tools
to use with the Cell/B.E. SDK 3.0 and for a discussion about Cell/B.E. system
performance best practices.
- Check out
"What Every Programmer Should Know About Memory"
(Redhat, 2007) for the structure of memory subsystems in use on
modern commodity hardware. The article illustrates why CPU caches were developed and
how they work, and what programs should do to achieve optimal performance
by using them.
- Learn about
"IBM Power6 microarchitecture"
(IBM Journal of Research and Development, 2007) and the
implementation of the IBM POWER6™ microprocessor, which is a two-way,
simultaneous, multithreaded, dual-core chip.
- Refer to
"Introduction to the Cell Multiprocessor"
(IBM Journal of Research and Development, 2005) for an
introductory overview of the Cell/B.E. multiprocessor's history, the
program objectives and challenges, the design concept, the architecture
and programming models, and the implementation. Also of interest from
early CBEA efforts is
"Cell Broadband Engine Architecture and its first implementation"
(developerWorks, November 2005).
- Pore through
"Introduction to the Cell Broadband Engine Architecture"
(IBM Journal of Research and Development, 2007) for an
overview of the CBEA's organization, instruction set, commands, and
facilities. The article introduces the Software Development Kit and the software
standards for a CBEA-compliant processor.
"A NUMA API
for LINUX®" (Novell, 2005) for an introduction to NUMA APIs.
- Refer to
"Cell Broadband Engine Programming Handbook "
(IBM, 2007) for developing applications, libraries,
middleware, drivers, compilers, or operating systems for the Cell
- See "Cell/B.E.
Programming Tutorial" (IBM, 2007) if you are interested in developing
applications or libraries for Cell/B.E. systems.
- Improve your skills with
"Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance"
(developerWorks, 2006), which gives you 25 ways to know the Cell/B.E. processor's
architectural characteristics better so you can optimize the processor.
- Use the
"Programmer's Guide to the IBM SDK for Multicore Acceleration"
(IBM, 2007) to learn how to use the SDK to write applications.
- Try the original installation document
Guide for the SDK for Multicore Acceleration v3.0" for Cell/B.E. SDK 3.0 installation
- Click on Cell Performance for
articles about all things related to getting the best performance from
your Cell/B.E. processor.
- Discover "Using
the IBM XL C/C++ Alpha Edition for Multicore Acceleration Single-Source Compiler"
(IBM, 2007), which contains overview and basic usage information for the
- Check out
Linear Algebra Subprograms Programmer's Guide and API Reference"
(IBM, 2008), which describes in detail how to configure the BLAS library and how
to program applications using it on the SDK.
Programmer's Guide and API Reference"
(IBM, 2008) for how to configure the LAPACK library and how to
program applications that use it on the SDK.
- Find more about best practices for Cell/B.E.
development in the IBM Redbook
"Programming the Cell Broadband Engine Examples and Best Practices"
(IBM Redbooks, February 2008).
- Learn more about Cell/B.E. programming
from the developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture® when you sign up to receive Cell/B.E. news in your newsletter.
Get products and technologies
- Learn more about
- Find more detailed
examples at the RapidMind Web site
(including an overview
of the technology). The site includes comparisons of code complexity and benchmarks on
various hardware targets, including the Cell/B.E..
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the library of Cell/B.E. documentation.
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
Juicy problems and answers from the forums are rounded up periodically and highlighted
in the "Forum watch" blog series.
- Go to the Cell Broadband Engine/Power Architecture blog for
instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find
the popular "Forum
watch" blog series (Q&A roundup), the "FixIt" technology updates, and the Infobomb
quick-read technology introductions.
Dr. Thomas Chen is a Senior Technical Staff Member responsible for systems performance in the IBM Systems and Technology Group. He leads the team effort in developing Cell/B.E. processor-based blade servers, enhancing PowerPC Architecture and design, and characterizing the emerging workloads for the design of future processors and servers. Since joining IBM in 1990, Dr. Chen has been awarded more than 20 patents spanning various technical areas. His technical interests include processor microarchitecture design, I/O and networking subsystem design, workload analysis and characterization, and performance modeling and analysis of processor and system architecture and designs. Dr. Chen received a Ph.D. degree in computer engineering from the State University of New York at Buffalo in 1989.
Daniel Brokenshire is a Senior Technical Staff Member in Systems and Technology Group working at the IBM Multicore Systems Software Design Center (Austin, Texas). His responsibilities include the architecture and development of programming standards, language extensions, and programming models for the Cell Broadband Engine processor and other multicore processors. He received a BS in computer science and BS and MS degrees in electrical engineering, all from Oregon State University. Prior to his work on the Cell Broadband Engine processor, he was involved in the development of 3D graphics products for both Tektronix, Inc. and IBM.