Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

BladeCenter QS: Maximizing memory performance

Compare CBEA and general-purpose processor memory access models for maximum memory performance

Thomas Chen , System Performance Analyst, IBM, Software Group
Dr. Thomas Chen is a Senior Technical Staff Member responsible for systems performance in the IBM Systems and Technology Group. He leads the team effort in developing Cell/B.E. processor-based blade servers, enhancing PowerPC Architecture and design, and characterizing the emerging workloads for the design of future processors and servers. Since joining IBM in 1990, Dr. Chen has been awarded more than 20 patents spanning various technical areas. His technical interests include processor microarchitecture design, I/O and networking subsystem design, workload analysis and characterization, and performance modeling and analysis of processor and system architecture and designs. Dr. Chen received a Ph.D. degree in computer engineering from the State University of New York at Buffalo in 1989.
Daniel A. Brokenshire, Software Architect, IBM
Daniel Brokenshire is a Senior Technical Staff Member in Systems and Technology Group working at the IBM Multicore Systems Software Design Center (Austin, Texas). His responsibilities include the architecture and development of programming standards, language extensions, and programming models for the Cell Broadband Engine processor and other multicore processors. He received a BS in computer science and BS and MS degrees in electrical engineering, all from Oregon State University. Prior to his work on the Cell Broadband Engine processor, he was involved in the development of 3D graphics products for both Tektronix, Inc. and IBM.

Summary:  This article compares the CBEA processor memory access model (with a focus on the IBM BladeCenter® QS21 and QS22) with that of general purpose processors, providing programmer guidelines to ensure that applications can be developed for maximum memory performance. This article also describes how to use the Cell Performance Counter tool when monitoring memory access activities for tuning and debugging memory performance.

Date:  01 Jul 2008
Level:  Intermediate PDF:  A4 and Letter (65KB)Get Adobe® Reader®

Activity:  25824 views
Comments:  

Introduction

The IBM BladeCenter QS21 and QS22 provide optimal performance in a small package by using two 3.2GHz Cell Broadband Engine™ Architecture-compliant processors in a symmetric multiprocessing (SMP) configuration. The QS21 uses two Cell Broadband Engine (Cell/B.E.™) processors, while the QS22 uses two PowerXCell™ 8i processors. The Cell/B.E. processor and its architecture were developed in collaboration with IBM, Sony Corporation, Sony Computer Entertainment Inc., and Toshiba Corporation.

Both processors, the Cell/B.E. and the PowerXCell 8i, are asymmetrical, multicore processors that are optimized for parallel processing and streaming applications. They have a POWER™ Processor Element (PPE) and eight highly optimized single-instruction multiple data (SIMD) engines (known as Synergistic Processor Elements or SPEs). Performance of the PPE is comparable to that of contemporary general purpose processors (GPPs). Furthermore, each of the eight SIMD engines is capable of matching or even surpassing a GPP running at the same frequency. The eight decoupled SPE SIMD engines with dedicated resources (which include large register files and DMA engines) provide the Cell/B.E. and PowerXCell 8i processors with a significant performance advantage over current GPPs.

This article compares the CBEA processor memory access model with that of general purpose processors and it provides programmer guidelines to ensure that applications can be developed for maximum memory performance. Finally, this article describes the usage of the Cell Performance Counter (CPC) performance tool to monitor memory access activities for tuning and debugging memory performance.

Understanding the CBEA memory access model

With advances in technology, the gap between microprocessor performance in clock rate and memory speed has been widened. A microprocessor typically depends on its cache hierarchy to alleviate the speed disparity between the processor and its memory by using its high-speed cache memory to avoid data access stalls.

There are usually multiple levels of caches along a processor's data path to memory. With space and temporal locality that exist in most applications, processors tend to access primarily their level-1 caches (L1), which are the closest of their memory hierarchies, with less frequent accesses to the L2 (and L3) caches in the event of L1 cache misses. The access to memory is typically infrequent, and it only occurs when data are not in any levels of its cache hierarchy. With a cache-based memory design, these general purpose processors rely on their cache facilities to fetch data without having to directly access system memory.

The cache hierarchy can effectively bridge the performance gap between the general purpose processor and its memory by allowing the processor to access data primarily from its high speed caches without having to wait for data to return from the relatively slow system memory. This improved memory subsystem design comes at the cost of a significant increase in silicon and complexity to coherently managing the cache and its directories. To increase the likelihood of data being available in the cache when needed, prefetch mechanisms have been developed so that the processor or application can speculatively load the cache. If the speculative prefetch is wrong, precious memory bandwidth can be wasted.

Moreover, the processor does not have precise control of when the accessed data is available, whether it comes from the cache hierarchy or from distant memory. The programmer does not have the ability to schedule its own data accesses and ensure the data is available at the time it is needed.

The Cell/B.E. PPE and the PowerXCell 8i PPE use a 2-level cache hierarchy similar to other general purpose processors as described. However, the eight SPEs (which are SIMD engines that provide the primary compute capability) do not use a cache-based memory subsystem. Instead, each SPE is equipped with a 256KB, high-performance, local store with an access time similar to the L1 cache of a general purpose processor. The SPU core executes its instructions and accesses data only from its local store. Each SPU has a companion memory flow controller (MFC) that provides DMA services to bringing in (or writing out) data and instructions as needed, as well as facilities to communicate with other SPUs and access system resources. The DMA engines are programmed by application software running on SPUs so that data can be transferred before needed. The DMA engine in each SPU supports up to 16 outstanding memory requests, with each request being a single DMA or a list of DMAs that provide data scattering and gathering capability.

With the capability of supporting more than 100 outstanding memory accesses concurrently at any time from its PPU, eight SPUs, and its IO devices, the CBEA processor can access its memory with a high degree of parallelism and achieve high memory access efficiency and throughput beyond what the cache-based memory subsystem can achieve. In addition, a CBEA application can schedule its data transfer into and out of local storage ahead of time and minimize or avoid stalling the SPU pipelines so they don't wait for data to be available. This is accomplished using double buffering (or multi-buffering) techniques to overlap computation on one buffer at the same time as transferring the next buffer. When the computation processing on the buffer is complete, the SPU can then switch to compute on the next buffer. Depending on the data access demands (in bytes per cycle), the buffer size can be chosen to entirely hide memory data transfer time during the computation cycles required to process a previously transferred buffer. By exposing the memory access controls to the programmer, an application running on a CBEA processor can be tuned to achieve nearly its peak computation capability.

The IBM BladeCenter QS21 is configured with 2GB of XDR (Rambus) memory. The QS22 system expands its memory capacity by using DDR2 SDRAM memory technology to support up to 32GB of memory. The initial QS22 systems can be configured with either 8GB or 32GB of memory. The performance characteristics of 8GB DDR2 for QS22 are very similar to that of XDR memory on QS21. Using the highest density DDR2 available on the market, QS22 systems incorporate 8 dual-rank 4GB DIMMS to achieve 32GB of memory capacity.

Aside from the performance advantage of expanded memory capacity, an application, in general, should perform very much the same on QS22 systems with either 8GB or 32GB of memory, as long as its memory accesses are properly aligned and sized.


Maximizing the potential of the memory subsystem

To maximize the potential of the QS21 or QS22 memory subsystem, apply specific programming practices, such as the following:

  1. Transfer data with optimal alignment and granularity.
  2. Preference processor local memory.
  3. Use huge pages for large data sets.
  4. Uniformly distribute the data transfers across all memory banks.
  5. Maximize the use of local storage.

Transfer data with optimal alignment and granularity

The most important factor in maximizing the performance of the memory subsystem is to ensure that all transfers begin on a cache-line (128 byte) boundary and are sized as a multiple of 128 bytes. The C/C++ type attribute aligned can be used to ensure arrays, structures, and classes are properly aligned.

If alignment cannot be guaranteed, consider transferring extra data on the front and back to ensure that cache-line alignment and size are achieved. This is especially important for writes to memory (DMA PUTs). For the large memory configuration on the QS22 (greater than 8GB), partial cache-line writes can have significant application performance impact due to the resulting read-modify-write sequence that occurs when storing a partial cache-line.

If the cache-line alignment cannot be achieved, the next best option is to ensure that the alignment of both the source and destination buffers have the same alignment (the same 7 least significant address bits). This ensures that performance penalties associated with partial cache-line transfers occur only on the first and possibly the last transfers.

Preference processor local memory

The QS21 and QS22 are dual-processor configurations with two memory nodes. The two processors are interconnected using a FlexIO interface running the fully coherent BIF protocol. The bandwidth between the CPUs is half that of the bandwidth to local memory. Figure 1 shows the QS21 and QS22 block diagram.


Figure 1. QS21 and QS22 block diagram
QS21 and QS22 block diagram

Applications should use NUMA control to preference SPUs on CPU 0 to access memory node 0. Likewise, SPUs in CPU 1 should preference memory accesses to memory node 1. Applications whose data access patterns cannot be separately partitioned and need to use all or most of the 16 available SPEs should use interleaved memory so that both memory nodes contribute to the aggregated memory bandwidth.

Use huge pages for large data sets

The SPE memory management unit (SMM) has a 256-entry translation look-aside buffer (TLB). The TLBs can be thrashed, and they consume additional memory bandwidth reloading TLBs when processing large data sets. To reduce the frequency of TLB reloads and to maximize the memory performance, maximum page sizes should be used. This means using 64KB base pages and, if possible, allocating large data sets in 16MB huge pages.

Uniformly distribute the data transfers across all memory banks

Each memory node contains 16 memory banks, which are naturally aligned on 128 bytes.

  • Bank 0 contains the first 128-byte block and every sixteenth 128-byte block after the first.
  • Bank 1 contains the second 128-byte block and every sixteenth 128-byte block after the second.

And so forth. If memory transactions at any time are concentrated on a subset of the 16 banks, maximum memory performance is compromised.

Various coding techniques, such as the following, have been employed successfully to improve memory access distribution:

  • If multiple data buffers are sequentially accessed in parallel, consider aligning each buffer on a different 128-byte bank boundary.
  • When data partitioning work across multiple SPEs, consider alternate partitioning strategies so that the transfers are more equally distributed.
  • For multi-dimensional data sets, consider padding the dimension so that each row, column, or plane doesn't start on the same bank.

Maximize the use of local storage

Efficiently managing the limited 256KB of local storage is important so that the size of the data buffers are maximized or optimized for the alignment and padding techniques described previously. Various techniques exist to reduce demands on the local store so that data buffers can be optimally sized. The techniques include:

  • Using a single set of I/O buffers and using triple buffers or even double buffers instead of double buffering both input and output buffers.
  • Building input-only DMA lists at the end of the target buffer.
  • Reducing source code impact with overlay.
  • Avoiding malloc heaps.

Using a single set of I/O buffers
Instead of double buffering both input and output buffers, use a single set of I/O buffers and use triple buffers or even double buffers. Table 1 shows four different multi-buffering techniques.


Table 1. Four multi-buffering techniques
Time stepDouble buffered
input and output
(B0, B1, B2, B3)
Triple buffered
compute in place
(B0, B1, B2)
Triple buffered

(B0, B1, B2)
Double buffered
compute in place
(B0, B1)
0GET B0GET B0GET B0GET B0
1GET B1
Compute B0 into B2
GET B1
Compute B0
GET B1
Compute B0 into B2
GET B1
Compute B0
2PUT B2
GET B0
Compute B1 into B3
PUT B0
GET B2
Compute B1
PUT B2
GETB B2
Compute B1 into B0
PUT B0
GETB B0
Compute B1
3PUT B3
GET B1
Compute B0 into B2
PUT B1
GET B0
Compute B2
PUT B0
GETB B0
Compute B2 into B1
PUT B1
GETB B1
Compute B0
4PUT B2
GET B0
Compute B1 into B3
PUT B2
GET B1
Compute B0
PUT B1
GETB B1
Compute B0 into B2
PUT B0
GETB B0
Compute B1
5PUT B3
GET B1
Compute B0 into B2
PUT B0
GET B2
Compute B1
PUT B2
GETB B2
Compute B1 into B0
PUT B1
GETB B1
Compute B0
6PUT B2
Compute B1 into B3
PUT B1
Compute B2
PUT B0
Compute B2 into B1
PUT B0
Compute B1
7PUT B3PUT B2PUT B2PUT B1

The second column demonstrates a time sequence of operations for a traditional double-buffered, both-input-and-output sequence. This solution requires four local store data buffers. If the processing of a buffer can be performed in place, the number of buffers can be reduced to three, as shown in the third column. If the computation cannot be performed in place, a triple buffer solution is still possible by reusing the output buffer for the next input buffer (fourth column). To avoid a possible race condition, the GET of the next input buffer must be ordered with respect to the PUT of the output buffer either by using a GET with a barrier or by using a fence. Even a two-buffer solution is possible if the computation can be performed in place, as demonstrated in the last column of the table.

Building Input-only DMA lists
Build input-only DMA lists at the end of the target buffer. Input data frequently is not sequentially accessed and must be gathered from multiple locations. If there are few locations (less than 16), individual DMAs can be initiated for each region. If there are many locations, a DMA list is recommended.

To save the local storage space required to hold the list, the list element array can be constructed at the end of the target buffer and overwritten by the transfer. The processor architecture requires that all list elements within the list command are guaranteed to be started and issued in sequence so that as long as each list element is of non-zero size, the list array is guaranteed not to be overwritten before it is processed.

Reducing source code impact with overlay
Exploit overlay techniques to reduce source code impact on local storage. To save local storage used by the program, infrequently used code sections can be overlayed and brought into local store only when required.

Avoiding malloc heaps
Avoid malloc heaps. Using a general-purpose malloc heap to allocate SPE local storage consumes valuable space for both the code and data structures required to manage the heap. Instead, pre-define data buffers using static arrays or runtime allocated on the stack using alloca.

Additional programming tips and best practices for applications to execute efficiently on CBEA processors to achieve optimal performance are detailed in the CBE Programming Handbook, CBE Programming Tutorial, and other documentation (see Resources).

High-level development tools (such as Gedae, RapidMind, and the IBM XL C/C++ single-source compiler) optimize CBEA programs in their entirety, both at compile time and at runtime. These tools automatically optimize memory alignment, memory access distributions, and data transfer attributes by using the techniques previously described.

SPE-accelerated libraries (like BLAS, LAPACK, and the CodeSourcery Math Library) are called directly from a PPU program, and they simply accelerate the requested function. Data arrays are typically allocated by the application, and they might require specific alignment and allocation strategies in order for the library to achieve optimal transfer characteristics. Consult the individual library's user documentation for specific library-usage instructions.


Using performance tools to monitor memory performance

With its innovative design, the CBEA processor micro-architecture supports a high degree of parallel processing capability at various levels for exploring the following:

  • Data-level parallelism (DLP) in its SPU SIMD engines
  • Instruction-level parallelism (ILP) in its execution pipelines
  • Thread-level parallelism (TLP) in its multiple SPUs

Its memory subsystem design provides a high-degree of concurrent data transfers at applications' control for supporting high memory performance. With these features exposed, an application can take full control of its resources to achieve optimum performance.

To assist programmers in tuning application performance, a performance event-monitoring facility has been implemented in the CBEA processor to monitor activities in its processor pipelines, execution dependencies, resource usages, and so on. A comprehensive set of performance events and signals has been connected from each component and island (such as PPU execution units, PPU storage subsystem, SPUs, Memory Flow Control units, Bus Interface controller, Memory Interface Controller) to performance counters and tracing facilities for recording the frequency and time (in cycles) of each performance event.

This section addresses performance events for monitoring performance memory activities and the use of the Cell Performance Counter (CPC) performance tool for tuning and debugging memory performance. Table 2 summarizes the memory performance events for the two Rambus Extreme Data Rate (XDR) memory interfaces: XIO0 and XIO1.


Table 2. Critical memory performance events
Event number
XIO0 | XIO1
Event typeEvent name and description
7209 | 7109Count cyclesRead command queue is empty.
7210 | 7110Count cyclesWrite command queue is empty.
7212 | 7112Count cyclesRead command queue is full.
7213 | 7113Count single-cycle eventsMemory interface controller responds with a Retry for a read command because the read command queue is full.
7214 | 7114Count cyclesWrite command queue is full.
7215 | 7115Count single-cycle eventsMemory interface controller responds with a Retry for a write command because the write command queue is full.
7234 | 7134Count single-cycle eventsRead command dispatched; includes high-priority and fast-path reads.
7235/7335 | 7135Count single-cycle eventsWrite command dispatched.
7236/7336 | 7136Count single-cycle eventsRead-Modify-Write command (data size < 16 bytes) dispatched.
7237/7337 | 7137Count single-cycle eventsRefresh dispatched.
7339 | 7139Count single-cycle eventsByte-masking write command (data size >= 16 bytes) dispatched.
7241 | 7141Count single-cycle eventsWrite command dispatched after a read command was previously dispatched.
7242 | 7142Count single-cycle eventsRead command dispatched after a write command was previously dispatched.

In general, these events can monitor memory read and write activities, as well as queuing behaviors, at the memory controller. The count cycles event type can be used to count the cycles the event is active. For example, event 7109 and 7110 can be used to count the number of cycles that the read and write command queues are empty, which indicates idleness of the memory controller in processing read or write commands.

The count single-cycle events event type is for counting the occurrence of an event. For example, event 7135 and 7235 can be counted to show how many write commands the memory controller issued. Note that each write command corresponds to a 128-byte data write and that a program-requested DMA is broken into multiple 128-byte writes. As a result, write throughput can be derived by multiplying the count and the data size (128 bytes) and dividing by the execution cycles (or time in seconds).

The CPC performance tool can be used to select the events for monitoring. The four 32-bit counters implemented in the performance monitoring facility allow for any 4 events (from 2 islands or fewer) to be monitored at the same time when using CPC. For a detailed description of the usage of CPC and other performance tools, refer to the performance tuning guide in Resources.

To evaluate balanced usage of both memory channels (XIO0 and XIO1), count the numbers of read and write commands for each channel with events 7134, 7135, 7234, and 7235. Obviously, the read-write ratio can be derived easily from the counts of these events by dividing counts of events 7134 and 7234 by that of 7135 and 7235. The counts from signal events 7109, 7110 and 7209, 7210 indicate the amount of time that the channels are idle, while events 7112, 7114 and 7212, 7214 show the high utilization of both channels when the counts are high. Signals of 7113, 7115 and 7213, 7215 are more important events that could potentially cause system performance impacts due to retries that can be costly when system utilization is high.

Events 7136 and 7236 monitor read-modify-write (RMW) activities on memory accesses. Along with signals 7135 and 7235, derive the RMW-to-write ratio to serve as a good indicator of the severity of RMWs that are caused by smaller-than-128-byte or unaligned writes.

The following example monitors the write and RMW activities on both XIO1 and XIO0 channels. Every 200 million cycles (or 62.5 msec), it samples the occurrence of write and RMW commands during the execution of the prog1 application.

cpc  -i 200000000 --sampling-mode count --events  7135,7136,7235,7236  prog1

Sample  7135          7136    7235           7236
------  -----------   ------  -----------    -----------
0       80983         37      81173          36
1       404564        3       404602         4
2       981142        4       981165         14
3       972268        0       972282         0
4       977525        0       977484         0
5       974026        0       974023         0
6       973806        0       973811         0
7       979406        0       979398         0
8       971131        0       971132         0
9       979516        0       979520         0
10      973916        2       973923         0
11      973922        0       973922         0
12      980355        0       980355         0
13      971387        0       971384         0
..      ...           ..      ...            ..

This example shows that there are almost no RMW activities in the execution of prog1 (which consumes about 3.74GBps of memory bandwidth or (979406+979398)*128B/62.5msec in writing to memory during the seventh sampling period.

Here is another example showing a significant number of RMWs in the execution of the prog2 application:

cpc  -i 200000000 --sampling-mode count --events  7135,7136,7235,7236  prog2

Sample  7135           7136          7235           7236
------  -----------    -----------   -----------    -----------
0       79518          135           79743          262
1       8592           0             8588           1
2       88358          169946        88318          169962
3       543            613353        550            613341
4       788            614463        800            614467
5       334            615680        329            615680
6       298            612896        298            612913
7       297            613147        297            613155
8       296            612716        298            612659
9       297            613285        296            613335
10      342            613190        330            613178
11      308            612471        304            612460
12      323            615415        328            615426
..      ...            ..            ...            ..

As you can see, prog2 issues too many small writes that can cause severe performance loss.

With more RMW activities, effective memory bandwidth can be severely reduced. However, its application or system performance impact depends heavily on whether RMW times can still be hidden in computation cycles and on the percentage of the RMW latency on the overall execution. The performance loss from RMWs can be severe for some applications. You really should tune the application to issue minimal or no RMWs for optimum performance.

Conclusion

The IBM BladeCenter QS21 and QS22 CBEA processors provide direct control of memory accesses using DMA command. Compared to general-purpose processors, CBEA processors have the advantage of maximizing performance by allowing applications to schedule their memory accesses directly without relying on cache hierarchy.

Programs control CBEA processors data reads and writes, transferring only the data that is required for processing without wasting precious memory and system-bus bandwidth in speculative data pre-fetches. The architecture enables applications to overlap data transfers with computation and processing to completely hide the memory access latency.

If applications apply best practices when accessing memory, the DMA-based memory model achieves high computation efficiency and performance. Performance tools, such as CPC, are available to help identify inefficient memory usage patterns and to help eliminate application memory performance issues.


Resources

Learn

Get products and technologies

Discuss

About the authors

Dr. Thomas Chen is a Senior Technical Staff Member responsible for systems performance in the IBM Systems and Technology Group. He leads the team effort in developing Cell/B.E. processor-based blade servers, enhancing PowerPC Architecture and design, and characterizing the emerging workloads for the design of future processors and servers. Since joining IBM in 1990, Dr. Chen has been awarded more than 20 patents spanning various technical areas. His technical interests include processor microarchitecture design, I/O and networking subsystem design, workload analysis and characterization, and performance modeling and analysis of processor and system architecture and designs. Dr. Chen received a Ph.D. degree in computer engineering from the State University of New York at Buffalo in 1989.

Daniel Brokenshire is a Senior Technical Staff Member in Systems and Technology Group working at the IBM Multicore Systems Software Design Center (Austin, Texas). His responsibilities include the architecture and development of programming standards, language extensions, and programming models for the Cell Broadband Engine processor and other multicore processors. He received a BS in computer science and BS and MS degrees in electrical engineering, all from Oregon State University. Prior to his work on the Cell Broadband Engine processor, he was involved in the development of 3D graphics products for both Tektronix, Inc. and IBM.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=317428
ArticleTitle=BladeCenter QS: Maximizing memory performance
publish-date=07012008
author1-email=wtchen@us.ibm.com
author1-email-cc=
author2-email=brokensh@us.ibm.com
author2-email-cc=dwpower@us.ibm.com