Before rolling out a new hardware platform to the IBM Cloud, we thoroughly evaluate its performance.

As you can see below, we stress systems to their limits, uncovering performance bottlenecks and addressing them before they impact your business.

We describe an issue that was discovered and resolved with certain multi-socket systems using 2nd-Gen Intel® Xeon® Scalable Processors (formerly code-named “Cascade Lake”).

System topology

In multi-socket systems, contemporary Intel processors use Ultra Path Interconnect (UPI) links to logically tie sockets together. The system in this study has four sockets, using processors with three UPI links each, allowing a fully connected topology:

Each processor (P0 – P3) has locally attached memory (M) and a direct UPI link to every other processor. Processors use UPI links to remotely access the memory attached to other processors.

Memory throughput collapse

We use a variety of workloads to characterize platform performance. This study focuses on the STREAM Triad benchmark used to measure memory throughput in a variety of configurations that assess both local and remote memory performance. The following configuration places maximum stress on remote memory serving:

Memory attached to P3 is remotely accessed by three other processors simultaneously. In its memory serving role, P3 has no active cores. We vary the number of active cores on the other processors, starting with one on each, then two, and so on.

This is a plot of measured throughput versus increasing load:

Things look normal on the left, where throughput rises with more active cores, reaching an expected peak consistent with the limits of the system. However, as the load continues to increase, throughput collapses, falling to around 6% of the peak.

Potential impacts

Intense remote memory traffic like this is typically avoided by NUMA (Non-Uniform Memory Access) aware applications, but some applications are not suited to such tuning and others may not be configured properly. Further, while we studied this throughput collapse in steady state, we recognize that transient conditions can trigger this collapse temporarily.

A dramatic drop in memory throughput can obviously affect application performance, but even temporary occurrences of this issue can easily produce spikes in application response latency.

Identifying the root cause

In our first conversation with Intel about this issue, they suggested that we vary a BIOS parameter called Local/Remote Threshold (part of UPI Configuration under North Bridge) to see if that might help. The parameter has various settings including High, Medium, and Low. In this case, changing it made no difference.

We then worked with Intel to help them reproduce the issue, extracting the essential elements from our measurement framework. Further work by the Intel team revealed that the issue was tied to how Local/Remote Threshold was mapped internally for our hardware topology.

The Local/Remote Threshold setting configures undocumented registers in the processor. Intel showed us how to inspect these registers and to modify one of them, which we will call register R here. Adjusting R allowed us to solve the memory collapse issue, as we show below.

Our earlier attempts to alter the Local/Remote Threshold had no effect because all available choices on the BIOS menu mapped to the same value of register R. This mapping from BIOS menu choice to R is affected by the hardware topology. As examples, the mappings for eight socket systems and those for four socket “ring” topologies (with two UPI links per processor) both vary R with menu choice. In this case, the BIOS vendor had correctly implemented the recommended mapping for our four socket fully connected topology.

New choices for register R

This shows memory throughput in the most challenging case of the problem workload, where all cores are active on each of the three processors running STREAM:

We chose register R values X, Y, and Z as possible alternatives to the original value.

Here is the problem workload with the original value of register R and the three new candidates across a range of active cores:

Clearly, all three new values address the original problem. After reaching its peak, throughput is now nearly flat in the face of increasing load.

Zooming in more closely, we see small differences in throughput between the new choices:

Note that magnification exaggerates the drop from the peak, as even value Z sustains more than 95% of best memory throughput.

Validating changes to register R

We then circled back to our full suite of memory throughput measurements to see how they were affected by the new candidate values for register R. Each was examined at the level of detail discussed above. For brevity, we present the geometric mean of all these workloads, which gives a good sense of the overall impact of varying R:

Value Y yields a balanced improvement, but each new choice significantly outperforms the original.

Delivering the solution

Our BIOS vendor provided a new version that enables the preferred choices for register R in systems with four fully connected sockets.

If you have read this far, we hope you have a better understanding of the IBM Cloud approach to performance.

Categories

More from Cloud

IBM Cloud inactive identities: Ideas for automated processing

4 min read - Regular cleanup is part of all account administration and security best practices, not just for cloud environments. In our blog post on identifying inactive identities, we looked at the APIs offered by IBM Cloud Identity and Access Management (IAM) and how to utilize them to obtain details on IAM identities and API keys. Some readers provided feedback and asked on how to proceed and act on identified inactive identities. In response, we are going lay out possible steps to take.…

IBM Cloud VMware as a Service introduces multitenant as a new, cost-efficient consumption model

4 min read - Businesses often struggle with ongoing operational needs like monitoring, patching and maintenance of their VMware infrastructure or the added concerns over capacity management. At the same time, cost efficiency and control are very important. Not all workloads have identical needs and different business applications have variable requirements. For example, production applications and regulated workloads may require strong isolation, but development/testing, training environments, disaster recovery sites or other applications may have lower availability requirements or they can be ephemeral in nature,…

IBM accelerates enterprise AI for clients with new capabilities on IBM Z

5 min read - Today, we are excited to unveil a new suite of AI offerings for IBM Z that are designed to help clients improve business outcomes by speeding the implementation of enterprise AI on IBM Z across a wide variety of use cases and industries. We are bringing artificial intelligence (AI) to emerging use cases that our clients (like Swiss insurance provider La Mobilière) have begun exploring, such as enhancing the accuracy of insurance policy recommendations, increasing the accuracy and timeliness of…

IBM NS1 Connect: How IBM is delivering network connectivity with premium DNS offerings

4 min read - For most enterprises, how their users access applications and data is an essential part of doing business, and how they service those application and data responses has a direct correlation to revenue generation.    According to We Are Social’s Digital 2023 Global Overview Report, there are 5.19 billion people around the world using the internet in 2023. There’s an imperative need for businesses to trust their networks to deliver meaningful content to address customer needs.  So how responsive is the…