October 16, 2020 By Kean Kuiper 4 min read

Before rolling out a new hardware platform to the IBM Cloud, we thoroughly evaluate its performance.

As you can see below, we stress systems to their limits, uncovering performance bottlenecks and addressing them before they impact your business.

We describe an issue that was discovered and resolved with certain multi-socket systems using 2nd-Gen Intel® Xeon® Scalable Processors (formerly code-named “Cascade Lake”).

System topology

In multi-socket systems, contemporary Intel processors use Ultra Path Interconnect (UPI) links to logically tie sockets together. The system in this study has four sockets, using processors with three UPI links each, allowing a fully connected topology:

Each processor (P0 – P3) has locally attached memory (M) and a direct UPI link to every other processor. Processors use UPI links to remotely access the memory attached to other processors.

Memory throughput collapse

We use a variety of workloads to characterize platform performance. This study focuses on the STREAM Triad benchmark used to measure memory throughput in a variety of configurations that assess both local and remote memory performance. The following configuration places maximum stress on remote memory serving:

Memory attached to P3 is remotely accessed by three other processors simultaneously. In its memory serving role, P3 has no active cores. We vary the number of active cores on the other processors, starting with one on each, then two, and so on.

This is a plot of measured throughput versus increasing load:

Things look normal on the left, where throughput rises with more active cores, reaching an expected peak consistent with the limits of the system. However, as the load continues to increase, throughput collapses, falling to around 6% of the peak.

Potential impacts

Intense remote memory traffic like this is typically avoided by NUMA (Non-Uniform Memory Access) aware applications, but some applications are not suited to such tuning and others may not be configured properly. Further, while we studied this throughput collapse in steady state, we recognize that transient conditions can trigger this collapse temporarily.

A dramatic drop in memory throughput can obviously affect application performance, but even temporary occurrences of this issue can easily produce spikes in application response latency.

Identifying the root cause

In our first conversation with Intel about this issue, they suggested that we vary a BIOS parameter called Local/Remote Threshold (part of UPI Configuration under North Bridge) to see if that might help. The parameter has various settings including High, Medium, and Low. In this case, changing it made no difference.

We then worked with Intel to help them reproduce the issue, extracting the essential elements from our measurement framework. Further work by the Intel team revealed that the issue was tied to how Local/Remote Threshold was mapped internally for our hardware topology.

The Local/Remote Threshold setting configures undocumented registers in the processor. Intel showed us how to inspect these registers and to modify one of them, which we will call register R here. Adjusting R allowed us to solve the memory collapse issue, as we show below.

Our earlier attempts to alter the Local/Remote Threshold had no effect because all available choices on the BIOS menu mapped to the same value of register R. This mapping from BIOS menu choice to R is affected by the hardware topology. As examples, the mappings for eight socket systems and those for four socket “ring” topologies (with two UPI links per processor) both vary R with menu choice. In this case, the BIOS vendor had correctly implemented the recommended mapping for our four socket fully connected topology.

New choices for register R

This shows memory throughput in the most challenging case of the problem workload, where all cores are active on each of the three processors running STREAM:

We chose register R values X, Y, and Z as possible alternatives to the original value.

Here is the problem workload with the original value of register R and the three new candidates across a range of active cores:

Clearly, all three new values address the original problem. After reaching its peak, throughput is now nearly flat in the face of increasing load.

Zooming in more closely, we see small differences in throughput between the new choices:

Note that magnification exaggerates the drop from the peak, as even value Z sustains more than 95% of best memory throughput.

Validating changes to register R

We then circled back to our full suite of memory throughput measurements to see how they were affected by the new candidate values for register R. Each was examined at the level of detail discussed above. For brevity, we present the geometric mean of all these workloads, which gives a good sense of the overall impact of varying R:

Value Y yields a balanced improvement, but each new choice significantly outperforms the original.

Delivering the solution

Our BIOS vendor provided a new version that enables the preferred choices for register R in systems with four fully connected sockets.

If you have read this far, we hope you have a better understanding of the IBM Cloud approach to performance.

Was this article helpful?

More from Cloud

A clear path to value: Overcome challenges on your FinOps journey 

3 min read - In recent years, cloud adoption services have accelerated, with companies increasingly moving from traditional on-premises hosting to public cloud solutions. However, the rise of hybrid and multi-cloud patterns has led to challenges in optimizing value and controlling cloud expenditure, resulting in a shift from capital to operational expenses.   According to a Gartner report, cloud operational expenses are expected to surpass traditional IT spending, reflecting the ongoing transformation in expenditure patterns by 2025. FinOps is an evolving cloud financial management discipline…

IBM Power8 end of service: What are my options?

3 min read - IBM Power8® generation of IBM Power Systems was introduced ten years ago and it is now time to retire that generation. The end-of-service (EoS) support for the entire IBM Power8 server line is scheduled for this year, commencing in March 2024 and concluding in October 2024. EoS dates vary by model: 31 March 2024: maintenance expires for Power Systems S812LC, S822, S822L, 822LC, 824 and 824L. 31 May 2024: maintenance expires for Power Systems S812L, S814 and 822LC. 31 October…

24 IBM offerings winning TrustRadius 2024 Top Rated Awards

2 min read - TrustRadius is a buyer intelligence platform for business technology. Comprehensive product information, in-depth customer insights and peer conversations enable buyers to make confident decisions. “Earning a Top Rated Award means the vendor has excellent customer satisfaction and proven credibility. It’s based entirely on reviews and customer sentiment,” said Becky Susko, TrustRadius, Marketing Program Manager of Awards. Top Rated Awards have to be earned: Gain 10+ new reviews in the past 12 months Earn a trScore of 7.5 or higher from…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters