May 19, 2023 By Cheuk Lam 4 min read

It has been a year and a half since we rolled out the throttling-aware container CPU sizing feature for IBM Turbonomic, and it has captured quite some attention, for good reason. As illustrated in our first blog post, setting the wrong CPU limit is silently killing your application performance and literally working as designed.

Turbonomic visualizes throttling metrics and, more importantly, takes throttling into consideration when recommending CPU limit sizing. Not only can we expose this silent performance killer, Turbonomic will prescribe the CPU limit value to minimize its impact on your containerized application performance.

In this new post, we are going to talk about a significant improvement in the way that we measure the level of throttling. Prior to this improvement, our throttling indicator was calculated based on the percentage of throttled periods. With such a measurement, throttling was underestimated for applications with a low CPU limit and overestimated for those with a high CPU limit. That resulted in sizing up high-limit applications too aggressively as we tuned our decision-making toward low-limit applications to minimize throttling and guarantee their performance.

In this recent improvement, we measure throttling based on the percentage of time throttled. In this post, we will show you how this new measurement works and why it will correct both the underestimation and the overestimation mentioned above:

  • Brief revisit of CPU throttling
  • The old/biased way: Period-based throttling measurement
  • The new/unbiased Way: Time-based throttling measurement
  • Benchmarking results
  • Release

Brief revisit of CPU throttling

If you watch this demo video, you can see a similar illustration of throttling. There it is a single-threaded container app with a CPU limit of 0.4 core (or 400m). The 400m limit in Linux is translated to a cgroup CPU quota of 40ms per 100ms, which is the default quota enforcement period in Linux that Kubernetes adopts. That means that the app can only use 40ms of CPU time in each 100ms period before it is throttled for 60ms. This repeats four times for a 200ms task (like the one shown below) and finally gets completed in the fifth period without being throttled. Overall, the 200ms task takes 100 * 4 + 40 = 440ms to complete, more than twice the actual needed CPU time:

Linux provides the following metrics related to throttling, which cAdvisor monitors and feeds to Kubernetes:

Linux MetriccAdvisor MetricValue (in the above example)Explanation
nr_periodscontainer_cpu_cfs_periods_total5This is the number of runnable periods. In the example, there are five.
nr_throttledcontainer_cpu_cfs_throttled_periods_total4It is throttled for only four out of the five runnable periods. In the fifth period, the request is completed, so it is no longer throttled.
throttled_timecontainer_cpu_cfs_throttled_seconds_total240msFor the first four periods, it runs for 40ms and is throttled for 60ms. Therefore, the total throttled time is 60ms * 4 = 240ms.
Scroll to view full table

The old/biased way: Period-based throttling measurement

As mentioned at the beginning, we used to measure the throttling level as the percentage of runnable periods that are throttled. In the above example, that would be 4 / 5 = 80%.

There is a significant bias with this measurement. Consider a second container application that has a CPU limit of 800m, as shown below. A task with 400ms processing time will run 80ms and then be throttled for 20ms in each of the first four enforcement periods of 100ms. It will then be completed in the fifth period. With the current way of measuring the throttling level, it will arrive at the same percentage: 80%. But clearly, this second app suffers far less than the first app. It is throttled for only 20ms * 4 = 80ms total—just a fraction of the 400ms CPU run time. The currently measured 80% throttling level is way too high to reflect the true situation of this app.

We needed a better way to measure throttling, and we created it:

The new/unbiased way: Time-based throttling measurement

With the new way, we measure the level of throttling as the percentage of time throttled versus the total time between using the CPU and being throttled. Here are the new measurements of the above two apps:

ApplicationThrottled TimeTotal Runnable TimePercentage Time Throttled
First240ms200ms + 240ms = 440ms240ms / 440ms = 55%
Second80ms400ms + 80ms = 480ms80ms / 480ms = 17%
Scroll to view full table

These two numbers—55% and 17%—make more sense than the original 80%. Not only they are two different numbers differentiating the two application scenarios, but their respective values also more appropriately reflect the true impact of throttling, as you could perhaps visualize from the two graphs. Intuitively, the new measurement can be interpreted as how much the overall task time can be improved/reduced by getting rid of throttling. For the first app, we can reduce the overall task time by 240ms (55% of the total). For the second app, it’s merely 17% if we get rid of throttling—not as significant as the first app.

Benchmarking results

Below, you’ll see some data to compare the throttling measurements computed using the throttling periods versus the timed-based version.

For a container with low CPU limits, the time-based measurement shows much higher throttling percentages compared to the older version that uses only throttling periods, as expected.

As the CPU limits go up, the time-based measurements again accurately reflect lower throttling percentages. Conversely, the older version shows a much higher throttling percentage, which can result in an aggressive resize-up in spite of the CPU limit being high enough.

Number of CoresCPU LimitThrottled PeriodsTotal PeriodsOld AverageThrottled Time (ms)Total Usage (ms)New Average
throttling-auto/low-cpu-high-throttling-77b6b5f84c-p97v8/kube-rbac-proxy-main10202175282,884.5976.2397.42537968
throttling-auto/low-cpu-high-throttling-77b6b5f84c-p97v8/low-cpu-high-throttling-spec10206414843.243243249,690.95170.898.26808196
monitoring/kube-state-metrics-6c6f446b4-hrq7v/kube-rbac-proxy-main122033956759.7883597943,943.63827.9198.15081538
throttling-auto/low-cpu-high-throttling-77b6b5f84c-njptn/kube-state-metrics1210036081544.41501103817,296.0221,838.6544.19615579
 dummy-ns/beekman-change-reconciler-5dbdcdb49b-sg2f9/beekman-2102008202856395.78418778488,921.77168,961.8074.31737012
 dummy-ns/beekman-change-reconciler-5dbdcdb49b-5mktb/beekman-2122008576858699.88353133554,103.75171,659.5876.34771956
 quota-test/cpu-quota-1-7f84f77bc5-ztdbm/cpu-quota-1-spec125003531856641.221106759,267.71357,274.1014.22851472
 turbo/kubeturbo-arsen-170-203-599fbdcff6-vbl55/kubeturbo-arsen-170-203-spec10100010117395.8079355956,300.3332,319.3916.31375702
default/nri-bundle-newrelic-logging-v8fqb/newrelic-logging121300182500.01212121211.86177,353.930.00668406
Scroll to view full table

Release

This new measurement of throttling has been available since IBM Turbonomic release 8.7.5. Furthermore, in release 8.8.2, we also allow users to customize the max throttling tolerance for each individual application or group of applications, as we fully recognize different applications have different needs in terms of tolerating throttling. For example, response-time-sensitive applications like web-services applications may have lower tolerance while batch applications like big machine learning jobs may have much higher tolerance. Now, users can configure the desired level as they want.

Learn more about IBM Turbonomic.
Was this article helpful?
YesNo

More from

Making HTTPS redirects easy with IBM NS1 Connect

3 min read - HTTPS is now the standard for application and website traffic on the internet. Over 85% of websites now use HTTPS by default—it’s to the point where a standard HTTP request now seems suspicious.  This is great for the security of the internet, but it’s a huge pain for the website and application teams that are managing HTTPS records. It was easy to move HTTP records around with a simple URL redirect. HTTPS redirects, on the other hand, require changing the URL…

Announcing Dizzion Desktop as a Service for IBM Virtual Private Cloud (VPC)

2 min read - For more than four years, Dizzion and IBM Cloud® have strategically partnered to deliver incredible digital workspace experiences to our clients. We are excited to announce that Dizzion has expanded their Desktop as a Service (DaaS) offering to now support IBM Cloud Virtual Private Cloud (VPC). Powered by Frame, Dizzion’s cloud-native DaaS platform, clients can now deploy their Windows and Linux® virtual desktops and applications on IBM Cloud VPC and enjoy fast, dynamic, infrastructure provisioning and a true consumption-based model.…

Microcontrollers vs. microprocessors: What’s the difference?

6 min read - Microcontroller units (MCUs) and microprocessor units (MPUs) are two kinds of integrated circuits that, while similar in certain ways, are very different in many others. Replacing antiquated multi-component central processing units (CPUs) with separate logic units, these single-chip processors are both extremely valuable in the continued development of computing technology. However, microcontrollers and microprocessors differ significantly in component structure, chip architecture, performance capabilities and application. The key difference between these two units is that microcontrollers combine all the necessary elements…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters