A quick primer on the concepts of "stolen" CPU cycles when using RHEL 5.2, RHEL 5.3, SLES 10sp2, or SLES 11 on POWER systems. POWER5, POWER6, SMT, LPARs, and sharing CPU resources across virtualized partitions. It has been noted the wording "stolen cycles" really should be shared CPU cycles on the POWER systems, especially with respect to simultaneous multi-threading (SMT).
For purposes of consistency, we define the terms:
- Logical CPUs: Those CPUs seen by Linux within a partition as separately schedule'able CPUs. When SMT is on, there will be a pair of logical CPUs associated with each physical core
- Physical processor core: The Power cores on the underlying hardware. Each physical core is capable of running two hardware threads when SMT is on.
- Entitled processor units: How many units of a physical processor core a partition is defined with
 | For discussions...
|
When using SMT, the accounting of the cycles on a logical CPU has changed to reflect the cycles from the perspective of each SMT thread. So a common observation on a busy system is that each of the pair of logical CPUs (for example: cpu0 and cpu1) are running with 50% user busy and 50% steal. This simply reflects the SMT siblings sharing the cycles of a single physical processor core.
But, the implementation can be confusing since the steal cycles can also be those CPU cycles used by other partitions on the same system.
 |
This page is targeted at technical individuals who are using the latest SLES 10 or RHEL 5 releases on POWER systems - which these days is most typically POWER5 and POWER6 processor based. In the Linux community this is the ppc64 base. While the concepts and implementation described below may be applicable to other Linux distros, our experience is primarily with the distro releases from Novell and Red Hat.
The intent is simply to describe what Linux is doing to report on the logical CPU usage. We would like the reader to understand the concepts of how Linux reports cycles which cannot be attributed to "work" for that particular logical CPU - in particular the new "stolen" cycles column of CPU usage.
There are some interesting aspects of measuring and reporting on logical CPU utilization when your "system" is really a virtualized partition sharing logical CPU resources with other partitions, or even when SMT (simultaneous mutithreading) is being used.
For details on POWER6 and SMT mode, check out this IBM Journal paper: IBM POWER6 Microarchitecture
For more details on virtualizing your POWER systems, check out the IBM Redbook Virtualizing an Infrastructure with System p and Linux
|
Who's stealing (?!) my CPU cycles?
We occasionally hear from programmers and technical system administrators who have upgraded to one of the more recent Linux versions from Novell or Red Hat on a Power system, and are now startled (even dismayed) to see a new column of CPU metrics which shows that something is "stealing" logical CPU resources. On newer Linux systems like SLES 10, SLES 11, and RHEL 5, this is reflected by a new "st" column for the CPU cycles which are now attributed to the cycles being shared for things like virtualization activities.
Contents
In this report, we'll discuss and review...
- Background - top, vmstat, mpstat
- Source code - where are the measurements coming from? /proc/stat
- Basic logical CPU usage metrics
- SMT on - Power (two hardware threads per physical core)
- /proc/ppc64/lparcfg
- Then we add virtualization with a couple of partitions
- capped, not shared
- uncapped, shared
Background
Firing up a system with RHEL 5 u2, RHEL 5 u3, SLES 10 sp2, or SLES 11: vmstat shows a new "st" column as the rightmost column of logical CPU data.
Checking the man page for vmstat.. we see the new definition for "st" added. On Power systems, it would probably be more accurate to have an "sh" column when the two SMT threads are sharing the processor core, and then the more traditional "st" column for logical CPU cycles shared between partitions.
As another example, top now shows the "st" column as well. As usual, if you type the number 1 while top is running, all of the available logical CPUs will be displayed. In this example, the logical CPU values are summarized.
sysstat rpm
Finally, if you have installed the sysstat rpm file on the Power system, you'll have access to the mpstat command. This command is particularly helpful because it will show all of the logical CPUs in use.
On POWER systems, the simultaneous multi-threaded "SMT" mode allows for the two hardware threads to share a single physical processor core. In practice, this is implemented with two schedule'able logical CPUs for each physical processor core. Here we have an example of a single physical processor core partition which show two idle logical CPUs with little to no SMT sibling thread sharing.
The man description for mpstat shows the following description. Now this definition is a little better as it describes the condition (involuntary wait) of the logical CPU while the physical processor core cycles are being shared.
If multiple processes are started on this example single physical core partition which keeps the two logical CPUs fully busy, mpstat will then show the following for the two logical CPUs.
This shows that the "physical core" - the paired logical CPUs - is 100% busy. The strange part to understand is it appears that each logical CPU is only half busy, with the steal cycles (those CPU cycles being shared via SMT siblings) showing the cycles that the other logical CPU (the other half of the processor core) is working on.
 | 50% User with 50% Steal when SMT is on
This is the most common scenario reported by customers. It reflects the logical CPU cycles from the perspective of each logical CPU. In this case, each SMT sibling is using (sharing) half of the resources of the underlying physical processor core. |
Source Code Insights
To better understand where the measurements are coming from, we downloaded a version of the sysstat source code. sysstat was found on freshmeat:
Which pointed to..
Snagging the sysstat-7.0.2 version which is what RHEL 5.2 is based on..
We looked at mpstat.c and notice that these tools are getting their data from /proc/stat on the system, parsing the lines and assigning them to the various settings for the mpstat output.
It's interesting to note that even the control block names in the code reflect the phrasing of "steal"'ing CPU cycles.
With an example from an 8-core (8 physical processor cores) Power6 system below.. with SMT on.. we look at /proc/stat which shows the 16 schedule'able logical CPUs.
This example of /proc/stat is from an idle system. It's intended simply as an example of the data values captured by the Linux kernel. Note that this example is different from the other more simple systems used which just have 1 physical processor core.
Through discussions with developers, we've heard that "time.c" in the arch/ppc64/kernel tree should have the source we're looking for.
As an exercise, we thought we'd check first in the kernel.org tree. If you checked in the kernel source tree - for example the 2.6.16.18 base (which is what RHEL 5.2 is based on), would we find the steal code?
We downloaded it, un-tar'ed, and looked under the ppc64 arch tree, we found time.c which should handle the calculations for the "steal" time. Course, time.c here doesn't have the changes. This is indicative of the nature of parallel work efforts within the distros and the mainline kernel efforts.
Turns out you really have to look in the Red Hat src.rpm that comes with RHEL 5.2, and look at arch/ppc/kernel/time.c.
Also in the RHEL 5.2 base there is a patch which specifically addresses the mpstat reports on "steal" usage.
- (See linux-2.6-ppc64-mpstat-reports-wrong-per-processor-stats.patch)
According to the developer discussions - and snagged nearly verbatim from a RHEL 5.2 patch in the src rpm...
- In calculating stolen time, we originally were trying to actually account for time spent in the hypervisor. We don't really have enough information to do that accurately, so we don't try. Instead, we now calculate stolen time as time that the current cpu thread is not actually dispatching instructions. On chips without a PURR, we cannot do this, so stolen time will always be zero. On chips with a PURR, this is merely the difference between the elapsed PURR values and the elapsed TB values.
In a later paper, we'll look into what this (PURR, TB, when not dispatching instructions) means in English. And of course, we're still missing the big picture of how CPU cycles are assigned across user, idle, iowait, and steal.
Parsing /proc/ppc64/lparcfg
"lparcfg" is the key data source for the definition of the partition from the perspective of the running partition. Here we can parse out how the partition was defined and the constraints (or lack of constraints) placed on the partition by the system administrators.
For a slightly dated description, see the paper on Entries in the proc filesystem
for some background information.
LPAR (partition) information can be seen in the system file /proc/ppc64/lparcfg
For this exercise, the values we really care about are as follows
| lparcfg keyword |
Meaning |
| partition_entitled_capacity=400 |
Entitled How many processor units? divide the number by 100 for the percentage of CPU to be assigned to this partition 400 = 4.0 processor units 80 = 0.80 processor units |
| capacity_weight=0 |
Weight Used to prioritize partitions competing for CPU resources - if zero specified this essentially sets the partition to capped |
| capped=1 |
Capped =1: Don't allow this partition to take more resources =0: Allow this partition to use extra resources from the system |
| shared_processor_mode=0 |
Shared =1: Allow this partition to share unused cycles =0: Do not share this partition's resources |
Power 6 Blade measurements
For the purposes of these measurements, we used a single socket dual core (two physical processor cores) Power6 Blade - the JS12. SMT was on. Tested with RHEL 5.2. Results are slightly rounded for simplicity.
The JS12 blade was defined with three partitions - two of the partitions to be used to for workload comparisons, and a third small partition for the VIOS server. The workload we used was an un-tuned engineering run of a multi-process Java workload and normalized across the results. The VIOS server had very minimal CPU usage and we essentially ignored it for this exercise.
For example, when the workload was running, we used mpstat in the Linux partition with a single physical processor core allocated and SMT on (two logical CPUs - cpu0 and cpu1), as we saw before, it shows about 50% user cycles and 50% steal cycles.
So, running various permutations of partitions and partition settings, we provide some example "normalized" metrics. These metrics are not intended as projections or commitments, but more simply as examples of how the logical CPU assignments will impact the workloads running.
| Test # |
Partitions |
mpstat user% (cpu0 - cpu1) |
mpstat steal% (cpu0 - cpu1) |
Result "value" |
| 1 |
P1. 1.0 processor Not-Shared Capped P2. Off P3. VIOS 0.2 cpu |
50% - 50% - - |
50% - 50% - - |
100 - - |
| 2 |
P1. 1.0 processor Shared Capped P2. Off P3. VIOS 0.2 cpu |
50% - 50% - - |
50% - 50% - - |
100 - - |
| 3 |
P1. 0.8 processor Shared Capped P2. Off P3. VIOS 0.2 cpu |
40% - 40% - - |
60% - 60% - - |
80 - - |
| 4 |
P1. 0.8 processor Shared Uncapped Weight=128 P2. Off P3. VIOS 0.2 cpu |
50% - 50% - - |
50% - 50% - - |
100 - - |
| 5 |
P1. 0.6 processor Shared Uncapped Weight=0 P2. Off P3. VIOS 0.2 cpu uncapped weight=0 means "capped" |
30% - 30% - - |
70% - 70% - - |
60 - - |
| Test # |
Partitions |
mpstat user% (cpu0 - cpu1) |
mpstat steal% (cpu0 - cpu1) |
Result "value" |
| 6 |
P1. 1.0 processor Not-Shared P2. 0.8 cpu Shared Capped P3. VIOS 0.2 cpu |
50% - 50% 40% - 40% - |
50% - 50% 60% - 60% - |
100 80 - |
| 7 |
P1. 0.8 processor Shared Capped P2. 0.8 cpu Shared Capped P3. VIOS 0.2 cpu |
40% - 40% 40% - 40% - |
60% - 60% 60% - 60% - |
80 80 - |
| 8 |
P1. 0.6 processor Shared Capped P2. 0.6 cpu Shared Capped P3. VIOS 0.2 cpu |
30% - 30% 30% - 30% - |
70% - 70% 70% - 70% - |
60 60 - |
It's interesting to note that the hypervisor assigns "physical processor cores" to a partition, corresponding to the number of entitled processor units assigned to the partition. So, with a single processor physical core assigned to these partitions, the "most" that we see here is the logical CPU capping out at 50% user busy for each logical CPU.
Possible follow-on assessments
There are other implications and subtleties which we'll explore further if there's interest. For example,
- we'd like to explore how nmon reports on usage in these cases. The nmon tool does some very nice data gathering and summary reports for Power partitions. In particular, nmon ignores the steal column.
- we'd like to show what happens when you're working on larger systems with bigger possible swings in CPU utilization and steal percentages - in other words - just how busy can a logical CPU be?
- where is the art of reporting CPU utilization really going?
- how are logical CPU cycles really assigned? In particular, the CPU percentage values from mpstat
- How are PURR and timebase values used?
Many good areas to dive into.
 | For discussions...
|