IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
 
developerworks > My developerWorks >  Dashboard > Linux for Power Architecture > ... > Performance Insights > Measuring stolen CPU cycles
developerWorks
Log In   View a printable version of the current page.
Overview Connect Spaces Forums Wikis
Measuring stolen CPU cycles
Added by wburos, last edited by billburos on Jun 19, 2009  (view change)
Labels: 
(None)

A quick primer on the concepts of "stolen" CPU cycles when using RHEL 5.2, RHEL 5.3, SLES 10sp2, or SLES 11 on POWER systems. POWER5, POWER6, SMT, LPARs, and sharing CPU resources across virtualized partitions. It has been noted the wording "stolen cycles" really should be shared CPU cycles on the POWER systems, especially with respect to simultaneous multi-threading (SMT).

For purposes of consistency, we define the terms:

  • Logical CPUs: Those CPUs seen by Linux within a partition as separately schedule'able CPUs. When SMT is on, there will be a pair of logical CPUs associated with each physical core
  • Physical processor core: The Power cores on the underlying hardware. Each physical core is capable of running two hardware threads when SMT is on.
  • Entitled processor units: How many units of a physical processor core a partition is defined with


For discussions...


When using SMT, the accounting of the cycles on a logical CPU has changed to reflect the cycles from the perspective of each SMT thread. So a common observation on a busy system is that each of the pair of logical CPUs (for example: cpu0 and cpu1) are running with 50% user busy and 50% steal. This simply reflects the SMT siblings sharing the cycles of a single physical processor core.

But, the implementation can be confusing since the steal cycles can also be those CPU cycles used by other partitions on the same system.

This page is targeted at technical individuals who are using the latest SLES 10 or RHEL 5 releases on POWER systems - which these days is most typically POWER5 and POWER6 processor based. In the Linux community this is the ppc64 base. While the concepts and implementation described below may be applicable to other Linux distros, our experience is primarily with the distro releases from Novell and Red Hat.

The intent is simply to describe what Linux is doing to report on the logical CPU usage. We would like the reader to understand the concepts of how Linux reports cycles which cannot be attributed to "work" for that particular logical CPU - in particular the new "stolen" cycles column of CPU usage.

There are some interesting aspects of measuring and reporting on logical CPU utilization when your "system" is really a virtualized partition sharing logical CPU resources with other partitions, or even when SMT (simultaneous mutithreading) is being used.

For details on POWER6 and SMT mode, check out this IBM Journal paper: IBM POWER6 Microarchitecture

For more details on virtualizing your POWER systems, check out the IBM Redbook Virtualizing an Infrastructure with System p and Linux



Who's stealing (?!) my CPU cycles?

We occasionally hear from programmers and technical system administrators who have upgraded to one of the more recent Linux versions from Novell or Red Hat on a Power system, and are now startled (even dismayed) to see a new column of CPU metrics which shows that something is "stealing" logical CPU resources. On newer Linux systems like SLES 10, SLES 11, and RHEL 5, this is reflected by a new "st" column for the CPU cycles which are now attributed to the cycles being shared for things like virtualization activities.



Contents

In this report, we'll discuss and review...

  1. Background - top, vmstat, mpstat
  2. Source code - where are the measurements coming from? /proc/stat
  3. Basic logical CPU usage metrics
  4. SMT on - Power (two hardware threads per physical core)
  5. /proc/ppc64/lparcfg
  6. Then we add virtualization with a couple of partitions
    • capped, not shared
    • uncapped, shared



Background

Firing up a system with RHEL 5 u2, RHEL 5 u3, SLES 10 sp2, or SLES 11: vmstat shows a new "st" column as the rightmost column of logical CPU data.

# vmstat 1
procs   ---- -------memory----------   ---swap--  ---io--- --system-- -----cpu------
 r  b   swpd    free   buff    cache    si   so   bi   bo   in   cs   us sy  id wa st
 0  0      0  11122432 537984 4258304    0    0    3    1    2    7    1  1  98  0  1
 0  0      0  11122496 537984 4258304    0    0    0    0   15   30    0  0 100  0  0
 0  0      0  11122496 537984 4258304    0    0    0   88   35   54    0  0 100  0  0

Checking the man page for vmstat.. we see the new definition for "st" added. On Power systems, it would probably be more accurate to have an "sh" column when the two SMT threads are sharing the processor core, and then the more traditional "st" column for logical CPU cycles shared between partitions.

# man vmstat
<clipped>
   CPU
       These are percentages of total CPU time.
       us: Time spent running non-kernel code. (user time, including nice time)
       sy: Time spent running kernel code. (system time)
       id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
       wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.
       st: Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.
<clipped>

As another example, top now shows the "st" column as well. As usual, if you type the number 1 while top is running, all of the available logical CPUs will be displayed. In this example, the logical CPU values are summarized.

top - 09:09:42 up 2 days, 11:17,  1 user,  load average: 3.49, 1.69, 0.64
Tasks: 137 total,   1 running, 136 sleeping,   0 stopped,   0 zombie
Cpu(s): 23.2%us,  5.3%sy,  0.0%ni, 34.1%id, 36.4%wa,  0.3%hi,  0.3%si,  0.3%st
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+     COMMAND
21493 root      18   0  4032 1600  960 S   69  0.0   0:24.29 gzip 
21490 root      18   0  7616 4096 2688 D   28  0.0   0:11.14 tar
21492 root      18   0  7424 2368 1216 S    4  0.0   0:01.52 tar
21491 root      18   0  5696 2048 1472 S    1  0.0   0:00.38 cat
21211 root      10  -5     0    0    0 D    1  0.0   0:00.30 kjournald 
<clipped>



sysstat rpm

Finally, if you have installed the sysstat rpm file on the Power system, you'll have access to the mpstat command. This command is particularly helpful because it will show all of the logical CPUs in use.

On POWER systems, the simultaneous multi-threaded "SMT" mode allows for the two hardware threads to share a single physical processor core. In practice, this is implemented with two schedule'able logical CPUs for each physical processor core. Here we have an example of a single physical processor core partition which show two idle logical CPUs with little to no SMT sibling thread sharing.

# mpstat -P ALL
Linux 2.6.18-92.el5 (p6ihopenhpc2.ltc.austin.ibm.com)   10/12/2008

06:27:06 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
06:27:06 PM    0    0.09    0.00    0.10    0.03    0.33    0.09    0.00   99.35      2.37
06:27:06 PM    1    0.00    0.00    0.07    0.02    0.31    0.06    0.00   99.53      1.94

The man description for mpstat shows the following description. Now this definition is a little better as it describes the condition (involuntary wait) of the logical CPU while the physical processor core cycles are being shared.

# man mpstat
<clipped>
       %steal
              Show the percentage of time spent in involuntary wait by the
              virtual CPU or CPUs while the hypervisor was servicing another
              virtual processor.
<clipped>

If multiple processes are started on this example single physical core partition which keeps the two logical CPUs fully busy, mpstat will then show the following for the two logical CPUs.

This shows that the "physical core" - the paired logical CPUs - is 100% busy. The strange part to understand is it appears that each logical CPU is only half busy, with the steal cycles (those CPU cycles being shared via SMT siblings) showing the cycles that the other logical CPU (the other half of the processor core) is working on.

10:14:40 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
10:14:45 AM  all   49.80    0.00    0.10    0.00    0.10    0.00   48.50    1.50     15.83
10:14:45 AM    0   48.50    0.00    0.00    0.00    0.00    0.00   48.50    3.01     10.22
10:14:45 AM    1   50.90    0.00    0.20    0.00    0.00    0.20   48.70    0.00      5.61

50% User with 50% Steal when SMT is on

This is the most common scenario reported by customers. It reflects the logical CPU cycles from the perspective of each logical CPU. In this case, each SMT sibling is using (sharing) half of the resources of the underlying physical processor core.



Source Code Insights

To better understand where the measurements are coming from, we downloaded a version of the sysstat source code. sysstat was found on freshmeat:

Which pointed to..

Snagging the sysstat-7.0.2 version which is what RHEL 5.2 is based on..

We looked at mpstat.c and notice that these tools are getting their data from /proc/stat on the system, parsing the lines and assigning them to the various settings for the mpstat output.

It's interesting to note that even the control block names in the code reflect the phrasing of "steal"'ing CPU cycles.

mpstat.c
<clipped>
         sscanf(line + 5, "%llu %llu %llu %llu %llu %llu %llu %llu",
                &(st_mp_cpu[curr]->cpu_user),
                &(st_mp_cpu[curr]->cpu_nice),
                &(st_mp_cpu[curr]->cpu_system),
                &(st_mp_cpu[curr]->cpu_idle),
                &(st_mp_cpu[curr]->cpu_iowait),
                &(st_mp_cpu[curr]->cpu_hardirq),
                &(st_mp_cpu[curr]->cpu_softirq),
                &(st_mp_cpu[curr]->cpu_steal));
<clipped>

With an example from an 8-core (8 physical processor cores) Power6 system below.. with SMT on.. we look at /proc/stat which shows the 16 schedule'able logical CPUs.

This example of /proc/stat is from an idle system. It's intended simply as an example of the data values captured by the Linux kernel. Note that this example is different from the other more simple systems used which just have 1 physical processor core.

# cat /proc/stat
cpu  12982 30 4229 3272769 9923 6460 2651 473
cpu0 225 0 407 208077 165 448 214 27
cpu1 165 3 452 205236 151 417 149 28
cpu2 189 0 123 205767 114 364 151 22
cpu3 442 0 204 205085 269 395 188 18
cpu4 2416 0 67 203523 143 394 145 40
cpu5 444 1 177 205247 168 419 129 16
cpu6 369 6 64 205611 128 394 133 22
cpu7 1197 4 610 204025 212 391 132 31
cpu8 120 0 39 205776 148 451 173 19
cpu9 138 1 64 205692 148 412 127 20
cpu10 792 0 86 205168 146 385 125 24
cpu11 762 0 315 204761 204 400 134 25
cpu12 1250 2 127 204547 221 393 160 25
cpu13 724 1 130 204993 213 383 132 24
cpu14 2378 1 738 196749 6174 397 224 62
cpu15 1364 0 618 202503 1312 411 329 64
intr 309891
ctxt 804919
btime 1223905644
processes 37521
procs_running 1
procs_blocked 0

Through discussions with developers, we've heard that "time.c" in the arch/ppc64/kernel tree should have the source we're looking for.

As an exercise, we thought we'd check first in the kernel.org tree. If you checked in the kernel source tree - for example the 2.6.16.18 base (which is what RHEL 5.2 is based on), would we find the steal code?

We downloaded it, un-tar'ed, and looked under the ppc64 arch tree, we found time.c which should handle the calculations for the "steal" time. Course, time.c here doesn't have the changes. This is indicative of the nature of parallel work efforts within the distros and the mainline kernel efforts.

Turns out you really have to look in the Red Hat src.rpm that comes with RHEL 5.2, and look at arch/ppc/kernel/time.c.

Also in the RHEL 5.2 base there is a patch which specifically addresses the mpstat reports on "steal" usage.

  • (See linux-2.6-ppc64-mpstat-reports-wrong-per-processor-stats.patch)

According to the developer discussions - and snagged nearly verbatim from a RHEL 5.2 patch in the src rpm...

  • In calculating stolen time, we originally were trying to actually account for time spent in the hypervisor. We don't really have enough information to do that accurately, so we don't try. Instead, we now calculate stolen time as time that the current cpu thread is not actually dispatching instructions. On chips without a PURR, we cannot do this, so stolen time will always be zero. On chips with a PURR, this is merely the difference between the elapsed PURR values and the elapsed TB values.

In a later paper, we'll look into what this (PURR, TB, when not dispatching instructions) means in English. And of course, we're still missing the big picture of how CPU cycles are assigned across user, idle, iowait, and steal.

void calculate_steal_time(void)
{
        u64 tb, purr;
        s64 stolen;
        struct cpu_purr_data *pme;

        if (!cpu_has_feature(CPU_FTR_PURR))
                return;
        pme = &per_cpu(cpu_purr_data, smp_processor_id());
        if (!pme->initialized)
                return;         /* this can happen in early boot */
        spin_lock(&pme->lock);
        tb = mftb();
        purr = mfspr(SPRN_PURR);
        stolen = (tb - pme->tb) - (purr - pme->purr);
        if (stolen > 0)
                account_steal_time(current, stolen);
        pme->tb = tb;
        pme->purr = purr;
        spin_unlock(&pme->lock);
}



Parsing /proc/ppc64/lparcfg

"lparcfg" is the key data source for the definition of the partition from the perspective of the running partition. Here we can parse out how the partition was defined and the constraints (or lack of constraints) placed on the partition by the system administrators.

For a slightly dated description, see the paper on Entries in the proc filesystem for some background information.

LPAR (partition) information can be seen in the system file /proc/ppc64/lparcfg

# cat /proc/ppc64/lparcfg
lparcfg 1.7 
serial_number=IBM,03100A8E2
system_type=IBM,8203-E4A
partition_id=1
R4=0x190
R5=0x0
R6=0x8001ffff
R7=0x1000000000004
BoundThrds=1
CapInc=100
DisWheRotPer=5120000
MinEntCap=100
MinEntCapPerVP=100
MinMem=256
MinProcs=1
partition_max_entitled_capacity=400
system_potential_processors=4
DesEntCap=400
DesMem=15808
DesProcs=4
DesVarCapWt=0
DedDonMode=0

partition_entitled_capacity=400
group=32769
system_active_processors=4
unallocated_capacity_weight=0
capacity_weight=0
capped=1
unallocated_capacity=0
purr=725341178006384
partition_active_processors=4
partition_potential_processors=4
shared_processor_mode=0

For this exercise, the values we really care about are as follows

lparcfg keyword Meaning
partition_entitled_capacity=400 Entitled
How many processor units? divide the number by 100 for the percentage of CPU to be assigned to this partition
400 = 4.0 processor units
80 = 0.80 processor units
capacity_weight=0 Weight
Used to prioritize partitions competing for CPU resources - if zero specified this essentially sets the partition to capped
capped=1 Capped
=1: Don't allow this partition to take more resources
=0: Allow this partition to use extra resources from the system
shared_processor_mode=0 Shared
=1: Allow this partition to share unused cycles
=0: Do not share this partition's resources



Power 6 Blade measurements

For the purposes of these measurements, we used a single socket dual core (two physical processor cores) Power6 Blade - the JS12. SMT was on. Tested with RHEL 5.2. Results are slightly rounded for simplicity.

The JS12 blade was defined with three partitions - two of the partitions to be used to for workload comparisons, and a third small partition for the VIOS server. The workload we used was an un-tuned engineering run of a multi-process Java workload and normalized across the results. The VIOS server had very minimal CPU usage and we essentially ignored it for this exercise.

For example, when the workload was running, we used mpstat in the Linux partition with a single physical processor core allocated and SMT on (two logical CPUs - cpu0 and cpu1), as we saw before, it shows about 50% user cycles and 50% steal cycles.

# mpstat
10:14:40 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
10:14:45 AM  all   49.80    0.00    0.10    0.00    0.10    0.00   48.50    1.50     15.83
10:14:45 AM    0   48.50    0.00    0.00    0.00    0.00    0.00   48.50    3.01     10.22
10:14:45 AM    1   50.90    0.00    0.20    0.00    0.00    0.20   48.70    0.00      5.61

So, running various permutations of partitions and partition settings, we provide some example "normalized" metrics. These metrics are not intended as projections or commitments, but more simply as examples of how the logical CPU assignments will impact the workloads running.

Test # Partitions mpstat
user%
(cpu0 - cpu1)
mpstat
steal%
(cpu0 - cpu1)
Result
"value"
1 P1. 1.0 processor Not-Shared Capped
P2. Off
P3. VIOS 0.2 cpu
50% - 50%
-
-
50% - 50%
-
-
100
-
-
2 P1. 1.0 processor Shared Capped
P2. Off
P3. VIOS 0.2 cpu
50% - 50%
-
-
50% - 50%
-
-
100
-
-
3 P1. 0.8 processor Shared Capped
P2. Off
P3. VIOS 0.2 cpu
40% - 40%
-
-
60% - 60%
-
-
80
-
-
4 P1. 0.8 processor Shared Uncapped Weight=128
P2. Off
P3. VIOS 0.2 cpu
50% - 50%
-
-
50% - 50%
-
-
100
-
-
5 P1. 0.6 processor Shared Uncapped Weight=0
P2. Off
P3. VIOS 0.2 cpu
uncapped weight=0 means "capped"
30% - 30%
-
-
70% - 70%
-
-
60
-
-
Test # Partitions mpstat
user%
(cpu0 - cpu1)
mpstat
steal%
(cpu0 - cpu1)
Result
"value"
6 P1. 1.0 processor Not-Shared
P2. 0.8 cpu Shared Capped
P3. VIOS 0.2 cpu
50% - 50%
40% - 40%
-
50% - 50%
60% - 60%
-
100
80
-
7 P1. 0.8 processor Shared Capped
P2. 0.8 cpu Shared Capped
P3. VIOS 0.2 cpu
40% - 40%
40% - 40%
-
60% - 60%
60% - 60%
-
80
80
-
8 P1. 0.6 processor Shared Capped
P2. 0.6 cpu Shared Capped
P3. VIOS 0.2 cpu
30% - 30%
30% - 30%
-
70% - 70%
70% - 70%
-
60
60
-

It's interesting to note that the hypervisor assigns "physical processor cores" to a partition, corresponding to the number of entitled processor units assigned to the partition. So, with a single processor physical core assigned to these partitions, the "most" that we see here is the logical CPU capping out at 50% user busy for each logical CPU.



Possible follow-on assessments

There are other implications and subtleties which we'll explore further if there's interest. For example,

  • we'd like to explore how nmon reports on usage in these cases. The nmon tool does some very nice data gathering and summary reports for Power partitions. In particular, nmon ignores the steal column.
  • we'd like to show what happens when you're working on larger systems with bigger possible swings in CPU utilization and steal percentages - in other words - just how busy can a logical CPU be?
  • where is the art of reporting CPU utilization really going?
  • how are logical CPU cycles really assigned? In particular, the CPU percentage values from mpstat
    %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle 
    
  • How are PURR and timebase values used?

Many good areas to dive into.

For discussions...

 
    About IBM Privacy Contact