IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
 
developerworks > My developerWorks >  Dashboard > Linux for Power Architecture > ... > Performance Insights > Measuring stolen CPU cycles > Information > Page Comparison
developerWorks
Log In   View a printable version of the current page.
Overview Connect Spaces Forums Wikis
Measuring stolen CPU cycles
Version 16 by wburos
on Nov 24, 2008 17:23.


compared with
Current by billburos
on Jun 19, 2009 09:13.

(show comment)
 
Key
These lines were removed. This word was removed.
These lines were added. This word was added.

View page history


There are 4 changes. View first change.

 A quick primer on the concepts of "stolen" CPU cycles when using RHEL 5.2 or SLES 10sp2 on POWER systems. POWER5, POWER6, SMT, LPARs, and sharing CPU resources across virtualized partitions. It has been noted the wording "stolen cycles" really should be *_shared CPU cycles_* on the POWER systems, especially with respect to simultaneous multi-threading (SMT).
  A quick primer on the concepts of "stolen" CPU cycles when using RHEL 5.2, RHEL 5.3, SLES 10sp2, or SLES 11 on POWER systems. POWER5, POWER6, SMT, LPARs, and sharing CPU resources across virtualized partitions. It has been noted the wording "stolen cycles" really should be *_shared CPU cycles_* on the POWER systems, especially with respect to simultaneous multi-threading (SMT).
  
 For purposes of consistency, we define the terms:
 * *Logical CPUs:* Those CPUs seen by Linux within a partition as separately schedule'able CPUs. When SMT is on, there will be a pair of logical CPUs associated with each physical core
 * *Physical processor core:* The Power cores on the underlying hardware. Each physical core is capable of running two hardware threads when SMT is on.
 * *Entitled processor units:* How many units of a physical processor core a partition is defined with
  
 \\
  
 {tip:title=For discussions...}
 * Check [this thread|http://www.ibm.com/developerworks/forums/thread.jspa?threadID=233385] for discussions on this page.
  
 * Post additional questions and observations on the [Linux for Power Architecture forum|http://www.ibm.com/developerworks/forums/forum.jspa?forumID=375]
 {tip}
  
 \\
  
 When using SMT, the accounting of the cycles on a logical CPU has changed to reflect the cycles from the perspective of each SMT thread. So a common observation on a busy system is that each of the pair of logical CPUs (for example: cpu0 and cpu1) are running with 50% user busy and 50% steal. This simply reflects the SMT siblings *_sharing_* the cycles of a single physical processor core.
  
But, the implementation can be confusing since the steal cycles can also be those CPU cycles used by other partitions on the same system.
  
  
 {info:Linux on Power Systems}
 This page is targeted at technical individuals who are using the latest SLES 10 or RHEL 5 releases on POWER systems - which these days is most typically POWER5 and POWER6 processor based. In the Linux community this is the ppc64 base. While the concepts and implementation described below may be applicable to other Linux distros, our experience is primarily with the distro releases from Novell and Red Hat.
  
 The intent is simply to describe what Linux is doing to report on the logical CPU usage. We would like the reader to understand the concepts of how Linux reports cycles which cannot be attributed to "work" for that particular logical CPU - in particular the new "stolen" cycles column of CPU usage.
  
 There are some interesting aspects of measuring and reporting on logical CPU utilization when your "system" is really a virtualized partition sharing logical CPU resources with other partitions, or even when SMT (simultaneous mutithreading) is being used.
  
 For details on POWER6 and SMT mode, check out this IBM Journal paper: [IBM POWER6 Microarchitecture|http://www.research.ibm.com/journal/rd/516/le.pdf]
  
 For more details on virtualizing your POWER systems, check out the IBM Redbook [Virtualizing an Infrastructure with System p and Linux|http://www.redbooks.ibm.com/abstracts/sg247499.html?Open]
  
 {info}
  
 \\
 ----
 h2. Who's stealing (?!) my CPU cycles?
  
We occasionally hear from programmers and technical system administrators who have upgraded to one of the more recent Linux versions from Novell or Red Hat on a Power system, and are now startled (even dismayed) to see a new column of CPU metrics which shows that something is "stealing" logical CPU resources. On newer Linux systems like SLES 10 and RHEL 5, this is reflected by a new "st" column for the CPU cycles which are now attributed to the cycles being shared for things like virtualization activities.
  We occasionally hear from programmers and technical system administrators who have upgraded to one of the more recent Linux versions from Novell or Red Hat on a Power system, and are now startled (even dismayed) to see a new column of CPU metrics which shows that something is "stealing" logical CPU resources. On newer Linux systems like SLES 10, SLES 11, and RHEL 5, this is reflected by a new "st" column for the CPU cycles which are now attributed to the cycles being shared for things like virtualization activities.
  
 \\
 ----
 h2. Contents
  
 In this report, we'll discuss and review...
  
 # Background - top, vmstat, mpstat
 # Source code - where are the measurements coming from? /proc/stat
 # Basic logical CPU usage metrics
 # SMT on - Power (two hardware threads per physical core)
 # /proc/ppc64/lparcfg
 # Then we add virtualization with a couple of partitions
  ** capped, not shared
  ** uncapped, shared
  
 \\
 ----
 h2. Background
  
 Firing up a system with RHEL 5 u2 or SLES 10 sp2, vmstat shows a new "st" column as the rightmost column of logical CPU data.
  Firing up a system with RHEL 5 u2, RHEL 5 u3, SLES 10 sp2, or SLES 11: vmstat shows a new "st" column as the rightmost column of logical CPU data.
  
 {noformat}
 # vmstat 1
 procs ---- -------memory---------- ---swap-- ---io--- --system-- -----cpu------
  r b swpd free buff cache si so bi bo in cs us sy id wa st
  0 0 0 11122432 537984 4258304 0 0 3 1 2 7 1 1 98 0 1
  0 0 0 11122496 537984 4258304 0 0 0 0 15 30 0 0 100 0 0
  0 0 0 11122496 537984 4258304 0 0 0 88 35 54 0 0 100 0 0
 {noformat}
  
 Checking the man page for vmstat.. we see the new definition for "st" added. On Power systems, it would probably be more accurate to have an "sh" column when the two SMT threads are sharing the processor core, and then the more traditional "st" column for logical CPU cycles shared between partitions.
  
 {noformat}
 # man vmstat
 <clipped>
  CPU
  These are percentages of total CPU time.
  us: Time spent running non-kernel code. (user time, including nice time)
  sy: Time spent running kernel code. (system time)
  id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
  wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.
  st: Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.
 <clipped>
 {noformat}
  
 As another example, top now shows the "st" column as well. As usual, if you type the number 1 while top is running, all of the available logical CPUs will be displayed. In this example, the logical CPU values are summarized.
  
 {noformat}
 top - 09:09:42 up 2 days, 11:17, 1 user, load average: 3.49, 1.69, 0.64
 Tasks: 137 total, 1 running, 136 sleeping, 0 stopped, 0 zombie
 Cpu(s): 23.2%us, 5.3%sy, 0.0%ni, 34.1%id, 36.4%wa, 0.3%hi, 0.3%si, 0.3%st
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 21493 root 18 0 4032 1600 960 S 69 0.0 0:24.29 gzip
 21490 root 18 0 7616 4096 2688 D 28 0.0 0:11.14 tar
 21492 root 18 0 7424 2368 1216 S 4 0.0 0:01.52 tar
 21491 root 18 0 5696 2048 1472 S 1 0.0 0:00.38 cat
 21211 root 10 -5 0 0 0 D 1 0.0 0:00.30 kjournald
 <clipped>
 {noformat}
  
 \\
 ----
 h2. sysstat rpm
 Finally, if you have installed the sysstat rpm file on the Power system, you'll have access to the mpstat command. This command is particularly helpful because it will show all of the logical CPUs in use.
  
 On POWER systems, the simultaneous multi-threaded "SMT" mode allows for the two hardware threads to share a single physical processor core. In practice, this is implemented with two schedule'able logical CPUs for each physical processor core. Here we have an example of a single physical processor core partition which show two idle logical CPUs with little to no SMT sibling thread sharing.
  
 {noformat}
 # mpstat -P ALL
 Linux 2.6.18-92.el5 (p6ihopenhpc2.ltc.austin.ibm.com) 10/12/2008
  
 06:27:06 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
 06:27:06 PM 0 0.09 0.00 0.10 0.03 0.33 0.09 0.00 99.35 2.37
 06:27:06 PM 1 0.00 0.00 0.07 0.02 0.31 0.06 0.00 99.53 1.94
 {noformat}
  
 The man description for mpstat shows the following description. Now this definition is a little better as it describes the condition (involuntary wait) of the logical CPU while the physical processor core cycles are being shared.
  
 {noformat}
 # man mpstat
 <clipped>
  %steal
  Show the percentage of time spent in involuntary wait by the
  virtual CPU or CPUs while the hypervisor was servicing another
  virtual processor.
 <clipped>
 {noformat}
  
 If multiple processes are started on this example single physical core partition which keeps the two logical CPUs fully busy, mpstat will then show the following for the two logical CPUs.
  
 This shows that the "physical core" - the paired logical CPUs - is 100% busy. The strange part to understand is it appears that each logical CPU is only half busy, with the steal cycles (those CPU cycles being shared via SMT siblings) showing the cycles that the other logical CPU (the other half of the processor core) is working on.
  
 {noformat}
 10:14:40 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
 10:14:45 AM all 49.80 0.00 0.10 0.00 0.10 0.00 48.50 1.50 15.83
 10:14:45 AM 0 48.50 0.00 0.00 0.00 0.00 0.00 48.50 3.01 10.22
 10:14:45 AM 1 50.90 0.00 0.20 0.00 0.00 0.20 48.70 0.00 5.61
  
 {noformat}
  
 {tip:title=50% User with 50% Steal when SMT is on}
 This is the most common scenario reported by customers. It reflects the logical CPU cycles from the perspective of each logical CPU. In this case, each SMT sibling is using (sharing) half of the resources of the underlying physical processor core.
 {tip}
  
 \\
 ----
 h2. Source Code Insights
  
 To better understand where the measurements are coming from, we downloaded a version of the sysstat source code. sysstat was found on freshmeat:
  
  * http://freshmeat.net/projects/sysstat/
  
 Which pointed to..
  
  * http://pagesperso-orange.fr/sebastien.godard/ (which now appears to be down ?)
  
 Snagging the sysstat-7.0.2 version which is what RHEL 5.2 is based on..
  
  * http://perso.orange.fr/sebastien.godard/sysstat-7.0.2.tar.gz
  
 We looked at mpstat.c and notice that these tools are getting their data from /proc/stat on the system, parsing the lines and assigning them to the various settings for the mpstat output.
  
 It's interesting to note that even the control block names in the code reflect the phrasing of "steal"'ing CPU cycles.
  
 {noformat}
 mpstat.c
 <clipped>
  sscanf(line + 5, "%llu %llu %llu %llu %llu %llu %llu %llu",
  &(st_mp_cpu[curr]->cpu_user),
  &(st_mp_cpu[curr]->cpu_nice),
  &(st_mp_cpu[curr]->cpu_system),
  &(st_mp_cpu[curr]->cpu_idle),
  &(st_mp_cpu[curr]->cpu_iowait),
  &(st_mp_cpu[curr]->cpu_hardirq),
  &(st_mp_cpu[curr]->cpu_softirq),
  &(st_mp_cpu[curr]->cpu_steal));
 <clipped>
 {noformat}
  
 With an example from an 8-core (8 physical processor cores) Power6 system below.. with SMT on.. we look at /proc/stat which shows the 16 schedule'able logical CPUs.
  
 _This example of /proc/stat is from an idle system. It's intended simply as an example of the data values captured by the Linux kernel. Note that this example is different from the other more simple systems used which just have 1 physical processor core._
  
 {noformat}
 # cat /proc/stat
 cpu 12982 30 4229 3272769 9923 6460 2651 473
 cpu0 225 0 407 208077 165 448 214 27
 cpu1 165 3 452 205236 151 417 149 28
 cpu2 189 0 123 205767 114 364 151 22
 cpu3 442 0 204 205085 269 395 188 18
 cpu4 2416 0 67 203523 143 394 145 40
 cpu5 444 1 177 205247 168 419 129 16
 cpu6 369 6 64 205611 128 394 133 22
 cpu7 1197 4 610 204025 212 391 132 31
 cpu8 120 0 39 205776 148 451 173 19
 cpu9 138 1 64 205692 148 412 127 20
 cpu10 792 0 86 205168 146 385 125 24
 cpu11 762 0 315 204761 204 400 134 25
 cpu12 1250 2 127 204547 221 393 160 25
 cpu13 724 1 130 204993 213 383 132 24
 cpu14 2378 1 738 196749 6174 397 224 62
 cpu15 1364 0 618 202503 1312 411 329 64
 intr 309891
 ctxt 804919
 btime 1223905644
 processes 37521
 procs_running 1
 procs_blocked 0
 {noformat}
  
 Through discussions with developers, we've heard that "time.c" in the arch/ppc64/kernel tree should have the source we're looking for.
  
 As an exercise, we thought we'd check first in the kernel.org tree. If you checked in the kernel source tree - for example the 2.6.16.18 base (which is what RHEL 5.2 is based on), would we find the steal code?
  
 * http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.16.18.tar.bz2
  
 We downloaded it, un-tar'ed, and looked under the ppc64 arch tree, we found time.c which should handle the calculations for the "steal" time. Course, time.c here doesn't have the changes. This is indicative of the nature of parallel work efforts within the distros and the mainline kernel efforts.
  
 Turns out you really have to look in the Red Hat src.rpm that comes with RHEL 5.2, and look at arch/ppc/kernel/time.c.
  
 Also in the RHEL 5.2 base there is a patch which specifically addresses the mpstat reports on "steal" usage.
  
  * (See linux-2.6-ppc64-mpstat-reports-wrong-per-processor-stats.patch)
  
 According to the developer discussions - and snagged nearly verbatim from a RHEL 5.2 patch in the src rpm...
  
  * _In calculating stolen time, we originally were trying to actually account for time spent in the hypervisor. We don't really have enough information to do that accurately, so we don't try. Instead, we now calculate stolen time as time that the current cpu thread is not actually dispatching instructions. On chips without a PURR, we cannot do this, so stolen time will always be zero. On chips with a PURR, this is merely the difference between the elapsed PURR values and the elapsed TB values._
  
 In a later paper, we'll look into what this (PURR, TB, when *not* dispatching instructions) means in English. And of course, we're still missing the big picture of how CPU cycles are assigned across user, idle, iowait, and steal.
  
  
 {noformat}
 void calculate_steal_time(void)
 {
  u64 tb, purr;
  s64 stolen;
  struct cpu_purr_data *pme;
  
  if (!cpu_has_feature(CPU_FTR_PURR))
  return;
  pme = &per_cpu(cpu_purr_data, smp_processor_id());
  if (!pme->initialized)
  return; /* this can happen in early boot */
  spin_lock(&pme->lock);
  tb = mftb();
  purr = mfspr(SPRN_PURR);
  stolen = (tb - pme->tb) - (purr - pme->purr);
  if (stolen > 0)
  account_steal_time(current, stolen);
  pme->tb = tb;
  pme->purr = purr;
  spin_unlock(&pme->lock);
 }
 {noformat}
  
 \\
 ----
 h2. Parsing /proc/ppc64/lparcfg
  
 "lparcfg" is the key data source for the definition of the partition from the perspective of the running partition. Here we can parse out how the partition was defined and the constraints (or lack of constraints) placed on the partition by the system administrators.
  
 For a slightly dated description, see the paper on [Entries in the proc filesystem|http://www.ibm.com/developerworks/wikis/display/LinuxP/Entries+in+the+proc+Filesystem] for some background information.
  
 LPAR (partition) information can be seen in the system file /proc/ppc64/lparcfg
  
 {noformat}
 # cat /proc/ppc64/lparcfg
 lparcfg 1.7
 serial_number=IBM,03100A8E2
 system_type=IBM,8203-E4A
 partition_id=1
 R4=0x190
 R5=0x0
 R6=0x8001ffff
 R7=0x1000000000004
 BoundThrds=1
 CapInc=100
 DisWheRotPer=5120000
 MinEntCap=100
 MinEntCapPerVP=100
 MinMem=256
 MinProcs=1
 partition_max_entitled_capacity=400
 system_potential_processors=4
 DesEntCap=400
 DesMem=15808
 DesProcs=4
 DesVarCapWt=0
 DedDonMode=0
  
 partition_entitled_capacity=400
 group=32769
 system_active_processors=4
 unallocated_capacity_weight=0
 capacity_weight=0
 capped=1
 unallocated_capacity=0
 purr=725341178006384
 partition_active_processors=4
 partition_potential_processors=4
 shared_processor_mode=0
 {noformat}
  
 For this exercise, the values we really care about are as follows
  
 || lparcfg keyword || Meaning ||
 | partition_entitled_capacity=400 |*Entitled*\\How many processor units? divide the number by 100 for the percentage of CPU to be assigned to this partition \\400 = 4.0 processor units \\ 80 = 0.80 processor units|
 | capacity_weight=0 | *Weight* \\Used to prioritize partitions competing for CPU resources - if zero specified this essentially sets the partition to capped |
 | capped=1 | *Capped* \\=1: Don't allow this partition to take more resources \\ =0: Allow this partition to use extra resources from the system |
 | shared_processor_mode=0 | *Shared*\\=1: Allow this partition to share unused cycles \\=0: Do not share this partition's resources |
  
  
 \\
 ----
 h2. Power 6 Blade measurements
  
 For the purposes of these measurements, we used a single socket dual core (two physical processor cores) Power6 Blade - the JS12. SMT was on. Tested with RHEL 5.2. Results are slightly rounded for simplicity.
  
 The JS12 blade was defined with three partitions - two of the partitions to be used to for workload comparisons, and a third small partition for the VIOS server. The workload we used was an un-tuned engineering run of a multi-process Java workload and normalized across the results. The VIOS server had very minimal CPU usage and we essentially ignored it for this exercise.
  
 For example, when the workload was running, we used mpstat in the Linux partition with a single physical processor core allocated and SMT on (two logical CPUs - cpu0 and cpu1), as we saw before, it shows about 50% user cycles and 50% steal cycles.
  
 {noformat}
 # mpstat
 10:14:40 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
 10:14:45 AM all 49.80 0.00 0.10 0.00 0.10 0.00 48.50 1.50 15.83
 10:14:45 AM 0 48.50 0.00 0.00 0.00 0.00 0.00 48.50 3.01 10.22
 10:14:45 AM 1 50.90 0.00 0.20 0.00 0.00 0.20 48.70 0.00 5.61
 {noformat}
  
 So, running various permutations of partitions and partition settings, we provide some example "normalized" metrics. These metrics are not intended as projections or commitments, but more simply as examples of how the logical CPU assignments will impact the workloads running.
  
 || Test # || Partitions || mpstat \\ *user%* \\ (cpu0 - cpu1) || mpstat \\ *steal%* \\ (cpu0 - cpu1) || Result \\ "value" ||
 | 1 | P1. 1.0 processor Not-Shared Capped \\ P2. Off \\ P3. VIOS 0.2 cpu| 50% - 50% \\ - \\ - | 50% - 50% \\ - \\ - | 100 \\- \\- |
 | 2 | P1. 1.0 processor Shared Capped \\ P2. Off \\ P3. VIOS 0.2 cpu| 50% - 50% \\ - \\ - | 50% - 50% \\ - \\ - | 100 \\- \\- |
 | 3 | P1. 0.8 processor Shared Capped \\ P2. Off \\ P3. VIOS 0.2 cpu| 40% - 40% \\ - \\ - | 60% - 60% \\ - \\ - | 80 \\- \\- |
 | 4 | P1. 0.8 processor Shared Uncapped Weight=128 \\ P2. Off \\ P3. VIOS 0.2 cpu| 50% - 50% \\ - \\ - | 50% - 50% \\ - \\ - | 100 \\- \\- |
 | 5 | P1. 0.6 processor Shared Uncapped Weight=0 \\ P2. Off \\ P3. VIOS 0.2 cpu \\uncapped weight=0 means "capped" | 30% - 30% \\ - \\ - | 70% - 70% \\ - \\ - | 60 \\- \\- |
 || Test # || Partitions || mpstat \\ *user%* \\ (cpu0 - cpu1) || mpstat \\ *steal%* \\ (cpu0 - cpu1) || Result \\ "value" ||
 | 6 | P1. 1.0 processor Not-Shared \\ P2. 0.8 cpu Shared Capped \\ P3. VIOS 0.2 cpu| 50% - 50% \\ 40% - 40% \\ - | 50% - 50% \\ 60% - 60% \\ - | 100 \\ 80 \\- |
 | 7 | P1. 0.8 processor Shared Capped \\ P2. 0.8 cpu Shared Capped \\ P3. VIOS 0.2 cpu| 40% - 40% \\ 40% - 40% \\ - | 60% - 60% \\ 60% - 60% \\ - | 80 \\ 80 \\- |
 | 8 | P1. 0.6 processor Shared Capped \\ P2. 0.6 cpu Shared Capped \\ P3. VIOS 0.2 cpu| 30% - 30% \\ 30% - 30% \\ - | 70% - 70% \\ 70% - 70% \\ - | 60 \\ 60 \\- |
  
  
 It's interesting to note that the hypervisor assigns "physical processor cores" to a partition, corresponding to the number of entitled processor units assigned to the partition. So, with a single processor physical core assigned to these partitions, the "most" that we see here is the logical CPU capping out at 50% user busy for each logical CPU.
  
 \\
 ----
 h2. Possible follow-on assessments
  
 There are other implications and subtleties which we'll explore further if there's interest. For example,
  
 * we'd like to explore how nmon reports on usage in these cases. The nmon tool does some very nice data gathering and summary reports for Power partitions. In particular, nmon ignores the steal column.
  
 * we'd like to show what happens when you're working on larger systems with bigger possible swings in CPU utilization and steal percentages - in other words - just how busy can a logical CPU be?
  
 * where is the art of reporting CPU utilization really going?
  
 * how are logical CPU cycles really assigned? In particular, the CPU percentage values from mpstat
 {noformat}
 %user %nice %sys %iowait %irq %soft %steal %idle
 {noformat}
  
 * How are PURR and timebase values used?
  
 Many good areas to dive into.
  
 {tip:title=For discussions...}
 * Feel free to post additional questions and observations on the [Linux for Power Architecture forum|http://www.ibm.com/developerworks/forums/forum.jspa?forumID=375]
 {tip}

 
    About IBM Privacy Contact