Topic
  • 1 reply
  • Latest Post - ‏2008-11-14T16:04:30Z by SystemAdmin
SystemAdmin
SystemAdmin
706 Posts

Pinned topic Monitoring CPU cycles - what's this steal column?

‏2008-11-12T16:49:03Z |
Several of us have recently been working with customers and software vendors on interesting questions around CPU metrics on the latest Linux distros for Power systems - in this case RHEL 5.2 and SLES 10 sp2.

The questions started with the simple: What's this "steal" CPU metric column?

In response to the original question, a wiki page for measuring stolen CPU cycles was developed and posted on developerWorks. Turns out this was just the simple "get your feet wet" introduction.

And more questions came in.

  • In one case, someone only uses vmstat. The averaging of the individual CPU metrics gets confusing.

  • In another case, there's a regular user of nmon, with the question of how nmon works with the steal metrics.

  • What happens when partitions are sharing more CPU resources on the larger systems? Can steal go over 100%?

  • How does a customer monitor and reconcile CPU metrics across AIX, SLES, and Red Hat partitions systems running on a single physical system?

  • And we quickly discovered that the various terms of CPUs, cores, SMT threads, virtual processor, logical CPUs, Linux's lparcfg, Linux terms across platforms, AIX terms, HMC terms all get muddled together

So work is proceeding. Several more Linux wiki pages are being developed. First we're settling on the terms and how tools like vmstat and nmon report on the "steal" metrics.

In this forum thread, we'll report on the progress being made, while trolling for more questions. We expect to post some draft wiki pages to solicit comment and feedback in the coming days.
Updated on 2008-11-14T16:04:30Z at 2008-11-14T16:04:30Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: Monitoring CPU cycles - what's this steal column?

    ‏2008-11-14T16:04:30Z  
    Got a good pointer from Andrew which I'll share here... This is the basis for the metrics and how various CPU utilization metrics are derived. This will be tied to the SMT handling and CPU steal cycles that Linux attempts to report on.

    Here is a good explanation of how processor utilization should be calculated with purr values:

    http://www.research.ibm.com/journal/rd/516/mccreary.html

    • "To support accurate accounting and utilization, the POWER5 architecture added a special-purpose register for each SMT thread, called the processor utilization of resources register (PURR). The PURR counts only the timebase ticks assigned to the thread by the processor. The hypervisor virtualizes the PURR. For dedicated processor partitions, in which processor cores are dedicated to running code only for that partition, the virtualized value is simply the total PURR count for the hardware thread. On the other hand, for processor cores in the shared processor pool, the hypervisor saves and restores the PURR value, managing a separate PURR value for each partition sharing the processor core."

    and

    • "For accounting purposes, the amount of CPU time consumed in an interval is the value of the virtualized PURR at the end of an interval minus the value of the virtualized PURR at the beginning. The utilization of the logical CPU is the ratio of the number of PURR ticks spent in the active state, that is, outside of the idle state, to the total number of PURR ticks for the interval. However, the utilization of the physical CPU is the sum of the PURR ticks in the active state for the two threads divided by the number of timebase ticks in the interval."