Author - Saravanan Devendran (often called DRS by his friends).
Traditionally we have been accustomed to use processor utilization as the primary metric to understand the performance of a system running a workload, to do capacity planning and to do charge back. Processor technology has undergone tremendous changes in the past decade which has called for a change in the way that processor utilization is computed and correctly interpreted. This article will cover in detail how processor utilization is computed in AIX® and what changes it has undergone in the past decade in sync with the IBM® POWER processor technology changes.
Today the industry is moving towards greater consolidation of computing resources to maximize efficiency of these resources. With cloud computing services getting greater traction, it is important for providers and consumers of these services to understand resources utilization metrics as they strive to drive utilization levels close to 100%.
Understanding resource utilization and the spare capacity is becoming more important than ever. Processors are one of the most significant resources among all the available resources in a system, making processor utilization a critical metric for measuring the performance of a system running a workload. IBM® POWER processor technology has undergone significant advancements in the past decade. In parallel, the processor utilization computation methodologies in AIX® have also undergone changes. It is important to understand IBM® POWER processor technology to understand the processor utilization computation methodologies used in AIX®.
Understanding Processor Utilization
Processor utilization is the metric that represents how busy a processor core is and how much spare capacity the core has to take additional work. The processor core executes instructions. The amount of time a processor core is busy executing instructions is called the core utilization. A POWER Processor core executes instructions at a rate approaching one instruction per processor cycle. (A cycle is the inverse of the processor frequency; a 4 GHz processor - 4 billion cycles per second - represents a 1?4 nanosecond processor cycle.). The instructions do not execute from beginning to end within the time of one cycle. Instead, instructions execute through multiple stages before complete, each typically one cycle long, from the point where they begin their execution to the point where the processor core treats them as complete. Think of this as an assembly line; each one-cycle long stage does something with an instruction before passing it off to the next stage. As with an assembly line, once an instruction has gone through one stage, a subsequent instruction can execute using then now available stage. Figure-1 illustrates this by showing the execution of an ordered stream of instructions A through G. This is called rather descriptively a "pipeline".
This single pipe allows for a maximum instruction execution rate of one instruction per cycle. Typically in a single pipe architecture, the processor core will be actively executing instructions from the same task throughout the execution time. Hence, the aggregation of tasks execution time is a pretty good representation of Processor Core utilization. With this architecture, the utilization calculations are straight forward due to the fact that the processor core resource is used by only one task at a time.
The utilization computation for the case depicted, in Figure-2, is very simple nothing but the time spent by the tasks on the processor core. For multi-processor systems, the over all system utilization is nothing but the average of the individual processor core utilization levels. For instance let's review Figure-3:
The Core1, Core3 marked in Red are the fully utilized processor cores, which means both of them were busy in executing tasks. The Core2, Core4 didn't have any thing to run, hence these processor cores were idle. In this case, the over all cpu utilization is 50%, as Core1, Core3 are 100% busy; and Core2, Core4 are 0% busy, which averages to 50%. The computation of processor cores utilization in the above cases simply takes the task's execution time.
Let's take a very simple example. The application "myapp" requires one second of single core execution time and can process two transactions with that amount of processor execution time. Executing "myapp" in an idle single processor core machine will take one second to complete, hence the output of AIX® 'time' command looks as follows:
The above output reports that the application ran for one second as shown by the metric "real" and it consumed one second of processor core time in user mode as shown by the metric "user". It is important to understand the "real" represents the elapsed wall clock time and the sum of "user" and "sys" represents the amount of time the processor core is busy executing instructions from the "myapp" application. The output of AIX® 'sar' command looks as follows:
The sar command represents the utilization in terms of percentage. With the monitoring interval of one second, the sar command reported that the only cpu 'cpu0' is 100% busy in executing instructions. And the row with "-" for cpuid represents the overall system utilization, as the system has only one cpu, the overall system utilization is also 100%. Now lets take the same "myapp" application and run in a system with two processors. The time command output stays the same but there would be a difference in sar output, which looks as follows:
In the above output the overall system utilization reported as 50%, as only one processor core is busy among the two available.
Understanding Processor Utilization in recent POWER servers
The previous section provides a basic understanding of how traditional processor core utilization is computed. As mentioned earlier, IBM® POWER processor has undergone many advancements in the past decade. Starting with IBM® POWER4 TM processor family, the processor core supports multiple "pipeline stages" to allow independent instructions to execute in parallel. This enables a remarkable amount of capacity for concurrently executing instructions. With the introduction of multiple pipelines, it became important to ensure all the pipeline stages are efficiently and fully utilized. Figure-4 depicts an example of how various execution units are utilized in a typical multi-piped architecture:
Each row represents an execution unit and each column represents a processor cycle. A blue colored box indicates the execution unit was busy during that cycle. Now let's take the example of ordered stream of instructions A through H of a task and represent its execution with multiple pipelines:
Note that certain instructions like E and D can execute in parallel, but still a single task of instruction stream will not be able to efficiently utilize all the pipeline stages. To efficiently utilize all the pipeline stages the concept of Simultaneous Multi-threading(SMT) was introduced. SMT enables the concurrent execution of multiple instruction stream from multiple tasks on the same core. Their instruction streams, which are typically independent, really are being executed concurrently; if a pipe line stage is not being used by one task's instructions, another task's instruction stream gets to use it. Conversely, if multiple tasks instructions want to use the same pipe line stage, only one gets to use it in that cycle, the other task's instruction(s) are momentarily delayed. This distinction is important.
Figure-6 depicts how the pipeline stages are getting utilized with 2 hardware thread SMT model:
To provide SMT, the processor cores are provided with multiple hardware thread contexts. In case of POWER5 TM and POWER6 TM processors, the core supports one or two hardware thread contexts per core. POWER7 TM processor supports one, two or four hardware thread contexts per core. AIX® Operating System treats each hardware thread as a logical CPU. Figure-7 depicts the POWER5 TM and POWER6 TM processor behavior:
Figure-8 depicts an example where one more task's instruction stream has been added to our earlier example (Figure-6), making it identical to the first to show an effect. Notice that the number of cycles spent by Task1 remained unchanged even with the addition of Task2's instruction stream, because the core's pipes were able to provide pipe stages to both instruction streams as needed. Of course, that is not always the case; where a pipe line stage is not available when needed, a task must wait. In general, this typically means that both task's execution speed becomes slightly slower.
It is important to note that the execution of two tasks on the same core may not provide the same throughput compared to running on two cores. As mentioned earlier, the AIX® operating system supports SMT by treating individual Hardware thread context(Htc) as a logical processor, this way scheduling the tasks to multiple hardware thread contexts becomes simpler. Since the operating system treats individual hardware thread context as logical processors, the traditional method of calculating utilization will produce inaccurate results. Let's review this by running "myapp" application example. Figure-9 shows the two instances of "myapp" running on two different cores:
The combined throughput of the two instances would be four transactions per second. Let's review the output of time and sar commands, for one "myapp" instance, using the traditional way of computing utilization.
The "cpu" column in sar are AIX® logical CPU ids for the hardware thread contexts. For this example, lets assume that the cpuid 0 and 1 represents the hardware thread context 0 and 1 of first core, 2 and 3 represents the hardware thread context 0 and 1 of second core (Note: The logical CPU ids are neither sequential, nor the hardware thread logical CPU ids are contiguous). The spare capacity as reported by sar command is 50%. The normal expectation is that by adding two additional instances of "myapp", it will scale the throughput to 8 tps linearly, as 50% of the processing resources are still available. This is not true. Let's review the illustration shown in Figure-10.
The corresponding output of the time command for one instance of "myapp" for the illustration shown in Figure-10:
Note that the single instance of "myapp" took 0.33 seconds more to complete two transactions, reducing the throughput to 1.5 tps. At the same time, the single core's throughput has increased from 2 tps to 3 tps by running two instances of "myapp" on the same core. This actually means that throughput is 6 tps instead of 8tps. This clearly indicates that the system doesn't have 50% spare capacity. This shows the importance of understanding the spare capacity accurately in case of SMT enabled cores.
In addition to SMT, IBM® POWER introduced the capability to share processor cores across multiple logical partitions. Figure-11 illustrates this behavior:
The PowerVM TM Hypervisor provides a virtualized view of the available physical processor cores to the AIX® Operating System. The PowerVM TM Hypervisor takes care of providing slices of physical processor core to the OS. The AIX® Operating System sees only the Virtual Cores. A hypervisor virtual processor core corresponds to one AIX® logical processor when SMT is disabled. With processor core resources sharing enabled, the need for understanding the core level utilization in terms of fractions of cores became more crucial.
Beginning with IBM® POWER5 TM processor architecture, a new register, PURR, is introduced to assist in computing the utilization. The PURR stands for Processor Utilization of Resources Register and its available per Hardware Thread Context. PURR provides an actual count of physical processing time units that a hardware thread has used. The hardware increments for PURR is done based on how each hardware thread is using the resources of the processor core. The PURR counts in proportion to the real time clock (timebase). Hence over a period of time the sum of PURR of individual hardware threads in a core will be very close to the timebase. The ratio of PURR & Timebase register gives the fractions of core utilized over a period of time. AIX® performance reporting tools are updated to use PURR for computing CPU utilization, which provides more accurate results than the traditional model. Now let's review how the AIX® sar command reports the processor utilization when two instances of "myapp" are running on two different cores:
In the above output the fractions of core consumed represented by a new column in sar output, named "physc". And also the system resource utilization is modified to report the physical utilization of the processor resources and no more represented in traditional way as an average of individual CPUs utilization. The above output indicates that the 2 instances of "myapp" running on 2 different cores consumes the entire core and no spare capacity is available.
Now let's take the scenario of 4 instances of "myapp" running in two cores and lets review the output of sar, which looks as follows:
Note that the each instance of "myapp" only gets 50% of the physical core. This clearly provides an indication why the throughput was lowered down, from 2 tps to 1.5 tps, if both instances of "myapp" are scheduled on the same core. AIX® provides the above view through a new utility named "mpstat". This will help to understand individual hardware thread context's (logical processor) utilization. For the above scenario, the mpstat output would look as follows:
Note in the above output the "Proc*" entries indicates the Virtual Processor. The "cpu*" entries indicates the logical processor. At any point of time only one Virtual Processor will be executing on one Core.
Beginning with IBM® POWER6 TM processor architecture, the processor frequency can be altered. This provides the flexibility to user to cap the energy consumption by the processor. Hence the processor can be made to run in a lower frequency than the rated frequency. This adds up another level of complexity for performance reporting. With the PURR a user would able to see the utilization level based on the current running frequency of the processor and at the same time users may want to understand the overall spare capacity available on their system. For this reason the POWER6 TM processor provided an additional register, SPURR. The SPURR stands for Scaled Processor Utilization of Resources Register. The SPURR is similar to PURR except that it increments proportionally to the processor core frequency. This means when the core is running at a frequency which is 50% lower than its rated frequency, then the SPURR will increment at half the rate of PURR. For accounting or charge back AIX® always uses utilization computed based on SPURR and for utilization reporting AIX® always uses utilization computed based on PURR. The SPURR-PURR ratio is used to compute the change in frequency. The AIX® lparstat, sar & mpstat utilities are modified to report the PURR-SPURR ratio via a new column, named "nsp". The PURR based utilization only provides the spare capacity based on the current running frequency of the processor and there are times users want to understand the actual spare capacity available.
Consider a LPAR running with 4 physical processor cores. When it is running at the actual nominal frequency F, with 50% computational load both PURR & SPURR based physical consumption reports a value of 2. When the processor is running at a frequency of 0.5F, PURR based physc will report a value of 4 while SPURR based physc reports 2. The SPURR based report shows additional spare processor core capacity which is due to the fact that the processor core is running at reduced frequency and bringing it back to rated frequency would provide additional capacity for the system. This information is not provided by PURR based metrics. So SPURR based metrics is used extensively for accounting & capacity planning. The same happens when running in turbo mode also, Figure-12 shows these behaviors:
Now lets review how the above behavior is represented by the AIX® lparstat command:
The above lparstat snapshot captured when the processor has been running at 50% of its rated frequency (F/2), hence %nsp reports a value of 50.
With IBM® POWER7 TM processor architecture, the processor core support four hardware threads contexts, this means four thread's instruction streams can concurrently use a core. And also POWER7 TM provides the flexibility, to the user, to choose whether he wants to run his workload in the core with one Hardware Thread enabled(ST) or with two(SMT2) or with four(SMT4). The PURR was improved for POWER7 TM processors in order to provide more accuracy in measuring the core's utilization levels. So the obvious question is why the improvement in PURR was required and what gap exists in POWER5 TM or POWER6 TM utilization measurement. Let's review the example provided earlier related to "myapp":
Single Instance in the Core
Two Instances in the Core
When two instances of the "myapp" running in the same core, one instance of it takes 0.33 seconds more to complete. Going by the simple mathematics, that one instance of "myapp" gets only 50% of the core's time, then it should have been taken 1.50s to complete 2 transactions. Instead "myapp" is able to complete two transactions within 1.33s. This really means that when only one instance of "myapp" is running there is some additional spare capacity available which eventually get used when additional instance of "myapp" is invoked. This leads to a gap in understanding the spare processor capacity available for workloads. The following section provides details about the processor utilization reporting in POWER7 TM processor based systems.
Understanding Processor Utilization in POWER7 TM processor based servers:
On POWER7 TM the PURR computation is improved to provide a better picture of the spare capacity available in the core. This is one of the significant changes to provide better view of processor utilization and available capacity. Hence the utilization numbers would look different from that of POWER5 TM or POWER6 TM processors. This means it's really important to understand the differences.
As mentioned earlier, POWER7 TM processors can run in ST, SMT2, SMT4 modes. Applications that are single process and single threaded may benefit from running in ST mode while multi-threaded and/or multi process applications typically benefit more running in SMT2 or SMT4 mode. ST mode can be beneficial in the case of a multi process application where the number of application processes is smaller than the number of cores assigned to the partition. Applications that do not scale with a larger number of CPUs may also benefit from running in SMT2 or ST mode instead of SMT4 since lower number of SMT threads means lower number of logical CPUs.
ST, SMT2 or SMT4 mode can be set through the smtctl command in AIX®. The default mode is SMT4. Lets take our "myapp" example again to understand how the utilization is represented in a much better fashion. Let's run single instance of "myapp" application under the following scenarios:
Scenario-1: ST mode
Scenario-2: SMT2 mode
Scenario-3: SMT4 mode
For simplicity, lets take only one core.
There is only one hardware thread context and hence the behavior would be similar to a uniprocessor architecture, as only one instance of "myapp" can run in the core. Since the core can take only one thread instruction stream at a time, the time & sar command o/p would look as follows:
In this scenario, the Core is running with two Hardware Thread contexts. Now with the improved PURR computation AIX® is able to identify the actual capacity used by the "myapp" relative to the capacity made available in the core via two hardware thread contexts. Now let's see the output of AIX® time and sar command, which looks as follows:
The above output shows that the cpu0 (say Htc0) is 100% utilized by "myapp" application, but relative to the available resources in the core (Htc0, Htc1) "myapp" utilized only 69% of the resources (0.69 fractional physical processor core used as represented by physc). This provides a clear indication of the spare processor capacity which can be made available for additional workload. And one more important difference that can be noticed is the output of time command, which is circled in red. Basically the time command output shows the "user" time to be 0.69 seconds. Earlier in the explanation of the time command output, the following statement is mentioned:
Because of the improved PURR computation, now the "user" & "sys" representations also improved and made more accurate. One more important factor to remember is that the processor busy time is always relative to the capacity it possesses. Hence if the processor capacity is increased, then the processor busy time for the same workload may decrease.
In this scenario, the Core is running with four Hardware Thread Contexts. This means the processor core's capacity is increased further to support concurrent execution of four tasks. Now let's see review the output of AIX® time and sar commands, which looks as follows:
The above output shows that the spare capacity is increased from 31% to 37% as the number of Hardware Thread Contexts increased to four. The spare capacity doesn't increase linearly with an increase in hardware thread contexts. The reason for this is that some amount of resources are shared between these hardware thread contexts. The output of time command shows the difference in the time that "myapp" is actively using the core's resources.
The above scenarios are provided as examples here to describe the difference in the utilization representation between POWER7 TM & earlier family of POWER processors, the actual utilization will vary based on the workload's behavior.
The SMT processors - POWER5 TM through POWER7 TM - offer an internal mechanism for tracking relative usage of each core. POWER7 TM took it a step further and tuned this measurement. The general intent is to provide a measure of CPU utilization wherein there is a linear relationship between the current throughput (e.g., transactions per second) and the CPU utilization being measured for that level of throughput. For example, a throughput of 100,000 tranactions/second at 50% utilization should imply that at 100% utilization the throughput should be able to reach 200,000 transactions/second, assuming that the workload will not encounter scaling issues unrelated to CPU capacity.
The POWER7 TM utilization measurement represents the desired near linear relationship between throughput and CPU utilization for a number of typical workloads. However, for atypical workloads seeing either little or a very large benefit from SMT4, the desired linear relationship can become nonlinear. The utilization representation in POWER7 TM enables the users to do more accurate capacity planning and sizing.
The subsequent section is going to provide insights on understanding processor utilization when the core is shared between logical partitions, basically in a Micro-Partitioned environment.
Understanding Processor Utilization in Micro-partitioned environment
In micro-partitioned environment the processor core can be shared between various logical partitions. Each and every logical partition at a minimum can get 1/10th of the processor core. Hence a logical partition can run with fractions of core allocated to it. The allotted fraction of processor cores to a partition is called as "Entitlement". A logical partition can be capped to run with the entitled fraction of cores or it can run in uncapped mode, where it can get additional fraction of cores based on the need and availability.
Let's take a simple example, where a logical partition is configured to run in capped mode with an entitlement of 0.1 physical core and SMT disabled. This means only 10% of the core's resources will be provided to this partition. Let's start the "myapp" application and check the output of time & sar commands:
The above report shows, the metric "real", the elapsed wall clock time to complete "myapp" is 10s. The reason is that only 10% of the core's resources is provided to the partition as a result the application took more time to complete. The "user" metric still reports the 1s, as the core is actively executing instructions from "myapp" only for a second. So the obvious question would be what the remaining 9 seconds the core was doing. As mentioned earlier the core is shared between various other logical partitions, it is possible that the core is actively executing instructions for other logical partitions. And also the sar command output is taken at a one second interval and hence every second the processor core is 100% busy relatives to the 10% processor alloted for this partition, hence the "physc" column reported 0.1 processor core consumed.
Understanding Processor Utilization with Energy Saving Capabilities
Beginning with the IBM® POWER6 TM family of servers the Energy Saving capability got introduced. With this Energy Saving feature, the user has the ability to cap the energy consumption of the processor resources. In POWER7 TM processor based servers Dynamic Power Save mode was introduced through which the energy consumption can be automatically scaled up or down based on the processor resource utilization, thermal levels and based on whether the user would like to optimize for power-savings or for performance. The energy consumption of processor resource is controlled by varying the frequency of the processor cores. These capabilities demanded a new form of reporting where the user can understand the current running frequency of the processor and as well the utilization based on this frequency and the rated frequency.
In AIX® 5.3 TL11 and 6.1TL04 a new reporting capability is introduced to the lparstat utility. The new report provides both PURR & SPURR based utilization along with the current running frequency of the processor. Following is a sample output of the lparstat utility.
The report that the system is running with 128 logical cpus with SMT4 enabled. This means the total available physical cores is 32. The new report provides both Actual and Normalized (Rated frequency). The actual metrics are using PURR counters and normalized metrics use the SPURR counters. The values shown in each mode are the actual physical processors that are used in each mode. Adding up all the values (user, sys, idle, wait) will equal the total entitlement of the partition in both the actual and normalized view. But in shared uncapped mode this need not be true as the actual processor consumption can exceed the entitlement. So in that case adding these values might not be equal. Also the idle value has been modified to show the actual entitlement available. So the values shown in this view should not be compared with the default views of lparstat. The idle value shown here is the available capacity.
When the partition is running at a reduced frequency, the actual available capacity (idle) shown by both the counters is different. The current idle capacity is shown by PURR. The idle value shown by SPURR is what the idle capacity would be (approximate) if the processor core running at the rated frequency.
idle = Entitlement - ( user + sys + wait )