Processor utilization difference between IBM AIX and Linux on Power – apple to apple comparison
For the identical hardware (apple to apple comparison) and workload configuration, the processor utilization reported by Linux on Power OS is much lesser compared to the processor utilization reported by AIX on Power. This article explains why we can't compare the utilization between two operating systems, why there is a difference in the processor utilization between them, how the utilization metrics are derived on each of them, and also the intention behind the calculation and the factors in the OS level causing these differences.
Note: This article is not intended to cover any merits or demerits between the two operating systems. It just only covers why there is processor utilization difference between the two operating systems on the IBM Power platform.
This study was done on the IBM POWER7® processor-based system running on IBM PowerVM® hypervisior with the help of a workload (similar to the open-sourced Netperf) used in the network performance analysis on IBM Power hardware. This case study compares the processor utilization on AIX 7.1 and Red Hat Enterprise Linux (RHEL) 7.0 on identical hardware and software environment.
- AIX: 7.1
- Linux: Red Hat Enterprise Linux Server 7.0 (BE)
- Processor: POWER7
- Hypervisor: PowerVM
- Core: 8 (SMT4)
- Model: IBM, 9179-MHC
Note: Both the logical partitions (LPARs) AIX and RHEL7.0 are on the same Central Electronics Complex (CEC).
In an eight core POWER7 processor-based system, with exactly identical hardware configuration such as memory, network adapters, workload and so on, workload stimulates 80 parallel socket connection between server and clients, which doesn't utilize 100% processor of all the eight core intentionally to re-create the scenario. The processor utilization in AIX was reported as 24% and the processor utilization reported on the Linux on Power (RHEL 7.0) was approximately 9%, and this is about one-third difference in processor utilization between AIX and Linux for the identical workload and hardware.
Note: This study was based on raw performance data without any tuning on top of default installation, for example, binding the workload threads to different logical processors of different cores manually and so on. Also holds true mostly for the workloads that doesn't extract 100% utilization on all the cores available.
Refer to the following network workload results for the identical network load generated on AIX and Linux systems. There were four virtual partitions on the POWER7 processor-based system, two AIX LPARs, and two Linux LPARs. Tests were conducted on the two AIX partitions and the two Linux partition on the same host.
Workload results on AIX
Figure 1. Processor utilization results on AIX
Network workload was running the standard TCP Round Robin (RR) test with 80 socket connections with 1 byte, 4 byte, and 8 byte message transfer sizes between two partitions for a fixed duration. We can observe from Figure 1 the Server (c700c02) reports the processor utilization as 24% and the client (c700c04) report 28% for the workload.
Figure 2. Output of SAR for thread scheduling across cores on AIX
Figure 2, which is captured soon after the workload start, demonstrates how AIX default scheduler distributes the threads across cores starting from the initial stage. Threads are distributed across cores evenly.
Workload results on Linux on Power
Figure 3. Linux on Power processor utilization results
Figure 3 shows the Linux on Power results for the same workload configuration of 1 byte, 2 byte, and 4 byte message transfer size. The server (c700c01) reports 10% and the client (c700c03) reports 7% processor utilization.
Linux on Power reports almost one third processor utilization of the AIX on IBM Power platform.
Figure 4. Output of SAR for thread scheduling across cores on Linux on Power
Based on Figure 4, the default Linux scheduler (deadline from 7.0) starts scheduling threads from the lower cores (first eight available hardware threads) and gradually distributes the load across cores over time.
Processor utilization on AIX
The processor utilization report on AIX is obtained from the system call,
perfstat_cpu_util( ), which is part of the libperfstat.a library. The
reported processor utilization is based on the PURR and SPURR calculation.
Figure 5. perfstat library
Figure 6. SMT performance in IBM POWER8
Simultaneous multithreading (SMT) performance characterization shown in Figure 6 is taken from the IBM POWER8™ specification. This figure shows that SMT8 provides 2.2 times better performance compared to single threaded on POWER8. So, in order to set the expectation of maximum performance between single threaded and SMT8 (or) single threaded and SMT4 in case of IBM POWER7 in a SMP multithreaded system, PURR- and SPURR- based processor utilization were introduced. This helps better in the CAPEX planning and actual performance expectation in the SMT8 mode when compared to the single-threaded mode.
Example 1: Let's consider a single-threaded application, and run it on an IBM POWER7 SMT4 system. The core which runs the application shows the core utilization as approximately 63% to 65% based on the fraction of core consumed (physc) value and PURR and SPURR calculation. For more details about how the core utilization is 65%, refer Understanding CPU Utilization on AIX article.
Figure 7. Core 0 Single-threaded performance
Another factor responsible for the difference in the utilization is the OS scheduling policy. AIX has a feature called CPU Folding, and this feature helps the applications threads to be evenly distributed across cores in SMP systems.
Example 2: Let's consider a multithreaded application that completes its job with four threads. For example, if this application is run on the four-core POWER7 processor-based system, then based on the AIX default scheduler policy it will distribute the load evenly across all the available cores and schedule four threads in four different cores as show in Figure 8. Based on the PURR and SPURR understanding above, the physc value for each of the four cores will report 63% to 65% busy. The overall system processor utilization would be 63% to 65% busy.
Figure 8. Multithreaded application scheduling on AIX
Processor utilization on Linux on Power
Processor utilization reported in Linux on Power are mostly based on the /proc/stat values which are time-based. Still, there is no PURR- or SPURR-based calculation reported in Linux on Power .
Now considering the same Example 1 used earlier, when we run the application on a Linux OS (RHEL), the core running the single-threaded application reports 25% utilization because it is SMT4 enabled as against the 63% to 65% utilization reported on the AIX OS.
Lets take the Example 2, with multithreaded application running on Linux on Power. Core 0 shows 100% utilization because the Linux kernel (RHEL) based Completely Fair Queuing (CFQ) scheduler till 6.0 or deadline from RHEL 7.0 onwards, will first try scheduling all the work in the first eight available hardware threads by default without any tuning. The runtime is represented in Figure 9. Because the application is four threaded, it will complete its task in core 0.
Now, the overall processor utilization for core 0 is 100% and the utilization for all others (core 1 to core 3) is 0%. Therefore, [(1%4)*100]= 25% because it is a four-core system.
Figure 9. Multithreaded application scheduling on Linux on Power
Figure 9 depicts the default thread scheduling behavior of Linux on Power during the case study.
The overall system-level processor utilization between AIX and Linux on Power for identical workload and Power hardware configurations in multicore SMT systems can't be compared because the calculations are completely different. The overall system-level processor utilization reported in Linux on Power is derived from /proc/stat, which is purely time based. Whereas, AIX is based on PURR and SPURR, which is more towards CAPEX planning and also projects more realistic and accurate processor utilization for the latest evolution of processors which has SMP and SMT environment. Actual performance difference between single threaded and SMT8 on POWER8 is 2.2 times. In time-based calculations, processor utilization report for a single thread projects as if each SMT thread and execution unit is independent of the other. This gives a notion that users can achieve eight times better performance in the SMT8 mode, which is not true in real world. The realistic 2.2 times processor utilization factor that is not accounted in the calculation is one of the reasons for the low processor utilization reported in Linux on Power system for certain workloads that do not utilize 100% core, when compared to the AIX on Power platform.