About this series
This three-part series focuses on the various aspects of Central Processing Unit (CPU) performance and monitoring. The first installment of the series provides an overview of how to efficiently monitor your CPU, discusses the methodology for performance tuning, and gives considerations that can impact performance, either positively or negatively. Though the first part of the series goes through some commands, the second installment focuses much more on the detail of actual CPU systems monitoring, along with analyzing trends and results. The third installment focuses on proactively controlling thread usage and other ways to tune your CPU to maximize performance. Throughout this series, I'll also expound on various best practices of AIX® CPU performance tuning and monitoring.
As an AIX administrator, you should already know some of the basics of performance tuning.
You probably already use certain commands, such as
topas, and are familiar with ways of identifying processes that are CPU
hogs. What you might not know is that CPU performance tuning is not only about running
some commands, but it is about proactively monitoring your systems, particularly when
there are no performance problems. This article covers the methodology for CPU
performance tuning and provides you with time-tested steps that assist you throughout
your tuning process. Throughout this article, I'll introduce some of the monitoring tools
you might wish to use, provide you with an overview of the POWER chip, and considerations
that can impact performance.
Performance in a virtualized environment provides challenges to even the most senior of administrators, so I'll also go over specific issues in a virtualized environment, including Simultaneous Multi-threading (SMT), virtual processors, and the POWER Hypervisor. I'll also talk about areas to focus on specifically during tuning the CPU, including tuning of the scheduler, balancing system workload, and changing the scheduler algorithm to fine-tune priority formulas. When investigating a performance problem, start by monitoring the statistics of CPU utilization. It is important to continuously observe system performance because you need to compare the loaded system data with normal usage data, which is the baseline. Because the CPU is one of the fastest components of the system, if CPU utilization keeps the CPU 100 percent busy, this also affects system-wide performance. If you discover that the system keeps the CPU 100 percent busy, you need to investigate the process that causes this situation. AIX provides many trace and profiling tools for system and processes or system or processes. In a system that is CPU-bound, all the processors are 100 percent busy and some jobs might be waiting for CPU time in the run queue. Generally speaking, a system has an excellent chance of becoming CPU-bound if the CPU is 100 percent busy, has a large run queue compared to the number of CPUs, and more context switches than usual. That is the quick and dirty, and I'm sure you'll find that there is a lot more.
In this section, let's take a look at several methodologies for AIX tuning.
Establishing a baseline
Before you tune or even start monitoring, you must establish a baseline. The baseline is a snapshot of what the system looks when it is performing well. This baseline should not only capture performance-type statistics, but it should also document the actual configuration of your system (amount of memory, CPU, and disk). If you don't document your system configuration, you might not be comparing apples to apples. This is particularly important in the partitioned world, when you can Dynamic Logical Partitioning (DLPAR) at a moments notice. To come up with a proper baseline, you need to identify the tools to use for monitoring. There are many tools that you can use in AIX 5.3, some of which are more specific to a partitioned and virtualized environment (for example, lparstat and mpstat). Some of the more generic tools that are generally available on all flavors of UNIX® include vmstat, sar, and ps. Some of the AIX-specific utilities include topas, procmon, and some semi-supported tools such as nmon. Once you have identified your monitoring tools, you need to gather your statistics and performance measurements. This helps you define what an acceptable level of performance is for a given system. To reiterate, the time to start tracking problems is prior to receiving that dreaded phone call. You need to know what a well-performing system looks like. You should also work with your appropriate application and functional teams to define what exactly a well-behaved system is. At that time, you would translate that into a Service Legal Agreement (SLA), which the customer would sign off on.
Stress testing and monitoring
The second step of the methodology is the stress test and monitoring piece. What you would be doing here is monitoring the systems at peak workloads and during problem periods. This helps you determine exactly what is wrong with the system. Is the bottleneck really a CPU bottleneck, or is it more memory or I/O related? I like to use several monitoring tools here to help validate my findings. I might use an interactive tool, such as vmstat, and then a capturing tool, such as nmon, to help me track data historically. The monitoring section is critical, because you cannot effectively tune anything without having an accurate historical record of what has been going on in your system, particularly during periods of stress. It is important here to establish performance policies for the system. You can determine the measures that are relevant during monitoring, analyze them historically, and then further during stress testing.
Identification of bottleneck
The objective of stressing and monitoring the system is to determine the bottleneck. You cannot provide the correct medicine without the proper diagnosis. If the system is in fact CPU bound, you can run additional tools, such as trace, curt, splat, tprof, and ps, to further identify the actual processes that are causing the bottleneck. It's possible that your system might actually be memory or I/O bound and not CPU bound. Fixing one bottleneck might actually cause a CPU bottleneck, because your system is now allowing the CPU to perform to its optimum capacity and it might not have the capacity to handle the increased amount of resources given to it. I've seen this situation often, which is not necessarily a bad thing. Quite the opposite, it ultimately helps you isolate all your bottlenecks. You will find that monitoring and tuning systems is quite dynamic and not always predictable. That's what makes performance tuning as challenging as it is.
Finally, after you've identified the bottleneck, it is time to tune it. For a CPU bottleneck, that usually means one of four solutions:
- Balancing system workload -- Running processes at different intervals to more efficiently utilize the 24-hour day.
- Tuning scheduler using nice or renice -- This helps you to assign different priorities to running processes to prevent CPU hogs.
- Tuning scheduler algorithm using schedo to fine tune priority formulas -- You can
tune various parameters in AIX using schedo. For example, the
schedocommand can be used to change the amount of time the operating system allows a given process to run before the dispatcher is called to choose another process to run (the time slice). The default value for this interval is a single clock tick (10 milliseconds). The time slice tuning parameter allows the user to specify the number of clock ticks by which the time slice length is to be increased.
Listing 1. Time slice tuning parameter
# schedo -a | grep timeslice timeslice = 1
- Increasing resources -- Adding more CPUs or, in a virtualized environment, re-configuration of your logical partitions (LPARs). This might include uncapping partitions or adding more virtual processors to your existing partitions. Virtutualzing your partitioned environment appropriately can help increase physical resource utilization, decrease CPU bottlenecks on specific LPARs, and reduce the expense of idle capacity in LPARs that are not breathing heavy.
Some additional CPU tuning commands include:
Now you have to go through this process again, starting from the second step in the Stress testing and monitoring section. Only by repeating your tests and consistently monitoring your systems can you determine if your tuning has really made an impact. I know some administrators that just tune certain parameters based on best practices for a specific application and then just move on. That is the worst thing that you can do. For one thing, what works in some environments might not work in yours. More importantly, how do you really know if what you've tuned has helped the bottleneck without looking at the data? To reiterate, AIX performance tuning is a dynamic process and, to achieve real success, you need to consistently monitor your systems, which can only come after a baseline and SLA has been established. If you cannot define the behavior of a system that runs well, how will you define the behavior of a system that doesn't? The way to do this can't be waiting for that dreaded phone call that I spoke about earlier.
The POWER architecture stands for Power Optimization with Enhanced Risc and is the processor used by IBM midrange servers today. It is a descendent from the 801 CPU and is a second generation RISC-based processor. It was first introduced in 1990 to support UNIX RS6000® systems. The POWER4 architecture was the first 64-bit symmetric multiprocessor, released in 2001. It became the driving force behind the IBM Regatta servers, which allowed for logical partitioning. The POWER5 architecture, introduced in 2003, contained 276 million transistors per processor. It was based on the 130 nanometer copper or silicon-on-insulator (SOI) process and featured:
- Chip multiprocessing
- A larger cache
- A memory controller on the chip
- Advanced power management
- Improved hypervisor technology
The POWER5 was built to allow up to 256 LPARs and was available on both its pSeries® and iSeries™ server. This dual core processor with SMT technology is fabricated using SOI devices and copper interconnects. SOI technology is used to reduce the device capacitance and increase transistor performance. The POWER5 is actually the IBM second generation of dual core microprocessor chips and provides new and improved functions for more granular and flexible partitioning. Further, it uses Dual Chip Modules (DCMs) and Multi-Chip Modules (MCMs) as the basic building blocks for its mid-range and high-end servers, respectively.
Some of the more important innovations on the POWER5 chip include:
- Enhanced memory subsystem
- Improved L1 cache design
- New replacement algorithm (LRU versus FIFO)
- Larger L2 cache
- 1.9 MB, 10-way set associative
- Improved L3 cache design
- Satisfies L2 cache misses more frequently
- Avoids traffic on the inter-chip fabric
- On-chip L3 directory and memory controller
- L3 directory on chip reduces off-chip delays after an L2 miss
- Improved pre-fetch algorithms
- Enhanced performance
- Hardware support for Micro-Partitioning
Perhaps the most important innovations with the POWER5 processor include support for Micro-Partitioning and SMT, which also requires the support of AIX 5L Version 5.3. Micro-Partitioning provides the ability to share a single processor between multiple partitions. These partitions are called shared processor partitions. Of course, POWER5-based systems continue to support partitions with dedicated processors that don't share a single physical processor with other partitions.
In a shared-partition environment, the POWER Hypervisor schedules and distributes processor entitlement to shared partitions from a set of physical processors. The physical processor set is called the shared processor pool. Processor entitlement is distributed with each turn of the hypervisor's dispatch wheel. During each turn, the partition consumes or cedes the given processor entitlement. Figure 1 shows a sample of shared and dedicated partitions in a micro-partitioned environment.
Figure 1. Sample of shared and dedicated partitions in a micro-partitioned environment
SMT allows for the ability of a single physical processor to concurrently dispatch instructions from more than one hardware thread. In AIX 5L Version 5.3, a dedicated partition created with one physical processor is actually configured as a logical two-way by default. In essence, two hardware threads can actually run on one physical processor at the same time. While there are isolated situations where turning on SMT can impact performance negatively, SMT is almost always the best choice, particularly when overall throughput is more important than the throughput of an individual thread. As a result of the POWER5's unique dual-core design and support for SMT, one POWER5 chip actually appears as a four-way microprocessor to the operating system. Processors using SMT can issue multiple instructions from different code paths during one single cycle. Figure 2, an illustration of a DCM, clearly illustrates the relationship between SMT and the chip itself.
Figure 2. Dual chip module
Getting threads to run on different CPUs allows for effective utilization of the IBM SMT. With the system in SMT mode, the processor fetches instructions from more than one thread. Exclusive to the POWER5 architecture, the concept of SMT is that no single process uses all the processor execution units at the same time. The POWER5 design implements two-way SMT on each of the chip's cores. As a result, two virtual processors represent each physical processor core. The greatest benefits of SMT occur in commercial environments where the speed of an individual transaction is less important than the total number of transactions performed. In addition, SMT increases the throughput of workloads with large working sets, such as database and Web servers. Generally, you should expect to see approximately a 30 percent increase in systems performance due to SMT.
How does the SMT implementation relate to the AIX scheduler? Because the kernel can see
the two hardware threads as separate logical processors in a four-way partition, the
scheduler would schedule the processes to run on the two hardware threads of the same
processor core, resulting in the other processor core being idle. Because the POWER5 is
multithreading-aware, it can distinguish between threads on the same or different
processors. The scheduler actually dispatches threads to the primary thread prior to
dispatching the tread to its secondary. When enabled, the hardware can dynamically switch
between a single thread and SMT on dedicated partitions. They can be manually done on
shared partitions using
smtctl. To view the processors, you can use the
Listing 2. View the processors
//To view all processors (logical and physical): # bindprocessor -q The available processors are: 0 1 2 3 //To view the physical processors # bindprocessor -s 0 The available processors are: 0 2 //To view the SMT enables processors. # bindprocessor -s 1 The available processors are: 1 3
Hypervisor and virtual partitions
The technology behind the virtualization of the IBM p5 systems is from a piece of firmware known as the POWER Hypervisor, which resides in flash memory. This firmware performs the initialization and configuration of the POWER5 processor, as well as the virtualization support required to run up to 254 partitions concurrently on the IBM p5 servers. The POWER Hypervisor uses some system processor and memory resources. The impact on performance is relatively minor for most workloads, but the impact increases with extensive amounts of page-mapping activity. There is nothing you can really tune as far as the Hypervisor is concerned. In earlier versions, there used to be a concern about limiting the amount of virtual processors when uncapping partitions. This was due to the overhead involved in the utilization of virtual processors. Starting with AIX 5.4 ML3, AIX introduced virtual processor folding. What this does is allow idle virtual processors to sleep and to awaken only when required to meet workload demand. The entitlement for these virtual processors is then redistributed on an as needed basis to other virtual processors on client partitions in the shared processor pool. The parameter is vpm_xvcpus, and it is changed with schedo and is enabled by default.
In a shared-partition environment, you need to understand that there is an unused time slice in each entitled processor capacity. When a virtual processor or SMT thread becomes idle, it is able to cede processor cycle to Hypervisor, and then the Hypervisor can dispatch unused processor cycles for other work. In order to collect CPU utilization at a processor thread level (in an SMT environment), the POWER5 architecture has implemented a new register -- it's called the Processor Utilization Resource Register (PURR). Each thread has its own PURR. The units are the same as the time base register and the sum of the PURR values for both threads is equal to time base register. More traditional methods for measuring processor utilization tend to yield incorrect results in an SMT and SPLAR environment, which is why the PURR registers provide a more accurate realistic measure of processor utilization.
Due to SMT, Micro-Partitioning, and the ability to dynamically change some parameters, it
was necessary to actually make some changes to the old tools. If SMT is enabled or in a
Micro-Partitioning environment, the
sar commands automatically use the new PURR-based data. In AIX 5L Version
lparstat command displays Hypervisor statistical data about many
POWER Hypervisor calls. Using the
-h flag adds summary POWER Hypervisor
statistics to the default
lparstat output (see Listing
Listing 3. lpartstat output with
# lparstat -h 1 5 System configuration: type=Dedicated mode=Capped smt=On lcpu=4 mem=3920 %user %sys %wait %idle %hypv hcalls ----- ---- ----- ----- ----- ------ 0.0 0.7 0.0 99.3 44.4 5933918 0.4 0.3 0.0 99.3 44.9 5898086 0.0 0.1 0.0 99.9 45.1 5930473 0.0 0.1 0.0 99.9 44.6 5931287 0.0 0.1 0.0 99.9 44.6 5931274 #
Performance tuning is one of the most challenging aspects of systems administration. Before you can start tuning systems, you must understand and follow a performance tuning methodology, which consists of baselining your systems, monitoring, and performing effective stress testing. System p™ servers contain powerful new features to help you tune your CPU subsystem, driven by AIX 5.3 and the POWER5 architecture. I've discussed some of the virtualization features of POWER5, including Micro-Partitioning and the Hypervisor. Many commands have been enhanced to provide for the virtualization and hypervisor functions of the POWER5 architecture. Part 1 of this series also introduced some of the command and utilities used to monitor and tune performance. In subsequent installments, I'll describe in detail these same utilities as you attempt to determine your bottleneck and tune your servers.
- Optimizing AIX 5L performance: Check out other parts in this series.
- High-Performance Architecture with a History: Read this paper for a brief description of PowerPC architecture.
- "Processor Affinity on AIX" (developerWorks, November 2006): Using process affinity settings to bind or unbind threads can help you find the root cause of troublesome hang or deadlock problems. Read this article to learn how to use processor affinity to restrict a process and run it only on a specified central processing unit (CPU).
- Check out other articles and tutorials written by Ken Milberg
- AIX and UNIX: The AIX and UNIX developerWorks zone provides a wealth of information relating to all aspects of AIX systems administration and expanding your UNIX skills.
- Safari bookstore: Visit this e-reference library to find specific technical resources.
- Future Tech: Visit Future Tech's site to learn more about their latest offerings.
Get products and technologies
- IBM trial software: Build your next development project with software for download directly from developerWorks.
- Participate in the developerWorks blogs and get involved in the developerWorks community.
- Participate in the AIX and UNIX forums