The basics of IBM i Wait Accounting
Wait Accounting is the patented technology built into the IBM® i operating system that tells you what a thread or task is doing when it appears that it is not doing anything.
When a thread or task is not executing, it is waiting. Wait accounting, a concept exclusive to IBM i, is a very powerful capability for detailed performance analysis. The following information is going to focus on waiting, why threads wait, and how you can use wait accounting to troubleshoot performance problems or to simply improve the performance of your applications.
A job is the basic mechanism through which work is done. Every job has at least one thread and may have multiple threads. Every thread is represented by a licensed internal code (LIC) task, but tasks also exist without the IBM i thread-level structures. LIC tasks are generally not visible externally except through the IBM i Performance or Service Tools. Wait accounting concepts apply to both threads and tasks, thus, the terms thread and task are used when referring to an executable piece of work.
A thread or task has two basic states it can be in:
- Executing on the processor. This is the "running" state.
- Waiting to run on the processor.
There are three key wait conditions:
- Ready to run, waiting for the processor. This is a special wait state and is generally referred to as "CPU Queuing. This means the thread or task is queued and is waiting to run on the CPU. There are a few different reasons that CPU queuing can occur. An example could be if the partition is overloaded and there is more work than the partition can accommodate, then work can be queued to wait for the CPU. This can be compared to a highway that has ramp meters; when the highway is congested, the ramp meters have a red signal so that the cars have to stop and wait before they can enter traffic. Logical partitioning and simultaneous multithreading can also result in CPU queuing.
- Idle waits. Idle waits are a normal and expected wait condition. Idle waits occur when the thread is waiting for external input. This input may come from a user, the network, or another application. Until that input is received, there is no work to be done.
- Blocked waits. Blocked waits are a result of serialization mechanisms to synchronize access to shared resources. Blocked waits may be normal and expected. Examples include serialized access to updating a row in a table, disk I/O operations, or communications I/O operations. However, blocked waits may not be normal and it is these unexpected block points that are situations where wait accounting can be used to analyze the wait conditions.
You can think of the life-time of a thread or a task in a graphical manner, breaking out the time spent running or waiting. This graphical description is called the "run-wait time signature". At a high level, this signature looks as follows:
Traditionally, the focus for improving the performance of an application was to have it use the CPU as efficiently as possible. On IBM i with wait accounting, we can examine the time spent waiting and understand what contributed to that wait time. If there are elements of waiting that can be reduced or eliminated, then the overall performance can also be improved.
Nearly all of the wait conditions in the IBM i operating system have been identified and enumerated - that is, each unique wait point is assigned a numerical value. This is possible because IBM has complete control over both the licensed internal code and the operating system. As of the IBM i 6.1 release, there are 268 unique wait conditions. Keeping track of over 250 unique wait conditions for every thread and task would consume too much storage, so a grouping approach has been used. Each unique wait condition is assigned to one of 32 groups, or "buckets". As threads or tasks go into and out of wait conditions, the task dispatcher maps the wait condition to the appropriate group.
If we take the run-wait time signature, using wait accounting, we can now identify the components that make up the time the thread or task was waiting. For example:
If the thread's wait time was due to reading and writing data to disk, locking records for serialized access, and journaling the data, we could see the waits broken out above. When you understand the types of waits that are involved, you can start to ask yourself some questions. For the example above, some of the questions that could be asked are:
- Are disk reads causing page faults? If so, are my pool sizes appropriate?
- What programs are causing the disk reads and writes? Is there unnecessary I/O that can be reduced or eliminated? Or can the I/O be done asynchronously?
- Is my record locking strategy optimal? Or am I locking records unnecessarily?
- What files are being journaled? Are all the journals required and optimally configured?
As of the IBM i 6.1 release, the following are the 32 wait groups or "buckets" that have been defined. The definition of the wait groups varies from release to release and may change in the future.
- Time dispatched on a CPU
- CPU queuing
- Reserved
- Other waits
- Disk page faults
- Disk non-fault reads
- Disk space usage contention
- Disk operation start contention
- Disk writes
- Disk other
- Journaling
- Semaphore contention
- Mutex contention
- Machine level gate serialization - call IBM support
- Seize contention - call IBM support
- Database record lock contention
- Object lock contention
- Ineligible waits
- Main storage pool contention - call IBM support
- Classic Java™ user including locks
- Classic Java JVM
- Classic Java other
- Socket accepts (idle)
- Socket transmits
- Socket receives
- Socket other
- IFS
- PASE
- Data queue receives
- Idle/waiting for work
- Synchronization Token contention
- Abnormal contention - call IBM support
There are many of these wait groups that you may see surface if you do wait analysis on your application. Understanding what your application is doing and why it is waiting in those situations can possibly help you reduce or eliminate unnecessary waits.
- Read
- Update
- Weak
- Transfer
- Check
- Conflict exit
Holders and Waiters
Not only does IBM i keep track of what resource a thread or task is waiting on, it also keeps track of the thread or task that has the resource allocated to it. This is a very powerful feature. A "holder" is the thread or task that is using the serialized resource. A "waiter" is the thread or task that wants access to that serialized resource.
Call Stacks
IBM i also manages call stacks for every thread or task. This is independent of the wait accounting information. The call stack shows the programs that have been invoked and can be very useful in understanding the wait condition; knowing some of the logic that led up to either holding a resource or wanting to get access to it. The combination of holder, waiter, and call stacks provide a very powerful capability to analyze wait conditions.
Collecting and Analyzing the Data
Collection Services and Job Watcher are two performance data collection mechanisms on IBM i that collect the wait accounting information. Job Watcher also collects holder and waiter information, as well as call stacks. Once the performance data has been collected, you can graphically analyze the data. The iDoctor product has a Windows client for graphically viewing performance data. And in IBM i 6.1, IBM Navigator for i web console has the "Investigate Data" feature to graphically view performance data through a web browser interface.