Troubleshooting
Problem
The aim of this document is to provide high level guidance of the top things you can check if you have a performance issue and want to look at performance metrics and attempt to resolve the issue yourself.
It is not intended to replace advice given by IBM support or performance collection procedures and analysis, however it may be helpful if you have a performance issue and need to quickly perform some high level checks on your lpar. The advice in this document pertains to generic AIX tuning and is not necessarily specific to your situation however advice provided may help with your current situation.
Symptom
A lack of system resources or misconfigured systems can cause many undeterminable issues, for example, slow systems, application time outs, application errors, depleted buffers, slow user response times.
Cause
There are many reasons why an lpar may be slow. Some issues are caused "within the box", that is, within the LPAR, for example, configuration, lack of resources, AIX tuning, application configuration and tuning. Some issues are caused "outside the box", for example, remote server response times, storage, or network problems.
This Technote looks into the issues that you can check before involving the next level of support from IBM.
Environment
AIX
Diagnosing The Problem
Here are some performance metrics you can check before sending data to IBM for a deeper analysis.
CPU
For client lpars (non-vio servers) in an uncapped configuration, ensure that the physical processors consumed (pc column in vmstat or lparstat) is not greater or equal (or near) to the entitled capacity (ent= column in System Configuration of vmstat) except for short periods. In the case of vio servers, the physical processors consumed should never be near or exceed the entitled capacity.
In the following example we have an entitled capacity of 5.00 cpus and are using 7.31 physical cpus which is over our entitlement. This may cause performance problems as we need to try to borrow cpu cycles from other lpars and those may be on remote hardware books and therefore have a longer path to cpu or memory. For lpars in a capped environment, us (%cpu time spent in user mode) and sy (%cpu time spent in system mode) added together should not exceed 80%.
# vmstat -t 5
System configuration: lcpu=96 mem=131072MB ent=5.00
kthr memory page faults cpu time
----- ----------------- ------------------------ ------------ ----------------------------- ---------
r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec hr mi se
19 0 15790958 13416980 0 0 0 0 0 0 578 1837634 26675 33 20 47 0 4.84 96.8 23:08:23
11 0 15825340 13382603 0 0 0 0 0 0 639 2538221 28756 32 21 47 0 6.10 122.1 23:08:28
11 0 15830226 13377720 0 0 0 0 0 0 505 2389828 30384 27 17 56 0 6.94 138.7 23:08:33
9 0 15828403 13379535 0 0 0 0 0 0 380 2081531 29492 23 16 62 0 7.30 146.0 23:08:38
8 0 15825518 13382412 0 0 0 0 0 0 316 1930309 28625 23 14 63 0 6.90 138.0 23:08:43
A lack of cpu can cause many undeterminable issues, for example, slow system, application timeouts, both remotely and locally, depleted buffers. If the system has been functioning for some time and is suddenly slow or consuming more resources then it could be due to changes in workload, or a misbehaving process or application or environmental change. You should explore these possibilities before adding cpu unless the issue is urgent to resolve. A convenient way of determining who is using the most cpu is to use the topas command without options. This will give similar output as in the following example:
# topas
Topas Monitor for host: EVENTS/QUEUES FILE/TTY
Sun Sep 21 22:40:43 2025 Interval:2 Cswitch 28952 Readch 125.2M
Syscall 3248.3K Writech 4543.5K
CPU User% Kern% Wait% Idle% Physc Entc% Reads 50376 Rawin 0
Total 25.7 19.7 0.0 54.6 7.95 159.07 Writes 20156 Ttyout 1388
Forks 231 Igets 3546
Network BPS I-Pkts O-Pkts B-In B-Out Execs 191 Namei 33201
Total 370K 260.0 1.05K 69.6K 301K Runqueue 9.00 Dirblk 0
Waitqueue 0.0
Disk Busy% BPS TPS B-Read B-Writ MEMORY
Total 7.5 980K 245.0 0 980K PAGING Real,MB 131072
Faults 86832 % Comp 47
FileSystem BPS TPS B-Read B-Writ Steals 0 % Noncomp 12
Total 118M 29.0K 118M 23.0K PgspIn 0 % Client 12
PgspOut 0
Name PID CPU% PgSp Owner PageIn 0 PAGING SPACE
kdb_64 24379686 2.4 12.4M in004ivy PageOut 0 Size,MB 24192
secldapc 11796832 2.3 18.0M root Sios 0 % Used 1
rsyslogd 14483820 1.9 63.3M root % Free 99
auditpr 30736802 1.6 8.57M root NFS (calls/sec)
sshd-ses 45416934 0.4 3.19M root SerV2 0 WPAR Activ 0
sshd-ses 42664360 0.4 3.19M root CliV2 0 WPAR Total 0
sshd-ses 27001128 0.4 3.19M root SerV3 0 Press: "h"-help
sshd-ses 22938076 0.4 3.25M root CliV3 0 "q"-quit
sshd-ses 37028350 0.4 3.32M root SerV4 0
java 17957136 0.4 424M root CliV4 0
In the output above we can see we are using more (159.07%) than our entitlement. The name of the busiest process is kdb_64. You need to determine if this is normal for your system. You should run vmstat and Topas regularly, or deploy nmon or topasrec which will enable data collection over an extended period of time, enabling you to determine normal workload patterns.
Run lparstat -h and ensure you have enough cpus in the pool.
# lparstat -h 5
System configuration: type=Shared mode=Capped smt=8 lcpu=16 mem=12288MB psize=23 ent=0.20
%user %sys %wait %idle physc %entc lbusy app vcsw phint %hypv hcalls %nsp %utcyc
----- ----- ------ ------ ----- ----- ------ --- ----- ----- ------ ------ ----- ------
0.0 0.4 0.0 99.6 0.01 3.2 1.8 22.85 249 0 3.1 399 117 14.42
0.0 0.2 0.0 99.8 0.01 2.6 1.0 22.88 254 0 2.6 416 117 14.39
0.0 0.2 0.0 99.7 0.01 2.8 1.6 22.86 255 0 2.8 412 117 14.40
0.0 0.2 0.0 99.8 0.01 2.6 1.1 22.87 255 0 2.6 418 117 14.39
In the example above we can see there is a minimum of 22.8 cpus available. Since these are average statistics, cpu availablility may be lower than shown. You should have at least 2.0 cpus available. Please bear in mind that other lpars may use cpus and therefore reduce the number of cpus available.
If you receive the following error or the Available Physical Procesors (app column) is not displayed then you can temporarily enable hypervisor statistics using the following instructional video.
lparstat: 0551-021 Hypervisor statistics cannot be enabled
STORAGE IO
Run iostat and make sure average read, write and service times do not exceed 6ms. If so then check entitlement of the vio server and involve your storage team to assess performance. Check the error report for errors pertaining to storage on both lpar client and vio server(s).
The following script will display average read, write and queue times.
Discard disks that are not that busy, eg under 5%
Qfulls represent the number of commands that are rejected by storage due to command queue exhaustion. They should not exceed 30.
There will be a delay while information is collected.
# cat chk_iostat.sh
echo " Disk %tm_act read writes queue Qfull Time\n"
sp=" " ; iostat -DlT 5 | awk -v s="$sp" '{ print $1 s $2 s $8 s $14 s $19 s $21 s $25 }' | grep hdisk
# sh ./chk_iostat.sh
Disk %tm_act read writes queue Qfull Time
hdisk0 0.0 0.0 0.0 0.0 0.0 21:01:44
hdisk3 0.0 0.0 0.0 0.0 0.0 21:01:44
hdisk1 0.0 0.0 0.4 0.0 0.0 21:01:44
hdisk2 0.0 0.0 0.0 0.0 0.0 21:01:44
hdisk0 0.0 0.0 0.0 0.0 0.0 21:01:49
hdisk3 0.0 0.0 0.0 0.0 0.0 21:01:49
hdisk1 0.0 0.0 0.0 0.0 0.0 21:01:49
hdisk2 0.0 0.0 0.0 0.0 0.0 21:01:49
hdisk0 0.0 0.0 0.0 0.0 0.0 21:01:54
hdisk3 0.0 0.0 0.0 0.0 0.0 21:01:54
hdisk1 0.0 0.0 0.0 0.0 0.0 21:01:54
hdisk2 0.0 0.0 0.0 0.0 0.0 21:01:54
hdisk0 0.0 0.0 0.0 0.0 0.0 21:01:59
hdisk3 0.0 0.0 0.0 0.0 0.0 21:01:59
hdisk1 0.0 0.0 0.0 0.0 0.0 21:01:59
Run the lssrad command to display the placement of cpu and memory.
The following shows a bad placement, and after running DPO, placement becomes more optimal.
# lssrad -av
REF1 SRAD MEM CPU
0
0 79140.19 0-3 8-11 32-35 56-59 80-83
2 59760.00 16-19 40-43 64-67
3 59760.00 20-23 44-47 68-71
1
1 78910.00 4-7 12-15 36-39 60-63 84-87
4 59760.00 24-27 48-51 72-75
5 59760.00 28-31 52-55 76-79
Run the following Dynamic Partiton Optimizer (DPO) commands to determine you placement score
Get placement score: lsmemopt -m <system_name> -p <LPAR_NAME> -o currscore
Calculate placement score: lsmemopt -m <system_name> -p <LPAR_NAME> -o calcscore
Calculate placement score: lsmemopt -m <system_name> -p <LPAR_NAME> -o calcscore
To run DPO use the following commands
Run DPO: optmem -m <system_name> -t affinity -o start
Monitor DPO progress: lsmemopt -m <system_name>
Monitor DPO progress: lsmemopt -m <system_name>
You can run DPO if your score is less that 80% or after performing an Live Partition Mobility (LPM).
Run lssrad again
$ lssrad -av
REF1 SRAD MEM CPU
0
0 106788.19 8-11 20-23 32-35 48-51 64-67 80-83
2 57200.00 44-47 60-63 76-79
3 132382.00 0-7 16-19 28-31 40-43 56-59 72-75
6 100720.00 12-15 24-27 36-39 52-55 68-71 84-87
1
1 0.00
4 0.00
5 0.00
It is advisable to run DPO after performing Live Partition Mobility (LPM).
MEMORY
Run vmstat. Check pi and po (pi=4k page ins and po=4k page outs) is not equal to zero, if so then add more memory. Check avm column in vmstat (Active Virtual Memory). Divide this number by 256 to determine the computational memory requirements of your lpar in megabytes. This is the minimum memory requirement to run your workload without paging. You should add 20% for contingency. Run vmstat during the problem period to ensure you have enough memory.
# vmstat -t 5
System configuration: lcpu=96 mem=131072MB ent=5.00
kthr memory page faults cpu time
----- ----------------- ------------------------ ------------ ----------------------------- ---------
r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec hr mi se
19 0 15790958 13416980 0 0 0 0 0 0 578 1837634 26675 33 20 47 0 4.84 96.8 23:08:23
11 0 15825340 13382603 0 0 0 0 0 0 639 2538221 28756 32 21 47 0 6.10 122.1 23:08:28
11 0 15830226 13377720 0 0 0 0 0 0 505 2389828 30384 27 17 56 0 6.94 138.7 23:08:33
9 0 15828403 13379535 0 0 0 0 0 0 380 2081531 29492 23 16 62 0 7.30 146.0 23:08:38
8 0 15825518 13382412 0 0 0 0 0 0 316 1930309 28625 23 14 63 0 6.90 138.0 23:08:43
In the example above, avm =
15790958, devided by 256 = 61683 MB (or 60.2GB) which fits in to the 128GB of memory allocated to this lpar.NETWORK
- Check retransmitted packets
# netstat -s | egrep "packets sent|retransmitted" | head -2
Only up to 0.02% of the packets should be retransmitted.
If high then involve your network team.
In this case only 0.007% of the packets are retransmtted# netstat -s | egrep "packets sent|retransmitted" | head -2 5717375359 packets sent 434621 data packets (321053353 bytes) retransmitted
- Check for CRC errors
For each ethernet device run the following and check for CRC errors
# entstat -d entN | grep CRC
Where N is the device number eg 0,1,2,3
If the CRC errors are increasing then this is generally a network problem, with cables, routers, etc. You will need to involve your network team.
# entstat -d ent0 | grep CRC No Carrier Sense: 0 CRC Errors: 0 - Check the send and receive queues for with specific network interfaces
A high Recv -Q could be caused by incorrect tuning, insufficient resources or the application not reading the packets fast enough. A high Send -Q could be the result of network congestion, or a remote connection is not processing traffic this lpar is sending to it fast enough. It is possible that this lpar is undersized or tuned incorrectly.
For receive queues# netstat -a | egrep -p "Proto Recv-Q Send-Q" | sort -d -r +1 Proto Recv-Q Send-Q Local Address Foreign Address (state) Active Internet connections (including servers) tcp4 0 0 *.writesrv *.* LISTEN tcp6 0 0 *.ssh *.* LISTEN tcp4 0 0 *.ssh *.* LISTEN tcp4 0 0 *.smtp *.* LISTENFor send queue- Dog threads
Dog threads are useful if your workload relies on the prioritisation of network traffic. For the purposes of troubleshooting via this document (assuming we have no perfpmr) we don't recommended going above two dog threads as specified by the ndogthreads parameter and not above the number of virtual cpus allocated to the lpar.
Ensure there are more cpus than adapters. Dog threads should not be enabled if network traffic is light. If you enable dog threads then you need to see what effect this has on general workload as this can impact non-network workloads.
To determine if dog threads could be an option for you, run ventstat -A (vmstat is provided by perfpmr tarball downloadable as the following Technote) and check that the number of receive packets per interrupt (rpkts/i) is not maxed out at 27 if so then enabling dogthreads could be an option.
# ./ventstat -A -t 5 | awk '{ print $1 " " $11 }' devname rpkts/i ent0 26 ent0 27 ent0 25 - Dog threads
To enable dog threads, do the following:
# no -p -o ndogthreads=2
Setting ndogthreads to 2
Setting ndogthreads to 2 in nextboot file
Check for threads option in ifconfig:
ifconfig -a | grep flags
en0: flags=1e084863,81ce0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN>
If THREAD is not present then enable it with chdev:
chdev -l en0 -a thread=on
en0 changed
ERROR LOG
Look for Permanent Errors (PERM) and Performance errors (PERF) in the error report.
Resolving The Problem
If you still experience performance issues then review the POWER Best Practices Guide.
POWER8 / POWER9 / POWER10
POWER8 / POWER9 / POWER10
If you have been unable to resolve the problem then you should use the following perfscan utility to collect performance data and submit to IBM for analysis.
Related Information
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB08","label":"Cognitive Systems"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[{"code":"a8m0z000000cw0jAAA","label":"Performance"}],"ARM Case Number":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"5.3.0;6.1.0;7.1.0;7.2.0;7.3.0"}]
Was this topic helpful?
Document Information
Modified date:
22 October 2025
UID
ibm17246257