Quick Steps to Assist with a Performance Problem

Troubleshooting

Problem

The aim of this document is to provide high level guidance of the top things you can check if you have a performance issue and want to look at performance metrics and attempt to resolve the issue yourself.

It is not intended to replace advice given by IBM support or performance collection procedures and analysis, however it may be helpful if you have a performance issue and need to quickly perform some high level checks on your lpar. The advice in this document pertains to generic AIX tuning and is not necessarily specific to your situation however advice provided may help with your current situation.

Symptom

A lack of system resources or misconfigured systems can cause many undeterminable issues, for example, slow systems, application time outs, application errors, depleted buffers, slow user response times.

Cause

There are many reasons why an lpar may be slow. Some issues are caused "within the box", that is, within the LPAR, for example, configuration, lack of resources, AIX tuning, application configuration and tuning. Some issues are caused "outside the box", for example, remote server response times, storage, or network problems.

This Technote looks into the issues that you can check before involving the next level of support from IBM.

Environment

AIX

Diagnosing The Problem

Here are some performance metrics you can check before sending data to IBM for a deeper analysis.

CPU

For client lpars (non-vio servers) in an uncapped configuration, ensure that the physical processors consumed (pc column in vmstat or lparstat) is not greater or equal (or near) to the entitled capacity (ent= column in System Configuration of vmstat) except for short periods. In the case of vio servers, the physical processors consumed should never be near or exceed the entitled capacity.

In the following example we have an entitled capacity of 5.00 cpus and are using 7.31 physical cpus which is over our entitlement. This may cause performance problems as we need to try to borrow cpu cycles from other lpars and those may be on remote hardware books and therefore have a longer path to cpu or memory. For lpars in a capped environment, us (%cpu time spent in user mode) and sy (%cpu time spent in system mode) added together should not exceed 80%.

# vmstat -t 5

System configuration: lcpu=96 mem=131072MB ent=5.00

kthr    memory                     page            faults              cpu                   time
----- ----------------- ------------------------ ------------ ----------------------------- ---------
 r  b   avm       fre    re  pi  po  fr   sr  cy  in    sy    cs   us sy id wa    pc     ec  hr mi se
19  0 15790958 13416980   0   0   0   0    0   0 578 1837634 26675 33 20 47  0  4.84   96.8  23:08:23
11  0 15825340 13382603   0   0   0   0    0   0 639 2538221 28756 32 21 47  0  6.10  122.1  23:08:28
11  0 15830226 13377720   0   0   0   0    0   0 505 2389828 30384 27 17 56  0  6.94  138.7  23:08:33
 9  0 15828403 13379535   0   0   0   0    0   0 380 2081531 29492 23 16 62  0  7.30  146.0  23:08:38
 8  0 15825518 13382412   0   0   0   0    0   0 316 1930309 28625 23 14 63  0  6.90  138.0  23:08:43

A lack of cpu can cause many undeterminable issues, for example, slow system, application timeouts, both remotely and locally, depleted buffers. If the system has been functioning for some time and is suddenly slow or consuming more resources then it could be due to changes in workload, or a misbehaving process or application or environmental change. You should explore these possibilities before adding cpu unless the issue is urgent to resolve. A convenient way of determining who is using the most cpu is to use the topas command without options. This will give similar output as in the following example:

# topas

Topas Monitor for host:                         EVENTS/QUEUES    FILE/TTY
Sun Sep 21 22:40:43 2025   Interval:2           Cswitch   28952  Readch   125.2M
                                                Syscall 3248.3K  Writech 4543.5K
CPU     User% Kern% Wait% Idle%   Physc  Entc%  Reads     50376  Rawin         0
Total    25.7  19.7   0.0  54.6    7.95 159.07  Writes    20156  Ttyout     1388
                                                Forks       231  Igets      3546
Network    BPS  I-Pkts  O-Pkts    B-In   B-Out  Execs       191  Namei     33201
Total     370K   260.0   1.05K   69.6K    301K  Runqueue   9.00  Dirblk        0
                                                Waitqueue   0.0
Disk    Busy%      BPS     TPS  B-Read  B-Writ                   MEMORY
Total     7.5     980K   245.0       0    980K  PAGING           Real,MB  131072
                                                Faults    86832  % Comp       47
FileSystem          BPS    TPS  B-Read  B-Writ  Steals        0  % Noncomp    12
Total              118M  29.0K    118M   23.0K  PgspIn        0  % Client     12
                                                PgspOut       0
Name           PID  CPU%  PgSp Owner            PageIn        0  PAGING SPACE
kdb_64     24379686  2.4 12.4M in004ivy         PageOut       0  Size,MB   24192
secldapc   11796832  2.3 18.0M root             Sios          0  % Used        1
rsyslogd   14483820  1.9 63.3M root                              % Free       99
auditpr    30736802  1.6 8.57M root             NFS (calls/sec)
sshd-ses   45416934  0.4 3.19M root             SerV2         0  WPAR Activ    0
sshd-ses   42664360  0.4 3.19M root             CliV2         0  WPAR Total    0
sshd-ses   27001128  0.4 3.19M root             SerV3         0  Press: "h"-help
sshd-ses   22938076  0.4 3.25M root             CliV3         0         "q"-quit
sshd-ses   37028350  0.4 3.32M root             SerV4         0
java       17957136  0.4  424M root             CliV4         0

In the output above we can see we are using more (159.07%) than our entitlement. The name of the busiest process is kdb_64. You need to determine if this is normal for your system. You should run vmstat and Topas regularly, or deploy nmon or topasrec which will enable data collection over an extended period of time, enabling you to determine normal workload patterns.

Run lparstat -h and ensure you have enough cpus in the pool.

# lparstat -h 5

System configuration: type=Shared mode=Capped smt=8 lcpu=16 mem=12288MB psize=23 ent=0.20

%user  %sys  %wait  %idle physc %entc  lbusy   app  vcsw phint  %hypv hcalls  %nsp  %utcyc
----- ----- ------ ------ ----- ----- ------   --- ----- ----- ------ ------ -----  ------
  0.0   0.4    0.0   99.6  0.01   3.2    1.8 22.85   249     0    3.1    399   117  14.42
  0.0   0.2    0.0   99.8  0.01   2.6    1.0 22.88   254     0    2.6    416   117  14.39
  0.0   0.2    0.0   99.7  0.01   2.8    1.6 22.86   255     0    2.8    412   117  14.40
  0.0   0.2    0.0   99.8  0.01   2.6    1.1 22.87   255     0    2.6    418   117  14.39

In the example above we can see there is a minimum of 22.8 cpus available. Since these are average statistics, cpu availablility may be lower than shown. You should have at least 2.0 cpus available. Please bear in mind that other lpars may use cpus and therefore reduce the number of cpus available.

If you receive the following error or the Available Physical Procesors (app column) is not displayed then you can temporarily enable hypervisor statistics using the following instructional video.

lparstat: 0551-021 Hypervisor statistics cannot be enabled

STORAGE IO

Run iostat and make sure average read, write and service times do not exceed 6ms. If so then check entitlement of the vio server and involve your storage team to assess performance. Check the error report for errors pertaining to storage on both lpar client and vio server(s).

The following script will display average read, write and queue times.

Discard disks that are not that busy, eg under 5%

Qfulls represent the number of commands that are rejected by storage due to command queue exhaustion. They should not exceed 30.

There will be a delay while information is collected.

# cat chk_iostat.sh
echo "  Disk %tm_act    read   writes   queue   Qfull      Time\n"
sp="     " ; iostat -DlT 5 | awk -v s="$sp" '{ print  $1 s $2 s $8 s $14 s $19 s $21 s $25 }' | grep hdisk

# sh ./chk_iostat.sh
  Disk %tm_act    read   writes   queue   Qfull      Time
hdisk0     0.0     0.0     0.0     0.0     0.0     21:01:44
hdisk3     0.0     0.0     0.0     0.0     0.0     21:01:44
hdisk1     0.0     0.0     0.4     0.0     0.0     21:01:44
hdisk2     0.0     0.0     0.0     0.0     0.0     21:01:44
hdisk0     0.0     0.0     0.0     0.0     0.0     21:01:49
hdisk3     0.0     0.0     0.0     0.0     0.0     21:01:49
hdisk1     0.0     0.0     0.0     0.0     0.0     21:01:49
hdisk2     0.0     0.0     0.0     0.0     0.0     21:01:49
hdisk0     0.0     0.0     0.0     0.0     0.0     21:01:54
hdisk3     0.0     0.0     0.0     0.0     0.0     21:01:54
hdisk1     0.0     0.0     0.0     0.0     0.0     21:01:54
hdisk2     0.0     0.0     0.0     0.0     0.0     21:01:54
hdisk0     0.0     0.0     0.0     0.0     0.0     21:01:59
hdisk3     0.0     0.0     0.0     0.0     0.0     21:01:59
hdisk1     0.0     0.0     0.0     0.0     0.0     21:01:59

LPAR PLACEMENT

Run the lssrad command to display the placement of cpu and memory.

The following shows a bad placement, and after running DPO, placement becomes more optimal.

# lssrad -av                                         

REF1   SRAD        MEM      CPU                                         
0                                                                       
          0   79140.19      0-3 8-11 32-35 56-59 80-83                  
          2   59760.00      16-19 40-43 64-67                           
          3   59760.00      20-23 44-47 68-71                           
1                                                                       
          1   78910.00      4-7 12-15 36-39 60-63 84-87                 
          4   59760.00      24-27 48-51 72-75                           
          5   59760.00      28-31 52-55 76-79

Run the following Dynamic Partiton Optimizer (DPO) commands to determine you placement score

Get placement score: lsmemopt -m <system_name> -p <LPAR_NAME> -o currscore
Calculate placement score: lsmemopt -m <system_name> -p <LPAR_NAME> -o calcscore

To run DPO use the following commands

Run DPO: optmem -m <system_name> -t affinity -o start
Monitor DPO progress: lsmemopt -m <system_name>

You can run DPO if your score is less that 80% or after performing an Live Partition Mobility (LPM).

Run lssrad again

$ lssrad -av                                         
REF1   SRAD        MEM      CPU                                         
0                                                                       
          0   106788.19      8-11 20-23 32-35 48-51 64-67 80-83         
          2   57200.00      44-47 60-63 76-79                           
          3   132382.00      0-7 16-19 28-31 40-43 56-59 72-75          
          6   100720.00      12-15 24-27 36-39 52-55 68-71 84-87        
1                                                                       
          1       0.00                                                  
          4       0.00                                                  
          5       0.00

It is advisable to run DPO after performing Live Partition Mobility (LPM).

MEMORY

Run vmstat. Check pi and po (pi=4k page ins and po=4k page outs) is not equal to zero, if so then add more memory. Check avm column in vmstat (Active Virtual Memory). Divide this number by 256 to determine the computational memory requirements of your lpar in megabytes. This is the minimum memory requirement to run your workload without paging. You should add 20% for contingency. Run vmstat during the problem period to ensure you have enough memory.

# vmstat -t 5

System configuration: lcpu=96 mem=131072MB ent=5.00

kthr    memory                     page            faults              cpu                   time
----- ----------------- ------------------------ ------------ ----------------------------- ---------
 r  b   avm       fre    re  pi  po  fr   sr  cy  in    sy    cs   us sy id wa    pc     ec  hr mi se
19  0 15790958 13416980   0   0   0   0    0   0 578 1837634 26675 33 20 47  0  4.84   96.8  23:08:23
11  0 15825340 13382603   0   0   0   0    0   0 639 2538221 28756 32 21 47  0  6.10  122.1  23:08:28
11  0 15830226 13377720   0   0   0   0    0   0 505 2389828 30384 27 17 56  0  6.94  138.7  23:08:33
 9  0 15828403 13379535   0   0   0   0    0   0 380 2081531 29492 23 16 62  0  7.30  146.0  23:08:38
 8  0 15825518 13382412   0   0   0   0    0   0 316 1930309 28625 23 14 63  0  6.90  138.0  23:08:43

In the example above, avm = 15790958, devided by 256 = 61683 MB (or 60.2GB) which fits in to the 128GB of memory allocated to this lpar.

NETWORK

Check retransmitted packets

# netstat -s | egrep "packets sent|retransmitted" | head -2
Only up to 0.02% of the packets should be retransmitted.
If high then involve your network team.
```
# netstat -s | egrep "packets sent|retransmitted" | head -2 
5717375359 packets sent 434621 
           data packets (321053353 bytes) retransmitted                              
```
In this case only 0.007% of the packets are retransmtted

Check for CRC errors

For each ethernet device run the following and check for CRC errors
# entstat -d entN | grep CRC
Where N is the device number eg 0,1,2,3
If the CRC errors are increasing then this is generally a network problem, with cables, routers, etc. You will need to involve your network team.
```
# entstat -d ent0 | grep CRC
No Carrier Sense: 0                           CRC Errors: 0
```
Check the send and receive queues for with specific network interfaces

A high Recv -Q could be caused by incorrect tuning, insufficient resources or the application not reading the packets fast enough. A high Send -Q could be the result of network congestion, or a remote connection is not processing traffic this lpar is sending to it fast enough. It is possible that this lpar is undersized or tuned incorrectly.

For receive queues
```
# netstat -a | egrep -p "Proto Recv-Q Send-Q" | sort -d -r +1

Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
Active Internet connections (including servers)
tcp4       0      0  *.writesrv             *.*                    LISTEN
tcp6       0      0  *.ssh                  *.*                    LISTEN
tcp4       0      0  *.ssh                  *.*                    LISTEN
tcp4       0      0  *.smtp                 *.*                    LISTEN
```
For send queue
- Dog threads
  
  Dog threads are useful if your workload relies on the prioritisation of network traffic. For the purposes of troubleshooting via this document (assuming we have no perfpmr) we don't recommended going above two dog threads as specified by the ndogthreads parameter and not above the number of virtual cpus allocated to the lpar.
  
  Ensure there are more cpus than adapters. Dog threads should not be enabled if network traffic is light. If you enable dog threads then you need to see what effect this has on general workload as this can impact non-network workloads.
  
  To determine if dog threads could be an option for you, run ventstat -A (vmstat is provided by perfpmr tarball downloadable as the following Technote) and check that the number of receive packets per interrupt (rpkts/i) is not maxed out at 27 if so then enabling dogthreads could be an option.
```
# ./ventstat -A -t 5 | awk '{ print $1 "  " $11 }'
devname  rpkts/i
ent0  26
ent0  27
ent0  25
```

To enable dog threads, do the following:

# no -p -o ndogthreads=2
Setting ndogthreads to 2
Setting ndogthreads to 2 in nextboot file

Check for threads option in ifconfig:

ifconfig -a | grep flags
en0: flags=1e084863,81ce0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN>

If THREAD is not present then enable it with chdev:

chdev -l en0 -a thread=on
en0 changed

ERROR LOG

Look for Permanent Errors (PERM) and Performance errors (PERF) in the error report.

Resolving The Problem

If you still experience performance issues then review the POWER Best Practices Guide.
POWER8 / POWER9 / POWER10

If you have been unable to resolve the problem then you should use the following perfscan utility to collect performance data and submit to IBM for analysis.

Related Information

IBM Support AIX MustGather: System Performance Analysis

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB08","label":"Cognitive Systems"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[{"code":"a8m0z000000cw0jAAA","label":"Performance"}],"ARM Case Number":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"5.3.0;6.1.0;7.1.0;7.2.0;7.3.0"}]

Tips