Where does it hurt?

How describing performance problems helps you diagnose them


Content series:

This content is part # of # in the series: The performance detective

Stay tuned for additional content in this series.

This content is part of the series:The performance detective

Stay tuned for additional content in this series.

"Houston, we have a problem!"

It's not the sort of news you want to hear. A system is slow, and there are users who are looking for a quick resolution. Sometimes the temptation is to go for a quick fix that may end up masking symptoms instead of addressing the real underlying problem. That's not a smart approach. What would you think of a doctor who prescribes medicine as soon as a patient says they aren't feeling well? The heart of the art of medicine is diagnosis. The same is true for fixing system performance problems.

Asking the right questions

It can take quite a bit of investigation to work out exactly what the factors contributing to the slow response times are. The first symptom reported may not be the only symptom. It may not even be the worst one. It is vital to find where the resource contention is happening. This fact-gathering process could take some time, but it saves you from "fixing" the wrong problem or spending time and effort on what is, in fact, only a minor symptom.

The first step in fixing any problem at all, of course, is identifying what the problem is. The key to understanding why a system is responding poorly is knowing where to look and what questions to ask. The diagnosis will be easier and faster if you can put together a detailed description of the performance problem.

Isolating the problem

If I had to identify a single rule for dealing with performance problems, it would be this one: You must pinpoint exactly which component in your infrastructure is the pain point. To do that, you have to look not only at what's running poorly but also at what's working normally.

Locating the areas of resource contention is far more effective than simply assuming the system is processor-bound, the network is slow, or the Storage Area Network (SAN) is poorly configured. You'll find it worth working through some basic questions that shed light on which component in your infrastructure might need attention.

The AIX Performance PMR tools

If you have a system that's performing below expectations, the AIX Performance PMR data collection tools may come in handy (see Related topics). In addition to providing some scripts that help you identify resource bottlenecks, the Performance PMR tools (perfpmr) include a set of questions that help you and IBM Support pinpoint exactly where the performance problem lies. By working through the questions, you can get a better grasp of the real bottlenecks.

To start with, ask some basic questions. What exactly is running slowly? Is the slow performance affecting a single user or many users? Is it one process that is slow, such as a report, a backup, or a database update? Are all the systems connecting to a particular SAN redundant array of independent disks (RAID) set responding poorly? Which system is affected? What application is running? Is it an entire IBM Power systems™ server or just one logical partition (LPAR)? If it is a single LPAR, is there a bottleneck on a single file system or even just one file?

Other tools

If you narrow the performance problem down to a single LPAR, you can then drill down further. You can do some basic checks for file system usage via the df command. Commands such as nmon and topas give an overall view of performance for a logical partition. Both these commands have menus that allow you to drill down to view processor usage, identify busy disks, display network statistics, and look at many other useful metrics. Figure 1 below shows the main screen for the topas command.

Figure 1. Main screen for the topas command
Screen shot of the topas main screen
Screen shot of the topas main screen

The vmstat command is especially useful for identifying performance bottlenecks. This single command can show you memory, processor, and I/O data, as you can see below in Listing 1.

Listing 1. A vmstat example
vmstat 1 4

System configuration: lcpu=12 mem=7168MB ent=2.80

kthr    memory              page              faults              cpu
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
 5  0 793201 942484   0   0   0   0    0   0 15550 32034 25717  4 37 50  8  1.32  47.1
 1  0 793201 942484   0   0   0   0    0   0 17369 36882 29660  6 40 48  6  1.45  51.6
 0  0 793201 942484   0   0   0   0    0   0 18309 39566 33628  8 39 47  7  1.45  51.9
 4  0 793203 942482   0   0   0   0    0   0 16068 34022 27586  5 39 49  6  1.40  49.8

For a detailed explanation of how vmstat can quickly point you to where a system is struggling for resources, see the "Optimizing AIX 7 memory performance" series. See Related topics for links to these and other articles related to system performance.

Can you reproduce the problem?

When you report a performance problem to IBM Support using the perfpmr tools, it helps if you can provide a detailed description of the problem. For example, you can provide more detail about the simplest repeatable example of the problem. When you try to reproduce the problem, see if there is a command or a sequence of events that always produces slow results. Is the execution of AIX commands also slow?

Check log files

Log files are an important source of information. Most applications, databases, and hardware components provide some sort of error logging. If a tape backup is running unusually slowly, it may be simply because the tape drive needs to be cleaned. If the tape drive connects to an AIX LPAR, you can run the errpt command. Use the -a flag for a detailed description of the error, as shown below in Listing 2.

Listing 2. Detailed error report (errpt -a)
# errpt -a | more
LABEL:          TAPE_ERR1

Date/Time:       Sat Oct  1 12:56:00 AEST 2011
Sequence Number: 136509
Machine Id:      00C5A47E4C00
Node Id:         tsm1
Class:           H
Type:            PERM
WPAR:            Global
Resource Name:   rmt1
Resource Class:
Resource Type:
        Machine Type and Model......ULT3580-TD3
        Serial Number...............1210002439
        Device Specific.(FW)........93G6


If there is a script that is running slowly, perhaps it generates some output to indicate which stage it is up to.

Has anything changed?

When a process that was running well suddenly starts to run slowly, you naturally ask if anything has changed. Is there something that worked previously—perhaps before an upgrade—that no longer works properly? The fix may not necessarily be to roll back to the pre-upgrade configuration. It may simply be that there is a tuning parameter or environmental variable that you need to set.

A simple procedure, such as extending a file system, may require adding a new physical volume to a volume group. If the new physical volume has the default queue depth attribute, it can cause I/O requests to queue on the operating system, regardless of how well the SAN may be able to service I/O requests.

You can check device attributes using the lsattr command. Listing 3 has an example showing the queue depth for a physical volume.

Listing 3. List device attribute
# lsattr -El hdisk7 -a queue_depth
queue_depth 3  Queue DEPTH True

To change a device attribute, you can usually use the chdev command, as shown in Listing 4.

Listing 4. Change device attribute
# chdev -l hdisk7 -a queue_depth=20
hdisk7 changed

If the device is in use, you can free up any resources that may be using it or schedule the attribute change for after a reboot. You can do this by adding the -P flag (see Listing 5 below).

Listing 5. Permanent change to device attribute
# chdev -l hdisk7 -a queue_depth=20 -P   # Make permanent change

There are so many components in a well-functioning system that it really helps if you can find out what configuration changes might have occurred in the lead-up to the performance problem.

What performance was expected?

If the application, system, or hardware is being set up for the first time, are there reasonable expectations about how it is supposed to perform? On what are those expectations based? For example, is there an equivalent configuration that runs a similar process significantly faster than the one that is running slowly?

You can get a simple comparison between two AIX LPARs by running the perfpmr tools on both of them. The performance data can provide a quick measure of what a normally functioning system should look like.

Listing 6 demonstrates how to run a perfpmr script for 10 minutes (600 seconds). The first few lines of the output are shown below.

Listing 6. Capturing performance statistics for 10 minutes
#./ 600

(C) COPYRIGHT International Business Machines Corp., 2000,2001,2002,2003,2004-2008

23:12:26-10/05/11 : begin
    PERFPMR: hostname: slowhost
    PERFPMR: Version 610 2010/12/01

Is the problem intermittent?

Here again, the perfpmr tools offer some key questions. Is the slow performance intermittent or constant? Is there a pattern to the slow behavior? For example, sometimes systems seem to peak when a large number of users log in at the start of the day, but then things settle down quickly.

What aspect is slow?

It can be useful to find out what exactly is leading users to report that the system is running slowly. Is it the time it takes to log in or the time it takes to echo a character? Perhaps a transaction is taking a long time to complete or a report is taking too long to generate.

Does a reboot provide a temporary fix?

If a reboot makes the problem disappear for a while, it could be because of a resource that is consumed but not released for use by other processes. If the problem does creep in again after a reboot, how long does that take? Sometimes it is worth disabling a particular process that you suspect is causing the slow response times.

It's always worth looking for processes that may be chewing up memory or processor time or putting excessive demand on I/O resources. The ps command has many options that help report the busiest processes. Listing 7 is an example.

Listing 7. The ps command reports the processes that are running
		# ps -ef | more
			     UID      PID     PPID   C    STIME    TTY  TIME CMD
    root        1        0   0   Oct 04      -  0:01 /etc/init
    root   655466  3866772   0   Oct 04      -  0:00 /usr/sbin/snmpd
    root  2097342        1   0   Oct 04      -  0:00 /bin/ksh /usr/tivoli/tsm/server/bin/
    root  2424972  3866772   0   Oct 04      -  0:00 /usr/sbin/inetd
    root  2883806        1   0   Oct 04      -  0:00 /usr/lib/errdemon
    root  2949246        1   0   Oct 04      -  0:00 /usr/ccs/bin/shlap64
    root  3276878  3866772   0   Oct 04      -  0:00 /usr/sbin/syslogd
    root  3604516        1   0   Oct 04      -  1:24 /usr/sbin/syncd 60
    root  3670082  3866772   0   Oct 04      -  0:05 /usr/sbin/xntpd
    root  3735676  3866772   0   Oct 04      -  0:00 /usr/sbin/muxatmd
    root  3801210  3866772   0   Oct 04      -  0:00 /usr/sbin/hostmibd
    root  3866772        1   0   Oct 04      -  0:00 /usr/sbin/srcmstr
    root  3932286  3866772   0   Oct 04      -  0:00 /usr/sbin/portmap
    root  3997832  3866772   0   Oct 04      -  0:00 /usr/sbin/aixmibd
    root  4063420        1   0   Oct 04      -  0:44 /usr/sbin/getty /dev/consol
    root  4128936  3866772   0   Oct 04      -  0:03 sendmail: accepting connect
    root  4259980  3866772   0   Oct 04      -  0:00 /usr/sbin/snmpmibd
    root  4325556        1   0   Oct 04      -  0:02 /usr/sbin/cron
    root  4391124  3866772   0   Oct 04      -  0:03 /usr/sbin/rsct/bin/vac8/IBM.
    root  4522176        1   0   Oct 04      -  0:00 /usr/bin/dsmcad
    root  4718774  3866772   0   Oct 04      -  0:00 /usr/sbin/rpc.lockd -d 0
    root  4784284  2424972   0   Oct 04      -  1:10 xmtopas -p3
    root  4980888  3866772   0   Oct 04      -  0:00 /usr/sbin/biod 6
    root  5177506  3866772   0   Oct 04      -  0:00 /usr/sbin/nfsd 3891
    root  5243046  3866772   0   Oct 04      -  0:00 /usr/sbin/rpc.mountd
    root  5439672  3866772   0   Oct 04      -  0:04 /usr/sbin/rsct/bin/rmcd -a IBM.
LPCommands -r
    root  5570560        1   0   Oct 04      -  0:00 bin/nonstop_aix @config/nonstop.
    root  5701822  2097342 208   Oct 04      - 938:56 dsmserv quiet
    root  5832888        1   0   Oct 04      -  0:02 /usr/local/sbin/sshd
    root  5898436  3866772   0   Oct 04      -  0:00 /usr/sbin/qdaemon
    root  5963972        1   0   Oct 04      -  0:00 /usr/sbin/uprintfd
    root  6095040  3866772   0   Oct 04      -  0:00 /usr/sbin/writesrv
    root  6160590  3866772   0   Oct 04      -  0:08 /usr/sbin/pcmsrv
    root  6291682  3866772   0   Oct 04      -  0:00 /usr/sbin/rsct/bin/IBM.DRMd

Is the problem related to the network?

With a client/server configuration, it may be worth testing to see if the problem occurs when run locally on the server rather than across the network. You can run the application from the console and see if the response time is similar to when you connect across the network.

If the application uses a client/server model, you can do some basic testing from the client using ping server_IP_address (see Listing 8).

Listing 8. Ping by IP address
PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=255 time=0 ms
64 bytes from icmp_seq=1 ttl=255 time=0 ms
64 bytes from icmp_seq=2 ttl=255 time=0 ms
64 bytes from icmp_seq=3 ttl=255 time=0 ms

---- PING Statistics----
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0/0/0 ms

A ping by IP address can help identify if the issue is related to Domain Name System (DNS) configuration. If you suspect network problems, a diagram or description of the network configuration is a useful starting point.

What vendor applications are involved?

It is important to know what vendor applications are used on a system that is performing poorly. Often there are operating system tunables, recommended kernel settings, and other environmental variables that you should use for some applications. There also may be patches for the application that fix known performance issues.

You should know what version/release/level of the vendor application is installed and if the application has been updated recently.

General advice

The perfpmr documentation recommends providing a clear written statement of a simple specific instance of the problem. It also recommends separating the symptoms and facts from theories, ideas, and your own conclusions. As the documentation says, "If all the facts are available, the performance team can quickly eliminate the unrelated ones."

Another piece of advice is to ensure the correct machine is being used for information gathering. In large sites—and especially with so many virtualized environments—it can be easy to collect data from the wrong system. As the documentation says, "This makes it very hard to investigate the problem."

To identify the machine model and serial number, you can use the lsconf command.

When you're working through a performance problem, it can be easy to lose track of what steps you've already taken to resolve the issues. Keeping a record of the actions taken to diagnose or fix the problem can save you a lot of wasted effort.

The rewards of patience

Fixing performance problems requires good diagnostic skills, an ability to separate facts from theories and assumptions, and, above all, patience. Often the solution is a simple one, and your efforts are rewarded with an improved system performance. The next article in this two-part series looks at some practices that can help you prevent performance bottlenecks from occurring in the first place.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=AIX and UNIX
ArticleTitle=The performance detective: Where does it hurt?