The performance detective: Where does it hurt?

How describing performance problems helps you diagnose them

If you're up against a performance problem on the IBM® AIX® operating system, your most important task is to diagnose it correctly. When a user tells you "the system is running slowly," it's time for some detective work. You need to know what questions to ask to help you pinpoint the real issue. The first article of this two-part series demonstrates how describing a performance problem can help you identify the bottlenecks. Part 2 will look at some good practices that help prevent those bottlenecks in the first place.

Share:

Anthony English (anthonyenglish@levitar.com.au), Senior AIX specialist, Levitar Pty Ltd

AnthonyAnthony English is an independent contractor from Sydney, Australia. He has worked on AIX systems since 1991 and writes the IBM developerWorks blog, AIX Down Under. He is also recognized as an IBM Champion for Power Systems. You can reach Anthony at anthonyenglish@levitar.com.au.



31 January 2012

Also available in Chinese Russian

"Houston, we have a problem!"

It's not the sort of news you want to hear. A system is slow, and there are users who are looking for a quick resolution. Sometimes the temptation is to go for a quick fix that may end up masking symptoms instead of addressing the real underlying problem. That's not a smart approach. What would you think of a doctor who prescribes medicine as soon as a patient says they aren't feeling well? The heart of the art of medicine is diagnosis. The same is true for fixing system performance problems.


Asking the right questions

It can take quite a bit of investigation to work out exactly what the factors contributing to the slow response times are. The first symptom reported may not be the only symptom. It may not even be the worst one. It is vital to find where the resource contention is happening. This fact-gathering process could take some time, but it saves you from "fixing" the wrong problem or spending time and effort on what is, in fact, only a minor symptom.

The first step in fixing any problem at all, of course, is identifying what the problem is. The key to understanding why a system is responding poorly is knowing where to look and what questions to ask. The diagnosis will be easier and faster if you can put together a detailed description of the performance problem.


Isolating the problem

If I had to identify a single rule for dealing with performance problems, it would be this one: You must pinpoint exactly which component in your infrastructure is the pain point. To do that, you have to look not only at what's running poorly but also at what's working normally.

Locating the areas of resource contention is far more effective than simply assuming the system is processor-bound, the network is slow, or the Storage Area Network (SAN) is poorly configured. You'll find it worth working through some basic questions that shed light on which component in your infrastructure might need attention.

The AIX Performance PMR tools

If you have a system that's performing below expectations, the AIX Performance PMR data collection tools may come in handy (see Resources). In addition to providing some scripts that help you identify resource bottlenecks, the Performance PMR tools (perfpmr) include a set of questions that help you and IBM Support pinpoint exactly where the performance problem lies. By working through the questions, you can get a better grasp of the real bottlenecks.

To start with, ask some basic questions. What exactly is running slowly? Is the slow performance affecting a single user or many users? Is it one process that is slow, such as a report, a backup, or a database update? Are all the systems connecting to a particular SAN redundant array of independent disks (RAID) set responding poorly? Which system is affected? What application is running? Is it an entire IBM Power systems™ server or just one logical partition (LPAR)? If it is a single LPAR, is there a bottleneck on a single file system or even just one file?

Other tools

If you narrow the performance problem down to a single LPAR, you can then drill down further. You can do some basic checks for file system usage via the df command. Commands such as nmon and topas give an overall view of performance for a logical partition. Both these commands have menus that allow you to drill down to view processor usage, identify busy disks, display network statistics, and look at many other useful metrics. Figure 1 below shows the main screen for the topas command.

Figure 1. Main screen for the topas command
Screen shot of the topas main screen

The vmstat command is especially useful for identifying performance bottlenecks. This single command can show you memory, processor, and I/O data, as you can see below in Listing 1.

Listing 1. A vmstat example
vmstat 1 4

System configuration: lcpu=12 mem=7168MB ent=2.80

kthr    memory              page              faults              cpu
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
 5  0 793201 942484   0   0   0   0    0   0 15550 32034 25717  4 37 50  8  1.32  47.1
 1  0 793201 942484   0   0   0   0    0   0 17369 36882 29660  6 40 48  6  1.45  51.6
 0  0 793201 942484   0   0   0   0    0   0 18309 39566 33628  8 39 47  7  1.45  51.9
 4  0 793203 942482   0   0   0   0    0   0 16068 34022 27586  5 39 49  6  1.40  49.8

For a detailed explanation of how vmstat can quickly point you to where a system is struggling for resources, see the "Optimizing AIX 7 memory performance" series. See Resources for links to these and other articles related to system performance.

Can you reproduce the problem?

When you report a performance problem to IBM Support using the perfpmr tools, it helps if you can provide a detailed description of the problem. For example, you can provide more detail about the simplest repeatable example of the problem. When you try to reproduce the problem, see if there is a command or a sequence of events that always produces slow results. Is the execution of AIX commands also slow?

Check log files

Log files are an important source of information. Most applications, databases, and hardware components provide some sort of error logging. If a tape backup is running unusually slowly, it may be simply because the tape drive needs to be cleaned. If the tape drive connects to an AIX LPAR, you can run the errpt command. Use the -a flag for a detailed description of the error, as shown below in Listing 2.

Listing 2. Detailed error report (errpt -a)
# errpt -a | more
LABEL:          TAPE_ERR1
IDENTIFIER:     4865FA9B

Date/Time:       Sat Oct  1 12:56:00 AEST 2011
Sequence Number: 136509
Machine Id:      00C5A47E4C00
Node Id:         tsm1
Class:           H
Type:            PERM
WPAR:            Global
Resource Name:   rmt1
Resource Class:
Resource Type:
Location:
VPD:
        Manufacturer................IBM
        Machine Type and Model......ULT3580-TD3
        Serial Number...............1210002439
        Device Specific.(FW)........93G6

Description
TAPE OPERATION ERROR

If there is a script that is running slowly, perhaps it generates some output to indicate which stage it is up to.

Has anything changed?

When a process that was running well suddenly starts to run slowly, you naturally ask if anything has changed. Is there something that worked previously—perhaps before an upgrade—that no longer works properly? The fix may not necessarily be to roll back to the pre-upgrade configuration. It may simply be that there is a tuning parameter or environmental variable that you need to set.

A simple procedure, such as extending a file system, may require adding a new physical volume to a volume group. If the new physical volume has the default queue depth attribute, it can cause I/O requests to queue on the operating system, regardless of how well the SAN may be able to service I/O requests.

You can check device attributes using the lsattr command. Listing 3 has an example showing the queue depth for a physical volume.

Listing 3. List device attribute
# lsattr -El hdisk7 -a queue_depth
queue_depth 3  Queue DEPTH True

To change a device attribute, you can usually use the chdev command, as shown in Listing 4.

Listing 4. Change device attribute
# chdev -l hdisk7 -a queue_depth=20
hdisk7 changed

If the device is in use, you can free up any resources that may be using it or schedule the attribute change for after a reboot. You can do this by adding the -P flag (see Listing 5 below).

Listing 5. Permanent change to device attribute
# chdev -l hdisk7 -a queue_depth=20 -P   # Make permanent change

There are so many components in a well-functioning system that it really helps if you can find out what configuration changes might have occurred in the lead-up to the performance problem.

What performance was expected?

If the application, system, or hardware is being set up for the first time, are there reasonable expectations about how it is supposed to perform? On what are those expectations based? For example, is there an equivalent configuration that runs a similar process significantly faster than the one that is running slowly?

You can get a simple comparison between two AIX LPARs by running the perfpmr tools on both of them. The performance data can provide a quick measure of what a normally functioning system should look like.

Listing 6 demonstrates how to run a perfpmr script for 10 minutes (600 seconds). The first few lines of the output are shown below.

Listing 6. Capturing performance statistics for 10 minutes
#./perfpmr.sh 600

(C) COPYRIGHT International Business Machines Corp., 2000,2001,2002,2003,2004-2008

23:12:26-10/05/11 :     perfpmr.sh begin
    PERFPMR: hostname: slowhost
    PERFPMR: perfpmr.sh Version 610 2010/12/01

Is the problem intermittent?

Here again, the perfpmr tools offer some key questions. Is the slow performance intermittent or constant? Is there a pattern to the slow behavior? For example, sometimes systems seem to peak when a large number of users log in at the start of the day, but then things settle down quickly.

What aspect is slow?

It can be useful to find out what exactly is leading users to report that the system is running slowly. Is it the time it takes to log in or the time it takes to echo a character? Perhaps a transaction is taking a long time to complete or a report is taking too long to generate.

Does a reboot provide a temporary fix?

If a reboot makes the problem disappear for a while, it could be because of a resource that is consumed but not released for use by other processes. If the problem does creep in again after a reboot, how long does that take? Sometimes it is worth disabling a particular process that you suspect is causing the slow response times.

It's always worth looking for processes that may be chewing up memory or processor time or putting excessive demand on I/O resources. The ps command has many options that help report the busiest processes. Listing 7 is an example.

Listing 7. The ps command reports the processes that are running
		# ps -ef | more
			     UID      PID     PPID   C    STIME    TTY  TIME CMD
    root        1        0   0   Oct 04      -  0:01 /etc/init
    root   655466  3866772   0   Oct 04      -  0:00 /usr/sbin/snmpd
    root  2097342        1   0   Oct 04      -  0:00 /bin/ksh /usr/tivoli/tsm/server/bin/
rc.adsmserv
    root  2424972  3866772   0   Oct 04      -  0:00 /usr/sbin/inetd
    root  2883806        1   0   Oct 04      -  0:00 /usr/lib/errdemon
    root  2949246        1   0   Oct 04      -  0:00 /usr/ccs/bin/shlap64
    root  3276878  3866772   0   Oct 04      -  0:00 /usr/sbin/syslogd
    root  3604516        1   0   Oct 04      -  1:24 /usr/sbin/syncd 60
    root  3670082  3866772   0   Oct 04      -  0:05 /usr/sbin/xntpd
    root  3735676  3866772   0   Oct 04      -  0:00 /usr/sbin/muxatmd
    root  3801210  3866772   0   Oct 04      -  0:00 /usr/sbin/hostmibd
    root  3866772        1   0   Oct 04      -  0:00 /usr/sbin/srcmstr
    root  3932286  3866772   0   Oct 04      -  0:00 /usr/sbin/portmap
    root  3997832  3866772   0   Oct 04      -  0:00 /usr/sbin/aixmibd
    root  4063420        1   0   Oct 04      -  0:44 /usr/sbin/getty /dev/consol
e
    root  4128936  3866772   0   Oct 04      -  0:03 sendmail: accepting connect
ions
    root  4259980  3866772   0   Oct 04      -  0:00 /usr/sbin/snmpmibd
    root  4325556        1   0   Oct 04      -  0:02 /usr/sbin/cron
    root  4391124  3866772   0   Oct 04      -  0:03 /usr/sbin/rsct/bin/vac8/IBM.
	CSMAgentRMd
    root  4522176        1   0   Oct 04      -  0:00 /usr/bin/dsmcad
    root  4718774  3866772   0   Oct 04      -  0:00 /usr/sbin/rpc.lockd -d 0
    root  4784284  2424972   0   Oct 04      -  1:10 xmtopas -p3
    root  4980888  3866772   0   Oct 04      -  0:00 /usr/sbin/biod 6
    root  5177506  3866772   0   Oct 04      -  0:00 /usr/sbin/nfsd 3891
    root  5243046  3866772   0   Oct 04      -  0:00 /usr/sbin/rpc.mountd
    root  5439672  3866772   0   Oct 04      -  0:04 /usr/sbin/rsct/bin/rmcd -a IBM.
LPCommands -r
    root  5570560        1   0   Oct 04      -  0:00 bin/nonstop_aix @config/nonstop.
properties
    root  5701822  2097342 208   Oct 04      - 938:56 dsmserv quiet
    root  5832888        1   0   Oct 04      -  0:02 /usr/local/sbin/sshd
    root  5898436  3866772   0   Oct 04      -  0:00 /usr/sbin/qdaemon
    root  5963972        1   0   Oct 04      -  0:00 /usr/sbin/uprintfd
    root  6095040  3866772   0   Oct 04      -  0:00 /usr/sbin/writesrv
    root  6160590  3866772   0   Oct 04      -  0:08 /usr/sbin/pcmsrv
    root  6291682  3866772   0   Oct 04      -  0:00 /usr/sbin/rsct/bin/IBM.DRMd

Is the problem related to the network?

With a client/server configuration, it may be worth testing to see if the problem occurs when run locally on the server rather than across the network. You can run the application from the console and see if the response time is similar to when you connect across the network.

If the application uses a client/server model, you can do some basic testing from the client using ping server_IP_address (see Listing 8).

Listing 8. Ping by IP address
ping 192.168.168.30
PING 192.168.168.30: (192.168.168.30): 56 data bytes
64 bytes from 192.168.168.30: icmp_seq=0 ttl=255 time=0 ms
64 bytes from 192.168.168.30: icmp_seq=1 ttl=255 time=0 ms
64 bytes from 192.168.168.30: icmp_seq=2 ttl=255 time=0 ms
64 bytes from 192.168.168.30: icmp_seq=3 ttl=255 time=0 ms

----192.168.168.30 PING Statistics----
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0/0/0 ms

A ping by IP address can help identify if the issue is related to Domain Name System (DNS) configuration. If you suspect network problems, a diagram or description of the network configuration is a useful starting point.

What vendor applications are involved?

It is important to know what vendor applications are used on a system that is performing poorly. Often there are operating system tunables, recommended kernel settings, and other environmental variables that you should use for some applications. There also may be patches for the application that fix known performance issues.

You should know what version/release/level of the vendor application is installed and if the application has been updated recently.


General advice

The perfpmr documentation recommends providing a clear written statement of a simple specific instance of the problem. It also recommends separating the symptoms and facts from theories, ideas, and your own conclusions. As the documentation says, "If all the facts are available, the performance team can quickly eliminate the unrelated ones."

Another piece of advice is to ensure the correct machine is being used for information gathering. In large sites—and especially with so many virtualized environments—it can be easy to collect data from the wrong system. As the documentation says, "This makes it very hard to investigate the problem."

To identify the machine model and serial number, you can use the lsconf command.

When you're working through a performance problem, it can be easy to lose track of what steps you've already taken to resolve the issues. Keeping a record of the actions taken to diagnose or fix the problem can save you a lot of wasted effort.


The rewards of patience

Fixing performance problems requires good diagnostic skills, an ability to separate facts from theories and assumptions, and, above all, patience. Often the solution is a simple one, and your efforts are rewarded with an improved system performance. The next article in this two-part series looks at some practices that can help you prevent performance bottlenecks from occurring in the first place.

Resources

Learn

Get products and technologies

  • Download perfpmr, the AIX Performance PMR data collection tools.
  • Try out IBM software for free. Download a trial version, log into an online trial, work with a product in a sandbox environment, or access it through the cloud. Choose from over 100 IBM product trials.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=790265
ArticleTitle=The performance detective: Where does it hurt?
publish-date=01312012