Where does it hurt?
How describing performance problems helps you diagnose them
This content is part # of # in the series: The performance detective
This content is part of the series:The performance detective
Stay tuned for additional content in this series.
"Houston, we have a problem!"
It's not the sort of news you want to hear. A system is slow, and there are users who are looking for a quick resolution. Sometimes the temptation is to go for a quick fix that may end up masking symptoms instead of addressing the real underlying problem. That's not a smart approach. What would you think of a doctor who prescribes medicine as soon as a patient says they aren't feeling well? The heart of the art of medicine is diagnosis. The same is true for fixing system performance problems.
Asking the right questions
It can take quite a bit of investigation to work out exactly what the factors contributing to the slow response times are. The first symptom reported may not be the only symptom. It may not even be the worst one. It is vital to find where the resource contention is happening. This fact-gathering process could take some time, but it saves you from "fixing" the wrong problem or spending time and effort on what is, in fact, only a minor symptom.
The first step in fixing any problem at all, of course, is identifying what the problem is. The key to understanding why a system is responding poorly is knowing where to look and what questions to ask. The diagnosis will be easier and faster if you can put together a detailed description of the performance problem.
Isolating the problem
If I had to identify a single rule for dealing with performance problems, it would be this one: You must pinpoint exactly which component in your infrastructure is the pain point. To do that, you have to look not only at what's running poorly but also at what's working normally.
Locating the areas of resource contention is far more effective than simply assuming the system is processor-bound, the network is slow, or the Storage Area Network (SAN) is poorly configured. You'll find it worth working through some basic questions that shed light on which component in your infrastructure might need attention.
The AIX Performance PMR tools
If you have a system that's performing below expectations, the AIX Performance PMR data collection tools may come in handy (see Related topics). In addition to providing some scripts that help you identify resource bottlenecks, the Performance PMR tools (perfpmr) include a set of questions that help you and IBM Support pinpoint exactly where the performance problem lies. By working through the questions, you can get a better grasp of the real bottlenecks.
To start with, ask some basic questions. What exactly is running slowly? Is the slow performance affecting a single user or many users? Is it one process that is slow, such as a report, a backup, or a database update? Are all the systems connecting to a particular SAN redundant array of independent disks (RAID) set responding poorly? Which system is affected? What application is running? Is it an entire IBM Power systems™ server or just one logical partition (LPAR)? If it is a single LPAR, is there a bottleneck on a single file system or even just one file?
If you narrow the performance problem down to a single LPAR, you can then drill down
further. You can do some basic checks for file system usage via the
df command. Commands such as
topas give an overall view of performance for a logical
partition. Both these commands have menus that allow you to drill down to view
processor usage, identify busy disks, display network statistics, and look at many
other useful metrics. Figure 1 below shows the main screen for the
Figure 1. Main screen for the
vmstat command is especially useful for identifying
performance bottlenecks. This single command can show you memory, processor, and
I/O data, as you can see below in Listing 1.
Listing 1. A
vmstat 1 4 System configuration: lcpu=12 mem=7168MB ent=2.80 kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------------------- r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec 5 0 793201 942484 0 0 0 0 0 0 15550 32034 25717 4 37 50 8 1.32 47.1 1 0 793201 942484 0 0 0 0 0 0 17369 36882 29660 6 40 48 6 1.45 51.6 0 0 793201 942484 0 0 0 0 0 0 18309 39566 33628 8 39 47 7 1.45 51.9 4 0 793203 942482 0 0 0 0 0 0 16068 34022 27586 5 39 49 6 1.40 49.8
For a detailed explanation of how
vmstat can quickly
point you to where a system is struggling for resources, see the "Optimizing AIX 7 memory performance" series. See Related topics for links to these and other articles related to system performance.
Can you reproduce the problem?
When you report a performance problem to IBM Support using the perfpmr tools, it helps if you can provide a detailed description of the problem. For example, you can provide more detail about the simplest repeatable example of the problem. When you try to reproduce the problem, see if there is a command or a sequence of events that always produces slow results. Is the execution of AIX commands also slow?
Check log files
Log files are an important source of information. Most applications, databases, and
hardware components provide some sort of error logging. If a tape backup is running
unusually slowly, it may be simply because the tape drive needs to be cleaned. If
the tape drive connects to an AIX LPAR, you can run the
errpt command. Use the
-a flag for a
detailed description of the error, as shown below in Listing 2.
Listing 2. Detailed error report (
# errpt -a | more LABEL: TAPE_ERR1 IDENTIFIER: 4865FA9B Date/Time: Sat Oct 1 12:56:00 AEST 2011 Sequence Number: 136509 Machine Id: 00C5A47E4C00 Node Id: tsm1 Class: H Type: PERM WPAR: Global Resource Name: rmt1 Resource Class: Resource Type: Location: VPD: Manufacturer................IBM Machine Type and Model......ULT3580-TD3 Serial Number...............1210002439 Device Specific.(FW)........93G6 Description TAPE OPERATION ERROR
If there is a script that is running slowly, perhaps it generates some output to indicate which stage it is up to.
Has anything changed?
When a process that was running well suddenly starts to run slowly, you naturally ask if anything has changed. Is there something that worked previously—perhaps before an upgrade—that no longer works properly? The fix may not necessarily be to roll back to the pre-upgrade configuration. It may simply be that there is a tuning parameter or environmental variable that you need to set.
A simple procedure, such as extending a file system, may require adding a new physical volume to a volume group. If the new physical volume has the default queue depth attribute, it can cause I/O requests to queue on the operating system, regardless of how well the SAN may be able to service I/O requests.
You can check device attributes using the
lsattr command. Listing 3 has an example showing the queue depth for a physical volume.
Listing 3. List device attribute
# lsattr -El hdisk7 -a queue_depth queue_depth 3 Queue DEPTH True
To change a device attribute, you can usually use the
chdev command, as shown in Listing 4.
Listing 4. Change device attribute
# chdev -l hdisk7 -a queue_depth=20 hdisk7 changed
If the device is in use, you can free up any resources that may be using it or schedule the attribute
change for after a reboot. You can do this by adding the
flag (see Listing 5 below).
Listing 5. Permanent change to device attribute
# chdev -l hdisk7 -a queue_depth=20 -P # Make permanent change
There are so many components in a well-functioning system that it really helps if you can find out what configuration changes might have occurred in the lead-up to the performance problem.
What performance was expected?
If the application, system, or hardware is being set up for the first time, are there reasonable expectations about how it is supposed to perform? On what are those expectations based? For example, is there an equivalent configuration that runs a similar process significantly faster than the one that is running slowly?
You can get a simple comparison between two AIX LPARs by running the perfpmr tools on both of them. The performance data can provide a quick measure of what a normally functioning system should look like.
Listing 6 demonstrates how to run a perfpmr script for 10 minutes (600 seconds). The first few lines of the output are shown below.
Listing 6. Capturing performance statistics for 10 minutes
#./perfpmr.sh 600 (C) COPYRIGHT International Business Machines Corp., 2000,2001,2002,2003,2004-2008 23:12:26-10/05/11 : perfpmr.sh begin PERFPMR: hostname: slowhost PERFPMR: perfpmr.sh Version 610 2010/12/01
Is the problem intermittent?
Here again, the perfpmr tools offer some key questions. Is the slow performance intermittent or constant? Is there a pattern to the slow behavior? For example, sometimes systems seem to peak when a large number of users log in at the start of the day, but then things settle down quickly.
What aspect is slow?
It can be useful to find out what exactly is leading users to report that the system is running slowly. Is it the time it takes to log in or the time it takes to echo a character? Perhaps a transaction is taking a long time to complete or a report is taking too long to generate.
Does a reboot provide a temporary fix?
If a reboot makes the problem disappear for a while, it could be because of a resource that is consumed but not released for use by other processes. If the problem does creep in again after a reboot, how long does that take? Sometimes it is worth disabling a particular process that you suspect is causing the slow response times.
It's always worth looking for processes that may be chewing up memory or processor time or putting excessive demand on I/O resources. The
ps command has many options that help report the busiest processes. Listing 7 is an example.
Listing 7. The
ps command reports the processes that are running
# ps -ef | more UID PID PPID C STIME TTY TIME CMD root 1 0 0 Oct 04 - 0:01 /etc/init root 655466 3866772 0 Oct 04 - 0:00 /usr/sbin/snmpd root 2097342 1 0 Oct 04 - 0:00 /bin/ksh /usr/tivoli/tsm/server/bin/ rc.adsmserv root 2424972 3866772 0 Oct 04 - 0:00 /usr/sbin/inetd root 2883806 1 0 Oct 04 - 0:00 /usr/lib/errdemon root 2949246 1 0 Oct 04 - 0:00 /usr/ccs/bin/shlap64 root 3276878 3866772 0 Oct 04 - 0:00 /usr/sbin/syslogd root 3604516 1 0 Oct 04 - 1:24 /usr/sbin/syncd 60 root 3670082 3866772 0 Oct 04 - 0:05 /usr/sbin/xntpd root 3735676 3866772 0 Oct 04 - 0:00 /usr/sbin/muxatmd root 3801210 3866772 0 Oct 04 - 0:00 /usr/sbin/hostmibd root 3866772 1 0 Oct 04 - 0:00 /usr/sbin/srcmstr root 3932286 3866772 0 Oct 04 - 0:00 /usr/sbin/portmap root 3997832 3866772 0 Oct 04 - 0:00 /usr/sbin/aixmibd root 4063420 1 0 Oct 04 - 0:44 /usr/sbin/getty /dev/consol e root 4128936 3866772 0 Oct 04 - 0:03 sendmail: accepting connect ions root 4259980 3866772 0 Oct 04 - 0:00 /usr/sbin/snmpmibd root 4325556 1 0 Oct 04 - 0:02 /usr/sbin/cron root 4391124 3866772 0 Oct 04 - 0:03 /usr/sbin/rsct/bin/vac8/IBM. CSMAgentRMd root 4522176 1 0 Oct 04 - 0:00 /usr/bin/dsmcad root 4718774 3866772 0 Oct 04 - 0:00 /usr/sbin/rpc.lockd -d 0 root 4784284 2424972 0 Oct 04 - 1:10 xmtopas -p3 root 4980888 3866772 0 Oct 04 - 0:00 /usr/sbin/biod 6 root 5177506 3866772 0 Oct 04 - 0:00 /usr/sbin/nfsd 3891 root 5243046 3866772 0 Oct 04 - 0:00 /usr/sbin/rpc.mountd root 5439672 3866772 0 Oct 04 - 0:04 /usr/sbin/rsct/bin/rmcd -a IBM. LPCommands -r root 5570560 1 0 Oct 04 - 0:00 bin/nonstop_aix @config/nonstop. properties root 5701822 2097342 208 Oct 04 - 938:56 dsmserv quiet root 5832888 1 0 Oct 04 - 0:02 /usr/local/sbin/sshd root 5898436 3866772 0 Oct 04 - 0:00 /usr/sbin/qdaemon root 5963972 1 0 Oct 04 - 0:00 /usr/sbin/uprintfd root 6095040 3866772 0 Oct 04 - 0:00 /usr/sbin/writesrv root 6160590 3866772 0 Oct 04 - 0:08 /usr/sbin/pcmsrv root 6291682 3866772 0 Oct 04 - 0:00 /usr/sbin/rsct/bin/IBM.DRMd
Is the problem related to the network?
With a client/server configuration, it may be worth testing to see if the problem occurs when run locally on the server rather than across the network. You can run the application from the console and see if the response time is similar to when you connect across the network.
If the application uses a client/server model, you can do some basic testing from the client using
ping server_IP_address (see Listing 8).
Listing 8. Ping by IP address
ping 192.168.168.30 PING 192.168.168.30: (192.168.168.30): 56 data bytes 64 bytes from 192.168.168.30: icmp_seq=0 ttl=255 time=0 ms 64 bytes from 192.168.168.30: icmp_seq=1 ttl=255 time=0 ms 64 bytes from 192.168.168.30: icmp_seq=2 ttl=255 time=0 ms 64 bytes from 192.168.168.30: icmp_seq=3 ttl=255 time=0 ms ----192.168.168.30 PING Statistics---- 4 packets transmitted, 4 packets received, 0% packet loss round-trip min/avg/max = 0/0/0 ms
A ping by IP address can help identify if the issue is related to Domain Name System (DNS) configuration. If you suspect network problems, a diagram or description of the network configuration is a useful starting point.
What vendor applications are involved?
It is important to know what vendor applications are used on a system that is performing poorly. Often there are operating system tunables, recommended kernel settings, and other environmental variables that you should use for some applications. There also may be patches for the application that fix known performance issues.
You should know what version/release/level of the vendor application is installed and if the application has been updated recently.
The perfpmr documentation recommends providing a clear written statement of a simple specific instance of the problem. It also recommends separating the symptoms and facts from theories, ideas, and your own conclusions. As the documentation says, "If all the facts are available, the performance team can quickly eliminate the unrelated ones."
Another piece of advice is to ensure the correct machine is being used for information gathering. In large sites—and especially with so many virtualized environments—it can be easy to collect data from the wrong system. As the documentation says, "This makes it very hard to investigate the problem."
To identify the machine model and serial number, you can use the
When you're working through a performance problem, it can be easy to lose track of what steps you've already taken to resolve the issues. Keeping a record of the actions taken to diagnose or fix the problem can save you a lot of wasted effort.
The rewards of patience
Fixing performance problems requires good diagnostic skills, an ability to separate facts from theories and assumptions, and, above all, patience. Often the solution is a simple one, and your efforts are rewarded with an improved system performance. The next article in this two-part series looks at some practices that can help you prevent performance bottlenecks from occurring in the first place.
- Download perfpmr, the AIX Performance PMR data collection tools.
- Check out the Reporting a performance problem section of the AIX documentation to learn how to report a performance problem to IBM Support.
- Visit the Performance Monitoring Tips and Techniques wiki.
- "Optimizing AIX 7 memory performance" (developerWorks, November 2010 and January 2011) is a three-part series that looks at tuning parameters and best practices for memory tuning and discusses some improvements in AIX 6 and AIX 7.
- See IBM's AIX Version 7.1 Performance Management document.
- Read "Insufficient Evidence When Problems Occur" (IBM Systems Magazine, August 2011) to learn what to do after a system disaster with no identifiable root cause.
- Follow me on Twitter and keep up with my blog updates.
- Try out IBM software for free. Download a trial version, log into an online trial, work with a product in a sandbox environment, or access it through the cloud. Choose from over 100 IBM product trials.