When a system is running too slowly, you have to identify which of the many components in the configuration might be the culprit. It could be slow disk I/O. Or maybe the system is short of memory. Is it the application that's poorly written or a runaway process? Perhaps there's an adapter firmware fix that will make all the problems go away.
The bottlenecks could be any one of these items, a combination of them, or maybe something else. When a performance problem raises its ugly head, you can be under a lot of pressure to fix it. A proactive approach to system performance can spare you this stress. Some forward planning and early intervention can help you avoid huge performance bottlenecks. It's also valuable to take a step back and consider whether the expectations of the system are reasonable for the workload.
Here are some practices to help you prevent performance problems, or at least prepare you to be ready to respond when they do occur.
It's important to share your knowledge of your configuration with others who are able to understand and manage it. The best way to do this is to document your configuration. Pay special attention to those aspects that use non-default settings or do not follow industry standards.
IBM® Power Systems™ severs have many tools to help you document your configuration. For example, you can list the virtual machine's configuration using commands such as
prtconf, as shown in Listing 1.
Listing 1. The
prtconfcommand displays the system configuration
# prtconf System Model: IBM,9117-MMA Machine Serial Number: 01A02B3 Processor Type: PowerPC_POWER6 Processor Implementation Mode: POWER 6 Processor Version: PV_6_Compat Number Of Processors: 2 Processor Clock Speed: 4208 MHz CPU Type: 64-bit Kernel Type: 64-bit LPAR Info: 4 A3_everest-nim Memory Size: 8192 MB Good Memory Size: 8192 MB Platform Firmware level: EM350_108 Firmware Version: IBM,EM350_108
Your documentation should include disk configuration, environmental variables, operating system tunable settings, device attributes, application versions, system model, types, and serial numbers. If you need to make a support call because of slow system performance, it helps to have details about the system at your fingertips.
With proper change management, you can alert other stakeholders of impending changes, provide step-by-step details of the changes, and prepare workable rollback plans.
By implementing recommended tuning parameters, you can make better use of the existing hardware and relieve the pressure on resources before it starts building up.
Even with a high-performing storage subsystem, you can have I/O bottlenecks at the operating-system level. A common example of this is when the I/O queue for a physical volume is filling up.
iostat command in Listing 2 shows that the service queue for this physical volume is unable to respond to the service requests it's receiving. See the sqfull value in the last column of the queue values. This means other I/O requests have to wait.
Listing 2. The
iostatshowing full queue
hdisk89 xfer: %tm_act bps tps bread bwrtn 75.0 43.4M 336.5 6.1K 43.4M read: rps avgserv minserv maxserv timeouts fails 1.5 123.9 15.1 1.0S 0 0 write: wps avgserv minserv maxserv timeouts fails 335.0 28.1 1.5 275.0 0 0 queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull 43.9 0.0 297.1 2.5 1.6 239.0
You can increase the queue depth using the
chdev command, as shown in Listing 3.
Listing 3. Increase queue depth
chdev -l hdisk89 -a queue_depth=20
When you are making a configuration change or attempting to fix a performance problem, it's helpful to break your changes down into simple, measurable steps. If you make three changes at once, you might fix the problem, but you may not know which change - or which combination of changes - ultimately resolved it.
There's no reason to tackle performance problems on your own. Before you implement a major configuration change, check with others who have done it before or at least might have done something similar. You don't have to wait until everything goes wrong to log a support call.
Even if your systems are running well, it can be worth spending a little time once in a while to see if the configuration could use some tweaking. For example, you might have some virtual machines that are no longer as critical as they once were. Do they still use the optimal disks or have too much processor or memory allocation? Perhaps those resources could be better used by other virtual machines that need them more.
Rapidly growing workloads can easily take you by surprise. Do some regular capacity planning to see whether additional resources might be needed. Capacity planning can include evaluating disk requirements, processor usage, and memory usage. Also keep in mind the flow-on effect of increasing capacity. Pay special attention to backup infrastructure and disaster-recovery requirements.
Regular capacity planning can prevent surprise budget outlay and emergency installation of new hardware.
Many systems underperform because they don't take advantage of available technology. For example, a point solution with dedicated I/O adapters and disks means that those resources are not available to be shared by other virtual machines. IBM PowerVM® virtualization has so many built-in features to help balance out the peaks and troughs that it doesn't make sense to bypass them by implementing a point solution.
Preparing for the worst is a good approach for preventing it from happening. Check that your system error report and application log files are easy to locate and are not being overwritten or removed too quickly to allow proper analysis.
It is helpful to capture both the standard output and any error messages. To capture standard output, use
1 > and add a file name, as shown in Listing 4.
Listing 4. Capture standard output
# ./myscript.sh 1>myscript.out
Just as importantly, you can capture standard error messages. To do this, redirect 2 to a file, as shown in Listing 5.
Listing 5. Capture standard error messages
# ./myscript.sh 2>myscript.err
You can combine the standard output and error messages in one file, as shown in Listing 6.
Listing 6. Capture standard output and error messages in one file
# ./myscript.sh > myscript.out 2>&1
You can avoid many performance problems with simple preventive maintenance. When system or device firmware is released, it often includes performance enhancements. Similarly, it makes sense to dedicate a regular maintenance window for operating system updates.
When a system is responding slowly, chances are the first you'll hear about it is from anecdotal evidence. That's a helpful start, but it's good if you have some point of comparison. How slow is slow? How does it compare with a similar workload? It's important to have some framework around reporting and managing performance problems. And that framework is something that you should put in place before you need to use it.
In addition to proactive practices, it's good to have a plan for those (hopefully rare) times when the system is just not working as well as it should. This plan should include the items listed in Table 1.
Table 1. What to include in your plan for when there are problems
|Process for reporting performance problems||So users can easily describe and document symptoms of the problem|
|Support contacts||To make it easy to know whom to contact and how to log support calls with external support teams|
|Record of changes||To outline:
When a system suddenly starts running slowly, time is critical. Perhaps a lot of money will be lost. Having this plan in place allows you to focus your energies where they're most needed.
When a system appears to be grinding to a halt, a level head is what is most needed. It's always worth "wasting" a little time to assess what's really going wrong, what the impact is, and how best to approach it.
A little extra time diagnosing what is wrong can be a big time saver later. As far as possible, you need to be ready to find the root cause (or causes) and have your detective magnifying glass and raincoat handy. The better prepared you are to capture snapshots, error reports, backups, and system dumps, the quicker the process of diagnosis and tuning should be.
It might take nerves of steel to tell screaming users that you're doing some information gathering, but the time investment is worth it. It's a matter of managing user expectations. It's better to overestimate how long the fix might take than to disappoint by promising a deadline you can't meet.
Even if you don't manage to pinpoint the root cause, you can at least take steps to make it easier for you if the problem happens again. Some of these are outlined in the article "Insufficient Evidence When Problems Occur" (see Resources).
Think about the workload that the system itself is expected to do. Here are some questions that are worth asking:
- Is the system underconfigured? Has the workload outgrown the hardware's capacity?
- Is the actual job that is being run necessary? If so, could you run it at another time when the system is under less strain?
- Is there a simpler, less resource-hungry approach? For example, could you do incremental backups instead of full backups? Is there a memory leak that is fixed in a later release of the software? Are there inefficient reports or applications that are putting the system under undue strain?
Just as it is unreasonable to expect a system to perform far beyond its capacity, it's even more unreasonable to expect it of the people who are working on the performance problem. They need adequate training, local knowledge of the system, and a chance to rest. There are many performance problems that can be resolved quickly but aren't because the people working on them are sleep deprived and under pressure themselves.
It's important to foresee that a big change might run into the early hours and beyond. When one person is working alone through the night (usually after a full day's work), it's a recipe for disaster. It's better to share the workload or extend the implementation time to allow for breaks and adequate rest.
Keep control over who does what to a system. Change management, sensible security practices, and a peer review of changes that have a potential impact are all important.
When performance problems occur on critical systems, the resource you're most in need of is time. By taking a proactive approach to system performance, you can spare yourself that pain and pressure, or at least minimize the impact of a system that is choking for resources.
See IBM's AIX Version 7.1 Performance management document.
Read "Insufficient Evidence When Problems Occur" (IBMSystems Magazine, August 2011) to learn what to do after a system disaster with no identifiable root cause.
Follow me on Twitter and keep up with
my blog updates.
- Follow developerWorks on
- Watch developerWorks on-demand demos
ranging from product installation and setup demos for beginners, to
advanced functionality for experienced developers.
Get products and technologies
Evaluate IBM products in the
way that suits you best: Download a product trial, try a product online,
use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to
implement Service Oriented Architecture efficiently.
Try out IBM
software for free. Download a trial version, log into an online trial, work with
a product in a sandbox environment, or access it through the cloud. Choose from over 100 IBM product trials.
Get involved in the developerWorks Community.
Connect with other developerWorks users while exploring the
developer-driven blogs, forums, groups, and wikis.
- Follow developerWorks on Twitter.
Participate in developerWorks blogs and get involved in the developerWorks community.
- Get involved in the My developerWorks community.
Participate in the AIX and UNIX® forums:
- AIX Forum
- AIX Forum for developers
- Cluster Systems Management
- Performance Tools Forum
- Virtualization Forum
- More AIX and UNIX Forums