The performance detective, part 2: Prevention is better than cure

Practical steps to avoid system bottlenecks

This second article in a two-part series on managing system performance looks at preventing performance problems. By keeping your system well-tuned, you can avoid a lot of stress. There are also steps you can take so that if the system does start to choke you're ready to identify the bottlenecks quickly and know where to go for help.

David Tansley (david.tansley@btinternet.com), System Administrator, Ace Europe

David TansleyDavid Tansley is a freelance writer. He has 15 years of experience as a UNIX administrator, using AIX the last eight years. He enjoys playing badminton, then relaxing watching Formula 1, but nothing beats riding and touring on his GSA motorbike with his wife.


developerWorks Contributing author
        level

01 December 2010

Also available in Chinese

When a system is running too slowly, you have to identify which of the many components in the configuration might be the culprit. It could be slow disk I/O. Or maybe the system is short of memory. Is it the application that's poorly written or a runaway process? Perhaps there's an adapter firmware fix that will make all the problems go away.

The bottlenecks could be any one of these items, a combination of them, or maybe something else. When a performance problem raises its ugly head, you can be under a lot of pressure to fix it. A proactive approach to system performance can spare you this stress. Some forward planning and early intervention can help you avoid huge performance bottlenecks. It's also valuable to take a step back and consider whether the expectations of the system are reasonable for the workload.

Take a proactive approach

Here are some practices to help you prevent performance problems, or at least prepare you to be ready to respond when they do occur.

Document the configuration

It's important to share your knowledge of your configuration with others who are able to understand and manage it. The best way to do this is to document your configuration. Pay special attention to those aspects that use non-default settings or do not follow industry standards.

IBM® Power Systems™ severs have many tools to help you document your configuration. For example, you can list the virtual machine's configuration using commands such as prtconf, as shown in Listing 1.

Listing 1. Listing 1. The prtconf command displays the system configuration
# prtconf
		
System Model: IBM,9117-MMA
Machine Serial Number: 01A02B3
Processor Type: PowerPC_POWER6
Processor Implementation Mode: POWER 6
Processor Version: PV_6_Compat
Number Of Processors: 2
Processor Clock Speed: 4208 MHz
CPU Type: 64-bit
Kernel Type: 64-bit
LPAR Info: 4 A3_everest-nim
Memory Size: 8192 MB
Good Memory Size: 8192 MB
Platform Firmware level: EM350_108
Firmware Version: IBM,EM350_108

Your documentation should include disk configuration, environmental variables, operating system tunable settings, device attributes, application versions, system model, types, and serial numbers. If you need to make a support call because of slow system performance, it helps to have details about the system at your fingertips.

Use change-management procedures

With proper change management, you can alert other stakeholders of impending changes, provide step-by-step details of the changes, and prepare workable rollback plans.

Implement recommended tuning settings

By implementing recommended tuning parameters, you can make better use of the existing hardware and relieve the pressure on resources before it starts building up.

Even with a high-performing storage subsystem, you can have I/O bottlenecks at the operating-system level. A common example of this is when the I/O queue for a physical volume is filling up.

The iostat command in Listing 2 shows that the service queue for this physical volume is unable to respond to the service requests it's receiving. In other words, other I/O requests have to wait.

Listing 2. Listing 2. The iostat showing full queue
hdisk89        xfer:  %tm_act      bps      tps      bread      bwrtn
                        75.0     43.4M   336.5        6.1K      43.4M
               read:      rps  avgserv  minserv  maxserv   timeouts      fails
                         1.5    123.9     15.1      1.0S          0          0
              write:      wps  avgserv  minserv  maxserv   timeouts      fails
                       335.0     28.1      1.5    275.0           0          0
              queue:  avgtime  mintime  maxtime  avgwqsz    avgsqsz     sqfull
                        43.9      0.0    297.1      2.5        1.6     239.0

You can increase the queue depth using the chdev command, as shown in Listing 3.

Listing 3. Listing 3. Increase queue depth
chdev -l hdisk89 -a queue_depth=20

Take one step at a time

When you are making a configuration change or attempting to fix a performance problem, it's helpful to break your changes down into simple, measurable steps. If you make three changes at once, you might fix the problem, but not know if one change or a combination of the changes that you made ultimately resolved it.

Ask for help early

There's no reason to tackle performance problems on your own. Before you implement a major configuration change, check with others who have done it before or at least might have done something similar. You don't have to wait until everything goes wrong to log a support call.

Do periodic performance reviews

Even if your systems are running well, it can be worth spending a little time once in a while to see if the configuration could use some tweaking. For example, you might have some virtual machines that are no longer as critical as they once were. Do they still use the optimal disks or have too much processor or memory allocation? Perhaps those resources could be better used by other virtual machines that need them more.

Implement capacity planning

Rapidly growing workloads can easily take you by surprise. Do some regular capacity planning to see whether additional resources might be needed. Capacity planning can include evaluating disk requirements, processor usage, and memory usage. Also keep in mind the flow-on effect of increasing capacity. Pay special attention to backup infrastructure and disaster-recovery requirements.

Regular capacity planning can prevent surprise budget outlay and emergency installation of new hardware.

Avoid point solutions

Many systems underperform because they don't take advantage of available technology. For example, a point solution with dedicated I/O adapters and disks means that those resources are not available to be shared by other virtual machines. IBM PowerVM® virtualization has so many built-in features to help balance out the peaks and troughs that it doesn't make sense to bypass them by implementing a point solution.

Protect log files

Preparing for the worst is a good approach for preventing it from happening. Check that your system error report and application log files are easy to locate and not being overwritten or removed too quickly to allow proper analysis.

It is helpful to capture both the standard output and any error messages. To capture standard output, use 1 > and add a file name, as shown in Listing 4.

Listing 4. Listing 4. Capture standard output
# ./myscript.sh 1>myscript.out

Just as importantly, you can capture standard error messages. To do this, redirect 2 to a file, as shown in Listing 5.

Listing 5. Listing 5. Capture standard error messages
# ./myscript.sh 2>myscript.err

You can combine the standard output and error messages in one file, as shown in Listing 6.

Listing 6. Listing 6. Capture standard output and error messages in one file
# ./myscript.sh > myscript.out 2>&1

Schedule regular system maintenance

You can avoid many performance problems with simple preventive maintenance. When system or device firmware is released, it often includes performance enhancements. Similarly, it makes sense to dedicate a regular maintenance window for operating system updates.

Establish a baseline

When a system is responding slowly, chances are the first you'll hear about it is from anecdotal evidence. That's a helpful start, but it's good if you have some point of comparison. How slow is slow? How does it compare with a similar workload? It's important to have some framework around reporting and managing performance problems. And that framework is something that you should put in place before you need to use it.

Plan for problems

In addition to proactive practices, it's good to have a plan for those (hopefully rare) times when the system is just not working as well as it should. This plan should include the items listed in Table 1.

Table 1. Table 1. What to include in your plan for when there are problems
ItemReason
Process for reporting performance problemsSo users can easily describe and document symptoms of the problem
Support contactsTo make it easy to know whom to contact and how to log support calls with external support teams
Record of changesTo outline:
  • What was changed
  • How it was changed
  • When it was changed
  • Whether it worked (or maybe made things worse)
  • How to reverse it

When a system suddenly starts running slowly, time is critical. Perhaps a lot of money will be lost. Having this plan in place allows you to focus your energies where they're most needed.


Manage expectations

When a system appears to be grinding to a halt, a level head is what is most needed. It's always worth "wasting" a little time to assess what's really going wrong, what the impact is, and how best to approach it.

Allow time for diagnosing the problem

A little extra time diagnosing what is wrong can be a big time saver later. As far as possible, you need to be ready to find the root cause (or causes) and have your detective magnifying glass and raincoat handy. The better prepared you are to capture snapshots, error reports, backups, and system dumps, the quicker the process of diagnosis and tuning should be.

It might take nerves of steel to tell screaming users that you're doing some information gathering, but the time investment is worth it. It's a matter of managing user expectations. It's better to overestimate how long the fix might take than to disappoint by promising a deadline you can't meet.

Even if you don't manage to pinpoint the root cause, you can at least take steps to make it easier for you if the problem happens again. Some of these are outlined in the article "Insufficient Evidence When Problems Occur" (see Resources).

Workload expectations

Think about the workload that the system itself is expected to do. Here are some questions that are worth asking:

  • Is the system underconfigured? Has the workload outgrown the hardware's capacity?
  • Is the actual job that is being run necessary? If so, could you run it at another time when the system is under less strain?
  • Is there a simpler, less resource-hungry approach? For example, could you do incremental backups instead of full backups? Is there a memory leak that is fixed in a later release of the software? Are there inefficient reports or applications that are putting the system under undue strain?

Expectations of workers

Just as it is unreasonable to expect a system to perform far beyond its capacity, it's even more unreasonable to expect it of the people who are working on the performance problem. They need adequate training, local knowledge of the system, and a chance to rest. There are many performance problems that can be resolved quickly but aren't because the people working on them are sleep deprived and under pressure themselves.

It's important to foresee that a big change might run into the early hours and beyond. When one person is working alone through the night (usually after a full day's work), it's a recipe for disaster. It's better to share the workload or extend the implementation time to allow for breaks and adequate rest.


Stay in control

Keep control over who does what to a system. Change management, sensible security practices, and a peer review of changes that have a potential impact are all important.

When performance problems occur on critical systems, the resource you're most in need of is time. By taking a proactive approach to system performance, you can spare yourself that pain and pressure, or at least minimize the impact of a system that is choking for resources.

Resources

Learn

Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
  • Try out IBM software for free. Download a trial version, log into an online trial, work with a product in a sandbox environment, or access it through the cloud. Choose from over 100 IBM product trials.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=657180
ArticleTitle=The performance detective, part 2: Prevention is better than cure
publish-date=12012010