The Support Authority: 12 ways you can prepare for effective production troubleshooting

Rather than focus on what to do after a problem happens, here are 12 things you can do to your environment now to make troubleshooting quicker and more effective when problems do occur. This content is part of the IBM WebSphere Developer Technical Journal.

Daniel Julin (dpj@us.ibm.com), WebSphere Serviceability Technical Area Lead, EMC

Author photoDaniel Julin has 20 years experience developing and troubleshooting complex online systems. As Technical Area Lead for the WebSphere Serviceability Team, he currently focuses on helping the team define and implement a collection of tools and techniques to assist in problem determination for WebSphere Application Server, and to maximize the efficiency of IBM Support. He also occasionally assists directly in various critical customer support situations.


developerWorks Contributing author
        level

22 August 2007

Also available in Chinese

In each column, The Support Authority discusses resources, tools, and other elements of IBM Technical Support that are available for WebSphere products, plus techniques and new ideas that can further enhance your IBM support experience.

This just in...

As always, we begin with some new items of interest for the WebSphere® community at large:

  1. IBM® Guided Activity Assistant Version 3.1 has just been released, and with this version, the IBM Guided Activity Assistant becomes a fully-supported problem determination tool. It also includes an improved user interface and many new features.

  2. Three problem determinations tools available in the IBM Support Assistant have been updated:

    • IBM Dump Analyzer for Java™ examines the contents of a JVM system dump and diagnoses the most common problems and the overall state of that JVM.

    • Extensible Verbose Toolkit (EVTK) examines JVM verbose GC logs and diagnoses garbage-collection problems.

    • IBM Pattern Modeling and Analysis Tool for Java Garbage Collector (PMAT) examines a JVM verbose GC log. (With this new release, the version of PMAT in IBM Support Assistant is identical to the version available through alphaWorks.)

  3. Last time, we announced the introduction of Fix Central, a centralized Web site to facilitate finding, downloading, and installing fixes for many IBM products. WebSphere Application Server has just been added to the family of products using Fix Central.

Continue to monitor the various support-related Web sites, as well as this column, for news about other tools as we encounter them.

And now, on to our main topic...


Thoughtful preparation for smarter troubleshooting

When discussing techniques and tools for troubleshooting and problem determination, it’s typical for most of the discussion to be about what to do after a problem has been discovered. Ideally, however, the prudent system administrator or troubleshooter should start thinking about the job long before a problem occurs; in other words, prepare the environment so that troubleshooting can be performed more quickly and effectively if and when problems eventually do occur.

This article presents a dozen recommendations that can be implemented to help expedite problem resolution, in even most complex production environments. This list, which is neither definitive nor absolute, is based on general observations of client environments and problems seen by IBM WebSphere Support. However, every environment has unique factors and constraints that might make some of these suggestions more (or less) practical or applicable. As you evaluate these (and other) actions, commit to build a customized troubleshooting plan for your environment, and use this list as your starting point. Even if you are not able to fully implement every one of these recommendations, every step that you do take will save time and reduce frustration down the line:

  1. Create and maintain a system architecture diagram
  2. Create and track an inventory of all problem determination artifacts
  3. Pay special attention to dumps and other artifacts that are only generated when a problem occurs
  4. Review and optimize the level of diagnostics during normal operation
  5. Watch low-level operating system and network metrics
  6. Be prepared to actively generate additional diagnostics when a problem occurs
  7. Define a diagnostic collection plan -- and practice it
  8. Establish baselines
  9. Periodically purge, archive, or clean-up old logs and dumps
  10. Eliminate spurious errors and other "noise" in the logs
  11. Keep a change log
  12. Setup ongoing system health monitoring

The sections that follow explain each of these steps in detail.


1. Create and maintain a system architecture diagram

An architecture diagram shows all major components of the overall system (machines and software components operating on these machines), how they communicate, and the main flows for requests being processed through the system. Having a good and up to date architecture diagram is a major help in facilitating and accelerating many tasks related to troubleshooting. In particular, a system architecture diagram helps you:

  • Identify the various points in the system where to find information or clues about the cause of a problem.
  • Clearly communicate between the various parties involved in the troubleshooting task, both inside your organization and when trying to explain a complex environment to IBM Support.
  • Answer and verify a favorite question of all troubleshooters: What has changed recently?

The architecture diagram should be specific, yet concise enough to be quickly understood. In particular, to the extent possible, it should show the actual current version of each software component, and the names and addresses of all hardware components.


2. Create and track an inventory of all problem determination artifacts

Troubleshooting activities often focus initially on finding and examining a variety of problem determination articfacts (that is, files such as log files, dump files, and so on) that were generated before or when the problem occurred. It pays to know in advance what files to look for, where they are, and to ensure that they are indeed generated properly and will be available when needed.

Make an inventory of all the important problem determination artifacts in your system:

  • Note what each file is for, its name, location, purpose, typical contents, and its typical size. The system architecture diagram can come in handy here, as it helps you review the entire system and all the components that can produce useful problem determination artifacts.
  • Don't be satisfied with simply "knowing" that the artifact exists. Though everything may be set-up and documented perfectly in theory, there is no substitute for an actual real-life test to validate the theory. Check the live system periodically to verify that all expected log files and other artifacts are still being written as expected.
  • Ensure that there is enough disk space, wherever appropriate, to continue to write all the relevant diagnostic files, and to receive any other files that might be generated during an incident.
  • Make sure that your articfacts are not purged too quickly. If an incident occurs, you might need to refer back to log files that were generated some hours before the incident was detected. In particular, make sure that these files will not be accidentally deleted or overwritten if the system is restarted after an incident.
Problem determination artifacts you might consider

The set of relevant artifacts will vary for each environment, set of products, and applications that you use. Some of the most common ones:

  • All the standard logs files associated with WebSphere Application Server: activity.log, SystemOut.log, SystemErr.log, native_stdout.log, native_stderr.log, and so on.
  • Incident files from the First Failure Data Capture (FFDC) facility in WebSphere Application Server.
  • Log files from the Web server: access.log, error.log.
  • Any log files from products built on top of WebSphere Application Server (such as WebSphere Portal, WebSphere Process Server, and so on).
  • Any log files from other components that interact with the main application server, such as firewall logs, database server logs, and LDAP directory server logs.
  • Any log files produced explicitly by an application.
  • Log files and dumps produced by the Java Virtual Machine: javacore or java dump, heap dump, and system dump (core file).

3. Pay special attention to dumps and other artifacts that are only generated when a problem occurs

When reviewing problem determination artifacts, much attention is often given to log files, which are usually generated continuously throughout the life of the system. Remember that there are also many very useful problem determination artifacts that only get generated when a problem occurs, either spontaneously by the system or upon a special action by an administrator.

Examples of artifacts generated only when a problem occurs
  • Under a variety of circumstances, the JVM can generate Java/thread dumps, heap dumps, and system dumps.
  • Some IBM products can generate a variety of other dump-type data of their internal states. For example, WebSphere Application Server provides a First Failure Data Capture (FFDC) facility and a Diagnostic Provider facility.
  • Some IBM products can also generate special trace files, either automatically or on demand, as soon as a particular problem is encountered, without requiring a system restart.

In most cases, there are a variety of configuration options that control how and when these artifacts are generated:

  • For any file that can be generated automatically when a problem is detected: carefully consider the potential benefit -- and impact -- of having this file produced automatically, and set up the configuration accordingly. Do not leave this to chance. If the potential benefit is high and the impact is low, make sure that this feature is enabled.
  • For files that can be generated upon a specific action by an administrator: in most cases there is very little or no impact to the system until that particular action is taken. If at all possible, make the necessary preparations and configuration changes to ensure that the action will be available if and when needed, and test to make sure that it works as expected.

4. Review and optimize the level of diagnostics during normal operation

Effective serviceability is always a trade-off. On one hand, to maximize the chances of being able to determine the cause of a problem upon its first occurrence, you want to gather the maximum amount of diagnostic data from the system at all times. But gathering very detailed diagnostics (for example, by having tracing enabled all the time) can cause a substantial performance overhead. Therefore, for performance reasons, you might be tempted to disable all diagnostics during normal operation of the system. You need to find the right balance between these two conflicting goals.

By default, most products and environments tend to err on the conservative side with a relatively small set of diagnostics enabled at all times. This is probably the right approach if no one will be available to review the set of enabled diagnostics before the system is put into production. As part of an actively designed troubleshooting plan, however, it is quite worthwhile to examine the specific constraints of your particular environment, the likelihood of specific problems, and the specific performance requirements, and then enable as many additional diagnostics as you can afford during normal steady-state operation.

Possible additional diagnostics to enable
  • JVM verboseGC log: Often very useful, and usually relatively low overhead on a well-tuned system.
  • JVM Java dumps, heap dumps, and system dumps: Java dumps are typically somewhat cheap to produce, and can be enabled for automatic generation. Heap dumps and system dumps can involve significant overhead, so consider carefully before setting them up to be triggered automatically.
  • Increased request logging at the HTTP server to show not just a single log entry for each request, but a separate log entry for the start and end of each request.
  • A moderate level of monitoring performance counters provided by the WebSphere Application Server Performance Monitoring Infrastructure.
  • Minimal WebSphere Application Server tracing to capture one or a few entries only for each transaction (Web requests or EJB requests).
  • Application level tracing and logging, if any.

5. Watch low-level operating system and network metrics

When looking for diagnostic information, there is a tendency to focus on the logs, dumps, and other files directly associated with the failing component or application, but the underlying hardware, operating system, and network can often also provide useful information for tracking down the source of a problem.

System-level metrics
  • Overall CPU and memory usage for the entire machine.
  • CPU and memory usage of individual processes that are part of the application (or that the application depends on).
  • Paging and disk I/O activity.
  • Rate of network traffic between various components.
  • Check for reduction or total loss of network connectivity between various components.

Such metrics are often overlooked, or considered only late in the course of a complex problem investigation, yet many of them are relatively easy and cheap to capture. In some particularly difficult cases, especially network-related problems, this information often plays a key role in tracking down the source.

Of course, it's not practical to monitor every single system-level metric on a permanent basis, but where possible, pick a lightweight set of system-level metrics to monitor regularly, so that you will have data both before a problem occurs (capturing the evolution of the problem), when a problem does occur. Depending on your software environment, you might have access to various operating system tools or specialized system management tools (such as the IBM Tivoli® suite of products) to assist in this monitoring. If not, you might be able to write a few simple command scripts to run periodically and collect the most useful statistics.


6. Be prepared to actively generate additional diagnostics when a problem occurs

In addition to dealing with diagnostic artifacts that are present when an incident occurs, your troubleshooting plan should consider any additional explicit actions that should be performed to obtain additional information as soon as an incident is detected -- before the data disappears or the system is restarted.

Examples of explicit actions to generate additional diagnostics
  • Actively trigger various system dumps, if they have not been generated automatically (such as Java dump, heap dump, system dump, WebSphere Application Server Diagnostic Provider dumps, or other dumps that might be provided by various products and applications). For example, when a system is believed to be "hung," it is common practice to collect three consecutive Java dumps for each potentially affected JVM process.
  • Take a snapshot of key operating system metrics, such as process states, sizes, CPU usage, and so on.
  • Enable and collect information from the WebSphere Application Server Performance Monitoring Infrastructure instrumentation.
  • Dynamically enable a specific trace, and collect that trace for a given interval while the system is in the current unhealthy state.
  • Actively test or "ping" various aspects of the system to see how their behavior has changed compared to normal conditions, to try to isolate the source of the problem in a multi-component system. For example:
    • Send an HTTP request directly to the application server, bypassing the Web server, or test some operations directly against a back end database, bypassing the application server.
    • Test the responses from different WebSphere Application Server cluster members.
    • If applicable, test different functions of the application, to see if they are affected differently.
    • Selectively restart individual components of the application or the system.

Clearly, there is a potentially infinite variety of such actions, and you cannot possibly perform them all. Rather, you must try to anticipate the most likely failure modes for the system and decide which actions are most likely to yield the most useful information in each case. A careful review of the application and the system architecture is one key source of information to help you get started with such a plan, as is the set of MustGather documents on the IBM Support Web sites. Each MustGather document applies to one particular type of problem, and gives specific instructions about recommended diagnostic information to collect in each case.


7. Define a diagnostic collection plan -- and practice it

Assuming you have identified key diagnostic artifacts that can be used for problem determination and have ensured that these artifacts will be available, and of the best quality possible, you also need a specific, well-documented plan to actually collect these artifacts when an incident does occur. When a problem happens, too often there is confusion, along with great pressure to restore the system to normal operation, causing mistakes that lead to unnecessary delays or general difficulties in troubleshooting. Having a plan of action, ensuring that everyone is aware of the plan of action -- and rehearsing the execution of the plan ahead of time -- are critical.

Your plan should cover the collection of the diagnostic artifacts that are always present or automatically generated when a problem occurs, as well as a set of specific actions that can be taken to generate additional diagnostics specifically when a problem occurs, as suggested earlier.

The simplest diagnostic collection plan is in the form of plain, written documentation that lists all the detailed manual steps that must be taken. To be more effective, try to automate as much as this plan as possible, by providing one or more command scripts that can be invoked to perform a complex set of actions, or by using more sophisticated system management tools. The various collector tools and scripts now offered as part of IBM Support Assistant can provide a good framework for you to start automating many diagnostic collection tasks.


8. Establish baselines

"What's different now compared to yesterday when the problem was not occurring?"

To help answer this question, you must actively collect and maintain a baseline: extensive information about the state of the system at a time when that system is operating normally.

Examples of information to include as a baseline
  • Copies of the various log files, trace files, and so on, over a representative period of time in the normal operation of the system, such as a full day.
  • Copies of a few Java dumps, heap dumps, system dumps, or other types of artifacts that are normally generated "on demand." You can combine this activity with the earlier recommendation to test the generation of these artifacts on a healthy system before a problem occurs.
  • Information about the normal transaction rates in the system, response times, and so on.
  • Various operating system level statistics on a healthy system, such as CPU usage for all processes, memory usage, network traffic, and so on.
  • Copies of any other artifacts, information or normal expected results from any of the special diagnostic collection actions, recommended earlier, for each anticipated type of problem.

Remember that this baseline information might have to be refreshed periodically. Systems evolve, the load changes, and so what is representative of the "normal" state will likely not remain constant over time. Also, if the system experiences different periods of activity (for example, on specific days or times of the day), different baselines might have to be collected for each period.


9. Periodically purge, archive, or clean-up old logs and dumps

Some of the other recommendations in this list suggest generating as many different types of diagnostic artifacts as possible in as much detail as possible. This may seem contradictory, but quantity is not always a good thing. In some environments, log files are allowed to grow indefinitely, or many old individual log and dump files are allowed to accumulate, over a period of months or even years. This can actually hamper the troubleshooting process:

  • Considerable and valuable time could be spent sorting through a lot of old information to find the current information, either manually or with tools used to scan all the files.
  • Tools that collect and transfer the problem determination artifacts, such as collection scripts, might run much slower if they have to transfer a lot of old unnecessary information. This is particularly unfortunate if the same old information is transferred each time, for each new incident.
  • In extreme cases, the system could run out of disk space from the sheer volume of old accumulated artifacts.

Your main objective should be to gather a maximum amount of diagnostic information just before and around the time of a problem. You also want to keep or archive a sufficient amount of historical diagnostic information to serve as a baseline for comparison purposes, and in case a problem is not immediately detected, or to see the slow build-up of some problems. But this historical data should be kept separate from the current problem determination artifacts, so as not to interfere with them.

Various IBM products, such as WebSphere Application Server, include features to automatically purge or rotate some log files. Other log files and dumps must be archived or purged on a regular basis manually.


10. Eliminate spurious errors and other "noise" in the logs

In IBM Support, we sometimes see systems and applications operating in a mode where they generate a large volume of error messages, even during normal operation. Such benign or common errors are clearly not significant, but they make it more difficult to spot unusual errors among all the noise. To simplify future troubleshooting, either eliminate all such common errors, or find a way to flag them so that they are easily distinguishable from the errors that matter.


11. Keep a change log

As mentioned several times already, an important aspect of most troubleshooting exercises is to figure out what is different between a working system and a broken one, and using a baseline is one way to help address the question. Another action that can help you determine system differences is to keep a rigorous log of all changes that have been applied to the system over time. When a problem occurs, you can look back through the log for any recent changes that might have contributed to the problem. You can also map these changes to the various baselines that have been collected in the past to ascertain how to interpret differences in these baselines.

Your log should at least track all upgrades and software fixes applied in every software component in the system, including both infrastructure products and application code. It should also track every configuration change in any component. Ideally, it should also track any known changes in the pattern of usage of the system; for example, expected increases in load, a different mix of operations invoked by users, and so on.

In a complex IT environment that has many teams contributing to different parts of the environment, the task of maintaining an accurate, up-to-date and global change log can be surprisingly difficult. There are tools and techniques you can use to assist in this task, from collecting regular snapshots of the configuration with simple data collection scripts (like those used to collect diagnostic data) to using sophisticated system management utilities.

Be aware that the concept of change control, and keeping a change log, is generally broader than the troubleshooting arena. It is also considered one of the key best practices for managing complex systems to prevent problems, as opposed to troubleshooting them.


12. Setup ongoing system health monitoring

In a surprising number of real-life cases, some minor or not-so-minor problems that contribute to a bigger problem can go undetected for a long time. Or, the overall health or performance of the system might degrade slowly over a long period before it finally leads to a serious problem. Obviously, the sooner you detect that something is wrong, the more time and opportunity you have to collect useful diagnostic information for troubleshooting. Therefore, having a good policy for continuous monitoring of the overall health of the system is an important part of an overall troubleshooting plan. This can involve a lot of the same sources of diagnostic data already discussed, the difference being that, in this case, you want to not only collect that information, but periodically scan it to ensure that no problem exists, rather than wait for a problem to be reported externally. Again, simple command scripts or sophisticated system management tools, if available, can be used to facilitate this monitoring.

Examples of things that might be monitored
  • Significant errors in the logs emitted by the various components.
  • Metrics produced by each component should remain within acceptable norms (for example, operating system CPU and memory statistics, WebSphere Application Server performance metrics, transaction rate through the application, and so on).
  • Spontaneous appearance of special artifacts that only get generated when a problem occurs, such as Java dumps, heap dumps, and so on.
  • Periodically send a "ping" through various system components or the application, verifying that it continues to respond as expected.

Conclusion

To be in the best possible position for troubleshooting problems in a complex environment, there are many factors and ideas you can consider, and many variations from one specific case to the next. Hopefully, this list makes the point that preparation pays off, and gives you a place to start for defining your own customized troubleshooting plans.


Acknowledgements

The author thanks Yu Tang, and all other members of the WebSphere SWAT team, for the many valuable discussions and insights that have led to the recommendations presented in this article.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=248899
ArticleTitle=The Support Authority: 12 ways you can prepare for effective production troubleshooting
publish-date=08222007