In each column, The Support Authority discusses resources, tools, and other elements of IBM Technical Support that are available for WebSphere products, plus techniques and new ideas that can further enhance your IBM support experience.
This just in...
As always, we begin with some new items of interest for the WebSphere® community at large:
IBM® Guided Activity Assistant Version 3.1 has just been released, and with this version, the IBM Guided Activity Assistant becomes a fully-supported problem determination tool. It also includes an improved user interface and many new features.
Three problem determinations tools available in the IBM Support Assistant have been updated:
IBM Dump Analyzer for Java™ examines the contents of a JVM system dump and diagnoses the most common problems and the overall state of that JVM.
Extensible Verbose Toolkit (EVTK) examines JVM verbose GC logs and diagnoses garbage-collection problems.
IBM Pattern Modeling and Analysis Tool for Java Garbage Collector (PMAT) examines a JVM verbose GC log. (With this new release, the version of PMAT in IBM Support Assistant is identical to the version available through alphaWorks.)
Last time, we announced the introduction of Fix Central, a centralized Web site to facilitate finding, downloading, and installing fixes for many IBM products. WebSphere Application Server has just been added to the family of products using Fix Central.
Continue to monitor the various support-related Web sites, as well as this column, for news about other tools as we encounter them.
And now, on to our main topic...
Thoughtful preparation for smarter troubleshooting
When discussing techniques and tools for troubleshooting and problem determination, itâs typical for most of the discussion to be about what to do after a problem has been discovered. Ideally, however, the prudent system administrator or troubleshooter should start thinking about the job long before a problem occurs; in other words, prepare the environment so that troubleshooting can be performed more quickly and effectively if and when problems eventually do occur.
This article presents a dozen recommendations that can be implemented to help expedite problem resolution, in even most complex production environments. This list, which is neither definitive nor absolute, is based on general observations of client environments and problems seen by IBM WebSphere Support. However, every environment has unique factors and constraints that might make some of these suggestions more (or less) practical or applicable. As you evaluate these (and other) actions, commit to build a customized troubleshooting plan for your environment, and use this list as your starting point. Even if you are not able to fully implement every one of these recommendations, every step that you do take will save time and reduce frustration down the line:
- Create and maintain a system architecture diagram
- Create and track an inventory of all problem determination artifacts
- Pay special attention to dumps and other artifacts that are only generated when a problem occurs
- Review and optimize the level of diagnostics during normal operation
- Watch low-level operating system and network metrics
- Be prepared to actively generate additional diagnostics when a problem occurs
- Define a diagnostic collection plan -- and practice it
- Establish baselines
- Periodically purge, archive, or clean-up old logs and dumps
- Eliminate spurious errors and other "noise" in the logs
- Keep a change log
- Setup ongoing system health monitoring
The sections that follow explain each of these steps in detail.
1. Create and maintain a system architecture diagram
An architecture diagram shows all major components of the overall system (machines and software components operating on these machines), how they communicate, and the main flows for requests being processed through the system. Having a good and up to date architecture diagram is a major help in facilitating and accelerating many tasks related to troubleshooting. In particular, a system architecture diagram helps you:
- Identify the various points in the system where to find information or clues about the cause of a problem.
- Clearly communicate between the various parties involved in the troubleshooting task, both inside your organization and when trying to explain a complex environment to IBM Support.
- Answer and verify a favorite question of all troubleshooters: What has changed recently?
The architecture diagram should be specific, yet concise enough to be quickly understood. In particular, to the extent possible, it should show the actual current version of each software component, and the names and addresses of all hardware components.
2. Create and track an inventory of all problem determination artifacts
Troubleshooting activities often focus initially on finding and examining a variety of problem determination articfacts (that is, files such as log files, dump files, and so on) that were generated before or when the problem occurred. It pays to know in advance what files to look for, where they are, and to ensure that they are indeed generated properly and will be available when needed.
Make an inventory of all the important problem determination artifacts in your system:
- Note what each file is for, its name, location, purpose, typical contents, and its typical size. The system architecture diagram can come in handy here, as it helps you review the entire system and all the components that can produce useful problem determination artifacts.
- Don't be satisfied with simply "knowing" that the artifact exists. Though everything may be set-up and documented perfectly in theory, there is no substitute for an actual real-life test to validate the theory. Check the live system periodically to verify that all expected log files and other artifacts are still being written as expected.
- Ensure that there is enough disk space, wherever appropriate, to continue to write all the relevant diagnostic files, and to receive any other files that might be generated during an incident.
- Make sure that your articfacts are not purged too quickly. If an incident occurs, you might need to refer back to log files that were generated some hours before the incident was detected. In particular, make sure that these files will not be accidentally deleted or overwritten if the system is restarted after an incident.
|Problem determination artifacts you might consider|
The set of relevant artifacts will vary for each environment, set of products, and applications that you use. Some of the most common ones:
3. Pay special attention to dumps and other artifacts that are only generated when a problem occurs
When reviewing problem determination artifacts, much attention is often given to log files, which are usually generated continuously throughout the life of the system. Remember that there are also many very useful problem determination artifacts that only get generated when a problem occurs, either spontaneously by the system or upon a special action by an administrator.
|Examples of artifacts generated only when a problem occurs|
In most cases, there are a variety of configuration options that control how and when these artifacts are generated:
- For any file that can be generated automatically when a problem is detected: carefully consider the potential benefit -- and impact -- of having this file produced automatically, and set up the configuration accordingly. Do not leave this to chance. If the potential benefit is high and the impact is low, make sure that this feature is enabled.
- For files that can be generated upon a specific action by an administrator: in most cases there is very little or no impact to the system until that particular action is taken. If at all possible, make the necessary preparations and configuration changes to ensure that the action will be available if and when needed, and test to make sure that it works as expected.
4. Review and optimize the level of diagnostics during normal operation
Effective serviceability is always a trade-off. On one hand, to maximize the chances of being able to determine the cause of a problem upon its first occurrence, you want to gather the maximum amount of diagnostic data from the system at all times. But gathering very detailed diagnostics (for example, by having tracing enabled all the time) can cause a substantial performance overhead. Therefore, for performance reasons, you might be tempted to disable all diagnostics during normal operation of the system. You need to find the right balance between these two conflicting goals.
By default, most products and environments tend to err on the conservative side with a relatively small set of diagnostics enabled at all times. This is probably the right approach if no one will be available to review the set of enabled diagnostics before the system is put into production. As part of an actively designed troubleshooting plan, however, it is quite worthwhile to examine the specific constraints of your particular environment, the likelihood of specific problems, and the specific performance requirements, and then enable as many additional diagnostics as you can afford during normal steady-state operation.
|Possible additional diagnostics to enable|
5. Watch low-level operating system and network metrics
When looking for diagnostic information, there is a tendency to focus on the logs, dumps, and other files directly associated with the failing component or application, but the underlying hardware, operating system, and network can often also provide useful information for tracking down the source of a problem.
Such metrics are often overlooked, or considered only late in the course of a complex problem investigation, yet many of them are relatively easy and cheap to capture. In some particularly difficult cases, especially network-related problems, this information often plays a key role in tracking down the source.
Of course, it's not practical to monitor every single system-level metric on a permanent basis, but where possible, pick a lightweight set of system-level metrics to monitor regularly, so that you will have data both before a problem occurs (capturing the evolution of the problem), when a problem does occur. Depending on your software environment, you might have access to various operating system tools or specialized system management tools (such as the IBM Tivoli® suite of products) to assist in this monitoring. If not, you might be able to write a few simple command scripts to run periodically and collect the most useful statistics.
6. Be prepared to actively generate additional diagnostics when a problem occurs
In addition to dealing with diagnostic artifacts that are present when an incident occurs, your troubleshooting plan should consider any additional explicit actions that should be performed to obtain additional information as soon as an incident is detected -- before the data disappears or the system is restarted.
|Examples of explicit actions to generate additional diagnostics|
Clearly, there is a potentially infinite variety of such actions, and you cannot possibly perform them all. Rather, you must try to anticipate the most likely failure modes for the system and decide which actions are most likely to yield the most useful information in each case. A careful review of the application and the system architecture is one key source of information to help you get started with such a plan, as is the set of MustGather documents on the IBM Support Web sites. Each MustGather document applies to one particular type of problem, and gives specific instructions about recommended diagnostic information to collect in each case.
7. Define a diagnostic collection plan -- and practice it
Assuming you have identified key diagnostic artifacts that can be used for problem determination and have ensured that these artifacts will be available, and of the best quality possible, you also need a specific, well-documented plan to actually collect these artifacts when an incident does occur. When a problem happens, too often there is confusion, along with great pressure to restore the system to normal operation, causing mistakes that lead to unnecessary delays or general difficulties in troubleshooting. Having a plan of action, ensuring that everyone is aware of the plan of action -- and rehearsing the execution of the plan ahead of time -- are critical.
Your plan should cover the collection of the diagnostic artifacts that are always present or automatically generated when a problem occurs, as well as a set of specific actions that can be taken to generate additional diagnostics specifically when a problem occurs, as suggested earlier.
The simplest diagnostic collection plan is in the form of plain, written documentation that lists all the detailed manual steps that must be taken. To be more effective, try to automate as much as this plan as possible, by providing one or more command scripts that can be invoked to perform a complex set of actions, or by using more sophisticated system management tools. The various collector tools and scripts now offered as part of IBM Support Assistant can provide a good framework for you to start automating many diagnostic collection tasks.
8. Establish baselines
"What's different now compared to yesterday when the problem was not occurring?"
To help answer this question, you must actively collect and maintain a baseline: extensive information about the state of the system at a time when that system is operating normally.
|Examples of information to include as a baseline|
Remember that this baseline information might have to be refreshed periodically. Systems evolve, the load changes, and so what is representative of the "normal" state will likely not remain constant over time. Also, if the system experiences different periods of activity (for example, on specific days or times of the day), different baselines might have to be collected for each period.
9. Periodically purge, archive, or clean-up old logs and dumps
Some of the other recommendations in this list suggest generating as many different types of diagnostic artifacts as possible in as much detail as possible. This may seem contradictory, but quantity is not always a good thing. In some environments, log files are allowed to grow indefinitely, or many old individual log and dump files are allowed to accumulate, over a period of months or even years. This can actually hamper the troubleshooting process:
- Considerable and valuable time could be spent sorting through a lot of old information to find the current information, either manually or with tools used to scan all the files.
- Tools that collect and transfer the problem determination artifacts, such as collection scripts, might run much slower if they have to transfer a lot of old unnecessary information. This is particularly unfortunate if the same old information is transferred each time, for each new incident.
- In extreme cases, the system could run out of disk space from the sheer volume of old accumulated artifacts.
Your main objective should be to gather a maximum amount of diagnostic information just before and around the time of a problem. You also want to keep or archive a sufficient amount of historical diagnostic information to serve as a baseline for comparison purposes, and in case a problem is not immediately detected, or to see the slow build-up of some problems. But this historical data should be kept separate from the current problem determination artifacts, so as not to interfere with them.
Various IBM products, such as WebSphere Application Server, include features to automatically purge or rotate some log files. Other log files and dumps must be archived or purged on a regular basis manually.
10. Eliminate spurious errors and other "noise" in the logs
In IBM Support, we sometimes see systems and applications operating in a mode where they generate a large volume of error messages, even during normal operation. Such benign or common errors are clearly not significant, but they make it more difficult to spot unusual errors among all the noise. To simplify future troubleshooting, either eliminate all such common errors, or find a way to flag them so that they are easily distinguishable from the errors that matter.
11. Keep a change log
As mentioned several times already, an important aspect of most troubleshooting exercises is to figure out what is different between a working system and a broken one, and using a baseline is one way to help address the question. Another action that can help you determine system differences is to keep a rigorous log of all changes that have been applied to the system over time. When a problem occurs, you can look back through the log for any recent changes that might have contributed to the problem. You can also map these changes to the various baselines that have been collected in the past to ascertain how to interpret differences in these baselines.
Your log should at least track all upgrades and software fixes applied in every software component in the system, including both infrastructure products and application code. It should also track every configuration change in any component. Ideally, it should also track any known changes in the pattern of usage of the system; for example, expected increases in load, a different mix of operations invoked by users, and so on.
In a complex IT environment that has many teams contributing to different parts of the environment, the task of maintaining an accurate, up-to-date and global change log can be surprisingly difficult. There are tools and techniques you can use to assist in this task, from collecting regular snapshots of the configuration with simple data collection scripts (like those used to collect diagnostic data) to using sophisticated system management utilities.
Be aware that the concept of change control, and keeping a change log, is generally broader than the troubleshooting arena. It is also considered one of the key best practices for managing complex systems to prevent problems, as opposed to troubleshooting them.
12. Setup ongoing system health monitoring
In a surprising number of real-life cases, some minor or not-so-minor problems that contribute to a bigger problem can go undetected for a long time. Or, the overall health or performance of the system might degrade slowly over a long period before it finally leads to a serious problem. Obviously, the sooner you detect that something is wrong, the more time and opportunity you have to collect useful diagnostic information for troubleshooting. Therefore, having a good policy for continuous monitoring of the overall health of the system is an important part of an overall troubleshooting plan. This can involve a lot of the same sources of diagnostic data already discussed, the difference being that, in this case, you want to not only collect that information, but periodically scan it to ensure that no problem exists, rather than wait for a problem to be reported externally. Again, simple command scripts or sophisticated system management tools, if available, can be used to facilitate this monitoring.
|Examples of things that might be monitored|
To be in the best possible position for troubleshooting problems in a complex environment, there are many factors and ideas you can consider, and many variations from one specific case to the next. Hopefully, this list makes the point that preparation pays off, and gives you a place to start for defining your own customized troubleshooting plans.
The author thanks Yu Tang, and all other members of the WebSphere SWAT team, for the many valuable discussions and insights that have led to the recommendations presented in this article.
- The Support Authority: If you need help with WebSphere products, there are many ways to get it
- IBM Software Support Web site
- IBM Support Assistant
- IBM Guided Activity Assistant
- WebSphere First Failure Data Capture facility
- The Support Authority: Features and tools for practical troubleshooting
- The Support Authority: Real time problem determination with WebSphere Diagnostic Providers
- MustGather: Read first for all WebSphere Application Server products
- Common malpractices whitepaper (Eleven ways to wreck a deployment)
- IBM developerWorks
- IBM Redbooks