The Support Authority: Features and tools for practical troubleshooting

Obtaining diagnostic information from your WebSphere Application Server system

IBM® puts a lot of effort into developing and improving mechanisms for obtaining, processing, and analyzing diagnostic information to determine problem cause and resolution. This column discusses some of the practical features and tools that are available to help you troubleshoot WebSphere® Application Server.

Share:

Daniel Julin (dpj@us.ibm.com), WebSphere Serviceability Technical Area Lead, IBM India Software Lab Services and Solutions

Author photoDaniel Julin has 20 years experience developing and troubleshooting complex online systems. As Technical Area Lead for the WebSphere Serviceability Team, he currently focuses on helping the team define and implement a collection of tools and techniques to assist in problem determination for WebSphere Application Server, and to maximize the efficiency of IBM Support. He also occasionally assists directly in various critical customer support situations.


developerWorks Contributing author
        level

Michel Betancourt (betancom@us.ibm.com), WebSphere Serviceability, IBM India Software Lab Services and Solutions

Author photoMichel Betancourt has been focused on WebSphere Application Server problem determination within IBM for the past 5 years. He is currently part of the WebSphere Serviceability Development Team, developing and supporting multiple tools and WebSphere User interfaces for the WebSphere Application Server runtime. Michel graduated in 2001 from Florida International University with a bachelors degree in Computer Engineering and has participated in several publications, including the book IBM WebSphere Application Server for Distributed Platforms and z/OS: An Administrator's Guide.



28 February 2007

Also available in Chinese

From the IBM WebSphere Developer Technical Journal.

In each column, The Support Authority discusses resources, tools, and other elements of IBM Technical Support that are available for WebSphere products, plus techniques and new ideas that can further enhance your IBM support experience.

This just in...

As we announced in the first installment of this column, we will occassionally use this space to alert you of new resources and ideas in the WebSphere Support area. Here are a few items of interest this month:

  1. First a reminder: Starting this year, Daylight Saving Time in the US and several other countries will start three weeks earlier than in previous years. This might affect many products that were released before the legislation governing this change could be anticipated, including several WebSphere-based products. IBM Support has been publishing warnings and updates about this upcoming issue for several months, but now that the time is almost here, it might be worth checking once again to be sure that you are ready. To find out if some of your products require patches, visit the IBM-wide product alerts site. In addition, you might want to listen to a replay of a WebSphere Support Technical Exchange broadcast on this matter.

  2. The IBM Guided Activity Assistant just released a new set of content to guide you through troubleshooting problems related to the J2C component of WebSphere Application Server, as well as problems related to hangs of the Java™ process. This content can be downloaded through the features updater in the IBM Support Assistant.

  3. The Memory Dump Diagnostic for Java tool (MDD4J) just released a beta of Version 2.0, which provides substantially enhanced coverage of different types of heap dumps, as well as several performance improvements and fixes. Expect future refreshes to focus on usability enhancements. This tool can also be downloaded through the features updater in the IBM Support Assistant.


Tools for troubleshooters

Last time, we gave you a very high level overview of the various support resources available in the WebSphere environment to help you find information, get assistance, and even perform the major functions of a WebSphere troubleshooter. Here, we will go one level down and discuss some of the practical features and tools that are available for troubleshooting WebSphere Application Server.

Need more info?
Unless otherwise noted, refer to the IBM WebSphere Application Server Information Center for more details about each of the items described here. Also, many of the tools discussed in this article are provided through the IBM Support Assistant tool's catalog.

At the core of most troubleshooting exercises, the following fundamental statement usually applies:

I need to get information about what's going on inside my system when it is behaving incorrectly and I need to analyze that information to determine the cause of the problem.

Our development team, therefore, puts a good deal of effort into improving the mechanisms for obtaining and processing that information. Let us survey some of these main areas.


Logging and tracing

The logging facility is most likely the first problem determination feature you will encounter when troubleshooting WebSphere Application Server. This is the set of facilities that provides users and IBM Support with an insight into the runtime that is necessary for basic problem determination.

Other logs: Two other log files also exist in WebSphere Application Server: the JVM native_stdout and native_stderr files. Unlike SystemOut.log and SystemErr.log, these latter files are actually handled by the JVM itself, and contain only messages pertaining to the operation of that JVM and not from the WebSphere Application Server runtime.

The WebSphere Application Server logging infrastructure is based on the standard Java logging infrastructure, java.util.logging. In a typical WebSphere Application Server configuration, logging is set up to write normal and severe log messages to two files, named SystemOut.log and SystemErr.log, respectively.

Key tools used to view logging messages include the WebSphere Application Server administrative console's troubleshooting panels and the Log and Trace Analyzer (LTA).

  • The admin console is the simpler of these two tools. The console provides simple facilities for viewing local or remote logs within the Troubleshooting menu. Although it may lack some of the advanced features that a more sophisticated tool provides, it does enable you to take a quick look at the captured logs and tracing.

  • LTA, on the other hand, is a very useful and friendly interface for logs, and also provides a great log correlation feature amongst multiple products and servers. Simply put, it provides an easy way to view logs at a much higher level. LTA exploits the Common Base Events standard, which provides a common format for messages from many different products. LTA is part of the Autonomic Computing Toolkit; a version of which is also provided within the WebSphere Application Server Toolkit, which is bundled with WebSphere Application Server.

The same underlying logging infrastructure within the WebSphere Application Server runtime is also used for tracing. The main distinction between the two is one of usage:

  • Logs are typically only used to report the key events in the life of the system. Logs are enabled by default and incur minimal performance overhead. Log messages are also normally translated in the local language and intended for end users and administrators.

  • Traces can be extremely detailed about the sequence of events in the system. They are typically disabled by default and their use could incur a substantial performance overhead, depending on the amount of tracing selected. Trace messages are typically displayed only in English and they are intended for very technical users focused on troubleshooting.

Although LTA can also be used to examine trace files, some engineers prefer to use the specialized TraceAnalyzer tool. This tool provides an additional set of features particularly useful for viewing and understanding complex trace files, such as the ability to filter traces by thread or component, derive call stack information from a long trace, detect gaps in the timeline, and so on.

Other facilities related to logging include:

  • IBM Service Logs: (activity.log) This facility consolidates all the key messages on a particular WebSphere Application Server node, and also contains extended service information that is useful for problem determination.

  • JMX-based Monitoring: In addition to being logged to simple files, most key WebSphere Application Server events handled by the logging infrastructure are also exported as JMX events. This enables a variety of tools to be built for remotely monitoring and capturing logging information. The Tivoli Monitoring for Web Infrastructure tool is the prime example of a tool that exploits this capability.


Specialized tracing and run time checks

The WebSphere Application Server runtime also provides several specialized forms of runtime checks and tracing to help diagnose some very specific common problems. Two examples are:

  • Connection leak detection: Special trace that helps identify the culprits when database connections are not properly released after being used by an application.

  • Session crossover detection: Special trace and check that attempts to detect situations where information from one HTTP session is accidentally made available in another user's HTTP session, due to defects or handling errors in the application or runtime.

These facilities are simply enabled by either performing tracing of a specific component or by setting a specialized custom property. Similar facilities for other common problems might be added to the product as the need becomes apparent.


First failure data capture

First failure data capture (FFDC) is a facility built into the WebSphere Application Server runtime that attempts to automatically capture and save key information whenever a potentially abnormal situation occurs. Since many problems encountered with WebSphere Application Server are associated with some sort of Java exception, FFDC monitors all exceptions that are thrown during the operation of the server. Whenever an exception is thrown, it is examined in real-time to determine if the exception is unexpected or could potentially be part of an emerging problem. If so, FFDC writes a record in a file (an FFDC incident record) containing the stack trace, the circumstances of the exception, and, optionally, a short dump of the state of the components of the server that just generated this exception. These FFDC incident records can later be examined post-mortem to gain some insight into what happened.

The information captured by FFDC can be used to help diagnose a very broad range of problems (in theory, any problem that is closely linked to a particular exception). However, in practice today, FFDC incident records can be challenging to interpret, mainly because it is difficult to reliably determine in advance which exceptions are benign and which will turn out to be critical to diagnosing some problem. Therefore, the FFDC facility tends to generate many incident records. (It is better to capture a large record and not need it, than it is to fail to capture one when it really is needed.) We will provide a tutorial on configuring and exploiting the FFDC facility in a future article.


Diagnostic Providers

Diagnostic Providers is a new facility introduced with WebSphere Application Server V6.1 that enables you to selectively query a particular component within the application server when you suspect a problem has just occurred, and obtain detailed information about the state of that component. The Diagnostic Provider for each component can be used to initiate a self-test of that component, get a dump of the static configuration information associated with that component, and get a dump of the run time state of that component. The amount of run time state diagnostic data captured by each diagnostic provider can further be configured to manage the performance overhead. Diagnostic Providers are controlled through the administration console or through the wsadmin command-line tool.

In addition, several log and error messages generated by the WebSphere Application Server runtime now include a Diagnostic Provider ID, which uniquely identifies the Diagnostic Provider that is at the source of the error being reported. You can then query that particular Diagnostic Provider to obtain more information and diagnose the cause of the error.

Currently (in Version 6.1), specialized Diagnostic Providers exist for the Connection Manager component, for the WebContainer component, and for the System Management component, as well as for the performance and diagnostic advisers, discussed later. Expect additional Diagnostic Providers to be implemented in future versions of WebSphere Application Server, as well as a growing collection of tools to exploit the information that they provide. Applications can also leverage the Diagnostic Providers framework to provide their own diagnostic data.


JVM-level diagnostics

At its core, WebSphere Application Server is, first, a JVM process. Therefore, it is natural to also consider the array of diagnostic facilities provided for all JVM processes. A substantial percentage of all problems encountered by WebSphere Application Server users are, in fact, problems that manifest themselves first at the JVM level, such as out-of-memory conditions, crashes, and so on.

  • The verboseGC log is probably the most common type of JVM diagnostic. It shows the sequence of garbage collection cycles that occurred throughout the life of that JVM. This is often invaluable to use as an initial problem determination aid for detecting and diagnosing all kinds of anomalous memory allocation issues within that JVM, such as memory leaks, fragmentation, performance issues related to GC, and so on. The PMAT tool, available in IBM Support Assistant, is currently the primary tool available to help analyze the contents of verboseGC log files.

  • Thread dumps are also a very common type of JVM diagnostic. A thread dump (also known as a javacore) can be triggered on request by an administrator, or automatically when some special condition is encountered in the JVM. A thread dump is a text file that contains a relatively short snapshot of the key aspects of the state of that JVM. The most often used part of that snapshot is the list of currently active threads in the JVM, hence its name. Threads dumps are most commonly used to diagnose the cause of hangs, slowdowns, crashes, or excessive CPU consumption in the JVM. Since thread dumps are (relatively) short text files, they can be examined with a simple text editor. However, it is often more effective to use a special tool that parses and organizes the contents, and automatically detects and highlights key information and anomalies. Two main tools are available today for this purpose: the ThreadAnalyzer, available in IBM Support Assistant, and the Thread and Monitor Dump Analyzer, available on AlphaWorks.

  • Heap dumps are another form of dump that can also be generated by a JVM, either on demand or automatically when special conditions occur. A heap dump is typically a very large file that contains a list of all the objects currently on the JVM heap. It is used to perform in-depth analysis when out-of-memory conditions are observed. One can, for example, find out which objects take up the most space in the heap, which objects are proliferating, and so on. Because a heap dump is such a large file, it is not practical to attempt to examine it by hand. The Memory Dump Diagnostic for Java tool (MDD4J), available in IBM Support Assistant, is currently the main tool provided to perform this analysis.

  • The third kind of JVM dump is the System dump or simple core file. This is the most expensive dump, but also the most complete. It is large binary file that reflects the entire contents of the JVM process: every Java object and its fields, every thread, every memory region, and so on. The initial use of system dump is to help diagnose crashes, hangs, or complex memory allocation issues in cases when the other types of dumps are insufficient or cannot be generated. However, because the system dump is so complete, it can also be used to gain information about many aspects of the current state of the WebSphere Application Server runtime or even the applications executing in that runtime. Expect to see more uses of system dumps for this purpose in the future. There are relatively few tools available externally to examine the contents of a WebSphere Application Server JVM system dump. Therefore, system dumps must typically be sent to IBM Support for in-depth analysis. However, IBM recently introduced a new technology, called Diagnostic Tooling Framework for Java (DTFJ), that makes it easy to build various tools to examine system dumps. Expect new tools based on DTFJ technology to become widely available in the future.

  • Finally, the JVM also provides its own JVM tracing facility -- distinct from the WebSphere tracing facility -- that provides tracing at the level of individual Java method invocations, as well as events internal to the operation of the JVM implementation itself. This type of tracing is currently used mostly to diagnose internal JVM problems, and (somewhat rarely) to diagnose WebSphere Application Server-level problems. However, the method level trace can be very useful as an adjunct to WebSphere tracing. We plan to document and expand its use in the near future.

The Java Diagnostics Guide is the primary source of information about these various types of JVM-level diagnostic facilities and how to generate the corresponding dumps. Each of these tools provides its own form of documentation.


Performance-related tools

Though their primary purpose is to monitor, measure, and tune the performance of the system, the various performance-related tools also provide an important mechanism to gain insight into the internal state of an application server, which can be invaluable to diagnose a variety of problems, whether these problems are directly performance-related or not.

The WebSphere Application Server runtime contains two main types of performance instrumentation:

  • The Performance Monitoring Infrastructure (PMI) provides counters for a wide variety of statistics reflecting the internal operation of the server, such as number of requests per second being processed in each servlet or EJB, average response times, utilization rate of various resources (such as threads, database connections), and so on. PMI is a WebSphere-specific facility that provides in-depth information from WebSphere Application Server and some other related products.

  • Request metrics provide a mechanism to trace the flow of execution of a single request through the system, and measure the processing time at each step of that flow. The information from the request metrics instrumentation can be accessed using the standard Application Request Measurement infrastructure. Request metrics are available from many products, making it possible to follow a request end-to-end through a complex system, even as it involves many different server components and layers.

The information provided by these two facilities is useful for general troubleshooting in two ways:

  • Problem isolation: By observing which subsystems or components of the overall system are active or have received a particular request for processing, we can often deduce which component is responsible when requests are no longer flowing, or are not completing normally.

  • Problem identification: by comparing the various statistics provided by PMI or request metrics with their normal values in a healthy system, we can discover specific problems that manifest themselves in a particular abnormal statistic, such as a resource over-utilization, overflows, and so on.

Both PMI and request metrics information are exported through public APIs, which makes it possible to write specialized or third-party tools to exploit this information. The Tivoli® Performance Viewer, within the WebSphere Application Server admin console, is the primary tool to review PMI information only. The IBM Tivoli Composite Application Manager family of tools (ITCAM) offers a more comprehensive platform for working with performance diagnostics including PMI, RM and other techniques, for a broad range of products and application environments.


Monitoring and detecting problems

The WebSphere Application Server runtime also contains several important facilities for monitoring and detecting problems as they are start to occur, as opposed to after the fact. If you run a very stable and healthy application, you may not notice them running at all, but if you have a problem you may notice warnings from facilities such as:

  • The hung thread detection facility helps diagnose hanging or slow performance problems by automatically warning when a thread seems to take too long to complete a request. It provides a hook into the WebSphere Application Server runtime by letting you specify a timeout period for threads, which is called the hang threshold. If your application runs beyond this specifiable hang threshold, a notification will be sent out about the potentially hung thread.

  • The performance and diagnostic advisers monitor the system in the background and provide advice on specific WebSphere Application Server runtime component settings or JVM settings. There are many type of advice provided through this facility, including pool setting advice for ORB and Web pools, session setting advice, memory leak detection or even data source diagnostic advice. Each advisor can be enabled or disabled at will through the WebSphere Application Server admin console.


Investigating problems related to one specific subsystem

In addition to the facilities listed so far, all of which are broadly applicable to a wide range of situations you might encounter with WebSphere Application Server, there is also an ever-expanding collection of specialized tools and facilities that are targeted at a specific type of problem or subsystem. For example:

  • The System Management Configuration Validation facility can perform automated checks to detect inconsistencies and errors in the complex set of XML files that contain the entire WebSphere Application Server system configuration. Such errors, though relatively rare in recent versions of the product thanks to many run time safety checks, can still crop up due to as-yet-undiscovered product defects, unexpected events occurring during configuration operations (like crashes), or operator mistakes during configuration. This facility is embedded inside the WebSphere Application Server runtime itself, and can be invoked from the admin console (in the Troubleshooting panel).

  • The DumpNameSpace tool provides a simple dump of the contents of the JNDI name tree visible to applications at a particular server. This is typically used to help sort out problems caused by incorrect configuration of the JNDI resources in the server, or incorrect access to these JNDI resources by application code. The DumpNameSpace tool is a standalone program, shipped in the WebSphere Application Server installation bin directory.

  • The Class Loader Viewer gives administrators the ability to peek into the classloading subsystem of an application server and its sometimes complex configuration. Typical problems resolved by the Class Loader Viewer pertain to classloading issues, the most familiar case being the ClassNotFoundException. The Class Loader Viewer is packaged and shipped with WebSphere Application Server and can be enabled from the Troubleshooting menu of the admin console.


Tools that address installation issues

Several tools are available to assist with installation-specific problems, each with unique and specific purpose:

  • The Installation Verification Tool enables you to test the basic health of a WebSphere Application Server profile by running a simple test on an individual profiles in a particular installation. This tool is often used just after a product installation to confirm that everything was properly installed.

  • When you need to perform a more in-depth check of the integrity of a WebSphere Application Server installation, such as noting if there are any file level changes or inconsistencies present, you can use the Installation Verification Utility, which enables you to peek into the file-level changes that occurred over the course of the system's lifecycle. This utility also provides a facility to help IBM Support determine if the supported file sets are properly installed and located within a WebSphere Application Server installation.

In addition to these, WebSphere Application Server contains additional problem determination aids to help you apply service (fix) packs and determine the additional set of fixes installed:

  • Update Installer: Enables you to apply different service packs or individual fixes to a WebSphere Application Server installation.

  • VersionInfo, HistoryInfo, and GenHistoryReport: These tools enable you to query a WebSphere Application Server installation and determine the level of software found or previously installed. GenHistoryReport displays this content in HTML form.

All these tools and commands are bundled with the standard WebSphere Application Server installation.


Debuggers and profilers

The ability to debug and profile is also often very valuable to the application support process. WebSphere Application Server provides these facilities through the JVM, either through the JVMPI or JVMTI interfaces of that JVM. WebSphere Application Server system management makes it easy to set-up the appropriate JVM parameters to enable these facilities, either through the WebSphere Application Server admin console, or through a wsadmin script.

Rational® and other Eclipse-based development tools, including Rational Application Developer and the WebSphere Application Server Toolkit include a powerful debugger and profiler tool that connects to these facilities. For profiling-related work, you might also consider using the Performance Inspector family of tools, which provides a variety of tools to extract and analyze run time performance information from a JVM, using these same basic interfaces. These tools are available on alphaWorks (for Windows) or Sourceforge (for Linux).


Conclusion

Once again, each of these topics could fill out an entire article on its own, but it is nonetheless important to know about the range of tools that are available so you can better identify the best tool or facility to help you with a given problem. We hope this article has contributed to your understanding of the tools at your disposal. As we continue with this column, we will continue to drill down and provide you with more practical support-related information.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=198025
ArticleTitle=The Support Authority: Features and tools for practical troubleshooting
publish-date=02282007