In each column, The Support Authority discusses resources, tools, and other elements of IBM Technical Support that are available for WebSphere products, plus techniques and new ideas that can further enhance your IBM support experience.
As always, we begin with some new items of interest for the WebSphere® community at large:
- Several new topics dealing with problems in the Java™ runtime environment have been added to the IBM Guided Activity Assistant , including new extended processes for diagnosing various types of out-of-memory conditions, both in the Java heap and in native memory. This new information is currently available as beta content in the alphaWorks version of the IBM Guided Activity Assistant, and will be made available on the IBM Support Assistant version shortly.
- Two new problem determination tools were released this month:
- The Garbage Collection and Memory Visualizer tool (formerly EVTK) is now available on IBM Support Assistant Version 4.0. This tool analyzes verboseGC logs from a JVM to help diagnose many types of memory issues. Earlier versions of this tool were available on IBM Support Assistant Version 3.1, but the new version brings many enhancements and tighter integration with the IBM Support Assistant platform. Check out this brief summary of all the tools available in IBM Support Assistant Version 4.0.
- The Database Connection Pool Analyzer for IBM WebSphere Application Server has been released in alphaWorks. This tool analyzes a trace from the database connection pool component of WebSphere Application Server, and summarizes the status and usage of this connection pool over time.
- Various updates and bug fixes have also been released for other tools:
- Log Analyzer and Symptom Editor are available in IBM Support Assistant Version 4.0.
- IBM Pattern Modeling and Analysis Tool for Java Garbage Collector and IBM Thread and Monitor Dump Analyzer for Java Technology are available in alphaWorks.
- Featured Documents, a collection of the most frequented documents from the IBM Support Web sites for various products, has been updated for many products. For a sample, see the Featured documents for WebSphere Application Server.
- The IBM Support Assistant team continues to hold public demos to help you learn more and ask questions about IBM Support Assistant Version 4.0. Check the latest schedule for upcoming Web conference demos.
- New content in the IBM Education Assistant was recently published to help you learn about both IBM Support Assistant V4.0 and the IBM Guided Activity Assistant.
Continue to monitor the various support-related Web sites, as well as this column, for news about other tools as we encounter them.
And now, on to our main topic...
When you encounter a new problem, how do you decide what to do? Where do you start? What do you look for? How can you become more effective at troubleshooting? What you need is a methodology for problem determination.
By its very nature, problem determination is about dealing with the unknown and the unexpected. If you knew in advance everything about all the problems that you could encounter and exactly how they manifest themselves, then you would take measures to prevent them and wouldn’t have to investigate them. You cannot expect that problem determination should be a perfectly predictable process, but there are a number of common approaches that can make the process go smoother and be more effective.
This article is based on the experience and observations of members of the IBM Support, Serviceability, and SWAT organizations from years of helping our clients, and from seeing both best and worst practices in action. This is an evolving work, as we continue to look for new ways to further enhance the investigative process.
Looking back, we can identify several common challenges that can make problem determination exercises difficult to resolve:
Need for direction, or "What do I do next?"
Sometimes the people involved in resolving a problem simply don’t know where to start or what to do at each step. Problems can be complex and it is not always obvious how to approach them. This article will provide some general guidance to help you get started finding and deciding what to do at each step of the process for a broad range of problems.
Need for information, or "What does it mean?"
Sometimes what’s missing is simply information: you see some sort of diagnostic message or diagnostic file, but you don’t know how to interpret it or can’t understand how it relates to the problem at hand. You need good sources of information and tools to help you interpret all the clues discovered in the course of an investigation.
Miscommunication and lack of organization, or "What was I doing? What were you doing"
Sometimes time is wasted or important clues are lost because of miscommunication, or because the investigation has dragged on for so long, making the collected information more difficult to manage. Events, timelines, and artifacts that are often invaluable to determining next steps of an investigation and communicating progress can easily get lost or forgotten.
Dealing with multiple unrelated problems or symptoms, or "What are we looking for?"
A particular challenge in complex situations is not knowing whether you are dealing with a single problem or with multiple independent problems that happen to occur at the same time. You might see a variety of symptoms, some of which relate to one problem, some to another, and others that are simply incidental and benign. Being able to distinguish between the “noise” and the real problems can go a long way to an timely resolution.
A common mistake when troubleshooting is to jump to specific analysis or isolation steps without taking the time to properly characterize the problem. Many investigations take longer than necessary or go off on the wrong track because they look for the wrong problem or miss critical elements that would have helped direct the research. In other cases, time is wasted because miscommunication has caused various parties in the investigation to have different understandings of the situation.
When studying journalism, aspiring reporters are taught to approach every news story by asking Who, What, When, Where, and Why. You can use a variation of this principle when investigating software problems:
- What are the main symptoms that led you to determine that there is a problem, as opposed to all other ancillary symptoms?
- Exactly how would the system have to look or act for you to consider that the problem is no longer present?
- How would you recognize that the same problem happened again?
- Be cautious about using vague terms like “hang,” “crash,” and “fail,” which are often generalizations, inaccurate, and distract attention away from important symptoms.
- Be aware that in real-world situations, there can be several independent problems rather than just one. You need to recognize them and prioritize them.
- Be conscious of tangential problems that are consistent and well-known (for example, an application error that is always written to the same log in which the original problem occurs). Sometimes these problems are incidental and not worth investigating, and sometimes they can be related to the problem.
Where did it happen?
- Be precise about which machine, which application, which processes, and so on, the problem was observed on.
- Which logs and which screens should you look at to see the problems?
- Know the overall environment surrounding the problem (for example, find out the system topology, network topology, application overview, software versions, and so on).
When did it happen?
- Track time stamps for where to look in the logs. Note time zone offsets and whether or not multiple systems have synchronized clocks.
- Are there any special timing circumstances:
- Every day at a particular time of day?
- Every time you try to perform a particular operation, or every time a particular system process executes?
- Every time a particular user or batch process starts processing?
- Did the problem happen only once or does it occur regularly? Is the problem repeatable at will?
- Has the problem been reproduced before, during load or stress testing?
Why did it happen?
- Why did the problem happen now, and not earlier? What has changed?
- Why does the problem happen here, on this system, and not on other similar systems? What is different?
- Has it ever worked correctly in the past?
What do you know?
Finally, it helps to clearly summarize all the information available. Make a list of all the symptoms and anomalies, whether or not they seem related to the problem.
Before jumping into more complex techniques, it is often useful to approach each new problem in two phases:
In Phase 1, you perform a broad scan of the entire system after a problem has occurred to find any major error messages or dump files that have been generated in the recent past. Each of these errors and dumps constitute an initial symptom for the investigation. Search these symptoms across one or more knowledge bases of known problems.
In Phase 2, you do everything else: Select one or more initial symptoms for additional investigation beyond a simple search for known problems, perform specific diagnostic actions to generate additional symptoms or information (analysis or isolation approach, described below) that are typically specific to the problem under investigation, and then repeat the process as needed until a solution is found.
Phase 1 is clearly easier than Phase 2, since it is a single step rather than an iterative process, and does not typically require significant prior knowledge that is specific to a particular problem. In practice, a large percentage of problems can be resolved by a systematic execution of Phase 1 type activities, so it is well worth starting here in most if not all situations. Moreover, even if a solution is not found at the end of Phase 1, the set of initial symptoms collected during this phase is usually just the kind information that you need to start Phase 2. Use this collected information to populate the table of problems, symptom, and actions described below. Conversely, if you omit Phase 1, you can miss important clues that might take much longer to find during a later, focused investigation. For example, if you focus too soon on one particular server that is responding abnormally, you might not notice that another server in the system has been going down or that there have been network errors, both of which are conditions that could indirectly affect the abnormal behavior that is occurring on your server.
The concept of “Phase 1 problem determination" has been steadily gaining acceptance, and IBM Support is developing specialized tools to facilitate this activity. For example, the Log Analyzer tool (available in IBM Support Assistant), coupled with a set of broad automated log collection scripts (also in IBM Support Assistant) can be used for this purpose. (See Resources.)
Before launching a more in-depth investigation, you need to consider the business context in which the problems -- and your investigation -- occur. In every investigation, you must balance two related, but distinct goals:
- Resolution: Finding the root cause and ensuring that it will not happen again.
- Relief: Enabling your users to resume productive work.
Unfortunately, these two goals are not always simultaneously attainable. A simple reboot is often enough to bring the system back, but it can destroy information needed to debug the problem. Even actions that simply gather or report information, such as running a system dump, could cause the system to be down for an unacceptable period of time. Finally, the resources needed for a complete investigation might also be needed for higher priority work elsewhere.
In some situations, such as system or integration testing, driving the problem to full resolution is required and relief is not an issue. In other situations, such as business-critical production systems, immediate relief is paramount. Once the system is back up, the investigation begins with whatever data is available or salvaged.
The team working on the project must be clear about the business priorities and tradeoffs involved, and revisit the question periodically. Finding that the problem recurs frequently might raise the importance of full resolution; however, finding that the scope of the problem is quite limited may tilt the scales toward simple relief.
Phase 2 involves more advanced problem determination activities and techniques, but fundamentally, all troubleshooting exercises boil down to this: you watch a software system that exhibits an abnormal or undesirable behavior, make observations and perform a sequence of steps to obtain more information to understand the causing factors, then devise and test a solution. Within this very broad context, it is useful to recognize two distinct but complementary approaches:
In the analysis approach, you pick one or more specific symptoms or anomalies observed in the system, and then drill down to gain more detailed information about these items and their causes. These, in turn, might lead to new, more specific symptoms or anomalies, which you further analyze until you reach a point where they are so simple or fundamental that it is clear what caused them.
Typical techniques used in this approach include searching knowledge bases to help interpret symptoms, and using traces or system dumps to find more specific symptoms that you can then research further with knowledge bases.
For example, consider a problem in which an application server appears to be crashing. By analyzing the diagnostic material from the crash, you can determine that the crash originated in a library called mylib.so. By looking at the source code for the library and taking the native stack trace information from the gathered diagnostic material, you can see that a bit of code creates a pointer to a memory location, but does not handle it correctly. This results in an illegal operation and subsequent crash.
In the isolation approach, rather than focusing on one particular symptom and analyzing it in ever greater detail, you look at the context in which each symptom occurs within the overall system and its relation to other symptoms, and then attempt to simplify and eliminate factors until you are left with a set of factors so small and simple that, again, it is clear what caused the problem.
Typical techniques used in this approach include performing specific experiments to observe how the system’s behavior changes in response to specific inputs or stimuli, or breaking down the system or some operations performed by the system into smaller parts to attempt to see which of these parts is contributing to the problem.
For example, consider a large WebSphere Application Server environment, consisting of many nodes across several physical machines, in which an accounting application, deployed into two clusters, is having long response times. Using the isolation approach, you might opt to trace the application servers that are involved along with the network links between the servers. This method would enable you to isolate the cause of the slowdown between the network and the application servers, making it possible for more in-depth investigation to be performed on the effected component.
At first glance, it could seem that the steps followed by someone trying to troubleshoot a complex problem are random and chaotic, fueled by the hope of stumbling upon the solution. In reality, a skilled troubleshooter rarely performs any step without a very specific objective that is rooted in one of these two approaches. Understanding these approaches will help you formulate steps to follow in each of your own investigations. Conversely, if you can’t justify a step based on one or both of these approaches, chances are good that you might be relying a little too much on luck.
Now, although a clear distinction is made here between the analysis and isolation approaches, in practice they are not mutually exclusive. In the course of a complex investigation, you will often take steps from both approaches, either in succession or in parallel, to pursue multiple avenues of investigation at the same time.
As mentioned earlier, many investigations suffer from imperfect communication and organization. For non-trivial problems (generally, anything requiring Phase 2 problem determination), you should generally keep four types of information:
This is a short paragraph that summarizes the status of the highest priority issues, what has been done in the last interval, and what the top action items are, with owners and approximate dates. This enables both stakeholders and those who are only marginally involved to understand the current status. Not exclusively for “executives,” this information helps focus the team, explains progress and next steps, and should highlight any important discoveries, dependencies, and constraints.
Table of problems, symptoms, and actions
Phase 2 investigations can sometimes suffer when the number of problems to be resolved grows prodigiously. It is important to keep a written list of these additional problems and not rely on the collective memory. This table is a crucial piece of record-keeping and should be kept for all situations, no matter how simple they seem at first. When the situation is simplistic, the table is simple as well and easy to maintain. The effort to track this information will pay off on those (hopefully rare) occasions when things are much more complicated than originally believed. By the time that complexity is realized, it is almost impossible to recreate all of the information that you will want to have kept.
The actual format of this table, its level of detail, and how rigorously it is used will vary between each individual troubleshooter and each situation. Regardless of the exact format, this table should contain:
- Problems: One entry for each problem that you are attempting to resolve (or each thing that needs to be changed).
- Symptoms: External manifestations of the underlying problem and anomalies that might provide a clue about the underlying problem. A symptom might be an error message observed in a log, or a particular pattern noticed in a system dump; the problem is the error condition or crash itself. Problems can be fixed; symptoms go away (or not) as changes are made. Sometimes new symptoms appear during an investigation.
- Actions: Tasks to be performed that may or may not be directly related to a particular symptom or problem, such as upgrading the software or preparing a new test environment.
- Fixes: Alternatives to be tried to achieve a resolution or workaround. (Some troubleshooters list these as actions.)
- Theories: It is useful to track ideas about why the problem is occurring or how it might be fixed, along with actions that could be taken to test them. Noting which symptoms the theories are derived from can help rule out theories or draft new ones for the investigation.
Regardless of the actual format, you should constantly review and update the table with your team so that it reflects the current state of the investigation (that is, what is known and what is not known, what to do next, and so on). When there are multiple problems, it is important to group symptoms with their corresponding problem, and to review these relationships frequently.
Finally, each entry should be prioritized. This is really the key to staying organized and methodical throughout a complex investigation. Do not attempt to use the table as a historical record of the progress and activities in the investigation. This table will typically be complex enough just keeping track of the current state of things. The timeline (covered next) will contain the historical information needed for most investigations. If you really wish, you can keep an archive of past tables.
Table 1 shows an example of a simple Problem, Symptom, Action table, and Table 2 shows a more complex example, illustrating a rich set of problems, symptoms, and actions that reference each other.Table 1. Simple problem, symptom, action table
Problem PMR Symptoms Status Actions One JVM crashes due to OOM 12345
- During peak workload
- Heapdump/javacore files created
- Last occurred: Monday
- Javacore analysis suggests infinite loop
- Applied -Xmx to increase heap size -- problem still occurs
- IBM to respond
System slows down on Mondays N/A
- High CPU seen
- Gathered ITCAM data
- Analyze ITCAM data
Table 2. Complex problem, symptom, action table
# Item PMR Status Actions Priority #A Problem: High CPU after long test run 12345
- Invalid Java lock for various Java locks, which itself has several variations (symptoms #B, #D)
- Investigation ongoing. See specific symptoms below
1 #B Symptom: Java locks found in an invalid state 56789
- High CPU seen
- Investigate potential other locks in an invalid state
- Run a long test with the experimental fix
3 #C Todo: Review logs from test #2 N/A
- Run with an experimental fix for Symptom #D
- Analyze data tomorrow
5 #D Symptom: Java lock BaCyclQueue invalid 56780
- Causes a long chain of events that results in the high CPU problem #A
- In a state that should never occur in the JVM
- Experimental fix provided
Timeline of events
In any investigation that lasts more than a few days, or that involves more than a few individuals, there will invariably be questions about the results of some earlier experiment or trace, where some file was saved, and so on. Keep a written document or log book in which you record a timeline of all major events that occurred during the investigation. The exact format of the timeline and the level of detail might vary between individuals and between different situations, but a timeline will typically contain:
- One entry for each occurrence of any problem being investigated.
- One entry for each significant change made to the system (such as software upgrades, reinstalled applications, and so on).
- One entry for each major diagnostic step taken (such as a test to reproduce the problem or experiment with a solution, a trace, and so on).
- A precise date and time stamp.
- A note of the systems (machines, servers, and so on) that were involved.
- A note of where any diagnostic artifacts (logs, traces, dumps, and so on) were saved.
Table 3 shows a high-level timeline of events.Table 3. Timeline of events
Event Date/Time Details Results Location Test #2 2009-01-05 17:30:04 UTC
- Re-ran same as Test #1
- 4 out of 5 JVMs ran successfully
Upgraded WebSphere Application Server 2009-01-03 17:30:04 UTC
- Upgraded entire cluster CLST03 to 22.214.171.124
Test #1 2009-01-01 17:30:04 UTC
- Ran test with -Xmx1100MB
- OutOfMemory still occurred
Inventory of the diagnostic artifacts
Over the course of an investigation, you will end up collecting a large number of diagnostic artifacts and files, such as log files, traces, dumps, and so on. Some of these artifacts could be the result of multiple spontaneous occurrences of the problem over time, and others could be the result of specific experiments conducted to try to solve the problem or to gather additional information.
Just as the timeline of events is important to keep track of what happened over time, it is also very important to manage all the various diagnostic artifacts collected during these various events so that you can consult them when you need additional information to pursue a line of investigation. For each artifact, you should be able to tell:
- Which event in the timeline does it correspond with?
- Which specific machine or process (in all of the machines and processes involved in a given event) did the artifact come from?
- What system or process configuration was in effect at the time the artifact was generated?
- What options were used specifically to generate that artifact, if appropriate (for example, which trace settings)?
- If the artifact is a log or trace file that could contain entries covering a long period of time, exactly at which time stamp(s) did something noteworthy happen in the overall system, that you might wish to correlate with entries in this log or trace file?
As for other aspects of the organizational devices described here, there are several equally good ways and formats suitable to maintain this information. One straightforward approach favored by many experienced troubleshooters is to organize all artifacts into subdirectories, with one subdirectory corresponding to each major event from the timeline, and to give each artifact within a directory a meaningful file name that reflects its type and source (machine, process, and so on). When necessary, you could also create a small README file associated with each artifact or directory that provides additional details about the circumstances when that artifact was generated (for example, configuration, options, detailed timestamps, and so on).
While it is perfectly acceptable to simply use any manually-created directory structure you wish to organize these articfacts, the Case Manager facility provided in IBM Support Assistant includes several features that help organize artifacts precisely along the principles outlined above.
Regardless of the approach you take, it’s important to always keep an open mind for out-of-the-box thinking and theories. Avoid having “tunnel vision,” focusing on one possibility for root cause and attempting to solve only that one cause relentlessly. Failure to see the whole picture and not methodically evaluating all potential solutions can result in prolonged relief and recovery times. Here are some ideas to help you avoid this condition:
Utilize checkpoints: During your investigation, take time to create checkpoints for all team members to share all relevant findings since the last checkpoint. With this method, team members can create and maintain creative synergy among themselves and assure that no details are left aside.
Work in parallel: When a team member has an alternative theory to the root cause, it can benefit the investigation by enabling one or more team members to work on proving that theory in parallel with the main investigative effort.
Regularly (re)ask the “big picture” questions: Keep asking “what is the problem?” and “are we solving the right problem?” One classic example of tunnel vision is thinking that you have reproduced a problem in a test environment only to discover later that it was actually a slight permutation of a production problem. Asking big picture questions can help contextualize problems and avoid chasing those that are of lesser priority.
Often problems are relatively straightforward, with a single symptom or a small cluster of symptoms that lead more or less directly to an understanding of a single problem which, when fixed, resolves the entire situation. Sometimes, though, you have to deal with more complex situations, with a series of related symptoms or problems that one-by-one must be “peeled back” to get to the root cause, It is important to understand this concept so that you can address it effectively when conducting a complex investigation. The phenomenon of “peeling the onion” might manifest itself in a few different variations:
- Multiple problems or symptoms can be linked by a cause-and-effect relationship. For example:
- A total lack of response from a Web site might be due to overload in a Web server...
- ...which might itself be due to the slow response of an application server that happens to serve only one type of request on the entire Web site but ties-up excessive resources on the Web server...
- ... which might itself be due to database access slowdown from another application server, which ties resources needed by this application server...
- ...which might itself be due to problems on the underlying network that connects the application server to the database...
- ... and so on, until you get to so-called root cause of this sequence of problems.
- Encountering one problem might cause the system to enter a particular error recovery path, and during the execution of that path another problem might manifest itself.
- In other cases, one problem might simply mask another; the first problem does not let the system proceed past a certain point in its processing of requests, but once that first problem is resolved, you get further into the processing and encounter a second independent problem.
In all these cases, you have no choice but to address one problem at a time, in the order that they are encountered, while observing the operation of the system. Each problem could itself require a lengthy sequence of investigative steps before it is understood. Once one problem is understood, you can proceed to the next problem in the sequence, and so on, peeling away each imaginary onion skin, until you finally reach the core. This method can be very frustrating, especially to those unfamiliar with the troubleshooting process, but effective communication can help keep morale high and build trust in the process by showing concrete progress and minimizing confusion.
Maintaining and publishing a clear executive summary helps set the context of the overall situation, helps highlight each specific problem when it has been resolved so that progress is evident, and helps identify new (major) problems as they are discovered. The table of problems, symptoms and actions helps to keep track of the various layers and clarify the relationship between similar problems and symptoms.
As a complement to the various techniques outlined in this article, you might want to consider using the IBM Guided Activity Assistant to help you conduct your investigation. The most visible function of which is to provide information and step-by-step guidance for the specific tasks that should be performed to diagnose a variety of problems. It also embodies many of the general principles presented in this article, helps you characterize your problem, keeps track of the state of the investigation and the various diagnostic artifacts through its Case Manager, and guides you through initial steps that are similar to Phase 1, which also support information gathering necessary to launch Phase 2.
Problem determination is about dealing with the unknown and unexpected, and so it will probably never be an exact science -- but it’s also not rocket science. By following the recommendations and techniques outlined in this article, you can take steps to make your problem determination work more organized, systematic, and, in the end, more effective and rewarding.
The Support Authority: If you need help with WebSphere products, there are many ways to get it
The Support Authority: 12 ways you can prepare for effective production troubleshooting
Application Server problem determination education course
IBM Software Support Web site
IBM Education Assistant
Kevin Grigorenko is a software engineer on the WebSphere Application Server SWAT team, which provides worldwide, on-site and remote supplemental product defect support. particularly in critical customer support situations. He currently focuses on problem determination for WebSphere Application Server and related stack products, including the JVM and various operating systems. He also has a deep history in development, including Java Enterprise Edition, C, C++, Perl, PHP, Python, Ruby, and .NET.
Daniel Julin has 20 years of experience developing and troubleshooting complex online systems. As technical area lead for the WebSphere Serviceability Team, he currently focuses on helping the team define and implement a collection of tools and techniques to assist in problem determination for WebSphere Application Server and to maximize the efficiency of IBM support. He also occasionally assists directly in various critical customer support situations.
Carolyn Norton is the Lead Serviceability Architect for WebSphere Application Server. She has been working on WebSphere since 1999, as an architect for both performance and system test. Other projects include the Sydney Olympics, Nagano Olympics, and AnyNet. Recurring themes in her career include autonomics and being a bridge between development and customers. A member of the IBM Academy of Technology since 2000, she holds a Ph.D. in Mathematics from MIT and an A.B. from Princeton University
John Pape currently works with the WebSphere SWAT Team and focuses on crit-sit support for clients who utilize WebSphere Application Server, WebSphere Portal Server, and WebSphere Extended Deployment. This role requires attention to detail as well and maintaining a “think-out-of-the-box” innovative mindset, all the while assuring IBM customers get the best support possible!