Skip to main content

skip to main content

developerWorks  >  Autonomic computing | Tivoli | XML | WebSphere | Information Management  >

Build a framework for problem determination triage

The concepts behind using event visualization and symptoms to effect problem determination

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Intermediate

Marcelo Perazolo (mperazol@us.ibm.com), Autonomic Computing Architecture, IBM 
Abdi Salahshour (abdis@us.ibm.com), Senior Software Engineer, IBM 

27 Mar 2007

So how do you set up "triage" problem determination? This article describes aspects of event visualization for triage problem determination that use concepts of autonomic computing -- such as Log and Trace Analyzer for Java Desktop (LTA-JD) -- and symptoms to represent, detect, evaluate, and resolve incidents and problems related to business mission-critical infrastructure management and operations. This two-part article also covers event and symptom visualization and processing methods of LTA-JD to enable efficient proactive avoidance of these incidents and problems. In this first part, you'll take a tour of the underlying concepts.

It's a simple equation -- the task of event monitoring increases in complexity as the volume and number of event sources increases. And poor visualization of events leads to poor problem detection and root cause analysis, which equates to time being lost, bad business practices, and an increased cost in recovery. There is a need to improve visualization of events and associated symptoms and thereby to improve human experience as it relates to problem detection, isolation, and prevention. Autonomic computing management applications can support specific management styles that define their capabilities and requirements for the set of manageable resources they monitor.

There are several different approaches to event visualization. Typically, an event-monitoring solution involves a human operator who is responsible for the analysis and reaction to problems associated with events. Operators rely on their experience and perception of event combinations to determine when a problem happens and how to resolve it. Now, combinations of multiple events can reveal more complex problems in the IT environment; it is at this point where human analysis may become difficult and time-consuming.

It is also at this point that monitoring solutions should be implementing automatic correlation of event combinations to make the job less onerous for system engineers; this automatic correlation includes the running of root cause analysis and the grouping of events by their relative contribution to problem trails. These automatically determined root cause events should then be presented to human operators for review and reaction (and hopefully, a successful resolution of the problem).

A supported management style can be hands-on, hands-off, or both. When using a hands-on style, the autonomic manager polls the resources it is managing to determine when it needs to take action. In other words, it is the method of choice for making the combination of human and user interface play the role of a manual manager in the autonomic computing architecture (see Resources). For example, a manual manager may monitor multiple event sources and when it observes one or more events of specific significance (which is a pattern also known as a symptom in autonomic computing terminology), the manager may initiate one or more actions to mitigate or resolve the observed problem.

Commonly, problematic symptoms are detected when all the events that satisfy the criteria, also known as symptom rules, for that symptom are observed. In fact, one of the main goals of this series of articles is to describe a way to facilitate incremental detection and visualization of symptoms and a method to share domain knowledge, or symptom definitions, among human operators by combining events and symptom visual semantics together in a most efficient way. This method is implemented as a simple event visualizer known as the Log and Trace Analyzer for Java Desktop (LTA-JD), a tool capable of collecting, merging, filtering, sorting, displaying, and analyzing contents of standardized event sources (for example, Common Base Event and Web Services Distributed Management [WSDM] Event Format or WEF) for problem isolation and triage to problem analysis.

Together, the triage function along with the superior visualization mechanisms offered by the LTA-JD improve root cause analysis as well as problem prediction and reaction. Domain expertise and semi-structured information resembling symptom rules can be easily mined and captured using industry-standard XPath expressions for quick detection and visualization of symptomatic events.

In this two-part article, we'll show you the fundamentals of event monitoring, symptom detection for problem analysis and reaction, and the enhanced visualization attributes implemented by the LTA-JD. We'll focus on the fundamentals in this first part, then move on to the details in part two.

Problem determination

Problem determination is the detection and diagnosis of situations that affect the operational status or availability of business applications. The goal of problem determination is to maximize business and IT system availability by minimizing the time it takes to recover from situations that affect system operation and or availability. This is accomplished by collecting the monitored information using tools to quickly detect meaningful conditions, diagnose the underlying problems, and apply available knowledge to restore normal business and IT system operations. Problematic symptoms are often detected when all the events that satisfy the criteria for that symptom are observed. This article is going to discuss:

  • A way to facilitate incremental detection and visualization of symptomatic events
  • A method to share domain knowledge among human operators using a mechanism to combine events and symptom visual semantics together in an efficient manner

Problem determination events are events that are specifically intended to be used to support the process of problem determination. Problem determination events can incorporate many types of data, including information about:

  • Operational status
  • State changes
  • Request processing
  • Performance metrics or faults

In order to enable autonomic functions related to problem determination you need a normalized format of events. One is already available -- the Common Base Event, which is a normalized representation of an event to communicate problem determination situations between entities participating in autonomic computing functions. It is the fundamental enabler for an autonomic computing self-managing system. See Resources for more information on the standardization aspects of the Common Base Event.

In autonomic computing, elements frequently communicate and exchange data and knowledge; often this knowledge involves a composition of simple events into patterns, descriptive information about the situation, and the remediation actions recommended to solve it. A symptom, a normalized format to knowledge, is a form of knowledge that indicates a possible problem or diagnosed situation in the managed environment.

Problem determination discipline can be performed using two different paths:

  • A manual path in which incidents and problems are handled by domain experts directly
  • An autonomic path in which symptoms are absolutely necessary to provide the information for automatic detection, diagnosis, recovery, and resolution

Depending on the maturity of the autonomic system, the manual monitoring functions of an event monitoring application are combined with the autonomic functions of an autonomic manager capable of processing symptoms. This creates a coexistence between human operators and autonomic managers in which autonomic managers do all the heavy work, yet rely on human expertise to review and approve autonomic decisions.

LTA-JD, an event and symptom visualization tool

The LTA-JD implements a hybrid event/symptom monitoring application. It is a stand-alone, simple-to-use Java event viewer that provides the ability to gather, merge, filter, sort, display, and analyze contents of log files and events from a large number of products in a single view for problem isolation and triage to problem analysis. It uses standard event formats to aggregate event data, including the Common Base Event. These capabilities can be integrated with a number of Tivoli® products (like IBM® Tivoli Monitoring) to expand the overall problem analysis capabilities of those products. In fact, this tool can perform deep-dive analysis of captured event data to determine the root cause of problems.

The LTA-JD provides the ability to manually manage problem determination tasks and to integrate autonomic computing capabilities that will help in the collection, detection, isolation, diagnosis for recovery, and resolution of problems. The triage functions coupled with the superior visualization mechanisms offered by the LTA-JD improves root cause analysis and problem resolution. Domain expertise and simple symptomatic event selection rules can be easily mined and captured using industry-standard XPath expressions for quick detection and visualization of symptomatic events.

When events are collected and merged, for each symptom matched to an event in the set of events, recommendations of action associated to the symptom are retrieved or authored. Visualization characteristics are then associated to events collected by the tool and associated to visualization characteristics of related symptoms. Likewise, textual information of events collected by the tool is associated to corresponding action recommendations of related symptoms. The event and symptom data is then processed by grouping events in a collection where their visualization and textual information matches to criteria established by human operators interacting with the tool's end-user interface. This collection of events and symptoms helps end users to narrow down the amount of information they have to process.

The LTA-JD application consists of the following main capabilities:

  • An event normalization module in which various kinds of information are collected and transformed into a format understandable by the system
  • An event filtering and visualization module in which events are collected from a managed resource and displayed to system administrators and support personnel
  • A simple symptom definition module in which symptoms are composed and associated with visualization parameters
  • An integrated symptom visualization module that presents the symptoms by overlaying their visualization aspects with those of normal events (which are in turn components of a symptom)
  • A dynamic symptom avoidance module in which symptoms trends are detected and recommendations are suggested to human administrators on what to do to avoid the symptom manifesting itself (thus avoiding the problem before it happens)

Figure 1 shows the main architectural elements of the LTA-JD.


Figure 1. The main architectural elements of the Log and Trace Analyzer for Java Desktop
The main architectural elements of the Log and Trace Analyzer for Java Desktop

The data normalization function of the LTA-JD is performed by the Generic Log Adapter (GLA) framework, which is tightly integrated with the LTA-JD. The GLA currently provides a large number of off-the-shelf adapters that convert application-specific information, in the form of application log files, into the standard Common Base Event format.

The event filtering and visualization module of the LTA-JD is provided by a simple visualizer that lets users filter out noise or non-important events and to focus on those events that really participate in a problem.

Figure 2 shows the main screen of the LTA-JD. This view shows how events are arranged in a tabular format and that there are selection and filtering mechanisms in place to perform traditional event monitoring duties. Also, it provides a way to drill down into a particular event and to examine its attributes and values. Finally, there are also ways to integrate various normalized event sources together, so you have an overall view of the whole spectrum of events present in the managed system.


Figure 2. Events displayed in LTA-JD
Events displayed in LTA-JD

A typical IT administrator would visualize this filtered information and perform manual analysis (by applying filtering and visual inspection) on the event flow in order to determine what problems are happening and what to do to resolve them.

Filtering capabilities in LTA-JD are provided by a special enhanced XPath processor, called Fast XPath. This processor is optimized to handle a large number of events and filtering expressions and to quickly return matching results.

Symptom definition and composite visualization

Often symptoms are authored by the manual creation or semi-automated capture of their main sub-components:

  • The metadata contains the values of the main attributes.
  • The schema defines what information will be associated at run time.
  • The symptom rule defines how a symptom is recognized.
  • The symptom effect defines how to react to a symptom.

(Resources contains more information on the standardization aspects of symptoms and explains the symptom effect concept.)

The current capabilities of the LTA-JD can help you build on the symptom rules (using a Rule Builder) to define a simple way to recognize symptoms (in other words, to perform problem isolation and diagnosis on the information -- such as events -- component of a symptom) and to identify the symptomatic events. These rules can be anything: The LTA-JD uses a standard expression or rule syntax such as XPath because it promotes easier interoperability of symptoms defined in different management environments.

The rules authored by the Rule Builder allow for additional properties such as a description, which is text that describes symptoms in more details (commonly provided by subject matter experts) and a highlighter (with a customizable color scheme) to highlight the events that match that rule. The highlighter enhances the visualization of the symptomatic events to problem detection and analysis.

Figure 2 shows the main screen of the LTA-JD, with some events that matched the specific rules defined by user highlighted. In addition, a tooltip is provided so that when you position the mouse cursor on top of any of the highlighted events, the description associated with that highlighter (the rule) is displayed.

To simplify composition of the rules (also, known as selection criteria) the LTA-JD provides a simple Rule Builder -- a simple rule editor that lets those who are not familiar with XPath language quickly compose simple yet composite rules for three purposes:

  1. To identify events that match the rules, events that are perceived to be symptomatic events
  2. To highlight events of interest using a spectrum of colors to further enhance the visualization of symptomatic events
  3. To facilitate compressing the events to just those highlighted

Therefore, a user who has no knowledge of XPath syntax may author rules by simply using the event properties and relational/Boolean operators. Figure 3 shows the simple Rule Builder window.


Figure 3. Simple XPath Rule builder
Simple XPath Rule builder

Figure 4 shows the LTA-JD Add/Remove Highlighter window where visualization information is associated to symptom definitions. As you can see, it consists of an identifier (provided by a symptom author or extracted from the symptom definition, which equates here to its description), background and foreground colors that are used to paint the symptom components (the events that are correlated together to form that symptom), and highlighter filter (also known as simple symptom rule). Such events are determined at run time by applying the symptom rules.


Figure 4. Associating visualization parameters to symptom Rules
Associating visualization parameters to symptom Rules

After this information is made available to the LTA-JD, it uses it to perform run time triage analysis of events as they arrive for processing and presents them to the end user.

Furthermore, the LTA-JD supports the IBM Symptom 2.0 format to leverage existing symptom catalogs provided by IBM and user-provided catalogs that contain known problems. Symptoms are one form of an autonomic computing knowledge component and are used for log analysis. This helps you to both identify system error conditions and to take action to solve the problem within the LTA-JD or other LTAs.

You may select one or more catalogs from either a local system or an ftp or http site. Currently, IBM provides 10 symptom catalogs for some of the major IBM products such as the WebSphere® Application Server and the DB2® Universal Database Manager. These symptom catalogs are available from the IBM Tivoli Open Process Automation Library (OPAL); also, the catalog addresses are provided from the Add/Remove symptom catalog window.

The LTA-JD analysis function lets users select one or more events and select Analyze from the pull-down menu. Each event in the list of the selected events is checked against the list of catalogs selected for one or more match. To further simplify this, a user may select one of the highlighted events and select Analyze; in this case, all the highlighted events will be analyzed.

When the symptom catalogs are searched for the selected events and the symptom (or symptoms) is found, the symptom definitions are reported for each event separately. Select the Analysis tab in the detail view of the event at the bottom section of the LTA-JD main view to see the result of the analysis. The analysis results, when a match is found, include the name, description, possible recommendations to address the symptom, and possible recommended action. Figure 5 shows the view of the analyzed events.


Figure 5. Symptom analysis result view
Symptom analysis result view

In conclusion

In this article, we've presented the fundamentals of problem determination and some of the autonomic computing artifacts necessary for automating common problem determination tasks, those still normally performed manually in many circumstances. We demonstrated the common infrastructure necessary to perform collaborative and more complex analysis that are usually beyond the grasp of single or manual management.

In the second part of this article, we will present a method and an application capable of harmonizing manual operation and will show you more of the common infrastructure necessary to perform collaborative and complex analysis.

Our goal is to demonstrate that the LTA-JD has in place the infrastructure necessary for performing triage analysis enabling problem isolation and diagnosis and can provide symptom analysis to facilitate triage problem determination in a very easy and uniform way. We've shown you that you can add extra avoidance rules and actions as components of existing symptoms to make a high level of predictive analysis and proactive avoidance possible.

After IT system administrators are confident enough to delegate common tasks to autonomic elements, they can dedicate themselves to producing the content necessary for such complex tasks as we've described, all integrated in a common visualization and execution environment. The infrastructure necessary for doing this is already in place -- we just need to work on better prediction and avoidance content to make this function viable.



Resources

Learn

Get products and technologies

Discuss


About the authors

Marcelo Perazolo is a member of the IBM Autonomic Computing Architecture team, where he serves as an architect for symptoms and other knowledge formats and defines Management Integration Taxonomies related to autonomic computing. He has worked for IBM since 1990, with various assignments in network and systems management. Marcelo received an M.S. degree in Electrical Engineering in 1994. His interests include problem determination and prediction, process optimization techniques, security, correlation technologies, and knowledge representation.


Abdi Salahshour photo

Abdi Salahshour is a Senior Software Engineer, problem determination architect, and Master Inventor at IBM's Autonomic Computing Technology and Development, who started with IBM in 1982 and served in many roles -- from design and development of database diagnostic tools to system management and self-healing architecture and enablement in heterogeneous and distributed environments. He was a member of IBM Problem Determination Council, is one of the authors of the IBM Common Base Event specification, one of the principal designers and implementers of the Generic Log Adapter, and the architect and designer of the Log and Trace Analyzer for Java Desktop.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top