 | Level: Introductory Abdi Salahshour (abdis@us.ibm.com), Problem Determination Architect, IBM Kalpana Doraisamy (kdoraisa@in.ibm.com), Staff Software Engineer, IBM Ajay G Rengasayee, Software Engineer, Freelance writer
19 Jun 2007 This four-part series is a comprehensive usage guide that gives you an
overview of the Log and Trace Analyzer for Java ™ Desktop, instructs you in the
installation process and teaches you to configure the tool correctly. The series
includes performance-enhancing tips, integration and hands-on scenarios, as well as
data on the IBM Tivoli Monitoring 6.1 Events Tool. Discover how your data can be more consumable from start to finish and learn how to
reduce your problem determination and maintenance costs. In part one, identify the challenges in data
collection and see how a common event format and a symptom repository help address those challenges.
While the first part of the series discusses the current obstacles to effective data
collection, in the consecutive articles:
- See an overview of the architecture and functions of the Log and Trace Analyzer - Java
Desktop (LTA-JD) and view an installation guide.
- Take a visual tour of the technology, get troubleshooting tips and learn to maximize
the performance out of the LTA-JD.
- Dive into the IBM Tivoli Monitoring Events Tool view of the LTA-JD.
The challenges of data collection
Problem determination is the detection and diagnosis of situations that affect the
operational status or availability of business applications. One of the challenges of data
collection is the time it takes for problem determination to take place. For example,
products have their own proprietary format to write their content; likewise, applications, databases and networks
each have specific formats to write their particular content. When a problem occurs in an
application due to network failure to access the database, then the user must understand
data from application, database, and network --this need to understand all of the various
components increases the complexity involved in problem determination. Complexity
increases because human intervention is required to manually correlate the log
record--which is in various formats--and because the application interacts with more
products and the failures that occur with these products.
 | |
The goal is to maximize business and IT system availability by minimizing the time it
takes to recover from situations that affect system availability. This is accomplished by
collecting the monitoring information and using tools to quickly detect meaningful conditions,
diagnose the underlying problem, and apply available knowledge to restore normal business
and IT system operations.
Often the combination of multiple and observed events reveals complex problems which make
human analysis difficult and time consuming. Monitoring solutions may implement autonomic
correlation and reaction to these problems in which simple event triage or more complex
root-cause analysis of sets of events is performed. A much smaller set of root-cause events are then presented to the human operators for review and reaction.
Using the Log and Trace Analyzer (for the purposes of this series, the Java Desktop version) as a simplified symptomatic event visualizer, can help solve three major hurdles to more effective data collection:
-
The complexity of e-business systems. Today's business systems are a collection of distributed and heterogeneous software and hardware components.
-
The variety of data and collectors/adapters. Because of the variety of
collectors and the vastness of the data collected, there are several problems that are
created. These include: how to consume and publish proprietary data formats; how to
make differing design and standards co-exist; how to integrate ad hoc and
product-specific code; how to integrate the different skill sets required to configure,
maintain, and tune the various systems; and how to overcome the difficulty in correlating for enterprise-to-enterprise problem diagnostics.
-
Overcoming instrumentation differences. Instrumentation differences include
topics such as standards compliance, customer inconvenience and cost of ownership. In
addition, when standardization is lacking, Management Tools (the consumers) need to be
instrumented for every Managed Resource (the
producers) with which they interact; the same is true in reverse. This is both costly and inefficient.
To handle these challenges, a set of tools must be defined to tackle them.
Defining the tools
In order to address the aforementioned problems, richer and normalized data is needed to
enable cross-product analysis and correlation; it is, in fact, a prerequisite to effective
root-cause analysis and automation. And standards are fundamental to this type of data; without standards, the event data is of little value to autonomic management in problem determination and action in response.
One way to alleviate this problem is to structure event data in four categories:
- The source is the component that is affected by or has experienced the situation.
- The reporter identification of the component that is reporting the situation.
This is also known as source component of a situation.
- The situation data is properties or attributes that describe the situation.
- The Context/Correlation data is properties or attributes to correlate the situation with others.
How Common Base Event format/WSDM Event Format fits
This is where the Common Base Event format and WSDM Event Format fit into the picture.
Common Base Event is an event definition that is an IBM initial implementation of the WSDM
Event Formal (WEF). The Common Base Event format and the WEF provide the common structure in which logs can be represented so that the user has to understand one format for all the product logs. The various format of logs are converted to standard format with the help of adapters. The Common Base Event format and WEF standards have been designed in such a way that problem determination becomes simpler and faster. There are various elements to provide more details on the event occurred and there are tools available to view the problem records in Common Base Event format so that it becomes easy to understand the problem scenario.
The Common Base Event format, which is a consistent and a common format to represent an
event produced during the operation of an IT system, facilitates effective intercommunication among disparate components that support logging, management, problem-determination, autonomic computing, and on-demand business functions in an enterprise.
Common Base Events provide
- A consistent specification for the definition of normalized event and log information for various domains (business, security, network, system, etc.)
- An exchange format for events and logs
- Situation descriptions about the external operational capabilities of the component
- Data that captures execution information within a component
- Context data
Defining the symptoms
A symptom is a form of knowledge that indicates a possible problem or diagnosed situation in the managed environment. The classic definition of symptom is "a characteristic sign or indication of the existence of something else." A symptom is recognized when the monitored data (the thermometer reading) matches the symptom definition.
The autonomic computing definition is a bit more involved; it is "a characteristic sign or indication of a possible problem or situation happening in the context of one or more manageable resources." Which breaks down into three things:
- It is a form of knowledge used to solve problems and situations automatically in an autonomic system.
- It is composite records of information formed by the combination of raw or composite information into patterns.
- It is a composition of other symptoms.
Connecting the definitions: Going from events to symptoms
You may be asking, "how do I get from events to symptoms?" Keep these definitions in
mind: an event is an indication of something being monitored (for example, memory
usage has exceeded a set limit) and a symptom is a characteristic sign or
indication of a possible problem or situation happening in the context of one or more
manageable resources. You link the two like this:
If event x (and y (and...) ) occur (under certain conditions), then report the occurrence and possible resolution actions
For example, memory usage has exceeded a set limit three times in a 10-minute stretch -- this would suggest a pattern that could benefit from a response of increasing your buffer sizes.
Using this information
The event visualization of LTA-JD utilizes the concepts of autonomic computing such as symptoms to represent, detect, evaluate, and resolve incidents and problems related to the IT infrastructure management and operations. In addition, symptom visualization and processing methods are suggested in order to enable efficient pro-active avoidance of these incidents and problems before they happen.
Now we come to the "value proposition." There are three ways that having the information proposed in this article can help:
- It makes the management data more consumable to the end-user because
- It gives you visualization of product symptoms from within problem determination tooling.
- Symptoms (patterns) are more deterministic than individual events.
- It helps reduce problem determination costs since
- Administrators can use automated event correlation to recognize symptoms (and potentially, corrective actions).
- Support personnel can access symptoms directly from the problem determination tools.
- Cross-product symptom catalogs allow quick diagnosis for known errors.
- It helps reduce maintenance costs since
- Incremental improvements to symptom databases will reduce requests to Level 1, 2, and 3 support (L1, L2, and L3. L1 is the first line of support that answers when the customer calls. L2 gets involved when L1 cannot resolve the problem; it usually includes a more knowledgeable support engineer such as product's Subject Matter Expert. L3 is commonly those who are considered change team and/or development members that change the code and provide fixes.)
Introducing the tool
The tool to help you achieve this is the Log and Trace Analyzer for Java Desktop, a
standalone simple Java event viewer to merge, filter, sort, analyze and display contents of event sources in a common event format (Common Base Event) for problem isolation and triage to problem analysis. The triage functionality coupled with the superior visualization mechanisms offered by the LTA-JD improves root cause analysis, problem prediction, and resolution. Domain expertise and symptom rules can be easily mined and captured using industry standard XPath expressions for quick detection and visualization of symptomatic events. Figure 1 shows a matrix of the Log and Trace Analyzer family of products on two spectra (analysis capabilities and user skills).
Figure 1. The Log and Trace Analyzer family
Log and Trace Analyzer, Java Desktop sits at the starting corner, but don't sell it short. It can enable end-to-end viewing of event sources across the heterogeneous environment, provide a customizable summary view, and offer the ability to select and expand any row from the summary view to display the full Common Base Event attributes. With it, you can also do multi-level filtering and sorting on any event properties, custom highlight triage events (single symptoms definitions), and save and share configuration settings.
In the next article, view an overview of the architecture and functions of the Log and Trace Analyzer, Java Desktop and a guide to installing it.
Resources Learn
-
The developerWorks Autonomic computing zone has an evolving "library" on system events (reporting and viewing) and using the Log and Trace Analyzer:
-
Get the scoop on the XPath in the W3C specification XML Path Language (XPath), Version 1.0.
-
And for more on XPath and how it fits into the WS-* family of Web services standards, see the "Meet the specs" series:
-
The developerWorks Autonomic computing zone has a nice "library" on the Common Base Event format:
-
The developerWorks Symptoms deep dive series introduces the autonomic computing symptoms architecture and format, and details symptoms, including such information as how symptoms are represented, how to identify them, the advantages for using a standard symptom representation, and how to adopt them as part of your systems management strategy.
-
Visit the developerWorks Autonomic computing zone for resources on WSDM, WEF, and cutting-edge information on other autonomic computing technologies.
-
Browse the technology bookstore for books on these and other technical topics.
Get products and technologies
Discuss
About the authors  | |  | Abdi Salahshour is a Senior Software Engineer, problem determination architect, Master Inventor at IBM's Autonomic Computing Technology and Development, and is currently an architect for the Plug and Manage architecture. He began working for IBM in 1982 and served in many roles -- from design and development of database diagnostic tools to system management and self-healing architecture and enablement in heterogeneous and distributed environments. He was a member of IBM Problem Determination Council, is one of the authors of the IBM Common Base Event specification, one of the principal designers and implementers of the Generic Log Adapter, and the architect and designer of the Log and Trace Analyzer for Java Desktop. |
 | |  | Kalpana Doraisamy is a Staff Software Engineer at IBM focusing currently on Lightweight Infrastructure for Systems Management. In her previous role she worked with the Log and Trace Analyzer for Autonomic Computing for more than two years. She was one of the senior developers of the Log and Trace Analyzer for Java Desktop. She holds a bachelor's degree in Computer Science and Engineering from Government College of Technology, Coimbatore, India |
 | |  | Ajay G Rengasayee was a System Software Engineer at IBM India Software Lab, Autonomic Computing. He was a developer for Log and Trace Analyzer for Autonomic Computing and related technology for two years. He was one of the developers of the Log and Trace Analyzer for Java Desktop. |
Rate this page
|  |