Level: Introductory Editorial staff (dwinfo@us.ibm.com), developerWorks, IBM
03 May 2005 This question and answer article features Abdi Salahshour, a Senior Software Engineer for Autonomic Computing Technology at IBM. developerWorks talked with Abdi about the current Version 1.0.1 of the Common Base Event format and situation categories and also discussed what's on the horizon with problem determination.
developerWorks:
What do you do at IBM, and what did you do before coming to IBM?
Abdi Salahshour:
I am a Senior Software Engineer and problem determination (PD) architect at IBM's Autonomic Computing Technology and Enablement in Research Triangle Park, N.C., working on autonomic computing (AC) technology and enablement tools. I started with IBM in 1982 and served in many roles -from design and development of database diagnostic tools to system management and self-healing architecture and enablement in heterogeneous and distributed environments. I served as a member of the IBM PD Council. The council focused on PD artifacts modeling and architecture. I'm also one of the authors of the IBM Common Base Event specification, and I was one of the designers and implementers of the Generic Log Adapter.
My specific interest in the last two-and-a-half years has been to work jointly with IBM Research to exploit and apply new technologies and data mining in an autonomic environment.
Problem determination, path to the autonomic level
dW:
So right from the start, this was a natural path to autonomic computing problem determination.
AS:
Absolutely right. About four or five years ago, I got an assignment working with the IBM Problem Determination Council. A team of IBM architects and RAS experts was pulled together to look into IBM products across different platforms, to assess their problem-determination capabilities and determine how to drive PD improvements in individual products, and to make their PD information available from all components. Also, the council was charged to drive a PD architecture and standardize PD information for all resource components, identify research in PD, and foster new PD technology in product plans to improve self-resiliency and reduce cost of ownership. Basically, to monitor, detect, isolate, resolve, and eventually prevent problems. The Council's architecture and self-resiliency work was a natural progression towards autonomic computing.
In general, autonomic computing is the evolution of automated management and resiliency concepts. The autonomic computing premise is to help reduce the cost and complexity of owning and operating an IT infrastructure. In an autonomic environment, components of systems, hardware, or software become self-configuring, self-healing, self-optimizing, and self-protecting. Central to this environment is knowledge that builds on known information about the system and grows as the autonomic components learn more about the characteristics of the managed resources, which leads to more informed decisions being made by the parts.
By the same token, knowledge can be learned, mined, and saved for reuse. This is analogous to a child that you teach the basics and then it observes, absorbs, and learns more from the repetitious events and new situations that occur. Some of those events, the most significant ones, can be captured and saved as knowledge.
 |
Self-CHOPs
Autonomic computing means that applications, systems, and networks become more self-managing. This self-management involves four qualities: self-configuration, self-healing, self-optimizing, and self-protection. Hence, the
Self-CHOP characteristics
.
|
|
dW:
How does this human self-learning translate to the AC concept?
AS:
Good question. The autonomic computing architecture describes two distinct types of system components -an
autonomic manager
and one or more
managed resources
.
An autonomic manager is a component that implements a particular control loop. A managed resource is what the autonomic manager is controlling. The self-managing attributes of the autonomic computing architecture, the very basic building element, involves an intelligent control loop. The loop collects information from the components of systems. Then it monitors, correlates, analyzes, makes plans or decisions, and then executes the plan to adjust the system as necessary.
These components collect, consume, and generate knowledge. This knowledge builds on information about the system and grows as the autonomic manager learns more about the characteristics of the managed resources. The knowledge is continuously shared among the four parts -the monitoring, analyzing, planning, and executing parts -leading to more informed decisions being made by these parts.
The anatomy of Common Base Events
dW:
I've also heard from other architects about the Common Base Events as a key component when it comes to autonomic computing and problem determination. What role do Common Base Events play in your work?
AS:
I've spent a few years on event and data modeling in my earlier work, as a member of the IBM PD Council architecture team, which included designing and modeling PD-related artifacts such as messages and events, trace, and dump. This work evolved into the Common Based Event format as autonomic computing became IBM's strategic architecture.
The purpose of the Common Base Event format is to facilitate effective communication among the disparate enterprise components that support logging, management, problem determination, autonomic computing, and on demand business functions in an enterprise. In other words, the Common Base Event makes an effective way to represent the data that needs to be exchanged between the autonomic managers and producers of events, also known as managed resources.
Events are emitted by the software or hardware resource, the managed resource, in the autonomic computing architecture; events are consumed by management application and tools, also known as autonomic managers, for an uninterrupted, consistent, self-resilient, operation. Common Base Event-formatted events is an attempt to bring a consistency in this exchange so the management system spends less instrumentation effort when trying to manage a resource.
dW:
Can you break that description of the Common Base Event down a little for me?
AS:
The Common Base Event is a standard XML schema that lends itself to several types of notifications, in particular, logging, management, auditing, and business events. In all of these cases there is a significant need for the data elements and the format of those elements to be consistent
because all of these notifications need to be correlated with each other
. That is the central concept.
Without data consistency and standardization, data stored in logs or published as events are of little value to autonomic management or business systems that rely on the completeness and accuracy of data to determine an appropriate course of action to take in response to the event. The Common Base Event prescribes a consistent format and definition, which helps to alleviate the problem by making sure the data is complete. It does this by providing properties to identify the component that is affected by the situation, the component that is reporting the situation, and the situation.
dW:
Can the component affected by the problem report the problem too?
AS:
Yes. Commonly, the reporter and the source components can be the same. In that case, only the source component identification (
sourceComponentId
) is sufficient.
dW:
What makes up the situation data?
AS:
The situation data includes a common description of the situation that occurred and content that can be used to correlate the situations. Let me get back to this. These three data elements -the component affected by the situation, the component reporting the situation, and the situation itself -form the
3 tuple
of the Common Base Event.
 |
An evolutionary approach to adapting PD to AC
IBM is adopting an evolutionary approach to adapt problem determination processes for autonomic environments, Starting with the development of a common problem-determination architecture to standardize log format, content, and organization, and then moving (quickly) to solutions that can automate event analysis and correlation through autonomic managers.
One of the most elegant elements of this approach is that it allows companies to incrementally transition to autonomic-based problem determination.
A basic foundation of the problem-determination architecture is the Common Base Event, the common format for log/trace information. Common Base Event is based on a structured
3 tuple
format that includes the following elements:
- The source component, the component affected by the event
- The reporting component, the "tattler"
- The situation data, or just what is happening
What makes it possible to write and deploy resource-independent management functions that are capable of isolating a failing component is the 3-tuple concept -organizing the content in this triad.
For a FAQ on IBM's approach, see this
Problem Determination FAQ
.
|
|
The current Version 1.0.1 of the Common Base Event provides an additional element, called
ExtendedDataElement
, that offers extensibility by providing a way to specify any product or application-specific attributes, such as name-type-value collections, that is not defined in the standard Common Base Event elements. Information placed in this element is assumed to conform to some set of syntax and semantics that is understood by both the event producer and the event consumer, such as defined schema or even product-specific data.
The first two tuple I mentioned, the source and the reporter, identify the affected and the reporting components and are prescribed by a Common Base Event data type called
ComponentIdentification
. The component identification property defines a collection of attributes that are required to uniquely identify a component.
dW:
How do the source and reporter work?
AS:
For example, a generic data collection adapter could collect and forward logs produced by components in a solution package. Let's say there is an application such as
MyApplication
running on a server machine. The application encounters an error and logs an entry to the server error log. Then, a separate application, for example an adapter, reads messages from the error log and converts them into Common Base Events and forwards them to the intended management application. In this case, the "affected" or "source" of the event is the
MyApplication
and the reporting component is the adapter that collected and submitted the event.
dW:
Tell me more about situation data.
AS:
The third tuple of the Common Base Event is the situation data. As I've mentioned earlier, the Situation is the data that describes a state change or occurrence of something that causes an event to be reported. The situation information includes a required set of properties or attributes that are common across product groups and platforms, yet flexible enough to allow for adoption to product-specific requirements.
You can think of situation data as elements and attributes of a Common Base Event that report a state change or occurrence of some significant change as it occurred. Among the situation data is the situation categories -this is used to describe the category of detected events. I'll talk about that.
dW:
So, in other words, the situation data has to be flexible enough to present the right message, present a message that's useful in two senses of that word: it has to be
usable
across various systems and it has to present
meaningful information
.
AS:
Correct. A situation is defined as the data that a component reports for consumption by autonomic management applications, or for that matter, by a product-specific manager.
dW:
Is there a fourth tuple?
AS:
In Version 1.0.1 of the Common Base Event we have not named any specific element as the forth tuple, yet you could think of the element of the Common Base Event that contains the content that can be used to correlate events for the purpose of the end-to-end tracing situations as the fourth tuple. The
ContextDataElement
is indeed a specialized extended data element and holds data that is used to assist with problem diagnostics by correlating messages or events generated during execution of a unit of work.
Defining the situation
dW:
Let's talk more about the situation categories.
AS:
This is a kind of novel idea, and it is one of the most important properties of a Common Base Event. Situation data includes a situation category element. While Common Base Event was evolving, it was inevitable and important to be cognizant of a capability to accommodate for heterogeneous solutions components, including hardware resources, legacy applications, and evolving applications and programming models. It was also necessary to be capable of accommodating existing data sources, as well as data that is supplied using other existing standard formats. With those two concepts in mind, the goal of the situations category was intended to not drastically change what the products and components are currently doing. Rather, it was intended to put some structure and rigor behind how components categorize events, the situations, in an effort to standardize how this is done so that autonomic managers can be written to act upon the common situations.
For example, a problem from a programmatic perspective is that there is not a standard way to check the log files to see if component of product
X
has started.
X
might generate a log entry like "Component X started," yet another product,
Y
for instance, might log "Change server state from starting to running." The reality is that both of these messages provide the same information. Yet each uses different terminology that could make it difficult for a program to use. It would be much easier if all components reported, for example, that they "started."
As I pointed out earlier, in order to build a robust foundation for autonomic computing, there must be a formal and disciplined approach in formatting; this can be called
syntax
. The definition of the syntax for the data is described in the Common Base Event specification.
Equally important is categorizing the data reported by components, also known as
semantics
. The categorization of the Common Base Events makes it possible to build autonomic managers that focus on implementing the analysis and planning functions rather than those that have to focus on product-specific data formats, adaptation, and collection efforts.
dW:
How do you define a situation?
AS:
The Common Base Event 1.0.1 specification defines these 12 situation categories:
- StartSituation
- StopSituation
- ConnectSituation
- ConfigureSituation
- RequestSituation
- FeatureSituation
- DependencySituation
- CreateSituation
- DestroySituation
- ReportSituation
- AvailableSituation
and a placeholder for "otherSituation" -"other" being designed to accommodate for product-specific requirements.
Think of a situation as being the state change in the resource that causes data to be reported. To further qualify the situation categories,
qualifiers
are introduced that are the structured representation of the parameters necessary to describe a situation.
 |
Situation qualifier example
To provide an example of a situation category qualifier, look at the
StartSituation
category.
In the Common Base Event definition,
StartSituation
contains three qualifiers:
reasoningScope,
successDisposition,
and
situationQualifier
.
For the
reasoningScope
qualifier, there are two values:
INTERNAL
and
EXTERNAL
(reporting on the scope of the impact of the situation reported.)
For the
successDisposition
qualifier, there are two values:
SUCCESSFUL
and
UNSUCCESSFUL
(reporting on the success of a start.)
For the
situationQualifier
qualifier, there are three values:
START INITIATED
,
RESTART INITIATED
, and
START COMPLETED
.
|
|
Furthermore, situations can be combined to create more complex situations. For example,
Start
situations reported by the WebSphere
®
Application Server and DB2
®
may result in a solution category of
StartSituation
. Keep in mind that the semantic of the situation is consistently represented by the situation category and the context of the situation can be understood from the situation data and the reporting/source entities of the Common Base Event 3 tuple.
Categorizing situations allows the specification and examination of a desired state. In our previous example, component
A
is dependent on component
B
being started. Categorization of a situation allows us to first check to see if the desired state is true -that "
B
is started," and it allows us to take an action if
B
is not started.
You should be able to describe the situation that an event represents using one of the predefined categories and its related qualifiers.
Due to effective and accurate categorization of events, you have to be cognizant and understand the context in which that event has occurred. For example, for a process that has been initiated, does it mean it has started? Or does it mean the connection has been established? Or does it mean that now I can access a table that is available. You can argue all three, that it has started, connected, or is available and select the right one when you take into account the context that the event was reported, for example an application, network, relational database.
dW:
Do the existing AC toolkit tools automatically assign situation categories to the events?
AS:
To answer that, let's look at the Generic Log Adapter (GLA) that is distributed with the Autonomic Computing Toolkit. GLA is a rules-based tool that provides the ability to convert one or more application log entries (or messages) into the Common Base Event format. GLA configuration file (also known as "adapter" file) parsing rules can be written using regular expression and Java
™
code. The GLA configuration file and the parsing rules in combination allow for easy conversion of proprietary message formats into the Common Base Event format without changing the application code natively. The GLA rules allow for mapping instructions to augment the proper situation category when generating the Common Base Event formatted events.
dW:
How do you determine which is a better choice for adapting events, the regular expressions or Java?
AS:
In general, rules written in regular expressions are more flexible, simpler to maintain, easier to update, and can more easily react to changes or requirements. Also, when you use rules using regular expressions, you may choose to combine it with the callout feature, the Substitution Extension class, to benefit from greater flexibility; that is, run a custom Java parsing logic to assign value to a specific property. On the other hand, if your messages are very unstructured and have less detectible patterns, eye-catchers, or delimiters to locate and extract data, then it may be beneficial to use what is known as staticor Java-based parsers to parse the messages.
dW:
But the GLA doesn't automatically assign situation categories to events does it?
AS:
As the common data format and situations become more widely deployed, it becomes critical to provide low-cost solutions so as to help in the process of converting the existing IT resource messages into a Common Base Event format. The Generic Log Adapter (GLA) fulfills this task; however, the GLA may not handle well the problem of categorizing messages into standard situations and it may require complex rules to detect and assign appropriate situation category and related qualifiers. Assigning a situation category often requires understanding the details of individual messages, and the details vary from message to message.
Ideally, assigning meaningful situation information is a process that could be at least partially automated. This is an active research topic that might bear fruit in the future. Any additional tools and technologies that can assist developers in creating useful Common Base Event information through increased automation would help to accelerate the adoption of this technology.
Of course, human expertise will remain important to improve the overall quality and confidence level when assigning the situation categories.
dW:
So, can we expect to see additional tools in the future?
AS:
We intend to continue to enhance and augment the technologies and documentation in the AC Toolkit to enable developers to further exploit AC capabilities in the products that they develop. We recognize that the assignment of situation information in Common Base Events is one area that we would like to make more straightforward for developers.
dW:
This is exciting -this is like an evolutionary advancement in AC capabilities.
AS:
Yes, indeed it is exciting; we expect that technologies and tools will continue to evolve to make adoption of the Autonomic Computing architecture much easier and simpler.
dW:
Thank you for the great interview. We'd like to talk you again when there are other new advancements in AC self-healing technologies.
Resources -
The
Generic Log Adapter for Autonomic Computing
is a rule-based tool that transforms software log events into the standard situational event formats in the autonomic computing architecture (the
Common Base Events format
).
-
Canonical Situation Data Format: The Common Base Event
defines the Common Base Event and the supporting technologies that make it possible.
-
For more on the
Common Base Event
, try this PDF book.
-
The
Log and Trace Analyzer for Autonomic Computing
is a tool that enables and speeds viewing, analysis, and correlation of log files.
-
The
Autonomic Management Engine
is a sample autonomic manager implementation that monitors system resources, sends aggregated events, and performs corrective actions for problems.
-
To read more technical documentation, how-to articles, education, downloads, and product information about autonomic computing, visit the
developerWorks Autonomic computing zone
.
-
You can
download
the Autonomic Computing Toolkit to take advantage of the tools discussed in this article. And get a good
user's guide
to Version 2.0 of the toolkit.
-
Keep up with news, documentation, and downloads at the IBM
Autonomic computing home page
.
-
The articles "
An autonomic computing roadmap
" and "
Understand the autonomic manager concept
" (developerWorks, February 2004) provide information about self-CHOP.
-
To learn more about the challenges of working with next-generation autonomic computing systems, read "
Meet the experts: Ric Telford on the state of autonomic computing today
" (developerWorks, November 2004).
-
For more about instituting standards in next-generation autonomic computing systems, read "
Meet the experts: Thomas Studwell on driving an autonomic computing standards-based strategy
" (developerWorks, January 2005).
-
For more about real-world implementations of the Common Base Event, read "
Meet the experts: Mickey Nix on life in the trenches
" (developerWorks, February 2005).
-
Read about some
real-life successes
resulting from help from IBM deploying autonomic computing solutions.
About the author  | |  | This article is brought to you by the developerWorks editorial staff. For comments or questions, contact the staff at dwinfo@us.ibm.com. |
Rate this page
|