How does Netcool/OMNIbus Knowledge Library operate?
Netcool/OMNIbus is a Service Level Management (SLM) system that presents a consistent and consolidated view of the current state of all the Netcool/OMNIbus managed systems to specific users. The Netcool/OMNIbus Knowledge Library 4.8 improves the capability of Netcool/OMNIbus in providing more valuable information.
Probes, rules file and Netcool/OMNIbus Knowledge Library
The probes used by the Netcool/OMNIbus collect and interpret information from disparate managed objects in a network. A probe parses the collected information, and sends the parsed data to the ObjectServer in a format described by the rules file and compatible with the ObjectServer fields.
The default rules file necessary for the execution of a probe only performs generic grouping of data. Using a rules file enhanced to cater for SNMP Management Information Base (MIB) events from a specific device provides sharpened event enrichment and causal analysis. The Netcool/OMNIbus Knowledge Library 4.8 is a collection of such enhanced rules files tuned to specific managed objects that send SNMP based events. For more information, refer to Setting up probes to use the updated rules files.
When the device sends the SNMP based events as traps, the probe uses the device specific rules file in the Netcool/OMNIbus Knowledge Library 4.8 specified by the RulesFile property. For more information, refer to Configuring the probes properties files.
ObjectServer and Netcool/OMNIbus Knowledge Library
The IBM Tivoli Netcool/OMNIbus ObjectServer currently uses two main types of automation to help reduce the number of events that require operator intervention. Generic Clear automations are designed to correlate and delete any matching pair of problem and resolution alerts, whereas deduplication eliminates duplicate alerts while maintaining an 'occurrence' count.
The Netcool/OMNIbus Knowledge Library 4.8 additionally increases the ability of the Tivoli Netcool/OMNIbus ObjectServer automations to correlate alarms and identify root causes by employing the following techniques:
- Event Pre-Classification: This process identifies and flags events within the probe rules files to indicate the causal relevance of events, where this can be determined without the need for correlation.
- Intra-Device Correlation: This process enhances probe rules files and adds automations to the ObjectServer to perform correlation beyond deduplication and problem or resolution correlation, identifying intra-device root causes and symptoms.
- AMOS Extended Event Recognition (for IBM Tivoli Network Manager IP Edition integration): This process provides IBM Tivoli Network Manager IP Edition with a larger dataset upon which to perform topology-based event correlation, by identifying a larger set of events for analysis.
The first two techniques are described below in further detail.
Current root cause analysis and event correlation systems rely on one or more correlation or analysis engines to determine the causal relationships between events. These existing systems ignore the simple 'common sense' understanding of the events as they are received, and are forced to perform root cause analysis operations on the full set of events. This reduces the efficiency of the root cause analysis system, or the accuracy of analysis.
The event pre-classification mechanism implemented in Netcool/OMNIbus Knowledge Library overcomes these shortcomings. To facilitate pre-classification, a catalog of known events and their causal types is implemented as a lookup table in Netcool/OMNIbus Knowledge Library 4.8. This causal type catalog is referenced by a probe's rules file to determine the causal relevance of a received event before it is forwarded to the ObjectServer.
While causal relevance can be determined by any combination of correlation and analysis methods within an engine, entries in the catalog are restricted to those events whose causal relevance can be determined only from the data contained within the received event message. Netcool/OMNIbus Knowledge Library 4.8 uses the following guidelines in determining the causal relevance to pre-classify events in the catalog:
- Root Cause: An event with a condition that is known not to be caused by any other detectable condition. A root cause event generally results in a degraded condition or failure of other related entities in a system. For example, if a Frame Relay interface fails, the virtual circuits (DLCIs) traversing that interface will fail. Therefore, in this example, the Frame Relay interface failure is the root cause of the virtual circuits (DLCIs) failures. Root cause events include many physical events, for example, certain card pulls, device shutdown, or power loss.
- Symptom: An event with a condition that was caused by the degraded condition or failure of higher level entities or processes in a system. Based on the example above, the virtual circuit failures are deemed symptoms of the Frame Relay interface failure.
- Singularity: An event with a condition that
is not directly caused by any other degraded condition or failure,
and which does not cause other degraded conditions or failures in
related entities. An example of a singularity is the Accounting File
Full condition on some Cisco equipment, which does not cause any other
fault condition other than that the accounting file can no longer
be written to.
It can be argued that a singularity is equivalent to a root cause. IBM believes that there is value in identifying singularities as events are received, and leaves it to other correlation engines and event management methods to implement the flexibility necessary to provide the system operator a choice of how singularities are finally considered.
- Information: A message that indicates non-fault-related conditions which might be of interest to system operators. Such events also include messages that indicate the clearing or resolution of previously occurring fault-related conditions. Examples of information events include Neighbor Adjacency Establishment events, successful call establishment messages, and recovery messages relating to physical events.
- Unknown: An event that cannot be classified as a root cause, symptom, singularity or informational event. While they cannot be pre-classified, unknown events may be further analyzed by one or more engines to determine their true causal relevance.
Intra-device correlation is implemented as a collection of ObjectServer automations that determine the causal relevance of intra-device events by using algorithms which consider managed object parent and child relationships. The automations use information revealed about the relationships to determine related events and test them for causal relevance.
There are separate automations for determining root causes and symptoms. The symptom-detecting automations will process an event at least once before the event is processed by the root-cause-detecting automations. If the event is identified as a symptom, it will be ignored by the correlation automations. This allows a more granular control of which events are processed, reducing the load the automations place on the ObjectServer. The automations are implemented in the same style as the Generic Clear v7.x automations, using a separate table for correlating temporarily-held events, further enhancing performance.