IBM Support

Be the "Ruler" of your policies and correlation using Event Management

Technical Blog Post


Abstract

Be the "Ruler" of your policies and correlation using Event Management

Body

Event Management provides DevOps (& Ops) teams with pre-configured event correlation, incident prioritization, notifications and runbooks (automated and manual) in a seamless experience suitable for any skill level.  You can subscribe to Cloud Event Management in Bluemix here or in Marketplace here.
 
This blog outlines how to use Event policies to achieve any type of event correlation you may need. Ensuring your incidents and your incident response rules the day.
 
image

A quick bit about Event policies.  You can use these policies to Enrich event fields (perhaps adding location details for a known host or application, or to achieve various types of correlation), to suppress events (reducing the noise), to associate runbooks which are known to resolve this type of event (guides less experienced operators to tried-and-true automated or manual runbook steps, improving MTTR) and to detect flapping events (events which flip-flop between the problem and resolutions state often).
 
To discuss enrichment for correlation, we first need to discuss how correlation works out of the box. Here is the 411 (the relevant information) on correlation:
 
Think of it as a series of waterfalls.  An omission of data means we move to the next field, and difference between the first populated correlation fields in two events means they won't correlate naturally.
 
Events may contain some or all the fields used by correlation. But only event.resource.name is required.
 
One incident will contain all events which have the same (1) event.resource.cluster name. An event with cluster set and one without it set will not correlate naturally.
 
That behavior then cascades using these fields in order:
 
(2) event.resource.application
(3) event.resource.hostname aka the server name
(4) event.resource.ipaddress
(5) event.resource.sourceId
(6) event.resource.service
(7) the catch all if the above event fields contain no values; event.resource.name
 
You can find correlation examples using example event data here.
 
Perhaps you have 3 events in your environment that you would like to correlate, regardless of what fields or correlation is occurring now.
 
The summary of all 3 events contains the word database, all 3 are critical events,  2 of the events have a hostname of correl8me and the other has an application name of database8. Assuming there are no other matching fields, two policies will be needed to achieve your goal.
 
The first  policy appears as follows:
 
image
 
The policy above reads as follows: If summary contains database and severity is critical and application is database8 then enrich cluster by replacing its value with "Database 8".
 
The second policy appears as follows:
 
image
 
The policy above reads as follows: If summary contains database and severity is critical and hostname is correl8me then enrich cluster by replacing its value with "Database 8".
 
With both of these policies active and in this order, the 3 events which previously did not correlate, and instead generated 3 incidents, will now appear in the same incident.
 
image
Often only one policy is required to achieve correlation of disparate events, however If the events vary enough, and organically lack keys for correlation, it may require two or more policies.
 
The good news is that policies are easy to write, order and test. By examining your current incidents, you may find incidents and events which are often clustered together. Using your expertise you determine events are in fact related, you can now easily correlate these events to reduce the clutter and streamline your incident response.
 
Often with clusters or swarms of events there are both actionable events like the ones described above, mixed-in with un-actionable lower severity events. Correlate the actionable events as shown, then identify and suppress the un-actionable events with other policies, to further reduce the noise in your environment.
 
Seize these examples and begin ruling events to better your operational environment.
 
Authors:
Phil Riedel
Sudhakar Chellam

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11080057