I thought I would write up
another blog post on event correlation. This is something that a few
products already perform (ITNM) that can be enormously useful for
operators to get to the root of problems that come as many events.
Here is an example scenario that I will walk though:
You have a visualized environment in your company that consists of 10 host machines and 50 VMs on each host (500 VMs)
Suddenly you get 51 events appear in your event list, all of them complaining that a machine has run out of memory
The phone rings. It's an angry customer asking why their webpage is down. "Did you reboot it?"
You pick one of the events at random and see what the node of the event is.
You pull up your excel spreadsheet which has the VM -> host mapping and find this VM
You log into this host. "hrm, seems fine... oh that VM isn't on here any more. Someone migrated it"
You keep doing this until you eventually find the host that is out of memory (this takes a while)
get another phone call. "I'm not paying for your services if my webpage
is down!! what? yes.. I rebooted it 3 times already."
Scary scenario right? well, it doesn't have to be. You could have OMNIbus do a lot of the work for you! let me show you how.
Firstly let's determine what information we need to actually make this connection. We will need:
- Events coming from VMs
- Events coming from host
- The host that a VM lives on (and information when this changes due to migrations)
- The type of events that might be connected (In our example, "out of memory" events)
we already have the first two, this is the 51 events that came in and
confused the operator. For the latter two pieces of information we can
leverage ITM. ITM offers agents that understand many hypervisors and can
give us information about what hosts VMs live on.
Now firstly we will want to create a table to keep track of VM -> Host information.
Table to track VM -> Host association
create table custom.vmstatus persistent
VMHostName varchar(64) primary key,
OK so this table will store the VM host name, the host name hypervisor which it resides and a time for when it was last changed.
could implement your own reinsert trigger for this table and do
something smart when you notice that the hypervisor host changes (This
means you are seeing a VM migration from one host to another (vmotion)).
This is an exercise for the readers.
you have this table you will need to configure ITM to insert events
into this table via the EIF probe. I will post example rule files
attached to this blog post (I'll actually attach the entire scenario
with the bells and whistles).
This should be enough information
in an object server to start doing correlation. So, let's get to setting
up some automation to perform correlations.
Correlation Using A Trigger
To perform correlation we can use a technique similar to the generic
clear trigger. We will create a temporal trigger which populates a
temporary table of potential symptoms, and then iterate over all
potential root causes and see if any of the symptoms match the root
When a match is found we can perform any
"correlation" action we want. For this blog post I will add new rows to
alert.status to represent a correlation. So let's do that first:
alter table alerts.status add column ParentIdentifier varchar(255)
So, the idea is, when I find a correlation I will point the ParentIdentifier field at the Identifier of the root cause (Parent / Child relationship).
We will also need to create our temporary table to perform correlation:
create table custom.vm_events virtual
VM_Identifier varchar(255) primary key,
OK, now in that temporary table we will be able to store the Identifier of potential symptoms, the error type of the event and the hostname that this VM event came from.
Let's write the correlation trigger. (with comments throughout)
create or replace trigger vm_correlate
comment 'Virtual machine to hypervisor host event correlation'
every 20 seconds
for each row vm_corr_cand in alerts.status where -- For all the possible symptom events
vm_corr_cand.Severity > 1 -- Ignore events that have been cleared
vm_corr_cand.AlertGroup in ('Memory Allocation Status', 'CPU Status', 'Network Link Status') -- Correlate over these event types
vm_corr_cand.Node in (select VMHostName from custom.vmstatus) -- Symptom events can only come from VMs
for each row vm_row in custom.vmstatus where vm_corr_cand.Node = vm_row.VMHostName -- Find this VMs host from our vmstatus table
-- Store a temporary event remembering the event, the alert group and the host that vm event occurred on
insert into custom.vm_events values (vm_corr_cand.Identifier, vm_corr_cand.AlertGroup, vm_row.HyperHostName);
for each row root_cause in alerts.status where -- For each possible root cause event
root_cause.Severity > 1 -- Ignore cleared events
root_cause.AlertGroup in (select ErrorType from custom.vm_events) -- Only events that match symptom events alert group
root_cause.Node in (select HyperHostName from custom.vm_events) -- Only consider events that happen on hosts with symptoms
-- If we enter this loop we have found a root cause event. We will mark the symptoms with the root cause identifier.
-- We could mark up the event in Alerts.status now, but for ease of reading I will do this later in the trigger
update custom.vm_events set RC_Identifier = root_cause.Identifier where
HyperHostName = root_cause.Node
ErrorType = root_cause.AlertGroup;
-- Now modify the events if a correlation has been found
for each row corr_event in custom.vm_events where
corr_event.RC_Identifier != '' --If the root cause identifier is not empty it means we found a match for this vm event
-- Down grade the severity of the symptom event and upgrade the severity of the root cause event
-- We will also set the new field in alerts.status to show the relationship
update alerts.status via corr_event.VM_Identifier set ParentIdentifier = corr_event.RC_Identifier, Severity = 2;
update alerts.status via corr_event.RC_Identifier set Severity = 5;
-- Clear down our temporary table
delete from custom.vm_events;
With this trigger in place, when VM/Host events arrive that are of the type CPU Status, Memory Allocation Status or Network Link Status we should match them up correctly. This will hopefully make dealing with the events that come in during a vmotion (network link down and up, memory hiccups) much easier to deal with and save everyone some time.
With the ParentIdentifier -> Identifier relationship you should be able to do some pretty cool visualisations with the coming release of WebGUI (I'll make another blog post about this when the time is right).
With this blog post I will attach the example triggers we will be deploying with 7.next so you could take a look at them and see if you can take advantage of something similar.
In that package you will find:
- The trigger above with some extra work done to provide more information.
- A right click tool to jump from Symptom events to Root cause events
- EIF rules to be used with a VMWare ITM agent
- A configuration piece to be used with 7.next Web gui
I look forward to any feedback on this post.