Hybrid system overview

Learn about the capabilities of a hybrid deployment of IBM® Netcool® Operations Insight®.

Netcool Operations Insight on Red Hat® OpenShift® is an AI-powered operations management solution, which assures the availability of applications, services, and network infrastructure across local, cloud, and hybrid environments by identifying actual and potential service degradations and outages. Netcool Operations Insight on OpenShift uses cognitive analysis of real-time and historical event data from diverse sources to consolidate events into a filtered subset of actionable incidents with a probable cause. Integrated service and topology management provide contemporary and historical topological context for events and incidents, and incident management and runbook automations expedite incident resolution.

Service and topology management

Service and topology management enables the real-time and historical visualization of highly dynamic and distributed infrastructure and services.

Many observer integrations are available to obtain topology and state information from a multitude of disparate sources. These observers are easily configured and run from a provided configuration UI, or through APIs. The information that is collected by the observers is used to build a dynamic topological representation, which can be viewed in the Topology Viewer.

You can query the built topology, and display a topological view of a chosen resource, with its relationships in a configurable number of hops, its properties, and its state. A topology can be viewed dynamically, so that incoming changes to the topology are shown, or incoming changes to the topology can be paused, and viewed on-demand. The history timeline can be used to view any resource in the topology and the changes that occurred to its relationships, properties, and state in a defined time window.

Note: Integration with on-premises IBM Agile Service Manager is not supported for hybrid deployments.

Event management

If pre-defined attributes for incoming events are the same, then these events are related events, and they are correlated into an incident. The incident priority is determined by the highest severity event that the incident contains. If an event occurs multiple times (the resource bundle and eventType are the same), then deduplication adds only one of these events to the owning incident, and increments the count for this event.

You can create event policies that perform actions against events, such as enriching events with additional information, suppressing events under specific conditions, or assigning runbooks to events to aid resolution. Incident policies can be created to assign incidents to specified groups automatically, notify users, or escalate incidents that do not have an investigation in progress after a configured time.

Cloud native analytics

Historic and live event data is analyzed to identify patterns and correlations, and policies are then suggested that can be used to group events together into incidents. Policies can be auto-deployed, or can be set to require manual review first. Scheduled training runs ensure that grouping policies maintain their relevance to the stream of incoming events.

Events are grouped by the following:

  • Seasonality - events that occur at a particular time.
  • Temporal grouping - events that are related because they usually occur within a short time of each other.
  • Temporal patterns - events that match a temporal pattern. Temporal patterns are patterns of behavior that occur among temporal groups, which are similar, but occur on different resources.
  • Topological correlation - events that occur on resources that are topologically related, or on a defined part of the topology.
  • Scope-based correlation - events that are grouped by a user-defined scope-based policy, which groups events that have a common attribute, such as a particular resource or sub topology, and a specific time window.

Deployed policies automatically group incoming events together into incidents where they match the conditions of the policy, reducing noise and presenting actionable incidents in the Alerts page. These incidents, which are composed of events that the user can examine individually, present a holistic view of the problem instead of a much larger volume of isolated single events.

Cloud native analytics generates a heartbeat event to self-monitor the health of its own services.

Probable cause

On the Alerts page, a weighted probable cause is shown for each of the events in an incident to help identify which event has the greatest probability of being the cause. Probable cause ratings are calculated for each of the events in the incident by using text classification and topological information. The way that probable cause ratings are calculated is configurable.

Topology analytics

Events that have an associated resource in the topology are enriched with topological information, and the Alerts page indicates when an event has an associated topology that can be launched to.

This dynamic topology mapping provides topological context when investigating an incident. Operators can drill down into an incident's topology, and see a timeline of recent changes on the event's associated topological resource to assist faster identification and resolution of the incident cause.

Incident management

The Incidents page displays all of the current incidents, and can be filtered to show only incidents that are assigned to a group or the current user. You can add events to an incident, assign it to an operator, change its state (for example to In Progress or Resolved), view the events in the incident, view a timeline of the incident's history, and see suggested runbooks.

Runbook automation

You can create and manage runbooks that provide full and partial automation of common operations procedures. When an incident is identified, AI models match the incident with previous similar incidents and their successful resolution actions, and suggest a runbook automation that can be used to resolve the issue. The runbook automations use tested and trusted procedures from similar incidents to provide a fast, reliable, and traceable resolution.

Anomaly detection and incident avoidance

Performance metrics are ingested from multiple sources and analyzed to model the normal operational range of the metric within its environment. Analytics are used to build mathematical models, which help detect and forecast anomalous metric values, and to generate events to proactively avoid incidents.

Search (Humio and Log Analysis)

The search and analysis capabilities of on-premises Operations Analytics - Log Analysis can be run against selected events, for example to search for similar events, events from the same node, or events with a matching keyword.

On Red Hat OpenShift, an integration with Humio can be configured to enable searching for events and topological resources in logs. Humio can also be used to search logs and create alerts if the specified search criteria are matched.

High availability and disaster recovery

High availability (HA) and disaster recovery (DR) are configurable for hybrid deployments.

Network management

Network Manager displays availability, performance, event, and configuration data for network views. Netcool Configuration Manager provides configuration and compliance management capabilities for network devices, and reports devices that violate user-defined rules. Topology Search is an extension of the Networks for Operations Insight feature. It provides insight into network performance by analyzing events that have been enriched with network data and determining the lowest cost routes between two endpoints on the network over time.

Users can run a discovery to find all the devices and interfaces on their network, determine their connectivity and build a topological representation. Polling can be configured to monitor any scope of the discovered topology, and to generate events if configured thresholds on certain values are violated, or the polled device or interface is unresponsive.

The discovered topology can be visualized, with its alert status, in standard network views, and in a hop view of a chosen device with a configurable number of its connections. The Network Health Dashboard can be used to display availability, performance, event, and configuration data for monitored devices and interfaces in user selected network views. Devices can be examined in more detail with the Structure Browser, MIB Browser, and MIB Grapher, and reports can be run to retrieve data about the network and its performance.

Events are received from OMNIBus probes and from polls. The Active Event List can be used to view and filter these events, and launch to any associated topology. If events occur on topologically linked devices, then Network Manager identifies the root cause event, and highlights it in the network and event visualizations.