Cloud system overview

Learn about the capabilities of IBM® Netcool® Operations Insight® on Red Hat® OpenShift®.

Netcool Operations Insight on Red Hat OpenShift is an AI-powered operations management solution, which assures the availability of applications, services, and network infrastructure across local, cloud, and hybrid environments by identifying actual and potential service degradations and outages. Netcool Operations Insight on OpenShift uses cognitive analysis of real-time and historical event data from diverse sources to consolidate events into a filtered subset of actionable incidents with a probable cause. Integrated service and topology management provide contemporary and historical topological context for events and incidents, and incident management and runbook automations expedite incident resolution.

Service and topology management

Service and topology management enables the real-time and historical visualization of highly dynamic and distributed infrastructure and services.

Many observer integrations are available to obtain topology and state information from a multitude of disparate sources. These observers are easily configured and run from a provided configuration UI, or through APIs. The information that is collected by the observers is used to build a dynamic topological representation, which can be viewed in the Topology Viewer.

You can query the built topology, and display a topological view of a chosen resource, with its relationships in a configurable number of hops, its properties, and its state. A topology can be viewed dynamically, so that incoming changes to the topology are shown, or incoming changes to the topology can be paused, and viewed on-demand. The history timeline can be used to view any resource in the topology and the changes that occurred to its relationships, properties, and state in a defined time window.

Event management

If pre-defined attributes for incoming events are the same, then these events are related events, and they are correlated into an incident. The incident priority is determined by the highest severity event that the incident contains. If an event occurs multiple times (the resource bundle and eventType are the same), then deduplication adds only one of these events to the owning incident, and increments the count for this event.

You can create event policies that perform actions against events, such as enriching events with additional information, suppressing events under specific conditions, or assigning runbooks to events to aid resolution. Incident policies can be created to assign incidents to specified groups automatically, notify users, or escalate incidents that do not have an investigation in progress after a configured time.

Cloud native analytics

Historic and live event data is analyzed to identify patterns and correlations, and policies are then suggested that can be used to group events together into incidents. Policies can be auto-deployed, or can be set to require manual review first. Scheduled training runs ensure that grouping policies maintain their relevance to the stream of incoming events.

Events are grouped by the following:

  • Seasonality - events that occur at a particular time.
  • Temporal grouping - events that are related because they usually occur within a short time of each other.
  • Temporal patterns - events that match a temporal pattern. Temporal patterns are patterns of behavior that occur among temporal groups, which are similar, but occur on different resources.
  • Topological correlation - events that occur on resources that are topologically related, or on a defined part of the topology.
  • Scope-based correlation - events that are grouped by a user-defined scope-based policy, which groups events that have a common attribute, such as a particular resource or sub topology, and a specific time window.

Deployed policies automatically group incoming events together into incidents where they match the conditions of the policy, reducing noise and presenting actionable incidents in the Alerts page. These incidents, which are composed of events that the user can examine individually, present a holistic view of the problem instead of a much larger volume of isolated single events.

Cloud native analytics generates a heartbeat event to self-monitor the health of its own services.

Probable cause

On the Alerts page, a weighted probable cause is shown for each of the events in an incident to help identify which event has the greatest probability of being the cause. Probable cause ratings are calculated for each of the events in the incident by using text classification and topological information. The way that probable cause ratings are calculated is configurable.

Topology analytics

Events that have an associated resource in the topology are enriched with topological information, and the Alerts page indicates when an event has an associated topology that can be launched to.

This dynamic topology mapping provides topological context when investigating an incident. Operators can drill down into an incident's topology, and see a timeline of recent changes on the event's associated topological resource to assist faster identification and resolution of the incident cause.

Incident management

The Incidents page displays all of the current incidents, and can be filtered to show only incidents that are assigned to a group or the current user. You can add events to an incident, assign it to an operator, change its state (for example to In Progress or Resolved), view the events in the incident, view a timeline of the incident's history, and see suggested runbooks.

Runbook automation

You can create and manage runbooks that provide full and partial automation of common operations procedures. When an incident is identified, AI models match the incident with previous similar incidents and their successful resolution actions, and suggest a runbook automation that can be used to resolve the issue. The runbook automations use tested and trusted procedures from similar incidents to provide a fast, reliable, and traceable resolution.

Search (Humio)

An integration with Humio can be configured to enable searching for events and topological resources in logs. Humio can also be used to search logs and create alerts if the specified search criteria are matched.

Anomaly detection and incident avoidance

Performance metrics are ingested from multiple sources and analyzed to model the normal operational range of the metric within its environment. Analytics are used to build mathematical models, which help detect and forecast anomalous metric values, and to generate events to proactively avoid incidents.