Configure aggregation settings

Learn how to configure the type of time windows, their duration, and the timing mode for your cloud or hybrid deployment.

Four types of grouping are available in IBM® Netcool® Operations Insight® on Red Hat® OpenShift®: temporal grouping, temporal patterns, topological, and scope-based event grouping.

Window types

In addition to matching the criteria for the policy, events are grouped if they come in during a window of time. How this window is determined depends upon the type configured. Two types of event windows are possible:
  • fixedFromFirst - The window duration that is defined in this configuration is the length of the window where events are grouped together. This time window is determined as a fixed length from the first event. As new events come in, within the window of time, they are added to the group. Any events that occur after the end of this window are not included in the group.
  • rolling - The window duration for rolling is called the quiet period. It represents the amount of time after the last event when the group membership is closed for new events. With a rolling window, the end time of the window is based on the last event in the group. So, as events keep coming in, the end time of the window shifts.

All grouping types can be configured for fixed time windows instead of rolling time windows, except for super-grouping, which supports only rolling time windows. For fixed time windows, events that occur within the fixed time window are grouped. When events occur after the fixed time window, a new group is created. Any single event that occurs outside the fixed time window does not get added to the group.

For rolling windows, the window duration is referred to as the "quiet period". The group membership closes when the span of time, which is defined by the window, passes without any additional events coming in. The end time of the rolling window is based on the last event to be added to the group. As events come in, they keep getting added to the group and the end time of the window keeps rolling.

Supergroups and rolling windows

A supergroup is created when events match multiple cloud native event analytics policies. All events in the supergroup can match all policies, or some can match some policies and some others, but there must be some overlap.

In the following example, a supergroup with all six events is created if all the following matches occur.
  • Events 1 and 2 match policies a, b, and c in a valid time window.
  • Events 3 and 4 match policy a in an overlapping valid time window.
  • Events 5 and 6 match policy c in an overlapping valid time window.

The rolling time window for supergroups is used to determine the amount of data that the collator keeps in memory for groups. For supergroups, the time windows for the constituent groups determine when the supergroup ends. If all constituent groups within a supergroup have ended, no further groups are added to the supergroup, even if they start within the rolling time window of the supergroup.

If constituent groups are still open, the rolling time window for supergroups has no effect. If scope or temporal groups keep being created with overlapping time windows, the supergroup stays active and will stay open beyond its time window setting. However, if the scope or temporal groups expire, new groups are not added to the supergroup, even if they come in within the supergroup time window setting.

A temporal group starts with the second event that matches the temporal policy. Thereafter, the temporal group will live for its time window setting (either rolling or fixedFromFirst). Scope-based groups start with the second event that matches the scope. Thereafter, the scope group will live for its time window setting (either rolling or fixedFromFirst). Or, if the scope-based event grouping (SBEG) policy that is used to populate the ScopeID parameter has a time period setting, the scope group will live from the QuietPeriod on the event. In this case, the ScopeID field is prefixed with FX:.

Exception: Where a relevant field is updated on an event in an old group, which makes it match a new group, the events for the new group are added to the old supergroup.

Note: Set the rolling time window for supergroups as short as possible, but longer than the time window for any global, policy, or event-based group.

Event timing

In addition to window type and duration, you can also set the timingMode parameter to explicit.

The time of an event is determined from when the event is seen by the cloud native analytics pods. Specifically, the time of an event is set as the time that the inference service looks at the event and determines whether there is a valid policy for the event. This time is called the policyTimeStamp. The deduplication pod bases the grouping on the policyTimeStamp value. However, small delays in the system might cause the inference service and deduplication pods to see events out of sequence. This situation might happen if the gateway sends the events out of order, or if the cloud native analytics pods are scaled and multiple pods are working on the events.

To overcome timing delays and out-of-sequence events, timingMode can be set to explicit. When timingMode is explicit, the deduplication pod holds the events for a small time period before they are processed. The value of the holdOffSeconds parameter is the amount of time that the deduplication pod waits for late, out-of-sequence events. By default, the holdOffSeconds value is 60 seconds, but you can change the value with the API. A value larger than 600 seconds is not recommended. Set the holdOffSeconds value as low as possible to account for a reasonable delay between two events, which should be grouped together, coming from the ObjectServer to the pods. Investigate delays longer than 10 minutes to resolve the root cause.

When you set timingMode to explicit, group formation is delayed by the holdOffSeconds value. There is a tradeoff between reducing out-of-sequence events and delaying the formation of groups.

By holding the events for a small time period before they are processed, the deduplication pod can correctly group the events. This grouping is done based on the policyTimeStamp value and not the FirstOccurrence value. However, the FirstOccurrence value is relevant because the deduplication pod uses the FirstOccurrence value to determine whether it can hold an event.

The deduplication pod holds an event only if it is not already late. For example, the pod delays the processing of events that it receives, until the holdOffSeconds value expires. The group is started only at the end of the holdOffSeconds period. If an event is already older than the holdOffSeconds value (based on FirstOccurrence), then it cannot be held, because it is already old. These events are processed immediately. For example, if an event comes in with a policyTimeStamp of 16:30, and FirstOccurrence of 16:10, then the deduplication pod processes it immediately, because 20 minutes is more than the holdOffSeconds value. The deduplication pod can only hold events for up to the holdOffSeconds number of seconds after the FirstOccurrence time.

For fresh deployments of version 1.6.10 and later, the default value for timingMode is explicit and the default value for holdOffSeconds is 60, which are the recommended settings.

Consider the following example, which assumes a five-minute fixed window, with explicit timing, and a holdOffSeconds value of 60 seconds. If event1 with FirstOccurrence of time 18:05 gets to the deduplication pod at time 18:05, the deduplication pod holds the event until time 18:06. If event2, with FirstOccurrence of time 18:02 gets to the deduplication pod at time 18:10, then it is not delayed. The events are grouped together in the group that starts at time 18:06 because the 18:06 and 18:10 times are within a five-minute window of each other.

In addition to the timingMode parameter, another parameter to consider is the useFirstOccurrenceLatenessThreshold parameter. When the useFirstOccurrenceLatenessThreshold value is set to true, the deduplication pod ignores events for which the FirstOccurrence timestamp is earlier than the current time minus the QuietPeriod value. For example, when useFirstOccurrenceLatenessThreshold is set to true and the QuietPeriod is five minutes, events with a FirstOccurrence timestamp older than five minutes ago are ignored. The QuietPeriod value is the window length. For scope-based windows, the QuietPeriod value can be set at the event level, or defaulted from the global or policy settings.

For fresh deployments of version 1.6.10 and later, the default value for useFirstOccurrenceLatenessThreshold is true, which is the recommended setting. For installations of version 1.6.9 and earlier, the useFirstOccurrenceLatenessThreshold value is set to false. For upgrades from version 1.6.9 and earlier to version 1.6.10, the useFirstOccurrenceLatenessThreshold value is set to false.

Sometimes, a probe can be disconnected from the system for a time, and its alarms arrive late. Probes go into store-and-forward mode if they can't connect to an ObjectServer. After the connection is restored, the probes replay events in chronological order. In this way, old events can be fed into the deduplication pod. To avoid grouping these old events, set useFirstOccurrenceLatenessThreshold to true.

The recommended timing settings for groupAggregationConfiguration are as follows:
    "timingMode": "explicit",
    "holdOffSeconds": 60,
    "useFirstOccurrenceLatenessThreshold" : true
Note: The deduplication pod caches the aggregation configuration settings and refreshes its cache every six minutes. It can take up to six minutes for changes to take effect. Restarting the deduplication pod forces the new settings to be read immediately.

Setting global aggregation defaults

The default window type is a rolling one, with a default duration of 1200 seconds. You can change the window type and the duration at a global level. However, a rolling window is the only supported window type for super-group aggregation. The global configuration can be changed in Swagger, by using the CNEA Aggregation Configuration API.

Enable the Swagger API for global aggregation: Edit the deployment for the ibm-hdm-analytics-dev-normalizer-aggregationservice service to enable the swagger API as follows.
Note: If you make a change to the normalizer service, the deduplication service must be restarted.
  1. Run the following command:
    oc edit deploy $(oc get deploy|grep normalizer-agg|awk '{print $1}')
  2. Under the containers env section add the following lines:
    - name: ENABLE_SWAGGER_UI
              value: "1"
    
  3. Save the deployment.
  4. Next, create an external route to the normalizer-aggregation service. This step can be done either in the Red Hat OpenShift Container Platform Console under Networking > Routes > Create Route or from the command line. In either case, use the sample YAML file, and modify according to your setup. The YAML is then either pasted into the Create Route YAML page, or you create a file and apply the YAML by using the oc create -f command.
    Sample YAML:
    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      name: <NOI-release-name>-agg-norm-api
      namespace: <NOI-namespace>
    spec:
      host: normalizer-aggregationservice-<release name>.apps.<FQDN>
      path: /api/aggregation/
      port:
       targetPort: 5600
      tls:
       termination: edge
      to:
       kind: Service
       name: <NOI-release-name>-ibm-hdm-analytics-dev-normalizer-aggregationservice
       weight: 100
      wildcardPolicy: None
    
    Where:
    • Release name is the name that is given to your IBM Netcool Operations Insight on Red Hat OpenShift deployment.
    • FQDN is the hostname where the IBM Netcool Operations Insight on Red Hat OpenShift and UI is running. This name can be found by running the following command:
      oc get route | grep common-ui | awk '{print $2}'
      Example output:
      netcool-<release name>.apps.<FQDN>
    • The service name of the existing normalizer aggregation service is retrieved by running the following command:
      oc get service | grep normalizer-aggregationservice
    • The Swagger UI is accessible from a browser that uses a URL in the following format:
      https://normalizer-aggregationservice-<release
            name>.apps.<FQDN>/api/aggregation/docs/aggconfig/v1/
    Note: Authorization for the API, when configuring at the global level, is obtained from NOI systemauth-secret in the NOI namespace.

Set global defaults: Group and super-group aggregation global defaults can be set with the following JSON examples. In the groupAggregationConfiguration section, you can change the windowType to fixedFromFirst or leave as rolling, and change the duration.

For groupAggregationConfiguration, you can also set the useFirstOccurrenceLatenessThreshold.
{
  "groupAggregationConfiguration": {
    "windowType": "fixedFromFirst",
    "durationSeconds": 300,
    "timingMode": "explicit",
    "holdOffSeconds": 60,
    "useFirstOccurrenceLatenessThreshold": true
  },
  "supergroupAggregationConfiguration": {
    "windowType": "rolling",
    "durationSeconds": 1200
  },
  "groupFinalisationConfiguration": {
    "enabled": false,
    "durationSeconds": 0
  }
  "additionalProp1": {}
}
Note: The useFirstOccurrenceLatenessThreshold parameter takes effect, regardless of the timing mode.
Note: Do not edit the groupFinalisationConfiguration parameters.
For supergroupAggregationConfiguration, you can change the duration, but only a rolling window type is supported for super-grouping.
{
"groupAggregationConfiguration": {
   "windowType": "rolling",
   "durationSeconds": 1200,
   "timingMode": "explicit",
   "holdOffSeconds": 60,
   "useFirstOccurrenceLatenessThreshold": true
},
"supergroupAggregationConfiguration": {
   "windowType": "rolling",
   "durationSeconds": 1200
},
"groupFinalisationConfiguration": {
   "enabled": false,
   "durationSeconds": 0
}
}

For more information, see CNEA Aggregation Configuration API.

Configuring aggregation at a policy level

Policy level aggregation overrides default or global aggregation settings. To set the group aggregation configuration at a system policy or user policy level, enable the Swagger UI for the policy registry service.

Enable the Swagger API for policy aggregation: Edit the deployment for the ibm-hdm-analytics-dev-policyregistryservice service to enable the swagger API as follows:
  1. Run the following command:
    oc edit deploy $(oc get deploy|grep policyregistryservice|awk '{print $1}')
  2. Under containers env settings, add the following code:
    - name: ENABLE_SWAGGER_UI
      value: “1”
    Note: Three containers are defined in the policy registry service deployment YAML. The environment variable must be added to the environment section of the container named <release name>-ibm-hdm-analytics-dev-policyregistryservice.
  3. Search for the route for the microservice by running the following command:
    oc get route|grep policyregistryservice
    Using the route results you can construct the URL for accessing the Swagger UI in the following format:
    netcool-<release name>.apps.<FQDN>

    For user policies:

     https://<route>/api/policies/docs/policies/user/v1/ 

    For system policies:

     https://<route>/api/policies/docs/policies/system/v1/docs-api 
    Note: The Swagger UI at the policy level uses a different authentication method from the global section. An API key is used and this key must be generated. This step is done from the GUI under Administration > Integrations with other systems > API Keys > Generate API key. Ensure that the Policy User API is selected when generating the API key.

To change the window type or duration, you need to add the groupAggregationConfiguration to the metadata section of the policy. The steps are the same for both system and user policies.

To specify fixedFromFirst as the window type, add the following JSON to the policy metadata section and set the duration of the fixed window to the desired length (see Update the policy):

"groupAggregationConfiguration": {
      "windowType": "fixedFromFirst",
      "durationSeconds": 60
    }

To specify rolling as the window type, add the following JSON to the policy metadata section and set the duration of the quiet period to the desired length (see Update the policy):

"groupAggregationConfiguration": {
      "windowType": "fixedFromFirst",
      "durationSeconds": 60
    }

Update the policy:

  1. Perform the GET operation of the policy to get the entire JSON for the policy.
  2. Copy the JSON to the PUT command section.
  3. Add the sample JSON for the window type to the policy and perform the PUT operation.
Example policy metadata section before addition of policy aggregation:
"metadata": { "createdBy": { "entityType": "analytics", "entityId": "system", "entityMetadata": { "trainingTimestamp": "2022-09-21T06:11:40Z" } }, 
"lastUpdatedBy": { "entityType": "analytics", "entityId": "system" }, "lastUpdated": "2022-09-21T06:11:47.502Z", "created": "2022-09-21T06:11:47.502Z", 
"statedata": { "locked": false, "state": "active", "userId": "icpadmin", "timestamp": 1663742017996 }, "model": { "trainingTimestamp": 1663742017996 }, 
"name": "Temporal-patterns-policy-sept21" }
Example policy metadata section after addition of policy aggregation:
"metadata": { "createdBy": { "entityType": "analytics", "entityId": "system", "entityMetadata": { "trainingTimestamp": "2022-09-21T06:11:40Z" } }, 
"lastUpdatedBy": { "entityType": "analytics", "entityId": "system" }, "lastUpdated": "2022-09-21T06:11:47.502Z", "created": "2022-09-21T06:11:47.502Z", 
"statedata": { "locked": false, "state": "active", "userId": "icpadmin", "timestamp": 1663742017996 }, "model": { "trainingTimestamp": 1663742017996 }, 
"name": "Temporal-patterns-policy-sept21", "groupAggregationConfiguration": { "windowType": "fixedFromFirst", "durationSeconds": 60 } }

Event level aggregation

For scope-based grouping, you can define, at the event level, a configuration for fixed time windows. Configure the fixed time windows with the scope-based grouping policy that is used to populate the ScopeID, by specifying a QuietPeriod and selecting the Use fixed time window option. The ScopeID value is prefixed with FX: and the QuietPeriod field for the event will contain the window length. This configuration allows the aggregation to identify events that have a fixed time window. If the ScopeID has no FX: prefix, global or policy level configurations are used. Without the FX: prefix, the window duration is taken from the configured window duration, which is set at a global or policy level.