Managing thresholds

Thresholds test for resource issues such as slow response time. When the conditions of a threshold are true, an event is opened and an incident is generated. You can create, edit, delete, enable, or disable thresholds.

Viewing thresholds

  1. Go to Monitor health > Infrastructure monitoring > Administration and click Thresholds on the IBM Cloud Pak console. A list of defined thresholds is displayed:

    • Name is the title given to the threshold when it was saved. Click the name to view and change the definition in the threshold editor.
    • Severity is the severity that was chosen for the threshold, such as warning or indeterminate.
    • Assigned to is the resource type that the threshold is defined to monitor, such as Linux Systems or Kubernetes Service.
    • Permissions are either "Read-only" or "Editable". Read-only thresholds are predefined or imported from integrated sources and cannot be changed. Editable thresholds were created by a member of your team and you have full editing capability.
    • State is "Enabled" for thresholds that are operational, which means they are monitoring the resources that they were assigned to. State is "Disabled" when Enable has been turned off and the threshold is nonoperational.
  2. If you're looking for a threshold that doesn't show on the first page of the list, use the page controls:

    • Click inside the Filter text box and type the beginning of the value to filter by. As you type, the rows that do not fit the criteria are filtered out. For example, begin typing critical to filter the list down to only those thresholds with severity Critical.
    • Select the column to sort by. The default order is by Permissions, Assigned to, and Name.
    • Select the number of thresholds to show per page: 10, 25, or 50.
    • Select the next or previous page or a specific page number.

Deleting a threshold

To delete a threshold (or more), select its checkbox and click Delete in the banner that appears. After you respond to the confirmation prompt by clicking OK, the threshold is permanently deleted.

Enabling or disabling a threshold

To enable or disable a threshold, click the Open and close list of options icon and select Enable or Disable from the list.

Defining a new threshold or editing a threshold

Prerequisites For Db2 threshold requirements, see Prerequisites for creating custom thresholds.

Procedures

Do the following steps when you define a new threshold or edit a threshold:

  1. Complete the Details section:

    1. Threshold name must start with a letter and can have up to 63 letters, numbers, and underscores.
    2. Enter a description for the threshold. As well as in the Thresholds pages, the description is displayed in the Incident details.
  2. Complete the Threshold section:

    1. For Resource type, select the resource that you want to monitor, such as Linux Systems. Note: If the resource that you select is monitored by Unified Agent, you see some UA only metrics included in the list of metrics when defining the condition.
    2. For Severity, select icon critical, icon major, icon minor, icon warning, or icon indeterminate.
    3. For Consecutive samples, specify how many consecutive threshold samples must evaluate to true before an event is generated: A threshold with a setting of 1 and a sample that evaluates to true, an event is generated immediately; a setting of 2 means that two consecutive threshold samples must evaluate to true before an event is opened.
    4. Define the condition:

      1. Select the metric to compare from the metric list. The remaining fields vary depending on the type of metric.

        Notes: If the resource that you select is monitored by Unified Agent, you see some UA only metrics included in the list of metrics when defining the condition. Type ua to quickly filter the metrics that are UA only.

      2. If the select field is displayed, select the relational operator: < less than, <= less than or equal, = equal, >= greater than or equal, > greater than, or != not equal.

      3. If this is a text metric, the select field has = equal, != not equal, and these relational operators:
        • missing to enter a list of text entries to compare. If none of the entries matches the data sample when the threshold is evaluated, an event is opened. When the missing relational operator is used, only the And operator is available for multiple expressions.
        • match or not match to enter a regular expression to compare. The match and not match operators look for a pattern match to the expression. If the regular expression matches or does not match the data sample when the threshold is evaluated, an event is opened. The easier it is to match a string with the expression, the more efficient the workload at the managed resource. The expression does not need to match the entire line; only the substring in the expression. For example, in See him run you want to know whether the string contains him. You might compose the regular expression using him but you might also use .\*him.\*. Or, if you are looking for See, you might enter See, or you might enter ˆSee to confirm that it's at the beginning of the line. Entering .* wildcards is a less efficient search and raises the workload. For more information about regular expressions, search for regex in your browser.
      4. Enter the value to compare by using the allowed format for the metric, such as 20 for 20% or 120 for 2 minutes. For example, a threshold condition of Process Percent User Time > 5% tests if the metric sample for Process Percent User time is greater than 5% and opens an event if the comparison is true. Note: When the resource type is Kubernetes Service and the comparison metric is Real User Latency, leave the percentile at rum_. If you change it, the comparison metric switches automatically to Latency.
    5. Optional: Add another condition to the threshold:

      1. Select Add condition or Add nested condition (see Example).
      2. Leave the logical operator at button AND if the previous condition and this condition must be met for the threshold to be breached or, if either of them can be met for the threshold to be breached, toggle the AND button to OR OR.
      3. Select the metric to compare from the list.
      4. Select the relational operator: < less than, <= less than or equal, = equal, >= greater than or equal, > greater than, or != not equal.
      5. Enter the value to compare by using the allowed format for the metric.

        If you are adding multiple conditions to a threshold or adding a display item, select metrics from the same metric list. Otherwise, you might get an error message while defining the threshold.

    6. Optional: Add an aggregation condition that applies to the data that meets the defined condition:

      1. Select the aggregation metric from the list.
      2. Select average for numeric metrics, count for text metrics, or none.
      3. Select the relational operator: < less than, <= less than or equal, = equal, >= greater than or equal, > greater than, or != not equal.
      4. Enter the value of the aggregation metric.

      When the threshold has multiple conditions or an aggregation condition, the metrics available are of the same type as the one chosen for the first condition. You can continue to add conditions for a complex threshold of up to 10 conditions.

  3. Optional: Select a Display item if one is available and you want to continue evaluating the threshold on other data sample rows. After a row evaluation causes an event to open, no more events can be opened for this threshold on the monitored resource until the event is closed. By selecting a display item, you enable the threshold to continue evaluating the other rows in the data sampling and open more events if other rows qualify. Display item is not available if the threshold includes an Aggregation condition. Known limitation: If you deploy the runtime data collectors in on-premises environment, when you define the threshold and select Display item, metrics of the selected item might not display. Best practice is to not select a display item for data collectors in an on-premises environment.

  4. Complete the Assignments section:

    1. Select the resources for the threshold to monitor under Resources to monitor:
      • Select All <resource_type> to apply the threshold to all resource instances of the same type, such as all Hadoop hosts.
      • Select Individual instances to see and select the resource instances. Individual instances cannot be selected for the WebSphere Applications agent nor any other agent that has subnodes.
      • Select Groups to see and select from the list of resource groups. (For more information, see Managing resource groups). If you assign the threshold to a resource group that does not include a managed resource of the threshold's resource type, a message notifies you that the threshold does not apply to any resource types in the group.
    2. Optional: Set up Until another threshold. This functionality enables you to configure another threshold for the same resources that you want to monitor. When the existing (first) threshold has an event generated, this event remains open until another threshold for those same resources is met triggering a new event to be generated. At this point, the event for the the first threshold is closed. After you select an assignment for the threshold, that is, select one of the Resource to monitor in 1., then the list under Related threshold is displayed. The Related threshold list has the same resource type and assignment as the first threshold that is being created. The two thresholds must also have the same display item. If no assignments are selected in 1., the Related threshold list is not populated and you cannot configure the Until another threshold functionality.
  5. Optional: Complete the Reflex action section if you want to execute a command when an event is opened.

    1. Enter the command to execute. In this example, two commands are run by the Linux® OS agent: The text in quotes is echoed and redirected to a log file, and the clean_logs script runs on the associated Linux OS disk (&{KLZ_Disk.Disk_Name} is replaced by the attribute value).
       echo "`date` : WT_LZ_user_login is true for &{KLZ_Disk.Disk_Name}" >>/tmp/wt.log;/scripts/clean_logs.sh &{KLZ_Disk.Disk_Name}Copy code
      
    2. Select one of these options to control how often the command is run:
      • Select On first event only if the data sample has multiple rows and you want to run the command for only the first event occurrence in the data sample. Clear the checkbox to run the command for every row that causes an event.
      • Select For every consecutive true interval to run the command every time the threshold evaluates to true. Clear the checkbox to run the command when the threshold is true, but not again until the threshold evaluates to false, followed by another true evaluation in a subsequent interval.
  6. Optional: Complete the Expiration section. Set up an expiration period for an event. When the expire time ends, the event closes automatically. The format is DD-HH-MM-SS (the corresponding fields to configure are: Days, Hours, Minutes, and Seconds). The minimum expire time you can set an event to is 1 second and the maximum value is 30 days. If you do not configure an expiration period for an event using this function, then the event expires after 2 hours, which is the default expire time.

  7. If you don't want the threshold to begin monitoring, drag the Enable slider from On to Off.

Note

Results

After you click Finish to save the threshold, it starts on the resources instances that are assigned. If you set Enable to Off, the thresholds list shows "Disabled" in the State column.

Note: You can enrich and improve the content of the events that are generated by the thresholds by using an event policy. For more information, see Enriching event information by adding custom attributes.

Example

Nested conditions are used to support multiple conditions joined with mixed AND and OR operators. Otherwise, multiple conditions would use Boolean AND logic or Boolean OR logic, not both. To illustrate, the following threshold evaluates to be true if either the process CPU is greater than ½ second and the process command is named kynagent or if the process command is named klzagent:

Condition 1 Process CPU Seconds >= 0.5 seconds AND Process Command Name = kynagent
Condition 2 OR Process Command Name = klzagent

The intention, however, is for the threshold to evaluate to be true if the process CPU is greater than ½ second and the process command is named either kynagent or klzagent. To achieve the desired result, select Add nested condition for Condition 2:

Condition 1 Process CPU Seconds >= 0.5 seconds AND
Condition 2 (nested) Process Command Name = kynagent OR Process Command Name = klzagent

Role-based access control (RBAC)

Users with the role of Cluster Administrator or Account Administrator have full access to thresholds. See the following table for detailed thresholds RBAC.

Role Create Modify Define reflex action Enable/Disable Delete Duplicate View
Account Administrator accept accept accept accept accept accept accept
Cluster Administrator accept accept accept accept accept accept accept
Administrator accept accept accept accept accept accept accept
Operator accept accept*1 reject accept*1 accept*2 accept*1 accept*1
Editor accept accept*1 reject accept*1 accept*2 accept*1 accept*1
Auditor reject reject reject reject reject reject accept*1
Viewer reject reject reject reject reject reject accept*1

Note:

What to do next