Managing thresholds
Thresholds test for resource issues such as slow response time. When the conditions of a threshold are true, an event is opened and an incident is generated. You can create, edit, delete, enable, or disable thresholds.
Viewing thresholds
-
Go to Monitor health > Infrastructure monitoring > Administration and click Thresholds on the IBM Cloud Pak console. A list of defined thresholds is displayed:
- Name is the title given to the threshold when it was saved. Click the name to view and change the definition in the threshold editor.
- Severity is the severity that was chosen for the threshold, such as warning or indeterminate.
- Assigned to is the resource type that the threshold is defined to monitor, such as Linux Systems or Kubernetes Service.
- Permissions are either "Read-only" or "Editable". Read-only thresholds are predefined or imported from integrated sources and cannot be changed. Editable thresholds were created by a member of your team and you have full editing capability.
- State is "Enabled" for thresholds that are operational, which means they are monitoring the resources that they were assigned to. State is "Disabled" when Enable has been turned off and the threshold is nonoperational.
-
If you're looking for a threshold that doesn't show on the first page of the list, use the page controls:
- Click inside the Filter text box and type the beginning of the value to filter by. As you type, the rows that do not fit the criteria are filtered out. For example, begin typing
critical
to filter the list down to only those thresholds with severity Critical. - Select the column to sort by. The default order is by Permissions, Assigned to, and Name.
- Select the number of thresholds to show per page: 10, 25, or 50.
- Select the next or previous page or a specific page number.
- Click inside the Filter text box and type the beginning of the value to filter by. As you type, the rows that do not fit the criteria are filtered out. For example, begin typing
Deleting a threshold
To delete a threshold (or more), select its checkbox and click Delete in the banner that appears. After you respond to the confirmation prompt by clicking OK, the threshold is permanently deleted.
Enabling or disabling a threshold
To enable or disable a threshold, click the icon and select Enable or Disable from the list.
Defining a new threshold or editing a threshold
Prerequisites For Db2 threshold requirements, see Prerequisites for creating custom thresholds.
Procedures
- To define a new threshold, click Create, and then continue with the following steps.
- To edit a threshold definition, click the threshold name, and then continue with the following steps.
Do the following steps when you define a new threshold or edit a threshold:
-
Complete the Details section:
- Threshold name must start with a letter and can have up to 63 letters, numbers, and underscores.
- Enter a description for the threshold. As well as in the Thresholds pages, the description is displayed in the Incident details.
-
Complete the Threshold section:
- For Resource type, select the resource that you want to monitor, such as Linux Systems. Note: If the resource that you select is monitored by Unified Agent, you see some UA only metrics included in the list of metrics when defining the condition.
- For Severity, select
critical,
major,
minor,
warning, or
indeterminate.
- For Consecutive samples, specify how many consecutive threshold samples must evaluate to true before an event is generated: A threshold with a setting of 1 and a sample that evaluates to true, an event is generated immediately; a setting of 2 means that two consecutive threshold samples must evaluate to true before an event is opened.
-
Define the condition:
-
Select the metric to compare from the metric list. The remaining fields vary depending on the type of metric.
Notes: If the resource that you select is monitored by Unified Agent, you see some UA only metrics included in the list of metrics when defining the condition. Type
ua
to quickly filter the metrics that are UA only. -
If the select field is displayed, select the relational operator: < less than, <= less than or equal, = equal, >= greater than or equal, > greater than, or != not equal.
- If this is a text metric, the select field has = equal, != not equal, and these relational operators:
- missing to enter a list of text entries to compare. If none of the entries matches the data sample when the threshold is evaluated, an event is opened. When the missing relational operator is used, only the And operator is available for multiple expressions.
- match or not match to enter a regular expression to compare. The match and not match operators look for a pattern match to the expression. If the regular expression
matches or does not match the data sample when the threshold is evaluated, an event is opened. The easier it is to match a string with the expression, the more efficient the workload at the managed resource. The expression does not
need to match the entire line; only the substring in the expression. For example, in
See him run
you want to know whether the string containshim
. You might compose the regular expression usinghim
but you might also use.\*him.\*
. Or, if you are looking forSee
, you might enterSee
, or you might enterˆSee
to confirm that it's at the beginning of the line. Entering .* wildcards is a less efficient search and raises the workload. For more information about regular expressions, search forregex
in your browser.
- Enter the value to compare by using the allowed format for the metric, such as 20 for 20% or 120 for 2 minutes. For example, a threshold condition of
Process Percent User Time > 5%
tests if the metric sample for Process Percent User time is greater than 5% and opens an event if the comparison is true. Note: When the resource type is Kubernetes Service and the comparison metric isReal User Latency
, leave the percentile atrum_
. If you change it, the comparison metric switches automatically toLatency
.
-
-
Optional: Add another condition to the threshold:
- Select Add condition or Add nested condition (see Example).
- Leave the logical operator at
AND if the previous condition and this condition must be met for the threshold to be breached or, if either of them can be met for the threshold to be breached, toggle the
button to
OR.
- Select the metric to compare from the list.
- Select the relational operator: < less than, <= less than or equal, = equal, >= greater than or equal, > greater than, or != not equal.
-
Enter the value to compare by using the allowed format for the metric.
If you are adding multiple conditions to a threshold or adding a display item, select metrics from the same metric list. Otherwise, you might get an error message while defining the threshold.
-
Optional: Add an aggregation condition that applies to the data that meets the defined condition:
- Select the aggregation metric from the list.
- Select average for numeric metrics, count for text metrics, or none.
- Select the relational operator: < less than, <= less than or equal, = equal, >= greater than or equal, > greater than, or != not equal.
- Enter the value of the aggregation metric.
When the threshold has multiple conditions or an aggregation condition, the metrics available are of the same type as the one chosen for the first condition. You can continue to add conditions for a complex threshold of up to 10 conditions.
-
Optional: Select a Display item if one is available and you want to continue evaluating the threshold on other data sample rows. After a row evaluation causes an event to open, no more events can be opened for this threshold on the monitored resource until the event is closed. By selecting a display item, you enable the threshold to continue evaluating the other rows in the data sampling and open more events if other rows qualify. Display item is not available if the threshold includes an Aggregation condition. Known limitation: If you deploy the runtime data collectors in on-premises environment, when you define the threshold and select Display item, metrics of the selected item might not display. Best practice is to not select a display item for data collectors in an on-premises environment.
-
Complete the Assignments section:
- Select the resources for the threshold to monitor under Resources to monitor:
- Select All <resource_type> to apply the threshold to all resource instances of the same type, such as all Hadoop hosts.
- Select Individual instances to see and select the resource instances. Individual instances cannot be selected for the WebSphere Applications agent nor any other agent that has subnodes.
- Select Groups to see and select from the list of resource groups. (For more information, see Managing resource groups). If you assign the threshold to a resource group that does not include a managed resource of the threshold's resource type, a message notifies you that the threshold does not apply to any resource types in the group.
- Optional: Set up Until another threshold. This functionality enables you to configure another threshold for the same resources that you want to monitor. When the existing (first) threshold has an event generated, this event remains open until another threshold for those same resources is met triggering a new event to be generated. At this point, the event for the the first threshold is closed. After you select an assignment for the threshold, that is, select one of the Resource to monitor in 1., then the list under Related threshold is displayed. The Related threshold list has the same resource type and assignment as the first threshold that is being created. The two thresholds must also have the same display item. If no assignments are selected in 1., the Related threshold list is not populated and you cannot configure the Until another threshold functionality.
- Select the resources for the threshold to monitor under Resources to monitor:
-
Optional: Complete the Reflex action section if you want to execute a command when an event is opened.
- Enter the command to execute. In this example, two commands are run by the Linux® OS agent: The text in quotes is echoed and redirected to a log file, and the clean_logs script runs on the associated Linux OS disk (&{KLZ_Disk.Disk_Name}
is replaced by the attribute value).
echo "`date` : WT_LZ_user_login is true for &{KLZ_Disk.Disk_Name}" >>/tmp/wt.log;/scripts/clean_logs.sh &{KLZ_Disk.Disk_Name}Copy code
- Select one of these options to control how often the command is run:
- Select On first event only if the data sample has multiple rows and you want to run the command for only the first event occurrence in the data sample. Clear the checkbox to run the command for every row that causes an event.
- Select For every consecutive true interval to run the command every time the threshold evaluates to true. Clear the checkbox to run the command when the threshold is true, but not again until the threshold evaluates to false, followed by another true evaluation in a subsequent interval.
- Enter the command to execute. In this example, two commands are run by the Linux® OS agent: The text in quotes is echoed and redirected to a log file, and the clean_logs script runs on the associated Linux OS disk (&{KLZ_Disk.Disk_Name}
is replaced by the attribute value).
-
Optional: Complete the Expiration section. Set up an expiration period for an event. When the expire time ends, the event closes automatically. The format is DD-HH-MM-SS (the corresponding fields to configure are: Days, Hours, Minutes, and Seconds). The minimum expire time you can set an event to is 1 second and the maximum value is 30 days. If you do not configure an expiration period for an event using this function, then the event expires after 2 hours, which is the default expire time.
-
If you don't want the threshold to begin monitoring, drag the Enable slider from
On
toOff
.
Note
- When assigning resources and the selected metrics contain any UA only metrics, the Reflex action section is not displayed.
- If you select the All <resource_type> or Groups option, only UA instances receive the threshold.
- If you select individual instances, the resource instances are restricted to UA.
- If the selected metrics contain only TEMA metrics (the metrics that are not tagged UA only), the Reflex action section is displayed.
- If you select the All <resource_type> or Groups option and a reflex action is specified, only TEMA instances are affected by the reflex action.
- If you select individual instances and a reflex action is specified, the resource instances are restricted to TEMA.
- If no reflex action is specified, all resources instances are shown.
- Cross-table thresholds are not visible for TEMA agents.
- Thresholds containing TEMA Agent metrics with until condition related to another threshold using TEMA metric are not supported.
Results
After you click Finish to save the threshold, it starts on the resources instances that are assigned. If you set Enable to Off
, the thresholds list shows "Disabled" in the State column.
Note: You can enrich and improve the content of the events that are generated by the thresholds by using an event policy. For more information, see Enriching event information by adding custom attributes.
Example
Nested conditions are used to support multiple conditions joined with mixed AND and OR operators. Otherwise, multiple conditions would use Boolean AND logic or Boolean OR logic, not both. To illustrate, the following threshold evaluates to be true
if either the process CPU is greater than ½ second and the process command is named kynagent
or if the process command is named klzagent
:
Condition 1 | Process CPU Seconds >= 0.5 seconds AND Process Command Name = kynagent |
Condition 2 | OR Process Command Name = klzagent |
The intention, however, is for the threshold to evaluate to be true if the process CPU is greater than ½ second and the process command is named either kynagent
or klzagent
. To achieve the desired result, select Add nested condition for Condition 2:
Condition 1 | Process CPU Seconds >= 0.5 seconds AND |
Condition 2 (nested) | Process Command Name = kynagent OR Process Command Name = klzagent |
Role-based access control (RBAC)
Users with the role of Cluster Administrator or Account Administrator have full access to thresholds. See the following table for detailed thresholds RBAC.
Role | Create | Modify | Define reflex action | Enable/Disable | Delete | Duplicate | View |
---|---|---|---|---|---|---|---|
Account Administrator | accept | accept | accept | accept | accept | accept | accept |
Cluster Administrator | accept | accept | accept | accept | accept | accept | accept |
Administrator | accept | accept | accept | accept | accept | accept | accept |
Operator | accept | accept*1 | reject | accept*1 | accept*2 | accept*1 | accept*1 |
Editor | accept | accept*1 | reject | accept*1 | accept*2 | accept*1 | accept*1 |
Auditor | reject | reject | reject | reject | reject | reject | accept*1 |
Viewer | reject | reject | reject | reject | reject | reject | accept*1 |
Note:
- *1 means that users can perform the action if the user has access to the resources which the threshold is assigned to.
- *2 means that users can perform the action if the threshold is created by current user.
What to do next
- View, edit, disable or enable, or delete the threshold in the Threshold Management table
- Follow this usage scenario to get some hands-on practice with creating thresholds. See Getting started: Accelerate your transition to the cloud with DevOps.
- Start monitoring your resource as described in Monitoring resources in your environment.