Guidelines for setting thresholds for alerts and exceptions
When you create a monitoring profile include line, you can specify the conditions that when exceeded, cause the generation of an exception or an alert. These conditions include CPU time, elapsed time, getpages, SQL calls, and exception limit.
The purpose of alerts
The purpose of alerts is to make you (or an operator) aware of any unusual activity. There is some overhead that is associated with an alert. For this reason, thresholds should be set high enough to avoid excessive resource consumption by the CAE Agent and CAE Server.
Considerations for setting thresholds
When you set an alert or exception threshold, you should consider the following:
- Can your site's operators and DBAs respond to each individual alert or exception that is generated?
- Would your site's operators and DBAs want to receive an email for every alert that is generated (or would to emails be generated faster than they might want to receive them)?
If your site's operators and DBAs would not be able to respond to each individual alert or exception that is generated, or if they would not want to receive so many emails, then the alert threshold is set too low and should be raised.
Thresholds and workloads
The optimal thresholds you should use will also vary depending on the kinds of workloads your site experiences. For example, if an SQL section exceeds a threshold, would an operator at your site want to take the time to look at this SQL section and consider whether or not to cancel it? For example, would they want to consider alerts that exceed thresholds of > 5 seconds of CPU, or > 20 seconds elapsed?
Thresholds and SQLCODEs
When setting thresholds for SQLCODES, you should evaluate whether an error requires human intervention. For example, would an application developer or DBA actually take some action as a result of knowing that this individual SQL error occurred?
While the alert system can handle 5 alerts a second or more (per CAE Agent) for short periods, if you are averaging more than 5 alerts a minute in the CAE Server, the CAE Server will start to consume too much memory over the course of a day.
Situations that cause too many alerts
The most common reasons for your site experiencing too many alerts are:
- Not having a good list of SQL codes in SQL Codes excluded for Alerts. For example, many sites should exclude -803, since it is a common coding technique to insert first and update if necessary based on a -803.
- Setting alert thresholds too low for CPU, elapsed, getpages, and/or SQL calls. If you want to look at the activity after the fact, you can set thresholds to store the activity as an exception. Alerts are for immediate attention. The profiles facility in Db2 Query Monitor allows you to set different thresholds for different workloads. For example, 5 seconds elapsed may be cause for concern for a transactional workload, but not for a batch workload.