Setting alert and exception thresholds

This topic describes how to set alert and exception thresholds.

During the creation of an INCLUDE monitoring profile line, you can specify the conditions that, when exceeded, cause the generation of an exception or an alert. These conditions include CPU time, elapsed time, getpages, SQL calls, and exception limit.

The purpose of alerts is to bring unusual activity to the attention of a human operator. There is some overhead associated with an alert, so you should set thresholds that are high enough to avoid too much resource consumption by the CAE Agent and CAE Server.

A good approach to use when setting alert thresholds is to assess whether your site's operators and DBAs can respond to each individual alert that is generated. Another way to evaluate an alert threshold is to consider that if an email is sent for each alert generated, would the resulting alerts be generated faster than you would want to receive e-mails? If so, the alert threshold is probably too low.

The best thresholds vary depending on the kinds of workloads they target. Another way of thinking about the alert threshold is this: if an SQL section exceeds the alert threshold, a human operator would want to take the time to look at this SQL section and consider whether or not to cancel it. For example, > 5 seconds of CPU, or > 20 seconds elapsed.

With regard to SQL codes, you should evaluate whether an error requires human intervention. For example, would an application developer or DBA actually take some action as a result of knowing that this individual SQL error occurred?

While the alert system can handle 5 alerts a second or more (per CAE Agent) for short periods, if you are averaging more than 5 alerts a minute in the CAE Server, the CAE Server will start to consume too much memory over the course of a day.

The most common reasons for receiving too many alerts are:

  • Not having a good list of SQL codes in the SQL Codes excluded for alerts. Many sites should exclude -803, since it is a fairly common coding technique to insert first, update if necessary based on a -803.
  • Setting alert thresholds too low for CPU, Elapsed, Getpages, and/or SQL Calls. If you want to look at the activity after the fact, you can set thresholds to store the activity as an exception. Alerts are for immediate attention. The profiles facility in CQM allows you to set different thresholds for different workloads (for example, 5 seconds elapsed may be cause for concern for a transactional workload, but not for a batch workload).