Built-in Events Reference
The Events page displays a list of all the currently available events; out of the box built-in events and any user-defined custom events. To view the Events page, click Settings -> Events.
The list can be filtered by:
- Type: built-in event or custom event.
- Incidents and severity: incidents, warning, or critical.
- Full text search.
Important: Built-in events can't be modified. You can create custom events based on the same entities and metrics used for built-in events. Custom events trigger issues or incidents
based on the thresholds of an individual metric of any given entity.
.NET App
Event |
Description |
Metric |
Garbage collection activity high. |
Monitors the garbage collection (GC) time spent by the CLR runtime platform and checks it against the maximum allowed percentage value. |
GC time (mem.time_in_gc ). |
For more information about this sensor, see the .NET documentation.
ActiveMQ
Event |
Description |
Metric |
Dead-letter queue size is growing. |
Dead-letter queue size is increasing. Messages sent are not routed to their correct destination. |
ActiveMQ queue size. |
Memory usage is close to the limit. |
Memory usage is close to 100% of the memory limit. |
Memory Usage (memoryPercentage ). |
Store usage is close to the limit. |
Store usage is close to 100% of the store limit. |
Store Usage (storePercentage ). |
For more information about this sensor, see the ActiveMQ documentation.
ActiveMQ Artemis
Event |
Description |
Metric |
ActiveMQ Artemis has no connections. |
There are no connections in the last 5 seconds. The current number of connections is equal to the configured NoConnections count. |
Total Connections (totalConnectionCount ). |
ActiveMQ Artemis has no consumers. |
There are no consumers in last 5 seconds. Current number of consumers is equal to the configured NoConsumers count. |
Total Consumers (totalConsumerCount ). |
Addresses memory usage is close to the limit. |
Memory usage of all addresses is close to 100% of its memory limit. |
Address Memory Usage (addressMemoryPercentage ). |
For more information about this sensor, see the ActiveMQ Artemis documentation.
Apache HTTPd
Event |
Description |
Metric |
Apache child processes are stuck performing DNS lookups. |
Detects high usage of server workers by DNS lookup. |
Dns (worker.dns ). |
Logging is slowing down Apache HTTPd performance. |
Detects high usage of server workers for logging purposes. |
Logging (worker.logging ). |
Number of busy workers is approaching max workers. |
Detect high percentage of busy workers. |
Busy workers (busy_workers ). |
For more information about this sensor, see the Apache HTTPd documentation.
Application
Event |
Description |
Metric |
Complete drop in calls |
Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the listed relative and
absolute threshold parameters. |
Calls/s (count ) |
Error rate too high |
Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value. |
Error Rate (error_rate ). |
Increasing trend in error rate |
This rule checks the presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates a certain amount of decreases in the
metric value inside the trend candidate. |
Error Rate (error_rate ). |
Sudden drop in calls |
Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the listed relative and absolute threshold parameters. |
Calls/s (count ). |
Sudden increase in error rate |
Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters. |
Error Rate (error_rate ). |
Sudden increase in latency |
Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters. |
Latency 50th (duration.50th ). |
Sudden increase in latency for a fraction of requests |
Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters. |
Latency 99th (duration.99th ). |
AWS DynamoDB
Event |
Description |
Metric |
Ratio of consumed and provisioned reads is critical. |
Detects high ratio of consumed and provisioned reads. |
Consumed read capacity (consumed_read ). |
Ratio of consumed and provisioned writes is critical. |
Detects high ratio of consumed and provisioned writes. |
Consumed write capacity (consumed_write ) and provisioned write capacity (provisioned_write ). |
For more information about this sensor, see the AWS DynamoDB documentation.
AWS MSK
Event |
Description |
Metric |
Active Controller Count. |
Checks for an unusual number of active controllers in the Kafka cluster. |
Active controller count (active_controller_count ). |
Offline Partitions Count. |
Defines the maximum allowed proportion of violations of offline partitions within the specified time window. |
Offline partitions count (offline_partitions_count ). |
Network Processor Low Idle Time. |
Checks whether the Kafka network thread is under high load. |
Network processor idle time (network_processor_idle ). |
Request Handler Low Idle Time. |
Checks whether the Kafka request handler is under high load. |
Request handler idle time (request_handler_idle ). |
Under-replicated partitions Count. |
Checks whether the number of under-replicated partitions exceeds the expected number. |
Under-replicated partitions (under_replicated_partitions ). |
For more information about this sensor, see the AWS MSK documentation.
AWS RDS
Event |
Description |
Metric |
CPU credit balance reaching zero. |
Checks if the CPU credit balance is getting closer to zero. |
CPU Credit Balance (cpu_credit_balance ). |
Number of CPU credits consumed is high. |
Checks if the percentage of CPU credits consumed by an instance is reaching max capacity. |
CPU Credit Usage (cpu_credit_usage ) and CPU Credit Balance (cpu_credit_balance ). |
For more information about this sensor, see the AWS RDS documentation.
Azure API Management Service
The Azure API Management sensor will automatically perform any configured custom health checks every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.
Event |
Description |
Metric |
Azure Api Management capacity is getting closer to the max capacity limit. |
Checks whether Azure API Management is using more than 90% of the available capacity. |
Capacity (metrics.Capacity ). |
For more information about this sensor, see the Azure Api Management documentation.
Azure CosmosDB
Event |
Description |
Metric |
Azure CosmosDb storage capacity is getting closer to the max capacity limit. |
Detects whether the Azure CosmosDb storage capacity is reaching the max capacity limit. |
CosmosDb storage capacity. |
For more information about this sensor, see the Azure CosmosDB documentation.
Azure Redis
The Azure Redis Cache sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.
Event |
Description |
Metric |
Azure Redis Cache client connections are getting closer to max connections limit. |
Azure Redis Cache is using more than 90% of available client connections. |
Connected Clients (connectedclients ). |
Azure Redis Cache memory usage is getting closer to max memory limit. |
Azure Redis Cache is using more than 90% of available memory. |
Percentage of Memory Used (usedmemorypercentage ). |
For more information about this sensor, see the Azure Redis documentation.
Azure SQL Database
The Azure SQL Database sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.
Event |
Description |
Metric |
Database is running out of space. |
Checks if Azure SQL Database is running out of space. Warning limit is at 80% and the critical limit is at 90% of the used size. |
metrics.storage_percent . |
Database status. |
Unhealthy state is caused by the database being unavailable. A database can be unavailable if one of the following conditions is true:
- The database has been set offline by the user
- The database is being restored from backup
- The database is being recovered
- The database has been corrupted
- The database has been set to the Emergency state by the administrator
- The database is in the process of being created by copying another database
|
metrics.statusCode . |
The total DTU utilization is getting closer to max DTU limit. |
Checks if the Azure SQL Database DTU utilization is reaching max DTU limit. Warning limit is at 75% and the critical limit is at 85% of the DTU utilization. |
metrics.dtu_consumption_percent . |
Azure MySQL Database
The Azure MySQL Database sensor runs custom health checks every minute. If the checks fail for at least one minute, an issue is raised to inform you.
Event |
Description |
Metric |
Available server connections are getting closer to the max connections limit |
The usage of Azure MySQL Server connections is more than 85% of the available client connections. |
Active Connections (active_connections ) |
For more information about this sensor, see the Azure MySQL documentation.
Azure Service Bus
The Azure Service Bus sensor runs custom health checks every minute. If the checks fail for at least one minute, an issue is raised to inform you.
Event |
Description |
Metric |
Azure Service Bus has at least one message in DL queue |
Checks if the Azure Service Bus has at least one message in the dead lettered queue. |
Deead Lettered Messages (deadletteredMessages ) |
For more information about this sensor, see the Azure Service Bus documentation.
Azure SQL Elastic Pool
The Azure SQL Elastic Pool sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.
Event |
Description |
Metric |
The total eDTU utilization is getting closer to max eDTU limit. |
Checks if Azure SQL Elastic Pool eDTU is reaching maximum eDTU limit. |
metrics.dtu_consumption_percent . |
Ceph
Event |
Description |
Metric |
Ceph cluster status. |
Ceph cluster is reporting a problem; HEALTH_WARN or HEALTH_ERR . |
Status of the Ceph Cluster (overall_status ). |
Monitor quorum is not reached. |
The number of healthy monitors is less than 50% of all monitors. |
Number of monitors (num_mons ) and number of active monitors (num_active_mons ). |
Osd(s) full capacity state. |
Some of OSDs are reporting full state. |
Number of active+clean pgs (num_full_osds ). |
Osd(s) near full capacity state. |
Some of OSDs are reporting near full state. |
Number of near full osds (num_near_full_osds ). |
For more information about this sensor, see the Ceph documentation.
Consul (HashiCorp)
Event |
Description |
Metric |
Consul cluster health. |
Detects the overall health of the cluster and if any of the nodes are considered unhealthy by Autopilot. |
Consul autopilot health status (consul.autopilot.healthy ). |
CRI-O
Event |
Description |
Metric |
Memory exhausted. |
Detects when the container memory usage exceeds specified limits. |
RSS (memory.total_rss ). |
Docker
Event |
Description |
Metric |
Memory exhausted. |
When the container memory usage exceeds specified limits, a memory warning threshold or a memory critical threshold alert is displayed. |
RSS (memory.total_rss ). |
For more information about this sensor, see the Docker documentation.
Endpoint
Event |
Description |
Metric |
Complete drop in calls. |
Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative
and absolute threshold parameters as follows. |
Calls/s (count ). |
Error rate too high. |
Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value. |
Error Rate (error_rate ). |
Error rate too high for a Synthetic endpoint. |
Detects a consistently high error rate of a Synthetic endpoint when the average errors KPI within the last four minutes is above the given threshold value. |
Synthetic error rate (synthetic_error_rate ). |
Increasing trend in error rate. |
Checks a presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates a certain amount of decreases in the metric value
inside the trend candidate. |
Error Rate (error_rate ). |
Sudden drop in calls. |
Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows. |
Calls/s (count ). |
Sudden drop in Synthetic calls. |
Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows. |
Synthetic calls/s (synthetic_count ). |
Sudden increase in error rate. |
Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. |
Error Rate (error_rate ). |
Sudden increase in latency. |
Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. |
Latency 50th (duration.50th ). |
Sudden increase in latency for a fraction of requests. |
Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. |
Latency 99th (duration.99th ). |
etcd
Event |
Description |
Metric |
Abnormally high disk backend commit duration. |
Detects high disc backend commit duration. |
Disk backend commit duration (health.disk_backend_commit_duration ). |
Abnormally high disk wal fsync duration. |
Detects high disc wal fsync duration. |
Disk fsync duration (health.disk_wal_fsync_duration ). |
Abnormally high snapshot duration. |
Detects high duration of saving a snapshot. |
Snap save total duration (health.debugging_snap_save_total_duration ). |
Frequent leader changes seen in last minute. |
Detects a high number of leader changes in the last minute. |
Server leader changes (health.server_leader_changes ). |
Member doesn't have leader. |
Detects a member who does not have a leader (unavailable). |
Server has leader (health.server_has_leader ). |
Proposal ratio analysis. |
Detects unusual fall of applied proposals and an unusual rise of pending and failed proposals. |
Number of proposals commited (health.server_proposals_committed ), number of proposals applied (health.server_proposals_applied ), number of proposals pending (health.server_proposals_pending ), and
number of proposals failed (health.server_proposals_failed ). |
Usage of open file descriptors is critical. |
Detects a high usage of open file descriptors. |
Number of open file descriptors (health.process_open_fds ) and the maximum number of file descriptors (health.process_max_fds ). |
For more information about this sensor, see the etcd documentation.
Garden Container
Event |
Description |
Metric |
Memory exhausted. |
Container memory usage is getting close to its memory limit. |
Usage (memory.usage ). |
For more information about this sensor, see the Garden documentation.
Glassfish
Event |
Description |
Metric |
Glassfish file cache hit rate is as follows 70%. |
A processing pipeline checks the file cache hit rate and validates whether it's lower than the given threshold value. |
Hit rate (file_cache_rate ). |
Maximum number of JDBC connections reached. |
A processing pipeline checks the total number of JDBC connections. It validates whether it's reaching the maximum limit for the server configuration. |
Used (jdbc_connection_used ). |
For more information about this sensor, see the Glassfish documentation.
Google Cloud Datastore
Event |
Description |
Metric |
Datastore request count dropped significantly in last 30 minutes. |
Checks for sudden decrease of requests count. |
Requests (request_count ) |
Datastore request count increased significantly in last 30 minutes. |
Checks for sudden increase of requests count. |
Requests (request_count ) |
For more information about this sensor, see the Google Cloud Datastore documentation.
Google Cloud Storage
Event |
Description |
Metric |
Sudden increase in size of all objects |
Checks for a sudden increase in size of all objects in 24h for non empty buckets |
Total size of all objects in the bucket. |
For more information about this sensor, see the Google Cloud Storage documentation.
Google Cloud Pub/Sub
Event |
Description |
Metric |
The push request latency for the subscription has increased in last 10 minutes. |
Checks for sudden increase of push request latency for the subscription. |
Request Latency (push_request_latencies ) |
Topic oldest message. |
Checks whether there are messages on the topic older than threshold value. |
Oldest Message (oldest_unacked_message_age ) |
For more information about this sensor, see the Google Cloud Pub/Sub documentation.
Hadoop YARN
Event |
Description |
Metric |
Resource manager is reporting lost node. |
Detects if the resource manager is reporting lost nodes. |
Lost Nodes (lostNodes ). |
Resource manager is reporting unhealthy node. |
Detects if the resource manager is reporting unhealthy nodes. |
Unhealthy Nodes (unhealthyNodes ). |
Submitted app has failed. |
Detects if submitted app has failed. |
Apps Failed (appsFailed ). |
For more information about this sensor, see the Hadoop YARN documentation.
HAProxy
Event |
Description |
Metric |
HAProxy backend average queue size is high. |
HAProxy backend average queue size is large. |
Backend Queue Size. |
HAProxy frontend session usage is high. |
HAProxy frontend session usage is high. |
Frontend Session Utilization. |
Sudden increase in average response time. |
Checks for a sudden increase in the average response time of a single backend. |
Average response time metrics. |
For more information about this sensor, see the HAProxy documentation.
Hazelcast
Starting with Hazelcast 3.3 the public methods HazelcastInstance::getPartitionService()::isLocalMemberSafe()
is used. For older Hazelcast versions the health status is derived from an internal "has ongoing migrations"
status on each local node.
The Hazelcast cluster health status is aggregated from each Hazelcast node. This is exactly what HazelcastInstance::getPartitionService()::isClusterSafe()
does internally, but without creating additional overhead of calling this
method.
Hazelcast Cluster
Event |
Description |
Metric |
Cluster status. |
Checks the cluster status of Hazelcast. Hazelcast 3.3 or above. |
Hazelcast cluster status flag. |
Hazelcast Node
Event |
Description |
Metric |
Node status. |
Checks the status of the local member. Hazelcast 3.3 or above. |
Hazelcast node status flag. |
For more information about this sensor, see the Hazelcast IMDG documentation.
HBase
Event |
Description |
Metric |
Difference between number of stores and number of store files is significant. |
Detects unusually low or unusually high number of stores. |
Stores count (rs_store_count ) and stores files count (rs_store_file_count ). |
Region server block cache hit ratio is low. |
Detects low cache hit ratio. |
Block cache hit rate (rs_blk_cache_hit_rate ) and block cache hit count (rs_blk_cache_hit_count ). |
Significant increase in compaction queue length. |
Checks for a sudden increase in the length of the compaction queue. This rule indicates that all regions are growing at a similar rate and need to split/compact at around the same time. This can be addressed by pre-splitting or turning
off auto-compactions. |
Compaction queue length (rs_comp_queue_length ). |
Significant increase in flush queue length. |
Checks for a sudden increase in the length of the flush queue. When triggered, this can be an indication of a lack of RAM or that flushes are faster than what disks can handle. |
Flush queue length (rs_flush_queue_length ). |
For more information about this sensor, see the Apache HBase documentation.
Host
Event |
Description |
Metric |
CPU spends significant time waiting for input/output. |
Checks whether the system spends significant time waiting for input/output (sampling in a sliding window of 60 seconds). |
Wait (cpu.wait ). |
CPU Steal Time exceeded. |
Checks on a secondly moving window, whether there is too much CPU stolen between running processes or by the hypervisor / host OS (sampling in a sliding window of 60 seconds). |
Steal (cpu.steal ). |
Device has low capacity left or is full. |
Detects disk low capacity problems to give an early prediction for a possible capacity breach up to 15 minutes in advance. The detector is not firing when the remaining disk space is more than 1GB or 1% of the total capacity. However,
it will fire if either the remaining disk space is empty (<1MB), or the disk space would fill up within the next 15 minutes based on the current trend. |
The disks free storage capacity. |
Disk fills up faster than it is being purged. |
Detects long-term disk capacity problems and fires when the disk is likely to run out of capacity within the next 48 hours. The detector is not firing when the remaining disk space is more than 20% of the total capacity. However, it will
fire when the disk space would fill up within the next 48 hours based on the current trend. This trend is computed based local minima collected over time. When these local minima define a timeframe of at least 4 hours, a linear regression
model is fitted on these data points to finally do the long-term forecast. |
The disks free storage capacity. |
Frequent TCP errors. |
Checks whether the host has an unusually high number of TCP errors (sampling in a sliding window of 60 seconds). |
In Segments/s (tcp.inSegs ) and error (tcp.errors ). |
Frequent TCP fails. |
Checks whether the host has an unusually high number of TCP fails (sampling in a sliding window of 60 seconds). |
Fail (tcp.fails ) and open/s (tcp.opens ). |
Permanent TCP retransmissions. |
Checks whether the host has an unusual high number of TCP retransmission (sampling in a sliding window of 60 seconds). |
Retransmission (tcp.retrans ) and out Segments/s (tcp.outSegs ). |
System load too high. |
Checks whether the system load is too high, by comparing the load against 2 times the CPU cores of the machine (sampling in a sliding window of 120 seconds). |
Load (load.1min ). |
System memory exhausted. |
Checks whether the system memory is close to being exhausted (triggered instantly). |
Free (memory.free ) and used (memory.used ). |
Too many open files. |
Processes are opening files faster than they close them (current vs max ratio exceeds threshold). |
Used (openFiles.used ). |
Too many used inodes. |
Low level of free inodes on filesystem triggers this health rule (current vs max ratio exceeds threshold). |
inode usage. |
Too much CPU usage by user processes. |
Checks whether CPU usage of user processes is too high (sampling in a sliding window of 180 seconds). |
User (cpu.user ) and topPID. |
You will run out of disk space soon. |
Detects short-term capacity problems of a disk and fires when when the disk is likely to run out of capacity within the next hour. The detector is not firing when the disk freed up a considerable amount of space (>=100MB) in the recent
past, or when the remaining disk space is more than 20% of the total capacity. However, it will fire when the disk space would fill up within the next hour based on the current trend. This trend is computed based on a linear regression
model fitted on the data points of the current sliding window. |
The disks free storage capacity. |
Windows service status is changed. |
Checks whether the Windows service status is changed (sampling in a sliding window of 60 seconds). |
Windows service status (state ). |
For more information about this sensor, see the Host documentation.
IBM ACE
Event |
Description |
Metric |
Status of ACE Integration Server |
Check the status of ACE Integration Server. |
Integration Server State |
ACE Integration Server status digital format |
Check the digital status of ACE Integration Server. |
Integration Server State Metrics |
Queue Manager connection status digital format |
Check the digital status between ACE Integration Server and Queue manager. |
Queue Manager Connection Status Metrics |
Message with errors number |
Number of messages that contain errors. |
Number of Messages with Errors |
Message flow with errors number |
Number of MQGET errors for MQInput nodes or Web Services errors for HTTPInput nodes. |
Number of MQGET Errors |
Message processing with errors number |
Number of errors that occur when processing a message. |
Number of Messages with Errors |
Message flow status |
Check the status of ACE Message Flow. |
Message Flow Status |
Message flow status digital format |
Check the digital status of ACE Message Flow. |
Message Flow Status Metrics |
For more information about this sensor, see the IBM ACE documentation.
IBM Db2
Event |
Description |
Metric |
Table Space Utilities metrics status |
Check for events that are related to table space and its metrics when the auto resize feature is enabled and disabled. |
Table Space Utilities |
HADR Connect Status |
Check for events that are related to the connection status of the HADR standby databases. The standby ID is used as a filter to generate the HADR_CONNECT_STATUS event, which is specific to any standby node, and can be set
with the standby ID in the matching operator field. The events can be created based on the following, which represents the current state of any database:
- The database is connected (
Connect State = CONNECTED as 1).
- The database is in disconnected state (
Connect State = DISCONNECTED as 0).
|
HADR_CONNECT_STATUS (hadr.standbyId.HADR_CONNECT_STATUS ). The matching operators that are set to any will generate the events that are irrespective of the standby ID. |
For more information about this sensor, see the IBM Db2 documentation.
JBoss
Event |
Description |
Metric |
Average errors on connector too high. |
A processing pipeline detects the number of errors that occurred on connectors in the given time window and also checks whether the number of errors is greater than the threshold value. |
Jboss connector errors. |
ConnectionPool is running out of connections. |
A processing pipeline detects the used connections ratio and checks if it is about to reach the threshold value. |
JBoss connection pool connections used ratio. |
Connections on datasources run out. |
A processing pipeline detects the number of available connections on data sources in the given time window and checks if the total number of connections is about to reach the threshold value. |
Jboss datasources connections used, datasources connections available. |
ThreadPool is running out of threads. |
A processing pipeline detects the number of max threads and checks if the current thread count is about to reach the threshold value. |
JBoss thread pool current thread count, thread pool max threads. |
For more information about this sensor, see the JBoss AS documentation.
JBoss Data Grid
Event |
Description |
Metric |
Caches not in the running state. |
Checks the ratios of number of caches created against the number of caches running in Jboss Data Grid. If the ratio is as follows a certain value, then it is considered a violation. |
Running and created caches of cache managers. |
For more information about this sensor, see the JBoss Data Grid documentation.
JVM
Event |
Description |
Metric |
Garbage collection activity high. |
A processing pipeline monitors the Garbage Collection time spent by the JVM Runtime Platform and validates it against a threshold. |
JVM Garbage Collection. |
JVM code cache is full. |
A processing pipeline monitors the maximum Code Cache usage of the JVM Runtime Platform. |
JVM maximum Code Cache usage. |
Perm Gen is full (CMS). |
A processing pipeline detects the maximum Perm Gen CMS Pools utilized. |
pools.CMS Perm Gen |
Perm Gen is full (G1). |
A processing pipeline detects the maximum Perm Gen G1 Pools utilized. |
pools.G1 Perm Gen |
Perm Gen is full (PS). |
A processing pipeline detects the maximum Perm Gen PS Pools utilized. |
pools.PS Perm Gen |
Threads are deadlocked. |
A detector monitors the JVM Runtime Platform and detects if there are any Deadlocked threads. |
Number of threads deadlocked (threads.deadlocked ). |
J9VM Memory Leak. |
A detector checks the growth rate of heap used after GC in MB per hour, and detects whether there is possibly a memory leak in the JVM. IBM J9 VM memory leak detection is an optional feature, disabled by default in the Instana backend. To enable this optional feature, see the page for your Instana deployment: SaaS, Self-Hosted Custom Edition (Kubernetes or Red Hat OpenShift Container Platform),
or Self-Hosted Classic Edition (Docker) |
memory.gc.after memory.gc.before |
For more information about this sensor, see the JVM documentation.
Kubernetes
Kubernetes Cluster
Event |
Description |
Metric |
Kubernetes Cluster component status. |
Kubernetes reports that a Master-Component (API-server, scheduler, controller manager) is unhealthy. Due to a bug in Kubernetes, the health is not always reliably reported. We try to filter these out and not cause an alert by only showing
up on the Cluster detail page. |
Instana low level events. |
Kubernetes DaemonSet
Event |
Description |
Metric |
Available replicas is less than desired replicas. |
Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the Kubernetes DaemonSet is missing replica pods. |
Desired (desiredReplicas ) and available (availableReplicas ). |
Kubernetes Deployment
Event |
Description |
Metric |
Available replicas is less than desired replicas. |
Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the Kubernetes Deployment is missing replica pods. |
Desired (desiredReplicas ) and available (availableReplicas ). |
Kubernetes Namespace
Event |
Description |
Metric |
Allocatable cpu requests too low. |
Requested CPU is approaching max capacity (requested CPU / CPU capacity ratio is greater than 80%). |
CPU Requests Allocation (required_cpu_percentage ). |
Allocatable memory requests too low. |
Requested Memory is approaching max capacity (requested memory/memory capacity ratio is greater than 80%) |
Memory Requests Allocation (required_mem_percentage ). |
Allocatable pod count too low. |
Allocated pods are approaching maximum capacity (allocated pods/pods capacity ratio is greater than 80%). For a namespace, pods in the phases Pending , Running , and Unknown are counted as allocated.
The namespace capacity values are based on ResourceQuotas, which can be set per Namespace. For more information, see the Kubernetes documentation. |
Pods Allocation (used_pods_percentage ). |
Kubernetes Node
Event |
Description |
Metric |
Allocatable CPU too low. |
Requested CPU is approaching max capacity (requested CPU / CPU capacity ratio is greater than 80%). |
CPU Requests Allocation (required_cpu_percentage ). |
Allocatable memory too low. |
Requested Memory is approaching max capacity (requested memory/memory capacity ratio is higher than 80%). |
Memory Requests Allocation (required_mem_percentage ). |
Allocatable pod count too low. |
Allocated pods are approaching maximum capacity (allocated pods/pods capacity ratio is greater than 80%). For a node, pods in the phases Running and Unknown are counted as allocated. For more information, see
the Kubernetes documentation. |
Pods Allocation (alloc_pods_percentage ). |
Kubernetes Node condition status. |
The node reports a condition which is not ready for more than one minute. For a node that’s all conditions besides the Ready condition. For more information, see the Kubernetes documentation. |
Instana low level events. |
Kubernetes Pod
Event |
Description |
Metric |
Kubernetes Pod condition status. |
A pod is not ready for more than one minute, and the reason is not that it’s completed. (PodCondition=Ready, Status=False, Reason != PodCompleted). For more information, see the Kubernetes documentation. |
Instana low level events. |
For more information about this sensor, see the Kubernetes documentation.
Memcached Nodes
Event |
Description |
Metric |
Flush all command executed. |
Detects high number of the flush_all command. |
Flush (cmd_flush ). |
High key eviction. |
Detects high number of key evictions. |
Evictions (evictions ). |
Number of queued connections increases. |
Detects high number of queued connections. |
Queued (conn_queued ). |
Number of yielded connections increases. |
Detects high number of yielded connections. |
Yields (conn_yields ). |
Used bytes by Memcached reached maxbytes limit. |
Used bytes by Memcached reached max bytes limit. |
Used bytes. |
For more information about this sensor, see the Memcached documentation.
MongoDB Node
Event |
Description |
Metric |
Continuously increasing background flushing latency. |
Database reports increasing background flushing latency (sampling in a sliding window of 150 seconds). |
Last background flushing latency (backgroundFlushingLast ). |
Continuously increasing lock queue length. |
Monitors the MongoDb Lock Queue metric and validates if the lock queue size is increasing too fast. |
Lock Queue Length (lockQueue ). |
Increasing page faults. |
Increasing page faults (sampling in a sliding window of 150 seconds). |
Number of Page Faults (pageFaults ). |
Journal commits in write lock growing |
Journal commits in write lock growing (sampling in a sliding window of 150 seconds). |
Journal Write Lock (journalWriteLock ). |
Too high ratio of non-mapped virtual memory |
Too high ratio of non-mapped virtual memory (triggered instantly and reported by the Instana Host sensor). |
Virtual and mapped . |
MongoDB Replica Set
Event |
Description |
Metric |
ReplicaSet has member(s) down. |
The member, as seen from another member of the set, is unreachable. |
unreachableNodeCount . |
ReplicaSet monitoring status. |
Monitors the health of all the members of MongoDB replica set. |
Slave Delays Count (slaveDelaysCount ), optimes count (optimesCount ), and monitored members count (monitoredMembersCount ). |
Replication lag is growing. |
Replication lag is growing (sampling in a sliding window of 150 seconds). |
Slave Delays (slaveDelays ) and Optimes (optimes ). |
Replica Set connection usage is high. |
Number of active connections is more than 90% of the maximum connections. |
Connections ('connections'). |
For more information about this sensor, see the MongoDB documentation.
MySQL DB
Event |
Description |
Metric |
Available server connections are at limit. |
Ratio between the used and connections limit is greater than the configured ratio threshold. |
Connections (status.THREADS_CONNECTED ). |
For more information about this sensor, see the MySQL documentation.
Nginx Server
Event |
Description |
Metric |
Nginx has a problem with offline peers. |
Inactive Peer (available only for NGINX Plus). |
Upstreams failed (nginx_plus.http.upstreams.peers.failed ). |
Nginx is dropping connections. |
Dropped connections. |
Dropped connections (connections.dropped ). |
Nginx is failing with SSL handshakes. |
Failed SSL handshakes (available only for NGINX Plus). |
Failed hanshakes (nginx_plus.ssl.handshakes_failed ). |
Number of active connections is close to the max. |
Used connections ratio exceeds the configured ratio threshold for used connections. |
Active connections (connections.active ). |
For more information about this sensor, see the NGINX documentation.
Node.js App
Event |
Description |
Metric |
Garbage collection activity high. |
Checks whether the time spent in GC in the given window is above the given threshold. |
GC pause metrics. |
Health checks are failing. |
Checks whether there are any failing healthchecks. For more information, see Health check support. |
Health check result (healthcheckResult ). |
For more information about this sensor, see the Node.js documentation.
OpenShift Deployment Config
Event |
Description |
Metric |
Available replicas is less than desired replicas. |
Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the OpenShift DeploymentConfig is missing replica pods. |
Desired (desiredReplicas ) and available (availableReplicas ). |
For more information about this sensor, see the Openshift documentation.
OTel Host
Event |
Description |
Metric |
CPU Wait time exceeded |
Checks whether the system spends a significant amount of time waiting for input or output operations. |
CPU Wait (cpu.wait ) |
CPU Steal time exceeded |
Specifies the number of allowed CPU Wait violations within a time frame. |
CPU Steal (cpu.steal ) |
CPU usage high |
Checks whether the CPU use is high. This event continuously evaluates data over the most recent 180-second interval. |
CPU User (cpu.user ) |
System load too high |
Checks whether the system load is high by comparing the load against two times the CPU cores of the machine. This event continuously evaluates data over the most recent 120-second interval. |
Load (load.avg_1m ) |
System memory exhausted |
Checks whether the system memory is close to fully used (triggered instantly). |
Memory free (memory.free ) and Memory used (memory.used ) |
Disk low capacity |
Detects short-term capacity problems of a device that has less than a static threshold (1GB) or less than 1% of the total volume size. In addition, it detects the capacity if the remaining time until zero provides the current rate of change
is under 15 minutes. |
Disks free storage capacity |
For more information about this sensor, see the OpenTelemetry documentation.
OracleDB
Event |
Description |
Metric |
Ratio between DB CPU Time and DB Time is low. |
Ratio between DB CPU Time and DB Time is as follows the configured threshold. |
DB CPU Time/DB Time Ratio (stats.cpuTimeDbTimeRatio ). |
Tablespace space usage is high. |
Tablespace used space is more significant than the configured amount of maximum space. |
Tablespace used space percentage. |
Total amount of sessions at maximum. |
Used sessions ratio exceeds the configured used sessions ratio threshold. |
Sessions/Session Limit (stats.usedSessionsRatio ). |
For more information about this sensor, see the OracleDB documentation.
OS process
Event |
Description |
Metric |
CPU Usage |
Process is causing high CPU usage on host. |
The result of a high CPU usage rule evaluation on the underlying host and the CPU user time of the given process. |
Open Files Usage. |
Process is opening files faster than it closes them (current vs max ratio exceeds threshold) |
Used (openFiles.used ). |
Abnormal termination. |
Process terminated as a result of an uncaught signal. |
|
Abnormal termination. |
Process terminated with a non-zero exit code. |
|
For more information about this sensor, see the OS process documentation.
PHP-FPM Runtime
Event |
Description |
Metric |
Frequent restarts of PHP-FPM worker pool. |
Checks for frequent restarts of a PHP-FPM worker pool by evaluating the number of its restarts in a given time window against a given threshold. |
Start times for a worker pool. |
Listen Backlog configured over capacity. |
Checks whether the listen backlog of a worker pool is over the configured capacity. |
Worker pool queue length. |
Too many connections reset. |
Checks the number of connection resets to be above the given threshold in the given time window. |
Connection resets metric for worker pool. |
Too many requests piling up in Listen Backlog. |
Checks the size for various PHP-FPM worker queues and validates it against the threshold value. |
Listen queue size metrics for various PHP-FPM worker queues. |
Too many slow requests. |
Checks the ratio of slow requests on all monitored PHP-FPM worker pools. |
Slow requests and accepted connection metric for a worker pool of a PHP-FPM instance. |
For more information about this sensor, see the PHP documentation.
Synthetic Check
Event |
Description |
Metric |
Remote target is not reachable. |
Checks whether the percentage of failed communication attempts in the given sliding window is above the given threshold. |
Status of Ping (status ). A http status code between 200-206 and 300-307 results in healthy status, for icmp the exit value 0 is seen as healthy while value 1 is seen as unhealthy, in addition a maximum execution time of 2
seconds is set |
For more information about this sensor, see the Synthetic Check documentation.
PostgreSQL DB
Event |
Description |
Metric |
Active connection usage. |
Number of active connections is more than 90% of the maximum connections. |
Connection Usage (max_conn_pct ). |
For more information about this sensor, see the PostgreSQL documentation.
Process
Event |
Description |
Metric |
High CPU usage. |
Evaluates whether the given process is causing high CPU usage on a host. |
Results of high CPU usage rule evaluation on the underlying host and CPU user time of the given process. |
Too many open files. |
Open files percentage is higher than the configured threshold. |
Used (openFiles.used ). |
SAP ABAP
Event |
Description |
Metric |
Lock contention detected |
Detects lock contention and provides details about the lock mode and lock object. |
ABAP Lock Contention |
ABAP dumps generated |
Detects ABAP dumps that are generated and provides details on the severity. |
ABAP Dumps Severity |
IDoc Inbound and OutBound errors occured |
Detects error for both Inbound and Outbound IDocs. |
Inbound IDoc Error and Outbound IDoc Error |
High CPU usage detected |
Detects if the CPU usage is greater than 90%. |
High CPU usage |
High memory usage detected |
Detects if the memory usage is greater than 90%. |
High Memory Usage |
Work process in stopped, shutdown, or PRIV mode (private) detected |
Detects if the work process is in PRIV mode (private), stopped, or shutdown. |
Work Process Status |
File system usage crossing threshold detected |
Detects if the file system usage crosses the threshold of 80%. |
File System Usage |
Connection issues detected |
Detects incorrect username, password, gateway failure, or incorrect login attempts. |
Connectivity Status |
Authorization missing detected |
Detects if the user is missing the authorization to run a function module. |
Authorization check |
User account locked detected |
Detects if the user account is locked due to login failures. |
User Account lock |
Spool Error detected |
Detects spool error. |
Spool Error |
Dialog response time exceeding threshold |
Detects if the dialog response time exceeds the preferred threshold. |
Dialog Response Time |
Dialog work process exceeding threshold |
Detects if the dialog work process is running longer than 10 seconds. |
Dialog Work Process |
Transport request release detected |
Detects whether transport request is released or protected. |
Transport Request |
For more information about this sensor, see SAP ABAP.
SAP HANA
Event |
Description |
Metric |
High CPU utilization |
Detects if the total CPU usage exceeds 90% |
Total CPU Utilization |
High HANA memory usage |
Detects if the used memory exceeds 90% of the allocated limit |
HANA Memory Usage |
High host memory usage |
Detects if the host memory usage exceeds 90% |
Host Memory Usage |
High Disk usage |
Detects if the disk usage exceeds 90% |
Disk Usage Summary |
High number of queuing connections |
Detects if the queuing connections are more than one |
Connections |
High number of blocked sessions |
Detects if the blocked sessions are more than one |
Sessions |
High number of blocking sessions |
Detects if the blocking sessions are more than one |
Sessions |
High number of blocked threads |
Detects if the blocked threads are more than 10 |
Threads |
High number of blocked SQL threads |
Detects if the blocked SQL threads are more than 10 |
SQL Threads |
High number of blocked job worker threads |
Detects if the blocked job worker threads are more than 10 |
Job Worker Threads |
High number of pending requests |
Detects if the pending requests are more than 10 |
Requests |
High process CPU |
Detects if any of the process CPUs exceeds 90% |
Service Details |
Service status is not active |
Detects if service status is not active |
Service Details |
Backup failed |
Detects failed backups |
Backup Progress |
User locks occurred |
Detects user locks |
User Locks |
Scheduler jobs failed |
Detects failed scheduler jobs |
Scheduler Jobs |
System events occurred |
Detects system events |
System Events |
Archive log backup failed |
Detects failed log backups |
Archive Log Backup |
Transaction is not active |
Detects partial aborting and aborting transactions |
Transaction Statistics |
For more information about the SAP HANA sensor, see Monitoring SAP HANA.
Service
Event |
Description |
Metric |
Complete drop in calls. |
Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative
and absolute threshold parameters as follows. |
Calls/s (count ). |
Error rate too high. |
Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value. |
Error rate (error_rate ). |
Increasing trend in error rate. |
Checks a presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates certain amount of decreases in the metric value
inside the trend candidate. |
Error rate (error_rate ). |
Sudden drop in calls. |
Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows. |
Calls/s (count ). |
Sudden increase in error rate. |
Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. |
Error Rate (error_rate ). |
Sudden increase in latency. |
Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. |
Latency 50th (duration.50th ). |
Sudden increase in latency for a fraction of requests. |
Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. |
Latency 99th (duration.99th ). |
Spring Boot App
Event |
Description |
Metric |
Number of active sessions reached maximum number. |
A processing pipeline detects the number of active connections of the SpringBoot application in the given time window. It validates whether the number of active sessions is greater than the threshold value. |
Active sessions (metrics.httpsessions.active ). |
Spring Boot Application down. |
Monitors the status of the SpringBoot Application. |
Status of SpringBoot Application (metrics.status ). |
For more information about this sensor, see the Spring Boot documentation.
Sybase Server
Event |
Description |
Metric |
Available server connections are at limit. |
Number of connections is close to 100% of connections limit per server. |
Connections (stats.connCount ). |
The maximum number of databases is at limit. |
Number of databases is close to 100% of databases limit per server. |
databasesCount . |
For more information about the SAP SQL Anywhere sensor, see Monitoring SAP SQL Anywhere.
Synthetic PoP
Event |
Description |
Metric |
Synthetic pop status |
Check whether Synthetic PoP can connect to Instana backend |
Status of Synthetic PoP (status ) |
Playback engine status |
Check whether the playback engine is overloaded. |
Workload status of the playback engines browserscript.workloadStatus , http.workloadStatus , javascript.workloadStatus , and ism.workloadStatus . |
Retrieving credentials failed |
Failed to get Synthetic crendentials from the Instana backend. |
Error code and URL of pop_get_cred_failed (error.pop_get_cred_failed ). |
Retrieving tests failed |
Failed to get Synthetic tests from Instana backend. |
Error code and URL of pop_get_test_failed (error.pop_get_test_failed ). |
Reporting test results failed |
Failed to post Synthetic test result to the Instana backend. |
Error code and URL of pop_report_result_failed (error.pop_report_result_failed ). |
Reporting test tesult details failed |
Failed to post Synthetic test result details to Instana backend. |
Error code and URL of pop_report_result_details_failed (error.pop_report_result_details_failed ). |
Reporting result queue depth is high |
Detect whether the result queue depth is high |
ResultQueueDepthHigh (resultQueueDepthHigh ). |
For more information about this sensor, see the Synthetic PoP documentation.
Tibco EMS
Event |
Description |
Metric |
Connections exceeds max available connections. |
The max number of connections is almost used up. |
Connections Count (connectionCount ). |
Messages memory usage exceeds the limit. |
The maximum message memory is almost used up. |
Messages Memory (messagesMemory ). |
Queues pending messages exceeds the limit. |
The max number of pending messages for queue is almost used up. |
Queue pending messages usage. |
Topics pending messages exceeds the limit. |
The max number of pending messages for topic is almost used up. |
Topic pending messages usage. |
For more information about this sensor, see the Tibco EMS documentation.
Tomcat
Event |
Description |
Metric |
Active connections reached maximum. |
Detects if the number of connections of specific connector is reaching its maximum configured value. |
Number of connector connection count. |
Sudden drop in the number of session. |
Checks for a significant drop in the number of sessions. |
Total session count (totalSessionCount ). |
Sudden increase in the number of session. |
Checks for a significant increase in the number of sessions. |
Total session count (totalSessionCount ). |
Threads number reached maximum. |
Detects if the number of busy threads of specific connector is reaching its maximum configured value. |
Number of connector busy threads. |
For more information about this sensor, see the Tomcat documentation.
Varnish Node
Event |
Description |
Metric |
Sudden drop in the number of requests. |
Checks for a sudden drop in the number of client requests. |
Received client requests (client_req ). |
Sudden increase in evected objects. |
Checks for a sudden increase in the number of evicted objects. |
Nuked Objects (n_lru_nuked ). |
Thread creation is failing. |
Too many thread creations failed. |
Failed (threads_failed ) and limited (threads_limited ). |
Varnish backend is marked unhealthy. |
Varnish backend server is unhealthy or is not available. |
Unhealthy (backend_unhealthy ). |
Varnish hit rate is low. |
Varnish hit rate is very low. |
Cache Hit Rate (cache_hit_rate ). |
Varnish is out of worker threads. |
Varnish is out of worker threads. |
Connections dropped due to a full queue (sess_dropped ). |
For more information about this sensor, see the Varnish documentation.
Vault
Event |
Description |
Metric |
Vault is sealed. |
Detects if the sealed status is set to true. |
Sealed (sealed ). |
Sudden increase in secret reads |
Checks for a sudden increase (increase by 60% based on the average of the last 5 minutes) in the number of secrets read. |
Secrets read count (secret.read.count ). |
For more information about this sensor, see the Vault documentation.
WebLogic Server
Event |
Description |
Metric |
Datasource error state. |
A processing pipeline monitors status codes of the WebLogicApplications data sources, and checks if any data source is unhealthy. |
WebLogic datasource status. |
Health state |
Detects overall system degradation based on reported health state. |
Health State status. |
For more information about this sensor, see the WebLogic documentation.
WebSphere
Event |
Description |
Metric |
WebContainer thread pool active threads reached maximum. |
A processing pipeline validates that the number of active threads in the WebContainer thread pool is reaching the maximum limit. |
Active threads (threadPools.webContainer.activeThreads ). |
WebSphere certificate is about to expire. |
Remaining days before certificate expiration is less than the threshold value. |
Remaining days before expiration (certificates.{certificate}.expDaysLeft ) |
For more information about this sensor, see the WebSphere Application Server documentation.
ZooKeeper
Event |
Description |
Metric |
Maximum request latency is high. |
A processing pipeline checks if the maximum request latency is reaching the threshold value. |
Max request latency (max_request_latency ). |
Number of queued requests is high. |
A processing pipeline detects the number of queued request and validates whether the number is reaching the threshold value. |
Outstanding request count (outstanding_requests ). |
For more information about this sensor, see the ZooKeeper documentation.