Built-in Events Reference

The Events page displays a list of all the currently available events; out of the box built-in events and any user-defined custom events. To view the Events page, click Settings -> Events.

The list can be filtered by:

  • Type: built-in event or custom event.
  • Incidents and severity: incidents, warning, or critical.
  • Full text search.

Important: Built-in events can't be modified. You can create custom events based on the same entities and metrics used for built-in events. Custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.

.NET App

Event Description Metric
Garbage collection activity high. Monitors the garbage collection (GC) time spent by the CLR runtime platform and checks it against the maximum allowed percentage value. GC time (mem.time_in_gc).

For more information about this sensor, see the .NET documentation.

ActiveMQ

Event Description Metric
Dead-letter queue size is growing. Dead-letter queue size is increasing. Messages sent are not routed to their correct destination. ActiveMQ queue size.
Memory usage is close to the limit. Memory usage is close to 100% of the memory limit. Memory Usage (memoryPercentage).
Store usage is close to the limit. Store usage is close to 100% of the store limit. Store Usage (storePercentage).

For more information about this sensor, see the ActiveMQ documentation.

ActiveMQ Artemis

Event Description Metric
ActiveMQ Artemis has no connections. There are no connections in the last 5 seconds. The current number of connections is equal to the configured NoConnections count. Total Connections (totalConnectionCount).
ActiveMQ Artemis has no consumers. There are no consumers in last 5 seconds. Current number of consumers is equal to the configured NoConsumers count. Total Consumers (totalConsumerCount).
Addresses memory usage is close to the limit. Memory usage of all addresses is close to 100% of its memory limit. Address Memory Usage (addressMemoryPercentage).

For more information about this sensor, see the ActiveMQ Artemis documentation.

Apache HTTPd

Event Description Metric
Apache child processes are stuck performing DNS lookups. Detects high usage of server workers by DNS lookup. Dns (worker.dns).
Logging is slowing down Apache HTTPd performance. Detects high usage of server workers for logging purposes. Logging (worker.logging).
Number of busy workers is approaching max workers. Detect high percentage of busy workers. Busy workers (busy_workers).

For more information about this sensor, see the Apache HTTPd documentation.

Application

Event Description Metric
Complete drop in calls Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the listed relative and absolute threshold parameters. Calls/s (count)
Error rate too high Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value. Error Rate (error_rate).
Increasing trend in error rate This rule checks the presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates a certain amount of decreases in the metric value inside the trend candidate. Error Rate (error_rate).
Sudden drop in calls Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the listed relative and absolute threshold parameters. Calls/s (count).
Sudden increase in error rate Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters. Error Rate (error_rate).
Sudden increase in latency Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters. Latency 50th (duration.50th).
Sudden increase in latency for a fraction of requests Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters. Latency 99th (duration.99th).

AWS DynamoDB

Event Description Metric
Ratio of consumed and provisioned reads is critical. Detects high ratio of consumed and provisioned reads. Consumed read capacity (consumed_read).
Ratio of consumed and provisioned writes is critical. Detects high ratio of consumed and provisioned writes. Consumed write capacity (consumed_write) and provisioned write capacity (provisioned_write).

For more information about this sensor, see the AWS DynamoDB documentation.

AWS MSK

Event Description Metric
Active Controller Count. Checks for an unusual number of active controllers in the Kafka cluster. Active controller count (active_controller_count).
Offline Partitions Count. Defines the maximum allowed proportion of violations of offline partitions within the specified time window. Offline partitions count (offline_partitions_count).
Network Processor Low Idle Time. Checks whether the Kafka network thread is under high load. Network processor idle time (network_processor_idle).
Request Handler Low Idle Time. Checks whether the Kafka request handler is under high load. Request handler idle time (request_handler_idle).
Under-replicated partitions Count. Checks whether the number of under-replicated partitions exceeds the expected number. Under-replicated partitions (under_replicated_partitions).

For more information about this sensor, see the AWS MSK documentation.

AWS RDS

Event Description Metric
CPU credit balance reaching zero. Checks if the CPU credit balance is getting closer to zero. CPU Credit Balance (cpu_credit_balance).
Number of CPU credits consumed is high. Checks if the percentage of CPU credits consumed by an instance is reaching max capacity. CPU Credit Usage (cpu_credit_usage) and CPU Credit Balance (cpu_credit_balance).

For more information about this sensor, see the AWS RDS documentation.

Azure API Management Service

The Azure API Management sensor will automatically perform any configured custom health checks every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.

Event Description Metric
Azure Api Management capacity is getting closer to the max capacity limit. Checks whether Azure API Management is using more than 90% of the available capacity. Capacity (metrics.Capacity).

For more information about this sensor, see the Azure Api Management documentation.

Azure CosmosDB

Event Description Metric
Azure CosmosDb storage capacity is getting closer to the max capacity limit. Detects whether the Azure CosmosDb storage capacity is reaching the max capacity limit. CosmosDb storage capacity.

For more information about this sensor, see the Azure CosmosDB documentation.

Azure Redis

The Azure Redis Cache sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.

Event Description Metric
Azure Redis Cache client connections are getting closer to max connections limit. Azure Redis Cache is using more than 90% of available client connections. Connected Clients (connectedclients).
Azure Redis Cache memory usage is getting closer to max memory limit. Azure Redis Cache is using more than 90% of available memory. Percentage of Memory Used (usedmemorypercentage).

For more information about this sensor, see the Azure Redis documentation.

Azure SQL Database

The Azure SQL Database sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.

Event Description Metric
Database is running out of space. Checks if Azure SQL Database is running out of space. Warning limit is at 80% and the critical limit is at 90% of the used size. metrics.storage_percent.
Database status. Unhealthy state is caused by the database being unavailable. A database can be unavailable if one of the following conditions is true:
  • The database has been set offline by the user
  • The database is being restored from backup
  • The database is being recovered
  • The database has been corrupted
  • The database has been set to the Emergency state by the administrator
  • The database is in the process of being created by copying another database
metrics.statusCode.
The total DTU utilization is getting closer to max DTU limit. Checks if the Azure SQL Database DTU utilization is reaching max DTU limit. Warning limit is at 75% and the critical limit is at 85% of the DTU utilization. metrics.dtu_consumption_percent.

Azure MySQL Database

The Azure MySQL Database sensor runs custom health checks every minute. If the checks fail for at least one minute, an issue is raised to inform you.

Event Description Metric
Available server connections are getting closer to the max connections limit The usage of Azure MySQL Server connections is more than 85% of the available client connections. Active Connections (active_connections)

For more information about this sensor, see the Azure MySQL documentation.

Azure Service Bus

The Azure Service Bus sensor runs custom health checks every minute. If the checks fail for at least one minute, an issue is raised to inform you.

Event Description Metric
Azure Service Bus has at least one message in DL queue Checks if the Azure Service Bus has at least one message in the dead lettered queue. Deead Lettered Messages (deadletteredMessages)

For more information about this sensor, see the Azure Service Bus documentation.

Azure SQL Elastic Pool

The Azure SQL Elastic Pool sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.

Event Description Metric
The total eDTU utilization is getting closer to max eDTU limit. Checks if Azure SQL Elastic Pool eDTU is reaching maximum eDTU limit. metrics.dtu_consumption_percent.

Cassandra

Cassandra Cluster

Event Description Metric
Unreachable Cassandra nodes. One or more nodes are down. Number of unreachable nodes (unreachableNodes).

Cassandra Node

Event Description Metric
Blocked threadpools. Checks whether there are stages with the blocked threads. Blocked threads metric for a stage.
Dropped messages. Checks whether there are thread pools dropping messages. Dropped messages metric for a stage.
Pending compactions. Checks whether pending compactions are increasing. Write (Pending) (compaction.pending).
Pending mutations. Checks whether there are pending mutations. Counter Mutation (stage.mutation.pending).
Pending reads. Pending reads. Read Repair (stage.read.pending).
Pending request responses. Pending request responses. Write (Mutation) (stage.requestresponse.pending).
Sudden drop in write requests. Checks for a sudden drop in the number of Cassandra write requests. Writes (clientrequests.write.count).

For more information about this sensor, see the Cassandra documentation.

Ceph

Event Description Metric
Ceph cluster status. Ceph cluster is reporting a problem; HEALTH_WARN or HEALTH_ERR. Status of the Ceph Cluster (overall_status).
Monitor quorum is not reached. The number of healthy monitors is less than 50% of all monitors. Number of monitors (num_mons) and number of active monitors (num_active_mons).
Osd(s) full capacity state. Some of OSDs are reporting full state. Number of active+clean pgs (num_full_osds).
Osd(s) near full capacity state. Some of OSDs are reporting near full state. Number of near full osds (num_near_full_osds).

For more information about this sensor, see the Ceph documentation.

Consul (HashiCorp)

Event Description Metric
Consul cluster health. Detects the overall health of the cluster and if any of the nodes are considered unhealthy by Autopilot. Consul autopilot health status (consul.autopilot.healthy).

CRI-O

Event Description Metric
Memory exhausted. Detects when the container memory usage exceeds specified limits. RSS (memory.total_rss).

Docker

Event Description Metric
Memory exhausted. When the container memory usage exceeds specified limits, a memory warning threshold or a memory critical threshold alert is displayed. RSS (memory.total_rss).

For more information about this sensor, see the Docker documentation.

Elasticsearch

Elasticsearch Cluster

Event Description Metric
Cluster status. Monitors the status of Elasticsearch cluster. Number of Elasticsearch nodes (node_count) and the status of Elasticsearch cluster (cluster_status).
Elasticsearch is in split-brain situation. Checks whether an Elasticsearch cluster has more than 1 master node. Split Brain is triggered for environments with two Elastic clusters with the same name. Master nodes count in elasticsearch cluster.

Elasticsearch Node

Event Description Metric
Capacity limit while rebalancing. Characterizes the node at being at the capacity limit by checking whether it's relocating shards at the time of being at the capacity limit. Results of the capacity limit evaluation and shard relocation.
Heap overallocation. Evaluates whether the heap size setting of the Elasticsearch is too big. Maximum heap size of the underlying JVM and the total memory on the underlying host.
High heap usage. Checks the heap usage of the node along with the recent workload characteristics to detect the heap usage to be too high. Heap usage by the underlying JVM and workload characterization.
Node at capacity limits. Checks for the node being at the capacity limit which is determined by the presence of the following issues: high load and CPU usage on the host, high heap usage and high GC time in the Elasticsearch JVM. High load and high CPU time on the host, high heap usage by the Elasticsearch, as well as high GC time on the underlying JVM
Node status. Checks the cluster status provided by the Elasticsearch. High load and high CPU time on the host, high heap usage by the Elasticsearch, as well as high GC time on the underlying JVM.
Rejected actions. Checks for the number of rejected threads being too high. Index (threads.index_rejected), search (threads.search_rejected), bulk (threads.bulk_rejected), and get (threads.get_rejected).

For more information about this sensor, see the Elasticsearch documentation.

Endpoint

Event Description Metric
Complete drop in calls. Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows. Calls/s (count).
Error rate too high. Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value. Error Rate (error_rate).
Error rate too high for a Synthetic endpoint. Detects a consistently high error rate of a Synthetic endpoint when the average errors KPI within the last four minutes is above the given threshold value. Synthetic error rate (synthetic_error_rate).
Increasing trend in error rate. Checks a presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates a certain amount of decreases in the metric value inside the trend candidate. Error Rate (error_rate).
Sudden drop in calls. Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows. Calls/s (count).
Sudden drop in Synthetic calls. Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows. Synthetic calls/s (synthetic_count).
Sudden increase in error rate. Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. Error Rate (error_rate).
Sudden increase in latency. Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. Latency 50th (duration.50th).
Sudden increase in latency for a fraction of requests. Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. Latency 99th (duration.99th).

etcd

Event Description Metric
Abnormally high disk backend commit duration. Detects high disc backend commit duration. Disk backend commit duration (health.disk_backend_commit_duration).
Abnormally high disk wal fsync duration. Detects high disc wal fsync duration. Disk fsync duration (health.disk_wal_fsync_duration).
Abnormally high snapshot duration. Detects high duration of saving a snapshot. Snap save total duration (health.debugging_snap_save_total_duration).
Frequent leader changes seen in last minute. Detects a high number of leader changes in the last minute. Server leader changes (health.server_leader_changes).
Member doesn't have leader. Detects a member who does not have a leader (unavailable). Server has leader (health.server_has_leader).
Proposal ratio analysis. Detects unusual fall of applied proposals and an unusual rise of pending and failed proposals. Number of proposals commited (health.server_proposals_committed), number of proposals applied (health.server_proposals_applied), number of proposals pending (health.server_proposals_pending), and number of proposals failed (health.server_proposals_failed).
Usage of open file descriptors is critical. Detects a high usage of open file descriptors. Number of open file descriptors (health.process_open_fds) and the maximum number of file descriptors (health.process_max_fds).

For more information about this sensor, see the etcd documentation.

Garden Container

Event Description Metric
Memory exhausted. Container memory usage is getting close to its memory limit. Usage (memory.usage).

For more information about this sensor, see the Garden documentation.

Glassfish

Event Description Metric
Glassfish file cache hit rate is as follows 70%. A processing pipeline checks the file cache hit rate and validates whether it's lower than the given threshold value. Hit rate (file_cache_rate).
Maximum number of JDBC connections reached. A processing pipeline checks the total number of JDBC connections. It validates whether it's reaching the maximum limit for the server configuration. Used (jdbc_connection_used).

For more information about this sensor, see the Glassfish documentation.

Google Cloud Datastore

Event Description Metric
Datastore request count dropped significantly in last 30 minutes. Checks for sudden decrease of requests count. Requests (request_count)
Datastore request count increased significantly in last 30 minutes. Checks for sudden increase of requests count. Requests (request_count)

For more information about this sensor, see the Google Cloud Datastore documentation.

Google Cloud Storage

Event Description Metric
Sudden increase in size of all objects Checks for a sudden increase in size of all objects in 24h for non empty buckets Total size of all objects in the bucket.

For more information about this sensor, see the Google Cloud Storage documentation.

Google Cloud Pub/Sub

Event Description Metric
The push request latency for the subscription has increased in last 10 minutes. Checks for sudden increase of push request latency for the subscription. Request Latency (push_request_latencies)
Topic oldest message. Checks whether there are messages on the topic older than threshold value. Oldest Message (oldest_unacked_message_age)

For more information about this sensor, see the Google Cloud Pub/Sub documentation.

Hadoop YARN

Event Description Metric
Resource manager is reporting lost node. Detects if the resource manager is reporting lost nodes. Lost Nodes (lostNodes).
Resource manager is reporting unhealthy node. Detects if the resource manager is reporting unhealthy nodes. Unhealthy Nodes (unhealthyNodes).
Submitted app has failed. Detects if submitted app has failed. Apps Failed (appsFailed).

For more information about this sensor, see the Hadoop YARN documentation.

HAProxy

Event Description Metric
HAProxy backend average queue size is high. HAProxy backend average queue size is large. Backend Queue Size.
HAProxy frontend session usage is high. HAProxy frontend session usage is high. Frontend Session Utilization.
Sudden increase in average response time. Checks for a sudden increase in the average response time of a single backend. Average response time metrics.

For more information about this sensor, see the HAProxy documentation.

Hazelcast

Starting with Hazelcast 3.3 the public methods HazelcastInstance::getPartitionService()::isLocalMemberSafe() is used. For older Hazelcast versions the health status is derived from an internal "has ongoing migrations" status on each local node.

The Hazelcast cluster health status is aggregated from each Hazelcast node. This is exactly what HazelcastInstance::getPartitionService()::isClusterSafe() does internally, but without creating additional overhead of calling this method.

Hazelcast Cluster

Event Description Metric
Cluster status. Checks the cluster status of Hazelcast. Hazelcast 3.3 or above. Hazelcast cluster status flag.

Hazelcast Node

Event Description Metric
Node status. Checks the status of the local member. Hazelcast 3.3 or above. Hazelcast node status flag.

For more information about this sensor, see the Hazelcast IMDG documentation.

HBase

Event Description Metric
Difference between number of stores and number of store files is significant. Detects unusually low or unusually high number of stores. Stores count (rs_store_count) and stores files count (rs_store_file_count).
Region server block cache hit ratio is low. Detects low cache hit ratio. Block cache hit rate (rs_blk_cache_hit_rate) and block cache hit count (rs_blk_cache_hit_count).
Significant increase in compaction queue length. Checks for a sudden increase in the length of the compaction queue. This rule indicates that all regions are growing at a similar rate and need to split/compact at around the same time. This can be addressed by pre-splitting or turning off auto-compactions. Compaction queue length (rs_comp_queue_length).
Significant increase in flush queue length. Checks for a sudden increase in the length of the flush queue. When triggered, this can be an indication of a lack of RAM or that flushes are faster than what disks can handle. Flush queue length (rs_flush_queue_length).

For more information about this sensor, see the Apache HBase documentation.

Host

Event Description Metric
CPU spends significant time waiting for input/output. Checks whether the system spends significant time waiting for input/output (sampling in a sliding window of 60 seconds). Wait (cpu.wait).
CPU Steal Time exceeded. Checks on a secondly moving window, whether there is too much CPU stolen between running processes or by the hypervisor / host OS (sampling in a sliding window of 60 seconds). Steal (cpu.steal).
Device has low capacity left or is full. Detects disk low capacity problems to give an early prediction for a possible capacity breach up to 15 minutes in advance. The detector is not firing when the remaining disk space is more than 1GB or 1% of the total capacity. However, it will fire if either the remaining disk space is empty (<1MB), or the disk space would fill up within the next 15 minutes based on the current trend. The disks free storage capacity.
Disk fills up faster than it is being purged. Detects long-term disk capacity problems and fires when the disk is likely to run out of capacity within the next 48 hours. The detector is not firing when the remaining disk space is more than 20% of the total capacity. However, it will fire when the disk space would fill up within the next 48 hours based on the current trend. This trend is computed based local minima collected over time. When these local minima define a timeframe of at least 4 hours, a linear regression model is fitted on these data points to finally do the long-term forecast. The disks free storage capacity.
Frequent TCP errors. Checks whether the host has an unusually high number of TCP errors (sampling in a sliding window of 60 seconds). In Segments/s (tcp.inSegs) and error (tcp.errors).
Frequent TCP fails. Checks whether the host has an unusually high number of TCP fails (sampling in a sliding window of 60 seconds). Fail (tcp.fails) and open/s (tcp.opens).
Permanent TCP retransmissions. Checks whether the host has an unusual high number of TCP retransmission (sampling in a sliding window of 60 seconds). Retransmission (tcp.retrans) and out Segments/s (tcp.outSegs).
System load too high. Checks whether the system load is too high, by comparing the load against 2 times the CPU cores of the machine (sampling in a sliding window of 120 seconds). Load (load.1min).
System memory exhausted. Checks whether the system memory is close to being exhausted (triggered instantly). Free (memory.free) and used (memory.used).
Too many open files. Processes are opening files faster than they close them (current vs max ratio exceeds threshold). Used (openFiles.used).
Too many used inodes. Low level of free inodes on filesystem triggers this health rule (current vs max ratio exceeds threshold). inode usage.
Too much CPU usage by user processes. Checks whether CPU usage of user processes is too high (sampling in a sliding window of 180 seconds). User (cpu.user) and topPID.
You will run out of disk space soon. Detects short-term capacity problems of a disk and fires when when the disk is likely to run out of capacity within the next hour. The detector is not firing when the disk freed up a considerable amount of space (>=100MB) in the recent past, or when the remaining disk space is more than 20% of the total capacity. However, it will fire when the disk space would fill up within the next hour based on the current trend. This trend is computed based on a linear regression model fitted on the data points of the current sliding window. The disks free storage capacity.
Windows service status is changed. Checks whether the Windows service status is changed (sampling in a sliding window of 60 seconds). Windows service status (state).

For more information about this sensor, see the Host documentation.

IBM ACE

Event Description Metric
Status of ACE Integration Server Check the status of ACE Integration Server. Integration Server State
ACE Integration Server status digital format Check the digital status of ACE Integration Server. Integration Server State Metrics
Queue Manager connection status digital format Check the digital status between ACE Integration Server and Queue manager. Queue Manager Connection Status Metrics
Message with errors number Number of messages that contain errors. Number of Messages with Errors
Message flow with errors number Number of MQGET errors for MQInput nodes or Web Services errors for HTTPInput nodes. Number of MQGET Errors
Message processing with errors number Number of errors that occur when processing a message. Number of Messages with Errors
Message flow status Check the status of ACE Message Flow. Message Flow Status
Message flow status digital format Check the digital status of ACE Message Flow. Message Flow Status Metrics

For more information about this sensor, see the IBM ACE documentation.

IBM Db2

Event Description Metric
Table Space Utilities metrics status Check for events that are related to table space and its metrics when the auto resize feature is enabled and disabled. Table Space Utilities
HADR Connect Status Check for events that are related to the connection status of the HADR standby databases. The standby ID is used as a filter to generate the HADR_CONNECT_STATUS event, which is specific to any standby node, and can be set with the standby ID in the matching operator field. The events can be created based on the following, which represents the current state of any database:
  • The database is connected (Connect State = CONNECTED as 1).
  • The database is in disconnected state (Connect State = DISCONNECTED as 0).
HADR_CONNECT_STATUS (hadr.standbyId.HADR_CONNECT_STATUS). The matching operators that are set to any will generate the events that are irrespective of the standby ID.

For more information about this sensor, see the IBM Db2 documentation.

IBM MQ

IBM MQ Queue Manager

Event Description Metric
Queue Manager number of connections Checks whether there are no connections currently on Queue Manager. Connection count (connectionCount)
Queue Manager status Checks whether Queue Manager is in the stopped or standby state to trigger the Down or Switchover event. Queue Manager Status (statusMetric)
Channel Initiator status for Queue Manager Checks whether Channel Initiator is in a running state. Channel Initiator Status (channelInitiatorStatus)
Publish/Subscribe Engine status for Queue Manager Checks whether Publish or Subscribe engine is in a running state. Publish/Subscribe Engine Status (pubsubStatus)
Bridge stopped[1] Indicates that the IMS bridge is stopped. From IBM MQ events

IBM MQ Queue

Event Description Metric
Queue oldest message Checks whether the queue has messages that are older than the threshold value. Oldest message on queue (oldestMessage)
Queue depth diff Checks whether the queue depth is approaching the maximum queue depth value. Queue depth (queueDepth) and max queue depth (maxQueueDepth)
Queue Full Checks whether the queue depth percentage has reached the warning or critical value. Queue Depth Percentage(queueFullPercentage)
Transmission Queue High Checks whether the number of transmission queue messages is too high. Queue depth (queueDepth)
Queue Service Interval High[1:1] Detects no successful GET operations or MQPUT calls within an interval is greater than the limit that is specified in the QServiceInterval attribute. From IBM MQ events
Queue Depth High[1:2] Indicates that the queue depth has increased to a predefined threshold by an MQPUT or MQPUT1 call that is specified in the QDepthHighLimit attribute. From IBM MQ events
Queue Full[1:3] Indicates a call failure (on an MQPUT or MQPUT1 call) because the queue is full. That is, the queue already contains the maximum number of messages that is possible. From IBM MQ events

IBM MQ Channel

Event Description Metric
Channel status Checks whether the channel is in a healthy state. Channel status (channelStatus)
Channel InDoubt status Checks whether the channel is in a doubt status. Channel status (channelStatus)
Channel conversion error[1:4] Indicates an error when a channel is unable to complete the data conversion and the MQGET call to get a message from the transmission queue that resulted in a data conversion error. From IBM MQ events
Channel SSL Error[1:5] Indicates an error when a channel that uses Transport Layer Security (TLS) or Secure Sockets Layer (SSL) fails to establish an MQ connection. From IBM MQ events

You can use built-in events for channels in Stopped and InDoubt status. You need to create custom events for channels in other status with built-in metrics. For the enumeration values of channel status, see IBM MQ channel metrics reference.

IBM MQ Listener

Event Description Metric
Listener status Checks whether the listener is in a healthy state. Listener status (listenerStatus)

For more information about this sensor, see the IBM MQ documentation.

IIS Internet Information Server

Event Description Metric
Sudden drop in requests to IIS-site. Checks for a sudden drop in the requests for an IIS-site. Total request metric of an IIS-sites.

For more information about this sensor, see the Microsoft IIS documentation.

IBM Datapower

IBM DataPower Appliance

Event Description Metric
Appliance percentage of CPU usage Check whether appliance percentage of CPU usage is too high. CPU Usage (cpuUsage)
Appliance percentage of memory usage Check whether appliance percentage of memory usage is too high. Memory Usage (memoryUsage)
Appliance percentage of system load Check whether appliance percentage of system load is too high. System Load (systemLoad)
Appliance status Check whether appliance status is in healthy state. Status (status)

IBM DataPower Domain

Event Description Metric
Domain percentage of memory usage Check whether domain percentage of memory usage is too high. Current Memory Usage (currentMemUsage)
IBM DataPower Gateway Peering status Check whether the gateway peering status of each instance is broken. Broken status ('brokenStatus')

IBM DataPower Service

Event Description Metric
Service percentage of memory usage Check whether service percentage of memory usage is too high. Current Memory Usage (currentMemUsage)
Service status Check whether service status is in healthy state. Status (status)

For more information about this sensor, see the IBM Datapower documentation.

JBoss

Event Description Metric
Average errors on connector too high. A processing pipeline detects the number of errors that occurred on connectors in the given time window and also checks whether the number of errors is greater than the threshold value. Jboss connector errors.
ConnectionPool is running out of connections. A processing pipeline detects the used connections ratio and checks if it is about to reach the threshold value. JBoss connection pool connections used ratio.
Connections on datasources run out. A processing pipeline detects the number of available connections on data sources in the given time window and checks if the total number of connections is about to reach the threshold value. Jboss datasources connections used, datasources connections available.
ThreadPool is running out of threads. A processing pipeline detects the number of max threads and checks if the current thread count is about to reach the threshold value. JBoss thread pool current thread count, thread pool max threads.

For more information about this sensor, see the JBoss AS documentation.

JBoss Data Grid

Event Description Metric
Caches not in the running state. Checks the ratios of number of caches created against the number of caches running in Jboss Data Grid. If the ratio is as follows a certain value, then it is considered a violation. Running and created caches of cache managers.

For more information about this sensor, see the JBoss Data Grid documentation.

JVM

Event Description Metric
Garbage collection activity high. A processing pipeline monitors the Garbage Collection time spent by the JVM Runtime Platform and validates it against a threshold. JVM Garbage Collection.
JVM code cache is full. A processing pipeline monitors the maximum Code Cache usage of the JVM Runtime Platform. JVM maximum Code Cache usage.
Perm Gen is full (CMS). A processing pipeline detects the maximum Perm Gen CMS Pools utilized. pools.CMS Perm Gen
Perm Gen is full (G1). A processing pipeline detects the maximum Perm Gen G1 Pools utilized. pools.G1 Perm Gen
Perm Gen is full (PS). A processing pipeline detects the maximum Perm Gen PS Pools utilized. pools.PS Perm Gen
Threads are deadlocked. A detector monitors the JVM Runtime Platform and detects if there are any Deadlocked threads. Number of threads deadlocked (threads.deadlocked).
J9VM Memory Leak. A detector checks the growth rate of heap used after GC in MB per hour, and detects whether there is possibly a memory leak in the JVM. IBM J9 VM memory leak detection is an optional feature, disabled by default in the Instana backend. To enable this optional feature, see the page for your Instana deployment: SaaS, Self-Hosted Custom Edition (Kubernetes or Red Hat OpenShift Container Platform), or Self-Hosted Classic Edition (Docker) memory.gc.after memory.gc.before

For more information about this sensor, see the JVM documentation.

Kafka

Kafka Cluster

Event Description Metric
Number of active controllers. Checks for an unusual number of active controllers in the Kafka cluster. Broker active controller count (broker.activeControllerCount).

Kafka Node

Event Description Metric
Kafka network thread is under high load. Checks whether the Kafka network thread is under high load. Network Processor (broker.networkProcessorIdle).
Kafka request handler thread is under high load. Checks whether the Kafka request handler is under high load. Request Handler (broker.requestHandlerIdle).
Leader elections are too often. Checks whether there are too many leader elections within a given timeframe. Leader Elections (broker.leaderElections).
Potential data loss due to unclean leader election. Checks for potential data loss due to unclean leader elections. Unclean Leader Elections (broker.uncleanLeaderElections).
Producers and consumer are blocked. Checks whether producers and consumer are blocked due to partitions being offline. Offline Partitions (broker.offlinePartitionsCount).
The number of in-sync replicas has shrunk. Checks whether the number of in-sync replicas has shrunk and did not recover back within the given interval. ISR shrinks (broker.isrShrinks) and ISR expansions (broker.isrExpansions).
Under-replicated partitions. Checks whether the number of under-replicated partitions exceeds the expected number. Under-replicated partitions (broker.underReplicatedPartitions).

For more information about this sensor, see the Kafka documentation.

Kubernetes

Kubernetes Cluster

Event Description Metric
Kubernetes Cluster component status. Kubernetes reports that a Master-Component (API-server, scheduler, controller manager) is unhealthy. Due to a bug in Kubernetes, the health is not always reliably reported. We try to filter these out and not cause an alert by only showing up on the Cluster detail page. Instana low level events.

Kubernetes DaemonSet

Event Description Metric
Available replicas is less than desired replicas. Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the Kubernetes DaemonSet is missing replica pods. Desired (desiredReplicas) and available (availableReplicas).

Kubernetes Deployment

Event Description Metric
Available replicas is less than desired replicas. Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the Kubernetes Deployment is missing replica pods. Desired (desiredReplicas) and available (availableReplicas).

Kubernetes Namespace

Event Description Metric
Allocatable cpu requests too low. Requested CPU is approaching max capacity (requested CPU / CPU capacity ratio is greater than 80%). CPU Requests Allocation (required_cpu_percentage).
Allocatable memory requests too low. Requested Memory is approaching max capacity (requested memory/memory capacity ratio is greater than 80%) Memory Requests Allocation (required_mem_percentage).
Allocatable pod count too low. Allocated pods are approaching maximum capacity (allocated pods/pods capacity ratio is greater than 80%). For a namespace, pods in the phases Pending, Running, and Unknown are counted as allocated. The namespace capacity values are based on ResourceQuotas, which can be set per Namespace. For more information, see the Kubernetes documentation. Pods Allocation (used_pods_percentage).

Kubernetes Node

Event Description Metric
Allocatable CPU too low. Requested CPU is approaching max capacity (requested CPU / CPU capacity ratio is greater than 80%). CPU Requests Allocation (required_cpu_percentage).
Allocatable memory too low. Requested Memory is approaching max capacity (requested memory/memory capacity ratio is higher than 80%). Memory Requests Allocation (required_mem_percentage).
Allocatable pod count too low. Allocated pods are approaching maximum capacity (allocated pods/pods capacity ratio is greater than 80%). For a node, pods in the phases Running and Unknown are counted as allocated. For more information, see the Kubernetes documentation. Pods Allocation (alloc_pods_percentage).
Kubernetes Node condition status. The node reports a condition which is not ready for more than one minute. For a node that’s all conditions besides the Ready condition. For more information, see the Kubernetes documentation. Instana low level events.

Kubernetes Pod

Event Description Metric
Kubernetes Pod condition status. A pod is not ready for more than one minute, and the reason is not that it’s completed. (PodCondition=Ready, Status=False, Reason != PodCompleted). For more information, see the Kubernetes documentation. Instana low level events.

For more information about this sensor, see the Kubernetes documentation.

Memcached Nodes

Event Description Metric
Flush all command executed. Detects high number of the flush_all command. Flush (cmd_flush).
High key eviction. Detects high number of key evictions. Evictions (evictions).
Number of queued connections increases. Detects high number of queued connections. Queued (conn_queued).
Number of yielded connections increases. Detects high number of yielded connections. Yields (conn_yields).
Used bytes by Memcached reached maxbytes limit. Used bytes by Memcached reached max bytes limit. Used bytes.

For more information about this sensor, see the Memcached documentation.

MongoDB Node

Event Description Metric
Continuously increasing background flushing latency. Database reports increasing background flushing latency (sampling in a sliding window of 150 seconds). Last background flushing latency (backgroundFlushingLast).
Continuously increasing lock queue length. Monitors the MongoDb Lock Queue metric and validates if the lock queue size is increasing too fast. Lock Queue Length (lockQueue).
Increasing page faults. Increasing page faults (sampling in a sliding window of 150 seconds). Number of Page Faults (pageFaults).
Journal commits in write lock growing Journal commits in write lock growing (sampling in a sliding window of 150 seconds). Journal Write Lock (journalWriteLock).
Too high ratio of non-mapped virtual memory Too high ratio of non-mapped virtual memory (triggered instantly and reported by the Instana Host sensor). Virtual and mapped.

MongoDB Replica Set

Event Description Metric
ReplicaSet has member(s) down. The member, as seen from another member of the set, is unreachable. unreachableNodeCount.
ReplicaSet monitoring status. Monitors the health of all the members of MongoDB replica set. Slave Delays Count (slaveDelaysCount), optimes count (optimesCount), and monitored members count (monitoredMembersCount).
Replication lag is growing. Replication lag is growing (sampling in a sliding window of 150 seconds). Slave Delays (slaveDelays) and Optimes (optimes).
Replica Set connection usage is high. Number of active connections is more than 90% of the maximum connections. Connections ('connections').

For more information about this sensor, see the MongoDB documentation.

MySQL DB

Event Description Metric
Available server connections are at limit. Ratio between the used and connections limit is greater than the configured ratio threshold. Connections (status.THREADS_CONNECTED).

For more information about this sensor, see the MySQL documentation.

Nginx Server

Event Description Metric
Nginx has a problem with offline peers. Inactive Peer (available only for NGINX Plus). Upstreams failed (nginx_plus.http.upstreams.peers.failed).
Nginx is dropping connections. Dropped connections. Dropped connections (connections.dropped).
Nginx is failing with SSL handshakes. Failed SSL handshakes (available only for NGINX Plus). Failed hanshakes (nginx_plus.ssl.handshakes_failed).
Number of active connections is close to the max. Used connections ratio exceeds the configured ratio threshold for used connections. Active connections (connections.active).

For more information about this sensor, see the NGINX documentation.

Node.js App

Event Description Metric
Garbage collection activity high. Checks whether the time spent in GC in the given window is above the given threshold. GC pause metrics.
Health checks are failing. Checks whether there are any failing healthchecks. For more information, see Health check support. Health check result (healthcheckResult).

For more information about this sensor, see the Node.js documentation.

OpenShift Deployment Config

Event Description Metric
Available replicas is less than desired replicas. Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the OpenShift DeploymentConfig is missing replica pods. Desired (desiredReplicas) and available (availableReplicas).

For more information about this sensor, see the Openshift documentation.

OTel Host

Event Description Metric
CPU Wait time exceeded Checks whether the system spends a significant amount of time waiting for input or output operations. CPU Wait (cpu.wait)
CPU Steal time exceeded Specifies the number of allowed CPU Wait violations within a time frame. CPU Steal (cpu.steal)
CPU usage high Checks whether the CPU use is high. This event continuously evaluates data over the most recent 180-second interval. CPU User (cpu.user)
System load too high Checks whether the system load is high by comparing the load against two times the CPU cores of the machine. This event continuously evaluates data over the most recent 120-second interval. Load (load.avg_1m)
System memory exhausted Checks whether the system memory is close to fully used (triggered instantly). Memory free (memory.free) and Memory used (memory.used)
Disk low capacity Detects short-term capacity problems of a device that has less than a static threshold (1GB) or less than 1% of the total volume size. In addition, it detects the capacity if the remaining time until zero provides the current rate of change is under 15 minutes. Disks free storage capacity

For more information about this sensor, see the OpenTelemetry documentation.

OracleDB

Event Description Metric
Ratio between DB CPU Time and DB Time is low. Ratio between DB CPU Time and DB Time is as follows the configured threshold. DB CPU Time/DB Time Ratio (stats.cpuTimeDbTimeRatio).
Tablespace space usage is high. Tablespace used space is more significant than the configured amount of maximum space. Tablespace used space percentage.
Total amount of sessions at maximum. Used sessions ratio exceeds the configured used sessions ratio threshold. Sessions/Session Limit (stats.usedSessionsRatio).

For more information about this sensor, see the OracleDB documentation.

OS process

Event Description Metric
CPU Usage Process is causing high CPU usage on host. The result of a high CPU usage rule evaluation on the underlying host and the CPU user time of the given process.
Open Files Usage. Process is opening files faster than it closes them (current vs max ratio exceeds threshold) Used (openFiles.used).
Abnormal termination. Process terminated as a result of an uncaught signal.
Abnormal termination. Process terminated with a non-zero exit code.

For more information about this sensor, see the OS process documentation.

PHP-FPM Runtime

Event Description Metric
Frequent restarts of PHP-FPM worker pool. Checks for frequent restarts of a PHP-FPM worker pool by evaluating the number of its restarts in a given time window against a given threshold. Start times for a worker pool.
Listen Backlog configured over capacity. Checks whether the listen backlog of a worker pool is over the configured capacity. Worker pool queue length.
Too many connections reset. Checks the number of connection resets to be above the given threshold in the given time window. Connection resets metric for worker pool.
Too many requests piling up in Listen Backlog. Checks the size for various PHP-FPM worker queues and validates it against the threshold value. Listen queue size metrics for various PHP-FPM worker queues.
Too many slow requests. Checks the ratio of slow requests on all monitored PHP-FPM worker pools. Slow requests and accepted connection metric for a worker pool of a PHP-FPM instance.

For more information about this sensor, see the PHP documentation.

Synthetic Check

Event Description Metric
Remote target is not reachable. Checks whether the percentage of failed communication attempts in the given sliding window is above the given threshold. Status of Ping (status). A http status code between 200-206 and 300-307 results in healthy status, for icmp the exit value 0 is seen as healthy while value 1 is seen as unhealthy, in addition a maximum execution time of 2 seconds is set

For more information about this sensor, see the Synthetic Check documentation.

PostgreSQL DB

Event Description Metric
Active connection usage. Number of active connections is more than 90% of the maximum connections. Connection Usage (max_conn_pct).

For more information about this sensor, see the PostgreSQL documentation.

Process

Event Description Metric
High CPU usage. Evaluates whether the given process is causing high CPU usage on a host. Results of high CPU usage rule evaluation on the underlying host and CPU user time of the given process.
Too many open files. Open files percentage is higher than the configured threshold. Used (openFiles.used).

RabbitMQ

RabbitMQ Cluster

Event Description Metric
RabbitMQ network partition detected Detects if network partition occurs inside the RabbitMQ cluster (triggered every 5 seconds). Total number of Network partitions (net_partitions_count).

RabbitMQ Server

Event Description Metric
Queues are filling up with messages Over a period of 10 minutes, queues are filling up with messages that are not delivered. Messages ready (overview.messages_ready) and messages acknowledged (overview.ack).
RabbitMq has no consumers In the last 5 seconds, RabbitMQ has had no consumers. Consumers (overview.consumers).
RabbitMq has no connections In the last 5 seconds, RabbitMQ has had no connections. Connections (overview.connections).

RabbitMQ Nodes

Event Description Metric
RabbitMQ File Descriptors Usage is critical. File descriptors usage rate is critical on a specific node (Warning: > 90%, Critical: > 98%). This is triggered every 5 seconds. RabbitMQ file descriptors used rate (fd_used_rate).
RabbitMQ Memory Usage is critical on node. Memory usage rate is critical on a specific node (Warning: > 90%, Critical: > 98%). This is triggered every 5 seconds. RabbitMQ memory used rate (mem_used_rate).
RabbitMQ Erlang Processes count is critical. Erlang Processes count is critical on a specific node (Warning: > 90%, Critical: > 98%). This is triggered every 5 seconds. RabbitMQ processes rate.

RabbitMQ Queues

Event Description Metric
More messages are being produced than consumed. More messages are being published to a queue than the consumers can process from a queue. RabbitMQ unacknowledged messages in a queue.

For more information about this sensor, see the RabbitMQ documentation.

Redis

Redis Cluster

Event Description Metric
Redis cluster state isn't ok. Cluster is in an inappropriate state. cluster_state.

Redis Node

Event Description Metric
Memory allocation analysis. Redis server is causing external memory fragmentation. Used memory (used_memory) and memory fragmentation ratio (mem_fragmentation_ratio).
Redis hit rate is low. Redis hit rate is as follows the configured threshold. Cache hit rate (hit_rate), keyspace hits (keyspace_hits), keyspace misses (keyspace_misses), and Redis evicted keys (evicted_keys).
Redis memory usage is getting closer to max memory limit. Redis memory usage is getting closer to max memory limit. Used memory (used_memory).
Redis rejecting connections. Redis is rejecting connections. Number of rejected connections (rejected_connections).
Redis slave node can't connect to master node. Redis slave node can't connect to the master node. master_downtime_seconds.

For more information about this sensor, see the Redis documentation.

SAP ABAP

Event Description Metric
Lock contention detected Detects lock contention and provides details about the lock mode and lock object. ABAP Lock Contention
ABAP dumps generated Detects ABAP dumps that are generated and provides details on the severity. ABAP Dumps Severity
IDoc Inbound and OutBound errors occured Detects error for both Inbound and Outbound IDocs. Inbound IDoc Error and Outbound IDoc Error
High CPU usage detected Detects if the CPU usage is greater than 90%. High CPU usage
High memory usage detected Detects if the memory usage is greater than 90%. High Memory Usage
Work process in stopped, shutdown, or PRIV mode (private) detected Detects if the work process is in PRIV mode (private), stopped, or shutdown. Work Process Status
File system usage crossing threshold detected Detects if the file system usage crosses the threshold of 80%. File System Usage
Connection issues detected Detects incorrect username, password, gateway failure, or incorrect login attempts. Connectivity Status
Authorization missing detected Detects if the user is missing the authorization to run a function module. Authorization check
User account locked detected Detects if the user account is locked due to login failures. User Account lock
Spool Error detected Detects spool error. Spool Error
Dialog response time exceeding threshold Detects if the dialog response time exceeds the preferred threshold. Dialog Response Time
Dialog work process exceeding threshold Detects if the dialog work process is running longer than 10 seconds. Dialog Work Process
Transport request release detected Detects whether transport request is released or protected. Transport Request

For more information about this sensor, see SAP ABAP.

SAP HANA

Event Description Metric
High CPU utilization Detects if the total CPU usage exceeds 90% Total CPU Utilization
High HANA memory usage Detects if the used memory exceeds 90% of the allocated limit HANA Memory Usage
High host memory usage Detects if the host memory usage exceeds 90% Host Memory Usage
High Disk usage Detects if the disk usage exceeds 90% Disk Usage Summary
High number of queuing connections Detects if the queuing connections are more than one Connections
High number of blocked sessions Detects if the blocked sessions are more than one Sessions
High number of blocking sessions Detects if the blocking sessions are more than one Sessions
High number of blocked threads Detects if the blocked threads are more than 10 Threads
High number of blocked SQL threads Detects if the blocked SQL threads are more than 10 SQL Threads
High number of blocked job worker threads Detects if the blocked job worker threads are more than 10 Job Worker Threads
High number of pending requests Detects if the pending requests are more than 10 Requests
High process CPU Detects if any of the process CPUs exceeds 90% Service Details
Service status is not active Detects if service status is not active Service Details
Backup failed Detects failed backups Backup Progress
User locks occurred Detects user locks User Locks
Scheduler jobs failed Detects failed scheduler jobs Scheduler Jobs
System events occurred Detects system events System Events
Archive log backup failed Detects failed log backups Archive Log Backup
Transaction is not active Detects partial aborting and aborting transactions Transaction Statistics

For more information about the SAP HANA sensor, see Monitoring SAP HANA.

Service

Event Description Metric
Complete drop in calls. Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows. Calls/s (count).
Error rate too high. Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value. Error rate (error_rate).
Increasing trend in error rate. Checks a presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates certain amount of decreases in the metric value inside the trend candidate. Error rate (error_rate).
Sudden drop in calls. Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows. Calls/s (count).
Sudden increase in error rate. Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. Error Rate (error_rate).
Sudden increase in latency. Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. Latency 50th (duration.50th).
Sudden increase in latency for a fraction of requests. Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows. Latency 99th (duration.99th).

Solr

Solr Cloud Cluster

Event Description Metric
Unreachable Solr nodes. One or more nodes are down. unreachableNodes.

Solr Node

Event Description Metric
Solr cache hit rate is low. Solr cache hit rate is as follows 80% over the last minute, possibly due to high evictions or clients are querying the wrong data. Solr Hit Ratio (hitratio) and Solr evictions.

For more information about this sensor, see the Apache Solr documentation.

Spark

Spark Application

Event Description Metric
Failed tasks on executor. Number of failed tasks on an executor exceeds the configured threshold. Spark Application failed tasks.
Scheduling delay is high. Scheduling delay is increasing too fast or is too high. Scheduling Delay (schedulingDelay).

Spark Standalone

Event Description Metric
Driver has failed. Number of failed drivers exceeds the configured threshold. Number of failed drivers (drivers.failed).
Spark standalone master is reporting dead worker(s). Number of dead workers exceeds the configured threshold. Dead workers (workers.deadWorkers).
Spark standalone master is reporting worker(s) in unknown state. Number of workers in an unknown state exceeds the configured threshold.
Submitted app has failed. Number of failed applications exceeds the configured threshold. Workers in unknown state (workers.workersInUnknownState).

For more information about this sensor, see the Apache Spark documentation.

Spring Boot App

Event Description Metric
Number of active sessions reached maximum number. A processing pipeline detects the number of active connections of the SpringBoot application in the given time window. It validates whether the number of active sessions is greater than the threshold value. Active sessions (metrics.httpsessions.active).
Spring Boot Application down. Monitors the status of the SpringBoot Application. Status of SpringBoot Application (metrics.status).

For more information about this sensor, see the Spring Boot documentation.

Sybase Server

Event Description Metric
Available server connections are at limit. Number of connections is close to 100% of connections limit per server. Connections (stats.connCount).
The maximum number of databases is at limit. Number of databases is close to 100% of databases limit per server. databasesCount.

For more information about the SAP SQL Anywhere sensor, see Monitoring SAP SQL Anywhere.

Synthetic PoP

Event Description Metric
Synthetic pop status Check whether Synthetic PoP can connect to Instana backend Status of Synthetic PoP (status)
Playback engine status Check whether the playback engine is overloaded. Workload status of the playback engines browserscript.workloadStatus, http.workloadStatus, javascript.workloadStatus, and ism.workloadStatus.
Retrieving credentials failed Failed to get Synthetic crendentials from the Instana backend. Error code and URL of pop_get_cred_failed (error.pop_get_cred_failed).
Retrieving tests failed Failed to get Synthetic tests from Instana backend. Error code and URL of pop_get_test_failed (error.pop_get_test_failed).
Reporting test results failed Failed to post Synthetic test result to the Instana backend. Error code and URL of pop_report_result_failed (error.pop_report_result_failed).
Reporting test tesult details failed Failed to post Synthetic test result details to Instana backend. Error code and URL of pop_report_result_details_failed (error.pop_report_result_details_failed).
Reporting result queue depth is high Detect whether the result queue depth is high ResultQueueDepthHigh (resultQueueDepthHigh).

For more information about this sensor, see the Synthetic PoP documentation.

Tibco EMS

Event Description Metric
Connections exceeds max available connections. The max number of connections is almost used up. Connections Count (connectionCount).
Messages memory usage exceeds the limit. The maximum message memory is almost used up. Messages Memory (messagesMemory).
Queues pending messages exceeds the limit. The max number of pending messages for queue is almost used up. Queue pending messages usage.
Topics pending messages exceeds the limit. The max number of pending messages for topic is almost used up. Topic pending messages usage.

For more information about this sensor, see the Tibco EMS documentation.

Tomcat

Event Description Metric
Active connections reached maximum. Detects if the number of connections of specific connector is reaching its maximum configured value. Number of connector connection count.
Sudden drop in the number of session. Checks for a significant drop in the number of sessions. Total session count (totalSessionCount).
Sudden increase in the number of session. Checks for a significant increase in the number of sessions. Total session count (totalSessionCount).
Threads number reached maximum. Detects if the number of busy threads of specific connector is reaching its maximum configured value. Number of connector busy threads.

For more information about this sensor, see the Tomcat documentation.

Varnish Node

Event Description Metric
Sudden drop in the number of requests. Checks for a sudden drop in the number of client requests. Received client requests (client_req).
Sudden increase in evected objects. Checks for a sudden increase in the number of evicted objects. Nuked Objects (n_lru_nuked).
Thread creation is failing. Too many thread creations failed. Failed (threads_failed) and limited (threads_limited).
Varnish backend is marked unhealthy. Varnish backend server is unhealthy or is not available. Unhealthy (backend_unhealthy).
Varnish hit rate is low. Varnish hit rate is very low. Cache Hit Rate (cache_hit_rate).
Varnish is out of worker threads. Varnish is out of worker threads. Connections dropped due to a full queue (sess_dropped).

For more information about this sensor, see the Varnish documentation.

Vault

Event Description Metric
Vault is sealed. Detects if the sealed status is set to true. Sealed (sealed).
Sudden increase in secret reads Checks for a sudden increase (increase by 60% based on the average of the last 5 minutes) in the number of secrets read. Secrets read count (secret.read.count).

For more information about this sensor, see the Vault documentation.

WebLogic Server

Event Description Metric
Datasource error state. A processing pipeline monitors status codes of the WebLogicApplications data sources, and checks if any data source is unhealthy. WebLogic datasource status.
Health state Detects overall system degradation based on reported health state. Health State status.

For more information about this sensor, see the WebLogic documentation.

WebSphere

Event Description Metric
WebContainer thread pool active threads reached maximum. A processing pipeline validates that the number of active threads in the WebContainer thread pool is reaching the maximum limit. Active threads (threadPools.webContainer.activeThreads).
WebSphere certificate is about to expire. Remaining days before certificate expiration is less than the threshold value. Remaining days before expiration (certificates.{certificate}.expDaysLeft)

For more information about this sensor, see the WebSphere Application Server documentation.

ZooKeeper

Event Description Metric
Maximum request latency is high. A processing pipeline checks if the maximum request latency is reaching the threshold value. Max request latency (max_request_latency).
Number of queued requests is high. A processing pipeline detects the number of queued request and validates whether the number is reaching the threshold value. Outstanding request count (outstanding_requests).

For more information about this sensor, see the ZooKeeper documentation.


  1. The events are retrieved from IBM MQ events. Instana agent collects these IBM MQ events and reports them as Instana events. To collect these events, you need to enable Queue Manager performance event and channel event. For more information, see Extra IBM MQ configuration. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎