Built-in Events Reference

.NET App


Event	Description	Metric
Garbage collection activity is high.	Monitors the garbage collection (GC) time spent by the CLR runtime platform and checks it against the maximum allowed percentage value.	GC time (`mem.time_in_gc`).

For more information about this sensor, see the .NET documentation.

ActiveMQ

Edit online


Event	Description	Metric
Dead-letter queue size is growing.	Dead-letter queue size is increasing. Messages sent are not routed to their correct destination.	ActiveMQ queue size.
Memory usage is close to the limit.	Memory usage is close to 100% of the memory limit.	Memory Usage (`memoryPercentage`).
Store usage is close to the limit.	Store usage is close to 100% of the store limit.	Store Usage (`storePercentage`).

For more information about this sensor, see the ActiveMQ documentation.

ActiveMQ Artemis

Edit online


Event	Description	Metric
ActiveMQ Artemis has no connections.	There are no connections in the last 5 seconds. The current number of connections is equal to the configured NoConnections count.	Total Connections (`totalConnectionCount`).
ActiveMQ Artemis has no consumers.	There are no consumers in last 5 seconds. Current number of consumers is equal to the configured NoConsumers count.	Total Consumers (`totalConsumerCount`).
Addresses memory usage is close to the limit.	Memory usage of all addresses is close to 100% of its memory limit.	Address Memory Usage (`addressMemoryPercentage`).

For more information about this sensor, see the ActiveMQ Artemis documentation.

Apache HTTPd

Edit online


Event	Description	Metric
Apache child processes are stuck performing DNS lookups.	Detects high usage of server workers by DNS lookup.	Dns (`worker.dns`).
Logging is slowing down Apache HTTPd performance.	Detects high usage of server workers for logging purposes.	Logging (`worker.logging`).
Number of busy workers is approaching max workers.	Detect high percentage of busy workers.	Busy workers (`busy_workers`).

For more information about this sensor, see the Apache HTTPd documentation.

Application

Edit online


Event	Description	Metric
Complete drop in calls	Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the listed relative and absolute threshold parameters.	Calls/s (`count`)
Error rate too high	Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value.	Error Rate (`error_rate`).
Increasing trend in error rate	This rule checks the presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. However, the detector is, not strict and tolerates a certain amount of decreases in the metric value inside the trend candidate.	Error Rate (`error_rate`).
Sudden drop in calls	Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the listed relative and absolute threshold parameters.	Calls/s (`count`).
Sudden increase in error rate	Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters.	Error Rate (`error_rate`).
Sudden increase in latency	Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters.	Latency 50th (`duration.50th`).
Sudden increase in latency for a fraction of requests	Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters.	Latency 99th (`duration.99th`).

AWS DynamoDB

Edit online


Event	Description	Metric
Ratio of consumed and provisioned reads is critical.	Detects high ratio of consumed and provisioned reads.	Consumed read capacity (`consumed_read`).
Ratio of consumed and provisioned writes is critical.	Detects high ratio of consumed and provisioned writes.	Consumed write capacity (`consumed_write`) and provisioned write capacity (`provisioned_write`).

For more information about this sensor, see the AWS DynamoDB documentation.

AWS MSK

Edit online


Event	Description	Metric
Active Controller Count.	Checks for an unusual number of active controllers in the Kafka cluster.	Active controller count (`active_controller_count`).
Offline Partitions Count.	Defines the maximum allowed proportion of violations of offline partitions within the specified time window.	Offline partitions count (`offline_partitions_count`).
Network Processor Low Idle Time.	Checks whether the Kafka network thread is under high load.	Network processor idle time (`network_processor_idle`).
Request Handler Low Idle Time.	Checks whether the Kafka request handler is under high load.	Request handler idle time (`request_handler_idle`).
Under-replicated partitions Count.	Checks whether the number of under-replicated partitions exceeds the expected number.	Under-replicated partitions (`under_replicated_partitions`).

For more information about this sensor, see the AWS MSK documentation.

AWS RDS

Edit online


Event	Description	Metric
CPU credit balance reaching zero.	Checks if the CPU credit balance is getting closer to zero.	CPU Credit Balance (`cpu_credit_balance`).
Number of CPU credits consumed is high.	Checks if the percentage of CPU credits consumed by an instance is reaching max capacity.	CPU Credit Usage (`cpu_credit_usage`) and CPU Credit Balance (`cpu_credit_balance`).

For more information about this sensor, see the AWS RDS documentation.

Azure API Management Service

Edit online

The Azure API Management sensor will automatically perform any configured custom health checks every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.


Event	Description	Metric
Azure Api Management capacity is getting closer to the max capacity limit.	Checks whether Azure API Management is using more than 90% of the available capacity.	Capacity (`metrics.Capacity`).

For more information about this sensor, see the Azure Api Management documentation.

Azure CosmosDB

Edit online


Event	Description	Metric
Azure CosmosDb storage capacity is getting closer to the max capacity limit.	Detects whether the Azure CosmosDb storage capacity is reaching the max capacity limit.	CosmosDb storage capacity.

For more information about this sensor, see the Azure CosmosDB documentation.

Azure Redis

Edit online

The Azure Redis Cache sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.


Event	Description	Metric
Azure Redis Cache client connections are getting closer to max connections limit.	Azure Redis Cache is using more than 90% of available client connections.	Connected Clients (`connectedclients`).
Azure Redis Cache memory usage is getting closer to max memory limit.	Azure Redis Cache is using more than 90% of available memory.	Percentage of Memory Used (`usedmemorypercentage`).

For more information about this sensor, see the Azure Redis documentation.

Azure SQL Database

Edit online

The Azure SQL Database sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.


Event	Description	Metric
Database is running out of space.	Checks if Azure SQL Database is running out of space. Warning limit is at 80% and the critical limit is at 90% of the used size.	`metrics.storage_percent`.
Database status.	Unhealthy state is caused by the database being unavailable. A database can be unavailable if one of the following conditions is true: The database has been set offline by the user The database is being restored from backup The database is being recovered The database has been corrupted The database has been set to the Emergency state by the administrator The database is in the process of being created by copying another database	`metrics.statusCode`.
The total DTU utilization is getting closer to max DTU limit.	Checks if the Azure SQL Database DTU utilization is reaching max DTU limit. Warning limit is at 75% and the critical limit is at 85% of the DTU utilization.	`metrics.dtu_consumption_percent`.

Azure MySQL Database

Edit online

The Azure MySQL Database sensor runs custom health checks every minute. If the checks fail for at least one minute, an issue is raised to inform you.


Event	Description	Metric
Available server connections are getting closer to the max connections limit	The usage of Azure MySQL Server connections is more than 85% of the available client connections.	Active Connections (`active_connections`)

For more information about this sensor, see the Azure MySQL documentation.

Azure Service Bus

Edit online

The Azure Service Bus sensor runs custom health checks every minute. If the checks fail for at least one minute, an issue is raised to inform you.


Event	Description	Metric
Azure Service Bus has at least one message in DL queue	Checks if the Azure Service Bus has at least one message in the dead lettered queue.	Deead Lettered Messages (`deadletteredMessages`)

For more information about this sensor, see the Azure Service Bus documentation.

Azure SQL Elastic Pool

Edit online

The Azure SQL Elastic Pool sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.


Event	Description	Metric
The total eDTU utilization is getting closer to max eDTU limit.	Checks if Azure SQL Elastic Pool eDTU is reaching maximum eDTU limit.	`metrics.dtu_consumption_percent`.

Cassandra

Edit online

Cassandra Cluster

Edit online


Event	Description	Metric
Unreachable Cassandra nodes.	One or more nodes are down.	Number of unreachable nodes (`unreachableNodes`).

Cassandra Node

Edit online


Event	Description	Metric
Blocked threadpools.	Checks whether there are stages with the blocked threads.	Blocked threads metric for a stage.
Dropped messages.	Checks whether there are thread pools dropping messages.	Dropped messages metric for a stage.
Pending compactions.	Checks whether pending compactions are increasing.	Write (Pending) (`compaction.pending`).
Pending mutations.	Checks whether there are pending mutations.	Counter Mutation (`stage.mutation.pending`).
Pending reads.	Pending reads.	Read Repair (`stage.read.pending`).
Pending request responses.	Pending request responses.	Write (Mutation) (`stage.requestresponse.pending`).
Sudden drop in write requests.	Checks for a sudden drop in the number of Cassandra write requests.	Writes (`clientrequests.write.count`).

For more information about this sensor, see the Cassandra documentation.

Ceph

Edit online


Event	Description	Metric
Ceph cluster status.	Ceph cluster is reporting a problem; `HEALTH_WARN` or `HEALTH_ERR`.	Status of the Ceph Cluster (`overall_status`).
Monitor quorum is not reached.	The number of healthy monitors is less than 50% of all monitors.	Number of monitors (`num_mons`) and number of active monitors (`num_active_mons`).
Osd(s) full capacity state.	Some of OSDs are reporting full state.	Number of active+clean pgs (`num_full_osds`).
Osd(s) near full capacity state.	Some of OSDs are reporting near full state.	Number of near full osds (`num_near_full_osds`).

For more information about this sensor, see the Ceph documentation.

Consul (HashiCorp)

Edit online


Event	Description	Metric
Consul cluster health.	Detects the overall health of the cluster and if any of the nodes are considered unhealthy by Autopilot.	Consul autopilot health status (`consul.autopilot.healthy`).

CRI-O

Edit online


Event	Description	Metric
Memory exhausted.	Detects when the container memory usage exceeds specified limits.	RSS (`memory.total_rss`).

Docker

Edit online


Event	Description	Metric
Memory exhausted.	When the container memory usage exceeds specified limits, a memory warning threshold or a memory critical threshold alert is displayed.	RSS (`memory.total_rss`).

For more information about this sensor, see the Docker documentation.

Elasticsearch

Edit online

Elasticsearch Cluster

Edit online


Event	Description	Metric
Cluster status.	Monitors the status of Elasticsearch cluster.	Number of Elasticsearch nodes (`node_count`) and the status of Elasticsearch cluster (`cluster_status`).
Elasticsearch is in split-brain situation.	Checks whether an Elasticsearch cluster has more than 1 master node. Split Brain is triggered for environments with two Elastic clusters with the same name.	Master nodes count in elasticsearch cluster.

Elasticsearch Node

Edit online


Event	Description	Metric
Capacity limit while rebalancing.	Characterizes the node at being at the capacity limit by checking whether it's relocating shards at the time of being at the capacity limit.	Results of the capacity limit evaluation and shard relocation.
Heap overallocation.	Evaluates whether the heap size setting of the Elasticsearch is too big.	Maximum heap size of the underlying JVM and the total memory on the underlying host.
High heap usage.	Checks the heap usage of the node along with the recent workload characteristics to detect the heap usage to be too high.	Heap usage by the underlying JVM and workload characterization.
Node at capacity limits.	Checks for the node being at the capacity limit which is determined by the presence of the following issues: high load and CPU usage on the host, high heap usage and high GC time in the Elasticsearch JVM.	High load and high CPU time on the host, high heap usage by the Elasticsearch, as well as high GC time on the underlying JVM
Node status.	Checks the cluster status provided by the Elasticsearch.	High load and high CPU time on the host, high heap usage by the Elasticsearch, as well as high GC time on the underlying JVM.
Rejected actions.	Checks for the number of rejected threads being too high.	Index (`threads.index_rejected`), search (`threads.search_rejected`), bulk (`threads.bulk_rejected`), and get (`threads.get_rejected`).

For more information about this sensor, see the Elasticsearch documentation.

Endpoint

Edit online


Event	Description	Metric
Complete drop in calls.	Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows.	Calls/s (`count`).
Error rate too high.	Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value.	Error Rate (`error_rate`).
Error rate too high for a Synthetic endpoint.	Detects a consistently high error rate of a Synthetic endpoint when the average errors KPI within the last four minutes is above the given threshold value.	Synthetic error rate (`synthetic_error_rate`).
Increasing trend in error rate.	Checks a presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates a certain amount of decreases in the metric value inside the trend candidate.	Error Rate (`error_rate`).
Sudden drop in calls.	Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows.	Calls/s (`count`).
Sudden drop in Synthetic calls.	Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows.	Synthetic calls/s (`synthetic_count`).
Sudden increase in error rate.	Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows.	Error Rate (`error_rate`).
Sudden increase in latency.	Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows.	Latency 50th (`duration.50th`).
Sudden increase in latency for a fraction of requests.	Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows.	Latency 99th (`duration.99th`).

etcd

Edit online


Event	Description	Metric
Abnormally high disk backend commit duration.	Detects high disc backend commit duration.	Disk backend commit duration (`health.disk_backend_commit_duration`).
Abnormally high disk wal fsync duration.	Detects high disc wal fsync duration.	Disk fsync duration (`health.disk_wal_fsync_duration`).
Abnormally high snapshot duration.	Detects high duration of saving a snapshot.	Snap save total duration (`health.debugging_snap_save_total_duration`).
Frequent leader changes seen in last minute.	Detects a high number of leader changes in the last minute.	Server leader changes (`health.server_leader_changes`).
Member doesn't have leader.	Detects a member who does not have a leader (unavailable).	Server has leader (`health.server_has_leader`).
Proposal ratio analysis.	Detects unusual fall of applied proposals and an unusual rise of pending and failed proposals.	Number of proposals commited (`health.server_proposals_committed`), number of proposals applied (`health.server_proposals_applied`), number of proposals pending (`health.server_proposals_pending`), and number of proposals failed (`health.server_proposals_failed`).
Usage of open file descriptors is critical.	Detects a high usage of open file descriptors.	Number of open file descriptors (`health.process_open_fds`) and the maximum number of file descriptors (`health.process_max_fds`).

For more information about this sensor, see the etcd documentation.

Garden Container

Edit online


Event	Description	Metric
Memory exhausted.	Container memory usage is getting close to its memory limit.	Usage (`memory.usage`).

For more information about this sensor, see the Garden documentation.

Glassfish

Edit online


Event	Description	Metric
Glassfish file cache hit rate is as follows 70%.	A processing pipeline checks the file cache hit rate and validates whether it's lower than the given threshold value.	Hit rate (`file_cache_rate`).
Maximum number of JDBC connections reached.	A processing pipeline checks the total number of JDBC connections. It validates whether it's reaching the maximum limit for the server configuration.	Used (`jdbc_connection_used`).

For more information about this sensor, see the Glassfish documentation.

Google Cloud Datastore

Edit online


Event	Description	Metric
Datastore request count dropped significantly in last 30 minutes.	Checks for sudden decrease of requests count.	Requests (`request_count`)
Datastore request count increased significantly in last 30 minutes.	Checks for sudden increase of requests count.	Requests (`request_count`)

For more information about this sensor, see the Google Cloud Datastore documentation.

Google Cloud Storage

Edit online


Event	Description	Metric
Sudden increase in size of all objects	Checks for a sudden increase in size of all objects in 24h for non empty buckets	Total size of all objects in the bucket.

For more information about this sensor, see the Google Cloud Storage documentation.

Google Cloud Pub/Sub

Edit online


Event	Description	Metric
The push request latency for the subscription has increased in last 10 minutes.	Checks for sudden increase of push request latency for the subscription.	Request Latency (`push_request_latencies`)
Topic oldest message.	Checks whether there are messages on the topic older than threshold value.	Oldest Message (`oldest_unacked_message_age`)

For more information about this sensor, see the Google Cloud Pub/Sub documentation.

Hadoop YARN

Edit online


Event	Description	Metric
Resource manager is reporting lost node.	Detects if the resource manager is reporting lost nodes.	Lost Nodes (`lostNodes`).
Resource manager is reporting unhealthy node.	Detects if the resource manager is reporting unhealthy nodes.	Unhealthy Nodes (`unhealthyNodes`).
Submitted app has failed.	Detects if submitted app has failed.	Apps Failed (`appsFailed`).

For more information about this sensor, see the Hadoop YARN documentation.

HAProxy

Edit online


Event	Description	Metric
HAProxy backend average queue size is high.	HAProxy backend average queue size is large.	Backend Queue Size.
HAProxy frontend session usage is high.	HAProxy frontend session usage is high.	Frontend Session Utilization.
Sudden increase in average response time.	Checks for a sudden increase in the average response time of a single backend.	Average response time metrics.

For more information about this sensor, see the HAProxy documentation.

Hazelcast

Edit online

Starting with Hazelcast 3.3 the public methods HazelcastInstance::getPartitionService()::isLocalMemberSafe() is used. For older Hazelcast versions, the health status is derived from an internal has ongoing migrations status on each local node.

The Hazelcast cluster health status is aggregated from each Hazelcast node. This is exactly what HazelcastInstance::getPartitionService()::isClusterSafe() does internally, but without creating additional overhead of calling this method.

Hazelcast Cluster

Edit online


Event	Description	Metric
Cluster status.	Checks the cluster status of Hazelcast. Hazelcast 3.3 or above.	Hazelcast cluster status flag.

Hazelcast Node

Edit online


Event	Description	Metric
Node status.	Checks the status of the local member. Hazelcast 3.3 or above.	Hazelcast node status flag.

For more information about this sensor, see the Hazelcast IMDG documentation.

HBase

Edit online


Event	Description	Metric
Difference between number of stores and number of store files is significant.	Detects unusually low or unusually high number of stores.	Stores count (`rs_store_count`) and stores files count (`rs_store_file_count`).
Region server block cache hit ratio is low.	Detects low cache hit ratio.	Block cache hit rate (`rs_blk_cache_hit_rate`) and block cache hit count (`rs_blk_cache_hit_count`).
Significant increase in compaction queue length.	Checks for a sudden increase in the length of the compaction queue. This rule indicates that all regions are growing at a similar rate and need to split/compact at around the same time. This can be addressed by pre-splitting or turning off auto-compactions.	Compaction queue length (`rs_comp_queue_length`).
Significant increase in flush queue length.	Checks for a sudden increase in the length of the flush queue. When triggered, this can be an indication of a lack of RAM or that flushes are faster than what disks can handle.	Flush queue length (`rs_flush_queue_length`).

For more information about this sensor, see the Apache HBase documentation.

Host

Edit online


Event	Description	Metric
CPU spends significant time waiting for input/output.	Checks whether the system spends significant time waiting for input/output (sampling in a sliding window of 60 seconds).	Wait (`cpu.wait`).
CPU Steal Time exceeded.	Checks on a secondly moving window, whether there is too much CPU stolen between running processes or by the hypervisor / host OS (sampling in a sliding window of 60 seconds).	Steal (`cpu.steal`).
Device has low capacity left or is full.	Detects disk low capacity problems to give an early prediction for a possible capacity breach up to 15 minutes in advance. The detector is not firing when the remaining disk space is more than 1GB or 1% of the total capacity. However, it will fire if either the remaining disk space is empty (<1MB), or the disk space would fill up within the next 15 minutes based on the current trend.	The disks free storage capacity.
Disk fills up faster than it is being purged.	Detects long-term disk capacity problems and fires when the disk is likely to run out of capacity within the next 48 hours. The detector is not firing when the remaining disk space is more than 20% of the total capacity. However, it will fire when the disk space would fill up within the next 48 hours based on the current trend. This trend is computed based local minima collected over time. When these local minima define a timeframe of at least 4 hours, a linear regression model is fitted on these data points to finally do the long-term forecast.	The disks free storage capacity.
Frequent TCP errors.	Checks whether the host has an unusually high number of TCP errors (sampling in a sliding window of 60 seconds).	In Segments/s (`tcp.inSegs`) and error (`tcp.errors`).
Frequent TCP fails.	Checks whether the host has an unusually high number of TCP fails (sampling in a sliding window of 60 seconds).	Fail (`tcp.fails`) and open/s (`tcp.opens`).
Permanent TCP retransmissions.	Checks whether the host has an unusual high number of TCP retransmission (sampling in a sliding window of 60 seconds).	Retransmission (`tcp.retrans`) and out Segments/s (`tcp.outSegs`).
System load too high.	Checks whether the system load is too high, by comparing the load against 2 times the CPU cores of the machine (sampling in a sliding window of 120 seconds).	Load (`load.1min`).
System memory exhausted.	Checks whether the system memory is close to being exhausted (triggered instantly).	Free (`memory.free`) and used (`memory.used`).
Too many open files.	Processes are opening files faster than they close them (current vs max ratio exceeds threshold).	Used (`openFiles.used`).
Too many used inodes.	Low level of free inodes on filesystem triggers this health rule (current vs max ratio exceeds threshold).	inode usage.
Too much CPU usage by user processes.	Checks whether CPU usage of user processes is too high (sampling in a sliding window of 180 seconds).	User (`cpu.user`) and topPID.
You will run out of disk space soon.	Detects short-term capacity problems of a disk and fires when when the disk is likely to run out of capacity within the next hour. The detector is not firing when the disk freed up a considerable amount of space (>=100MB) in the recent past, or when the remaining disk space is more than 20% of the total capacity. However, it will fire when the disk space would fill up within the next hour based on the current trend. This trend is computed based on a linear regression model fitted on the data points of the current sliding window.	The disks free storage capacity.
Windows service status is changed.	Checks whether the Windows service status is changed (sampling in a sliding window of 60 seconds).	Windows service status (`state`).

For more information about this sensor, see the Host documentation.

IBM ACE

Edit online


Event	Description	Metric
Status of ACE Integration Server	Check the status of ACE Integration Server.	Integration Server State
ACE Integration Server status digital format	Check the digital status of ACE Integration Server.	Integration Server State Metrics
Queue Manager connection status digital format	Check the digital status between ACE Integration Server and Queue manager.	Queue Manager Connection Status Metrics
Message with errors number	Number of messages that contain errors.	Number of Messages with Errors
Message flow with errors number	Number of MQGET errors for MQInput nodes or Web Services errors for HTTPInput nodes.	Number of MQGET Errors
Message processing with errors number	Number of errors that occur when processing a message.	Number of Messages with Errors
Message flow status	Check the status of ACE Message Flow.	Message Flow Status
Message flow status digital format	Check the digital status of ACE Message Flow.	Message Flow Status Metrics

For more information about this sensor, see the IBM ACE documentation.

IBM Db2

Edit online


Event	Description	Metric
Table Space Utilities metrics status	Check for events that are related to table space and its metrics when the auto resize feature is enabled and disabled.	Table Space Utilities
HADR Connect Status	Check for events that are related to the connection status of the HADR standby databases. The standby ID is used as a filter to generate the `HADR_CONNECT_STATUS` event, which is specific to any standby node, and can be set with the standby ID in the matching operator field. The events can be created based on the following, which represents the current state of any database: The database is connected (`Connect State = CONNECTED` as 1). The database is in disconnected state (`Connect State = DISCONNECTED` as 0).	`HADR_CONNECT_STATUS` (`hadr.standbyId.HADR_CONNECT_STATUS`). The matching operators that are set to `any` will generate the events that are irrespective of the standby ID.

For more information about this sensor, see the IBM Db2 documentation.

IBM MQ

Edit online

IBM MQ Queue Manager

Edit online


Event	Description	Metric
Queue Manager number of connections	Checks whether there are no connections currently on Queue Manager.	Connection count (`connectionCount`)
Queue Manager status	Checks whether Queue Manager is in the stopped or standby state to trigger the `Down` or `Switchover` event.	Queue Manager Status (`statusMetric`)
Channel Initiator status for Queue Manager	Checks whether Channel Initiator is in a running state.	Channel Initiator Status (`channelInitiatorStatus`)
Publish/Subscribe Engine status for Queue Manager	Checks whether Publish or Subscribe engine is in a running state.	Publish/Subscribe Engine Status (`pubsubStatus`)
Bridge stopped^[1]	Indicates that the IMS bridge is stopped.	From IBM MQ events

IBM MQ Queue

Edit online


Event	Description	Metric
Queue oldest message	Checks whether the queue has messages that are older than the threshold value.	Oldest message on queue (`oldestMessage`)
Queue depth diff	Checks whether the queue depth is approaching the maximum queue depth value.	Queue depth (`queueDepth`) and max queue depth (`maxQueueDepth`)
Queue Full	Checks whether the queue depth percentage has reached the warning or critical value.	Queue Depth Percentage(`queueFullPercentage`)
Transmission Queue High	Checks whether the number of transmission queue messages is too high.	Queue depth (`queueDepth`)
Queue Service Interval High^[1:1]	Detects no successful GET operations or MQPUT calls within an interval is greater than the limit that is specified in the `QServiceInterval` attribute.	From IBM MQ events
Queue Depth High^[1:2]	Indicates that the queue depth has increased to a predefined threshold by an MQPUT or MQPUT1 call that is specified in the `QDepthHighLimit` attribute.	From IBM MQ events
Queue Full^[1:3]	Indicates a call failure (on an MQPUT or MQPUT1 call) because the queue is full. That is, the queue already contains the maximum number of messages that is possible.	From IBM MQ events

IBM MQ Channel

Edit online


Event	Description	Metric
Channel status	Checks whether the channel is in a healthy state.	Channel status (`channelStatus`)
Channel InDoubt status	Checks whether the channel is in a doubt status.	Channel status (`channelStatus`)
Channel conversion error^[1:4]	Indicates an error when a channel is unable to complete the data conversion and the MQGET call to get a message from the transmission queue that resulted in a data conversion error.	From IBM MQ events
Channel SSL Error^[1:5]	Indicates an error when a channel that uses Transport Layer Security (TLS) or Secure Sockets Layer (SSL) fails to establish an MQ connection.	From IBM MQ events

You can use built-in events for channels in Stopped and InDoubt status. You need to create custom events for channels in other status with built-in metrics. For the enumeration values of channel status, see IBM MQ channel metrics reference.

IBM MQ Listener

Edit online


Event	Description	Metric
Listener status	Checks whether the listener is in a healthy state.	Listener status (`listenerStatus`)

For more information about this sensor, see the IBM MQ documentation.

IIS Internet Information Server

Edit online


Event	Description	Metric
Sudden drop in requests to IIS-site.	Checks for a sudden drop in the requests for an IIS-site.	Total request metric of an IIS-sites.

For more information about this sensor, see the Microsoft IIS documentation.

IBM Datapower

Edit online

IBM DataPower Appliance

Edit online


Event	Description	Metric
Appliance percentage of CPU usage	Check whether appliance percentage of CPU usage is too high.	CPU Usage (`cpuUsage`)
Appliance percentage of memory usage	Check whether appliance percentage of memory usage is too high.	Memory Usage (`memoryUsage`)
Appliance percentage of system load	Check whether appliance percentage of system load is too high.	System Load (`systemLoad`)
Appliance status	Check whether appliance status is in healthy state.	Status (`status`)

IBM DataPower Domain

Edit online


Event	Description	Metric
Domain percentage of memory usage	Check whether domain percentage of memory usage is too high.	Current Memory Usage (`currentMemUsage`)
IBM DataPower Gateway Peering status	Check whether the gateway peering status of each instance is broken.	Broken status ('brokenStatus')

IBM DataPower Service

Edit online


Event	Description	Metric
Service percentage of memory usage	Check whether service percentage of memory usage is too high.	Current Memory Usage (`currentMemUsage`)
Service status	Check whether service status is in healthy state.	Status (`status`)

For more information about this sensor, see the IBM Datapower documentation.

JBoss

Edit online


Event	Description	Metric
Average errors on connector too high.	A processing pipeline detects the number of errors that occurred on connectors in the given time window and also checks whether the number of errors is greater than the threshold value.	Jboss connector errors.
ConnectionPool is running out of connections.	A processing pipeline detects the used connections ratio and checks if it is about to reach the threshold value.	JBoss connection pool connections used ratio.
Connections on datasources run out.	A processing pipeline detects the number of available connections on data sources in the given time window and checks if the total number of connections is about to reach the threshold value.	Jboss datasources connections used, datasources connections available.
ThreadPool is running out of threads.	A processing pipeline detects the number of max threads and checks if the current thread count is about to reach the threshold value.	JBoss thread pool current thread count, thread pool max threads.

For more information about this sensor, see the JBoss AS documentation.

JBoss Data Grid

Edit online


Event	Description	Metric
Caches not in the running state.	Checks the ratios of number of caches created against the number of caches running in Jboss Data Grid. If the ratio is as follows a certain value, then it is considered a violation.	Running and created caches of cache managers.

For more information about this sensor, see the JBoss Data Grid documentation.

JVM

Edit online


Event	Description	Metric
Garbage collection activity high.	A processing pipeline monitors the Garbage Collection time spent by the JVM Runtime Platform and validates it against a threshold.	JVM Garbage Collection.
JVM code cache is full.	A processing pipeline monitors the maximum Code Cache usage of the JVM Runtime Platform.	JVM maximum Code Cache usage.
Perm Gen is full (CMS).	A processing pipeline detects the maximum Perm Gen CMS Pools utilized.	`pools.CMS Perm Gen`
Perm Gen is full (G1).	A processing pipeline detects the maximum Perm Gen G1 Pools utilized.	`pools.G1 Perm Gen`
Perm Gen is full (PS).	A processing pipeline detects the maximum Perm Gen PS Pools utilized.	`pools.PS Perm Gen`
Threads are deadlocked.	A detector monitors the JVM Runtime Platform and detects if there are any Deadlocked threads.	Number of threads deadlocked (`threads.deadlocked`).
J9VM Memory Leak.	A detector checks the growth rate of heap used after GC in MB per hour, and detects whether there is possibly a memory leak in the JVM. IBM J9 VM memory leak detection is an optional feature, disabled by default in the Instana backend. To enable this optional feature, see the page for your Instana deployment: SaaS, Self-Hosted Custom Edition (Kubernetes or Red Hat OpenShift Container Platform) , or Self-Hosted Classic Edition (Docker)	`memory.gc.after memory.gc.before`

For more information about this sensor, see the JVM documentation.

Kafka

Edit online

Kafka Cluster

Edit online


Event	Description	Metric
Number of active controllers.	Checks for an unusual number of active controllers in the Kafka cluster.	Broker active controller count (`broker.activeControllerCount`).

Kafka Node

Edit online


Event	Description	Metric
Kafka network thread is under high load.	Checks whether the Kafka network thread is under high load.	Network Processor (`broker.networkProcessorIdle`).
Kafka request handler thread is under high load.	Checks whether the Kafka request handler is under high load.	Request Handler (`broker.requestHandlerIdle`).
Leader elections are too often.	Checks whether there are too many leader elections within a given timeframe.	Leader Elections (`broker.leaderElections`).
Potential data loss due to unclean leader election.	Checks for potential data loss due to unclean leader elections.	Unclean Leader Elections (`broker.uncleanLeaderElections`).
Producers and consumer are blocked.	Checks whether producers and consumer are blocked due to partitions being offline.	Offline Partitions (`broker.offlinePartitionsCount`).
The number of in-sync replicas has shrunk.	Checks whether the number of in-sync replicas has shrunk and did not recover back within the given interval.	ISR shrinks (`broker.isrShrinks`) and ISR expansions (`broker.isrExpansions`).
Under-replicated partitions.	Checks whether the number of under-replicated partitions exceeds the expected number.	Under-replicated partitions (`broker.underReplicatedPartitions`).

For more information about this sensor, see the Kafka documentation.

Kubernetes

Edit online

Kubernetes Cluster

Edit online


Event	Description	Metric
Kubernetes Cluster component status.	Kubernetes reports that a Master-Component (API-server, scheduler, controller manager) is unhealthy. Due to a bug in Kubernetes, the health is not always reliably reported. We try to filter these out and not cause an alert by only showing up on the Cluster detail page.	Instana low level events.

Kubernetes DaemonSet

Edit online


Event	Description	Metric
Available replicas is less than desired replicas.	Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the Kubernetes DaemonSet is missing replica pods.	Desired (`desiredReplicas`) and available (`availableReplicas`).

Kubernetes Deployment

Edit online


Event	Description	Metric
Available replicas is less than desired replicas.	Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the Kubernetes Deployment is missing replica pods.	Desired (`desiredReplicas`) and available (`availableReplicas`).

Kubernetes Namespace

Edit online


Event	Description	Metric
Allocatable cpu requests too low.	Requested CPU is approaching max capacity (requested CPU / CPU capacity ratio is greater than 80%).	CPU Requests Allocation (`required_cpu_percentage`).
Allocatable memory requests too low.	Requested Memory is approaching max capacity (requested memory/memory capacity ratio is greater than 80%)	Memory Requests Allocation (`required_mem_percentage`).
Allocatable pod count too low.	Allocated pods are approaching maximum capacity (allocated pods/pods capacity ratio is greater than 80%). For a namespace, pods in the phases `Pending`, `Running`, and `Unknown` are counted as allocated. The namespace capacity values are based on ResourceQuotas, which can be set per Namespace. For more information, see the Kubernetes documentation.	Pods Allocation (`used_pods_percentage`).

Kubernetes Node

Edit online


Event	Description	Metric
Allocatable CPU too low.	Requested CPU is approaching max capacity (requested CPU / CPU capacity ratio is greater than 80%).	CPU Requests Allocation (`required_cpu_percentage`).
Allocatable memory too low.	Requested Memory is approaching max capacity (requested memory/memory capacity ratio is higher than 80%).	Memory Requests Allocation (`required_mem_percentage`).
Allocatable pod count too low.	Allocated pods are approaching maximum capacity (allocated pods/pods capacity ratio is greater than 80%). For a node, pods in the phases `Running` and `Unknown` are counted as allocated. For more information, see the Kubernetes documentation.	Pods Allocation (`alloc_pods_percentage`).
Kubernetes Node condition status.	The node reports a condition which is not ready for more than one minute. For a node that’s all conditions besides the `Ready` condition. For more information, see the Kubernetes documentation.	Instana low level events.

Kubernetes Pod

Edit online


Event	Description	Metric
Kubernetes Pod condition status.	A pod is not ready for more than one minute, and the reason is not that it’s completed. (PodCondition=Ready, Status=False, Reason != PodCompleted). For more information, see the Kubernetes documentation.	Instana low level events.

For more information about this sensor, see the Kubernetes documentation.

Kubernetes Cost

Edit online


Event	Description	Metric
Kubecost vCPU usage > 200	Warning will be logged for nearing the 250 vCPU Free license limit.	`coreCountStats.totalCoreCount`
Kubecost vCPU usage > 250	The Free license supports up to 250 vCPUs only.	`coreCountStats.totalCoreCount`

Large language models (LLMs)

Edit online


Event	Description	Metric
OTel LLMs Status Threshold.	When the LLMs is down, the alarms are activated.	Status (`llm.status`).
OTel LLMs Response Duration.	When the LLMs response duration exceeds the specified threshold, the alarms are activated.	Latency (`llm.response.duration.max`).

Memcached Nodes

Edit online


Event	Description	Metric
Flush all command executed.	Detects high number of the `flush_all` command.	Flush (`cmd_flush`).
High key eviction.	Detects high number of key evictions.	Evictions (`evictions`).
Number of queued connections increases.	Detects high number of queued connections.	Queued (`conn_queued`).
Number of yielded connections increases.	Detects high number of yielded connections.	Yields (`conn_yields`).
Used bytes by Memcached reached maxbytes limit.	Used bytes by Memcached reached max bytes limit.	Used bytes.

For more information about this sensor, see the Memcached documentation.

MongoDB Node

Edit online


Event	Description	Metric
Continuously increasing background flushing latency.	Database reports increasing background flushing latency (sampling in a sliding window of 150 seconds).	Last background flushing latency (`backgroundFlushingLast`).
Continuously increasing lock queue length.	Monitors the MongoDb Lock Queue metric and validates if the lock queue size is increasing too fast.	Lock Queue Length (`lockQueue`).
Increasing page faults.	Increasing page faults (sampling in a sliding window of 150 seconds).	Number of Page Faults (`pageFaults`).
Journal commits in write lock growing	Journal commits in write lock growing (sampling in a sliding window of 150 seconds).	Journal Write Lock (`journalWriteLock`).
Too high ratio of non-mapped virtual memory	Too high ratio of non-mapped virtual memory (triggered instantly and reported by the Instana Host sensor).	`Virtual` and `mapped`.

MongoDB Replica Set

Edit online


Event	Description	Metric
ReplicaSet has member(s) down.	The member, as seen from another member of the set, is unreachable.	`unreachableNodeCount`.
ReplicaSet monitoring status.	Monitors the health of all the members of MongoDB replica set.	Slave Delays Count (`slaveDelaysCount`), optimes count (`optimesCount`), and monitored members count (`monitoredMembersCount`).
Replication lag is growing.	Replication lag is growing (sampling in a sliding window of 150 seconds).	Slave Delays (`slaveDelays`) and Optimes (`optimes`).
Replica Set connection usage is high.	Number of active connections is more than 90% of the maximum connections.	Connections ('connections').

For more information about this sensor, see the MongoDB documentation.

MySQL DB

Edit online


Event	Description	Metric
Available server connections are at limit.	Ratio between the used and connections limit is greater than the configured ratio threshold.	Connections (`status.THREADS_CONNECTED`).

For more information about this sensor, see the MySQL documentation.

Nginx Server

Edit online


Event	Description	Metric
Nginx has a problem with offline peers.	Inactive Peer (available only for NGINX Plus).	Upstreams failed (`nginx_plus.http.upstreams.peers.failed`).
Nginx is dropping connections.	Dropped connections.	Dropped connections (`connections.dropped`).
Nginx is failing with SSL handshakes.	Failed SSL handshakes (available only for NGINX Plus).	Failed hanshakes (`nginx_plus.ssl.handshakes_failed`).
Number of active connections is close to the max.	Used connections ratio exceeds the configured ratio threshold for used connections.	Active connections (`connections.active`).

For more information about this sensor, see the NGINX documentation.

Node.js App

Edit online


Event	Description	Metric
Garbage collection activity high.	Checks whether the time spent in GC in the given window is above the given threshold.	GC pause metrics.
Health checks are failing.	Checks whether there are any failing healthchecks. For more information, see Health check support.	Health check result (`healthcheckResult`).

For more information about this sensor, see the Node.js documentation.

OpenShift Deployment Config

Edit online


Event	Description	Metric
Available replicas is less than desired replicas.	Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the OpenShift DeploymentConfig is missing replica pods.	Desired (`desiredReplicas`) and available (`availableReplicas`).

For more information about this sensor, see the Openshift documentation.

OTel Host

Edit online


Event	Description	Metric
CPU Wait time exceeded	Checks whether the system spends a significant amount of time waiting for input or output operations.	CPU Wait (`cpu.wait`)
CPU Steal time exceeded	Specifies the number of allowed CPU Wait violations within a time frame.	CPU Steal (`cpu.steal`)
CPU usage high	Checks whether the CPU use is high. This event continuously evaluates data over the most recent 180-second interval.	CPU User (`cpu.user`)
System load too high	Checks whether the system load is high by comparing the load against two times the CPU cores of the machine. This event continuously evaluates data over the most recent 120-second interval.	Load (`load.avg_1m`)
System memory exhausted	Checks whether the system memory is close to fully used (triggered instantly).	Memory free (`memory.free`) and Memory used (`memory.used`)
Disk low capacity	Detects short-term capacity problems of a device that has less than a static threshold (1GB) or less than 1% of the total volume size. In addition, it detects the capacity if the remaining time until zero provides the current rate of change is under 15 minutes.	Disks free storage capacity

For more information about this sensor, see the OpenTelemetry documentation.

OracleDB

Edit online


Event	Description	Metric
Ratio between DB CPU Time and DB Time is low.	Ratio between DB CPU Time and DB Time is as follows the configured threshold.	DB CPU Time/DB Time Ratio (`stats.cpuTimeDbTimeRatio`).
Tablespace space usage is high.	Tablespace used space is more significant than the configured amount of maximum space.	Tablespace used space percentage.
Total amount of sessions at maximum.	Used sessions ratio exceeds the configured used sessions ratio threshold.	Sessions/Session Limit (`stats.usedSessionsRatio`).

For more information about this sensor, see the OracleDB documentation.

OS process

Edit online


Event	Description	Metric
CPU Usage	Process is causing high CPU usage on host.	The result of a high CPU usage rule evaluation on the underlying host and the CPU user time of the given process.
Open Files Usage.	Process is opening files faster than it closes them (current vs max ratio exceeds threshold)	Used (`openFiles.used`).
Abnormal termination.	Process terminated as a result of an uncaught signal.
Abnormal termination.	Process terminated with a non-zero exit code.

For more information about this sensor, see the OS process documentation.

PHP-FPM Runtime

Edit online


Event	Description	Metric
Frequent restarts of PHP-FPM worker pool.	Checks for frequent restarts of a PHP-FPM worker pool by evaluating the number of its restarts in a given time window against a given threshold.	Start times for a worker pool.
Listen Backlog configured over capacity.	Checks whether the listen backlog of a worker pool is over the configured capacity.	Worker pool queue length.
Too many connections reset.	Checks the number of connection resets to be above the given threshold in the given time window.	Connection resets metric for worker pool.
Too many requests piling up in Listen Backlog.	Checks the size for various PHP-FPM worker queues and validates it against the threshold value.	Listen queue size metrics for various PHP-FPM worker queues.
Too many slow requests.	Checks the ratio of slow requests on all monitored PHP-FPM worker pools.	Slow requests and accepted connection metric for a worker pool of a PHP-FPM instance.

For more information about this sensor, see the PHP documentation.

Synthetic Check

Edit online


Event	Description	Metric
Remote target is not reachable.	Checks whether the percentage of failed communication attempts in the given sliding window is above the given threshold.	Status of Ping (`status`). A http status code between 200-206 and 300-307 results in healthy status, for icmp the exit value 0 is seen as healthy while value 1 is seen as unhealthy, in addition a maximum execution time of 2 seconds is set

For more information about this sensor, see the Synthetic Check documentation.

PostgreSQL DB

Edit online


Event	Description	Metric
Active connection usage.	Number of active connections is more than 90% of the maximum connections.	Connection Usage (`max_conn_pct`).

For more information about this sensor, see the PostgreSQL documentation.

Process

Edit online


Event	Description	Metric
High CPU usage.	Evaluates whether the given process is causing high CPU usage on a host.	Results of high CPU usage rule evaluation on the underlying host and CPU user time of the given process.
Too many open files.	Open files percentage is higher than the configured threshold.	Used (`openFiles.used`).

RabbitMQ

Edit online

RabbitMQ Cluster

Edit online


Event	Description	Metric
RabbitMQ network partition detected	Detects if network partition occurs inside the RabbitMQ cluster (triggered every 5 seconds).	Total number of Network partitions (`net_partitions_count`).

RabbitMQ Server

Edit online


Event	Description	Metric
Queues are filling up with messages	Over a period of 10 minutes, queues are filling up with messages that are not delivered.	Messages ready (`overview.messages_ready`) and messages acknowledged (`overview.ack`).
RabbitMq has no consumers	In the last 5 seconds, RabbitMQ has had no consumers.	Consumers (`overview.consumers`).
RabbitMq has no connections	In the last 5 seconds, RabbitMQ has had no connections.	Connections (`overview.connections`).

RabbitMQ Nodes

Edit online


Event	Description	Metric
RabbitMQ File Descriptors Usage is critical.	File descriptors usage rate is critical on a specific node (Warning: > 90%, Critical: > 98%). This is triggered every 5 seconds.	RabbitMQ file descriptors used rate (`fd_used_rate`).
RabbitMQ Memory Usage is critical on node.	Memory usage rate is critical on a specific node (Warning: > 90%, Critical: > 98%). This is triggered every 5 seconds.	RabbitMQ memory used rate (`mem_used_rate`).
RabbitMQ Erlang Processes count is critical.	Erlang Processes count is critical on a specific node (Warning: > 90%, Critical: > 98%). This is triggered every 5 seconds.	RabbitMQ processes rate.

RabbitMQ Queues

Edit online


Event	Description	Metric
More messages are being produced than consumed.	More messages are being published to a queue than the consumers can process from a queue.	RabbitMQ unacknowledged messages in a queue.

For more information about this sensor, see the RabbitMQ documentation.

Redis

Edit online

Redis Cluster

Edit online


Event	Description	Metric
Redis cluster state isn't ok.	Cluster is in an inappropriate state.	`cluster_state`.

Redis Node

Edit online


Event	Description	Metric
Memory allocation analysis.	Redis server is causing external memory fragmentation.	Used memory (`used_memory`) and memory fragmentation ratio (`mem_fragmentation_ratio`).
Redis hit rate is low.	Redis hit rate is as follows the configured threshold.	Cache hit rate (`hit_rate`), keyspace hits (`keyspace_hits`), keyspace misses (`keyspace_misses`), and Redis evicted keys (`evicted_keys`).
Redis memory usage is getting closer to max memory limit.	Redis memory usage is getting closer to max memory limit.	Used memory (`used_memory`).
Redis rejecting connections.	Redis is rejecting connections.	Number of rejected connections (`rejected_connections`).
Redis slave node can't connect to master node.	Redis slave node can't connect to the master node.	`master_downtime_seconds`.

For more information about this sensor, see the Redis documentation.

SAP ABAP

Edit online


Event	Description	Metric
Lock contention detected	Detects lock contention and provides details about the lock mode and lock object.	ABAP Lock Contention
ABAP dumps generated	Detects ABAP dumps that are generated and provides details on the severity.	ABAP Dumps Severity
IDoc Inbound and OutBound errors occured	Detects error for both Inbound and Outbound IDocs.	Inbound IDoc Error and Outbound IDoc Error
Background Job gets Aborted or Cancelled	Detects if any Background Job gets Aborted or Cancelled.	Background Job gets Aborted or Cancelled
High CPU usage detected	Detects if the CPU usage is greater than 90%.	High CPU usage
High memory usage detected	Detects if the memory usage is greater than 90%.	High Memory Usage
Work process in stopped, shutdown, or PRIV mode (private) detected	Detects if the work process is in PRIV mode (private), stopped, or shutdown.	Work Process Status
Work process On Hold exceeds threshold	Detects if the number of work processes is On Hold and exceeds 5.	Work Process Status
File system usage crossing threshold detected	Detects if the file system usage crosses the threshold of 80%.	File System Usage
Connection issues detected	Detects incorrect username, password, gateway failure, or incorrect login attempts.	Connectivity Status
Authorization missing detected	Detects if the user is missing the authorization to run a function module.	Authorization check
User account locked detected	Detects if the user account is locked due to login failures.	User Account lock
Spool Error detected	Detects spool error.	Spool Error
Dialog response time exceeding threshold	Detects if the dialog response time exceeds the preferred threshold.	Dialog Response Time
Dialog work process exceeding threshold	Detects if the dialog work process is running longer than 10 seconds.	Dialog Work Process
Database latency exceeding threshold	Detects if the average Database latency exceeds 5 seconds.	Database latency
Transport request release detected	Detects whether transport request is released or protected.	Transport Request
Background job duration exceeds 6 hours	Detects if the background job does not complete within 6 hours.	Background Jobs

For more information about this sensor, see SAP ABAP.

SAP Java NetWeaver

Edit online


Event	Description	Metric
High CPU usage detected	Detects if the CPU usage is greater than 90%.	High CPU usage
High Memory usage detected	Detects if the memory usage is greater than 90%.	High Memory Usage
High Disk usage detected	Detects if the disk usage is greater than 90%.	High Disk Usage
Authentication failure detected	Detects authentication failure caused by an incorrect username or password.	Authentication failure
Authorization failure detected	Detects authorization failure caused by not assigning JMXManageAll action to the user.	Authorization failure
Connection Timeout detected	Detects connection timeout or network-related issue.	Connection Timeout detected
High System load detected	Detects if the system load average is greater than 90%.	High System Load Usage
GC problem detected	Detects the garbage collection (GC) problem.	GC Problems detected
System problem detected	Detects the system problem.	System Problems detected
High HTTP Threads usage detected	Detects if the number of active HTTP Threads reaches the configured Pool size.	High HTTP Threads usage

For more information about this sensor, see SAP Java Netweaver.

SAP HANA

Edit online


Event	Description	Metric
High CPU utilization	Detects if the total CPU usage exceeds 90%	Total CPU Utilization
High HANA memory usage	Detects if the used memory exceeds 90% of the allocated limit	HANA Memory Usage
High host memory usage	Detects if the host memory usage exceeds 90%	Host Memory Usage
High Disk usage	Detects if the disk usage exceeds 90%	Disk Usage Summary
High number of queuing connections	Detects if the queuing connections are more than one	Connections
High number of blocked sessions	Detects if the blocked sessions are more than one	Sessions
High number of blocking sessions	Detects if the blocking sessions are more than one	Sessions
High number of blocked threads	Detects if the blocked threads are more than 10	Threads
High number of blocked SQL threads	Detects if the blocked SQL threads are more than 10	SQL Threads
High number of blocked job worker threads	Detects if the blocked job worker threads are more than 10	Job Worker Threads
High number of pending requests	Detects if the pending requests are more than 10	Requests
High process CPU	Detects if any of the process CPUs exceeds 90%	Service Details
Service status is not active	Detects if service status is not active	Service Details
Backup failed	Detects the latest failed backup	Latest Backup
User locks occurred	Detects user locks	User Locks
Scheduled jobs failed	Detects failed scheduled jobs	Scheduled Jobs
System events occurred	Detects system events	System Events
Archive log backup failed	Detects failed log backups	All Backups
Transaction is not active	Detects partial aborting and aborting transactions	Transaction Statistics
Blocked transactions	Detects if any transaction is blocked	Blocked Transactions

For more information about the SAP HANA sensor, see Monitoring SAP HANA.

Service

Edit online


Event	Description	Metric
Complete drop in calls.	Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows.	Calls/s (`count`).
Error rate too high.	Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value.	Error rate (`error_rate`).
Increasing trend in error rate.	Checks a presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates certain amount of decreases in the metric value inside the trend candidate.	Error rate (`error_rate`).
Sudden drop in calls.	Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters as follows.	Calls/s (`count`).
Sudden increase in error rate.	Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows.	Error Rate (`error_rate`).
Sudden increase in latency.	Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows.	Latency 50th (`duration.50th`).
Sudden increase in latency for a fraction of requests.	Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters as follows.	Latency 99th (`duration.99th`).

Solr

Edit online

Solr Cloud Cluster

Edit online


Event	Description	Metric
Unreachable Solr nodes.	One or more nodes are down.	`unreachableNodes`.

Solr Node

Edit online


Event	Description	Metric
Solr cache hit rate is low.	Solr cache hit rate is as follows 80% over the last minute, possibly due to high evictions or clients are querying the wrong data.	Solr Hit Ratio (`hitratio`) and Solr evictions.

For more information about this sensor, see the Apache Solr documentation.

Spark

Edit online

Spark Application

Edit online


Event	Description	Metric
Failed tasks on executor.	Number of failed tasks on an executor exceeds the configured threshold.	Spark Application failed tasks.
Scheduling delay is high.	Scheduling delay is increasing too fast or is too high.	Scheduling Delay (`schedulingDelay`).

Spark Standalone

Edit online


Event	Description	Metric
Driver has failed.	Number of failed drivers exceeds the configured threshold.	Number of failed drivers (`drivers.failed`).
Spark standalone master is reporting dead worker(s).	Number of dead workers exceeds the configured threshold.	Dead workers (`workers.deadWorkers`).
Spark standalone master is reporting worker(s) in unknown state.	Number of workers in an unknown state exceeds the configured threshold.
Submitted app has failed.	Number of failed applications exceeds the configured threshold.	Workers in unknown state (`workers.workersInUnknownState`).

For more information about this sensor, see the Apache Spark documentation.

Spring Boot App

Edit online


Event	Description	Metric
Number of active sessions reached maximum number.	A processing pipeline detects the number of active connections of the SpringBoot application in the given time window. It validates whether the number of active sessions is greater than the threshold value.	Active sessions (`metrics.httpsessions.active`).
Spring Boot Application down.	Monitors the status of the SpringBoot Application.	Status of SpringBoot Application (`metrics.status`).

For more information about this sensor, see the Spring Boot documentation.

Sybase Server

Edit online


Event	Description	Metric
Available server connections are at limit.	Number of connections is close to 100% of connections limit per server.	Connections (`stats.connCount`).
The maximum number of databases is at limit.	Number of databases is close to 100% of databases limit per server.	`databasesCount`.

For more information about the SAP SQL Anywhere sensor, see Monitoring SAP SQL Anywhere.

Synthetic PoP

Edit online


Event	Description	Metric
Synthetic pop status	Check whether Synthetic PoP can connect to Instana backend	Status of Synthetic PoP (`status`)
Playback engine status	Check whether the playback engine is overloaded.	Workload status of the playback engines `browserscript.workloadStatus`, `http.workloadStatus`, `javascript.workloadStatus`, and `ism.workloadStatus`.
Retrieving credentials failed	Failed to get Synthetic crendentials from the Instana backend.	Error code and URL of pop_get_cred_failed (`error.pop_get_cred_failed`).
Retrieving tests failed	Failed to get Synthetic tests from Instana backend.	Error code and URL of pop_get_test_failed (`error.pop_get_test_failed`).
Reporting test results failed	Failed to post Synthetic test result to the Instana backend.	Error code and URL of pop_report_result_failed (`error.pop_report_result_failed`).
Reporting test tesult details failed	Failed to post Synthetic test result details to Instana backend.	Error code and URL of pop_report_result_details_failed (`error.pop_report_result_details_failed`).
Reporting result queue depth is high	Detect whether the result queue depth is high	ResultQueueDepthHigh (`resultQueueDepthHigh`).

For more information about this sensor, see the Synthetic PoP documentation.

TIBCO EMS

Edit online


Event	Description	Metric
Connections exceeds max available connections.	The max number of connections is almost used up.	Connections Count (`connectionCount`).
Messages memory usage exceeds the limit.	The maximum message memory is almost used up.	Messages Memory (`messagesMemory`).
Queues pending messages exceeds the limit.	The max number of pending messages for queue is almost used up.	Queue pending messages usage.
Topics pending messages exceeds the limit.	The max number of pending messages for topic is almost used up.	Topic pending messages usage.

For more information about this sensor, see the TIBCO EMS documentation.

Tomcat

Edit online


Event	Description	Metric
Active connections reached maximum.	Detects if the number of connections of specific connector is reaching its maximum configured value.	Number of connector connection count.
Sudden drop in the number of session.	Checks for a significant drop in the number of sessions.	Total session count (`totalSessionCount`).
Sudden increase in the number of session.	Checks for a significant increase in the number of sessions.	Total session count (`totalSessionCount`).
Threads number reached maximum.	Detects if the number of busy threads of specific connector is reaching its maximum configured value.	Number of connector busy threads.

For more information about this sensor, see the Tomcat documentation.

Varnish Node

Edit online


Event	Description	Metric
Sudden drop in the number of requests.	Checks for a sudden drop in the number of client requests.	Received client requests (`client_req`).
Sudden increase in evected objects.	Checks for a sudden increase in the number of evicted objects.	Nuked Objects (`n_lru_nuked`).
Thread creation is failing.	Too many thread creations failed.	Failed (`threads_failed`) and limited (`threads_limited`).
Varnish backend is marked unhealthy.	Varnish backend server is unhealthy or is not available.	Unhealthy (`backend_unhealthy`).
Varnish hit rate is low.	Varnish hit rate is very low.	Cache Hit Rate (`cache_hit_rate`).
Varnish is out of worker threads.	Varnish is out of worker threads.	Connections dropped due to a full queue (`sess_dropped`).

For more information about this sensor, see the Varnish documentation.

Vault

Edit online


Event	Description	Metric
Vault is sealed.	Detects if the sealed status is set to true.	Sealed (`sealed`).
Sudden increase in secret reads	Checks for a sudden increase (increase by 60% based on the average of the last 5 minutes) in the number of secrets read.	Secrets read count (`secret.read.count`).

For more information about this sensor, see the Vault documentation.

WebLogic Server

Edit online


Event	Description	Metric
Datasource error state.	A processing pipeline monitors status codes of the WebLogicApplications data sources, and checks if any data source is unhealthy.	WebLogic datasource status.
Health state	Detects overall system degradation based on reported health state.	Health State status.

For more information about this sensor, see the WebLogic documentation.

WebSphere

Edit online


Event	Description	Metric
WebContainer thread pool active threads reached maximum.	A processing pipeline validates that the number of active threads in the WebContainer thread pool is reaching the maximum limit.	Active threads (`threadPools.webContainer.activeThreads`).
WebSphere certificate is about to expire.	Remaining days before certificate expiration is less than the threshold value.	Remaining days before expiration (`certificates.{certificate}.expDaysLeft`)

For more information about this sensor, see the WebSphere Application Server documentation.

ZooKeeper

Edit online


Event	Description	Metric
Maximum request latency is high.	A processing pipeline checks if the maximum request latency is reaching the threshold value.	Max request latency (`max_request_latency`).
Number of queued requests is high.	A processing pipeline detects the number of queued request and validates whether the number is reaching the threshold value.	Outstanding request count (`outstanding_requests`).

For more information about this sensor, see the ZooKeeper documentation.

The events are retrieved from IBM MQ events. Instana agent collects these IBM MQ events and reports them as Instana events. To collect these events, you need to enable Queue Manager performance event and channel event. For more information, see Extra IBM MQ configuration. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎