Infrastructure Metrics

Infrastructure metrics include system metrics and container metrics. For information about container metrics, see Container Metrics.

System Metrics

Monitor the following system metrics to analyze Elasticsearch health.

  • CPU usage
  • Disk usage
  • Memory usage

Monitor the CPU usage

To ensure that CPU is not over utilized, you must monitor the CPU health regularly. You can monitor the CPU usage at two levels: process level and OS level. If the process level CPU is utilized beyond the threshold limits, you can share the load. However, if the OS level CPU has reached its limits, you must contact your IT team.

Command / Metric Description
curl -X GET http://localhost:9240​/_nodes/stats/process?pretty This command retrieves the CPU utilization by the external Elasticsearch pods.
$.nodes.nodeid​.process.cpu.percent This JSON path expression retrieves the percentage of CPU usage by an Elasticsearch pod.
If a pod is using 80% of the CPU space for more than 15 minutes, consider the severity as WARNING and perform the following steps to identify the causes of higher CPU usage.
  1. Identify the process that consumes the highest CPU.
  2. Generate the thread dump.
  3. Analyze the thread dump to identify the thread locks.

If a pod is using 90% of the CPU space for more than 15 minutes, look for the following Prometheus metrics:

  • elasticsearch_os_cpu_percent
  • elasticsearch_process_cpu_percent
elasticsearch_os_​cpu_percent
If elasticsearch_os_cpu_percent is more than 90%, consider the severity as CRITICAL and perform the following steps to identify the causes of higher CPU usage.
  1. Restart the pod.
  2. Check the readiness and liveliness of the pod.
elasticsearch_process_​cpu_percent

If elasticsearch_process_cpu_percent is more than 90%, consider the severity as CRITICAL and add a new node to the cluster. To learn more about how to add a new external Elasticsearch node, see Adding New Nodes to an Elasticsearch Cluster.

Monitor the Disk usage

To ensure that all nodes have enough disk space, it is recommended to monitor the disk space regularly.

Command Description
curl -X GET http://localhost:9240​/_nodes/stats/fs This command retrieves the disk space of the external Elasticsearch nodes. It lists the disk space available in all nodes.

For more information about Elasticsearch node statistics, see Elasticsearch documentation.

$.nodes..fs.total.total_in_bytes This JSON path expression retrieves the total disk space.
$.nodes..fs.total.free_in_bytes This JSON path expression retrieves the free disk space.
.nodes..fs.total.available_in_bytes This JSON path expression retrieves the available disk space.
Disk-based shard allocations
Note: 500GB(HA) / 150GB(single node) is used as an example for maximum data retention here.
Command Description
curl -X GET http://localhost:9240​/_cluster/settings?pretty This command retrieves the configured disk-based shard allocations in Elasticsearch. To learn more about disk-based shard allocations, see Elasticsearch documentation.

The shard allocation is based on the thresholds known as Low, High, and Flood watermark.

Shard allocation: Low

$.persitent.cluster.routing​.allocation.disk.​watermark.low

The default threshold for this level is 80%. Once the threshold is reached, Elasticsearch does not allocate new shards to nodes that have used more than 80% disk space. You can calculate if the disk usage is low by using the expression ( average disk usage of the external Elasticsearch cluster / standalone). If the result of this expression exceeds the defined threshold (80%), the disk has reached the Low stage. If your disk usage has reached low, perform the following steps:

  1. Query the transaction event index size and verify if the index is above 525 GB(HA) / 175 GB(single node). If it is already breached, monitor if the purge scripts are running and the index size is decreasing.
  2. Verify if the size of each transaction event index is equal to the sum of used space (range of 25 GB). If this does not match, some other external items like increased logs size or heap dump are occupying a lot of space. Clear the logs and heap dump.
  3. Repeat the above steps until the transaction event index is less than 525 GB and the average disk usage of the cluster becomes less than 80%.
Shard allocation: High

$.persitent.cluster.routing​.allocation.disk.​watermark.high

The default threshold for this level is 85%. Once the threshold is reached, Elasticsearch attempts to relocate shards away from a node whose disk usage is above 85%. You can calculate if the disk usage is low by using the expression ( average disk usage of the external Elasticsearch cluster / standalone). If the result of this expression exceeds the defined threshold (85%), the disk has reached the High stage. If your disk usage has reached High, perform the following steps:

  1. Query the transaction event index and verify if the index is above 525 GB(HA) / 175 GB(single node). If it is already breached, monitor if the purge scripts are running and the index size is decreasing.
  2. Verify if the size of each transaction event index is equal to the sum of used space (range of 25 GB). If this does not match, some other external items like increased logs size or heap dump are occupying a lot of space. Clear the logs and heap dump.
  3. Repeat the above steps until the transaction event index is less than 525 GB and the average disk usage of the cluster becomes less than 85%
Shard allocation: Flood

$.persitent.cluster.routing​.allocation.disk.​watermark.flood

The default threshold for this level is 90%. Once the threshold is reached, Elasticsearch enforces a read-only index block (index.blocks.read_only_allow_delete) on every index that has one or more shards allocated on the node that has at least one disk exceeding the flood stage. This is the last resort to prevent nodes from running out of disk space.

You can calculate if the disk usage is in flood stage, by using the expression ( average disk usage of the Elasticsearch cluster / standalone). If the result of this expression exceeds the defined threshold (90%), the disk is in the Flood stage. If your disk usage has reached the Flood stage, perform the following steps:

  • Monitor the purging of data and ensure the purging happens and the disk space occupancy is reduced.
  • If this situation is due to a spike in the requests count and size, follow up with the customer to understand the reason for the sudden spike and inform the customer to compress the payload for transaction logging or not to store the request or response.
curl -X GET http://localhost:9240​/_nodes/stats/metric This command retrieves information about specific metrics like fs, http, os, process, and so on.

For more information about the corresponding metrics, see Elasticsearch documentation.

Monitor the Memory usage

Command Description
http://HOST:9240/​_nodes/nodeid/stats/os This URL retrieves the memory status utilized by the external Elasticsearch pods.
http:URL/nodes?​v&full_id=true&h=id,name,ip This URL retrieves the node id of the corresponding API. This returns the node id, node name, and the node IP address.
$.nodes.nodeid​.os.mem.free_percent This JSON expression retrieves the percentage of memory that is free.

If a pod is using 85% of the available memory, consider the severity as WARNING, and identify the process that consumes more memory and generate the heap dump.

If a pod is using 90% of the available memory, consider the severity as CRITICAL, and perform the following steps to identify the reason.
  1. Identify the process that consumes more memory.
  2. Generate the heap dump.
  3. Restart the pod.
  4. Check the readiness and liveliness of the pod.