Elastic Stack troubleshooting

Follow this high-level troubleshooting process to isolate and resolve problems with Elastic Stack, for example, when a service in the Elastic Stack is in the Error state or remains in the TENATIVE state.

  1. For information about important configurations, what to monitor, and how to diagnosis and prevent problems, see the following Elastic documentation:
    • Monitoring
    • Cluster Health
      Tip: A red cluster indicates that at least one primary shard and all of its replicas are missing. As a result, the data in that shard is not available, searches return partial results, and indexing into that shard return errors.
    • Monitoring Individual Nodes to troubleshoot each node. Identify the troublesome indices and determine why the shards are not available. Check the disks or review the logs for errors and warnings. If the issue stems from node failure or hard disk failure, take steps to bring the node online.
    • cat API to view cluster statistics.
  2. Based on the type of error you encounter, refer to the appropriate Elastic Stack log file.
    Table 1. Elastic Stack log files
    Log file Default log location
    Elastic Stack manager service log
    Standard out or error log.
    $EGO_TOP/integration/elk/log/manager-[out|err].log.*
    Elasticsearch service log
    Standard out or error log for the primary, client, or data service.
    $EGO_TOP/integration/elk/log/es-[out|err].log.[master|client|data].*
    Elasticsearch runtime log
    Runtime log for the primary, client, or data service.
    $EGO_TOP/integration/elk/log/elasticsearch/*.log.[master|client|data]_*
    Logstash (indexer) service log
    Standard out or error log.
    $EGO_TOP/integration/elk/log/indexer-[out|err].log.*
    Logstash (indexer) runtime log
    Runtime log.
    $EGO_TOP/integration/elk/log/logstash/logstash-plain.log.*
    Filebeat (shipper) service log
    Standard out or error log.
    $EGO_TOP/integration/elk/log/shipper-[out|err].log.*
    Filebeat (shipper) runtime log
    Runtime log.
    $EGO_TOP/integration/elk/log/filebeat/filebeat.log.*
  3. Resolve any of the following problems that might occur:
    Out of memory exception or Java heap size reached
    The default Elasticsearch installation uses 10 GB heap for the Elasticsearch services and 4 GB for Logstash service, which satisfies the 24 GB of RAM for IBM® Spectrum Conductor system requirements. If your hosts have more than 24 GB memory and you need to increase the heap such as for system performance reasons, you can increase the Elasticsearch and Logstash heap sizes in IBM Spectrum Conductor. For more information about increasing the heap, see Tuning the heap sizes for Elasticsearch and Logstash to accommodate heavy load.
    Disk full or watermark is reached
    The Elasticsearch service can remain in the TENTATIVE state when it reaches the limitations that are defined in the Elasticsearch watermark parameters.
    Consider cleaning up the space and increase the watermarks. For more information, see Configuring Elasticsearch disk usage.
    If you see the TOO_MANY_REQUESTS/12/index read-only error in the logs, a safeguard is in place that sets the read_only_allow_delete parameter to true. You must clean up the storage and verify you have sufficient space before you can run a command to modify the settings back. For more information, see Resolving reports on full disk or watermark reached.
    Too many buckets exception thrown or charts not displaying properly
    In the Resource Usage page within the cluster or instance group management console, if you have many applications, charts might not display properly and you encounter exceptions about failing to retrieve data and requiring more buckets for aggregation.
    You can reduce the amount of data for application charts by reducing the duration of data or increase the search_max_buckets cluster level setting. For more information, see Resolving exception on trying to create too many buckets.
    Red cluster or UNASSIGNED shards
    The Elasticsearch service can remain in the TENTATIVE state when at least one primary shard and all its replicas are missing.
    First, you must rule out the disk is full or watermark is reached. For more information, see Resolving reports on full disk or watermark reached. Next, see Resolving red cluster or UNASSIGNED shards.