Monitoring topology pods has high number of restarts

If you see that the monitoring topology pods has high number of restarts, follow the steps to diagnose and solve this problem.

Symptoms

When you run the command oc -n management-monitoring get pod |grep topology, you might see that the pod monitoring-topology-xxx has high number of restarts in a short period of time. See the following example:

NAME READY STATUS RESTARTS AGE
monitoring-topology-7fbf949c87-lg8j6 1/1 Running 3 1d

Diagnosis

  1. Check the monitoring topology pod to see whether it runs out of memory or out of Java heap memory.

    1. Determine the full topology pod name by running the following command:
      oc -n management-monitoring get pod |grep topology
      
    2. Check the reason for last termination by running the following command, such as OOMKilled, which indicates out of memory.
      oc -n management-monitoring describe pod monitoring-topology-XXX
      
    3. Check whether any errors exist in the previous logs by running the following command, such as out of Java heap memory.
      oc -n management-monitoring logs monitoring-topology-XXX -p
      
    4. Check whether any other errors exist in the current log by running the following command, such as cannot access database.
      oc -n management-monitoring logs monitoring-topology-XXX
      
  2. Check whether the Cassandra pod has any problems, such as running out of memory, insufficient disk space, IO delay, etc.

    1. Check the reason for last termination by running the following command, such as OOMKilled, which indicates out of memory.
      oc -n management-monitoring describe pod monitoring-cassandra-0
      
    2. Check whether any errors exist in the previous logs by running the following command:
      oc -n management-monitoring logs monitoring-cassandra-0 -p
      
    3. Check whether any errors exist in the current logs by running the following command:
      oc -n management-monitoring logs monitoring-cassandra-0
      
    4. Check whether any errors exist in the debug log by running the following commands:
      oc cp monitoring-cassandra-0:/opt/ibm/cassandra/logs/debug.log /tmp/cassandra_debug.log
      
      less /tmp/cassandra_debug.log
      

Solutions

  1. If the issue is caused by insufficient memory or Java heap size in Topology, increase the memory or JVM heap size as follows.

    1. Determine the full name of your sizing configmap by running the following commands:

      oc -n cp4mcm get installation ibm-management -o yaml |grep environment
      
      oc -n management-monitoring get configmap |grep sizing
      
    2. Specify the full configmap name by running the following command:

      oc -n management-monitoring edit configmap monitoring-sizing-sizeXXX
      
    3. Update the following statement with a larger memory limit:

      Deployment.topology.monitoring-topology.memLimit: xxMi
      

      You need to designate xx with the memory limit number.

    4. Add or replace the following statement to increase the JVM maximum limit:

      Deployment.topology.monitoring-topology.env: '[{"name":"JVM_ARGS","value":"-Xms128M -Xmx512M"}]'
      
  2. If the issue is caused by insufficient memory or Java heap size in Cassandra, increase the memory or JVM heap size as follows:

    1. Determine the full name of your sizing configmap by running the following command:

      oc -n cp4mcm get installation ibm-management -o yaml |grep environment
      
      oc -n management-monitoring get configmap |grep sizing
      
    2. Specify the full configmap name by running the following command:

      oc -n management-monitoring edit configmap monitoring-sizing-sizeXXX
      
    3. Update the following statement with a larger memory limit and memory request size:

      StatefulSet.cassandra.monitoring-cassandra.memLimit: xxGi
      StatefulSet.cassandra.monitoring-cassandra.memRequest: xxGi
      

      You need to designate xx with the memory limit number or the memory request size number.

    4. Add or replace the following statement to increase the JVM maximum limit:

      StatefulSet.cassandra.monitoring-cassandra.env: '[{"name":"CASSANDRA_HEAP_SIZE","value":"xxG"}]'
      
  3. If the issue is caused by some errors in Topology or Cassandra, take the following steps to apply the latest fixes and reduce the redundant data in Cassandra janusgraph keyspace.

    1. Upgrade IBM Cloud Pak® for Multicloud Management to the latest release.

    2. If you use IBM Cloud Pak® for Multicloud Management 2.3 Fix Pack 4, apply the following hot fixes:

      1. Back up the existing csv before you edit the csv by running the following command:
        oc -n management-monitoring get csv ibm-management-monitoring.v2.3.24 -o yaml >/tmp/backup-ibm-management-monitoring.v2.3.24.yaml
        
      2. Edit the csv by running the following command:

        oc -n management-monitoring edit csv ibm-management-monitoring.v2.3.24
        
        • Replace the line olm.relatedImage.agentmgmt: ...... with the following line:

          olm.relatedImage.agentmgmt: cp.icr.io/cp/cp4mcm/agentmgmt:developer-fix-23095
          
        • Replace the line olm.relatedImage.metricprovider: ...... with the following line:

          olm.relatedImage.metricprovider: cp.icr.io/cp/cp4mcm/metricprovider:developer-fix-22989-mcm-monitor22
          
        • Replace the line olm.relatedImage.nasm-topology-service: ...... with the following line:

          olm.relatedImage.nasm-topology-service: cp.icr.io/cp/cp4mcm/nasm-topology-service:developer-fix-23095
          
      3. Save the changes.

    3. If Cassandra is running in standalone mode (not high availability mode, and there are no plans to run in high availability mode), reduce the tombstones in Cassandra by running the following commands:

      oc -n management-monitoring exec -it monitoring-cassandra-0 bash
      set |grep -i CASSANDRA_PASS
      set |grep -i CASSANDRA_USER
      cqlsh -u $CASSANDRA_USER -p $CASSANDRA_PASS
      SELECT table_name,gc_grace_seconds FROM system_schema.tables WHERE keyspace_name='janusgraph';
      
    4. If the tables are shown with non-zero gc_grace_seconds, run the following commands to change them.

      Alter table janusgraph.edgestore WITH gc_grace_seconds =0;
      Alter table janusgraph.edgestore_lock_ WITH gc_grace_seconds =0;
      Alter table janusgraph.graphindex WITH gc_grace_seconds =0;
      Alter table janusgraph.graphindex_lock_ WITH gc_grace_seconds =0;
      Alter table janusgraph.janusgraph_ids WITH gc_grace_seconds =0;
      Alter table janusgraph.system_properties WITH gc_grace_seconds =0;
      Alter table janusgraph.system_properties_lock_ WITH gc_grace_seconds =0;
      Alter table janusgraph.systemlog WITH gc_grace_seconds =0;
      Alter table janusgraph.txlog WITH gc_grace_seconds =0;
      Check again
      SELECT table_name,gc_grace_seconds FROM system_schema.tables WHERE keyspace_name='janusgraph';
      exit
      exit
      
    5. Reduce the Topology history data retention period to 8 days (192 hours) in the Cassandra keyspace.

      1. Determine the history data retention period in Topology by running the following command:
        oc -n management-monitoring describe deploy monitoring-topology |grep HISTORY_TTL
        
      2. If HISTORY_TTL is greater than 129 (hours), take the following steps:

        1. Determine the full name of your sizing configmap by running the following commands:

          oc -n cp4mcm get installation ibm-management -o yaml |grep environment
          
          oc -n management-monitoring get configmap |grep sizing
          
        2. Specify the full configmap name in the following command:

          oc -n management-monitoring edit configmap monitoring-sizing-sizeXXX
          
        3. Add or replace the following statement to set HISTORY_TTL to 8 days:

          Deployment.topology.monitoring-topology.env: '[{"name": "HISTORY_TTL", "value": "192"}]'
          
  4. Compact the Cassandra janusgraph keyspace regularly.

    1. Run the following command to enter the cassandra container so that you can run the 'nodetool' commands.

      oc -n management-monitoring exec -it monitoring-cassandra-0 bash
      
    2. Compact the janusgraph keyspace by running the following commands:

      nodetool status janusgraph
      
      nodetool compact janusgraph
      
    3. Check the compaction progress in another connection to the Cassandra pod by running the following command:

      nodetool compactionstats
      
    4. Check whether the keyspace has a smaller size by running the following commands:

      nodetool status janusgraph
      
      exit
      

Notes: