Monitoring topology pods has high number of restarts
If you see that the monitoring topology pods has high number of restarts, follow the steps to diagnose and solve this problem.
Symptoms
When you run the command oc -n management-monitoring get pod |grep topology, you might see that the pod monitoring-topology-xxx
has high number of restarts in a short period of time. See the following example:
NAME READY STATUS RESTARTS AGE
monitoring-topology-7fbf949c87-lg8j6 1/1 Running 3 1d
Diagnosis
-
Check the monitoring topology pod to see whether it runs out of memory or out of Java heap memory.
- Determine the full topology pod name by running the following command:
oc -n management-monitoring get pod |grep topology
- Check the reason for last termination by running the following command, such as
OOMKilled
, which indicatesout of memory
.oc -n management-monitoring describe pod monitoring-topology-XXX
- Check whether any errors exist in the previous logs by running the following command, such as
out of Java heap memory
.oc -n management-monitoring logs monitoring-topology-XXX -p
- Check whether any other errors exist in the current log by running the following command, such as
cannot access database
.oc -n management-monitoring logs monitoring-topology-XXX
- Determine the full topology pod name by running the following command:
-
Check whether the Cassandra pod has any problems, such as running out of memory, insufficient disk space, IO delay, etc.
- Check the reason for last termination by running the following command, such as
OOMKilled
, which indicatesout of memory
.oc -n management-monitoring describe pod monitoring-cassandra-0
- Check whether any errors exist in the previous logs by running the following command:
oc -n management-monitoring logs monitoring-cassandra-0 -p
- Check whether any errors exist in the current logs by running the following command:
oc -n management-monitoring logs monitoring-cassandra-0
- Check whether any errors exist in the debug log by running the following commands:
oc cp monitoring-cassandra-0:/opt/ibm/cassandra/logs/debug.log /tmp/cassandra_debug.log
less /tmp/cassandra_debug.log
- Check the reason for last termination by running the following command, such as
Solutions
-
If the issue is caused by insufficient memory or Java heap size in Topology, increase the memory or JVM heap size as follows.
-
Determine the full name of your sizing configmap by running the following commands:
oc -n cp4mcm get installation ibm-management -o yaml |grep environment
oc -n management-monitoring get configmap |grep sizing
-
Specify the full configmap name by running the following command:
oc -n management-monitoring edit configmap monitoring-sizing-sizeXXX
-
Update the following statement with a larger memory limit:
Deployment.topology.monitoring-topology.memLimit: xxMi
You need to designate xx with the memory limit number.
-
Add or replace the following statement to increase the JVM maximum limit:
Deployment.topology.monitoring-topology.env: '[{"name":"JVM_ARGS","value":"-Xms128M -Xmx512M"}]'
-
-
If the issue is caused by insufficient memory or Java heap size in Cassandra, increase the memory or JVM heap size as follows:
-
Determine the full name of your sizing configmap by running the following command:
oc -n cp4mcm get installation ibm-management -o yaml |grep environment
oc -n management-monitoring get configmap |grep sizing
-
Specify the full configmap name by running the following command:
oc -n management-monitoring edit configmap monitoring-sizing-sizeXXX
-
Update the following statement with a larger memory limit and memory request size:
StatefulSet.cassandra.monitoring-cassandra.memLimit: xxGi StatefulSet.cassandra.monitoring-cassandra.memRequest: xxGi
You need to designate xx with the memory limit number or the memory request size number.
-
Add or replace the following statement to increase the JVM maximum limit:
StatefulSet.cassandra.monitoring-cassandra.env: '[{"name":"CASSANDRA_HEAP_SIZE","value":"xxG"}]'
-
-
If the issue is caused by some errors in Topology or Cassandra, take the following steps to apply the latest fixes and reduce the redundant data in Cassandra janusgraph keyspace.
-
Upgrade IBM Cloud Pak® for Multicloud Management to the latest release.
-
If you use IBM Cloud Pak® for Multicloud Management 2.3 Fix Pack 4, apply the following hot fixes:
- Back up the existing csv before you edit the csv by running the following command:
oc -n management-monitoring get csv ibm-management-monitoring.v2.3.24 -o yaml >/tmp/backup-ibm-management-monitoring.v2.3.24.yaml
-
Edit the csv by running the following command:
oc -n management-monitoring edit csv ibm-management-monitoring.v2.3.24
-
Replace the line
olm.relatedImage.agentmgmt: ......
with the following line:olm.relatedImage.agentmgmt: cp.icr.io/cp/cp4mcm/agentmgmt:developer-fix-23095
-
Replace the line
olm.relatedImage.metricprovider: ......
with the following line:olm.relatedImage.metricprovider: cp.icr.io/cp/cp4mcm/metricprovider:developer-fix-22989-mcm-monitor22
-
Replace the line
olm.relatedImage.nasm-topology-service: ......
with the following line:olm.relatedImage.nasm-topology-service: cp.icr.io/cp/cp4mcm/nasm-topology-service:developer-fix-23095
-
-
Save the changes.
- Back up the existing csv before you edit the csv by running the following command:
-
If Cassandra is running in standalone mode (not high availability mode, and there are no plans to run in high availability mode), reduce the tombstones in Cassandra by running the following commands:
oc -n management-monitoring exec -it monitoring-cassandra-0 bash set |grep -i CASSANDRA_PASS set |grep -i CASSANDRA_USER cqlsh -u $CASSANDRA_USER -p $CASSANDRA_PASS SELECT table_name,gc_grace_seconds FROM system_schema.tables WHERE keyspace_name='janusgraph';
-
If the tables are shown with non-zero
gc_grace_seconds
, run the following commands to change them.Alter table janusgraph.edgestore WITH gc_grace_seconds =0; Alter table janusgraph.edgestore_lock_ WITH gc_grace_seconds =0; Alter table janusgraph.graphindex WITH gc_grace_seconds =0; Alter table janusgraph.graphindex_lock_ WITH gc_grace_seconds =0; Alter table janusgraph.janusgraph_ids WITH gc_grace_seconds =0; Alter table janusgraph.system_properties WITH gc_grace_seconds =0; Alter table janusgraph.system_properties_lock_ WITH gc_grace_seconds =0; Alter table janusgraph.systemlog WITH gc_grace_seconds =0; Alter table janusgraph.txlog WITH gc_grace_seconds =0; Check again SELECT table_name,gc_grace_seconds FROM system_schema.tables WHERE keyspace_name='janusgraph'; exit exit
-
Reduce the Topology history data retention period to
8
days (192
hours) in the Cassandra keyspace.- Determine the history data retention period in Topology by running the following command:
oc -n management-monitoring describe deploy monitoring-topology |grep HISTORY_TTL
-
If
HISTORY_TTL
is greater than129
(hours), take the following steps:-
Determine the full name of your sizing configmap by running the following commands:
oc -n cp4mcm get installation ibm-management -o yaml |grep environment
oc -n management-monitoring get configmap |grep sizing
-
Specify the full configmap name in the following command:
oc -n management-monitoring edit configmap monitoring-sizing-sizeXXX
-
Add or replace the following statement to set
HISTORY_TTL
to8
days:Deployment.topology.monitoring-topology.env: '[{"name": "HISTORY_TTL", "value": "192"}]'
-
- Determine the history data retention period in Topology by running the following command:
-
-
Compact the Cassandra janusgraph keyspace regularly.
-
Run the following command to enter the cassandra container so that you can run the 'nodetool' commands.
oc -n management-monitoring exec -it monitoring-cassandra-0 bash
-
Compact the janusgraph keyspace by running the following commands:
nodetool status janusgraph
nodetool compact janusgraph
-
Check the compaction progress in another connection to the Cassandra pod by running the following command:
nodetool compactionstats
-
Check whether the keyspace has a smaller size by running the following commands:
nodetool status janusgraph
exit
-
Notes:
-
The Kubernetes memory request needs to be paired with the heap setting. For example, if the JVM heap size is set to
256MB
forXmx
, thememRequest
needs to be around350 MB
(256MB + around 100MB for native JVM usage). ThememLimit
needs to be set to a higher value, 25% more or450 MB
in this example, to prevent frequentOOMKills
from temporary little spikes. -
After the changes in the configmap or csv are made, the affected pods will be restarted.
-
Here is an example to specify more than one environment variables for Topology in the csv:
Deployment.topology.monitoring-topology.env: '[{"name": "HISTORY_TTL", "value": "192"},{"name":"JVM_ARGS","value":"-Xms128M -Xmx512M"}]'
-
You need to monitor the Topology and Cassandra pods on OpenShift Container Platform console by clicking Workloads > Pods, and selecting the
management-monitoring
namespace. Review the memory usage and pod log regularly.