Troubleshooting topology

Learn how to isolate and resolve problems with the Topology viewer and topology service.

To help you with troubleshooting topology issues, review the topology service logs. For more information, see Viewing the topology logs.

Common issues:

Chrome extension display issues on Incident Topology page

When displaying the details for a listed event in the Resolution Hub by using the Chrome browser, the resizing of the page might cause screen errors. This is caused by the MEGA Secure Cloud Storage and Chat Chrome browser extension.

Solution: Remove the MEGA Secure Cloud Storage and Chat extension.

Error connecting topology services to Cassandra database

During startup, the topology service can attempt to connect to the Cassandra database before it is fully started, which causes an error, followed by a warning message similar to the following message:

WARN   [2022-02-28 11:01:36,748] [main] c.i.i.t.g.ConnectionManager -  failed to connect to storage back end sleeping before retrying.

This issue can occur when the topology service has started before the Cassandra database is fully ready.

Solution: You do not need to complete any workaround. The topology service tries to connect to the Cassandra database repeatedly. When the Cassandra database is up and running, the service can successfully connect.

Observer schedule toggle becomes disabled after saving the job

After editing and saving an observer job that includes scheduled jobs, the Schedule request toggle becomes disabled. However, the job remains scheduled and continues to run in the background.

This issue can occur when the Schedule request toggle fails to read the existing job schedule and erroneously reverts to its default setting of Off.

Solution: The job scheduling remains active and no workaround is required. To confirm the existing observer job schedule:

  1. Log in to IBM Cloud Pak for AIOps console.
  2. Expand the navigation menu (four horizontal bars), then click Define > Integrations.
  3. On the Integrations page, click Add integration.
  4. On the Add integration* page, click Topology in the Category list that is next to the list of all integrations.
  5. Click the configure, schedule, and manage other observer jobs link in the description for the topology integration.
  6. Select the observer.
  7. Confirm that the job schedule remains active.

File Observer fails due to hidden characters

An error can occur when your File Observer input file contains hidden characters that interfere with the processing of the content.

The following lines numbers had problems, check the logs for details: 1

This issue can occur when a file appears compliant in content and format, but is in a UTF-8 Unicode (with BOM) file format (instead of the regular UTF-8 file format).

Solution: Change the file format. For example, you can create a new file from the source file by using the following command:

sed '1s/^\xEF\xBB\xBF//' < topology.txt > new.txt

Kubernetes Observer job fails to restart after OOM

Kubernetes Observer jobs with very large payloads can encounter an OOM (out-of-memory) error, after which they may fail to restart. The observer appears offline, but a health check fails to flag any errors.

Solution: Restart the observer if it appears as offline in the UI.

Halt a Cassandra node for maintenance

Sometimes you need to manually maintain the Cassandra server nodes. For example, you might need to halt one node for long enough to allow you to complete the required maintenance work or debug a problem.

The following steps suspend (halt) one Cassandra node while the other nodes still serve queries, which allows you to perform maintenance tasks without shutting down your production environment.

Note: To reduce the impact of pausing a Cassandra node on the performance of your remaining system resources, keep such maintenance periods to a minimum.

Halt the node

  1. Edit the bootstrap configMap.
oc edit configmaps aiops-topology-cassandra-bootstrap-config

For example, if you want to halt node 0, you can set the following columns.

hostname_filter: aiops-topology-cassandra-0
running_mode: maintenance
  1. Save and exit.

  2. To restart the target node pod, you delete the pod so that the statefulSet will restart the pod automatically. Then it should be halted.

oc delete pod aiops-topology-cassandra-0

Debug the Cassandra server

After the node is restarted and halted, you should see that its state stays as 0/1 Running. When the pod is in this mode, the Cassandra server is not automatically started in it. The liveness probe and readiness probe will not be activated for many days in this pod, allowing you to perform any maintenance that is required.

There are two methods to start the Cassandra server for debug purposes.

  • You can go into the pod container to manually start the server.

    oc exec -it aiops-topology-cassandra-0 -- bash
    bash-4.4$ /opt/ibm/cassandra/bin/cassandra -fR > /tmp/server.log 2>&1 &
    

    The server process will be started in the background and the log messages can be found in file /tmp/server.log.

  • You can reuse the entry point script.

    oc exec -it aiops-topology-cassandra-0 -- bash
    bash-4.4$ export RUNNING_MODE=normal
    bash-4.4$ /opt/ibm/start-cassandra.sh > /tmp/server.log 2>&1 &
    

Example maintenance task: Remove corrupt commit logs that are causing Cassandra to not start.

$ oc exec -it aiops-topology-cassandra-0 -- bash
bash-4.4$ rm -rf /opt/ibm/cassandra/data/commitlog/CommitLog-6-1682019553495.log

After performing the required maintenance, you can stop the manually started Cassandra server.

oc exec -it aiops-topology-cassandra-0 -- bash
bash-4.4$ nodetool stopdaemon

Note: You can also perform other actions, like fixing volumes or transforming the data files. Every time you restart the pod, it will be halted.

Return the node to normal

After you complete your manual work, you can edit the configMap and restart the pod to return it back to normal.

oc edit configmaps aiops-topology-cassandra-bootstrap-config

Set the running mode to normal.

hostname_filter: aiops-topology-cassandra-0
running_mode: normal

Restart normal operations

Finally, you can delete the pod again so it will revert to the normal node and continue to work in the cluster.

Topology observer job failure

When you attempt to run a topology observer, the job fails because no response is returned for the request to the corresponding external service.

Solution: Run the following commands to create a NetworkPolicy that allows egress from the topology observer pods:

export AIOPS_NAMESPACE=<AIOps installation namespace>

cat << EOF | oc apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: aiops-observer-egress
  namespace: ${AIOPS_NAMESPACE}
spec:
  egress:
  - {}
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: aiops-topology
  policyTypes:
  - Egress
EOF

Topology viewer user interface crashes after the update manager is displayed

The issue is that you cannot use the update manager feature in topology viewer.

The solution is to change your topology viewer user preferences to auto render changes on refresh, which prevents the update manager from appearing.