Observer troubleshooting

See the following information to troubleshoot a variety of observer issues.

Observer jobs appear stuck in 'queued' state

Observer jobs can appear to be stuck in state 'Queued' after a Kafka outage, or after enabling Kafka Authentication, as messages between observers and the observer service are lost.

Workaround: As an administrator user, you can either remove or modify the existing job schedule to get the job working again.

Kubernetes Observer job fails to restart after OOM

Kubernetes Observer jobs with very large payloads can encounter an OOM (out-of-memory) error, after which they may fail to restart. The observer appears offline, but a health check fails to flag any errors.
Workaround
Restart the observer if it appears as offline in the UI.

Observation data dropped after retention period (OCP only)

If an observation takes longer than 8 hours it can exceed the default Kafka retention period and the remainder of the data from that observation will be dropped from the resources.json Kafka topic.

Cause
The default retention period is 8 hours.
Workaround
The default retention time is defined by the environment variable named KAFKA_RESOURCES_JSON_RETENTION_MS and can be changed in OCP by adding the following code to the spec section of the custom resource of the operator:
spec:
helmValuesASM:
global.asm.inputTopicRetentionPeriodMs: 28800000

File Observer fails due to hidden characters

An error can occur when your File Observer input file contains hidden characters that interfere with the processing of the content.

Error message example:
The following lines numbers had problems, check the logs for details: 1
Cause
A file may appear compliant in content and format, but may actually be of a UTF-8 Unicode (with BOM) file format rather than a regular UTF-8 file format.
Workaround
Change the file format. For example, you can create a new file from the source file using the following command:
sed '1s/^\xEF\xBB\xBF//' < topology.txt > new.txt

Network Discovery Observer 'unable to start threads' error

An error can occur when you upgrade to the latest version of Agile Service Manager, and then create a new Network Discovery Observer job without increasing the pids_limit defined inside the crio.conf filet o at least to 44406.

Error message example;
{"message":"Tue Dec  6 09:32:06 2022  Warning: Error found in file CRivThread.cc at line 101 - Unable to create a thread.","timestamp":"2022-12-06T09:32:06","level":"trace","log_file":"ncp_agent.SerialLink.NCOMS.trace"}
{"message":"Reason: Resource temporarily unavailable","timestamp":"2022-12-06T09:32:06","level":"trace","log_file":"ncp_agent.SerialLink.NCOMS.trace"}
{"message":"If possible, try reducing the number of threads this process has been configured to use.","timestamp":"2022-12-06T09:32:06","level":"trace","log_file":"ncp_agent.SerialLink.NCOMS.trace"}
...
Workaround
OCP requirement: On the OCP hosts, network discovery requires that pids_limit be set at least to 44406 inside the crio.conf file.
For information about changing the values in the crio.conf file using Machine Configs, see Creating a ContainerRuntimeConfig CR to edit CRI-O parameters
For reference information about Machine Configs, see Red Hat Enterprise Linux CoreOS (RHCOS)

OpenStack Observer certificate chaining error

A Certificate Chaining Error can occur when launching an OpenStack Observer job, as in the following example:
/opt/ibm/netcool/asm/logs/openstack-observer/openstack-observer.log has following
INFO   [2019-11-01 14:48:50,609] [cfd95b7e-3bc7-4006-a4a8-a73a79c71255:OpenStack - Ericsson - ceevepc] c.i.i.t.o.t.ObservationVertex -  Backing up observation vertex Ericsson - VEPC
INFO   [2019-11-01 14:48:50,617] [cfd95b7e-3bc7-4006-a4a8-a73a79c71255:OpenStack - Ericsson - ceevepc] c.i.i.t.o.t.ObservationVertex -  Existing backup observation vertex CTvJ5KIFQgaGNexrlJBsjA for Ericsson - VEPC.bak
INFO   [2019-11-01 14:48:50,636] [cfd95b7e-3bc7-4006-a4a8-a73a79c71255:OpenStack - Ericsson - ceevepc/KeystoneV3IdentityTask] c.i.i.t.o.o.j.r.v.t.AbstractTask -  cfd95b7e-3bc7-4006-a4a8-a73a79c71255:OpenStack - Ericsson - ceevepc/KeystoneV3IdentityTask - Starting...
INFO   [2019-11-01 14:48:50,661] [cfd95b7e-3bc7-4006-a4a8-a73a79c71255:OpenStack - Ericsson - ceevepc] c.i.i.t.o.o.j.r.OpenStackV3FullTopologyGetter -  cfd95b7e-3bc7-4006-a4a8-a73a79c71255:OpenStack - Ericsson - ceevepc - cancel - Cancelling Tasks, Shutting Down Executor...
ERROR  [2019-11-01 14:48:50,663] [cfd95b7e-3bc7-4006-a4a8-a73a79c71255:OpenStack - Ericsson - ceevepc] c.i.i.t.o.o.j.r.OpenStackV3FullTopologyGetter -  cfd95b7e-3bc7-4006-a4a8-a73a79c71255:OpenStack - Ericsson - ceevepc - OpenStack task error occurred, rethrowing...
java.util.concurrent.ExecutionException: com.ibm.itsm.topology.observer.openstack.job.OpenStackTaskProcessingException: An error occurred while processing KeystoneV3IdentityTask:- javax.net.ssl.SSLHandshakeException: com.ibm.jsse2.util.h: PKIX path building failed: java.security.cert.CertPathBuilderException: PKIXCertPathBuilderImpl could not build a valid CertPath.; internal cause is:
        java.security.cert.CertPathValidatorException: The certificate issued by CN=IBMSubCA01, DC=IBM, DC=com, DC=Raleigh is not trusted; internal cause is:
        java.security.cert.CertPathValidatorException: Certificate chaining error
        at java.util.concurrent.FutureTask.report(FutureTask.java:133)
        at java.util.concurrent.FutureTask.get(FutureTask.java:203)
        at com.ibm.itsm.topology.observer.openstack.job.rest.OpenStackV3FullTopologyGetter.waitForFutures(OpenStackV3FullTopologyGetter.java:155)
        at com.ibm.itsm.topology.observer.openstack.job.rest.OpenStackV3FullTopologyGetter.go(OpenStackV3FullTopologyGetter.java:107)
        at com.ibm.itsm.topology.observer.openstack.job.rest.FullRESTLoadJob.observe(FullRESTLoadJob.java:85)
        at com.ibm.itsm.topology.observer.app.ObservationJob.call(ObservationJob.java:179)
        at com.ibm.itsm.topology.observer.app.ObservationJob.call(ObservationJob.java:63)
        at com.ibm.itsm.topology.service.utils.InstrumentedVisibleExecutorService.wrapCallable(InstrumentedVisibleExecutorService.java:385)
        at com.ibm.itsm.topology.service.utils.InstrumentedVisibleExecutorService.access$400(InstrumentedVisibleExecutorService.java:65)
        at com.ibm.itsm.topology.service.utils.InstrumentedVisibleExecutorService$InstrumentedCallable.call(InstrumentedVisibleExecutorService.java:345)
        at java.util.concurrent.FutureTask.run(FutureTask.java:277)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(Thread.java:812)
Caused by: com.ibm.itsm.topology.observer.openstack.job.OpenStackTaskProcessingException: An error occurred while processing KeystoneV3IdentityTask:- javax.net.ssl.SSLHandshakeException: com.ibm.jsse2.util.h: PKIX path building failed: java.security.cert.CertPathBuilderException: PKIXCertPathBuilderImpl could not build a valid CertPath.; internal cause is:
        java.security.cert.CertPathValidatorException: The certificate issued by CN=IBMSubCA01, DC=IBM, DC=com, DC=Raleigh is not trusted; internal cause is:
        java.security.cert.CertPathValidatorException: Certificate chaining error
        at com.ibm.itsm.topology.observer.openstack.job.rest.v3.task.KeystoneV3IdentityTask.process(KeystoneV3IdentityTask.java:43)
        at com.ibm.itsm.topology.observer.openstack.job.rest.v2.task.AbstractTask.call(AbstractTask.java:45)
        at com.ibm.itsm.topology.observer.openstack.job.rest.v2.task.AbstractTask.call(AbstractTask.java:22)
        at java.util.concurrent.FutureTask.run(FutureTask.java:277)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
        at java.util.concurrent.FutureTask.run(FutureTask.java:277)
        at com.ibm.itsm.topology.service.utils.InstrumentedVisibleExecutorService.wrapRunnable(InstrumentedVisibleExecutorService.java:406)
        at com.ibm.itsm.topology.service.utils.InstrumentedVisibleExecutorService.access$200(InstrumentedVisibleExecutorService.java:65)
        at com.ibm.itsm.topology.service.utils.InstrumentedVisibleExecutorService$InstrumentedRunnable.run(InstrumentedVisibleExecutorService.java:317)
        ... 3 common frames omitted
The problem can occur if not all OpenStack certificates have been loaded into Agile Service Manager, or the certificate has not been added to the trusted CA list on the Agile Service Manager server.
Workaround
To load all OpenStack certificates into Agile Service Manager, obtain a copy of the root certificate(s) from the OpenStack host, and import them into the keystore.
Note: Ensure you obtain all certificates, if the host has more than one naming alias. Obtain the certificates directly from the OpenStack administrator or the Server (that is, do not generate them using the openssl command).
To add the certificate to the trusted CA list on the Agile Service Manager server, copy the ca.pem file as root certificate.
Note: See the following link for more information: https://access.redhat.com/solutions/3220561

File Observer heap size issue

If a large number of events are being processed, the default Java Virtual Machine (JVM) memory settings may prove insufficient and processing errors may occur. These errors can generate WARNING logs, and processing of data may be suspended.
Workaround
Increase the maximum Java heap size (Xmx) value to 6G.
  1. Edit the ASM_HOME/etc/nasm-file-observer.yml file and change the Xmx value in the following default argument to 6G:
    JVM_ARGS: ${FILE_OBSERVER_JVM_ARGS:--Xms1G -Xmx2G}
  2. Restart the service.

Jenkins Observer troubleshooting

Artifactory integration: script approval
The first time you use integration with Artifactory, your build may fail as a result of the Artifactory API code being called not yet being whitelisted (approved). In such a case the build log will suggest that you approve the API code.
Workaround: You can approve the scripts in Jenkins, on the Manage Jenkins > In-process script approval screen. Once approved, a No pending approvals message will be displayed.
Culprits: getting the expected username
Depending on your build configuration, you may get a 'noreply' as the user in the culprits information reported by Jenkins.
To get the actual user ID as expected, you can modify your build configuration to make it use the actual author.
Workaround: Go to the Jenkins Pipeline tab inside the build configuration, select Add from the Additional Behaviours drop-down, then click Use commit author in changelog.
For more Jenkins-specific information on this issue and workaround, see the following location: https://issues.jenkins-ci.org/browse/JENKINS-38698
Git resources URLs
Topology tools expect that artifact properties contain a valid URL using HTTP rather than SSH.
If your current Jenkins pipeline performs the checkout operation using a SSH location (such as git@ github.domain:org/repo.git), then the right-click links will not work.
Workaround: Modify your Jenkins pipeline to use HTTP checkout.