Troubleshooting issues in installation

Known issues in the installation of Managed services.

  • If you face issues while accessing the Managed services user interface post the installation, check your firewall settings and turn it off.

  • If the Managed services entry is missing from the navigation menu Automate Infrastructure, this might be caused by a problem during Managed services installation.

    To resolve this issue:

    • Delete the Managed services installation instance and install it again.
    • After successful installation and creation of the Managed services entry in the navigation menu Automate Infrastructure, delete the Service library pods to restart them.
  • There will be instances where the cam-tenant-api pod does not start even after 45 minutes and an error 'Failed to get IAM access token' is displayed. Restart the pod to resolve the issue.

  • Sometimes, even after the successful deployment of Managed services, the "cam-mongo" microservice might go down unexpectedly.

    Run the following command to check the pod log:

    kubectl describe pods -n cp4aiops
    

    If this command does not provide you the necessary details to understand the issue, run the kubectl command to get logs from previously running container. For example, this kubectl -n cp4aiops logs cam-mongo-5c89fcccbd-r2hv4 command results in the following output:

    exception in initAndListen: 98 Unable to lock file: /data/db/mongod.lock Resource temporarily unavailable. Is a mongod  instance already running?, terminatingConclusion: While starting the container inside "cam-mongo" pod it was unable to use the existing /data/db/mongod.lock file and hence your pod will be not up and running and you cannot acces CAM URL.Solution:
    

    To resolve the issue, do the following steps:

    1. Use the following pod creation yaml to spin up a container and mount the cam-mongo volume within it. It mounts the concerned pv's, that is, cam-mongo-pv where /data/db/ is present.

      apiVersion: v1
      kind: Pod
      metadata:
        name: mongo-troubleshoot-pod
      spec:
        volumes:
          - name: cam-mongo-pv
            persistentVolumeClaim:
            claimName: cam-mongo-pv
        containers:
          - name: mongo-troubleshoot
            image: nginx
            ports:
              - containerPort: 80
                name: "http-server"
            volumeMounts:
              - mountPath: "/data/db"
                name: cam-mongo-pv
      
    2. Use podman exec -it /bin/bash to keep stdin open and to allocate a terminal. Run the following commands:

      cd /data/db
      rm mongod.lock
      rm WiredTiger.lock
      
    3. Kill the pod that you created for troubleshooting.

    4. Run the following command to kill the corrupted cam-mongo pod:

      kubectl delete pods -n cp4aiops
      
  • Managed services Container Debugging (kubectl)

    When a container is not in running state, run the following kubect1 commands to describe pods and look for errors:

    kubectl -n cp4aiops get pod
    kubectl -n cp4aiops describe pod <podname>
    kubectl -n cp4aiops get pv
    kubectl -n cp4aiops describe pv <pvname>
    

    Look for events or error messages when describing the pods or persistent volumes that are not in health states. For example, CrashLoopBackoff, Pending (for a while), Init (for a while).

  • Run the following commands to ensure PVs are created successfully:

    kubectl -n cp4aiops describe pv cam-mongo-pv
    

    If PVs are not set up, follow PV setup steps before you install Managed services:

    Note: PVs must be deleted and re-created everytime Managed services is installed.

  • Managed services installation fails due to an incorrect Worker node architecture value.

    The installation fails with the following error message:

    Events:
    Type     Reason        Age                           From               Message
     ----     ------         ----       ---      -------
    Warning  FailedScheduling 71s (x2 over 71s)  default-scheduler  0/1 nodes are available: 1 node(s) didn't match node selector.
    

    To resolve this issue, review the input parameter for the 'Worker node architecture' and check if the supported architecture is selected/entered correctly: amd64

  • Managed services reinstall fails due to an existing cam-api-secret-gen-job left from prior Managed services installations and the job hangs indefinitely with the following error:

    Internal service error : rpc error: code = Unknown desc = jobs.batch “cam-api-secret-gen-job” already exists
    
    root@csz25087:~# kubectl -n cp4aiops get pods
    NAME                           READY   STATUS      RESTARTS   AGE
    cam-api-secret-gen-job-n5d87   0/1     Completed   0          24m
    

    To resolve this issue:

    1. Run the following command:

      kubectl -n cp4aiops delete job cam-api-secret-gen-job
      
    2. Install Managed services.

  • Managed services installation fails due to an existing template-crd-gen-job. In the prior installations, you might see the job hang for ten minutes and then time out.

    Internal service error : rpc error: code = Unknown desc = jobs.batch "template-crd-gen-job" already exists
    
    root@csz25087:~# kubectl -n cp4aiops get pods
    NAME                         READY   STATUS             RESTARTS   AGE
    template-crd-gen-job-wm7mj   0/1     ImagePullBackOff   0          8m27s
    

    To resolve this issue:

    1. Run the following command:

      kubectl -n cp4aiops delete job template-crd-gen-job
      
    2. Install Managed services.

  • Encountered an error 3.6.0.0 (20220113_2156) x86_64 while accessing the library page after you uninstall and reinstall Managed services.

    To resolve this issue, increase the icpdata_addon_version to a higher value in the configmap cam-proxy-zen-extension.

    oc -n cp4aiops edit configmap cam-proxy-zen-extension
    
  • A Bad Gateway error appears on some Managed Services pages following the installation of Infrastructure Automation.

    In order to resolve this issue, restart the cam-tenant-api pod as follows:

    oc delete pod $(oc get pod|grep cam-tenant-api|awk '{print $1}')