Troubleshooting issues in installation

Known issues in the installation of Cloud Automation Manager. It also includes known issues with storage/persistant volumes (PVs) setup or NFS configuration in Cloud Automation Manager.

The following sample error may occur whenever you set up a new Cloud Automation Manager environment with MongoDB and MariaDB pods:
```
 NETWORK [main] cannot read certificate file: /data/db/tls.pem error: 02001002:system library:fopen: No such file or directory
 CONTROL [main] Failed global initialization: InvalidSSLConfiguration: Can not set up PEM key file.
 ERROR: child process failed, exited with error number 1
```
As a resolution, do the following steps:
1. Open the NFS exports file (/etc/exports) in edit mode.
2. Remove rsid=0 from the base directory entry.
3. Add no_root_squash to all entries.
4. Run the following commands to restart the NFS server:
```
systemctl restart nfs-kernel-server.service
```
5. Run the following command to restart the pod:
```
kubectl delete pod
```
When the MariaDB pod starts, it creates the database needed by the Template Designer. It also creates a Template Designer user/password and grants that user access to the Template Designer database. When the cam-bpd-ui pod starts, it first create tables in the database. If, for any reason, the database/user/access was not set up correctly by the MariaDB pod, then the table creation fails. The Template Designer cannot recover from this error and it stops. As a result, IBM Multicloud Manager marks the pod as unhealthy and kills the pod. The cam-bpd-ui pod starts again with the same results and the loop continues. Furthermore, the MariaDB pod shows as started and healthy even if it cannot create the Template Designer database.

The possible solutions are as follows: Cloud Automation Manager
- Check the PVs and ensure that they are set up correctly, are initialized, and have enough resources allocated. If you find a problem with the PV settings, start the Cloud Automation Manager installation with the updated PV settings. Sometimes, the PV may not be fully ready while creating the Template Designer database. To work around a glusterfs issue, increase its resource allocation settings. You can change the default resource allocation for the glusterfs settings in the IBM Multicloud Manager config.yaml. For example:
```
Cloud Automation Manager gluster:
   resources:
     requests:
       cpu: 500m
       memory: 512Mi
     limits:
       cpu: 1000m
       memory: 1Gi
 heketi:
   backupDbSecret: heketi-db-backup
   authSecret: heketi-secret
   maxInFlightOperations: 20
   resources:
     requests:
       cpu: 500m
       memory: 512Mi
     limits:
       cpu: 1000m
       memory: 1Gi
```
  - Check the Template Designer database and verify whether it is set up correctly:
```
 # Get into the MariaDB pod
 kubectl exec -it -n services cam-bpd-mariadb-abc123 -- bash

 # Get the MariaDB root password
 env | grep MYSQL_ROOT_PASSWORD

 # Run the mysql command line tool
 mysql -u root -p <password_found_above>
 # for example: mysql -u root -pabc123
 # you might only need mysql without any credentials

 # Show the databases
 show databases;

 # Verify database ibm_ucdp exists. If it does, then
 use ibm_ucdp;
 show tables;

 # Verify there are many tables (should show around 61)

 # Verify the user "ucdpadmin" exists
 SELECT User,Host FROM mysql.user;
```
    - If the database exists, but does not have all of the table. Delete the database:
```
 drop database ibm_ucdp;
```
    - If the database does not exist or you just dropped it:
```
 # In the mysql command line, manually create the database, user, and grant access
 CREATE DATABASE ibm_ucdp;
 CREATE USER 'ucdpadmin'@'%' IDENTIFIED BY '<password_found_above>';
 GRANT ALL ON ibm_ucdp.* TO 'ucdpadmin'@'%' ;
 FLUSH PRIVILEGES;
```
    - If you manually set up the database, then restart the cam-bpd-ui pod.
  - If the Template Designer database looks correct, then it might be that the designer is just slow to start and IBM Multicloud Manager is killing it before it has a chance to become fully ready. You can update the failure threshold values to avoid this:
    1. Log into the IBM Multicloud Manager UI.
    2. Go to Workloads > Deployments.
    3. For the cam-bpd-ui deployment, select Edit in the overflow menu.
    4. In the deployment editor, update the following values:
      - Under livenessProbe, change failureThreshold to 15.
      - Under readinessProbe, change failureThreshold to 15.
    5. Click Submit. The pod gets restarted with these settings
When you use NFS on SoftLayer, add the following lines because SoftLayer uses NFS version 4, whereas IBM Multicloud Manager and Kubernetes automatically use NFS 2 or 3:
```
  hard
  intr
  nfsvers=4
```
An example YAML snippet:
```
  kind: PersistentVolume
  apiVersion: v1
  metadata:
    name: cam-bpd-appdata-pv
    labels:
      type: cam-bpd-appdata
  spec:
    capacity:
      storage: 20Gi
    accessModes:
      - ReadWriteMany
    mountOptions:
      - hard
      - intr
      - nfsvers=4
    nfs:
      server:
      path:
      Collapse
```
- To clean up a previously failed deployment, do the following steps:
  
  Note: In the following commands, replace <RELEASE_NAME> with the name you gave the Cloud Automation Manager deployment.
  1. Run the following command to delete the failed helm release:
```
helm del <RELEASE_NAME> --purge
```
  2. Run the following command to ensure that all previous artifacts are removed:
```
kubectl delete deploy,svc,secret,cm,ing,clusterservicebroker -l release=<RELEASE_NAME>
```
  3. Check whether the required Cloud Automation Manager persistent volume claims are bound to the persistent volumes:
```
kubectl get pvc -l release=<RELEASE_NAME>
```
  4. Based on whether the 'STATUS' column is in Bound state or not, run either of the following steps:
    
    a. If the 'STATUS' column for all of the Cloud Automation Manager persistent volume claims are in 'Bound' state, then set the following chart parameters on the next Cloud Automation Manager helm install:
```
  camMongoPV:
    persistence:
      existingClaimName: "cam-mongo-pv"
  camLogsPV:
    persistence:
      existingClaimName: "cam-logs-pv"
  camTerraformPV:
    persistence:
      existingClaimName: "cam-terraform-pv"
  camBPDAppDataPV:
    persistence:
      existingClaimName: "cam-bpd-appdata-pv"
```
    b. If the Cloud Automation Manager persistent volume claims are not in the 'Bound' state, then delete the claims with the following command:
```
  kubectl delete pvc -l release=<RELEASE_NAME>
```
    Create the persistent volumes again. For steps to create persistent volumes, see Creating Cloud Automation Manager persistent volumes and Creating Cloud Automation Manager persistent volumes using GlusterFS.
- When you run cloudctl catalog load-ppa-archive --archive <PPA archive> to register the Cloud Automation Manager offline PPA image into IBM Multicloud Manager, the following error message may get displayed:
  
  Returned status 400 Bad Request
  
  You can try the following resolutions:
  - Error may occur due to incomplete download, so double check the md5sum for the PPA archive file.
  - Make sure you are logged into the IBM Multicloud Manager docker repository before you issue the cloudctl catalog load-ppa-archive command:
```
docker login mycluster.icp:8500
```
- If you face issues while accessing the Cloud Automation Manager user interface post the installation, check your firewall settings and turn it off.
- The Cloud Automation Manager broker does not support bind and unbind operations.
- When you install nfs-commons on worker node and install Containers in CrashLoopProtect, the NFS mount on worker node fails with the following exception:
```
cam-mongo pod is in CrashLoopBackoff
kubectl logs cam-mongo-4205725084-wx86f
exception in initAndListen: 98 Unable to lock file: /data/db/mongod.lock No locks available. Is a mongod instance already running?, terminating
```
  As a resolution, do the following steps:
  1. Ensure permissions are correct for the cam-mongo-db volume.
  2. If you have an NFS, ensure that /etc/exports has /export *(rw,insecure,no_subtree_check,async,no_root_squash).
  3. If you have NFS on a separate server, then check /etc/exports for fsid=0 and remove it.
  4. Run the following command to restart NFS server:
```
systemctl restart nfs-kernel-server.service
```
  5. Run the following command to restart the pod:
```
 kubectl delete pod
```

Sometimes, even after the successful deployment of Cloud Automation Manager, the "cam-mongo" microservice might go down unexpectedly. Run the following commands to check the pod log:

 kubectl describe pods -n services

If this command does not provide you the necessary details to understand the issue, run the kubecl command to get logs from previously running container. For example, this kubectl -n services logs cam-mongo-5c89fcccbd-r2hv4 -p command results in the following output:

 exception in initAndListen: 98 Unable to lock file: /data/db/mongod.lock Resource temporarily unavailable. Is a mongod  instance already running?, terminatingConclusion: While starting the container inside "cam-mongo" pod it was unable to use the existing /data/db/mongod.lock file and hence your pod will be not up and running and you cannot acces CAM URL.Solution:

To resolve the issue, do the following steps:

Use the following pod creation yaml to spin up a container and mount the cam-mongo volume within it. It mounts the concerned pv's, that is, cam-mongo-pv where /data/db/ is present.

    apiVersion: v1
    metadata:
      name: mongo-troubleshoot-pod
      spec:
        volumes:
          name: cam-mongo-pv
          persistentVolumeClaim:
          claimName: cam-mongo-pv
        containers:
          name: mongo-troubleshoot
          image: nginx
         ports:
            containerPort: 80
            name: "http-server"
         volumeMounts:
            mountPath: "/data/db"
            name: cam-mongo-pv

Use docker exec -it /bin/bash to keep stdin open and to allocate a terminal. Run the following commands:
```
cd /data/db
rm mongod.lock
rm WiredTiger.lock
```
Kill the pod that you created for troubleshooting.
Run the following command to kill the corrupted cam-mongo pod:
```
kubectl delete pods -n services
```

Cloud Automation Manager Container Debugging (kubectl)

When a container is not in running state, run the following kubect1 commands to describe pods and look for errors:
```
  kubectl -n services get pod
  kubectl -n services describe pod <podname>
  kubectl -n services get pv
  kubectl -n services describe pv <pvname>
```
Look for events or error messages when describing the pods or persistent volumes that are not in health states. For example, CrashLoopBackoff, Pending (for a while), Init (for a while).
Run the following commands to ensure PVs are created successfully:
```
kubectl -n services describe pv cam-mongo-pv
  kubectl -n services describe pv cam-logs-pv
kubectl -n services describe pv cam-terraform-pv
kubectl -n services describe pv cam-bpd-appdata-pv
```
If PVs are not setup, follow PV setup steps before you install Cloud Automation Manager:

Note: PVs must be deleted and recreated everytime Cloud Automation Manager is installed.
If the environment variable http_proxy is used on the master and you do a PPA upload on the master, then the helm-chart does not load into the repository properly. As a workaround, either do the upload from an external machine or add mycluster and mycluster.icp to the NO_PROXY variable.
Cloud Automation Manager re-install fails due to an already existing cam-api-secret-gen-jobleft from prior Cloud Automation Manager installations and the job hangs indefinitely with the following error:

Internal service error : rpc error: code = Unknown desc = jobs.batch “cam-api-secret-gen-job” already exists


root@csz25087:~# kubectl -n services get pods
NAME                           READY   STATUS      RESTARTS   AGE
cam-api-secret-gen-job-n5d87   0/1     Completed   0          24m

To resolve this issue:

Run the following command:

kubectl -n services delete job cam-api-secret-gen-job

Install Cloud Automation Manager.

Cloud Automation Manager installation fails due to an incorrect Worker node architecture value. The installation fails with the following error message:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  71s (x2 over 71s)  default-scheduler  0/1 nodes are available: 1 node(s) didn't match node selector.

To resolve this issue, review the input parameter for the 'Worker node architecture' and check if one of the supported architectures are selected/entered correctly:

amd64
ppc64le
s390x

Cloud Automation Manager Install has all pods ready and running except 'cam-tenant-api' indicated by its non-readiness status in "0/1"
```
cam-tenant-api-6fc4d4fcff-ltdvj             0/1     Running   0          122m
```

Events:
 Type     Reason     Age                   From                   Message
  ----     ------     ----                  ----                   -------
  Warning  Unhealthy  86s (x375 over 126m)  kubelet, 9.30.255.175  Readiness probe failed: HTTP probe failed with statuscode: 404

To confirm the issue, search for the following error messages:

Failed to get IAM access token
Provided API key could not be found
- Error occurred while onboarding CAM into IAM

The error messages can be found in the /[export folder]/CAM_logs/cam-tenant-api/ or run logs

 kubectl -n services logs -f <tenant-pod-name>

[2020-03-12T21:01:26.977] [INFO] init-platform-security - Onboarding CAM Service into ICP
[2020-03-12T21:01:27.554] [ERROR] init-platform-security - Failed to get IAM access token. { statusCode: 400,
 body: '{"context":{"requestId":"f0cef458b1c6463cb6daed3597445d42","requestType":"incoming.OIDC_Token","userAgent":"NotSet","clientIp":"9.30.255.175","instanceId":"NotSet/999999","threadId":"ac73","host":"auth-idp-kngmb","startTime":"12.03.2020 21:01:14:952 UTC","endTime":"12.03.2020 21:01:15:176 UTC","elapsedTime":"224","locale":"c.u_US"},"errorCode":"BXNIM0415E","errorMessage":"Provided API key could not be found","errorDetails":"BXNIM0102E: Unable to find Object. Object Type: \'ApiKey\' with ID: \'JZWSw9rwauvmAKVcuzYdY09hzO51jCyGHci-12yb0kje\' not found."}' }
[2020-03-12T21:01:27.557] [ERROR] init-platform-security - Error occurred while onboarding CAM into IAM. { Error: [object Object]
  at /usr/src/app/lib/icp/platform-security.js:319:19

This issue is resolved by deleting the invalid API key from the MCM console and creating a new one API key.

Cloud Automation Manager Installation fails with the following error message when the service catalog is not installed on OpenShift Container Platform 4.x and the following error

Error: validation failed: unable to recognize "": no matches for kind "ClusterServiceBroker" in version "servicecatalog.k8s.io/v1beta1"

The service catalog for OpenShift Container Paltform is not installed by default. To resolve the issue, see Install the service catalog for OpenShift Container Platform