Troubleshooting issues in installation
Known issues in the installation of Cloud Automation Manager. It also includes known issues with storage/persistant volumes (PVs) setup or NFS configuration in Cloud Automation Manager.
-
The following sample error may occur whenever you set up a new Cloud Automation Manager environment with MongoDB and MariaDB pods:
NETWORK [main] cannot read certificate file: /data/db/tls.pem error: 02001002:system library:fopen: No such file or directory CONTROL [main] Failed global initialization: InvalidSSLConfiguration: Can not set up PEM key file. ERROR: child process failed, exited with error number 1As a resolution, do the following steps:
- Open the NFS exports file (/etc/exports) in edit mode.
- Remove
rsid=0from the base directory entry. - Add
no_root_squashto all entries. -
Run the following commands to restart the NFS server:
systemctl restart nfs-kernel-server.service -
Run the following command to restart the pod:
kubectl delete pod
-
When the MariaDB pod starts, it creates the database needed by the Template Designer. It also creates a Template Designer user/password and grants that user access to the Template Designer database. When the
cam-bpd-uipod starts, it first create tables in the database. If, for any reason, the database/user/access was not set up correctly by the MariaDB pod, then the table creation fails. The Template Designer cannot recover from this error and it stops. As a result, IBM Multicloud Manager marks the pod as unhealthy and kills the pod. Thecam-bpd-uipod starts again with the same results and the loop continues. Furthermore, the MariaDB pod shows as started and healthy even if it cannot create the Template Designer database.The possible solutions are as follows: Cloud Automation Manager
-
Check the PVs and ensure that they are set up correctly, are initialized, and have enough resources allocated. If you find a problem with the PV settings, start the Cloud Automation Manager installation with the updated PV settings. Sometimes, the PV may not be fully ready while creating the Template Designer database. To work around a glusterfs issue, increase its resource allocation settings. You can change the default resource allocation for the glusterfs settings in the IBM Multicloud Manager
config.yaml. For example:Cloud Automation Manager gluster: resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi heketi: backupDbSecret: heketi-db-backup authSecret: heketi-secret maxInFlightOperations: 20 resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi-
Check the Template Designer database and verify whether it is set up correctly:
# Get into the MariaDB pod kubectl exec -it -n services cam-bpd-mariadb-abc123 -- bash # Get the MariaDB root password env | grep MYSQL_ROOT_PASSWORD # Run the mysql command line tool mysql -u root -p <password_found_above> # for example: mysql -u root -pabc123 # you might only need mysql without any credentials # Show the databases show databases; # Verify database ibm_ucdp exists. If it does, then use ibm_ucdp; show tables; # Verify there are many tables (should show around 61) # Verify the user "ucdpadmin" exists SELECT User,Host FROM mysql.user;-
If the database exists, but does not have all of the table. Delete the database:
drop database ibm_ucdp; -
If the database does not exist or you just dropped it:
# In the mysql command line, manually create the database, user, and grant access CREATE DATABASE ibm_ucdp; CREATE USER 'ucdpadmin'@'%' IDENTIFIED BY '<password_found_above>'; GRANT ALL ON ibm_ucdp.* TO 'ucdpadmin'@'%' ; FLUSH PRIVILEGES; - If you manually set up the database, then restart the
cam-bpd-uipod.
-
-
If the Template Designer database looks correct, then it might be that the designer is just slow to start and IBM Multicloud Manager is killing it before it has a chance to become fully ready. You can update the failure threshold values to avoid this:
- Log into the IBM Multicloud Manager UI.
- Go to Workloads > Deployments.
- For the
cam-bpd-uideployment, select Edit in the overflow menu. - In the deployment editor, update the following values:
- Under
livenessProbe, changefailureThresholdto 15. - Under
readinessProbe, changefailureThresholdto 15.
- Under
- Click Submit. The pod gets restarted with these settings
-
-
-
When you use NFS on SoftLayer, add the following lines because SoftLayer uses NFS version 4, whereas IBM Multicloud Manager and Kubernetes automatically use NFS 2 or 3:
hard intr nfsvers=4An example YAML snippet:
kind: PersistentVolume apiVersion: v1 metadata: name: cam-bpd-appdata-pv labels: type: cam-bpd-appdata spec: capacity: storage: 20Gi accessModes: - ReadWriteMany mountOptions: - hard - intr - nfsvers=4 nfs: server: path: Collapse-
To clean up a previously failed deployment, do the following steps:
Note: In the following commands, replace
<RELEASE_NAME>with the name you gave the Cloud Automation Manager deployment.-
Run the following command to delete the failed helm release:
helm del <RELEASE_NAME> --purge -
Run the following command to ensure that all previous artifacts are removed:
kubectl delete deploy,svc,secret,cm,ing,clusterservicebroker -l release=<RELEASE_NAME> - Check whether the required Cloud Automation Manager persistent volume claims are bound to the persistent volumes:
kubectl get pvc -l release=<RELEASE_NAME> -
Based on whether the 'STATUS' column is in
Boundstate or not, run either of the following steps:a. If the 'STATUS' column for all of the Cloud Automation Manager persistent volume claims are in 'Bound' state, then set the following chart parameters on the next Cloud Automation Manager helm install:
camMongoPV: persistence: existingClaimName: "cam-mongo-pv" camLogsPV: persistence: existingClaimName: "cam-logs-pv" camTerraformPV: persistence: existingClaimName: "cam-terraform-pv" camBPDAppDataPV: persistence: existingClaimName: "cam-bpd-appdata-pv"b. If the Cloud Automation Manager persistent volume claims are not in the 'Bound' state, then delete the claims with the following command:
kubectl delete pvc -l release=<RELEASE_NAME>Create the persistent volumes again. For steps to create persistent volumes, see Creating Cloud Automation Manager persistent volumes and Creating Cloud Automation Manager persistent volumes using GlusterFS.
-
-
When you run
cloudctl catalog load-ppa-archive --archive <PPA archive>to register the Cloud Automation Manager offline PPA image into IBM Multicloud Manager, the following error message may get displayed:Returned status 400 Bad Request
You can try the following resolutions:
- Error may occur due to incomplete download, so double check the md5sum for the PPA archive file.
- Make sure you are logged into the IBM Multicloud Manager docker repository before you issue the
cloudctl catalog load-ppa-archivecommand:docker login mycluster.icp:8500
-
If you face issues while accessing the Cloud Automation Manager user interface post the installation, check your firewall settings and turn it off.
- The Cloud Automation Manager broker does not support bind and unbind operations.
-
When you install
nfs-commonson worker node and install Containers in CrashLoopProtect, the NFS mount on worker node fails with the following exception:cam-mongo pod is in CrashLoopBackoff kubectl logs cam-mongo-4205725084-wx86f exception in initAndListen: 98 Unable to lock file: /data/db/mongod.lock No locks available. Is a mongod instance already running?, terminatingAs a resolution, do the following steps:
-
Ensure permissions are correct for the cam-mongo-db volume.
-
If you have an NFS, ensure that /etc/exports has
/export *(rw,insecure,no_subtree_check,async,no_root_squash). -
If you have NFS on a separate server, then check /etc/exports for fsid=0 and remove it.
-
Run the following command to restart NFS server:
systemctl restart nfs-kernel-server.service -
Run the following command to restart the pod:
kubectl delete pod
-
-
-
Sometimes, even after the successful deployment of Cloud Automation Manager, the "cam-mongo" microservice might go down unexpectedly. Run the following commands to check the pod log:
kubectl describe pods -n servicesIf this command does not provide you the necessary details to understand the issue, run the kubecl command to get logs from previously running container. For example, this
kubectl -n services logs cam-mongo-5c89fcccbd-r2hv4 -pcommand results in the following output:exception in initAndListen: 98 Unable to lock file: /data/db/mongod.lock Resource temporarily unavailable. Is a mongod instance already running?, terminatingConclusion: While starting the container inside "cam-mongo" pod it was unable to use the existing /data/db/mongod.lock file and hence your pod will be not up and running and you cannot acces CAM URL.Solution:To resolve the issue, do the following steps:
-
Use the following pod creation yaml to spin up a container and mount the cam-mongo volume within it. It mounts the concerned pv's, that is,
cam-mongo-pvwhere /data/db/ is present.apiVersion: v1 metadata: name: mongo-troubleshoot-pod spec: volumes: name: cam-mongo-pv persistentVolumeClaim: claimName: cam-mongo-pv containers: name: mongo-troubleshoot image: nginx ports: containerPort: 80 name: "http-server" volumeMounts: mountPath: "/data/db" name: cam-mongo-pv -
Use
docker exec -it /bin/bashto keep stdin open and to allocate a terminal. Run the following commands:cd /data/db rm mongod.lock rm WiredTiger.lock - Kill the pod that you created for troubleshooting.
- Run the following command to kill the corrupted cam-mongo pod:
kubectl delete pods -n services
-
-
Cloud Automation Manager Container Debugging (kubectl)
When a container is not in running state, run the following kubect1 commands to describe pods and look for errors:
kubectl -n services get pod kubectl -n services describe pod <podname> kubectl -n services get pv kubectl -n services describe pv <pvname>Look for events or error messages when describing the pods or persistent volumes that are not in health states. For example,
CrashLoopBackoff, Pending (for a while), Init (for a while). -
Run the following commands to ensure PVs are created successfully:
kubectl -n services describe pv cam-mongo-pv kubectl -n services describe pv cam-logs-pv kubectl -n services describe pv cam-terraform-pv kubectl -n services describe pv cam-bpd-appdata-pvIf PVs are not setup, follow PV setup steps before you install Cloud Automation Manager:
Note: PVs must be deleted and recreated everytime Cloud Automation Manager is installed.
-
If the environment variable
http_proxyis used on the master and you do a PPA upload on the master, then the helm-chart does not load into the repository properly. As a workaround, either do the upload from an external machine or addmyclusterandmycluster.icpto theNO_PROXYvariable. -
Cloud Automation Manager re-install fails due to an already existing
cam-api-secret-gen-jobleft from prior Cloud Automation Manager installations and the job hangs indefinitely with the following error:
Internal service error : rpc error: code = Unknown desc = jobs.batch “cam-api-secret-gen-job” already exists
root@csz25087:~# kubectl -n services get pods
NAME READY STATUS RESTARTS AGE
cam-api-secret-gen-job-n5d87 0/1 Completed 0 24m
To resolve this issue:
- Run the following command:
kubectl -n services delete job cam-api-secret-gen-job - Install Cloud Automation Manager.
- Cloud Automation Manager installation fails due to an incorrect
Worker node architecturevalue. The installation fails with the following error message:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 71s (x2 over 71s) default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector.
To resolve this issue, review the input parameter for the 'Worker node architecture' and check if one of the supported architectures are selected/entered correctly:
amd64ppc64les390x
- Cloud Automation Manager Install has all pods ready and running except 'cam-tenant-api' indicated by its non-readiness status in "0/1"
cam-tenant-api-6fc4d4fcff-ltdvj 0/1 Running 0 122m
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 86s (x375 over 126m) kubelet, 9.30.255.175 Readiness probe failed: HTTP probe failed with statuscode: 404
To confirm the issue, search for the following error messages:
Failed to get IAM access tokenProvided API key could not be foundError occurred while onboarding CAM into IAM
The error messages can be found in the /[export folder]/CAM_logs/cam-tenant-api/
kubectl -n services logs -f <tenant-pod-name>
[2020-03-12T21:01:26.977] [INFO] init-platform-security - Onboarding CAM Service into ICP
[2020-03-12T21:01:27.554] [ERROR] init-platform-security - Failed to get IAM access token. { statusCode: 400,
body: '{"context":{"requestId":"f0cef458b1c6463cb6daed3597445d42","requestType":"incoming.OIDC_Token","userAgent":"NotSet","clientIp":"9.30.255.175","instanceId":"NotSet/999999","threadId":"ac73","host":"auth-idp-kngmb","startTime":"12.03.2020 21:01:14:952 UTC","endTime":"12.03.2020 21:01:15:176 UTC","elapsedTime":"224","locale":"c.u_US"},"errorCode":"BXNIM0415E","errorMessage":"Provided API key could not be found","errorDetails":"BXNIM0102E: Unable to find Object. Object Type: \'ApiKey\' with ID: \'JZWSw9rwauvmAKVcuzYdY09hzO51jCyGHci-12yb0kje\' not found."}' }
[2020-03-12T21:01:27.557] [ERROR] init-platform-security - Error occurred while onboarding CAM into IAM. { Error: [object Object]
at /usr/src/app/lib/icp/platform-security.js:319:19
This issue is resolved by deleting the invalid API key from the MCM console and creating a new one API key.
- Cloud Automation Manager Installation fails with the following error message when the service catalog is not installed on OpenShift Container Platform 4.x and the following error
Error: validation failed: unable to recognize "": no matches for kind "ClusterServiceBroker" in version "servicecatalog.k8s.io/v1beta1"
The service catalog for OpenShift Container Paltform is not installed by default. To resolve the issue, see Install the service catalog for OpenShift Container Platform