Troubleshooting installing and uninstalling Infrastructure Automation
Review frequently encountered issues related to installing, upgrading and uninstalling Infrastructure Automation.
Installation issues
- Database is stuck on retrying on Infrastructure Automation IMInstall Power Environment
- The
cam-tenant-api
pod is not in a ready state after installing the IAConfig CR - Offline install or upgrade throws 'invalid character' error
- Offline install or upgrade stuck because the 'oc ibm-pak generate mirror-manifests' command fails with 'no space left on device'
Uninstall issues
Troubleshooting installation
Database is stuck on retrying on Infrastructure Management IMInstall Power Environment
Sometimes when Infrastructure Management is deployed, the orchestrator pod fails to start properly with the following in the oc logs:
Cannot connect to the database!
Deployment status is
Check the postgresql pod to ensure it's running and listening. Then, verify the vmdb_production
database was created by rushing into the postresql pod:
sh-4.4$ psql -U postgres
psql (10.19)
Type "help" for help.
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------------+----------+----------+------------+------------+
postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 |
template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
(3 rows)
Solution: If the vmdb_production
database doesn't exist, like above, it is possible that a remnant postgresql.conf in the DATA directory is preventing the postgresql image from initializing the new database on
boot.
You will need to check the DATA directory and either clean it manually or recreate the PVC/PV and check that it is empty.
You can always find the location of the DATA directory using the psql command below:
postgres=# show data_directory;
data_directory
------------------------------
/var/lib/pgsql/data/userdata
(1 row)
Ensure that the data directory is empty before the postgresql pod is restarted.
The cam-tenant-api
pod is not in a ready state after installing the IAConfig CR
After you install Infrastructure Automation, you can encounter an error where the cam-tenant-api
pod displays as running, but not in a ready state. When this error occurs, you can see the following message:
[ERROR] init-platform-security - >>>>>>>>>> Failed to configure Platform Security. Will retry in 60 seconds <<<<<<<<<<<<< OperationalError: [object Object]
If this error occurs, delete the cam-tenant-api
pod to cause the pod to restart and attempt to enter a ready state.
Offline install or upgrade throws 'invalid character' error.
When doing an offline install or upgrade, running the oc ibm-pak generate mirror-manifests <..>
command throws an error similar to the following:
Error: failed to load the catalog FBC at cp.stg.icr.io/cp/ <...> invalid character '<' in string escape code
Solution: You must have IBM Catalog Management Plug-in for IBM Cloud Pak
(ibm-pak-plugin
) v1.10 or higher installed. Run the following commands to ensure that ibm-pak-plugin
is at the
required level.
-
Check which version of
ibm-pak-plugin
you have installed.Run the following command on your bastion host, portable compute device, or connected compute device if you are using a portable storage device.
oc ibm-pak --version
Example output:
oc ibm-pak --version v1.11.0
-
If the
ibm-pak-plugin
version is lower than v1.10.0, then you must download and install the most recent version.Follow the steps for your installation approach:
-
Bastion host: Install the IBM Catalog Management Plug-in for IBM Cloud Pak®.
-
Portable device: Install the IBM Catalog Management Plug-in for IBM Cloud Pak®.
-
Offline install or upgrade stuck because the 'oc ibm-pak generate mirror-manifests' command fails with 'no space left on device'
The oc ibm-pak generate mirror-manifests $IA_CASE_NAME $TARGET_REGISTRY --version $IA_CASE_VERSION
command fails with a message similar to the following in $IBMPAK_HOME/.ibm-pak/logs/oc-ibm_pak.log
:
write /tmp/render-unpack-4002583241/var/lib/rpm/Packages: no space left on device
Solution: The default temporary directory does not have enough space to run the ibm-pak
tool. You must set the TMPDIR
environment variable to a different directory with more space before running the
oc ibm-pak generate mirror-manifests
command.
TMPDIR=<new_temp_dir> oc ibm-pak generate mirror-manifests $IA_CASE_NAME $TARGET_REGISTRY --version $IA_CASE_VERSION
Where <new_temp_dir>
is the path for a directory with more space.
Troubleshooting uninstall
Uninstall fails to remove Infrastructure Management cleanly after deleting IAConfig instance
After deleting Infrastructure Management IAConfig CR in the namespace to uninstall Infrastructure Management from Red Hat OpenShift Container Platform console, some of the Infrastructure Management pods were still there in the namespace. You can workaround the problem by deleting IMInstall custom resource from the Red Hat OpenShift Container Platform console. This will remove the pods from the namespace.
You should also delete the clients.oidc.security.ibm.com
customer resource that was used by Infrastructure Management if it still exists. Then, You can uninstall the operators remaining on the namespace.
To delete IMInstall Custom Resource (CR), you need to edit the CR to remove the Finalizer. Save the CR. Then delete the IMInstall Custom Resource (CR). Run the following commands or perform the same from the Red Hat OpenShift Container Platform console.
-
Identify the CR name, for example,
im-iminstall
oc -n <project> get iminstall
Where
<project>
is the project (namespace) where Infrastructure Automation is deployed. -
Edit the statement "finalizers:" and the one below it. Save and exit.
oc -n <project> edit iminstall im-iminstall
Where
<project>
is the project (namespace) where Infrastructure Automation is deployed -
Delete the CR.
oc -n <project> delete iminstall im-iminstall
Where
<project>
is the project (namespace) where Infrastructure Automation is deployed
After a few minutes, the Infrastructure Management pods should no longer exist. The clients.oidc.security.ibm.com Custom Resource (CR) could still exist, and to delete the CR, do the following steps,
-
Check if the CR exist.
oc -n <project> get clients.oidc.security.ibm.com
Where
<project>
is the project (namespace) where Infrastructure Automation is deployed -
If the CR exists, edit the CR. Delete the statement "finalizers:" and the one below. Save and exit.
oc -n <project> edit clients.oidc.security.ibm.com ibm-infra-management-application-client
Where
<project>
is the project (namespace) where Infrastructure Automation is deployed -
Delete the CR.
oc -n <project> delete clients.oidc.security.ibm.com ibm-infra-management-application-client
Where
<project>
is the project (namespace) where Infrastructure Automation is deployed
Uninstall hangs when uninstalling Managed services
The Infrastructure Automation uninstallation sometimes hangs when trying to delete the manageservice instance.
Describing the Managed services instance shows output similar to the following:
# oc describe manageservice cam
Name: cam
Namespace: cp4aiops
Labels: operator.ibm.com/opreq-control=true
Annotations: operator-sdk/primary-resource: /cam-services-sa-csb-patch-rb-pod
operator-sdk/primary-resource-type: ClusterRoleBinding.rbac.authorization.k8s.io
API Version: cam.management.ibm.com/v1alpha1
Kind: ManageService
Metadata:
Creation Timestamp: 2024-03-05T21:46:56Z
Deletion Grace Period Seconds: 0
Deletion Timestamp: 2024-04-15T01:11:29Z
Finalizers:
helm.sdk.operatorframework.io/uninstall-release
Generation: 2
Resource Version: 55077252
UID: 87e6b8cf-3123-4914-bf64-a2b209cbe22d
Solution: Edit the Managed services instance to remove the finalizer entry.
-
Run the following command:
oc edit ManageService cam
-
Delete the following lines, and then save your changes.
Finalizers: helm.sdk.operatorframework.io/uninstall-release