Troubleshooting installing and uninstalling Infrastructure Automation

Review frequently encountered issues related to installing, upgrading and uninstalling Infrastructure Automation.

Installation issues

Uninstall issues

Troubleshooting installation

Database is stuck on retrying on Infrastructure Management IMInstall Power Environment

Sometimes when Infrastructure Management is deployed, the orchestrator pod fails to start properly with the following in the oc logs:

Cannot connect to the database!
Deployment status is

Check the postgresql pod to ensure it's running and listening. Then, verify the vmdb_production database was created by rushing into the postresql pod:

sh-4.4$ psql -U postgres
psql (10.19)
Type "help" for help.

postgres=# \l
                                    List of databases
      Name       |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges
-----------------+----------+----------+------------+------------+
 postgres        | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
 template0       | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
                 |          |          |            |            | postgres=CTc/postgres
 template1       | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
                 |          |          |            |            | postgres=CTc/postgres
(3 rows)

Solution: If the vmdb_production database doesn't exist, like above, it is possible that a remnant postgresql.conf in the DATA directory is preventing the postgresql image from initializing the new database on boot.

You will need to check the DATA directory and either clean it manually or recreate the PVC/PV and check that it is empty.

You can always find the location of the DATA directory using the psql command below:

postgres=# show data_directory;
        data_directory
------------------------------
 /var/lib/pgsql/data/userdata
(1 row)

Ensure that the data directory is empty before the postgresql pod is restarted.

The cam-tenant-api pod is not in a ready state after installing the IAConfig CR

After you install Infrastructure Automation, you can encounter an error where the cam-tenant-api pod displays as running, but not in a ready state. When this error occurs, you can see the following message:

[ERROR] init-platform-security - >>>>>>>>>> Failed to configure Platform Security. Will retry in 60 seconds <<<<<<<<<<<<< OperationalError: [object Object]

If this error occurs, delete the cam-tenant-api pod to cause the pod to restart and attempt to enter a ready state.

Offline install or upgrade throws 'invalid character' error.

When doing an offline install or upgrade, running the oc ibm-pak generate mirror-manifests <..> command throws an error similar to the following:

Error: failed to load the catalog FBC at cp.stg.icr.io/cp/ <...> invalid character '<' in string escape code

Solution: You must have IBM Catalog Management Plug-in for IBM Cloud Pak (ibm-pak-plugin) v1.10 or higher installed. Run the following commands to ensure that ibm-pak-plugin is at the required level.

  1. Check which version of ibm-pak-plugin you have installed.

    Run the following command on your bastion host, portable compute device, or connected compute device if you are using a portable storage device.

    oc ibm-pak --version
    

    Example output:

    oc ibm-pak --version
    v1.11.0
    

  2. If the ibm-pak-plugin version is lower than v1.10.0, then you must download and install the most recent version.

    Follow the steps for your installation approach:

Offline install or upgrade stuck because the 'oc ibm-pak generate mirror-manifests' command fails with 'no space left on device'

The oc ibm-pak generate mirror-manifests $IA_CASE_NAME $TARGET_REGISTRY --version $IA_CASE_VERSION command fails with a message similar to the following in $IBMPAK_HOME/.ibm-pak/logs/oc-ibm_pak.log:

write /tmp/render-unpack-4002583241/var/lib/rpm/Packages: no space left on device

Solution: The default temporary directory does not have enough space to run the ibm-pak tool. You must set the TMPDIR environment variable to a different directory with more space before running the oc ibm-pak generate mirror-manifests command.

TMPDIR=<new_temp_dir> oc ibm-pak generate mirror-manifests $IA_CASE_NAME $TARGET_REGISTRY --version $IA_CASE_VERSION

Where <new_temp_dir> is the path for a directory with more space.

Troubleshooting uninstall

Uninstall fails to remove Infrastructure Management cleanly after deleting IAConfig instance

After deleting Infrastructure Management IAConfig CR in the namespace to uninstall Infrastructure Management from Red Hat OpenShift Container Platform console, some of the Infrastructure Management pods were still there in the namespace. You can workaround the problem by deleting IMInstall custom resource from the Red Hat OpenShift Container Platform console. This will remove the pods from the namespace.

You should also delete the clients.oidc.security.ibm.com customer resource that was used by Infrastructure Management if it still exists. Then, You can uninstall the operators remaining on the namespace.

To delete IMInstall Custom Resource (CR), you need to edit the CR to remove the Finalizer. Save the CR. Then delete the IMInstall Custom Resource (CR). Run the following commands or perform the same from the Red Hat OpenShift Container Platform console.

  1. Identify the CR name, for example, im-iminstall

    oc -n <project> get iminstall
    

    Where <project> is the project (namespace) where Infrastructure Automation is deployed.

  2. Edit the statement "finalizers:" and the one below it. Save and exit.

    oc -n <project> edit iminstall im-iminstall
    

    Where <project> is the project (namespace) where Infrastructure Automation is deployed

  3. Delete the CR.

    oc -n <project> delete iminstall im-iminstall
    

    Where <project> is the project (namespace) where Infrastructure Automation is deployed

After a few minutes, the Infrastructure Management pods should no longer exist. The clients.oidc.security.ibm.com Custom Resource (CR) could still exist, and to delete the CR, do the following steps,

  1. Check if the CR exist.

    oc -n <project> get clients.oidc.security.ibm.com
    

    Where <project> is the project (namespace) where Infrastructure Automation is deployed

  2. If the CR exists, edit the CR. Delete the statement "finalizers:" and the one below. Save and exit.

    oc -n <project> edit clients.oidc.security.ibm.com ibm-infra-management-application-client
    

    Where <project> is the project (namespace) where Infrastructure Automation is deployed

  3. Delete the CR.

    oc -n <project> delete clients.oidc.security.ibm.com ibm-infra-management-application-client
    

    Where <project> is the project (namespace) where Infrastructure Automation is deployed

Uninstall hangs when uninstalling Managed services

The Infrastructure Automation uninstallation sometimes hangs when trying to delete the manageservice instance.

Describing the Managed services instance shows output similar to the following:

# oc describe manageservice cam
Name:         cam
Namespace:    cp4aiops
Labels:       operator.ibm.com/opreq-control=true
Annotations:  operator-sdk/primary-resource: /cam-services-sa-csb-patch-rb-pod
              operator-sdk/primary-resource-type: ClusterRoleBinding.rbac.authorization.k8s.io
API Version:  cam.management.ibm.com/v1alpha1
Kind:         ManageService
Metadata:
  Creation Timestamp:             2024-03-05T21:46:56Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2024-04-15T01:11:29Z
  Finalizers:
    helm.sdk.operatorframework.io/uninstall-release
  Generation:        2
  Resource Version:  55077252
  UID:               87e6b8cf-3123-4914-bf64-a2b209cbe22d

Solution: Edit the Managed services instance to remove the finalizer entry.

  1. Run the following command:

    oc edit ManageService cam
    
  2. Delete the following lines, and then save your changes.

    Finalizers:
        helm.sdk.operatorframework.io/uninstall-release