Installation common issues

This document provides solutions to common issues encountered during installation and operation.

Common issues during installation

Symptoms
Kafka not verifying
Solution
Delete the Kafka resource and allow it to re-create automatically.
Symptoms
Role binding conflicts
Solution
Delete existing role bindings before you proceed with the installation.
Symptoms
Images pull errors
Solution
Verify that entitlement key secrets are correctly created and configured.

Conflicts multiple operator versions attempt to manage the same PostgreSQL

Symptoms:
When you migrate from OLM (Operator Lifecycle Manager) to non-OLM deployments, both the old OLM-based Postgres operator and the new Helm-based operator might run simultaneously. This creates conflicts and resource contention, as multiple operator versions attempt to manage the same PostgreSQL resources. The old CSV (ClusterServiceVersion) must be manually removed to prevent operational issues and can ensure that only the Helm-based operator manages PostgreSQL instances.
Solution
Follow these steps to remove CSV-based Postgres deployments when you migrate from OLM to non-OLM.
  1. Step 1 - Verify Helm-based Postgres Operator
    1. Check whether the Helm-based Postgres operator is running:
      oc get deploy postgresql-operator-controller-manager-1-25-4 \
        -n ${PROJECT_CPD_INST_OPERATORS}
      
      oc get deploy postgresql-operator-controller-manager-1-25-3 \
        -n ${PROJECT_CPD_INST_OPERATORS}
    2. If both operators are active (versions 1-25-4 and 1-25-3), you need to remove the older version (1-25-3).
  2. Step 2 - Inspect Finalizers on old OLM CSV
    oc get csv cloud-native-postgresql.v1.25.3 \
      -n ${PROJECT_CPD_INST_OPERATORS} \
      -o jsonpath='{.metadata.finalizers}'

    Expected output

    ["operators.coreos.com/csv-cleanup"]
  3. Step 3 - Remove the Finalizer
    1. Forces unlock by removing the finalizer for the old Postgres CSV:
      oc patch csv cloud-native-postgresql.v1.25.3 \
        -n ${PROJECT_CPD_INST_OPERATORS} \
        --type=json \
        -p='[{"op":"remove","path":"/metadata/finalizers"}]'
  4. Step 4 - Delete the CSV
    oc delete csv cloud-native-postgresql.v1.25.3 \
      -n ${PROJECT_CPD_INST_OPERATORS}
  5. Step 5 - Verify Removal
    Wait 30 seconds, then verify that the OLM-based operator deployment is removed:
    oc get deploy -n ${PROJECT_CPD_INST_OPERATORS} | grep postgresql
    
    Watson Assistant TensorFlow Deployment Issues

Watson Assistant TensorFlow deployment issues

Problem: WA TF Deployment Not Starting

Symptom The pod restarts before models are downloaded due to startup probe timing out.

Solution Adjust the startup probe with a temporary patch to accommodate slow download speeds.

Apply the following patch:
cat <<EOF | oc apply -f -
apiVersion: assistant.watson.ibm.com/v1
kind: TemporaryPatch
metadata:
  name: wa-tf-probe-fix
spec:
  apiVersion: assistant.watson.ibm.com/v1
  kind: WatsonAssistantClu
  name: wo-wa
  patchType: patchStrategicMerge
  patch:
    tf:
      deployment:
        spec:
          template:
            spec:
              containers:
              - name: tensorflow-serving-adapter
                startupProbe:
                  periodSeconds: 30
EOF

This increases the period between startup probe checks from the default to 30 seconds.

Alternative

Increase the failureThreshold instead of adjusting periodSeconds.