Watson Query post-restore hooks fail because pods are unreachable

Important: IBM Cloud Pak® for Data Version 4.8 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.

When you do offline restores, Watson Query pods might show as running, but the IP addresses that are assigned to them are invalid and the pods cannot connect to each other. The Watson Query post restore hooks might fail because the pods are unreachable.

Symptoms

Note:

Watson Query instances can be provisioned on multiple namespaces (the control plane namespace or tethered namespaces). Be sure to use the correct Watson Query instance namespace when you complete these steps.

  1. Run the following command to check the pods:
    oc -n <namespace> logs <podname>
    • For example, to check head pod logs for a Watson Query instance that is provisioned in a control plane namespace zen, run the following command:
      oc -n zen logs c-db2u-dv-db2u-0
    • To check head pod logs for a Watson Query instance that is provisioned in a tethered namespace tn1, run the following command:
      oc -n tn1 logs c-db2u-dv-db2u-0

    Check if the pod errors are similar to the following:

    Error: Head pod c-db2u-dv-db2u-0 logs :
    Error: unable to initialize: Get "https://xxx.xxx.xx.x:443/api/v1/namespaces/dv1/configmaps/c-db2u-dv-db2u-api": dial tcp xxx.xxx.xx.x:443: connect: no route to host
    
    Hurricane pod logs :
    Error: dial tcp: lookup c-db2u-dv-db2u-internal.dv1.svc on 172.30.0.10:53: read udp 10.254.16.131:40797->172.30.0.10:53: i/o timeout
    
  2. Login to the head pod and check the bigsql status:
    oc -n <namespace> rsh c-db2u-dv-db2u-0 bash
    su db2inst1
    bigsql status
    
    The bigsql status might show the following error:
    [db2inst1@c-db2u-dv-db2u-0 - Db2U /]$ bigsql status
    SERVICE              HOSTNAME                               NODE      PID STATUS
                         c-db2u-dv-db2u-1.c-db2u-dv-db2u-internal.zen.svc.cluster.local    -        - Unreachable
    Big SQL Master       c-db2u-dv-db2u-0.c-db2u-dv-db2u-internal.zen.svc.cluster.local    0        - DB2 Not Running
    Error: dial tcp: lookup c-db2u-dv-db2u-internal.zen.svc on 172.30.0.10:53: read udp 10.254.20.103:45777->172.30.0.10:53: i/o timeout
    Error: dial tcp: lookup c-db2u-dv-db2u-internal.zen.svc on 172.30.0.10:53: read udp 10.254.20.103:52983->172.30.0.10:53: i/o timeout
    

Causes

After a restore, pods use the same IP address as what is used in the backup but the IP address gets assigned to different worker nodes. As a result, pods get assigned IP addresses that do not belong to the subnet of their worker node.

Diagnosing the problem

Note:

Watson Query instances can be provisioned on multiple namespaces (the control plane namespace or tethered namespaces). Be sure to use the correct Watson Query instance namespace when you complete these steps.

  1. Get the worker node to which the pod is assigned:
    oc -n <namespace> describe po <podname>  | grep Node
  2. Get the IP address that is assigned to the pod:
    oc  -n <namespace> describe po <podname>  | grep  IP
  3. Check the IP address and subnet of the worker node that you obtained in step 1:
    oc describe node <workernodename>| grep IP
    oc describe node <workernodename>| grep subnet
  4. Download the Kubernetes resources backup file (for example, tenant-offline-b2).

    To get the backup file name, run the following command:

    cpd-cli oadp backup ls
    To download the backup file, run the following command:
    
    cpd-cli oadp backup download <backupname> 
  5. Unzip the backup tar file <backupname>.tar.gz:
    tar -zxvf  <backupname>.tar.gz
  6. Get the IP address and worker node details from the backup file of a pod that has the issue:
    • For the head pod c-db2u-dv-db2u-0, run the following commands:
      cat resources/pods/namespaces/<namespace>/c-db2u-dv-db2u-0.json |python -m json.tool | grep nodeName
      cat resources/pods/namespaces/<namespace>/c-db2u-dv-db2u-0.json |python -m json.tool | grep podIP
    • For the worker pod c-db2u-dv-db2u-1, run the following commands:
      cat resources/pods/namespaces/<namespace>/c-db2u-dv-db2u-1.json |python -m json.tool | grep nodeName
      cat resources/pods/namespaces/<namespace>/c-db2u-dv-db2u-1.json |python -m json.tool | grep podIP
    • For the hurricane pod, run the following commands:
      cat resources/pods/namespaces/<namespace>/<hurricane-podname>.json |python -m json.tool | grep nodeName
      cat resources/pods/namespaces/<namespace>/<hurricane-podname>.json |python -m json.tool | grep podIP
    • For the dv-utils pod c-db2u-dv-dvutils-0, run the following commands:
      cat resources/pods/namespaces/<namespace>/c-db2u-dv-dvutils-0.json |python -m json.tool | grep nodeName
      cat resources/pods/namespaces/<namespace>/c-db2u-dv-dvutils-0.json |python -m json.tool | grep podIP
  7. Compare the worker node and IP address that you obtained in step 6 with the worker node and IP address that you obtained in Steps 1 and 2. If the IP address is same and worker node is different, then the issue is as described above. Proceed to complete the resolution steps.
  8. Repeat steps 1 - 7 for all Watson Query pods that have this issue.

Resolving the problem

  1. Restart the Watson Query pods to refresh the IP address:
    oc -n <namespace> delete pod <podaname> 
  2. Re-run the Watson Query post-restore hooks if they have failed:
    1. Run the following commands for head, worker, and hurricane pods:
      oc -n <namespace> rsh <podname> bash 
       su db2inst1
       /db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE
    2. Run the following commands for the dv-util pod:
      oc -n <namespace> rsh <dvutil-podname> bash 
      su db2inst1
      /opt/dv/current/dv-utils.sh -o start --is-bar