Watson Query post-restore hooks fail because pods are unreachable
Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.
When you do offline restores, Watson Query pods might show as running, but the IP addresses that are assigned to them are invalid and the pods cannot connect to each other. The Watson Query post restore hooks might fail because the pods are unreachable.
Symptoms
Watson Query instances can be provisioned on multiple namespaces (the control plane namespace or tethered namespaces). Be sure to use the correct Watson Query instance namespace when you complete these steps.
- Run the following command to check the pods:
oc -n <namespace> logs <podname>
- For example, to check head pod logs for a Watson
Query instance that is provisioned in a
control plane namespace
zen
, run the following command:oc -n zen logs c-db2u-dv-db2u-0
- To check head pod logs for a Watson
Query instance that is provisioned in a tethered
namespace
tn1
, run the following command:oc -n tn1 logs c-db2u-dv-db2u-0
Check if the pod errors are similar to the following:
Error: Head pod c-db2u-dv-db2u-0 logs : Error: unable to initialize: Get "https://xxx.xxx.xx.x:443/api/v1/namespaces/dv1/configmaps/c-db2u-dv-db2u-api": dial tcp xxx.xxx.xx.x:443: connect: no route to host Hurricane pod logs : Error: dial tcp: lookup c-db2u-dv-db2u-internal.dv1.svc on 172.30.0.10:53: read udp 10.254.16.131:40797->172.30.0.10:53: i/o timeout
- For example, to check head pod logs for a Watson
Query instance that is provisioned in a
control plane namespace
- Login to the head pod and check the
bigsql
status:
Theoc -n <namespace> rsh c-db2u-dv-db2u-0 bash su db2inst1 bigsql status
bigsql
status might show the following error:[db2inst1@c-db2u-dv-db2u-0 - Db2U /]$ bigsql status SERVICE HOSTNAME NODE PID STATUS c-db2u-dv-db2u-1.c-db2u-dv-db2u-internal.zen.svc.cluster.local - - Unreachable Big SQL Master c-db2u-dv-db2u-0.c-db2u-dv-db2u-internal.zen.svc.cluster.local 0 - DB2 Not Running Error: dial tcp: lookup c-db2u-dv-db2u-internal.zen.svc on 172.30.0.10:53: read udp 10.254.20.103:45777->172.30.0.10:53: i/o timeout Error: dial tcp: lookup c-db2u-dv-db2u-internal.zen.svc on 172.30.0.10:53: read udp 10.254.20.103:52983->172.30.0.10:53: i/o timeout
Causes
After a restore, pods use the same IP address as what is used in the backup but the IP address gets assigned to different worker nodes. As a result, pods get assigned IP addresses that do not belong to the subnet of their worker node.Diagnosing the problem
Watson Query instances can be provisioned on multiple namespaces (the control plane namespace or tethered namespaces). Be sure to use the correct Watson Query instance namespace when you complete these steps.
- Get the worker node to which the pod is
assigned:
oc -n <namespace> describe po <podname> | grep Node
- Get the IP address that is assigned to the
pod:
oc -n <namespace> describe po <podname> | grep IP
- Check the IP address and subnet of the worker node that you obtained in step
1:
oc describe node <workernodename>| grep IP oc describe node <workernodename>| grep subnet
Download the Kubernetes resources backup file (for example, tenant-offline-b2).
To get the backup file name, run the following command:
To download the backup file, run the following command:cpd-cli oadp backup ls
cpd-cli oadp backup download <backupname>
- Unzip the backup tar file
<backupname>.tar.gz:
tar -zxvf <backupname>.tar.gz
- Get the IP address and worker node details from the backup file of a pod that has the issue:
- For the head pod
c-db2u-dv-db2u-0
, run the following commands:cat resources/pods/namespaces/<namespace>/c-db2u-dv-db2u-0.json |python -m json.tool | grep nodeName
cat resources/pods/namespaces/<namespace>/c-db2u-dv-db2u-0.json |python -m json.tool | grep podIP
- For the worker pod
c-db2u-dv-db2u-1
, run the following commands:cat resources/pods/namespaces/<namespace>/c-db2u-dv-db2u-1.json |python -m json.tool | grep nodeName
cat resources/pods/namespaces/<namespace>/c-db2u-dv-db2u-1.json |python -m json.tool | grep podIP
-
For the hurricane pod, run the following commands:
cat resources/pods/namespaces/<namespace>/<hurricane-podname>.json |python -m json.tool | grep nodeName
cat resources/pods/namespaces/<namespace>/<hurricane-podname>.json |python -m json.tool | grep podIP
- For the dv-utils pod
c-db2u-dv-dvutils-0
, run the following commands:cat resources/pods/namespaces/<namespace>/c-db2u-dv-dvutils-0.json |python -m json.tool | grep nodeName
cat resources/pods/namespaces/<namespace>/c-db2u-dv-dvutils-0.json |python -m json.tool | grep podIP
- For the head pod
- Compare the worker node and IP address that you obtained in step 6 with the worker node and IP address that you obtained in Steps 1 and 2. If the IP address is same and worker node is different, then the issue is as described above. Proceed to complete the resolution steps.
- Repeat steps 1 - 7 for all Watson Query pods that have this issue.
Resolving the problem
- Restart the Watson
Query pods to refresh the IP
address:
oc -n <namespace> delete pod <podaname>
- Re-run the Watson
Query post-restore hooks if they have failed:
- Run the following commands for head, worker, and hurricane
pods:
oc -n <namespace> rsh <podname> bash su db2inst1 /db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE
- Run the following commands for the
dv-util
pod:oc -n <namespace> rsh <dvutil-podname> bash su db2inst1 /opt/dv/current/dv-utils.sh -o start --is-bar
- Run the following commands for head, worker, and hurricane
pods: