Restoring Watson Machine Learning
Backup and restore for Watson Machine Learning follows the standard backup and restore process for all Cloud Pak for Data services. If you encounter problems with the wml-deployment-manager
pod after backup and restore, refer to the
Troubleshooting section.
Troubleshooting
If wml-deployment-manager-<pod name>
is not recovering after backup or restore:
-
Check
wml-deployment-manager-<pod name>
init logs and see if it failed due to error in establishing connection to etcd and is stuck as below.oc logs -l app=wml-deployment-manager -c init-container
Example output:
Waiting for Ectd service to come up... wml-cpd-etcd:2379 is unhealthy: failed to connect: context deadline exceeded Error: unhealthy cluster Waiting for Ectd service to come up... wml-cpd-etcd:2379 is unhealthy: failed to connect: context deadline exceeded Error: unhealthy cluster Waiting for Ectd service to come up...
-
In case of an error, restart these pods:
wml-cpd-etcd-0
wml-cpd-etcd-1
wml-cpd-etcd-2
-
If the pods restart without error, restart
wml-deployment-manager-<pod name>
pod and verify if deployment manager started successfully. -
Check the logs of
agent
andenvoy
pods foretcd
connection issues. Use these commands:oc logs wml-deployment-agent-0 -c init-container
oc logs -l app=wml-deployment-envoy -c init-container
If
agent
andenvoy
failed to connectetcd
after restore, restart these pods:wml-deployment-agent-0
wml-deployment-envoy-<podname>
-
If the issue is not fixed, reboot cluster.
Parent topic: IBM Watson Studio