Restoring Watson Machine Learning
Backup and restore for Watson Machine Learning follows the standard backup and restore process for all Cloud Pak for Data services. If you encounter problems with the wml-deployment-manager pod after backup and restore, refer to the
Troubleshooting section.
Troubleshooting
If wml-deployment-manager-<pod name> is not recovering after backup or restore:
-
Check
wml-deployment-manager-<pod name>init logs and see if it failed due to error in establishing connection to etcd and is stuck as below.oc logs -l app=wml-deployment-manager -c init-containerExample output:
Waiting for Ectd service to come up... wml-cpd-etcd:2379 is unhealthy: failed to connect: context deadline exceeded Error: unhealthy cluster Waiting for Ectd service to come up... wml-cpd-etcd:2379 is unhealthy: failed to connect: context deadline exceeded Error: unhealthy cluster Waiting for Ectd service to come up... -
In case of an error, restart these pods:
wml-cpd-etcd-0wml-cpd-etcd-1wml-cpd-etcd-2
-
If the pods restart without error, restart
wml-deployment-manager-<pod name>pod and verify if deployment manager started successfully. -
Check the logs of
agentandenvoypods foretcdconnection issues. Use these commands:oc logs wml-deployment-agent-0 -c init-containeroc logs -l app=wml-deployment-envoy -c init-container
If
agentandenvoyfailed to connectetcdafter restore, restart these pods:wml-deployment-agent-0wml-deployment-envoy-<podname>
-
If the issue is not fixed, reboot cluster.
Parent topic: Restoring volumes