Restoring Watson Machine Learning

Backup and restore for Watson Machine Learning follows the standard backup and restore process for all Cloud Pak for Data services. If you encounter problems with the wml-deployment-manager pod after backup and restore, refer to the Troubleshooting section.

Troubleshooting

If wml-deployment-manager-<pod name> is not recovering after backup or restore:

  1. Check wml-deployment-manager-<pod name> init logs and see if it failed due to error in establishing connection to etcd and is stuck as below.

    oc logs -l app=wml-deployment-manager -c init-container

    Example output:

    Waiting for Ectd service to come up...
    wml-cpd-etcd:2379 is unhealthy: failed to connect: context deadline exceeded
    Error: unhealthy cluster
    Waiting for Ectd service to come up...
    wml-cpd-etcd:2379 is unhealthy: failed to connect: context deadline exceeded
    Error: unhealthy cluster
    Waiting for Ectd service to come up...
    
  2. In case of an error, restart these pods:

    • wml-cpd-etcd-0
    • wml-cpd-etcd-1
    • wml-cpd-etcd-2
  3. If the pods restart without error, restart wml-deployment-manager-<pod name> pod and verify if deployment manager started successfully.

  4. Check the logs of agent and envoy pods for etcd connection issues. Use these commands:

    • oc logs wml-deployment-agent-0 -c init-container
    • oc logs -l app=wml-deployment-envoy -c init-container

    If agent and envoy failed to connect etcd after restore, restart these pods:

    • wml-deployment-agent-0
    • wml-deployment-envoy-<podname>
  5. If the issue is not fixed, reboot cluster.

Parent topic: IBM Watson Studio