Backing up and restoring the Watson Machine Learning Accelerator service
Use this information to backup or restore the IBM Watson® Machine Learning Accelerator service.
- Online backup and restore of Watson Machine Learning Accelerator
- Offline backup and restore of Watson Machine Learning Accelerator
- After restoring Watson Machine Learning Accelerator
Online backup and restore of Watson Machine Learning Accelerator
Online backup
To complete an online backup, see Cloud Pak for Data online backup and restore.
Online restore
After you restore the Watson Machine Learning Accelerator service using the Cloud Pak for Data restore process, you must run an additional script to restore owner references to all Watson Machine Learning Accelerator resources.
Before you begin:
Before performing an online restore, make sure that the Watson Machine Learning Accelerator namespace is deleted.
- Log in to your OpenShift cluster as a project
administrator.
oc login OpenShift_URL:port - Switch to the Watson Machine Learning
Accelerator
namespace.
oc project wmla-namespace - Return owner references to Watson Machine Learning
Accelerator resources, run the following script depending on
your version of Watson Machine Learning
Accelerator:
#!/bin/bash wmla_name=`oc get wmla -o name|awk -F/ '{print $NF}'` wmla_uid=`oc get wmla $wmla_name -o jsonpath='{.metadata.uid}'` user_pvc=`oc get wmla $wmla_name -o jsonpath={.spec.usePreCreatedPvcs}` for r in \ certificate.cert-manager.io/wmla-ca-crt \ certificate.cert-manager.io/wmla-internal-keys \ certificate.cert-manager.io/wmla-nginx-keys \ certificate.cert-manager.io/wmla-internal-keys-ecdsa \ certificate.cert-manager.io/wmla-nginx-keys-ecdsa \ certificate.cert-manager.io/wmla-worker-keys \ configmap/cpd-wmla-br-cm \ configmap/cpd-wmla-ckpt-cm \ configmap/cpd-wmla-qu-cm \ configmap/cpd-wmla-add-on-br-cm \ configmap/wmla-edi-lbd-nginx \ configmap/wmla-gpu-types \ configmap/wmla-install-info-cm \ configmap/wmla-watchdog-conf \ configmap/wmla-wml-accelerator-instance-cm \ configmap/wmla-dlpd-bootstrap \ configmap/wmla-edi \ configmap/wmla-edi-dlim \ configmap/wmla-edi-imd-nginx \ configmap/wmla-edi-isd \ configmap/wmla-edi-isd-ingress \ configmap/wmla-grafana-configmap \ configmap/wmla-grafana-ini \ configmap/wmla-grafana-providers \ configmap/wmla-infoservice \ configmap/wmla-jupyter-hub-config \ configmap/wmla-logstash-conf \ configmap/wmla-mongodb-shells \ configmap/wmla-msd \ configmap/wmla-mss \ configmap/wmla-nginx-conf \ configmap/wmla-nginx-grafana-sidecar-conf \ configmap/wmla-nginx-sidecar-conf \ configmap/wmla-prometheus \ configmap/wmla-version-info \ configmap/wmlaconfigmap \ deployment.apps/wmla-auth-rest \ deployment.apps/wmla-conda \ deployment.apps/wmla-dlpd \ deployment.apps/wmla-edi-imd \ deployment.apps/wmla-edi-lbd \ deployment.apps/wmla-grafana \ deployment.apps/wmla-gui \ deployment.apps/wmla-infoservice \ deployment.apps/wmla-ingress \ deployment.apps/wmla-jupyter-gateway \ deployment.apps/wmla-jupyter-hub \ deployment.apps/wmla-jupyter-proxy \ deployment.apps/wmla-logstash \ deployment.apps/wmla-msd \ deployment.apps/wmla-mss \ deployment.apps/wmla-prometheus \ deployment.apps/wmla-watchdog \ horizontalpodautoscaler.autoscaling/wmla-auth-rest-hpa \ horizontalpodautoscaler.autoscaling/wmla-dlpd-hpa \ horizontalpodautoscaler.autoscaling/wmla-edi-lbd-hpa \ horizontalpodautoscaler.autoscaling/wmla-gui-hpa \ horizontalpodautoscaler.autoscaling/wmla-ingress-hpa \ horizontalpodautoscaler.autoscaling/wmla-watchdog-hpa \ ingress.networking.k8s.io/wmla-jupyter-ingress \ issuer.cert-manager.io/wmla-ca \ issuer.cert-manager.io/wmla-root-issuer \ networkpolicy.networking.k8s.io/wmla-dlpd-netpol \ networkpolicy.networking.k8s.io/wmla-edi-imd-network-policy \ networkpolicy.networking.k8s.io/wmla-edi-isd-network-policy \ networkpolicy.networking.k8s.io/wmla-infoservice-netpol \ networkpolicy.networking.k8s.io/wmla-ingress-network-policy \ networkpolicy.networking.k8s.io/wmla-logstash-network-policy \ networkpolicy.networking.k8s.io/wmla-msd-netpol \ networkpolicy.networking.k8s.io/wmla-namespace-network-policy \ persistentvolumeclaim/wmla-conda \ persistentvolumeclaim/wmla-cws-share \ persistentvolumeclaim/wmla-edi \ persistentvolumeclaim/wmla-infoservice \ persistentvolumeclaim/wmla-logging \ persistentvolumeclaim/wmla-mygpfs \ persistentvolumeclaim/wmla-grafana \ persistentvolumeclaim/wmla-prometheus \ poddisruptionbudget.policy/wmla-jupyter-hub-pdb \ poddisruptionbudget.policy/wmla-jupyter-proxy-pdb \ role.rbac.authorization.k8s.io/wmla-core-role \ role.rbac.authorization.k8s.io/wmla-edi \ role.rbac.authorization.k8s.io/wmla-msd-mss \ role.rbac.authorization.k8s.io/wmla-notebook-role \ role.rbac.authorization.k8s.io/wmla-role \ rolebinding.rbac.authorization.k8s.io/wmla-core-rb \ rolebinding.rbac.authorization.k8s.io/wmla-edi \ rolebinding.rbac.authorization.k8s.io/wmla-msd-mss \ rolebinding.rbac.authorization.k8s.io/wmla-notebook-rb \ rolebinding.rbac.authorization.k8s.io/wmla-rb \ route.route.openshift.io/wmla-console \ route.route.openshift.io/wmla-grafana \ route.route.openshift.io/wmla-inference \ route.route.openshift.io/wmla-jupyter-notebook \ secret/wmla-dlpd-conf \ secret/wmla-eg-secret \ secret/wmla-grafana-secret \ secret/wmla-jupyter-hub-secret \ secret/wmla-mongodb-secret \ secret/wmla-prometheus-htpasswd \ service/wmla-auth-rest \ service/wmla-dlpd \ service/wmla-edi \ service/wmla-edi-admin \ service/wmla-etcd \ service/wmla-grafana \ service/wmla-gui \ service/wmla-inference \ service/wmla-infoservice \ service/wmla-ingress \ service/wmla-jupyter-enterprise-gateway \ service/wmla-jupyter-hub \ service/wmla-jupyter-proxy-api \ service/wmla-jupyter-proxy-public \ service/wmla-logstash-service \ service/wmla-mongodb \ service/wmla-msd \ service/wmla-mss \ service/wmla-prometheus \ serviceaccount/wmla-core-sa \ serviceaccount/wmla-msd-mss \ serviceaccount/wmla-norbac \ serviceaccount/wmla-notebook-sa \ serviceaccount/wmla-sa \ statefulset.apps/wmla-etcd \ statefulset.apps/wmla-mongodb \ zenextension/zen-wmla-frontdoor-extension \ zenextension/zen-wmla-edi-frontdoor-extension \ wmla-add-on.spectrumcomputing.ibm.com/wmla; do oc get $r >& /dev/null if [ $? == "0" ]; then #skip patch user pvc if [ x$user_pvc == 'xtrue' ];then resoucetype=`echo $r|awk -F'/' '{print $1}'` if [ x$resoucetype == 'xpersistentvolumeclaim' ];then echo "skip user defined PVC $r" continue fi fi echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}" fi done #update ownerReferences for wmla resource plans wmla_rps=`oc get rp -o name` ns_rp=`oc get rp platform -o jsonpath={.spec.parent}` wmla_fix_rp=`oc get rp platform -o jsonpath={.spec.children[0].name}` cpd_fix_rp=`oc get rp platform -o jsonpath={.spec.children[1].name}` for r in $wmla_rps; do #skip scheduler created resource plans rp_name=`echo $r|awk -F'/' '{print $2}'` if [ x$rp_name == "xplatform" -o x$rp_name == "x$ns_rp" -o x$rp_name == "x$wmla_fix_rp" -o x$rp_name == "x$cpd_fix_rp" ];then echo "skip resource plan $rp_name" continue fi echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}" done #update ownerReferences for deploy/isd and isd/service isds=`oc get deploy -o name|grep wmla-edi-isd` imd_uid=`oc get deploy wmla-edi-imd -o jsonpath='{.metadata.uid}'` for r in $isds; do echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"wmla-edi-imd\",\"uid\":\"$imd_uid\"}]}}" done isd_servicess=`oc get services -o name|grep wmla-edi-isd` for r in $isd_servicess; do isd_name=`echo $r|awk -F/ '{print $NF}'` isd_uid=`oc get deploy $isd_name -o jsonpath='{.metadata.uid}'` echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"$isd_name\",\"uid\":\"$isd_uid\"}]}}" done #update ownerReferences for wmla-add-on cm wmla_add_on_name=`oc get wmla-add-on -o name|awk -F/ '{print $NF}'` if [ x$wmla_add_on_name != x ];then wmla_add_on_uid=`oc get wmla-add-on $wmla_add_on_name -o jsonpath='{.metadata.uid}'` oc patch configmap/cpd-wmla-add-on-br-cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" wmla_instance_cm=`oc get cm -o name|grep wml-accelerator-instance-cm` oc patch $wmla_instance_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" wmla_connection_cm=`oc get cm -o name|grep wml-accelerator-connection-info-extension` if [[ 'x' != x"$wmla_connection_cm" ]];then oc patch $wmla_connection_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" fi wmla_zen_extension=`oc get zenextension -o name|grep wml-accelerator-zen-extension|awk '{print $1}'` if [[ 'x' != x"$extension" ]];then oc patch $wmla_zen_extension --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" fi fi #remove unused sa docker config secret sa_secrets=`oc get secret --field-selector type=kubernetes.io/dockercfg -o name|grep 'secret/wmla'` for s in $sa_secrets;do owner=`oc get $s -o jsonpath='{.metadata.ownerReferences}' 2> /dev/null` if [ x$owner == 'x' ];then echo "remove $s" oc delete $s fi done
Offline backup and restore of Watson Machine Learning Accelerator
Offline backup
Before you complete an offline backup of the Watson Machine Learning Accelerator service using the standard backup process, you must stop all running workloads.
- Stop all running workloads.
- As an Watson Machine Learning Accelerator project administrator, stop all running jobs, see Stopping an application.
- Stop all running deployed models. Use the WML Accelerator command line interface to stop each running model, see Stop an inference service.
- Back up the Watson Machine Learning Accelerator service using the standard backup process. See: Cloud Pak for Data offline backup and restore (OADP utility)
Offline restore
After you restore the Watson Machine Learning Accelerator service using the Cloud Pak for Data restore process, you must run an additional script to restore owner references to all Watson Machine Learning Accelerator resources.
Before you begin:
Before performing an online restore, make sure that the Watson Machine Learning Accelerator namespace is deleted.
- Log in to your OpenShift cluster as a project
administrator.
oc login OpenShift_URL:port - Switch to the Watson Machine Learning
Accelerator
namespace.
oc project wmla-namespace - Return owner references to Watson Machine Learning
Accelerator resources, run the following script depending on
your version of Watson Machine Learning
Accelerator:
#!/bin/bash wmla_name=`oc get wmla -o name|awk -F/ '{print $NF}'` wmla_uid=`oc get wmla $wmla_name -o jsonpath='{.metadata.uid}'` user_pvc=`oc get wmla $wmla_name -o jsonpath={.spec.usePreCreatedPvcs}` for r in \ certificate.cert-manager.io/wmla-ca-crt \ certificate.cert-manager.io/wmla-internal-keys \ certificate.cert-manager.io/wmla-nginx-keys \ certificate.cert-manager.io/wmla-internal-keys-ecdsa \ certificate.cert-manager.io/wmla-nginx-keys-ecdsa \ certificate.cert-manager.io/wmla-worker-keys \ configmap/cpd-wmla-br-cm \ configmap/cpd-wmla-ckpt-cm \ configmap/cpd-wmla-qu-cm \ configmap/cpd-wmla-add-on-br-cm \ configmap/wmla-edi-lbd-nginx \ configmap/wmla-gpu-types \ configmap/wmla-install-info-cm \ configmap/wmla-watchdog-conf \ configmap/wmla-wml-accelerator-instance-cm \ configmap/wmla-dlpd-bootstrap \ configmap/wmla-edi \ configmap/wmla-edi-dlim \ configmap/wmla-edi-imd-nginx \ configmap/wmla-edi-isd \ configmap/wmla-edi-isd-ingress \ configmap/wmla-grafana-configmap \ configmap/wmla-grafana-ini \ configmap/wmla-grafana-providers \ configmap/wmla-infoservice \ configmap/wmla-jupyter-hub-config \ configmap/wmla-logstash-conf \ configmap/wmla-mongodb-shells \ configmap/wmla-msd \ configmap/wmla-mss \ configmap/wmla-nginx-conf \ configmap/wmla-nginx-grafana-sidecar-conf \ configmap/wmla-nginx-sidecar-conf \ configmap/wmla-prometheus \ configmap/wmla-version-info \ configmap/wmlaconfigmap \ deployment.apps/wmla-auth-rest \ deployment.apps/wmla-conda \ deployment.apps/wmla-dlpd \ deployment.apps/wmla-edi-imd \ deployment.apps/wmla-edi-lbd \ deployment.apps/wmla-grafana \ deployment.apps/wmla-gui \ deployment.apps/wmla-infoservice \ deployment.apps/wmla-ingress \ deployment.apps/wmla-jupyter-gateway \ deployment.apps/wmla-jupyter-hub \ deployment.apps/wmla-jupyter-proxy \ deployment.apps/wmla-logstash \ deployment.apps/wmla-msd \ deployment.apps/wmla-mss \ deployment.apps/wmla-prometheus \ deployment.apps/wmla-watchdog \ horizontalpodautoscaler.autoscaling/wmla-auth-rest-hpa \ horizontalpodautoscaler.autoscaling/wmla-dlpd-hpa \ horizontalpodautoscaler.autoscaling/wmla-edi-lbd-hpa \ horizontalpodautoscaler.autoscaling/wmla-gui-hpa \ horizontalpodautoscaler.autoscaling/wmla-ingress-hpa \ horizontalpodautoscaler.autoscaling/wmla-watchdog-hpa \ ingress.networking.k8s.io/wmla-jupyter-ingress \ issuer.cert-manager.io/wmla-ca \ issuer.cert-manager.io/wmla-root-issuer \ networkpolicy.networking.k8s.io/wmla-dlpd-netpol \ networkpolicy.networking.k8s.io/wmla-edi-imd-network-policy \ networkpolicy.networking.k8s.io/wmla-edi-isd-network-policy \ networkpolicy.networking.k8s.io/wmla-infoservice-netpol \ networkpolicy.networking.k8s.io/wmla-ingress-network-policy \ networkpolicy.networking.k8s.io/wmla-logstash-network-policy \ networkpolicy.networking.k8s.io/wmla-msd-netpol \ networkpolicy.networking.k8s.io/wmla-namespace-network-policy \ persistentvolumeclaim/wmla-conda \ persistentvolumeclaim/wmla-cws-share \ persistentvolumeclaim/wmla-edi \ persistentvolumeclaim/wmla-infoservice \ persistentvolumeclaim/wmla-logging \ persistentvolumeclaim/wmla-mygpfs \ persistentvolumeclaim/wmla-grafana \ persistentvolumeclaim/wmla-prometheus \ poddisruptionbudget.policy/wmla-jupyter-hub-pdb \ poddisruptionbudget.policy/wmla-jupyter-proxy-pdb \ role.rbac.authorization.k8s.io/wmla-core-role \ role.rbac.authorization.k8s.io/wmla-edi \ role.rbac.authorization.k8s.io/wmla-msd-mss \ role.rbac.authorization.k8s.io/wmla-notebook-role \ role.rbac.authorization.k8s.io/wmla-role \ rolebinding.rbac.authorization.k8s.io/wmla-core-rb \ rolebinding.rbac.authorization.k8s.io/wmla-edi \ rolebinding.rbac.authorization.k8s.io/wmla-msd-mss \ rolebinding.rbac.authorization.k8s.io/wmla-notebook-rb \ rolebinding.rbac.authorization.k8s.io/wmla-rb \ route.route.openshift.io/wmla-console \ route.route.openshift.io/wmla-grafana \ route.route.openshift.io/wmla-inference \ route.route.openshift.io/wmla-jupyter-notebook \ secret/wmla-dlpd-conf \ secret/wmla-eg-secret \ secret/wmla-grafana-secret \ secret/wmla-jupyter-hub-secret \ secret/wmla-mongodb-secret \ secret/wmla-prometheus-htpasswd \ service/wmla-auth-rest \ service/wmla-dlpd \ service/wmla-edi \ service/wmla-edi-admin \ service/wmla-etcd \ service/wmla-grafana \ service/wmla-gui \ service/wmla-inference \ service/wmla-infoservice \ service/wmla-ingress \ service/wmla-jupyter-enterprise-gateway \ service/wmla-jupyter-hub \ service/wmla-jupyter-proxy-api \ service/wmla-jupyter-proxy-public \ service/wmla-logstash-service \ service/wmla-mongodb \ service/wmla-msd \ service/wmla-mss \ service/wmla-prometheus \ serviceaccount/wmla-core-sa \ serviceaccount/wmla-msd-mss \ serviceaccount/wmla-norbac \ serviceaccount/wmla-notebook-sa \ serviceaccount/wmla-sa \ statefulset.apps/wmla-etcd \ statefulset.apps/wmla-mongodb \ zenextension/zen-wmla-frontdoor-extension \ zenextension/zen-wmla-edi-frontdoor-extension \ wmla-add-on.spectrumcomputing.ibm.com/wmla; do oc get $r >& /dev/null if [ $? == "0" ]; then #skip patch user pvc if [ x$user_pvc == 'xtrue' ];then resoucetype=`echo $r|awk -F'/' '{print $1}'` if [ x$resoucetype == 'xpersistentvolumeclaim' ];then echo "skip user defined PVC $r" continue fi fi echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}" fi done #update ownerReferences for wmla resource plans wmla_rps=`oc get rp -o name` ns_rp=`oc get rp platform -o jsonpath={.spec.parent}` wmla_fix_rp=`oc get rp platform -o jsonpath={.spec.children[0].name}` cpd_fix_rp=`oc get rp platform -o jsonpath={.spec.children[1].name}` for r in $wmla_rps; do #skip scheduler created resource plans rp_name=`echo $r|awk -F'/' '{print $2}'` if [ x$rp_name == "xplatform" -o x$rp_name == "x$ns_rp" -o x$rp_name == "x$wmla_fix_rp" -o x$rp_name == "x$cpd_fix_rp" ];then echo "skip resource plan $rp_name" continue fi echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}" done #update ownerReferences for deploy/isd and isd/service isds=`oc get deploy -o name|grep wmla-edi-isd` imd_uid=`oc get deploy wmla-edi-imd -o jsonpath='{.metadata.uid}'` for r in $isds; do echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"wmla-edi-imd\",\"uid\":\"$imd_uid\"}]}}" done isd_servicess=`oc get services -o name|grep wmla-edi-isd` for r in $isd_servicess; do isd_name=`echo $r|awk -F/ '{print $NF}'` isd_uid=`oc get deploy $isd_name -o jsonpath='{.metadata.uid}'` echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"$isd_name\",\"uid\":\"$isd_uid\"}]}}" done #update ownerReferences for wmla-add-on cm wmla_add_on_name=`oc get wmla-add-on -o name|awk -F/ '{print $NF}'` if [ x$wmla_add_on_name != x ];then wmla_add_on_uid=`oc get wmla-add-on $wmla_add_on_name -o jsonpath='{.metadata.uid}'` oc patch configmap/cpd-wmla-add-on-br-cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" wmla_instance_cm=`oc get cm -o name|grep wml-accelerator-instance-cm` oc patch $wmla_instance_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" wmla_connection_cm=`oc get cm -o name|grep wml-accelerator-connection-info-extension` if [[ 'x' != x"$wmla_connection_cm" ]];then oc patch $wmla_connection_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" fi wmla_zen_extension=`oc get zenextension -o name|grep wml-accelerator-zen-extension|awk '{print $1}'` if [[ 'x' != x"$extension" ]];then oc patch $wmla_zen_extension --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" fi fi #remove unused sa docker config secret sa_secrets=`oc get secret --field-selector type=kubernetes.io/dockercfg -o name|grep 'secret/wmla'` for s in $sa_secrets;do owner=`oc get $s -o jsonpath='{.metadata.ownerReferences}' 2> /dev/null` if [ x$owner == 'x' ];then echo "remove $s" oc delete $s fi done
After restoring Watson Machine Learning Accelerator
After restoring Watson Machine Learning Accelerator, make sure to address the following known issues.
Known issue:
After restoring Watson Machine Learning Accelerator, a known issue exists where the cluster is unhealthy and fails to get the status of the endpoint.
- Run the following command to check the status of the wmla-etcd
cluster:
The following error is displayed if the cluster is unhealthy:oc exec -it wmla-etcd-0 -- bash -c "ETCDCTL_API=3 etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --insecure-skip-tls-verify endpoint status --cluster"Defaulted container "etcd" out of: etcd, init-data-dir (init) {"level":"warn","ts":"2023-06-19T12:01:20.476Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://wmla-etcd-2.wmla-etcd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: lookup wmla-etcd-2.wmla-etcd on 172.30.0.10:53: no such host\""} Failed to get the status of endpoint https://wmla-etcd-2.wmla-etcd:2379 (context deadline exceeded) https://wmla-etcd-1.wmla-etcd:2379, 968d327db883b4b4, 3.3.27, 9.0 MB, true, 2715, 2044 https://wmla-etcd-0.wmla-etcd:2379, f5b85a4577d2c8db, 3.3.27, 9.1 MB, false, 2715, 2044 - Modify the wmla-etcd statefulset to make the failure pod be ready for maintenance.
oc edit statefulset wmla-etcd- Remove the
livenessprobe andreadinessprobe by removing the following lines:livenessProbe: failureThreshold: 3 initialDelaySeconds: 60 periodSeconds: 30 successThreshold: 1 tcpSocket: port: 2379 timeoutSeconds: 1 readinessProbe: failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 20 successThreshold: 1 tcpSocket: port: 2379 timeoutSeconds: 1 - Modify the container as follows:
containers: - command: - /bin/sh - -c - | PEERS="wmla-etcd-0=https://wmla-etcd-0.wmla-etcd:2380,wmla-etcd-1=https://wmla-etcd-1.wmla-etcd:2380,wmla-etcd-2=https://wmla-etcd-2.wmla-etcd:2380" ETCD_INITIAL_CLUSTER_STATE="new" if [ "$WMLA_ETCD_FAILURE_NODE" == "$HOSTNAME" -a ! -f /var/run/etcd/${HOSTNAME}.etcd/_recovered ]; then rm -rf /var/run/etcd/${HOSTNAME}.etcd echo "Restore ${HOSTNAME} in mainteance ..." ETCD_INITIAL_CLUSTER_STATE="existing" sleep 5 mkdir -p /var/run/etcd/${HOSTNAME}.etcd touch /var/run/etcd/${HOSTNAME}.etcd/_recovered fi exec etcd --name ${HOSTNAME} \ --listen-peer-urls https://0.0.0.0:2380 \ --listen-client-urls https://0.0.0.0:2379 \ --advertise-client-urls https://${HOSTNAME}.wmla-etcd:2379 \ --initial-advertise-peer-urls https://${HOSTNAME}:2380 \ --initial-cluster-token wmla-etcd-cluster \ --initial-cluster ${PEERS} \ --initial-cluster-state ${ETCD_INITIAL_CLUSTER_STATE} \ --data-dir /var/run/etcd/${HOSTNAME}.etcd \ --cert-file=/etc/pki/etcd/tls.crt \ --key-file=/etc/pki/etcd/tls.key \ --trusted-ca-file=/etc/pki/etcd/ca.crt \ --client-cert-auth \ --peer-cert-file=/etc/pki/etcd/tls.crt \ --peer-key-file=/etc/pki/etcd/tls.key \ --peer-trusted-ca-file=/etc/pki/etcd/ca.crt \ --peer-client-cert-auth \ --quota-backend-bytes=8589934592 \ --cipher-suites TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- Remove the
- Set the
WMLA_ETCD_FAILURE_NODEenvironment variable towmla-etcd-2for the pod that failed:oc set env statefulset/wmla-etcd WMLA_ETCD_FAILURE_NODE=wmla-etcd-2 - Restart the failed pod:
oc delete pod wmla-etcd-2 - Run the following command to check the status of the wmla-etcd
cluster:
The following is displayed for a healthy cluster:oc exec -it wmla-etcd-0 -- bash -c "ETCDCTL_API=3 etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --insecure-skip-tls-verify endpoint status --cluster"Defaulted container "etcd" out of: etcd, init-data-dir (init) https://wmla-etcd-1.wmla-etcd:2379, 968d327db883b4b4, 3.3.27, 1.4 MB, true, 2815, 2094 https://wmla-etcd-2.wmla-etcd:2379, cc26316c8c459e22, 3.3.27, 2.7 MB, false, 2815, 2094 https://wmla-etcd-0.wmla-etcd:2379, f5b85a4577d2c8db, 3.3.27, 1.4 MB, false, 2815, 2094
Known issue:
After restoring Watson Machine Learning Accelerator, the wmla-mongodb-1 or wmla-mongodb-2 pod may fail to start.
If this issue has occurred, complete the following steps to start the pods. Depending on the status of the cluster, this procedure may take several minutes to complete.
- Scale down the MongoDB service to replica number
1:
oc scale --replicas=1 sts wmla-mongodb -n <wmla_instance_namespace> - Wait for the MongoDB pods to scale down and stabilize.
- Remove wmla-mongodb-1 and wmla-mongodb-2 PVCs. Do not delete wmla-mongodb-0
PVC.
oc delete pvc data-wmla-mongodb-1 data-wmla-mongodb-2 -n <wmla_instance_namespace> - Scale up the MongoDB service to replica number
3:
oc scale --replicas=3 sts wmla-mongodb -n <wmla_instance_namespace>