Backing up and restoring the Watson Machine Learning Accelerator service
Use this information to backup or restore the IBM Watson® Machine Learning Accelerator service.
- Online backup and restore of Watson Machine Learning Accelerator
- Offline backup and restore of Watson Machine Learning Accelerator
- After restoring Watson Machine Learning Accelerator
Online backup and restore of Watson Machine Learning Accelerator
Online backup
To complete an online backup, see Cloud Pak for Data online backup and restore.
Online restore
After you restore the Watson Machine Learning Accelerator service using the Cloud Pak for Data restore process, you must run an additional script to restore owner references to all Watson Machine Learning Accelerator resources.
Before you begin:
Before performing an online restore, make sure that the Watson Machine Learning Accelerator namespace is deleted.
- Log in to your OpenShift cluster as a project
administrator.
oc login OpenShift_URL:port - Switch to the Watson Machine Learning
Accelerator
namespace.
oc project wmla-namespace - Return owner references to Watson Machine Learning
Accelerator resources, run the following script depending on
your version of Watson Machine Learning
Accelerator:
#!/bin/bash wmla_name=`oc get wmla -o name|awk -F/ '{print $NF}'` wmla_uid=`oc get wmla $wmla_name -o jsonpath='{.metadata.uid}'` user_pvc=`oc get wmla $wmla_name -o jsonpath={.spec.usePreCreatedPvcs}` for r in \ certificate.cert-manager.io/wmla-ca-crt \ certificate.cert-manager.io/wmla-internal-keys \ certificate.cert-manager.io/wmla-nginx-keys \ certificate.cert-manager.io/wmla-internal-keys-ecdsa \ certificate.cert-manager.io/wmla-nginx-keys-ecdsa \ certificate.cert-manager.io/wmla-worker-keys \ configmap/cpd-wmla-br-cm \ configmap/cpd-wmla-ckpt-cm \ configmap/cpd-wmla-qu-cm \ configmap/cpd-wmla-add-on-br-cm \ configmap/wmla-edi-lbd-nginx \ configmap/wmla-gpu-types \ configmap/wmla-install-info-cm \ configmap/wmla-watchdog-conf \ configmap/wmla-wml-accelerator-instance-cm \ configmap/wmla-dlpd-bootstrap \ configmap/wmla-edi \ configmap/wmla-edi-dlim \ configmap/wmla-edi-imd-nginx \ configmap/wmla-edi-isd \ configmap/wmla-edi-isd-ingress \ configmap/wmla-grafana-configmap \ configmap/wmla-grafana-ini \ configmap/wmla-grafana-providers \ configmap/wmla-infoservice \ configmap/wmla-jupyter-hub-config \ configmap/wmla-logstash-conf \ configmap/wmla-mongodb-shells \ configmap/wmla-msd \ configmap/wmla-mss \ configmap/wmla-nginx-conf \ configmap/wmla-nginx-grafana-sidecar-conf \ configmap/wmla-nginx-sidecar-conf \ configmap/wmla-prometheus \ configmap/wmla-version-info \ deployment.apps/wmla-auth-rest \ deployment.apps/wmla-conda \ deployment.apps/wmla-dlpd \ deployment.apps/wmla-edi-imd \ deployment.apps/wmla-edi-lbd \ deployment.apps/wmla-grafana \ deployment.apps/wmla-gui \ deployment.apps/wmla-infoservice \ deployment.apps/wmla-ingress \ deployment.apps/wmla-jupyter-gateway \ deployment.apps/wmla-jupyter-hub \ deployment.apps/wmla-jupyter-proxy \ deployment.apps/wmla-logstash \ deployment.apps/wmla-msd \ deployment.apps/wmla-mss \ deployment.apps/wmla-prometheus \ deployment.apps/wmla-watchdog \ horizontalpodautoscaler.autoscaling/wmla-auth-rest-hpa \ horizontalpodautoscaler.autoscaling/wmla-dlpd-hpa \ horizontalpodautoscaler.autoscaling/wmla-edi-lbd-hpa \ horizontalpodautoscaler.autoscaling/wmla-gui-hpa \ horizontalpodautoscaler.autoscaling/wmla-ingress-hpa \ horizontalpodautoscaler.autoscaling/wmla-watchdog-hpa \ ingress.networking.k8s.io/wmla-jupyter-ingress \ issuer.cert-manager.io/wmla-ca \ issuer.cert-manager.io/wmla-root-issuer \ networkpolicy.networking.k8s.io/wmla-dlpd-netpol \ networkpolicy.networking.k8s.io/wmla-edi-imd-network-policy \ networkpolicy.networking.k8s.io/wmla-edi-isd-network-policy \ networkpolicy.networking.k8s.io/wmla-infoservice-netpol \ networkpolicy.networking.k8s.io/wmla-ingress-network-policy \ networkpolicy.networking.k8s.io/wmla-logstash-network-policy \ networkpolicy.networking.k8s.io/wmla-msd-netpol \ networkpolicy.networking.k8s.io/wmla-namespace-network-policy \ persistentvolumeclaim/wmla-conda \ persistentvolumeclaim/wmla-cws-share \ persistentvolumeclaim/wmla-edi \ persistentvolumeclaim/wmla-infoservice \ persistentvolumeclaim/wmla-logging \ persistentvolumeclaim/wmla-mygpfs \ persistentvolumeclaim/wmla-grafana \ persistentvolumeclaim/wmla-prometheus \ poddisruptionbudget.policy/wmla-jupyter-hub-pdb \ poddisruptionbudget.policy/wmla-jupyter-proxy-pdb \ role.rbac.authorization.k8s.io/wmla-core-role \ role.rbac.authorization.k8s.io/wmla-edi \ role.rbac.authorization.k8s.io/wmla-msd-mss \ role.rbac.authorization.k8s.io/wmla-notebook-role \ role.rbac.authorization.k8s.io/wmla-role \ rolebinding.rbac.authorization.k8s.io/wmla-core-rb \ rolebinding.rbac.authorization.k8s.io/wmla-edi \ rolebinding.rbac.authorization.k8s.io/wmla-msd-mss \ rolebinding.rbac.authorization.k8s.io/wmla-notebook-rb \ rolebinding.rbac.authorization.k8s.io/wmla-rb \ route.route.openshift.io/wmla-console \ route.route.openshift.io/wmla-grafana \ route.route.openshift.io/wmla-inference \ route.route.openshift.io/wmla-jupyter-notebook \ secret/wmla-dlpd-conf \ secret/wmla-eg-secret \ secret/wmla-grafana-secret \ secret/wmla-jupyter-hub-secret \ secret/wmla-mongodb-secret \ secret/wmla-prometheus-htpasswd \ service/wmla-auth-rest \ service/wmla-dlpd \ service/wmla-edi \ service/wmla-edi-admin \ service/wmla-etcd \ service/wmla-grafana \ service/wmla-gui \ service/wmla-inference \ service/wmla-infoservice \ service/wmla-ingress \ service/wmla-jupyter-enterprise-gateway \ service/wmla-jupyter-hub \ service/wmla-jupyter-proxy-api \ service/wmla-jupyter-proxy-public \ service/wmla-logstash-service \ service/wmla-mongodb \ service/wmla-msd \ service/wmla-mss \ service/wmla-prometheus \ serviceaccount/wmla-core-sa \ serviceaccount/wmla-msd-mss \ serviceaccount/wmla-norbac \ serviceaccount/wmla-notebook-sa \ serviceaccount/wmla-sa \ statefulset.apps/wmla-etcd \ statefulset.apps/wmla-mongodb \ wmla-add-on.spectrumcomputing.ibm.com/wmla; do oc get $r >& /dev/null if [ $? == "0" ]; then #skip patch user pvc if [ x$user_pvc == 'xtrue' ];then resoucetype=`echo $r|awk -F'/' '{print $1}'` if [ x$resoucetype == 'xpersistentvolumeclaim' ];then echo "skip user defined PVC $r" continue fi fi echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}" fi done #update ownerReferences for wmla resource plans wmla_rps=`oc get rp -o name` ns_rp=`oc get rp platform -o jsonpath={.spec.parent}` wmla_fix_rp=`oc get rp platform -o jsonpath={.spec.children[0].name}` cpd_fix_rp=`oc get rp platform -o jsonpath={.spec.children[1].name}` for r in $wmla_rps; do #skip scheduler created resource plans rp_name=`echo $r|awk -F'/' '{print $2}'` if [ x$rp_name == "xplatform" -o x$rp_name == "x$ns_rp" -o x$rp_name == "x$wmla_fix_rp" -o x$rp_name == "x$cpd_fix_rp" ];then echo "skip resource plan $rp_name" continue fi echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}" done #update ownerReferences for deploy/isd and isd/service isds=`oc get deploy -o name|grep wmla-edi-isd` imd_uid=`oc get deploy wmla-edi-imd -o jsonpath='{.metadata.uid}'` for r in $isds; do echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"wmla-edi-imd\",\"uid\":\"$imd_uid\"}]}}" done isd_servicess=`oc get services -o name|grep wmla-edi-isd` for r in $isd_servicess; do isd_name=`echo $r|awk -F/ '{print $NF}'` isd_uid=`oc get deploy $isd_name -o jsonpath='{.metadata.uid}'` echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"$isd_name\",\"uid\":\"$isd_uid\"}]}}" done #update ownerReferences for wmla-add-on cm wmla_add_on_name=`oc get wmla-add-on -o name|awk -F/ '{print $NF}'` if [ x$wmla_add_on_name != x ];then wmla_add_on_uid=`oc get wmla-add-on $wmla_add_on_name -o jsonpath='{.metadata.uid}'` oc patch configmap/cpd-wmla-add-on-br-cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" wmla_instance_cm=`oc get cm -o name|grep wml-accelerator-instance-cm` oc patch $wmla_instance_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" wmla_connection_cm=`oc get cm -o name|grep wml-accelerator-connection-info-extension` if [[ 'x' != x"$wmla_connection_cm" ]];then oc patch $wmla_connection_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" fi fi retry=1 crash='Y' MAX_RETRY=60 echo "checking mongodb status..." oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff until [[ $? == '0' ]]; do if [ $retry -ge $MAX_RETRY ]; then crash='N' echo "Not found mongodb pod crash" oc get po|grep -E 'wmla-mongodb' break fi sleep 1 let "retry += 1" oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff done if [ $crash == 'Y' ]; then echo "Found mongodb pod crash" oc get po|grep -E 'wmla-mongodb' oc scale --replicas 0 sts wmla-mongodb oc delete pvc data-wmla-mongodb-1 data-wmla-mongodb-2 oc scale --replicas=3 sts wmla-mongodb fi #workaround for etcd unhealthy, check 18 times (the result is from the last 3 times) MAX_RETRY=18 echo "checking wmla-etcd status (in $MAX_RETRY rounds)..." dead_node="" result=0 healthy_node_count=0 ETCD_CHECK_HEALTH="oc exec -it wmla-etcd-0 -c etcd -- etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --endpoints=https://wmla-etcd:2379 endpoint health --cluster" for j in $(seq 1 $MAX_RETRY); do $ETCD_CHECK_HEALTH > /tmp/wmla-etcd-check-result if [ "$?" != "0" ];then result=$(($result+1)) if [ $j -lt 9 ]; then continue fi healthy_node_count=$(cat /tmp/wmla-etcd-check-result|grep -w "healthy"|wc -l) if [ $result -ge 3 -a $healthy_node_count -ge 2 ]; then dead_node=$(cat /tmp/wmla-etcd-check-result|grep "is unhealthy"|awk '{print substr($1,9,11)}') break fi else echo "$j round: Not found unhealthy in wmla-etcd pods" result=0 healthy_node_count=0 fi sleep 20 done if [ "x$dead_node" != "x" ];then echo "Found unhealthy in wmla-etcd pods:" cat /tmp/wmla-etcd-check-result if [ "x$dead_node" != "x" ];then echo "== restoring unhealthy pod $dead_node" oc exec $dead_node -- rm -f /var/run/etcd/$dead_node.etcd/_recovered 2>/dev/null # to workaround failed node DNS not ready issue (the probe will be recovered when operator restarting) oc set probe sts/wmla-etcd --remove --readiness --liveness oc set env statefulset/wmla-etcd WMLA_ETCD_FAILURE_NODE=$dead_node oc delete pod $dead_node fi elif [ $result -ge 3 ]; then echo "WARN: Found unhealthy in wmla-etcd pods, but wmla-etcd is not ready for fixing, wait a while and rerun this tool." fi rm -f /tmp/wmla-etcd-check-result
Offline backup and restore of Watson Machine Learning Accelerator
Offline backup
Before you complete an offline backup of the Watson Machine Learning Accelerator service using the standard backup process, you must stop all running workloads.
- Stop all running workloads.
- As an Watson Machine Learning
Accelerator project
administrator, from the Watson Machine Learning
Accelerator console stop
all running jobs.
- Log in to the Watson Machine Learning Accelerator console as a project administrator.
- Navigate to .
- For each running applications, select the menu icon and click Stop.
- Stop all running deployed models. Use the WML Accelerator console or the command line interface to stop each running model, see Stop an inference service.
- As an Watson Machine Learning
Accelerator project
administrator, from the Watson Machine Learning
Accelerator console stop
all running jobs.
- Back up the Watson Machine Learning Accelerator service using the standard backup process. See: https://www.ibm.com/docs/en/cloud-paks/cp-data/4.7.x?topic=project-backing-up
Offline restore
After you restore the Watson Machine Learning Accelerator service using the Cloud Pak for Data restore process, you must run an additional script to restore owner references to all Watson Machine Learning Accelerator resources.
Before you begin:
Before performing an online restore, make sure that the Watson Machine Learning Accelerator namespace is deleted.
- Log in to your OpenShift cluster as a project
administrator.
oc login OpenShift_URL:port - Switch to the Watson Machine Learning
Accelerator
namespace.
oc project wmla-namespace - Return owner references to Watson Machine Learning
Accelerator resources, run the following script depending on
your version of Watson Machine Learning
Accelerator:
#!/bin/bash wmla_name=`oc get wmla -o name|awk -F/ '{print $NF}'` wmla_uid=`oc get wmla $wmla_name -o jsonpath='{.metadata.uid}'` user_pvc=`oc get wmla $wmla_name -o jsonpath={.spec.usePreCreatedPvcs}` for r in \ certificate.cert-manager.io/wmla-ca-crt \ certificate.cert-manager.io/wmla-internal-keys \ certificate.cert-manager.io/wmla-nginx-keys \ certificate.cert-manager.io/wmla-internal-keys-ecdsa \ certificate.cert-manager.io/wmla-nginx-keys-ecdsa \ certificate.cert-manager.io/wmla-worker-keys \ configmap/cpd-wmla-br-cm \ configmap/cpd-wmla-ckpt-cm \ configmap/cpd-wmla-qu-cm \ configmap/cpd-wmla-add-on-br-cm \ configmap/wmla-edi-lbd-nginx \ configmap/wmla-gpu-types \ configmap/wmla-install-info-cm \ configmap/wmla-watchdog-conf \ configmap/wmla-wml-accelerator-instance-cm \ configmap/wmla-dlpd-bootstrap \ configmap/wmla-edi \ configmap/wmla-edi-dlim \ configmap/wmla-edi-imd-nginx \ configmap/wmla-edi-isd \ configmap/wmla-edi-isd-ingress \ configmap/wmla-grafana-configmap \ configmap/wmla-grafana-ini \ configmap/wmla-grafana-providers \ configmap/wmla-infoservice \ configmap/wmla-jupyter-hub-config \ configmap/wmla-logstash-conf \ configmap/wmla-mongodb-shells \ configmap/wmla-msd \ configmap/wmla-mss \ configmap/wmla-nginx-conf \ configmap/wmla-nginx-grafana-sidecar-conf \ configmap/wmla-nginx-sidecar-conf \ configmap/wmla-prometheus \ configmap/wmla-version-info \ deployment.apps/wmla-auth-rest \ deployment.apps/wmla-conda \ deployment.apps/wmla-dlpd \ deployment.apps/wmla-edi-imd \ deployment.apps/wmla-edi-lbd \ deployment.apps/wmla-grafana \ deployment.apps/wmla-gui \ deployment.apps/wmla-infoservice \ deployment.apps/wmla-ingress \ deployment.apps/wmla-jupyter-gateway \ deployment.apps/wmla-jupyter-hub \ deployment.apps/wmla-jupyter-proxy \ deployment.apps/wmla-logstash \ deployment.apps/wmla-msd \ deployment.apps/wmla-mss \ deployment.apps/wmla-prometheus \ deployment.apps/wmla-watchdog \ horizontalpodautoscaler.autoscaling/wmla-auth-rest-hpa \ horizontalpodautoscaler.autoscaling/wmla-dlpd-hpa \ horizontalpodautoscaler.autoscaling/wmla-edi-lbd-hpa \ horizontalpodautoscaler.autoscaling/wmla-gui-hpa \ horizontalpodautoscaler.autoscaling/wmla-ingress-hpa \ horizontalpodautoscaler.autoscaling/wmla-watchdog-hpa \ ingress.networking.k8s.io/wmla-jupyter-ingress \ issuer.cert-manager.io/wmla-ca \ issuer.cert-manager.io/wmla-root-issuer \ networkpolicy.networking.k8s.io/wmla-dlpd-netpol \ networkpolicy.networking.k8s.io/wmla-edi-imd-network-policy \ networkpolicy.networking.k8s.io/wmla-edi-isd-network-policy \ networkpolicy.networking.k8s.io/wmla-infoservice-netpol \ networkpolicy.networking.k8s.io/wmla-ingress-network-policy \ networkpolicy.networking.k8s.io/wmla-logstash-network-policy \ networkpolicy.networking.k8s.io/wmla-msd-netpol \ networkpolicy.networking.k8s.io/wmla-namespace-network-policy \ persistentvolumeclaim/wmla-conda \ persistentvolumeclaim/wmla-cws-share \ persistentvolumeclaim/wmla-edi \ persistentvolumeclaim/wmla-infoservice \ persistentvolumeclaim/wmla-logging \ persistentvolumeclaim/wmla-mygpfs \ persistentvolumeclaim/wmla-grafana \ persistentvolumeclaim/wmla-prometheus \ poddisruptionbudget.policy/wmla-jupyter-hub-pdb \ poddisruptionbudget.policy/wmla-jupyter-proxy-pdb \ role.rbac.authorization.k8s.io/wmla-core-role \ role.rbac.authorization.k8s.io/wmla-edi \ role.rbac.authorization.k8s.io/wmla-msd-mss \ role.rbac.authorization.k8s.io/wmla-notebook-role \ role.rbac.authorization.k8s.io/wmla-role \ rolebinding.rbac.authorization.k8s.io/wmla-core-rb \ rolebinding.rbac.authorization.k8s.io/wmla-edi \ rolebinding.rbac.authorization.k8s.io/wmla-msd-mss \ rolebinding.rbac.authorization.k8s.io/wmla-notebook-rb \ rolebinding.rbac.authorization.k8s.io/wmla-rb \ route.route.openshift.io/wmla-console \ route.route.openshift.io/wmla-grafana \ route.route.openshift.io/wmla-inference \ route.route.openshift.io/wmla-jupyter-notebook \ secret/wmla-dlpd-conf \ secret/wmla-eg-secret \ secret/wmla-grafana-secret \ secret/wmla-jupyter-hub-secret \ secret/wmla-mongodb-secret \ secret/wmla-prometheus-htpasswd \ service/wmla-auth-rest \ service/wmla-dlpd \ service/wmla-edi \ service/wmla-edi-admin \ service/wmla-etcd \ service/wmla-grafana \ service/wmla-gui \ service/wmla-inference \ service/wmla-infoservice \ service/wmla-ingress \ service/wmla-jupyter-enterprise-gateway \ service/wmla-jupyter-hub \ service/wmla-jupyter-proxy-api \ service/wmla-jupyter-proxy-public \ service/wmla-logstash-service \ service/wmla-mongodb \ service/wmla-msd \ service/wmla-mss \ service/wmla-prometheus \ serviceaccount/wmla-core-sa \ serviceaccount/wmla-msd-mss \ serviceaccount/wmla-norbac \ serviceaccount/wmla-notebook-sa \ serviceaccount/wmla-sa \ statefulset.apps/wmla-etcd \ statefulset.apps/wmla-mongodb \ wmla-add-on.spectrumcomputing.ibm.com/wmla; do oc get $r >& /dev/null if [ $? == "0" ]; then #skip patch user pvc if [ x$user_pvc == 'xtrue' ];then resoucetype=`echo $r|awk -F'/' '{print $1}'` if [ x$resoucetype == 'xpersistentvolumeclaim' ];then echo "skip user defined PVC $r" continue fi fi echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}" fi done #update ownerReferences for wmla resource plans wmla_rps=`oc get rp -o name` ns_rp=`oc get rp platform -o jsonpath={.spec.parent}` wmla_fix_rp=`oc get rp platform -o jsonpath={.spec.children[0].name}` cpd_fix_rp=`oc get rp platform -o jsonpath={.spec.children[1].name}` for r in $wmla_rps; do #skip scheduler created resource plans rp_name=`echo $r|awk -F'/' '{print $2}'` if [ x$rp_name == "xplatform" -o x$rp_name == "x$ns_rp" -o x$rp_name == "x$wmla_fix_rp" -o x$rp_name == "x$cpd_fix_rp" ];then echo "skip resource plan $rp_name" continue fi echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}" done #update ownerReferences for deploy/isd and isd/service isds=`oc get deploy -o name|grep wmla-edi-isd` imd_uid=`oc get deploy wmla-edi-imd -o jsonpath='{.metadata.uid}'` for r in $isds; do echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"wmla-edi-imd\",\"uid\":\"$imd_uid\"}]}}" done isd_servicess=`oc get services -o name|grep wmla-edi-isd` for r in $isd_servicess; do isd_name=`echo $r|awk -F/ '{print $NF}'` isd_uid=`oc get deploy $isd_name -o jsonpath='{.metadata.uid}'` echo "Patch ownerReferences for $r" oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"$isd_name\",\"uid\":\"$isd_uid\"}]}}" done #update ownerReferences for wmla-add-on cm wmla_add_on_name=`oc get wmla-add-on -o name|awk -F/ '{print $NF}'` if [ x$wmla_add_on_name != x ];then wmla_add_on_uid=`oc get wmla-add-on $wmla_add_on_name -o jsonpath='{.metadata.uid}'` oc patch configmap/cpd-wmla-add-on-br-cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" wmla_instance_cm=`oc get cm -o name|grep wml-accelerator-instance-cm` oc patch $wmla_instance_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" wmla_connection_cm=`oc get cm -o name|grep wml-accelerator-connection-info-extension` if [[ 'x' != x"$wmla_connection_cm" ]];then oc patch $wmla_connection_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}" fi fi retry=1 crash='Y' MAX_RETRY=60 echo "checking mongodb status..." oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff until [[ $? == '0' ]]; do if [ $retry -ge $MAX_RETRY ]; then crash='N' echo "Not found mongodb pod crash" oc get po|grep -E 'wmla-mongodb' break fi sleep 1 let "retry += 1" oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff done if [ $crash == 'Y' ]; then echo "Found mongodb pod crash" oc get po|grep -E 'wmla-mongodb' oc scale --replicas 0 sts wmla-mongodb oc delete pvc data-wmla-mongodb-1 data-wmla-mongodb-2 oc scale --replicas=3 sts wmla-mongodb fi #workaround for etcd unhealthy, check 18 times (the result is from the last 3 times) MAX_RETRY=18 echo "checking wmla-etcd status (in $MAX_RETRY rounds)..." dead_node="" result=0 healthy_node_count=0 ETCD_CHECK_HEALTH="oc exec -it wmla-etcd-0 -c etcd -- etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --endpoints=https://wmla-etcd:2379 endpoint health --cluster" for j in $(seq 1 $MAX_RETRY); do $ETCD_CHECK_HEALTH > /tmp/wmla-etcd-check-result if [ "$?" != "0" ];then result=$(($result+1)) if [ $j -lt 9 ]; then continue fi healthy_node_count=$(cat /tmp/wmla-etcd-check-result|grep -w "healthy"|wc -l) if [ $result -ge 3 -a $healthy_node_count -ge 2 ]; then dead_node=$(cat /tmp/wmla-etcd-check-result|grep "is unhealthy"|awk '{print substr($1,9,11)}') break fi else echo "$j round: Not found unhealthy in wmla-etcd pods" result=0 healthy_node_count=0 fi sleep 20 done if [ "x$dead_node" != "x" ];then echo "Found unhealthy in wmla-etcd pods:" cat /tmp/wmla-etcd-check-result if [ "x$dead_node" != "x" ];then echo "== restoring unhealthy pod $dead_node" oc exec $dead_node -- rm -f /var/run/etcd/$dead_node.etcd/_recovered 2>/dev/null # to workaround failed node DNS not ready issue (the probe will be recovered when operator restarting) oc set probe sts/wmla-etcd --remove --readiness --liveness oc set env statefulset/wmla-etcd WMLA_ETCD_FAILURE_NODE=$dead_node oc delete pod $dead_node fi elif [ $result -ge 3 ]; then echo "WARN: Found unhealthy in wmla-etcd pods, but wmla-etcd is not ready for fixing, wait a while and rerun this tool." fi rm -f /tmp/wmla-etcd-check-result
After restoring Watson Machine Learning Accelerator
After restoring Watson Machine Learning Accelerator, make sure to address the following known issues.
Known issue:
After restoring Watson Machine Learning Accelerator, a known issue exists where the cluster is unhealthy and fails to get the status of the endpoint.
- Run the following command to check the status of the wmla-etcd
cluster:
The following error is displayed if the cluster is unhealthy:oc exec -it wmla-etcd-0 -- bash -c "ETCDCTL_API=3 etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --insecure-skip-tls-verify endpoint status --cluster"Defaulted container "etcd" out of: etcd, init-data-dir (init) {"level":"warn","ts":"2023-06-19T12:01:20.476Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://wmla-etcd-2.wmla-etcd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: lookup wmla-etcd-2.wmla-etcd on 172.30.0.10:53: no such host\""} Failed to get the status of endpoint https://wmla-etcd-2.wmla-etcd:2379 (context deadline exceeded) https://wmla-etcd-1.wmla-etcd:2379, 968d327db883b4b4, 3.3.27, 9.0 MB, true, 2715, 2044 https://wmla-etcd-0.wmla-etcd:2379, f5b85a4577d2c8db, 3.3.27, 9.1 MB, false, 2715, 2044 - Modify the wmla-etcd statefulset to make the failure pod be ready for maintenance.
oc edit statefulset wmla-etcd- Remove the
livenessprobe andreadinessprobe by removing the following lines:livenessProbe: failureThreshold: 3 initialDelaySeconds: 60 periodSeconds: 30 successThreshold: 1 tcpSocket: port: 2379 timeoutSeconds: 1 readinessProbe: failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 20 successThreshold: 1 tcpSocket: port: 2379 timeoutSeconds: 1 - Modify the container as follows:
containers: - command: - /bin/sh - -c - | PEERS="wmla-etcd-0=https://wmla-etcd-0.wmla-etcd:2380,wmla-etcd-1=https://wmla-etcd-1.wmla-etcd:2380,wmla-etcd-2=https://wmla-etcd-2.wmla-etcd:2380" ETCD_INITIAL_CLUSTER_STATE="new" if [ "$WMLA_ETCD_FAILURE_NODE" == "$HOSTNAME" -a ! -f /var/run/etcd/${HOSTNAME}.etcd/_recovered ]; then rm -rf /var/run/etcd/${HOSTNAME}.etcd echo "Restore ${HOSTNAME} in mainteance ..." ETCD_INITIAL_CLUSTER_STATE="existing" sleep 5 mkdir -p /var/run/etcd/${HOSTNAME}.etcd touch /var/run/etcd/${HOSTNAME}.etcd/_recovered fi exec etcd --name ${HOSTNAME} \ --listen-peer-urls https://0.0.0.0:2380 \ --listen-client-urls https://0.0.0.0:2379 \ --advertise-client-urls https://${HOSTNAME}.wmla-etcd:2379 \ --initial-advertise-peer-urls https://${HOSTNAME}:2380 \ --initial-cluster-token wmla-etcd-cluster \ --initial-cluster ${PEERS} \ --initial-cluster-state ${ETCD_INITIAL_CLUSTER_STATE} \ --data-dir /var/run/etcd/${HOSTNAME}.etcd \ --cert-file=/etc/pki/etcd/tls.crt \ --key-file=/etc/pki/etcd/tls.key \ --trusted-ca-file=/etc/pki/etcd/ca.crt \ --client-cert-auth \ --peer-cert-file=/etc/pki/etcd/tls.crt \ --peer-key-file=/etc/pki/etcd/tls.key \ --peer-trusted-ca-file=/etc/pki/etcd/ca.crt \ --peer-client-cert-auth \ --quota-backend-bytes=8589934592 \ --cipher-suites TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- Remove the
- Set the
WMLA_ETCD_FAILURE_NODEenvironment variable towmla-etcd-2for the pod that failed:oc set env statefulset/wmla-etcd WMLA_ETCD_FAILURE_NODE=wmla-etcd-2 - Restart the failed pod:
oc delete pod wmla-etcd-2 - Run the following command to check the status of the wmla-etcd
cluster:
The following is displayed for a healthy cluster:oc exec -it wmla-etcd-0 -- bash -c "ETCDCTL_API=3 etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --insecure-skip-tls-verify endpoint status --cluster"Defaulted container "etcd" out of: etcd, init-data-dir (init) https://wmla-etcd-1.wmla-etcd:2379, 968d327db883b4b4, 3.3.27, 1.4 MB, true, 2815, 2094 https://wmla-etcd-2.wmla-etcd:2379, cc26316c8c459e22, 3.3.27, 2.7 MB, false, 2815, 2094 https://wmla-etcd-0.wmla-etcd:2379, f5b85a4577d2c8db, 3.3.27, 1.4 MB, false, 2815, 2094
Known issue:
After restoring Watson Machine Learning Accelerator, the wmla-mongodb-1 or wmla-mongodb-2 pod may fail to start.
If this issue has occurred, complete the following steps to start the pods. Depending on the status of the cluster, this procedure may take several minutes to complete.
- Scale down the MongoDB service to replica number
1:
oc scale --replicas=1 sts wmla-mongodb -n <wmla_instance_namespace> - Wait for the MongoDB pods to scale down and stabilize.
- Remove wmla-mongodb-1 and wmla-mongodb-2 PVCs. Do not delete wmla-mongodb-0
PVC.
oc delete pvc data-wmla-mongodb-1 data-wmla-mongodb-2 -n <wmla_instance_namespace> - Scale up the MongoDB service to replica number
3:
oc scale --replicas=3 sts wmla-mongodb -n <wmla_instance_namespace>