Backing up and restoring the Watson Machine Learning Accelerator service

Use this information to backup or restore the IBM Watson® Machine Learning Accelerator service.

Online backup and restore of Watson Machine Learning Accelerator

Online backup

To complete an online backup, see Cloud Pak for Data online backup and restore.

Online restore

After you restore the Watson Machine Learning Accelerator service using the Cloud Pak for Data restore process, you must run an additional script to restore owner references to all Watson Machine Learning Accelerator resources.

Before you begin:

Before performing an online restore, make sure that the Watson Machine Learning Accelerator namespace is deleted.

Steps:
  1. Log in to your OpenShift cluster as a project administrator.
    oc login OpenShift_URL:port
  2. Switch to the Watson Machine Learning Accelerator namespace.
    oc project wmla-namespace
  3. Return owner references to Watson Machine Learning Accelerator resources, run the following script depending on your version of Watson Machine Learning Accelerator:
    #!/bin/bash
    
    wmla_name=`oc get wmla -o name|awk -F/ '{print $NF}'`
    wmla_uid=`oc get wmla $wmla_name -o jsonpath='{.metadata.uid}'`
    user_pvc=`oc get wmla $wmla_name -o jsonpath={.spec.usePreCreatedPvcs}`
    
    for r in \
    certificate.cert-manager.io/wmla-ca-crt \
    certificate.cert-manager.io/wmla-internal-keys \
    certificate.cert-manager.io/wmla-nginx-keys \
    certificate.cert-manager.io/wmla-internal-keys-ecdsa \
    certificate.cert-manager.io/wmla-nginx-keys-ecdsa \
    certificate.cert-manager.io/wmla-worker-keys \
    configmap/cpd-wmla-br-cm \
    configmap/cpd-wmla-ckpt-cm \
    configmap/cpd-wmla-qu-cm \
    configmap/cpd-wmla-add-on-br-cm \
    configmap/wmla-edi-lbd-nginx \
    configmap/wmla-gpu-types \
    configmap/wmla-install-info-cm \
    configmap/wmla-watchdog-conf \
    configmap/wmla-wml-accelerator-instance-cm \
    configmap/wmla-dlpd-bootstrap \
    configmap/wmla-edi \
    configmap/wmla-edi-dlim \
    configmap/wmla-edi-imd-nginx \
    configmap/wmla-edi-isd \
    configmap/wmla-edi-isd-ingress \
    configmap/wmla-grafana-configmap \
    configmap/wmla-grafana-ini \
    configmap/wmla-grafana-providers \
    configmap/wmla-infoservice \
    configmap/wmla-jupyter-hub-config \
    configmap/wmla-logstash-conf \
    configmap/wmla-mongodb-shells \
    configmap/wmla-msd \
    configmap/wmla-mss \
    configmap/wmla-nginx-conf \
    configmap/wmla-nginx-grafana-sidecar-conf \
    configmap/wmla-nginx-sidecar-conf \
    configmap/wmla-prometheus \
    configmap/wmla-version-info \
    deployment.apps/wmla-auth-rest \
    deployment.apps/wmla-conda \
    deployment.apps/wmla-dlpd \
    deployment.apps/wmla-edi-imd \
    deployment.apps/wmla-edi-lbd \
    deployment.apps/wmla-grafana \
    deployment.apps/wmla-gui \
    deployment.apps/wmla-infoservice \
    deployment.apps/wmla-ingress \
    deployment.apps/wmla-jupyter-gateway \
    deployment.apps/wmla-jupyter-hub \
    deployment.apps/wmla-jupyter-proxy \
    deployment.apps/wmla-logstash \
    deployment.apps/wmla-msd \
    deployment.apps/wmla-mss \
    deployment.apps/wmla-prometheus \
    deployment.apps/wmla-watchdog \
    horizontalpodautoscaler.autoscaling/wmla-auth-rest-hpa \
    horizontalpodautoscaler.autoscaling/wmla-dlpd-hpa \
    horizontalpodautoscaler.autoscaling/wmla-edi-lbd-hpa \
    horizontalpodautoscaler.autoscaling/wmla-gui-hpa \
    horizontalpodautoscaler.autoscaling/wmla-ingress-hpa \
    horizontalpodautoscaler.autoscaling/wmla-watchdog-hpa \
    ingress.networking.k8s.io/wmla-jupyter-ingress \
    issuer.cert-manager.io/wmla-ca \
    issuer.cert-manager.io/wmla-root-issuer \
    networkpolicy.networking.k8s.io/wmla-dlpd-netpol \
    networkpolicy.networking.k8s.io/wmla-edi-imd-network-policy \
    networkpolicy.networking.k8s.io/wmla-edi-isd-network-policy \
    networkpolicy.networking.k8s.io/wmla-infoservice-netpol \
    networkpolicy.networking.k8s.io/wmla-ingress-network-policy \
    networkpolicy.networking.k8s.io/wmla-logstash-network-policy \
    networkpolicy.networking.k8s.io/wmla-msd-netpol \
    networkpolicy.networking.k8s.io/wmla-namespace-network-policy \
    persistentvolumeclaim/wmla-conda \
    persistentvolumeclaim/wmla-cws-share \
    persistentvolumeclaim/wmla-edi \
    persistentvolumeclaim/wmla-infoservice \
    persistentvolumeclaim/wmla-logging \
    persistentvolumeclaim/wmla-mygpfs \
    persistentvolumeclaim/wmla-grafana \
    persistentvolumeclaim/wmla-prometheus \
    poddisruptionbudget.policy/wmla-jupyter-hub-pdb \
    poddisruptionbudget.policy/wmla-jupyter-proxy-pdb \
    role.rbac.authorization.k8s.io/wmla-core-role \
    role.rbac.authorization.k8s.io/wmla-edi \
    role.rbac.authorization.k8s.io/wmla-msd-mss \
    role.rbac.authorization.k8s.io/wmla-notebook-role \
    role.rbac.authorization.k8s.io/wmla-role \
    rolebinding.rbac.authorization.k8s.io/wmla-core-rb \
    rolebinding.rbac.authorization.k8s.io/wmla-edi \
    rolebinding.rbac.authorization.k8s.io/wmla-msd-mss \
    rolebinding.rbac.authorization.k8s.io/wmla-notebook-rb \
    rolebinding.rbac.authorization.k8s.io/wmla-rb \
    route.route.openshift.io/wmla-console \
    route.route.openshift.io/wmla-grafana \
    route.route.openshift.io/wmla-inference \
    route.route.openshift.io/wmla-jupyter-notebook \
    secret/wmla-dlpd-conf \
    secret/wmla-eg-secret \
    secret/wmla-grafana-secret \
    secret/wmla-jupyter-hub-secret \
    secret/wmla-mongodb-secret \
    secret/wmla-prometheus-htpasswd \
    service/wmla-auth-rest \
    service/wmla-dlpd \
    service/wmla-edi \
    service/wmla-edi-admin \
    service/wmla-etcd \
    service/wmla-grafana \
    service/wmla-gui \
    service/wmla-inference \
    service/wmla-infoservice \
    service/wmla-ingress \
    service/wmla-jupyter-enterprise-gateway \
    service/wmla-jupyter-hub \
    service/wmla-jupyter-proxy-api \
    service/wmla-jupyter-proxy-public \
    service/wmla-logstash-service \
    service/wmla-mongodb \
    service/wmla-msd \
    service/wmla-mss \
    service/wmla-prometheus \
    serviceaccount/wmla-core-sa \
    serviceaccount/wmla-msd-mss \
    serviceaccount/wmla-norbac \
    serviceaccount/wmla-notebook-sa \
    serviceaccount/wmla-sa \
    statefulset.apps/wmla-etcd \
    statefulset.apps/wmla-mongodb \
    wmla-add-on.spectrumcomputing.ibm.com/wmla;
    do
    	oc get $r >& /dev/null
    	if [ $? == "0" ]; then
    		#skip patch user pvc
    		if [ x$user_pvc == 'xtrue' ];then
    			resoucetype=`echo $r|awk -F'/' '{print $1}'`
    			if [ x$resoucetype == 'xpersistentvolumeclaim' ];then
    				echo "skip user defined PVC $r"
    				continue
    			fi
    		fi
    		echo "Patch ownerReferences for $r"
    		oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}"
    	fi
    done
    
    #update ownerReferences for wmla resource plans
    wmla_rps=`oc get rp -o name`
    ns_rp=`oc get rp platform  -o jsonpath={.spec.parent}`
    wmla_fix_rp=`oc get rp platform  -o jsonpath={.spec.children[0].name}`
    cpd_fix_rp=`oc get rp platform  -o jsonpath={.spec.children[1].name}`
    for r in $wmla_rps;
    do
    	#skip scheduler created resource plans
    	rp_name=`echo $r|awk -F'/' '{print $2}'`
    	if [ x$rp_name == "xplatform" -o x$rp_name == "x$ns_rp" -o x$rp_name == "x$wmla_fix_rp" -o x$rp_name == "x$cpd_fix_rp" ];then
    		echo "skip resource plan $rp_name"
    		continue
    	fi
    	echo "Patch ownerReferences for $r"
    	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}"
    done
    
    #update ownerReferences for deploy/isd and isd/service
    isds=`oc get deploy -o name|grep wmla-edi-isd`
    imd_uid=`oc get deploy wmla-edi-imd -o jsonpath='{.metadata.uid}'`
    for r in $isds;
    do
    	echo "Patch ownerReferences for $r"
    	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"wmla-edi-imd\",\"uid\":\"$imd_uid\"}]}}"
    done
    
    isd_servicess=`oc get services -o name|grep wmla-edi-isd`
    for r in $isd_servicess;
    do
    	isd_name=`echo $r|awk -F/ '{print $NF}'`
    	isd_uid=`oc get deploy $isd_name -o jsonpath='{.metadata.uid}'`
    	echo "Patch ownerReferences for $r"
    	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"$isd_name\",\"uid\":\"$isd_uid\"}]}}"
    done
    
    #update ownerReferences for wmla-add-on cm
    wmla_add_on_name=`oc get wmla-add-on -o name|awk -F/ '{print $NF}'`
    if [ x$wmla_add_on_name != x ];then
    	wmla_add_on_uid=`oc get wmla-add-on $wmla_add_on_name -o jsonpath='{.metadata.uid}'`
    	oc patch configmap/cpd-wmla-add-on-br-cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
    	wmla_instance_cm=`oc get cm -o name|grep wml-accelerator-instance-cm`
    	oc patch $wmla_instance_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
    	wmla_connection_cm=`oc get cm -o name|grep wml-accelerator-connection-info-extension`
    	if [[ 'x' != x"$wmla_connection_cm" ]];then
    		oc patch $wmla_connection_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
    	fi
    fi
    
    retry=1
    crash='Y'
    MAX_RETRY=60
    echo "checking mongodb status..."
    oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff
    until [[ $? == '0' ]]; do
        if [ $retry -ge $MAX_RETRY ]; then
            crash='N'
            echo "Not found mongodb pod crash"
            oc get po|grep -E 'wmla-mongodb'
            break
        fi
        sleep 1
        let "retry += 1"
        oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff
    done
    if [ $crash == 'Y' ]; then
        echo "Found mongodb pod crash"
        oc get po|grep -E 'wmla-mongodb'
        oc scale --replicas 0 sts wmla-mongodb
        oc delete pvc data-wmla-mongodb-1 data-wmla-mongodb-2 
        oc scale  --replicas=3 sts wmla-mongodb
    fi
    
    #workaround for etcd unhealthy, check 18 times (the result is from the last 3 times)
    MAX_RETRY=18
    echo "checking wmla-etcd status (in $MAX_RETRY rounds)..."
    dead_node=""
    result=0
    healthy_node_count=0
    ETCD_CHECK_HEALTH="oc exec -it wmla-etcd-0 -c etcd -- etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --endpoints=https://wmla-etcd:2379 endpoint health --cluster"
    for j in $(seq 1 $MAX_RETRY); do
        $ETCD_CHECK_HEALTH > /tmp/wmla-etcd-check-result
        if [ "$?" != "0" ];then
            result=$(($result+1))
            if [ $j -lt 9 ]; then
                continue
            fi
            healthy_node_count=$(cat /tmp/wmla-etcd-check-result|grep -w "healthy"|wc -l)
            if [ $result -ge 3 -a $healthy_node_count -ge 2 ]; then
                dead_node=$(cat /tmp/wmla-etcd-check-result|grep "is unhealthy"|awk '{print substr($1,9,11)}')
                break
            fi
        else
            echo "$j round: Not found unhealthy in wmla-etcd pods"
            result=0
            healthy_node_count=0
        fi
        sleep 20
    done
    if [ "x$dead_node" != "x" ];then
        echo "Found unhealthy in wmla-etcd pods:"
        cat /tmp/wmla-etcd-check-result
        if [ "x$dead_node" != "x" ];then
            echo "== restoring unhealthy pod $dead_node"
            oc exec $dead_node -- rm -f /var/run/etcd/$dead_node.etcd/_recovered 2>/dev/null
            # to workaround failed node DNS not ready issue (the probe will be recovered when operator restarting)
            oc set probe sts/wmla-etcd --remove --readiness --liveness
            oc set env statefulset/wmla-etcd WMLA_ETCD_FAILURE_NODE=$dead_node
            oc delete pod $dead_node
        fi
    elif [ $result -ge 3 ]; then
        echo "WARN: Found unhealthy in wmla-etcd pods, but wmla-etcd is not ready for fixing, wait a while and rerun this tool."
    fi
    rm -f /tmp/wmla-etcd-check-result
    

Offline backup and restore of Watson Machine Learning Accelerator

Offline backup

Before you complete an offline backup of the Watson Machine Learning Accelerator service using the standard backup process, you must stop all running workloads.

Steps:
  1. Stop all running workloads.
    1. As an Watson Machine Learning Accelerator project administrator, from the Watson Machine Learning Accelerator console stop all running jobs.
      1. Log in to the Watson Machine Learning Accelerator console as a project administrator.
      2. Navigate to Monitoring > Applications.
      3. For each running applications, select the menu icon and click Stop.
    2. Stop all running deployed models. Use the WML Accelerator console or the command line interface to stop each running model, see Stop an inference service.
  2. Back up the Watson Machine Learning Accelerator service using the standard backup process. See: https://www.ibm.com/docs/en/cloud-paks/cp-data/4.7.x?topic=project-backing-up

Offline restore

After you restore the Watson Machine Learning Accelerator service using the Cloud Pak for Data restore process, you must run an additional script to restore owner references to all Watson Machine Learning Accelerator resources.

Before you begin:

Before performing an online restore, make sure that the Watson Machine Learning Accelerator namespace is deleted.

Steps:
  1. Log in to your OpenShift cluster as a project administrator.
    oc login OpenShift_URL:port
  2. Switch to the Watson Machine Learning Accelerator namespace.
    oc project wmla-namespace
  3. Return owner references to Watson Machine Learning Accelerator resources, run the following script depending on your version of Watson Machine Learning Accelerator:
    #!/bin/bash
    
    wmla_name=`oc get wmla -o name|awk -F/ '{print $NF}'`
    wmla_uid=`oc get wmla $wmla_name -o jsonpath='{.metadata.uid}'`
    user_pvc=`oc get wmla $wmla_name -o jsonpath={.spec.usePreCreatedPvcs}`
    
    for r in \
    certificate.cert-manager.io/wmla-ca-crt \
    certificate.cert-manager.io/wmla-internal-keys \
    certificate.cert-manager.io/wmla-nginx-keys \
    certificate.cert-manager.io/wmla-internal-keys-ecdsa \
    certificate.cert-manager.io/wmla-nginx-keys-ecdsa \
    certificate.cert-manager.io/wmla-worker-keys \
    configmap/cpd-wmla-br-cm \
    configmap/cpd-wmla-ckpt-cm \
    configmap/cpd-wmla-qu-cm \
    configmap/cpd-wmla-add-on-br-cm \
    configmap/wmla-edi-lbd-nginx \
    configmap/wmla-gpu-types \
    configmap/wmla-install-info-cm \
    configmap/wmla-watchdog-conf \
    configmap/wmla-wml-accelerator-instance-cm \
    configmap/wmla-dlpd-bootstrap \
    configmap/wmla-edi \
    configmap/wmla-edi-dlim \
    configmap/wmla-edi-imd-nginx \
    configmap/wmla-edi-isd \
    configmap/wmla-edi-isd-ingress \
    configmap/wmla-grafana-configmap \
    configmap/wmla-grafana-ini \
    configmap/wmla-grafana-providers \
    configmap/wmla-infoservice \
    configmap/wmla-jupyter-hub-config \
    configmap/wmla-logstash-conf \
    configmap/wmla-mongodb-shells \
    configmap/wmla-msd \
    configmap/wmla-mss \
    configmap/wmla-nginx-conf \
    configmap/wmla-nginx-grafana-sidecar-conf \
    configmap/wmla-nginx-sidecar-conf \
    configmap/wmla-prometheus \
    configmap/wmla-version-info \
    deployment.apps/wmla-auth-rest \
    deployment.apps/wmla-conda \
    deployment.apps/wmla-dlpd \
    deployment.apps/wmla-edi-imd \
    deployment.apps/wmla-edi-lbd \
    deployment.apps/wmla-grafana \
    deployment.apps/wmla-gui \
    deployment.apps/wmla-infoservice \
    deployment.apps/wmla-ingress \
    deployment.apps/wmla-jupyter-gateway \
    deployment.apps/wmla-jupyter-hub \
    deployment.apps/wmla-jupyter-proxy \
    deployment.apps/wmla-logstash \
    deployment.apps/wmla-msd \
    deployment.apps/wmla-mss \
    deployment.apps/wmla-prometheus \
    deployment.apps/wmla-watchdog \
    horizontalpodautoscaler.autoscaling/wmla-auth-rest-hpa \
    horizontalpodautoscaler.autoscaling/wmla-dlpd-hpa \
    horizontalpodautoscaler.autoscaling/wmla-edi-lbd-hpa \
    horizontalpodautoscaler.autoscaling/wmla-gui-hpa \
    horizontalpodautoscaler.autoscaling/wmla-ingress-hpa \
    horizontalpodautoscaler.autoscaling/wmla-watchdog-hpa \
    ingress.networking.k8s.io/wmla-jupyter-ingress \
    issuer.cert-manager.io/wmla-ca \
    issuer.cert-manager.io/wmla-root-issuer \
    networkpolicy.networking.k8s.io/wmla-dlpd-netpol \
    networkpolicy.networking.k8s.io/wmla-edi-imd-network-policy \
    networkpolicy.networking.k8s.io/wmla-edi-isd-network-policy \
    networkpolicy.networking.k8s.io/wmla-infoservice-netpol \
    networkpolicy.networking.k8s.io/wmla-ingress-network-policy \
    networkpolicy.networking.k8s.io/wmla-logstash-network-policy \
    networkpolicy.networking.k8s.io/wmla-msd-netpol \
    networkpolicy.networking.k8s.io/wmla-namespace-network-policy \
    persistentvolumeclaim/wmla-conda \
    persistentvolumeclaim/wmla-cws-share \
    persistentvolumeclaim/wmla-edi \
    persistentvolumeclaim/wmla-infoservice \
    persistentvolumeclaim/wmla-logging \
    persistentvolumeclaim/wmla-mygpfs \
    persistentvolumeclaim/wmla-grafana \
    persistentvolumeclaim/wmla-prometheus \
    poddisruptionbudget.policy/wmla-jupyter-hub-pdb \
    poddisruptionbudget.policy/wmla-jupyter-proxy-pdb \
    role.rbac.authorization.k8s.io/wmla-core-role \
    role.rbac.authorization.k8s.io/wmla-edi \
    role.rbac.authorization.k8s.io/wmla-msd-mss \
    role.rbac.authorization.k8s.io/wmla-notebook-role \
    role.rbac.authorization.k8s.io/wmla-role \
    rolebinding.rbac.authorization.k8s.io/wmla-core-rb \
    rolebinding.rbac.authorization.k8s.io/wmla-edi \
    rolebinding.rbac.authorization.k8s.io/wmla-msd-mss \
    rolebinding.rbac.authorization.k8s.io/wmla-notebook-rb \
    rolebinding.rbac.authorization.k8s.io/wmla-rb \
    route.route.openshift.io/wmla-console \
    route.route.openshift.io/wmla-grafana \
    route.route.openshift.io/wmla-inference \
    route.route.openshift.io/wmla-jupyter-notebook \
    secret/wmla-dlpd-conf \
    secret/wmla-eg-secret \
    secret/wmla-grafana-secret \
    secret/wmla-jupyter-hub-secret \
    secret/wmla-mongodb-secret \
    secret/wmla-prometheus-htpasswd \
    service/wmla-auth-rest \
    service/wmla-dlpd \
    service/wmla-edi \
    service/wmla-edi-admin \
    service/wmla-etcd \
    service/wmla-grafana \
    service/wmla-gui \
    service/wmla-inference \
    service/wmla-infoservice \
    service/wmla-ingress \
    service/wmla-jupyter-enterprise-gateway \
    service/wmla-jupyter-hub \
    service/wmla-jupyter-proxy-api \
    service/wmla-jupyter-proxy-public \
    service/wmla-logstash-service \
    service/wmla-mongodb \
    service/wmla-msd \
    service/wmla-mss \
    service/wmla-prometheus \
    serviceaccount/wmla-core-sa \
    serviceaccount/wmla-msd-mss \
    serviceaccount/wmla-norbac \
    serviceaccount/wmla-notebook-sa \
    serviceaccount/wmla-sa \
    statefulset.apps/wmla-etcd \
    statefulset.apps/wmla-mongodb \
    wmla-add-on.spectrumcomputing.ibm.com/wmla;
    do
    	oc get $r >& /dev/null
    	if [ $? == "0" ]; then
    		#skip patch user pvc
    		if [ x$user_pvc == 'xtrue' ];then
    			resoucetype=`echo $r|awk -F'/' '{print $1}'`
    			if [ x$resoucetype == 'xpersistentvolumeclaim' ];then
    				echo "skip user defined PVC $r"
    				continue
    			fi
    		fi
    		echo "Patch ownerReferences for $r"
    		oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}"
    	fi
    done
    
    #update ownerReferences for wmla resource plans
    wmla_rps=`oc get rp -o name`
    ns_rp=`oc get rp platform  -o jsonpath={.spec.parent}`
    wmla_fix_rp=`oc get rp platform  -o jsonpath={.spec.children[0].name}`
    cpd_fix_rp=`oc get rp platform  -o jsonpath={.spec.children[1].name}`
    for r in $wmla_rps;
    do
    	#skip scheduler created resource plans
    	rp_name=`echo $r|awk -F'/' '{print $2}'`
    	if [ x$rp_name == "xplatform" -o x$rp_name == "x$ns_rp" -o x$rp_name == "x$wmla_fix_rp" -o x$rp_name == "x$cpd_fix_rp" ];then
    		echo "skip resource plan $rp_name"
    		continue
    	fi
    	echo "Patch ownerReferences for $r"
    	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}"
    done
    
    #update ownerReferences for deploy/isd and isd/service
    isds=`oc get deploy -o name|grep wmla-edi-isd`
    imd_uid=`oc get deploy wmla-edi-imd -o jsonpath='{.metadata.uid}'`
    for r in $isds;
    do
    	echo "Patch ownerReferences for $r"
    	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"wmla-edi-imd\",\"uid\":\"$imd_uid\"}]}}"
    done
    
    isd_servicess=`oc get services -o name|grep wmla-edi-isd`
    for r in $isd_servicess;
    do
    	isd_name=`echo $r|awk -F/ '{print $NF}'`
    	isd_uid=`oc get deploy $isd_name -o jsonpath='{.metadata.uid}'`
    	echo "Patch ownerReferences for $r"
    	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"$isd_name\",\"uid\":\"$isd_uid\"}]}}"
    done
    
    #update ownerReferences for wmla-add-on cm
    wmla_add_on_name=`oc get wmla-add-on -o name|awk -F/ '{print $NF}'`
    if [ x$wmla_add_on_name != x ];then
    	wmla_add_on_uid=`oc get wmla-add-on $wmla_add_on_name -o jsonpath='{.metadata.uid}'`
    	oc patch configmap/cpd-wmla-add-on-br-cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
    	wmla_instance_cm=`oc get cm -o name|grep wml-accelerator-instance-cm`
    	oc patch $wmla_instance_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
    	wmla_connection_cm=`oc get cm -o name|grep wml-accelerator-connection-info-extension`
    	if [[ 'x' != x"$wmla_connection_cm" ]];then
    		oc patch $wmla_connection_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
    	fi
    fi
    
    retry=1
    crash='Y'
    MAX_RETRY=60
    echo "checking mongodb status..."
    oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff
    until [[ $? == '0' ]]; do
        if [ $retry -ge $MAX_RETRY ]; then
            crash='N'
            echo "Not found mongodb pod crash"
            oc get po|grep -E 'wmla-mongodb'
            break
        fi
        sleep 1
        let "retry += 1"
        oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff
    done
    if [ $crash == 'Y' ]; then
        echo "Found mongodb pod crash"
        oc get po|grep -E 'wmla-mongodb'
        oc scale --replicas 0 sts wmla-mongodb
        oc delete pvc data-wmla-mongodb-1 data-wmla-mongodb-2 
        oc scale  --replicas=3 sts wmla-mongodb
    fi
    
    #workaround for etcd unhealthy, check 18 times (the result is from the last 3 times)
    MAX_RETRY=18
    echo "checking wmla-etcd status (in $MAX_RETRY rounds)..."
    dead_node=""
    result=0
    healthy_node_count=0
    ETCD_CHECK_HEALTH="oc exec -it wmla-etcd-0 -c etcd -- etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --endpoints=https://wmla-etcd:2379 endpoint health --cluster"
    for j in $(seq 1 $MAX_RETRY); do
        $ETCD_CHECK_HEALTH > /tmp/wmla-etcd-check-result
        if [ "$?" != "0" ];then
            result=$(($result+1))
            if [ $j -lt 9 ]; then
                continue
            fi
            healthy_node_count=$(cat /tmp/wmla-etcd-check-result|grep -w "healthy"|wc -l)
            if [ $result -ge 3 -a $healthy_node_count -ge 2 ]; then
                dead_node=$(cat /tmp/wmla-etcd-check-result|grep "is unhealthy"|awk '{print substr($1,9,11)}')
                break
            fi
        else
            echo "$j round: Not found unhealthy in wmla-etcd pods"
            result=0
            healthy_node_count=0
        fi
        sleep 20
    done
    if [ "x$dead_node" != "x" ];then
        echo "Found unhealthy in wmla-etcd pods:"
        cat /tmp/wmla-etcd-check-result
        if [ "x$dead_node" != "x" ];then
            echo "== restoring unhealthy pod $dead_node"
            oc exec $dead_node -- rm -f /var/run/etcd/$dead_node.etcd/_recovered 2>/dev/null
            # to workaround failed node DNS not ready issue (the probe will be recovered when operator restarting)
            oc set probe sts/wmla-etcd --remove --readiness --liveness
            oc set env statefulset/wmla-etcd WMLA_ETCD_FAILURE_NODE=$dead_node
            oc delete pod $dead_node
        fi
    elif [ $result -ge 3 ]; then
        echo "WARN: Found unhealthy in wmla-etcd pods, but wmla-etcd is not ready for fixing, wait a while and rerun this tool."
    fi
    rm -f /tmp/wmla-etcd-check-result
    

After restoring Watson Machine Learning Accelerator

After restoring Watson Machine Learning Accelerator, make sure to address the following known issues.

Known issue:

After restoring Watson Machine Learning Accelerator, a known issue exists where the cluster is unhealthy and fails to get the status of the endpoint.

To resolve this issue:
  1. Run the following command to check the status of the wmla-etcd cluster:
    oc exec -it wmla-etcd-0 -- bash -c "ETCDCTL_API=3 etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --insecure-skip-tls-verify endpoint status --cluster"
    The following error is displayed if the cluster is unhealthy:
    Defaulted container "etcd" out of: etcd, init-data-dir (init)
    {"level":"warn","ts":"2023-06-19T12:01:20.476Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://wmla-etcd-2.wmla-etcd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: lookup wmla-etcd-2.wmla-etcd on 172.30.0.10:53: no such host\""}
    Failed to get the status of endpoint https://wmla-etcd-2.wmla-etcd:2379 (context deadline exceeded)
    https://wmla-etcd-1.wmla-etcd:2379, 968d327db883b4b4, 3.3.27, 9.0 MB, true, 2715, 2044
    https://wmla-etcd-0.wmla-etcd:2379, f5b85a4577d2c8db, 3.3.27, 9.1 MB, false, 2715, 2044
  2. Modify the wmla-etcd statefulset to make the failure pod be ready for maintenance.
    oc edit statefulset wmla-etcd
    1. Remove the liveness probe and readiness probe by removing the following lines:
              livenessProbe:
                failureThreshold: 3
                initialDelaySeconds: 60
                periodSeconds: 30
                successThreshold: 1
                tcpSocket:
                  port: 2379
                timeoutSeconds: 1
      
              readinessProbe:
                failureThreshold: 3
                initialDelaySeconds: 10
                periodSeconds: 20
                successThreshold: 1
                tcpSocket:
                  port: 2379
                timeoutSeconds: 1
    2. Modify the container as follows:
            containers:
            - command:
              - /bin/sh
              - -c
              - |
                PEERS="wmla-etcd-0=https://wmla-etcd-0.wmla-etcd:2380,wmla-etcd-1=https://wmla-etcd-1.wmla-etcd:2380,wmla-etcd-2=https://wmla-etcd-2.wmla-etcd:2380"
                ETCD_INITIAL_CLUSTER_STATE="new"
                if [ "$WMLA_ETCD_FAILURE_NODE" == "$HOSTNAME" -a ! -f /var/run/etcd/${HOSTNAME}.etcd/_recovered ]; then
                    rm -rf /var/run/etcd/${HOSTNAME}.etcd
                    echo "Restore ${HOSTNAME} in mainteance ..."
                    ETCD_INITIAL_CLUSTER_STATE="existing"
                    sleep 5
                    mkdir -p /var/run/etcd/${HOSTNAME}.etcd
                    touch /var/run/etcd/${HOSTNAME}.etcd/_recovered
                fi
                exec etcd --name ${HOSTNAME} \
                  --listen-peer-urls https://0.0.0.0:2380 \
                  --listen-client-urls https://0.0.0.0:2379 \
                  --advertise-client-urls https://${HOSTNAME}.wmla-etcd:2379 \
                  --initial-advertise-peer-urls https://${HOSTNAME}:2380 \
                  --initial-cluster-token wmla-etcd-cluster \
                  --initial-cluster ${PEERS} \
                  --initial-cluster-state ${ETCD_INITIAL_CLUSTER_STATE} \
                  --data-dir /var/run/etcd/${HOSTNAME}.etcd \
                  --cert-file=/etc/pki/etcd/tls.crt \
                  --key-file=/etc/pki/etcd/tls.key \
                  --trusted-ca-file=/etc/pki/etcd/ca.crt \
                  --client-cert-auth \
                  --peer-cert-file=/etc/pki/etcd/tls.crt \
                  --peer-key-file=/etc/pki/etcd/tls.key \
                  --peer-trusted-ca-file=/etc/pki/etcd/ca.crt \
                  --peer-client-cert-auth \
                  --quota-backend-bytes=8589934592 \
                  --cipher-suites TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
      
  3. Set the WMLA_ETCD_FAILURE_NODE environment variable to wmla-etcd-2 for the pod that failed:
    oc set env statefulset/wmla-etcd WMLA_ETCD_FAILURE_NODE=wmla-etcd-2
  4. Restart the failed pod:
    oc delete pod wmla-etcd-2
  5. Run the following command to check the status of the wmla-etcd cluster:
    oc exec -it wmla-etcd-0 -- bash -c "ETCDCTL_API=3 etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --insecure-skip-tls-verify endpoint status --cluster"
    The following is displayed for a healthy cluster:
    Defaulted container "etcd" out of: etcd, init-data-dir (init)
    https://wmla-etcd-1.wmla-etcd:2379, 968d327db883b4b4, 3.3.27, 1.4 MB, true, 2815, 2094
    https://wmla-etcd-2.wmla-etcd:2379, cc26316c8c459e22, 3.3.27, 2.7 MB, false, 2815, 2094
    https://wmla-etcd-0.wmla-etcd:2379, f5b85a4577d2c8db, 3.3.27, 1.4 MB, false, 2815, 2094
    

Known issue:

After restoring Watson Machine Learning Accelerator, the wmla-mongodb-1 or wmla-mongodb-2 pod may fail to start.

If this issue has occurred, complete the following steps to start the pods. Depending on the status of the cluster, this procedure may take several minutes to complete.

  1. Scale down the MongoDB service to replica number 1:
    oc scale --replicas=1 sts wmla-mongodb -n <wmla_instance_namespace>
  2. Wait for the MongoDB pods to scale down and stabilize.
  3. Remove wmla-mongodb-1 and wmla-mongodb-2 PVCs. Do not delete wmla-mongodb-0 PVC.
    oc delete pvc data-wmla-mongodb-1 data-wmla-mongodb-2 -n <wmla_instance_namespace>
  4. Scale up the MongoDB service to replica number 3:
    oc scale  --replicas=3 sts wmla-mongodb -n <wmla_instance_namespace>