Backing up and restoring the Watson Machine Learning Accelerator service

Use this information to backup or restore the IBM Watson® Machine Learning Accelerator service.

Online backup and restore of Watson Machine Learning Accelerator
Offline backup and restore of Watson Machine Learning Accelerator
After restoring Watson Machine Learning Accelerator

Online backup and restore of Watson Machine Learning Accelerator

Online backup

To complete an online backup, see Cloud Pak for Data online backup and restore.

Online restore

After you restore the Watson Machine Learning Accelerator service using the Cloud Pak for Data restore process, you must run an additional script to restore owner references to all Watson Machine Learning Accelerator resources.

Before you begin:

Before performing an online restore, make sure that the Watson Machine Learning Accelerator namespace is deleted.

Steps:

Log in to your OpenShift cluster as a project administrator.
```
oc login OpenShift_URL:port
```
Switch to the Watson Machine Learning Accelerator namespace.
```
oc project wmla-namespace
```

Return owner references to Watson Machine Learning Accelerator resources, run the following script depending on your version of Watson Machine Learning Accelerator:

#!/bin/bash

wmla_name=`oc get wmla -o name|awk -F/ '{print $NF}'`
wmla_uid=`oc get wmla $wmla_name -o jsonpath='{.metadata.uid}'`
user_pvc=`oc get wmla $wmla_name -o jsonpath={.spec.usePreCreatedPvcs}`

for r in \
certificate.cert-manager.io/wmla-ca-crt \
certificate.cert-manager.io/wmla-internal-keys \
certificate.cert-manager.io/wmla-nginx-keys \
certificate.cert-manager.io/wmla-internal-keys-ecdsa \
certificate.cert-manager.io/wmla-nginx-keys-ecdsa \
certificate.cert-manager.io/wmla-worker-keys \
configmap/cpd-wmla-br-cm \
configmap/cpd-wmla-ckpt-cm \
configmap/cpd-wmla-qu-cm \
configmap/cpd-wmla-add-on-br-cm \
configmap/wmla-edi-lbd-nginx \
configmap/wmla-gpu-types \
configmap/wmla-install-info-cm \
configmap/wmla-watchdog-conf \
configmap/wmla-wml-accelerator-instance-cm \
configmap/wmla-dlpd-bootstrap \
configmap/wmla-edi \
configmap/wmla-edi-dlim \
configmap/wmla-edi-imd-nginx \
configmap/wmla-edi-isd \
configmap/wmla-edi-isd-ingress \
configmap/wmla-grafana-configmap \
configmap/wmla-grafana-ini \
configmap/wmla-grafana-providers \
configmap/wmla-infoservice \
configmap/wmla-jupyter-hub-config \
configmap/wmla-logstash-conf \
configmap/wmla-mongodb-shells \
configmap/wmla-msd \
configmap/wmla-mss \
configmap/wmla-nginx-conf \
configmap/wmla-nginx-grafana-sidecar-conf \
configmap/wmla-nginx-sidecar-conf \
configmap/wmla-prometheus \
configmap/wmla-version-info \
deployment.apps/wmla-auth-rest \
deployment.apps/wmla-conda \
deployment.apps/wmla-dlpd \
deployment.apps/wmla-edi-imd \
deployment.apps/wmla-edi-lbd \
deployment.apps/wmla-grafana \
deployment.apps/wmla-gui \
deployment.apps/wmla-infoservice \
deployment.apps/wmla-ingress \
deployment.apps/wmla-jupyter-gateway \
deployment.apps/wmla-jupyter-hub \
deployment.apps/wmla-jupyter-proxy \
deployment.apps/wmla-logstash \
deployment.apps/wmla-msd \
deployment.apps/wmla-mss \
deployment.apps/wmla-prometheus \
deployment.apps/wmla-watchdog \
horizontalpodautoscaler.autoscaling/wmla-auth-rest-hpa \
horizontalpodautoscaler.autoscaling/wmla-dlpd-hpa \
horizontalpodautoscaler.autoscaling/wmla-edi-lbd-hpa \
horizontalpodautoscaler.autoscaling/wmla-gui-hpa \
horizontalpodautoscaler.autoscaling/wmla-ingress-hpa \
horizontalpodautoscaler.autoscaling/wmla-watchdog-hpa \
ingress.networking.k8s.io/wmla-jupyter-ingress \
issuer.cert-manager.io/wmla-ca \
issuer.cert-manager.io/wmla-root-issuer \
networkpolicy.networking.k8s.io/wmla-dlpd-netpol \
networkpolicy.networking.k8s.io/wmla-edi-imd-network-policy \
networkpolicy.networking.k8s.io/wmla-edi-isd-network-policy \
networkpolicy.networking.k8s.io/wmla-infoservice-netpol \
networkpolicy.networking.k8s.io/wmla-ingress-network-policy \
networkpolicy.networking.k8s.io/wmla-logstash-network-policy \
networkpolicy.networking.k8s.io/wmla-msd-netpol \
networkpolicy.networking.k8s.io/wmla-namespace-network-policy \
persistentvolumeclaim/wmla-conda \
persistentvolumeclaim/wmla-cws-share \
persistentvolumeclaim/wmla-edi \
persistentvolumeclaim/wmla-infoservice \
persistentvolumeclaim/wmla-logging \
persistentvolumeclaim/wmla-mygpfs \
persistentvolumeclaim/wmla-grafana \
persistentvolumeclaim/wmla-prometheus \
poddisruptionbudget.policy/wmla-jupyter-hub-pdb \
poddisruptionbudget.policy/wmla-jupyter-proxy-pdb \
role.rbac.authorization.k8s.io/wmla-core-role \
role.rbac.authorization.k8s.io/wmla-edi \
role.rbac.authorization.k8s.io/wmla-msd-mss \
role.rbac.authorization.k8s.io/wmla-notebook-role \
role.rbac.authorization.k8s.io/wmla-role \
rolebinding.rbac.authorization.k8s.io/wmla-core-rb \
rolebinding.rbac.authorization.k8s.io/wmla-edi \
rolebinding.rbac.authorization.k8s.io/wmla-msd-mss \
rolebinding.rbac.authorization.k8s.io/wmla-notebook-rb \
rolebinding.rbac.authorization.k8s.io/wmla-rb \
route.route.openshift.io/wmla-console \
route.route.openshift.io/wmla-grafana \
route.route.openshift.io/wmla-inference \
route.route.openshift.io/wmla-jupyter-notebook \
secret/wmla-dlpd-conf \
secret/wmla-eg-secret \
secret/wmla-grafana-secret \
secret/wmla-jupyter-hub-secret \
secret/wmla-mongodb-secret \
secret/wmla-prometheus-htpasswd \
service/wmla-auth-rest \
service/wmla-dlpd \
service/wmla-edi \
service/wmla-edi-admin \
service/wmla-etcd \
service/wmla-grafana \
service/wmla-gui \
service/wmla-inference \
service/wmla-infoservice \
service/wmla-ingress \
service/wmla-jupyter-enterprise-gateway \
service/wmla-jupyter-hub \
service/wmla-jupyter-proxy-api \
service/wmla-jupyter-proxy-public \
service/wmla-logstash-service \
service/wmla-mongodb \
service/wmla-msd \
service/wmla-mss \
service/wmla-prometheus \
serviceaccount/wmla-core-sa \
serviceaccount/wmla-msd-mss \
serviceaccount/wmla-norbac \
serviceaccount/wmla-notebook-sa \
serviceaccount/wmla-sa \
statefulset.apps/wmla-etcd \
statefulset.apps/wmla-mongodb \
wmla-add-on.spectrumcomputing.ibm.com/wmla;
do
	oc get $r >& /dev/null
	if [ $? == "0" ]; then
		#skip patch user pvc
		if [ x$user_pvc == 'xtrue' ];then
			resoucetype=`echo $r|awk -F'/' '{print $1}'`
			if [ x$resoucetype == 'xpersistentvolumeclaim' ];then
				echo "skip user defined PVC $r"
				continue
			fi
		fi
		echo "Patch ownerReferences for $r"
		oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}"
	fi
done

#update ownerReferences for wmla resource plans
wmla_rps=`oc get rp -o name`
ns_rp=`oc get rp platform  -o jsonpath={.spec.parent}`
wmla_fix_rp=`oc get rp platform  -o jsonpath={.spec.children[0].name}`
cpd_fix_rp=`oc get rp platform  -o jsonpath={.spec.children[1].name}`
for r in $wmla_rps;
do
	#skip scheduler created resource plans
	rp_name=`echo $r|awk -F'/' '{print $2}'`
	if [ x$rp_name == "xplatform" -o x$rp_name == "x$ns_rp" -o x$rp_name == "x$wmla_fix_rp" -o x$rp_name == "x$cpd_fix_rp" ];then
		echo "skip resource plan $rp_name"
		continue
	fi
	echo "Patch ownerReferences for $r"
	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}"
done

#update ownerReferences for deploy/isd and isd/service
isds=`oc get deploy -o name|grep wmla-edi-isd`
imd_uid=`oc get deploy wmla-edi-imd -o jsonpath='{.metadata.uid}'`
for r in $isds;
do
	echo "Patch ownerReferences for $r"
	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"wmla-edi-imd\",\"uid\":\"$imd_uid\"}]}}"
done

isd_servicess=`oc get services -o name|grep wmla-edi-isd`
for r in $isd_servicess;
do
	isd_name=`echo $r|awk -F/ '{print $NF}'`
	isd_uid=`oc get deploy $isd_name -o jsonpath='{.metadata.uid}'`
	echo "Patch ownerReferences for $r"
	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"$isd_name\",\"uid\":\"$isd_uid\"}]}}"
done

#update ownerReferences for wmla-add-on cm
wmla_add_on_name=`oc get wmla-add-on -o name|awk -F/ '{print $NF}'`
if [ x$wmla_add_on_name != x ];then
	wmla_add_on_uid=`oc get wmla-add-on $wmla_add_on_name -o jsonpath='{.metadata.uid}'`
	oc patch configmap/cpd-wmla-add-on-br-cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
	wmla_instance_cm=`oc get cm -o name|grep wml-accelerator-instance-cm`
	oc patch $wmla_instance_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
	wmla_connection_cm=`oc get cm -o name|grep wml-accelerator-connection-info-extension`
	if [[ 'x' != x"$wmla_connection_cm" ]];then
		oc patch $wmla_connection_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
	fi
fi

retry=1
crash='Y'
MAX_RETRY=60
echo "checking mongodb status..."
oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff
until [[ $? == '0' ]]; do
    if [ $retry -ge $MAX_RETRY ]; then
        crash='N'
        echo "Not found mongodb pod crash"
        oc get po|grep -E 'wmla-mongodb'
        break
    fi
    sleep 1
    let "retry += 1"
    oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff
done
if [ $crash == 'Y' ]; then
    echo "Found mongodb pod crash"
    oc get po|grep -E 'wmla-mongodb'
    oc scale --replicas 0 sts wmla-mongodb
    oc delete pvc data-wmla-mongodb-1 data-wmla-mongodb-2 
    oc scale  --replicas=3 sts wmla-mongodb
fi

#workaround for etcd unhealthy, check 18 times (the result is from the last 3 times)
MAX_RETRY=18
echo "checking wmla-etcd status (in $MAX_RETRY rounds)..."
dead_node=""
result=0
healthy_node_count=0
ETCD_CHECK_HEALTH="oc exec -it wmla-etcd-0 -c etcd -- etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --endpoints=https://wmla-etcd:2379 endpoint health --cluster"
for j in $(seq 1 $MAX_RETRY); do
    $ETCD_CHECK_HEALTH > /tmp/wmla-etcd-check-result
    if [ "$?" != "0" ];then
        result=$(($result+1))
        if [ $j -lt 9 ]; then
            continue
        fi
        healthy_node_count=$(cat /tmp/wmla-etcd-check-result|grep -w "healthy"|wc -l)
        if [ $result -ge 3 -a $healthy_node_count -ge 2 ]; then
            dead_node=$(cat /tmp/wmla-etcd-check-result|grep "is unhealthy"|awk '{print substr($1,9,11)}')
            break
        fi
    else
        echo "$j round: Not found unhealthy in wmla-etcd pods"
        result=0
        healthy_node_count=0
    fi
    sleep 20
done
if [ "x$dead_node" != "x" ];then
    echo "Found unhealthy in wmla-etcd pods:"
    cat /tmp/wmla-etcd-check-result
    if [ "x$dead_node" != "x" ];then
        echo "== restoring unhealthy pod $dead_node"
        oc exec $dead_node -- rm -f /var/run/etcd/$dead_node.etcd/_recovered 2>/dev/null
        # to workaround failed node DNS not ready issue (the probe will be recovered when operator restarting)
        oc set probe sts/wmla-etcd --remove --readiness --liveness
        oc set env statefulset/wmla-etcd WMLA_ETCD_FAILURE_NODE=$dead_node
        oc delete pod $dead_node
    fi
elif [ $result -ge 3 ]; then
    echo "WARN: Found unhealthy in wmla-etcd pods, but wmla-etcd is not ready for fixing, wait a while and rerun this tool."
fi
rm -f /tmp/wmla-etcd-check-result

Offline backup and restore of Watson Machine Learning Accelerator

Offline backup

Before you complete an offline backup of the Watson Machine Learning Accelerator service using the standard backup process, you must stop all running workloads.

Steps:

Stop all running workloads.
1. As an Watson Machine Learning Accelerator project administrator, from the Watson Machine Learning Accelerator console stop all running jobs.
  1. Log in to the Watson Machine Learning Accelerator console as a project administrator.
  2. Navigate to Monitoring > Applications.
  3. For each running applications, select the menu icon and click Stop.
2. Stop all running deployed models. Use the WML Accelerator console or the command line interface to stop each running model, see Stop an inference service.
Back up the Watson Machine Learning Accelerator service using the standard backup process. See: https://www.ibm.com/docs/en/cloud-paks/cp-data/4.7.x?topic=project-backing-up

Offline restore

Before you begin:

Before performing an online restore, make sure that the Watson Machine Learning Accelerator namespace is deleted.

Steps:

Log in to your OpenShift cluster as a project administrator.
```
oc login OpenShift_URL:port
```
Switch to the Watson Machine Learning Accelerator namespace.
```
oc project wmla-namespace
```

Return owner references to Watson Machine Learning Accelerator resources, run the following script depending on your version of Watson Machine Learning Accelerator:

#!/bin/bash

wmla_name=`oc get wmla -o name|awk -F/ '{print $NF}'`
wmla_uid=`oc get wmla $wmla_name -o jsonpath='{.metadata.uid}'`
user_pvc=`oc get wmla $wmla_name -o jsonpath={.spec.usePreCreatedPvcs}`

for r in \
certificate.cert-manager.io/wmla-ca-crt \
certificate.cert-manager.io/wmla-internal-keys \
certificate.cert-manager.io/wmla-nginx-keys \
certificate.cert-manager.io/wmla-internal-keys-ecdsa \
certificate.cert-manager.io/wmla-nginx-keys-ecdsa \
certificate.cert-manager.io/wmla-worker-keys \
configmap/cpd-wmla-br-cm \
configmap/cpd-wmla-ckpt-cm \
configmap/cpd-wmla-qu-cm \
configmap/cpd-wmla-add-on-br-cm \
configmap/wmla-edi-lbd-nginx \
configmap/wmla-gpu-types \
configmap/wmla-install-info-cm \
configmap/wmla-watchdog-conf \
configmap/wmla-wml-accelerator-instance-cm \
configmap/wmla-dlpd-bootstrap \
configmap/wmla-edi \
configmap/wmla-edi-dlim \
configmap/wmla-edi-imd-nginx \
configmap/wmla-edi-isd \
configmap/wmla-edi-isd-ingress \
configmap/wmla-grafana-configmap \
configmap/wmla-grafana-ini \
configmap/wmla-grafana-providers \
configmap/wmla-infoservice \
configmap/wmla-jupyter-hub-config \
configmap/wmla-logstash-conf \
configmap/wmla-mongodb-shells \
configmap/wmla-msd \
configmap/wmla-mss \
configmap/wmla-nginx-conf \
configmap/wmla-nginx-grafana-sidecar-conf \
configmap/wmla-nginx-sidecar-conf \
configmap/wmla-prometheus \
configmap/wmla-version-info \
deployment.apps/wmla-auth-rest \
deployment.apps/wmla-conda \
deployment.apps/wmla-dlpd \
deployment.apps/wmla-edi-imd \
deployment.apps/wmla-edi-lbd \
deployment.apps/wmla-grafana \
deployment.apps/wmla-gui \
deployment.apps/wmla-infoservice \
deployment.apps/wmla-ingress \
deployment.apps/wmla-jupyter-gateway \
deployment.apps/wmla-jupyter-hub \
deployment.apps/wmla-jupyter-proxy \
deployment.apps/wmla-logstash \
deployment.apps/wmla-msd \
deployment.apps/wmla-mss \
deployment.apps/wmla-prometheus \
deployment.apps/wmla-watchdog \
horizontalpodautoscaler.autoscaling/wmla-auth-rest-hpa \
horizontalpodautoscaler.autoscaling/wmla-dlpd-hpa \
horizontalpodautoscaler.autoscaling/wmla-edi-lbd-hpa \
horizontalpodautoscaler.autoscaling/wmla-gui-hpa \
horizontalpodautoscaler.autoscaling/wmla-ingress-hpa \
horizontalpodautoscaler.autoscaling/wmla-watchdog-hpa \
ingress.networking.k8s.io/wmla-jupyter-ingress \
issuer.cert-manager.io/wmla-ca \
issuer.cert-manager.io/wmla-root-issuer \
networkpolicy.networking.k8s.io/wmla-dlpd-netpol \
networkpolicy.networking.k8s.io/wmla-edi-imd-network-policy \
networkpolicy.networking.k8s.io/wmla-edi-isd-network-policy \
networkpolicy.networking.k8s.io/wmla-infoservice-netpol \
networkpolicy.networking.k8s.io/wmla-ingress-network-policy \
networkpolicy.networking.k8s.io/wmla-logstash-network-policy \
networkpolicy.networking.k8s.io/wmla-msd-netpol \
networkpolicy.networking.k8s.io/wmla-namespace-network-policy \
persistentvolumeclaim/wmla-conda \
persistentvolumeclaim/wmla-cws-share \
persistentvolumeclaim/wmla-edi \
persistentvolumeclaim/wmla-infoservice \
persistentvolumeclaim/wmla-logging \
persistentvolumeclaim/wmla-mygpfs \
persistentvolumeclaim/wmla-grafana \
persistentvolumeclaim/wmla-prometheus \
poddisruptionbudget.policy/wmla-jupyter-hub-pdb \
poddisruptionbudget.policy/wmla-jupyter-proxy-pdb \
role.rbac.authorization.k8s.io/wmla-core-role \
role.rbac.authorization.k8s.io/wmla-edi \
role.rbac.authorization.k8s.io/wmla-msd-mss \
role.rbac.authorization.k8s.io/wmla-notebook-role \
role.rbac.authorization.k8s.io/wmla-role \
rolebinding.rbac.authorization.k8s.io/wmla-core-rb \
rolebinding.rbac.authorization.k8s.io/wmla-edi \
rolebinding.rbac.authorization.k8s.io/wmla-msd-mss \
rolebinding.rbac.authorization.k8s.io/wmla-notebook-rb \
rolebinding.rbac.authorization.k8s.io/wmla-rb \
route.route.openshift.io/wmla-console \
route.route.openshift.io/wmla-grafana \
route.route.openshift.io/wmla-inference \
route.route.openshift.io/wmla-jupyter-notebook \
secret/wmla-dlpd-conf \
secret/wmla-eg-secret \
secret/wmla-grafana-secret \
secret/wmla-jupyter-hub-secret \
secret/wmla-mongodb-secret \
secret/wmla-prometheus-htpasswd \
service/wmla-auth-rest \
service/wmla-dlpd \
service/wmla-edi \
service/wmla-edi-admin \
service/wmla-etcd \
service/wmla-grafana \
service/wmla-gui \
service/wmla-inference \
service/wmla-infoservice \
service/wmla-ingress \
service/wmla-jupyter-enterprise-gateway \
service/wmla-jupyter-hub \
service/wmla-jupyter-proxy-api \
service/wmla-jupyter-proxy-public \
service/wmla-logstash-service \
service/wmla-mongodb \
service/wmla-msd \
service/wmla-mss \
service/wmla-prometheus \
serviceaccount/wmla-core-sa \
serviceaccount/wmla-msd-mss \
serviceaccount/wmla-norbac \
serviceaccount/wmla-notebook-sa \
serviceaccount/wmla-sa \
statefulset.apps/wmla-etcd \
statefulset.apps/wmla-mongodb \
wmla-add-on.spectrumcomputing.ibm.com/wmla;
do
	oc get $r >& /dev/null
	if [ $? == "0" ]; then
		#skip patch user pvc
		if [ x$user_pvc == 'xtrue' ];then
			resoucetype=`echo $r|awk -F'/' '{print $1}'`
			if [ x$resoucetype == 'xpersistentvolumeclaim' ];then
				echo "skip user defined PVC $r"
				continue
			fi
		fi
		echo "Patch ownerReferences for $r"
		oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}"
	fi
done

#update ownerReferences for wmla resource plans
wmla_rps=`oc get rp -o name`
ns_rp=`oc get rp platform  -o jsonpath={.spec.parent}`
wmla_fix_rp=`oc get rp platform  -o jsonpath={.spec.children[0].name}`
cpd_fix_rp=`oc get rp platform  -o jsonpath={.spec.children[1].name}`
for r in $wmla_rps;
do
	#skip scheduler created resource plans
	rp_name=`echo $r|awk -F'/' '{print $2}'`
	if [ x$rp_name == "xplatform" -o x$rp_name == "x$ns_rp" -o x$rp_name == "x$wmla_fix_rp" -o x$rp_name == "x$cpd_fix_rp" ];then
		echo "skip resource plan $rp_name"
		continue
	fi
	echo "Patch ownerReferences for $r"
	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla\",\"name\":\"$wmla_name\",\"uid\":\"$wmla_uid\"}]}}"
done

#update ownerReferences for deploy/isd and isd/service
isds=`oc get deploy -o name|grep wmla-edi-isd`
imd_uid=`oc get deploy wmla-edi-imd -o jsonpath='{.metadata.uid}'`
for r in $isds;
do
	echo "Patch ownerReferences for $r"
	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"wmla-edi-imd\",\"uid\":\"$imd_uid\"}]}}"
done

isd_servicess=`oc get services -o name|grep wmla-edi-isd`
for r in $isd_servicess;
do
	isd_name=`echo $r|awk -F/ '{print $NF}'`
	isd_uid=`oc get deploy $isd_name -o jsonpath='{.metadata.uid}'`
	echo "Patch ownerReferences for $r"
	oc patch $r --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"blockOwnerDeletion\":true,\"controller\":true,\"kind\":\"Deployment\",\"name\":\"$isd_name\",\"uid\":\"$isd_uid\"}]}}"
done

#update ownerReferences for wmla-add-on cm
wmla_add_on_name=`oc get wmla-add-on -o name|awk -F/ '{print $NF}'`
if [ x$wmla_add_on_name != x ];then
	wmla_add_on_uid=`oc get wmla-add-on $wmla_add_on_name -o jsonpath='{.metadata.uid}'`
	oc patch configmap/cpd-wmla-add-on-br-cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
	wmla_instance_cm=`oc get cm -o name|grep wml-accelerator-instance-cm`
	oc patch $wmla_instance_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
	wmla_connection_cm=`oc get cm -o name|grep wml-accelerator-connection-info-extension`
	if [[ 'x' != x"$wmla_connection_cm" ]];then
		oc patch $wmla_connection_cm --type merge -p "{\"metadata\":{\"ownerReferences\":[{\"apiVersion\":\"spectrumcomputing.ibm.com/v1\",\"kind\":\"Wmla-add-on\",\"name\":\"$wmla_add_on_name\",\"uid\":\"$wmla_add_on_uid\"}]}}"
	fi
fi

retry=1
crash='Y'
MAX_RETRY=60
echo "checking mongodb status..."
oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff
until [[ $? == '0' ]]; do
    if [ $retry -ge $MAX_RETRY ]; then
        crash='N'
        echo "Not found mongodb pod crash"
        oc get po|grep -E 'wmla-mongodb'
        break
    fi
    sleep 1
    let "retry += 1"
    oc get po|grep -E 'wmla-mongodb-1|wmla-mongodb-2'|grep CrashLoopBackOff
done
if [ $crash == 'Y' ]; then
    echo "Found mongodb pod crash"
    oc get po|grep -E 'wmla-mongodb'
    oc scale --replicas 0 sts wmla-mongodb
    oc delete pvc data-wmla-mongodb-1 data-wmla-mongodb-2 
    oc scale  --replicas=3 sts wmla-mongodb
fi

#workaround for etcd unhealthy, check 18 times (the result is from the last 3 times)
MAX_RETRY=18
echo "checking wmla-etcd status (in $MAX_RETRY rounds)..."
dead_node=""
result=0
healthy_node_count=0
ETCD_CHECK_HEALTH="oc exec -it wmla-etcd-0 -c etcd -- etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --endpoints=https://wmla-etcd:2379 endpoint health --cluster"
for j in $(seq 1 $MAX_RETRY); do
    $ETCD_CHECK_HEALTH > /tmp/wmla-etcd-check-result
    if [ "$?" != "0" ];then
        result=$(($result+1))
        if [ $j -lt 9 ]; then
            continue
        fi
        healthy_node_count=$(cat /tmp/wmla-etcd-check-result|grep -w "healthy"|wc -l)
        if [ $result -ge 3 -a $healthy_node_count -ge 2 ]; then
            dead_node=$(cat /tmp/wmla-etcd-check-result|grep "is unhealthy"|awk '{print substr($1,9,11)}')
            break
        fi
    else
        echo "$j round: Not found unhealthy in wmla-etcd pods"
        result=0
        healthy_node_count=0
    fi
    sleep 20
done
if [ "x$dead_node" != "x" ];then
    echo "Found unhealthy in wmla-etcd pods:"
    cat /tmp/wmla-etcd-check-result
    if [ "x$dead_node" != "x" ];then
        echo "== restoring unhealthy pod $dead_node"
        oc exec $dead_node -- rm -f /var/run/etcd/$dead_node.etcd/_recovered 2>/dev/null
        # to workaround failed node DNS not ready issue (the probe will be recovered when operator restarting)
        oc set probe sts/wmla-etcd --remove --readiness --liveness
        oc set env statefulset/wmla-etcd WMLA_ETCD_FAILURE_NODE=$dead_node
        oc delete pod $dead_node
    fi
elif [ $result -ge 3 ]; then
    echo "WARN: Found unhealthy in wmla-etcd pods, but wmla-etcd is not ready for fixing, wait a while and rerun this tool."
fi
rm -f /tmp/wmla-etcd-check-result

After restoring Watson Machine Learning Accelerator

After restoring Watson Machine Learning Accelerator, make sure to address the following known issues.

Known issue:

After restoring Watson Machine Learning Accelerator, a known issue exists where the cluster is unhealthy and fails to get the status of the endpoint.

To resolve this issue:

Run the following command to check the status of the wmla-etcd cluster:

oc exec -it wmla-etcd-0 -- bash -c "ETCDCTL_API=3 etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --insecure-skip-tls-verify endpoint status --cluster"

The following error is displayed if the cluster is unhealthy:

Defaulted container "etcd" out of: etcd, init-data-dir (init)
{"level":"warn","ts":"2023-06-19T12:01:20.476Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://wmla-etcd-2.wmla-etcd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: lookup wmla-etcd-2.wmla-etcd on 172.30.0.10:53: no such host\""}
Failed to get the status of endpoint https://wmla-etcd-2.wmla-etcd:2379 (context deadline exceeded)
https://wmla-etcd-1.wmla-etcd:2379, 968d327db883b4b4, 3.3.27, 9.0 MB, true, 2715, 2044
https://wmla-etcd-0.wmla-etcd:2379, f5b85a4577d2c8db, 3.3.27, 9.1 MB, false, 2715, 2044

Modify the wmla-etcd statefulset to make the failure pod be ready for maintenance.

oc edit statefulset wmla-etcd

Remove the liveness probe and readiness probe by removing the following lines:

        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 30
          successThreshold: 1
          tcpSocket:
            port: 2379
          timeoutSeconds: 1

        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 20
          successThreshold: 1
          tcpSocket:
            port: 2379
          timeoutSeconds: 1

Modify the container as follows:

      containers:
      - command:
        - /bin/sh
        - -c
        - |
          PEERS="wmla-etcd-0=https://wmla-etcd-0.wmla-etcd:2380,wmla-etcd-1=https://wmla-etcd-1.wmla-etcd:2380,wmla-etcd-2=https://wmla-etcd-2.wmla-etcd:2380"
          ETCD_INITIAL_CLUSTER_STATE="new"
          if [ "$WMLA_ETCD_FAILURE_NODE" == "$HOSTNAME" -a ! -f /var/run/etcd/${HOSTNAME}.etcd/_recovered ]; then
              rm -rf /var/run/etcd/${HOSTNAME}.etcd
              echo "Restore ${HOSTNAME} in mainteance ..."
              ETCD_INITIAL_CLUSTER_STATE="existing"
              sleep 5
              mkdir -p /var/run/etcd/${HOSTNAME}.etcd
              touch /var/run/etcd/${HOSTNAME}.etcd/_recovered
          fi
          exec etcd --name ${HOSTNAME} \
            --listen-peer-urls https://0.0.0.0:2380 \
            --listen-client-urls https://0.0.0.0:2379 \
            --advertise-client-urls https://${HOSTNAME}.wmla-etcd:2379 \
            --initial-advertise-peer-urls https://${HOSTNAME}:2380 \
            --initial-cluster-token wmla-etcd-cluster \
            --initial-cluster ${PEERS} \
            --initial-cluster-state ${ETCD_INITIAL_CLUSTER_STATE} \
            --data-dir /var/run/etcd/${HOSTNAME}.etcd \
            --cert-file=/etc/pki/etcd/tls.crt \
            --key-file=/etc/pki/etcd/tls.key \
            --trusted-ca-file=/etc/pki/etcd/ca.crt \
            --client-cert-auth \
            --peer-cert-file=/etc/pki/etcd/tls.crt \
            --peer-key-file=/etc/pki/etcd/tls.key \
            --peer-trusted-ca-file=/etc/pki/etcd/ca.crt \
            --peer-client-cert-auth \
            --quota-backend-bytes=8589934592 \
            --cipher-suites TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384

Set the WMLA_ETCD_FAILURE_NODE environment variable to wmla-etcd-2 for the pod that failed:
```
oc set env statefulset/wmla-etcd WMLA_ETCD_FAILURE_NODE=wmla-etcd-2
```
Restart the failed pod:
```
oc delete pod wmla-etcd-2
```

Run the following command to check the status of the wmla-etcd cluster:

oc exec -it wmla-etcd-0 -- bash -c "ETCDCTL_API=3 etcdctl --cacert=/etc/pki/etcd/ca.crt --cert=/etc/pki/etcd/tls.crt --key=/etc/pki/etcd/tls.key --insecure-skip-tls-verify endpoint status --cluster"

The following is displayed for a healthy cluster:

Defaulted container "etcd" out of: etcd, init-data-dir (init)
https://wmla-etcd-1.wmla-etcd:2379, 968d327db883b4b4, 3.3.27, 1.4 MB, true, 2815, 2094
https://wmla-etcd-2.wmla-etcd:2379, cc26316c8c459e22, 3.3.27, 2.7 MB, false, 2815, 2094
https://wmla-etcd-0.wmla-etcd:2379, f5b85a4577d2c8db, 3.3.27, 1.4 MB, false, 2815, 2094

Known issue:

After restoring Watson Machine Learning Accelerator, the wmla-mongodb-1 or wmla-mongodb-2 pod may fail to start.

If this issue has occurred, complete the following steps to start the pods. Depending on the status of the cluster, this procedure may take several minutes to complete.

Scale down the MongoDB service to replica number 1:

oc scale --replicas=1 sts wmla-mongodb -n <wmla_instance_namespace>

Wait for the MongoDB pods to scale down and stabilize.

Remove wmla-mongodb-1 and wmla-mongodb-2 PVCs. Do not delete wmla-mongodb-0 PVC.

oc delete pvc data-wmla-mongodb-1 data-wmla-mongodb-2 -n <wmla_instance_namespace>

Scale up the MongoDB service to replica number 3:

oc scale  --replicas=3 sts wmla-mongodb -n <wmla_instance_namespace>