Replace cluster nodes
If a master, storage, or compute node fails, a Watson Studio Local administrator can manually replace it by using the command line.
Tasks that you can do:
- Replace a master node by command line
- Replace a storage node by command line
- Replace a compute node by command line
Replace a master node by command line
If you have a three node configuration and need to replace an old or faulty master node with a new one, complete the following steps:
- Copy the following folders from a working master node to the new master node:
-
/wdp -
/etc/kubernetes
Bash Command for folder:
scp -r root@IP_OF_GOOD_MASTER:/PATH_TO_FOLDER PATH_TO_SAVE_FOLDERExample Command:
scp -r root@9.30.10.92:/wdp /Bash Command for file:
scp root@IP_OF_GOOD_MASTER:/PATH_TO_FILE PATH_TO_SAVE_FOLDERExample Command:
scp -r root@9.30.10.92:/etc/kubernetes /etc/ -
- On the new master node, add the local repository by entering the following
command:
cat <<EOF > /etc/yum.repos.d/wdp_local.repo [WDP_Local] name = WDP_Local baseurl = file:///wdp/wdp-repo-rhel7 gpgcheck = 0 EOF - Install kubectl and all other
packages:
yum install -y nfs-utils net-tools ebtables socat lvm2 yum-utils glusterfs-server docker-engine kubectl.x86_64 kubelet.x86_64 kubernetes-cni.x86_64 iptables-services haproxy keepalived jq - Remove the local
repository:
yum clear all rm -f /etc/yum.repos.d/wdp_local.repo - Create new
/var/etcdand/etc/datadirectories on the new master node.mkdir /var/etcd mkdir /var/etcd/data - Copy the
kublet.servicefile from one of the working master nodes:/etc/systemd/system/kubelet.service. Change the line:BIND:IPtoBIND: {NEWNODE-IP}. - Copy the
/var/etcd/data/etcd_kube.shscript the same location on the new master node. - In the
/etc/kubernetes/manifests/etcd.yamlfile, edit the line/var/etcd/etcd_kube.sh:- Update every old node IP address to the new node IP address.
- Set the
Initial cluster statetoexisting. Example of the final changes:/var/etcd/etcd_kube.sh 4 "9.87.654.321" "etcd1=http://9.87.654.322:2380,etcd2=http://9.87.654.323:2380,etcd4=http://9.87.654.321:2380" existing
- Start the
docker:
systemctl enable docker systemctl start docker - Preload the docker
images:
ls /wdp/DockerImages | awk '{system("docker load -i /wdp/DockerImages/"$1)}' - On one of the working master nodes, find the failed etcd pod with its ID in the
log:
kubectl logs ETCD -n=kube-systemIf the ID cannot be found in the logs, enter the following command to list all of the etcd members:
kubectl exec -it $(kubectl get po --all-namespaces | grep etcd | grep Running | head -n 1 | awk '{print $2}') -n=kube-system -- etcdctl member listIn the output, the first part is the ID. Example:
2a9a0a84da3f2511: name=etcd3 peerURLs=http://9.87.654.321:2380 clientURLs=http://127.0.0.1:2379,http://9.87.654.321:2379 - Remove the old ETCD
member:
kubectl exec -it {ETCD} -n=kube-system -- etcdctl member remove {ETCD-ID}where {ETCD} represents the full pod name of a working ETCD member, and {ETCD-ID} represents the ID you obtained.
- Add the new ETCD
member:
kubectl exec -it {ETCD} -n=kube-system -- etcdctl member add ETCD-NAME http://ETCD-IP:2380Example:
kubectl exec -it etcd-server-ettin-master-1.fyre.ibm.com -n=kube-system -- etcdctl member add etcd4 http://9.87.654.321:2380 - On the new master node, start
kubelet:
systemctl enable kubelet systemctl start kubelet - Verify that the new master node was
added:
kubectl get noIf kubelet did not start up, verify that ETCD is running:
docker ps -a | grep etcdIf ETCD is not running, enter the following commands to restart the docker:
systemctl stop docker rm -rf /run/docker systemctl start docker systemctl start kubeletTroupleshooting tip: If Watson Studio Local connected to the server but localhost:8080 was refused, correct the host and port in
/etc/kubernetes/manifests/etcd.yamland enter the following commands and then check again:kubectl exec -it {ONE_OF_RUNNING_ETCD_POD_NAME} -n=kube-system -- etcdctl member remove {WRONG_ETCD_HASH_CODE} kubectl exec -it {ONE_OF_RUNNING_ETCD_POD_NAME} -n=kube-system -- etcdctl member add etcd4 http://{CORRECT_NEW_NODE_IP}:2380: restart etcd and kubectl pkill etcd systemctl stop kubectl systemctl start kubectlYou can also check the log for the error on the working master node:
kubectl logs etcd-server-ettin-9.87.654.321 -n=kube-systemwhere
etcd-server-ettin-9.87.654.321represents the one fromkubectl get po --all-namespaces | grep etcd | grep Running - To configure Keepalived in the new master node, copy over the
/etc/keepalived/keepalived.conffile from a working master node. - In the
keepalive.conffile, edit the following two lines:-
stateshould bestate MASTERfor the first master node, andstate BACKUPfor the backup master nodes. -
priorityshould bepriority 102for the first master node, andpriority 101for the backup master nodes. Example:state BACKUP interface eth0 virtual_router_id 91 priority 101
-
- To configure haproxy on the new master node, copy over
/etc/haproxy/haproxy.cfgfrom a working master node the the same location in the new master node. - In
haproxy.cfg, change the IP of the old master node to the IP of the new master node:backend app balance roundrobin server http1 9.87.654.321:6443 check server http2 9.87.654.322:6443 check server http3 9.87.654.323:6443 checkRepeat this step for all of the other master nodes as well.
-
If the DNS is not set up to handle the domain: On the new master node, edit
the
/etc/hostsfile add the IP and domain pair of every node. In the/etc/hostsfile of each working master node, add the IP and domain pair of the new master node. Example:9.87.654.321 high-io-1-proxy.ibm.com high-io-1-proxy 9.87.654.322 high-io-1-master-1.ibm.com high-io-1-master-1 9.87.654.323 high-io-1-master-2.ibm.com high-io-1-master-2 9.87.654.324 replace-new-master.ibm.com replace-new-master - Remove the old master node from kubelet and weave (the node name can be found in
kubectl get no):kubectl delete node NODE_NAME docker cp $(docker ps -f name=weave | grep -v -E "(npc|pause|^CONTAINER)" | awk '{print $1}'):/home/weave/weave weave ./weave rmpeer OLD_NODE_IP ./weave connect NEW_NODE_IP rm weave - In
/wdp/config, remove the old master node and add the new master node information using the following format:{NEW_MASTER_IP} {M#} WDP_PLACEHOLDER {INSTALL_FOLDER} {FQDN} {HOSTNAME}where:
- NEW_MASTER_IP
- The IP address of the new master node
- M#
- The number of the master node being replaced (Master 1 = M1, Master 2 = M2, Master 3 = M3)
- INSTALL_FOLDER
- The path to the installation folder
- FQDN
- The fully qualified domain name of the new master node
- HOSTNAME
- The host name of the new master node
- On the new master node, start
glusterfs:
systemctl enable glusterd systemctl start glusterd - From one of the working master nodes, run the following
command:
gluster peer probe IP_OF_NEW_MASTER_NODE - Save the following code segment to replace the bricks into a script on one of the master nodes
inside the
cluster:
#!/usr/bin/env bash # Replace all the gluster bricks that is link with the old storage ip to the new storage ip. # Only the storage node that is getting replace can be down. # Please note that this does not heal the volume if [[ $# -ne 2 ]]; then echo echo " Usage: "$(basename $0) old_storage_node_ip new_storage_node_ip echo exit 1 fi old_ip=$1 new_ip=$2 volumes=$(gluster volume info | grep 'Volume Name') IFS=$'\n' for volume in $volumes ; do volume=$(echo ${volume} | awk -F ':' '{print $2}' | awk '{print $1}') info=$(gluster volume info "${volume}" | grep ${old_ip}) if [[ $? -eq 0 ]]; then brick=$(echo ${info} | awk '{print $2}') new_brick=$(echo ${brick} | sed "s|${old_ip}|${new_ip}|g") gluster volume replace-brick "${volume}" "${brick}" "${new_brick}" commit force fi done - Grant the script executable privileges:
chmod +x scriptname.sh - Run the script to replace the bricks:
./scriptname.sh old_storage_node_ip new_storage_node_ipwhere old_storage_node_ip represents the old master node IP address and new_storage_node_ip represents the new master node IP address. - On the working master node, heal the gluster volumes by entering the following
command:
volumes=$(gluster volume info | grep 'Volume Name') IFS=$'\n' for volume in $volumes ; do volume=$(echo ${volume} | awk -F ':' '{print $2}' | awk '{print $1}') gluster volume heal "$volume" done - On the same master node, update mongo by entering the following
commands:
/wdp/k8s/wdp-deploy-dashboard/k8s/mongoDelete.sh sysibm-adm 0 NUMBER /wdp/k8s/wdp-deploy-dashboard/k8s/mongoCreate.sh sysibm-adm 0 NUMBERwhere
NUMBERrepresents the index where the master node info can be found in/wdp/configand0is the first index. Example:/wdp/k8s/wdp-deploy-dashboard/k8s/mongoDelete.sh sysibm-adm 0 0 /wdp/k8s/wdp-deploy-dashboard/k8s/mongoCreate.sh sysibm-adm 0 0 - On a working master node, replace gluster services and endpoints for the new node by running the
following
script:
glusterName="$( kubectl get svc --all-namespaces | grep gluster | awk '{print $1}')" kubectl get svc --all-namespaces | grep gluster | awk '{system("kubectl delete svc -n "$1" "$2)}' echo "$glusterName" | awk '{system("/wdp/k8s/gluster-endpoints/createCommonEndpointSvcYamlFile.sh /wdp/config "$1)}' - Verify whether the new node IP address is updated for gluster endpoints by entering the
following
command:
kubectl get ep --all-namespaces | grep gluster - On a working master node, prepare the new certificate
license:
/wdp/scripts/crtmastercert.sh {PROXY_IP} {MASTER_1} {MASTER_2} {MASTER_3}From the master node used to run the above script, copy all of the files under
/etc/kubernetes/ssl/to all other nodes. - Restart the API server on all master nodes by running the following command on each
node:
pkill hyperkube - Load the new certificate for pods by deleting all
secrets:
kubectl get secrets --all-namespaces | grep default-token | awk '{system("kubectl delete secrets -n "$1" "$2)}' - Restart all nodes in the cluster, including the new master node.
- In the proxy node, shut it
down:
shutdown -h nowIf redirected when trying to SSH to the proxy node, shut down keepalived on all of the master nodes:
systemctl stop keepalivedthen SSH to proxy node and shut down. Bring up keepalived on all master nodes if shut down:
systemctl start keepalived - Ensure all nodes are in ready
state:
kubectl get noIf any nodes are down, enter the following command until all master nodes are in ready state:
systemctl restart kubelet - Restart all pods by entering the following commands are in order. Ensure the related pods are up
before moving on to the next command (delete everything after the
awkcommand to just check the status of the related pods):kubectl get po --all-namespaces -o wide | grep kube-system | grep -v kube-apiserver | grep -v weave | grep -v dns | awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0 --force")}' kubectl get po --all-namespaces -o wide | grep kube-system | grep weave | awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0 --force")}' kubectl get po --all-namespaces -o wide | grep kube-system | grep dns | awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0 --force")}' kubectl get po --all-namespaces -o wide | grep docker | awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0 --force")}' kubectl get po --all-namespaces -o wide | grep redis | awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0 --force")}' kubectl get po --all-namespaces -o wide | grep cloudant | awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0 --force")}' kubectl get po --all-namespaces -o wide | grep -v kube-system | grep -v redis | grep -v cloudant | grep -v docker | grep -v nginx | awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0 --force")}' kubectl get po --all-namespaces -o wide | grep nginx | awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0 --force")}'Troubleshooting tip: If you are unable to bring up all pods, redo the previous two steps.
- In the
/etc/kubernetes/manifests/etcd.yamlfile on the other two master nodes, edit the line/var/etcd/etcd_kube.shto update every old master node IP address to the new master node IP address. Example of the final changes:- /var/etcd/etcd_kube.sh 1 "9.87.654.321" "etcd1=http://9.87.654.322:2380,etcd2=http://9.87.654.323:2380,etcd4=http://9.87.654.321:2380" - Label the new master
node:
kubectl label no {NEW_NODE_NAME} nodetype=control kubectl label no {NEW_NODE_NAME} is_control=true kubectl label no {NEW_NODE_NAME} is_compute=false kubectl label no {NEW_NODE_NAME} is_storage=false
Replace a storage node by command line
Requirement: You can replace the storage node if only one storage node is down only.
If you need to replace an old or faulty storage node with a new one, complete the following steps:
- Shut down the storage node to replace.
- Use the
mkdircommand on the new storage to create a new storage directory. The new storage path must match the old storage path, for example,/data. - Copy the following folders from a working storage node to the new storage node:
-
/wdp -
/etc/kubernetes
Bash command for the folder:
scp -r root@IP_GOOD_STORAGE:/PATH_TO_FOLDER PATH_TO_SAVE_FOLDERExample command:scp -r root@9.87.654.321:/wdp /\Bash command for file:
scp root@IP_GOOD_STORAGE:/PATH_TO_FILE PATH_TO_SAVE_FOLDERExample command:scp -r root@9.87.654.321:/etc/kubernetes /etc/ -
- On the new storage node, add the local repository by entering the following
command:
cat <<EOF > /etc/yum.repos.d/wdp_local.repo [WDP_Local] name = WDP_Local baseurl = file:///wdp/wdp-repo-rhel7 gpgcheck = 0 EOF - Install kubectl and all other
packages:
yum install -y nfs-utils net-tools ebtables socat lvm2 yum-utils glusterfs-server docker-engine kubectl.x86_64 kubelet.x86_64 kubernetes-cni.x86_64 iptables-services - Remove the local
repository:
yum clear all rm -f /etc/yum.repos.d/wdp_local.repo - Copy the
kublet.servicefile from one of the working storage nodes:/etc/systemd/system/kubelet.service. - Create the SSL certificate on the new storage node by running the
/wdp/scripts/crtworkercert.shscript. - Start the
docker:
systemctl enable docker systemctl start docker - Preload the docker
images:
ls /wdp/DockerImages | awk '{system("docker load -i /wdp/DockerImages/"$1)}' - On the new storage node, start
GlusterFS:
systemctl enable glusterd systemctl start glusterd - On one of the original storage nodes, run the following
command:
gluster peer probe IP_OF_NEW_STORAGE_NODE - On the new storage node, start
kubelet:
systemctl enable kubelet systemctl start kubelet - Save the following code segment to replace the bricks into a script on one of the storage nodes
inside the
cluster:
#!/usr/bin/env bash # Replace all the gluster bricks that is link with the old storage ip to the new storage ip. # Only the storage node that is getting replace can be down. # Please note that this does not heal the volume if [[ $# -ne 2 ]]; then echo echo " Usage: "$(basename $0) old_storage_node_ip new_storage_node_ip echo exit 1 fi old_ip=$1 new_ip=$2 volumes=$(gluster volume info | grep 'Volume Name') IFS=$'\n' for volume in $volumes ; do volume=$(echo ${volume} | awk -F ':' '{print $2}' | awk '{print $1}') info=$(gluster volume info "${volume}" | grep ${old_ip}) if [[ $? -eq 0 ]]; then brick=$(echo ${info} | awk '{print $2}') new_brick=$(echo ${brick} | sed "s|${old_ip}|${new_ip}|g") gluster volume replace-brick "${volume}" "${brick}" "${new_brick}" commit force fi done - Grant the script executable privileges:
chmod +x scriptname.sh - Run the script to replace the bricks:
./scriptname.sh old_storage_node_ip new_storage_node_ipwhereold_storage_node_iprepresents the old storage node IP address andnew_storage_node_iprepresents the new storage node IP address. - On the working storage node, heal the gluster volumes by entering the following
command:
volumes=$(gluster volume info | grep 'Volume Name') IFS=$'\n' for volume in $volumes ; do volume=$(echo ${volume} | awk -F ':' '{print $2}' | awk '{print $1}') gluster volume heal "$volume" done - On one of the master nodes, edit
/wdp/configto remove the line that contains the old storage node. Replace it with a new line that contains the new storage node IP address, domain name, and hostname. The following example replaces storage node 1 (dsx-local-storage-1) with a new storage node (dsx-local-newstorage-1):OLD FILE:
storage_group_start 987.16.163.7 WDP_PLACEHOLDER /data /ibm dsx-local-storage-1.ibm.com dsx-local-storage-1 987.65.432.237 WDP_PLACEHOLDER /data /ibm dsx-local-storage-2.ibm.com dsx-local-storage-2 987.65.432.238 WDP_PLACEHOLDER /data /ibm dsx-local-storage-3.ibm.com dsx-local-storage-3 storage_group_endNEW EDIT:
storage_group_start 987.65.432.147 WDP_PLACEHOLDER /data /ibm dsx-local-newstorage-1.ibm.com dsx-local-newstorage-1 987.65.432.237 WDP_PLACEHOLDER /data /ibm dsx-local-storage-2.ibm.com dsx-local-storage-2 987.65.432.238 WDP_PLACEHOLDER /data /ibm dsx-local-storage-3.ibm.com dsx-local-storage-3 storage_group_end - Copy the
/wdp/configfile from the master node to all other nodes. - On the same master node, update mongo by entering the following
commands:
/wdp/k8s/wdp-deploy-dashboard/k8s/mongoDelete.sh sysibm-adm 0 NUMBER /wdp/k8s/wdp-deploy-dashboard/k8s/mongoCreate.sh sysibm-adm 0 NUMBERwhere
0is the first index andNUMBERrepresents the index where the storage node information can be found in/wdp/config. For example:/wdp/k8s/wdp-deploy-dashboard/k8s/mongoDelete.sh sysibm-adm 0 0 /wdp/k8s/wdp-deploy-dashboard/k8s/mongoCreate.sh sysibm-adm 0 0 - On the master node, replace gluster services and endpoints for the new node by running the
following
script:
glusterName="$( kubectl get svc --all-namespaces | grep gluster | awk '{print $1}')" kubectl get svc --all-namespaces | grep gluster | awk '{system("kubectl delete svc -n "$1" "$2)}' echo "$glusterName" | awk '{system("/wdp/k8s/gluster-endpoints/createCommonEndpointSvcYamlFile.sh /wdp/config "$1)}' - Verify whether the new node IP address is updated for gluster endpoints by entering the
following
command:
kubectl get ep --all-namespaces | grep gluster - Delete the old pod from the old storage nodes by entering the following
command:
kubectl get po --all-namespaces | grep -E "NodeLost|Unknown" | awk '{system("kubectl delete pod --grace-period=0 --force -n="$1" "$2)}' - Find the name of the old storage node to delete by entering the following
command:
kubectl get no - Delete the old storage node from the
cluster:
kubectl delete no "NODE_NAME" - On the master node, verify that the new storage node is running properly by entering the
following
commands:
kubectl get no kubectl get po --all-namespacesIf all the nodes are
Readyand all the pods areRunning, then the storage node was replaced properly.
Replace a compute node by command line
Requirement: You can only replace the compute node if only one compute node is down.
If you need to replace an old or faulty compute node with a new one, complete the following steps:
- Shut down the compute node to replace.
- Copy the following folders from a working compute node to the new compute node:
-
/wdp -
/etc/kubernetes
Bash command for folder:
scp -r root@IP_GOOD_COMPUTE:/PATH_TO_FOLDER PATH_TO_SAVE_FOLDERExample command:scp -r root@9.87.654.321:/wdp /\Bash command for file:
scp root@IP_GOOD_COMPUTE:/PATH_TO_FILE PATH_TO_SAVE_FOLDERExample command:scp -r root@9.87.654.321:/etc/kubernetes /etc/ -
- On the new compute node, add the local repository by entering the following
command:
cat <<EOF > /etc/yum.repos.d/wdp_local.repo [WDP_Local] name = WDP_Local baseurl = file:///wdp/wdp-repo-rhel7 gpgcheck = 0 EOF - Install kubectl and all other
packages:
yum install -y nfs-utils net-tools ebtables socat lvm2 yum-utils glusterfs-server docker-engine kubectl.x86_64 kubelet.x86_64 kubernetes-cni.x86_64 iptables-services - Remove the local
repository:
yum clear all rm -f /etc/yum.repos.d/wdp_local.repo - Copy the
kublet.servicefile from one of the working compute nodes:/etc/systemd/system/kubelet.service. - Create the SSL certificate on the new compute node by running the
/wdp/scripts/crtworkercert.shscript. - Start the
docker:
systemctl enable docker systemctl start docker - Preload the docker
images:
ls /wdp/DockerImages | awk '{system("docker load -i /wdp/DockerImages/"$1)}' - On the new compute node, start
kubelet:
systemctl enable kubelet systemctl start kubelet - On one of the master nodes, edit
/wdp/configto remove the line that contains the old compute node. Replace it with a new line that contains the new compute node IP address, domain name, and hostname. The following example replaces old compute node 1 (dsx-local-compute-1) with a new compute node (dsx-local-newcompute-1):OLD FILE:
storage_group_start 987.65.163.7 WDP_PLACEHOLDER /data /ibm dsx-local-compute-1.ibm.com dsx-local-compute-1 987.65.432.237 WDP_PLACEHOLDER /data /ibm dsx-local-compute-2.ibm.com dsx-local-compute-2 987.65.432.238 WDP_PLACEHOLDER /data /ibm ddsx-local-compute-3.ibm.com dsx-local-compute-3 storage_group_endNEW EDIT:
storage_group_start 987.65.432.147 WDP_PLACEHOLDER /data /ibm dsx-local-newstorage-1.ibm.com dsx-local-newcompute-1 987.65.432.237 WDP_PLACEHOLDER /data /ibm dsx-local-compute-2.ibm.com ddsx-local-newcompute-2 987.65.432.238 WDP_PLACEHOLDER /data /ibm dsx-local-compute-3.ibm.com dsx-local-newcompute-3 storage_group_end - Copy the
/wdp/configfile from the master node to all other nodes. - On the master node, replace gluster services and endpoints for the new node by running the
following
script:
glusterName="$( kubectl get svc --all-namespaces | grep gluster | awk '{print $1}')" kubectl get svc --all-namespaces | grep gluster | awk '{system("kubectl delete svc -n "$1" "$2)}' echo "$glusterName" | awk '{system("/wdp/k8s/gluster-endpoints/createCommonEndpointSvcYamlFile.sh /wdp/config "$1)}' - Verify whether the new node IP address is updated for gluster endpoints by entering the
following
command:
kubectl get ep --all-namespaces | grep gluster - Delete the old pod from the old compute nodes by entering the following
command:
kubectl get po --all-namespaces | grep -E "NodeLost|Unknown" | awk '{system("kubectl delete pod --grace-period=0 --force -n="$1" "$2)}' - Find the name of the old compute node to delete by entering the following
command:
kubectl get no - Delete the old compute node from the
cluster:
kubectl delete no "NODE_NAME" - On the master node, verify that the new compute node is running properly by entering the
following
commands:
kubectl get no kubectl get po --all-namespacesIf all the nodes are
Readyand all the pods areRunning, then the compute node was replaced properly.