Replace cluster nodes

If a master, storage, or compute node fails, a Watson Studio Local administrator can manually replace it by using the command line.

Tasks that you can do:

Replace a master node by command line
Replace a storage node by command line
Replace a compute node by command line

Replace a master node by command line

If you have a three node configuration and need to replace an old or faulty master node with a new one, complete the following steps:

Copy the following folders from a working master node to the new master node:
- /wdp
- /etc/kubernetes
Bash Command for folder: scp -r root@IP_OF_GOOD_MASTER:/PATH_TO_FOLDER PATH_TO_SAVE_FOLDER

Example Command: scp -r root@9.30.10.92:/wdp /

Bash Command for file: scp root@IP_OF_GOOD_MASTER:/PATH_TO_FILE PATH_TO_SAVE_FOLDER

Example Command: scp -r root@9.30.10.92:/etc/kubernetes /etc/

On the new master node, add the local repository by entering the following command:


cat <<EOF >
/etc/yum.repos.d/wdp_local.repo
[WDP_Local]
name = WDP_Local
baseurl = file:///wdp/wdp-repo-rhel7
gpgcheck = 0
EOF

Install kubectl and all other packages:

yum install -y nfs-utils net-tools ebtables
socat lvm2 yum-utils glusterfs-server docker-engine kubectl.x86_64
kubelet.x86_64 kubernetes-cni.x86_64 iptables-services haproxy
keepalived jq

Remove the local repository:

yum clear all
rm -f /etc/yum.repos.d/wdp_local.repo

Create new /var/etcd and /etc/data directories on the new master node.
```
mkdir /var/etcd
mkdir /var/etcd/data
```
Copy the kublet.service file from one of the working master nodes: /etc/systemd/system/kubelet.service. Change the line: BIND:IP to BIND: {NEWNODE-IP}.
Copy the /var/etcd/data/etcd_kube.sh script the same location on the new master node.
In the /etc/kubernetes/manifests/etcd.yaml file, edit the line /var/etcd/etcd_kube.sh:
- Update every old node IP address to the new node IP address.
- Set the Initial cluster state to existing. Example of the final changes:
```
/var/etcd/etcd_kube.sh 4 "9.87.654.321" "etcd1=http://9.87.654.322:2380,etcd2=http://9.87.654.323:2380,etcd4=http://9.87.654.321:2380" existing
```

Start the docker:

systemctl enable docker
systemctl start docker

Preload the docker images:

ls /wdp/DockerImages | awk '{system("docker load
-i /wdp/DockerImages/"$1)}'

On one of the working master nodes, find the failed etcd pod with its ID in the log:

kubectl logs ETCD -n=kube-system

If the ID cannot be found in the logs, enter the following command to list all of the etcd members:


kubectl exec -it $(kubectl get po
--all-namespaces | grep etcd | grep Running | head -n 1 | awk
'{print $2}')  -n=kube-system -- etcdctl member list

In the output, the first part is the ID. Example:


2a9a0a84da3f2511: name=etcd3
peerURLs=http://9.87.654.321:2380
clientURLs=http://127.0.0.1:2379,http://9.87.654.321:2379

Remove the old ETCD member:
```
kubectl exec -it {ETCD} -n=kube-system --
etcdctl member remove {ETCD-ID}
```
where {ETCD} represents the full pod name of a working ETCD member, and {ETCD-ID} represents the ID you obtained.

Add the new ETCD member:


kubectl exec -it {ETCD} -n=kube-system --
etcdctl member add ETCD-NAME http://ETCD-IP:2380

Example:


kubectl exec -it
etcd-server-ettin-master-1.fyre.ibm.com -n=kube-system -- etcdctl
member add etcd4 http://9.87.654.321:2380

On the new master node, start kubelet:


systemctl enable kubelet
systemctl start kubelet

Verify that the new master node was added:
```
kubectl get no
```
If kubelet did not start up, verify that ETCD is running:
```
docker ps -a | grep etcd
```
If ETCD is not running, enter the following commands to restart the docker:
```
systemctl stop docker
rm -rf /run/docker
systemctl start docker
systemctl start kubelet
```
Troupleshooting tip: If Watson Studio Local connected to the server but localhost:8080 was refused, correct the host and port in /etc/kubernetes/manifests/etcd.yaml and enter the following commands and then check again:
```
kubectl exec -it {ONE_OF_RUNNING_ETCD_POD_NAME}
-n=kube-system -- etcdctl member remove {WRONG_ETCD_HASH_CODE}
kubectl exec -it {ONE_OF_RUNNING_ETCD_POD_NAME} -n=kube-system --
etcdctl member add etcd4 http://{CORRECT_NEW_NODE_IP}:2380:
restart etcd and kubectl pkill etcd systemctl stop kubectl
systemctl start kubectl
```
You can also check the log for the error on the working master node:
```
kubectl logs etcd-server-ettin-9.87.654.321
-n=kube-system
```
where etcd-server-ettin-9.87.654.321 represents the one from kubectl get po --all-namespaces | grep etcd | grep Running
To configure Keepalived in the new master node, copy over the /etc/keepalived/keepalived.conf file from a working master node.
In the keepalive.conf file, edit the following two lines:
- state should be state MASTER for the first master node, and state BACKUP for the backup master nodes.
- priority should be priority 102 for the first master node, and priority 101 for the backup master nodes. Example:
```
state BACKUP
interface eth0
virtual_router_id 91
priority 101
```
To configure haproxy on the new master node, copy over /etc/haproxy/haproxy.cfg from a working master node the the same location in the new master node.

In haproxy.cfg, change the IP of the old master node to the IP of the new master node:


backend app
balance     roundrobin
server  http1 9.87.654.321:6443 check
server  http2 9.87.654.322:6443 check
server  http3 9.87.654.323:6443 check

Repeat this step for all of the other master nodes as well.

If the DNS is not set up to handle the domain: On the new master node, edit the /etc/hosts file add the IP and domain pair of every node. In the /etc/hosts file of each working master node, add the IP and domain pair of the new master node. Example:
```
9.87.654.321 high-io-1-proxy.ibm.com
high-io-1-proxy
9.87.654.322 high-io-1-master-1.ibm.com high-io-1-master-1
9.87.654.323 high-io-1-master-2.ibm.com high-io-1-master-2
9.87.654.324 replace-new-master.ibm.com replace-new-master
```

Remove the old master node from kubelet and weave (the node name can be found in

kubectl
get
no


kubectl delete node NODE_NAME
docker cp $(docker ps -f name=weave | grep -v -E
"(npc|pause|^CONTAINER)" | awk '{print $1}'):/home/weave/weave
weave
./weave rmpeer OLD_NODE_IP
./weave connect NEW_NODE_IP
rm weave

In /wdp/config, remove the old master node and add the new master node information using the following format:
```
{NEW_MASTER_IP} {M#} WDP_PLACEHOLDER
{INSTALL_FOLDER} {FQDN} {HOSTNAME}
```
where:

NEW_MASTER_IP

The IP address of the new master node

M#

The number of the master node being replaced (Master 1 = M1, Master 2 = M2, Master 3 = M3)

INSTALL_FOLDER

The path to the installation folder

FQDN

The fully qualified domain name of the new master node

HOSTNAME

The host name of the new master node

On the new master node, start glusterfs:


systemctl enable glusterd
systemctl start glusterd

From one of the working master nodes, run the following command:
```
gluster peer probe IP_OF_NEW_MASTER_NODE
```

Save the following code segment to replace the bricks into a script on one of the master nodes inside the cluster:


#!/usr/bin/env bash
# Replace all the gluster bricks that is link with the old storage
ip to the new storage ip.
# Only the storage node that is getting replace can be down.
# Please note that this does not heal the volume
if [[ $# -ne 2 ]]; then
  echo
  echo "      Usage: "$(basename $0) old_storage_node_ip
new_storage_node_ip
  echo
  exit 1
fi
old_ip=$1
new_ip=$2
volumes=$(gluster volume info | grep 'Volume Name')
IFS=$'\n'
for volume in $volumes ; do
  volume=$(echo ${volume} | awk -F ':' '{print $2}' | awk '{print
$1}')
  info=$(gluster volume info "${volume}" | grep ${old_ip})
  if [[ $? -eq 0 ]]; then
      brick=$(echo ${info} | awk '{print $2}')
      new_brick=$(echo ${brick} | sed "s|${old_ip}|${new_ip}|g")
      gluster volume replace-brick "${volume}" "${brick}"
"${new_brick}" commit force
  fi
done

Grant the script executable privileges: chmod +x scriptname.sh
Run the script to replace the bricks: ./scriptname.sh old_storage_node_ip new_storage_node_ip where old_storage_node_ip represents the old master node IP address and new_storage_node_ip represents the new master node IP address.

On the working master node, heal the gluster volumes by entering the following command:


volumes=$(gluster volume info | grep 'Volume
Name')
IFS=$'\n'
for volume in $volumes ; do
  volume=$(echo ${volume} | awk -F ':' '{print $2}' | awk '{print
$1}')
  gluster volume heal "$volume"
done

On the same master node, update mongo by entering the following commands:


/wdp/k8s/wdp-deploy-dashboard/k8s/mongoDelete.sh
sysibm-adm 0 NUMBER
/wdp/k8s/wdp-deploy-dashboard/k8s/mongoCreate.sh sysibm-adm 0
NUMBER

where NUMBER represents the index where the master node info can be found in /wdp/config and 0 is the first index. Example:


/wdp/k8s/wdp-deploy-dashboard/k8s/mongoDelete.sh
sysibm-adm 0 0
/wdp/k8s/wdp-deploy-dashboard/k8s/mongoCreate.sh sysibm-adm 0 0

On a working master node, replace gluster services and endpoints for the new node by running the following script:


glusterName="$( kubectl get svc --all-namespaces
| grep gluster | awk '{print $1}')"
kubectl get svc --all-namespaces | grep gluster |  awk
'{system("kubectl delete svc -n "$1" "$2)}' 
echo "$glusterName" | awk
'{system("/wdp/k8s/gluster-endpoints/createCommonEndpointSvcYamlFile.sh
/wdp/config "$1)}'

Verify whether the new node IP address is updated for gluster endpoints by entering the following command:
```
kubectl get ep --all-namespaces | grep gluster
```
On a working master node, prepare the new certificate license:
```
/wdp/scripts/crtmastercert.sh {PROXY_IP}
{MASTER_1} {MASTER_2} {MASTER_3}
```
From the master node used to run the above script, copy all of the files under /etc/kubernetes/ssl/ to all other nodes.
Restart the API server on all master nodes by running the following command on each node:
```
pkill hyperkube
```

Load the new certificate for pods by deleting all secrets:


kubectl get secrets --all-namespaces | grep
default-token | awk '{system("kubectl delete secrets -n "$1" "$2)}'

Restart all nodes in the cluster, including the new master node.
In the proxy node, shut it down:
```
shutdown -h now
```
If redirected when trying to SSH to the proxy node, shut down keepalived on all of the master nodes:
```
systemctl stop keepalived
```
then SSH to proxy node and shut down. Bring up keepalived on all master nodes if shut down:
```
systemctl start keepalived
```
Ensure all nodes are in ready state:
```
kubectl get no
```
If any nodes are down, enter the following command until all master nodes are in ready state:
```
systemctl restart kubelet
```

Restart all pods by entering the following commands are in order. Ensure the related pods are up before moving on to the next command (delete everything after the awk command to just check the status of the related pods):


kubectl get po --all-namespaces -o wide | grep
kube-system | grep -v kube-apiserver | grep -v weave | grep  -v dns
|  awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0
--force")}'
kubectl get po --all-namespaces -o wide | grep kube-system | grep 
weave | awk '{system("kubectl delete po -n "$1" "$2"
--grace-period=0 --force")}'
kubectl get po --all-namespaces -o wide | grep kube-system | grep 
dns | awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0
--force")}'
kubectl get po --all-namespaces -o wide | grep  docker  | awk
'{system("kubectl delete po -n "$1" "$2" --grace-period=0
--force")}'
kubectl get po --all-namespaces -o wide | grep  redis  | awk
'{system("kubectl delete po -n "$1" "$2" --grace-period=0
--force")}'
kubectl get po --all-namespaces -o wide | grep  cloudant  | awk
'{system("kubectl delete po -n "$1" "$2" --grace-period=0
--force")}'
kubectl get po --all-namespaces -o wide | grep  -v kube-system |
grep -v redis | grep -v cloudant | grep -v docker | grep -v nginx 
| awk '{system("kubectl delete po -n "$1" "$2" --grace-period=0
--force")}'
kubectl get po --all-namespaces -o wide | grep  nginx  | awk
'{system("kubectl delete po -n "$1" "$2" --grace-period=0
--force")}'

Troubleshooting tip: If you are unable to bring up all pods, redo the previous two steps.

In the /etc/kubernetes/manifests/etcd.yaml file on the other two master nodes, edit the line /var/etcd/etcd_kube.sh to update every old master node IP address to the new master node IP address. Example of the final changes:
```
- /var/etcd/etcd_kube.sh 1 "9.87.654.321"
"etcd1=http://9.87.654.322:2380,etcd2=http://9.87.654.323:2380,etcd4=http://9.87.654.321:2380"
```

Label the new master node:


kubectl label no {NEW_NODE_NAME}
nodetype=control
kubectl label no {NEW_NODE_NAME} is_control=true
kubectl label no {NEW_NODE_NAME} is_compute=false
kubectl label no {NEW_NODE_NAME} is_storage=false

Replace a storage node by command line

Requirement: You can replace the storage node if only one storage node is down only.

If you need to replace an old or faulty storage node with a new one, complete the following steps:

Shut down the storage node to replace.
Use the mkdir command on the new storage to create a new storage directory. The new storage path must match the old storage path, for example, /data.
Copy the following folders from a working storage node to the new storage node:
- /wdp
- /etc/kubernetes
Bash command for the folder: scp -r root@IP_GOOD_STORAGE:/PATH_TO_FOLDER PATH_TO_SAVE_FOLDER Example command: scp -r root@9.87.654.321:/wdp /\

Bash command for file: scp root@IP_GOOD_STORAGE:/PATH_TO_FILE PATH_TO_SAVE_FOLDER Example command: scp -r root@9.87.654.321:/etc/kubernetes /etc/

On the new storage node, add the local repository by entering the following command:


cat <<EOF >
/etc/yum.repos.d/wdp_local.repo
[WDP_Local]
name = WDP_Local
baseurl = file:///wdp/wdp-repo-rhel7
gpgcheck = 0
EOF

Install kubectl and all other packages:


yum install -y nfs-utils net-tools ebtables
socat lvm2 yum-utils glusterfs-server docker-engine kubectl.x86_64
kubelet.x86_64 kubernetes-cni.x86_64 iptables-services

Remove the local repository:


yum clear all
rm -f /etc/yum.repos.d/wdp_local.repo

Copy the kublet.service file from one of the working storage nodes: /etc/systemd/system/kubelet.service.
Create the SSL certificate on the new storage node by running the /wdp/scripts/crtworkercert.sh script.

Start the docker:


systemctl enable docker
systemctl start docker

Preload the docker images:


ls /wdp/DockerImages | awk '{system("docker load
-i /wdp/DockerImages/"$1)}'

On the new storage node, start GlusterFS:


systemctl enable glusterd
systemctl start glusterd

On one of the original storage nodes, run the following command:
```
gluster peer probe IP_OF_NEW_STORAGE_NODE
```

On the new storage node, start kubelet:


systemctl enable kubelet
systemctl start kubelet

Save the following code segment to replace the bricks into a script on one of the storage nodes inside the cluster:


#!/usr/bin/env bash
# Replace all the gluster bricks that is link with the old storage
ip to the new storage ip.
# Only the storage node that is getting replace can be down.
# Please note that this does not heal the volume
if [[ $# -ne 2 ]]; then
   echo
   echo "      Usage: "$(basename $0) old_storage_node_ip
new_storage_node_ip
   echo
   exit 1
fi
old_ip=$1
new_ip=$2
volumes=$(gluster volume info | grep 'Volume Name')
IFS=$'\n'
for volume in $volumes ; do
   volume=$(echo ${volume} | awk -F ':' '{print $2}' | awk '{print
$1}')
   info=$(gluster volume info "${volume}" | grep ${old_ip})
   if [[ $? -eq 0 ]]; then
       brick=$(echo ${info} | awk '{print $2}')
       new_brick=$(echo ${brick} | sed "s|${old_ip}|${new_ip}|g")
       gluster volume replace-brick "${volume}" "${brick}"
"${new_brick}" commit force
   fi
done

Grant the script executable privileges: chmod +x scriptname.sh
Run the script to replace the bricks: ./scriptname.sh old_storage_node_ip new_storage_node_ip where old_storage_node_ip represents the old storage node IP address and new_storage_node_ip represents the new storage node IP address.

On the working storage node, heal the gluster volumes by entering the following command:


volumes=$(gluster volume info | grep 'Volume
Name')
IFS=$'\n'
for volume in $volumes ; do
   volume=$(echo ${volume} | awk -F ':' '{print $2}' | awk '{print
$1}')
   gluster volume heal "$volume"
done

On one of the master nodes, edit /wdp/config to remove the line that contains the old storage node. Replace it with a new line that contains the new storage node IP address, domain name, and hostname. The following example replaces storage node 1 (dsx-local-storage-1) with a new storage node (dsx-local-newstorage-1):

OLD FILE:


storage_group_start
987.16.163.7 WDP_PLACEHOLDER /data /ibm dsx-local-storage-1.ibm.com
dsx-local-storage-1
987.65.432.237 WDP_PLACEHOLDER /data /ibm
dsx-local-storage-2.ibm.com dsx-local-storage-2
987.65.432.238 WDP_PLACEHOLDER /data /ibm
dsx-local-storage-3.ibm.com dsx-local-storage-3
storage_group_end

NEW EDIT:


storage_group_start
987.65.432.147 WDP_PLACEHOLDER /data /ibm
dsx-local-newstorage-1.ibm.com dsx-local-newstorage-1
987.65.432.237 WDP_PLACEHOLDER /data /ibm
dsx-local-storage-2.ibm.com dsx-local-storage-2
987.65.432.238 WDP_PLACEHOLDER /data /ibm
dsx-local-storage-3.ibm.com dsx-local-storage-3
storage_group_end

Copy the /wdp/config file from the master node to all other nodes.

On the same master node, update mongo by entering the following commands:


/wdp/k8s/wdp-deploy-dashboard/k8s/mongoDelete.sh
sysibm-adm 0 NUMBER
/wdp/k8s/wdp-deploy-dashboard/k8s/mongoCreate.sh sysibm-adm 0
NUMBER

where 0 is the first index and NUMBER represents the index where the storage node information can be found in /wdp/config. For example:


/wdp/k8s/wdp-deploy-dashboard/k8s/mongoDelete.sh
sysibm-adm 0 0
/wdp/k8s/wdp-deploy-dashboard/k8s/mongoCreate.sh sysibm-adm 0 0

On the master node, replace gluster services and endpoints for the new node by running the following script:


glusterName="$( kubectl get svc --all-namespaces
| grep gluster | awk '{print $1}')"
kubectl get svc --all-namespaces | grep gluster |  awk
'{system("kubectl delete svc -n "$1" "$2)}' 
echo "$glusterName" | awk
'{system("/wdp/k8s/gluster-endpoints/createCommonEndpointSvcYamlFile.sh
/wdp/config "$1)}'

Verify whether the new node IP address is updated for gluster endpoints by entering the following command:
```
kubectl get ep --all-namespaces | grep gluster
```

Delete the old pod from the old storage nodes by entering the following command:


kubectl get po --all-namespaces | grep -E
"NodeLost|Unknown" | awk '{system("kubectl delete pod
--grace-period=0 --force -n="$1" "$2)}'

Find the name of the old storage node to delete by entering the following command:
```
kubectl get no
```
Delete the old storage node from the cluster:
```
kubectl delete no "NODE_NAME"
```
On the master node, verify that the new storage node is running properly by entering the following commands:
```
kubectl get no
kubectl get po --all-namespaces
```
If all the nodes are Ready and all the pods are Running, then the storage node was replaced properly.

Replace a compute node by command line

Requirement: You can only replace the compute node if only one compute node is down.

If you need to replace an old or faulty compute node with a new one, complete the following steps:

Shut down the compute node to replace.
Copy the following folders from a working compute node to the new compute node:
- /wdp
- /etc/kubernetes
Bash command for folder: scp -r root@IP_GOOD_COMPUTE:/PATH_TO_FOLDER PATH_TO_SAVE_FOLDER Example command: scp -r root@9.87.654.321:/wdp /\

Bash command for file: scp root@IP_GOOD_COMPUTE:/PATH_TO_FILE PATH_TO_SAVE_FOLDER Example command: scp -r root@9.87.654.321:/etc/kubernetes /etc/

On the new compute node, add the local repository by entering the following command:


cat <<EOF >
/etc/yum.repos.d/wdp_local.repo
[WDP_Local]
name = WDP_Local
baseurl = file:///wdp/wdp-repo-rhel7
gpgcheck = 0
EOF

Install kubectl and all other packages:


yum install -y nfs-utils net-tools ebtables
socat lvm2 yum-utils glusterfs-server docker-engine kubectl.x86_64
kubelet.x86_64 kubernetes-cni.x86_64 iptables-services

Remove the local repository:


yum clear all
rm -f /etc/yum.repos.d/wdp_local.repo

Copy the kublet.service file from one of the working compute nodes: /etc/systemd/system/kubelet.service.
Create the SSL certificate on the new compute node by running the /wdp/scripts/crtworkercert.sh script.

Start the docker:


systemctl enable docker
systemctl start docker

Preload the docker images:


ls /wdp/DockerImages | awk '{system("docker load
-i /wdp/DockerImages/"$1)}'

On the new compute node, start kubelet:


systemctl enable kubelet
systemctl start kubelet

On one of the master nodes, edit /wdp/config to remove the line that contains the old compute node. Replace it with a new line that contains the new compute node IP address, domain name, and hostname. The following example replaces old compute node 1 (dsx-local-compute-1) with a new compute node (dsx-local-newcompute-1):

OLD FILE:


storage_group_start
987.65.163.7 WDP_PLACEHOLDER /data /ibm dsx-local-compute-1.ibm.com
dsx-local-compute-1
987.65.432.237 WDP_PLACEHOLDER /data /ibm
dsx-local-compute-2.ibm.com dsx-local-compute-2
987.65.432.238 WDP_PLACEHOLDER /data /ibm
ddsx-local-compute-3.ibm.com dsx-local-compute-3
storage_group_end

NEW EDIT:


storage_group_start
987.65.432.147 WDP_PLACEHOLDER /data /ibm
dsx-local-newstorage-1.ibm.com dsx-local-newcompute-1
987.65.432.237 WDP_PLACEHOLDER /data /ibm
dsx-local-compute-2.ibm.com ddsx-local-newcompute-2
987.65.432.238 WDP_PLACEHOLDER /data /ibm
dsx-local-compute-3.ibm.com dsx-local-newcompute-3
storage_group_end

Copy the /wdp/config file from the master node to all other nodes.

On the master node, replace gluster services and endpoints for the new node by running the following script:


glusterName="$( kubectl get svc --all-namespaces
| grep gluster | awk '{print $1}')"
kubectl get svc --all-namespaces | grep gluster |  awk
'{system("kubectl delete svc -n "$1" "$2)}' 
echo "$glusterName" | awk
'{system("/wdp/k8s/gluster-endpoints/createCommonEndpointSvcYamlFile.sh
/wdp/config "$1)}'

Verify whether the new node IP address is updated for gluster endpoints by entering the following command:
```
kubectl get ep --all-namespaces | grep gluster
```

Delete the old pod from the old compute nodes by entering the following command:


kubectl get po --all-namespaces | grep -E
"NodeLost|Unknown" | awk '{system("kubectl delete pod
--grace-period=0 --force -n="$1" "$2)}'

Find the name of the old compute node to delete by entering the following command:
```
kubectl get no
```
Delete the old compute node from the cluster:
```
kubectl delete no "NODE_NAME"
```
On the master node, verify that the new compute node is running properly by entering the following commands:
```
kubectl get no
kubectl get po --all-namespaces
```
If all the nodes are Ready and all the pods are Running, then the compute node was replaced properly.