IBM Support

Big SQL 4.2 How to remove a "dead" worker node - Hadoop Dev

Technical Blog Post


Big SQL 4.2 How to remove a "dead" worker node - Hadoop Dev


The following article explains how to remove a worker node that is “down” or otherwise unreachable, from a BigSQL cluster.Normally, a Big SQL worker node can be decommissioned from cluster if the host is alive :


Figure 1: Showing the decommission option of a worker node.

However, when a host can no longer be accessed, above option will not be possible as shown below;


Figure 2 : The host worker2Host is dead. The service states shows “unknown” and in the Ambari UI , the host heartbeats are lost.

Figure 3 : The host worker2Host is reported as ‘unknown-state’ in the hostlist – it is down.

How to remove a dead node in Big SQL Cluster from command line.

There are 3 phases of the Big SQL Worker Cleanup :
Phase1- Deregistering host from Big SQL (whether targeted Big SQL Worker is accessible or not )
– Removal from Big SQL db
– Removal from Big SQL cluster
Phase2- Removal of Big SQL binaries/packages from worker host (if targeted Big SQL Worker is accessible)
Phase3- Removal of Big SQL worker service entry from Ambari(whether targeted Big SQL Worker is accessible or not)

The 3 phases above will be performed by the utility with -w option.

WARNING : Running this script without -w option will WIPE the whole Big SQL cluster

Here is how :

Step 1 : ssh to Big SQL Head node as sudo user
Step 2 : Switch to bigsql user and determine the dead node’s node number in the Big SQL Cluster:

# su - bigsql $ cat ~/sqllib/db2nodes.cfg |grep workerHost2 2 worker2Host 0

The host is in question has Big SQL node number “2”.
It is the first field separated by space in db2nodes.cfg that corresponds worker2Host
node line.

Step 3 : Validate the host nodenumber one more time by attempting to start and/or stop Big SQL service from command line


Figure 4: Failure to ping and to start/stop Big SQL Worker worker2Host

Now user has validated the node number 2 is dead. It is completely out of the network, and user does not have intention to bring it back ever again.

Step 4 : As root on Big SQL Head node

$ su - bigsql $ cat ~/sqllib/db2nodes.cfg
0 0 1 0 2 0 >> This node will be removed

Switch to root on Big SQL Head node :

[root@head1 ~]# cd /var/lib/ambari-agent [root@head1 ambari-agent]# find . -name ./cache/stacks/HDP/2.4/services/BIGSQL/package/scripts/ [root@head1 ambari-agent]# cd ./cache/stacks/HDP/2.4/services/BIGSQL/package/scripts

Let’s run the command to get usage help :

root@headHost1 scripts]# ./
Usage: ./ -u -p [-s ] [-w ] Required parameters: -u: Ambari admin username -p: Ambari admin password Worker node cleanup: -w: Worker node hostname Using this option will remove the specified worker from the existing Big SQL cluster. Optional: -Z: sudo_ssh_user specify the sudo/ssh user if it is other than root. WARNING: THIS SCRIPT SHOULD BE INVOKED FROM BIGSQL_HEAD_NODE

Here is the command to cleanup from bigsql cluster:

[root@head1 scripts]# ./ -u admin -p admin -w

Output of the command will look like below, waiting user input for confirmation, enter “Y” to continue to remove:

Single host cleanup is requested on SSL is NOT set Please confirm the following cluster info: Ambari server = Ambari port = 8081 Ambari cluster = TESTHDP24 Would you like to continue? (Y/n): Y Exporting environment variables for bigsql service Successfully exported environment variables Cleanup parameters: BIGSQL_USER = bigsql DATA_DIRS = /var/ibm/bigsql/database,/hadoop/bigsql AMBARI_SERVER = AMBARI_CLUSTER = TESTHDP24 AMBARI_PORT = 8081 AMBARI_USER = admin BIGSQL_USER_HOME = /home/bigsql TARGET_HOSTLIST = /tmp/bigsqlSSHHostList ************************** Existing Big SQL host list: ************************** ************************** 2 0 Target host: Target nodes: 2 Current host: Big SQL Head Host: Processing worker removal for node 2 (forced: 0) Given node array to remove 0 0 1 0 2 0 Processing removal of 2,, 0 Log file of this shell is: /tmp/bigsql/logs/bigsql-fixtopology-2016-10-15_03.56.23.3491.log ... ... Timed out in first attempt. Retrying ambari-server restart Using python /usr/bin/python Restarting ambari-server Using python /usr/bin/python Stopping ambari-server Ambari Server stopped Using python /usr/bin/python Starting ambari-server Ambari Server running with administrator privileges. Organizing resource files at /var/lib/ambari-server/resources... Server PID at: /var/run/ambari-server/ Server out at: /var/log/ambari-server/ambari-server.out Server log at: /var/log/ambari-server/ambari-server.log Waiting for server start.................... Ambari Server 'start' completed successfully.

Now Let’s validate the outcome:

[root@head1 scripts]# su - bigsql [bigsql@head1 sqllib]$ cat ~/sqllib/db2nodes.cfg 0 0 1 0

Dead Host (worker2) with associated BigSQL node number “2” is now removed from the cluster.As seen in Ambari-UI bigsql service is now having only 1 worker.


[{"Business Unit":{"code":"BU054","label":"Cloud & Data Platform"},"Product":{"code":"SSCRJT","label":"IBM Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]