Node procedures

This topic describes various procedures that can be done on a node to accomplish various tasks.

Note: Any new hardware (especially disks) used for below section must be tested before put into production to make sure the hardware quality.

When you add a new node or replace a node, you need to prepare the following as the precondition for the new node to be operational:

A homogeneous server is recommended. It must have the same CPU, memory, PCI speed, network speed, disk controller, disk number, and model. If the node has different configurations, make sure that this node does not introduce any performance bottlenecks to the cluster.
Enclosure Descriptor File: If the server is homogeneous with other servers that include the drive mapping (which is what is recommended), the edf file (/usr/lpp/mmfs/data/gems/*edf) can be copied from the existing node to the new node. If the new server is not homogeneous with others, then new edf files must be created. For more information, see Mapping NVMe disk slot location.
Setting the disks used for IBM Storage Scale Erasure Code Edition in JBOD mode, check the disk format, update firmware, and disable the disk writer cache. For more information, see Hardware checklist.
SAS disk slot location: If the server is homogeneous with others that includes drive mapping and needing remapping the disk slot location, then the slot remapping file /usr/lpp/mmfs/data/gems/slotmap.yaml can be copied from an existing node to the new node. Otherwise, a new slotmap file must be created. For more information, see Mapping LMR disk location.
Setting customized udev rules if required.
Setting the systemctl settings if required.
Follow the OS precheck tool Readme file to run the precheck tools after you prepare the node. For more information, see Minimum hardware requirements and precheck.

Adding new I/O nodes

Adding a new node by using the mmvdisk command:

Make sure that the node is a member of the IBM Storage Scale cluster and the state is active (if not, issue mmaddnode and mmstartup). Also, make sure that the node has the server license (if not, run mmchlicense).

Issue the mmvdisk server list -N newnode --disk-topology command to verify that the new node has the same disk topology as the other nodes in the recovery group to which the node is added.

# mmvdisk server list  -N c72f4m5u15-ib0 --disk-topology -L

The system displays the following output:

GNR server: name c72f4m5u15-ib0 arch x86_64 model 7X06CTO1WW serial J100574A
GNR enclosures found: internal
Enclosure internal (internal, number 1):
Enclosure internal sees 9 disks (6 SSDs, 3 HDDs)

GNR server disk topology: ECE 6 SSD/NVMe and 3 HDD (match: 100/100)
GNR configuration: 1 enclosure, 6 SSDs, 0 empty slots, 9 disks total, 0 NVRAM partitions

Issue the mmvdisk server configure -N newnode --recycle one command to configure the new node as IBM Storage Scale Erasure Code Edition server and restart the IBM Storage Scale daemon.

# mmvdisk server configure -N c72f4m5u15-ib0 --recycle one

mmvdisk: Checking resources for specified nodes.
mmvdisk: Setting configuration for node 'c72f4m5u15-ib0'.
mmvdisk: Node 'c72f4m5u15-ib0' has a scale-out recovery group disk topology.
mmvdisk: Using 'default.scale-out' RG configuration for topology 'ECE 6 SSD/NVMe and 3 HDD'.
mmvdisk: Node 'c72f4m5u15-ib0' is now configured to be a recovery group server.
mmvdisk: Restarting GPFS daemon on node 'c72f4m5u15-ib0'.

Issue the mmvdisk rg add --rg rg_name -N newnode command to add the new node to the current recovery group. After that, all DAs must be in the rebalanced state. The mmvdisk rg add --rg rg_name -N newnode command adds a call-back script to monitor the rebalance process. When the rebalance is finished, the call-back runs mmvdisk recoverygroup add --recovery-group rg_name --complete-node-add command of the next step to finish the procedure for adding the node.

# mmvdisk rg add --rg rg_1 -N c72f4m5u15-ib0

mmvdisk: Checking daemon status on node 'c72f4m5u15-ib0'.
mmvdisk: Checking resources for specified nodes.
mmvdisk: Adding 'c72f4m5u15-ib0' to node class 'nc_1'.
mmvdisk: Obtaining pdisk information for recovery group 'rg_1'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u13-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u19-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u17-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u21-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u11-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u15-ib0'.
mmvdisk: Validating declustered arrays for recovery group 'rg_1'.
mmvdisk: Updating server list for recovery group 'rg_1'.
mmvdisk: Updating pdisk list for recovery group 'rg_1'.
mmvdisk: Updating parameters for declustered array 'DA1'.
mmvdisk: Updating parameters for declustered array 'DA2'.
mmvdisk: Updating parameters for declustered array 'DA3'.
mmvdisk: Node 'c72f4m5u15-ib0' added to recovery group 'rg_1'.
mmvdisk: Log group and vdisk set operations for recovery group 'rg_1'
mmvdisk: must be deferred until rebalance completes in all declustered arrays.
mmvdisk: A callback 'RG001CompletNodeAdd' has been created to monitor the rebalance state.
mmvdisk: Once rebalance completes in all declustered arrays,
mmvdisk: log group and vdisk set will be created automatically.

Check the DA status and rebalance progress by issuing the following command:

# mmvdisk rg list --rg rg_1 --da

The system displays the following output:

declustered   needs                vdisks       pdisks           capacity
   array     service  type  trim  user log  total spare rt  total raw free raw  background task
-----------  -------  ----  ----  ---- ---  ----- ----- --  --------- --------  ---------------
DA1          no       NVMe  no      10   0     12     2  2   8869 GiB 1237 GiB  rebalance (12%)
DA2          yes      HDD   no      10   0     18     2  2   8829 GiB 1089 GiB  rebalance (88%)
DA3          no       SSD   no      10  11     24     3  2   9173 GiB  695 GiB  rebalance (19%)

mmvdisk: Total capacity is the raw space before any vdisk set definitions.
mmvdisk: Free capacity is what remains for additional vdisk set definitions.

mmvdisk: Attention: Recovery group 'rg_1' has an incomplete node addition (c72f4m5u15-ib0).
mmvdisk: callback 'RG001CompletNodeAdd' will perform the node addition after rebalance completes
mmvdisk: in all declustered arrays of recovery group 'rg_1'.

Verify that the call-back is added by issuing the following command:

# mmlscallback RG001CompletNodeAdd

RG001CompletNodeAdd
        command       = /usr/lpp/mmfs/bin/mmvdisk
        sync          = false
        event         = imEventRebalance
        node          = c72f4m5u11-ib0,c72f4m5u13-ib0,c72f4m5u15-ib0,c72f4m5u17-ib0,c72f4m5u19-ib0,c72f4m5u21-ib0
        parms         = recoverygroup add --recovery-group %rgName --complete-node-add --callback RG001CompletNodeAdd

The call-back automatically runs the mmvdisk recoverygroup add --recovery-group rg_name --complete-node-add command to finish the adding node process after the rebalance is finished. This operation creates new log groups, new vdisks for all existing vdisk sets, NSDs, and adds the free NSDs to file systems if the vdisk sets belong to some file system.

If the rebalance is ongoing, run the following command:

# mmvdisk recoverygroup add --recovery-group rg_1 --complete-node-add

mmvdisk: Verifying that the DAs in recovery group 'rg1' are idle.
mmvdisk: Declustered array 'DA1' is in task 'rebalance'.
mmvdisk: All DAs must be in task 'scrub' to complete node addition.
mmvdisk: Log group and vdisk set operations for recovery group 'rg1'
mmvdisk: must be deferred until rebalance completes in all declustered arrays.
mmvdisk: A callback 'RG001CompletNodeAdd' has been created to monitor the rebalance state.
mmvdisk: Once rebalance completes in all declustered arrays,
mmvdisk: log group and vdisk set will be created automatically.
mmvdisk: Command failed. Examine previous error messages to determine cause.

Generally the mmvdisk command reports the same message if:

The rebalance is ongoing.
The call-back is not finished.

# mmvdisk rg list --rg rg_1 --da

The system displays the following output:

declustered   needs                vdisks       pdisks           capacity
   array     service  type  trim  user log  total spare rt  total raw free raw  background task
-----------  -------  ----  ----  ---- ---  ----- ----- --  --------- --------  ---------------
DA1          no       NVMe  no      10   0     12     2  2   8869 GiB 1237 GiB  rebalance (12%)
DA2          yes      HDD   no      10   0     18     2  2   8829 GiB 1089 GiB  rebalance (88%)
DA3          no       SSD   no      10  11     24     3  2   9173 GiB  695 GiB  rebalance (19%)

mmvdisk: Total capacity is the raw space before any vdisk set definitions.
mmvdisk: Free capacity is what remains for additional vdisk set definitions.

mmvdisk: Attention: Recovery group 'rg_1' has an incomplete node addition (c72f4m5u15-ib0).
mmvdisk: callback 'RG001CompletNodeAdd' will perform the node addition after rebalance completes
mmvdisk: in all declustered arrays of recovery group 'rg_1'.

After the call-back is executed, the above mmvdisk command message will disappear.

Run the following command that would display an increased vdisks number.

# mmvdisk rg list --rg rg_1 --da

The system displays the following output:

declustered   needs                vdisks       pdisks           capacity
   array     service  type  trim  user log  total spare rt  total raw free raw  background task
-----------  -------  ----  ----  ---- ---  ----- ----- --  --------- --------  ---------------
DA1          no       NVMe  no      12   0     12     2  2   8869 GiB 1237 GiB  scrub 14d (63%)
DA2          yes      HDD   no      12   0     18     2  2   8829 GiB 1089 GiB  scrub 14d (63%)
DA3          no       SSD   no      12  13     24     3  2   9173 GiB  695 GiB  scrub 14d (65%)

mmvdisk: Total capacity is the raw space before any vdisk set definitions.
mmvdisk: Free capacity is what remains for additional vdisk set definitions.

Replacing an I/O node with a new node and disks

In this scenario, a failed server is to be replaced with an entirely new server, including new drives.

Prepare a new node with the same disk topology as the node needs to be replaced. The server type, memory, and disks must be same.
Issue the mmaddnode command to add this node into the IBM Storage Scale, accept the license as the server, and issue the mmstartup -N command to bring up the IBM Storage Scale daemon.
Define the node as the same role as the old server, such as quorum, fsmgr, and so on.
Run the mmvdisk server configure -N nodename command to configure the node, then restart the daemon on this node.

Run the mmvdisk rg replace command to replace the existing node with a new node. In some cases, you might need to specify --match parameter if there are slight differences between your configuration and the standard topology definitions, for example --match 90.


mmvdisk rg replace --rg rg1 -N c72f4m5u01-ib0 --new-node c72f4m5u07-ib0 
mmvdisk: Attempting to complete a previous replace command.
mmvdisk: Analyzing disk topology for node 'c72f4m5u01-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u03-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u05-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u11-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u09-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u15-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u13-ib0'.
mmvdisk: Analyzing disk topology for node 'c72f4m5u07-ib0'.
mmvdisk: Updating server list for recovery group 'rg1'.
mmvdisk: Updating pdisk list for recovery group 'rg1'.
mmvdisk: This could take a long time.
mmvdisk: The following pdisks will be formatted on node c72f4m5u01.gpfs.net:
mmvdisk:     //c72f4m5u07-ib0/dev/nvme1n1
mmvdisk:     //c72f4m5u07-ib0/dev/nvme0n1
mmvdisk:     //c72f4m5u07-ib0/dev/sda
mmvdisk:     //c72f4m5u07-ib0/dev/sdc
mmvdisk:     //c72f4m5u07-ib0/dev/sdb
mmvdisk:     //c72f4m5u07-ib0/dev/sde
mmvdisk:     //c72f4m5u07-ib0/dev/sdg
mmvdisk:     //c72f4m5u07-ib0/dev/sdf
mmvdisk:     //c72f4m5u07-ib0/dev/sdd
mmvdisk: Removing node 'c72f4m5u01-ib0' from node class 'r1'.
mmvdisk: Updating server list for recovery group 'rg1'.

Run the mmvdisk rg list command to make sure that the new node joins the node class, and that all related pdisks work fine. Also, make sure that the replaced node and the related pdisks are not in the RG anymore. Then, wait for some time to make sure all DAs are into the scrub state.
Now the node is replaced from RG successfully. Run mmshutdown -N and mmdelnode -N to delete the replaced node from the cluster (if you do not need the node in the cluster anymore).

Replacing broken I/O nodes with moving disks to new nodes

Make sure that the node is broken, not pingable, or cannot be logged in. You can pull the network cable on this broken node if you can physically access the node.
Prepare a new node that is of the same hardware as that of the broken node.
Install the same OS on it, check the time to sync with all other nodes in IBM Storage Scale Erasure Code Edition cluster, and then install the same IBM Storage Scale build on the new node.
Connect the new node to the switch, change the hostname and IP address of the node as that of the old node.
Pull the pdisks that the old node was using and insert them into the new node.
Make sure that all disks are visible on the new node and that none of the pdisks are broken. If the pdisks are broken, data in this disk never gets restored.
Make sure that the ssh and scp commands work on the new node. You must configure passwordless ssh and scp for root users.
Make sure that ssh/scp works between ALL nodes and the new node.
Issue the mmsdrrestore -p <node name> -R /usr/bin/scp command on the new node, where <node name> is one of the active nodes in the node class.