Node procedures
This topic describes various procedures that can be done on a node to accomplish various tasks.
Note: Any new hardware (especially disks) used for below section must be tested before put into
production to make sure the hardware quality.
When you add a new node or replace a node, you need to prepare
the following as the precondition for the new node to be operational:
- A homogeneous server is recommended. It must have the same CPU, memory, PCI speed, network speed, disk controller, disk number, and model. If the node has different configurations, make sure that this node does not introduce any performance bottlenecks to the cluster.
- Enclosure Descriptor File: If the server is homogeneous with other servers that include the drive mapping (which is what is recommended), the edf file (/usr/lpp/mmfs/data/gems/*edf) can be copied from the existing node to the new node. If the new server is not homogeneous with others, then new edf files must be created. For more information, see Mapping NVMe disk slot location.
- Setting the disks used for IBM Storage Scale Erasure Code Edition in JBOD mode, check the disk format, update firmware, and disable the disk writer cache. For more information, see Hardware checklist.
- SAS disk slot location: If the server is homogeneous with others that includes drive mapping and needing remapping the disk slot location, then the slot remapping file /usr/lpp/mmfs/data/gems/slotmap.yaml can be copied from an existing node to the new node. Otherwise, a new slotmap file must be created. For more information, see Mapping LMR disk location.
- Setting customized
udev
rules if required. - Setting the
systemctl
settings if required. - Follow the OS precheck tool Readme file to run the precheck tools after you prepare the node. For more information, see Minimum hardware requirements and precheck.
Adding new I/O nodes
Adding a new node by using the mmvdisk command:
- Make sure that the node is a member of the IBM Storage Scale cluster and the state is active (if not, issue mmaddnode and mmstartup). Also, make sure that the node has the server license (if not, run mmchlicense).
- Issue the mmvdisk server list -N newnode --disk-topology command to verify
that the new node has the same disk topology as the other nodes in the recovery group to which the
node is
added.
# mmvdisk server list -N c72f4m5u15-ib0 --disk-topology -L
The system displays the following output:GNR server: name c72f4m5u15-ib0 arch x86_64 model 7X06CTO1WW serial J100574A GNR enclosures found: internal Enclosure internal (internal, number 1): Enclosure internal sees 9 disks (6 SSDs, 3 HDDs) GNR server disk topology: ECE 6 SSD/NVMe and 3 HDD (match: 100/100) GNR configuration: 1 enclosure, 6 SSDs, 0 empty slots, 9 disks total, 0 NVRAM partitions
- Issue the mmvdisk server configure -N newnode --recycle one command to
configure the new node as
IBM Storage Scale Erasure Code Edition server and
restart the IBM
Storage Scale
daemon.
# mmvdisk server configure -N c72f4m5u15-ib0 --recycle one
mmvdisk: Checking resources for specified nodes. mmvdisk: Setting configuration for node 'c72f4m5u15-ib0'. mmvdisk: Node 'c72f4m5u15-ib0' has a scale-out recovery group disk topology. mmvdisk: Using 'default.scale-out' RG configuration for topology 'ECE 6 SSD/NVMe and 3 HDD'. mmvdisk: Node 'c72f4m5u15-ib0' is now configured to be a recovery group server. mmvdisk: Restarting GPFS daemon on node 'c72f4m5u15-ib0'.
- Issue the mmvdisk rg add --rg rg_name -N newnode command to add the new node
to the current recovery group. After that, all DAs must be in the rebalanced state. The
mmvdisk rg add --rg rg_name -N newnode command adds a call-back script to monitor
the rebalance process. When the rebalance is finished, the call-back runs mmvdisk
recoverygroup add --recovery-group rg_name --complete-node-add command of the next step to
finish the procedure for adding the node.
# mmvdisk rg add --rg rg_1 -N c72f4m5u15-ib0
mmvdisk: Checking daemon status on node 'c72f4m5u15-ib0'. mmvdisk: Checking resources for specified nodes. mmvdisk: Adding 'c72f4m5u15-ib0' to node class 'nc_1'. mmvdisk: Obtaining pdisk information for recovery group 'rg_1'. mmvdisk: Analyzing disk topology for node 'c72f4m5u13-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u19-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u17-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u21-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u11-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u15-ib0'. mmvdisk: Validating declustered arrays for recovery group 'rg_1'. mmvdisk: Updating server list for recovery group 'rg_1'. mmvdisk: Updating pdisk list for recovery group 'rg_1'. mmvdisk: Updating parameters for declustered array 'DA1'. mmvdisk: Updating parameters for declustered array 'DA2'. mmvdisk: Updating parameters for declustered array 'DA3'. mmvdisk: Node 'c72f4m5u15-ib0' added to recovery group 'rg_1'. mmvdisk: Log group and vdisk set operations for recovery group 'rg_1' mmvdisk: must be deferred until rebalance completes in all declustered arrays. mmvdisk: A callback 'RG001CompletNodeAdd' has been created to monitor the rebalance state. mmvdisk: Once rebalance completes in all declustered arrays, mmvdisk: log group and vdisk set will be created automatically.
- Check the DA status and rebalance progress by issuing the following
command:
# mmvdisk rg list --rg rg_1 --da
The system displays the following output:declustered needs vdisks pdisks capacity array service type trim user log total spare rt total raw free raw background task ----------- ------- ---- ---- ---- --- ----- ----- -- --------- -------- --------------- DA1 no NVMe no 10 0 12 2 2 8869 GiB 1237 GiB rebalance (12%) DA2 yes HDD no 10 0 18 2 2 8829 GiB 1089 GiB rebalance (88%) DA3 no SSD no 10 11 24 3 2 9173 GiB 695 GiB rebalance (19%) mmvdisk: Total capacity is the raw space before any vdisk set definitions. mmvdisk: Free capacity is what remains for additional vdisk set definitions. mmvdisk: Attention: Recovery group 'rg_1' has an incomplete node addition (c72f4m5u15-ib0). mmvdisk: callback 'RG001CompletNodeAdd' will perform the node addition after rebalance completes mmvdisk: in all declustered arrays of recovery group 'rg_1'.
- Verify that the call-back is added by issuing the following
command:
# mmlscallback RG001CompletNodeAdd
RG001CompletNodeAdd command = /usr/lpp/mmfs/bin/mmvdisk sync = false event = imEventRebalance node = c72f4m5u11-ib0,c72f4m5u13-ib0,c72f4m5u15-ib0,c72f4m5u17-ib0,c72f4m5u19-ib0,c72f4m5u21-ib0 parms = recoverygroup add --recovery-group %rgName --complete-node-add --callback RG001CompletNodeAdd
- Check the DA status and rebalance progress by issuing the following
command:
- The call-back automatically runs the mmvdisk recoverygroup add --recovery-group rg_name
--complete-node-add command to finish the adding node process after the rebalance is
finished. This operation creates new log groups, new vdisks for all existing
vdisk sets, NSDs, and adds the free NSDs to file systems if the
vdisk sets belong to some file system.If the rebalance is ongoing, run the following command:
# mmvdisk recoverygroup add --recovery-group rg_1 --complete-node-add
mmvdisk: Verifying that the DAs in recovery group 'rg1' are idle. mmvdisk: Declustered array 'DA1' is in task 'rebalance'. mmvdisk: All DAs must be in task 'scrub' to complete node addition. mmvdisk: Log group and vdisk set operations for recovery group 'rg1' mmvdisk: must be deferred until rebalance completes in all declustered arrays. mmvdisk: A callback 'RG001CompletNodeAdd' has been created to monitor the rebalance state. mmvdisk: Once rebalance completes in all declustered arrays, mmvdisk: log group and vdisk set will be created automatically. mmvdisk: Command failed. Examine previous error messages to determine cause.
Generally the mmvdisk command reports the same message if:- The rebalance is ongoing.
- The call-back is not finished.
# mmvdisk rg list --rg rg_1 --da
The system displays the following output:declustered needs vdisks pdisks capacity array service type trim user log total spare rt total raw free raw background task ----------- ------- ---- ---- ---- --- ----- ----- -- --------- -------- --------------- DA1 no NVMe no 10 0 12 2 2 8869 GiB 1237 GiB rebalance (12%) DA2 yes HDD no 10 0 18 2 2 8829 GiB 1089 GiB rebalance (88%) DA3 no SSD no 10 11 24 3 2 9173 GiB 695 GiB rebalance (19%) mmvdisk: Total capacity is the raw space before any vdisk set definitions. mmvdisk: Free capacity is what remains for additional vdisk set definitions. mmvdisk: Attention: Recovery group 'rg_1' has an incomplete node addition (c72f4m5u15-ib0). mmvdisk: callback 'RG001CompletNodeAdd' will perform the node addition after rebalance completes mmvdisk: in all declustered arrays of recovery group 'rg_1'.
After the call-back is executed, the above mmvdisk command message will disappear.
- Run the following command that would display an increased vdisks
number.
# mmvdisk rg list --rg rg_1 --da
The system displays the following output:declustered needs vdisks pdisks capacity array service type trim user log total spare rt total raw free raw background task ----------- ------- ---- ---- ---- --- ----- ----- -- --------- -------- --------------- DA1 no NVMe no 12 0 12 2 2 8869 GiB 1237 GiB scrub 14d (63%) DA2 yes HDD no 12 0 18 2 2 8829 GiB 1089 GiB scrub 14d (63%) DA3 no SSD no 12 13 24 3 2 9173 GiB 695 GiB scrub 14d (65%) mmvdisk: Total capacity is the raw space before any vdisk set definitions. mmvdisk: Free capacity is what remains for additional vdisk set definitions.
Replacing an I/O node with a new node and disks
In this scenario, a failed server is to be replaced with an entirely new server, including new drives.
- Prepare a new node with the same disk topology as the node needs to be replaced. The server type, memory, and disks must be same.
- Issue the mmaddnode command to add this node into the IBM Storage Scale, accept the license as the server, and issue the mmstartup -N command to bring up the IBM Storage Scale daemon.
- Define the node as the same role as the old server, such as quorum,
fsmgr
, and so on. - Run the mmvdisk server configure -N nodename command to configure the node, then restart the daemon on this node.
- Run the mmvdisk rg replace command to replace the existing node with a new
node. In some cases, you might need to specify --match parameter if there are
slight differences between your configuration and the standard topology definitions, for example
--match 90.
mmvdisk rg replace --rg rg1 -N c72f4m5u01-ib0 --new-node c72f4m5u07-ib0 mmvdisk: Attempting to complete a previous replace command. mmvdisk: Analyzing disk topology for node 'c72f4m5u01-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u03-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u05-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u11-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u09-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u15-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u13-ib0'. mmvdisk: Analyzing disk topology for node 'c72f4m5u07-ib0'. mmvdisk: Updating server list for recovery group 'rg1'. mmvdisk: Updating pdisk list for recovery group 'rg1'. mmvdisk: This could take a long time. mmvdisk: The following pdisks will be formatted on node c72f4m5u01.gpfs.net: mmvdisk: //c72f4m5u07-ib0/dev/nvme1n1 mmvdisk: //c72f4m5u07-ib0/dev/nvme0n1 mmvdisk: //c72f4m5u07-ib0/dev/sda mmvdisk: //c72f4m5u07-ib0/dev/sdc mmvdisk: //c72f4m5u07-ib0/dev/sdb mmvdisk: //c72f4m5u07-ib0/dev/sde mmvdisk: //c72f4m5u07-ib0/dev/sdg mmvdisk: //c72f4m5u07-ib0/dev/sdf mmvdisk: //c72f4m5u07-ib0/dev/sdd mmvdisk: Removing node 'c72f4m5u01-ib0' from node class 'r1'. mmvdisk: Updating server list for recovery group 'rg1'.
- Run the mmvdisk rg list command to make sure that the new node joins the node
class, and that all related
pdisks
work fine. Also, make sure that the replaced node and the relatedpdisks
are not in the RG anymore. Then, wait for some time to make sure all DAs are into the scrub state. - Now the node is replaced from RG successfully. Run mmshutdown -N and mmdelnode -N to delete the replaced node from the cluster (if you do not need the node in the cluster anymore).
Replacing broken I/O nodes with moving disks to new nodes
- Make sure that the node is broken, not pingable, or cannot be logged in. You can pull the network cable on this broken node if you can physically access the node.
- Prepare a new node that is of the same hardware as that of the broken node.
- Install the same OS on it, check the time to sync with all other nodes in IBM Storage Scale Erasure Code Edition cluster, and then install the same IBM Storage Scale build on the new node.
- Connect the new node to the switch, change the hostname and IP address of the node as that of the old node.
- Pull the pdisks that the old node was using and insert them into the new node.
- Make sure that all disks are visible on the new node and that none of the pdisks are broken. If the pdisks are broken, data in this disk never gets restored.
- Make sure that the ssh and scp commands work on the new node. You must configure passwordless ssh and scp for root users.
- Make sure that ssh/scp works between ALL nodes and the new node.
- Issue the mmsdrrestore -p <node name> -R /usr/bin/scp command on the new node, where <node name> is one of the active nodes in the node class.