Using load balancing with scale-up and scale-down

Namespaces are automatically redistributed during scale-up and scale-down processes to ensure equal load across gateways.

About this task

Namespace redistribution helps ensure that the load across all gateways in the group are the same. The automatic redistribution process begins automatically after a scale-up or scale-down operation. Use this information to understand how to use and prepare for load balancing with scale-up and scale-down operations.

Adding gateway nodes to a gateway group, also known as scale-up operations, require certain steps before and after the scale-up operation. These steps include:
  1. Preparing new gateways with listeners defined for each system.
  2. Verifying that the new gateway IP addresses are discovered as expected.
  3. Reconnecting from the initiators to the subsystems.

The following procedure provides detailed load balancing scale-up steps.

Removing gateway nodes from a gateway group, also known as scale-down operations does not require extra steps. After scale-down, the system automatically failovers the namespaces that correspond to the load balancing group ID of the gateway that is being removed. The namespaces failover to one of the other gateways that remain in the group.
Note: When running scale-down operations, delete listeners from the deleted gateways for all related subsystems. Deleting listeners is required for scale-down operations but not for gateway node replacement procedures. For more information about deleting listeners, see Deleting listeners.

When automatic listeners are enabled on the subsystems listeners get deleted automatically during scale-down and there is no need to delete them manually.

The first step of the failover verifies that I/O can continue to the remaining namespaces, leaving one of the gateways with more load than the others. Therefore, the system begins a background process will automatically rebalance the namespaces across the existing gateways.

Important: mTLS certificates must include all IP addresses of each of the Ceph NVMe-oF gateways. When adding or removing gateways, if the certificates added during mTLS configuration need to be re-created to include all IP addresses. For more information, see Configuring mTLS authentication.
Use the following information during scale-up and scale-down, based on your deployment method:

Procedure

Run the following steps before and after the scale-up operations, as indicated.
  1. Before adding new gateways to a gateway group, add a listener to define the IP port on the gateway that is to process NVMe/TCP commands and I/O operations.
    Note: If automatic listeners were enabled either during subsystem creation or by using the subsystem add_networkcommand, listeners do not need to be added during scale-up operations. Skip to step 2.
    Repeat this step for all subsystems present in the High Availability gateway group.
    Important: NQN and gateway name values must not contain an underscore character.
    ceph nvmeof listener add --nqn NQN --host-name HOST_NAME [--traddr LISTENER_ADDRESS] [--server-address SERVER_ADDRESS]

    To define the IP port on the gateway, be sure to include the --host-name and --traddr parameters.

    For extra input, the following parameters can optionally be added to the command:

    • --trsvcid
    • --adrfam
    • --server-port
    For example,
    [root@host01 ~]# ceph nvmeof listener add --nqn nqn.2016-06.io.spdk:cnode1.group1 --host-name host02 --traddr 10.172.19.01
  2. Add any required gateway nodes from the gateway group.
  3. Optional: Verify that the scale-up operation is complete.
    ceph nvme-gw show NVME-OF_POOL_NAME GW_GROUP
    Check the following output:
    • The num-namespaces output of each gateway in Created Gateways is of similar value, within just a few numbers of each other.
    • The total of each gateway num-namespaces output in Created Gateways equals to the total num-namespaces.
    • The gateways within the Created Gateways equals the number of gateways after the scale-out.

    The following example has a total of three gateways with the Created Gateways num-namespaces with output of 315, 315, and 314, with a total of "num-namespaces": 944. The rebalance_ana_group, num gws and number of gateways within Created Gateways all equal 3 and all gateways are in an AVAILABLE state.

    For example,
    [root@host-01 gwconf]# ceph nvme-gw show mypool mygroup1
    {
        "epoch": 2059,
        "pool": "mypool",
        "group": "mygroup1",
        "features": "LB",
        "rebalance_ana_group": 3,
        "num gws": 3,
        "Anagrp list": "[ 3 2 1 ]",
        "num-namespaces": 944,
        "Created Gateways:": [
            {
                "gw-id": "client.nvmeof.mypool.mygroup1.host01.yarqjx",
                "anagrp-id": 3,
                "num-namespaces": 315,
                "performed-full-startup": 1,
                "Availability": "AVAILABLE",
                "num-listeners": 118,
                "ana states": " 1: STANDBY ,  2: STANDBY ,  3: ACTIVE "
            },
            {
                "gw-id": "client.nvmeof.mypool.mygroup1.host02.eashcy",
                "anagrp-id": 2,
                "num-namespaces": 315,
                "performed-full-startup": 1,
                "Availability": "AVAILABLE",
                "num-listeners": 118,
                "ana states": " 1: STANDBY ,  2: ACTIVE ,  3: STANDBY "
            },
            {
                "gw-id": "client.nvmeof.mypool.mygroup1.host03.eakobb",
                "anagrp-id": 1,
                "num-namespaces": 314,
                "performed-full-startup": 1,
                "Availability": "AVAILABLE",
                "num-listeners": 118,
                "ana states": " 1: ACTIVE ,  2: STANDBY ,  3: STANDBY "
            }
        ]
    }
    Important: Continue with the following steps only after the scale-up operation is complete.
  4. Run the nvme discover command to verify that the new gateway IP addresses are discovered as expected.
    Run this command for each initiator.
    Note: This step is only applicable to RHEL initiators. For ESXi initiators, skip to step 5.
    nvme discover -t tcp -a GATEWAY_IP -s 8009

    The output provides the IP address (traddr) of the new NVMe-oF gateway for the subsystems added with listeners in step 1, which can be connected from this initiator.

    For example,
    [root@host01 ~]# nvme discover -t tcp -a 10.172.19.01 -s 8009
    Discovery Log Entry 0
    trtype:  tcp
    adrfam:  ipv4
    subtype: nvme subsystem
    treq:    not required
    portid:  0
    trsvcid: 4420
    subnqn:  nqn.2016-06.io.spdk:cnode1.group1 
    traddr:  9.147.168.14
    eflags:  none
    sectype: none
    Discovery Log Entry 1
    trtype:  tcp
    adrfam:  ipv4
    subtype: nvme subsystem
    treq:    not required
    portid:  1
    trsvcid: 4420
    subnqn:  nqn.2016-06.io.spdk:cnode1.group1 
    traddr:  9.147.168.32
    eflags:  none
    sectype: none
  5. Reconnect from the initiators to the subsystems.
    Run this command for each initiator.
    Reconnecting Red Hat Enterprise Linux initiators
    Use the nvme connect-all command to connect to all gateways in the group, establishing multipath connections.
    Note: When using bidirectional in-band authentication, use the nvme connect command instead of nvme connect-all. The nvme connect command must be run on each gateway for all namespaces to be visible.
    To connect to a specific subsystem, use the following values from the previous step:
    • subnqn value as the SUBSYSTEM_NQN
    • The trsvcid port value
    nvme connect --traddr GATEWAY_IP --transport tcp --nqn SUBSYSTEM_NQN --trsvcid PORT
    Important: Namespaces might not be visible if an initiator is connected to two different gateway groups with the same NQN.
    Note: Adding the -l flag enables the initiator to retry connecting to the gateway in cases where the gateway becomes temporarily unavailable.
    nvme connect-all --traddr GATEWAY_IP --transport tcp -l 1800 -s 8009
    For example,
    [root@host01 ~]# nvme connect-all --traddr 10.172.19.01 --transport tcp -l 1800 -s 8009
    In this example, the -l flag is set to retry after 1800 seconds, if the first connection does not succeed.
    Reconnecting ESXi initiators
    Connect to the NVMe-oF gateway subsystem.
    This command discovers the NVMe-oF gateways in the gateway group and then connects to the gateways.
    esxcli nvme fabrics discover -a NVME_TCP_ADAPTER -i GATEWAY_IP -p 8009 -c
    Note: In cases that a specific connection is required without discovering all of the available gateways, run the nvme fabrics connect command.
    esxcli nvme fabrics connect -a NVME_TCP_ADAPTER -i GATEWAY_IP -s SUBSYSTEM_NQN -p 8009
    [root@host01:~] esxcli nvme fabrics connect -a vmhba64 -i 10.0.211.196 -s nqn.2016-06.io.spdk:cnode1.group1  -p 8009
    To verify, use the nvme controller list command.
    esxcli nvme controller list |grep TCP
    Check that the new connection is listed and marked as true.
    For example,
    [root@host01:~] esxcli nvme fabrics discover -a vmhba64 -i 10.0.211.196 -p 8009 -c
    
    Transport Type Address Family Subsystem Type Controller ID Admin Queue Max Size Transport Address Transport Service ID Subsystem NQN              Connected
    -------------- -------------- -------------- ------------- -------------------- ----------------- -------------------- -------------------------- ---------
    TCP            IPv4           NVM            65535         128                   10.0.211.196     8009                 nqn.2016-06.io.spdk:cnode1.group1   true

What to do next

Verify the gateway connections.
  1. List the NVMe-oF block devices.
    nvme list
    For example,
    [root@host01 ~]# nvme list
    Node                    Generic           SN                   Model                   Namespace Usage                      Format           FW Rev
    ---------------------   ----------------  -------------------  ----------------------- --------- -------------------------- ---------------- --------
    /home/nvme01_node01     /home/ng1n1       SPDK00000000000001   SPDK bdev Controller    1          10,49  MB /  10,49  MB      4 KiB +  0 B   23.01
    ...
  2. Verify that the initiator is connected to all NVMe-oF gateways and subsystems in the gateway group.
    nvme list-subsys
    For example,
    [root@init-nvme-vm5 ~]# nvme list-subsys
    nvme-subsys5 - NQN nqn.2016-06.io.spdk:cnode2
    \
    +- nvme5 tcp traddr 10.243.64.5,trsvcid 4420 live
    +- nvme6 tcp traddr 10.243.64.10,trsvcid 4420 live
    +- nvme7 tcp traddr 10.243.64.11,trsvcid 4420 live
    +- nvme8 tcp traddr 10.243.64.12,trsvcid 4420 live
    nvme-subsys1 - NQN nqn.2016-06.io.spdk:cnode1.group1 
    \
    +- nvme1 tcp traddr 10.243.64.5,trsvcid 4420 live
  3. Create a filesystem on the target of your choosing. Use the target path that was found in step 1.
    mkfs NVME_NODE_PATH
    For example,
    [root@host01 ~]# mkfs /home/nvme01_node01
    mke2fs 1.46.5 (20-Dec-2023)
    Discarding device blocks: done
    Creating filesystem with 2560 4k blocks and 2560 inodes
    
    Allocating group tables: done
    Writing inode tables: done
    Writing superblocks and filesystem accounting information: done
  4. Mount NVMe-oF.
    mkdir /mnt/nvmeof
    For example,
    [root@host01 ~]# mkdir /mnt/nvmeof
  5. Mount the node on within the NVMe-oF directory.
    mount NVME_NODE_PATH /mnt/nvmeof
    For example,
    [root@host01 ~]# mount /home/nvme01_node01 /mnt/nvmeof
  6. Using sudo commands, list mounted NVMe-oF files.
    ls /mnt/nvmeof
    For example,
    $ ls /mnt/nvmeof
    lost+found
  7. Create a text file within the mnt/nvmeof directory.
    For example,
    $ sudo bash -c "echo Hello NVMe-oF > /mnt/nvmeof/hello.txt"
  8. Verify that the text file can now be reached.
    For example,
    $ cat /mnt/nvmeof/hello.txt
    Hello NVMe-oF