Recovery group issues for shared recovery groups
An ESS 3000, ESS 3200, or ESS 3500
recovery group is
called a
shared recovery group
because the enclosure disks are shared by both the
canister servers in the building block. These building block contains two canister servers and an
NVMe enclosure, and configures as a single recovery group that is simultaneously active on both
canister servers.
The single shared recovery group structure is necessitated because the ESS system can have as few as 12 disks, which is the smallest number of disks a recovery group can contain. Having 12 disks allows for one equivalent spare and 11-wide 8+3P RAID codes.
ESSNC
: # mmvdisk server list --node-class ESSNC
node
number server active remarks
------ -------------------------------- ------- -------
3 canister1.gpfs.net yes serving ESSRG: LG002, LG004
4 canister2.gpfs.net yes serving ESSRG: root, LG001, LG003
For these ESS systems, each server is simultaneously serving the same single recovery group,
ESSRG
. The server workload within the building block is balanced by subdividing the
single shared recovery group into the following log groups: LG001
,
LG002
, LG003
, LG004
, and the lightweight root or
master log group. The non-root log groups are called user log groups
. Only the user
log groups contain the file system vdisk NSDs.
# mmvdisk recoverygroup list
needs user
recovery group active current or master server service vdisks remarks
-------------- ------- -------------------------------- ------- ------ -------
ESSRG yes canister2.gpfs.net no 16
ESSRG1 yes server1.gpfs.net no 8
ESSRG2 yes server2.gpfs.net no 8
needs service
column in all the IBM Spectrum
Scale RAID commands is narrowly defined to mean
whether a disk in the recovery group is called out for replacement. The mmvdisk
recoverygroup list --not-ok command can be used to show other recovery group issues,
including those involving log groups or servers:# mmvdisk recoverygroup list --not-ok
recovery group remarks
-------------- -------
ESSRG server canister2.gpfs.net 'down'
#
# mmvdisk recoverygroup list --server --recovery-group ESSRG
node
number server active remarks
------ -------------------------------- ------- -------
3 canister1.gpfs.net yes serving ESSRG: root, LG001, LG002, LG003, LG004
4 canister2.gpfs.net no configured
When the down server is brought back up, the Recovery Group Configuration Manager (RGCM) process that is running on the cluster manager node assigns it two of the user log groups. The two user log groups are used to rebalance the recovery group server workload. For more information, see Server failover for shared recovery groups.
Other than cases where a failover occurs or while servers are rejoining a recovery group, RGCM must always keep two user log groups on each server. In the unlikely event that both servers are active but each server does not have two user log groups, you can shut down one of the servers and restart it. Shutting down the servers and restarting them causes the RGCM to redistribute the user log groups to the servers.
# mmvdisk recoverygroup list --server --recovery-group ESSRG
node
number server active remarks
------ -------------------------------- ------- -------
3 canister1.gpfs.net yes serving ESSRG: root, LG001, LG002, LG003
4 canister2.gpfs.net yes serving ESSRG: LG004
canister2
and starting it back up restores the log
group workload balance in the building block within five or fewer
minutes:# mmshutdown -N canister2.gpfs.net
# mmstartup -N canister2.gpfs.net
# sleep 300
# mmvdisk recoverygroup list --server --recovery-group ESSRG
node
number server active remarks
------ -------------------------------- ------- -------
3 canister1.gpfs.net yes serving ESSRG: root, LG002, LG003
4 canister2.gpfs.net yes serving ESSRG: LG001, LG003