Creating recovery groups on the IBM Storage Scale System 3200
You can create recovery groups using IBM Storage Scale RAID commands. You can create vdisks, NSDs, and file systems using IBM Storage Scale commands or using the IBM Storage Scale System 3200 GUI.
To create vdisks, NSDs, and file systems using the IBM Storage Scale System 3200 GUI, use the action in the view. You must decide whether you will use IBM Storage Scale commands or the IBM Storage Scale System 3200 GUI to create vdisks, NSDs, and file systems. A combination of the two is not supported. The IBM Storage Scale System 3200 GUI cannot create a file system on existing NSDs.
Configuring GPFS nodes to be recovery group servers
Having verified the disk enclosure connectivity of the GL4 building block, and having optionally also created a component database to associate names and machine room locations with the storage hardware, the recovery groups may now be created on the GL4 building block disk enclosures and servers.
The servers must be members of the same GPFS cluster and must be configured for IBM Storage Scale RAID.
IBM Storage Scale must be running on the servers, and the servers should not have been rebooted or had their disk configuration changed since the verified disk topology files were acquired.
Defining the recovery group layout
The definition of recovery groups on a GL4 building block is accomplished by dividing the drawers of the enclosures into left and right halves. The sharing of GL4 disk enclosures by two servers implies two recovery groups; one is served by one node and one by the other, and each server acts as the other's backup. Half the disks in each enclosure and drawer should belong to one recovery group, and half to the other. One recovery group will therefore be defined on the disks in the left half of each drawer, slots 1 through 6, and one on the disks in the right half of each drawer, slots 7 through 12. The SSD in drawer 1, slot 3 of the first enclosure will make up the SSD declustered array for the left recovery group, and the SSD in drawer 5, slot 12 of the first enclosure will make up the SSD declustered array of the right recovery group. The remaining 116 HDDs in each half are divided into two vdisk data declustered arrays of 58 disks.
IBM Storage Scale RAID provides a script, mkrginput, that understands the layout of IBM Storage Scale System 3200 building blocks and will automatically generate the mmcrrecoverygroup stanza files for creating the left and right recovery groups. The mkrginput script, when supplied with the output of the mmgetpdisktopology command from the two servers, will create recovery group stanza files for the left and right sets of disks.
# mkrginput server1.top server2.top
(In ESS 3.5, the
-s parameter can be used with mkrginput to create a single
data declustered array in GL4 and GL6 building blocks. Do this only if all GL4 and GL6 recovery
groups in the cluster are to be treated the same. See the mkrginput script for more information.)server1
topology. The files will be named after the serial number of
the enclosure determined to be first in the topology, but each will contain disks from all four
enclosures. In this case, the resulting stanza files will be SV35229088L.stanza
for
the left half and SV35229088R.stanza
for the right half:
# ls -l SV35229088L.stanza SV35229088R.stanza
-rw-r--r-- 1 root root 7244 Nov 25 09:18 SV35229088L.stanza
-rw-r--r-- 1 root root 7243 Nov 25 09:18 SV35229088R.stanza
The
recovery group stanza files will follow the recommended best practice for a GL4 building block of
defining in each half a declustered array called NVR
with two NVRAM partitions (one
from each server) for fast recovery group RAID update logging; a declustered array called
SSD
with either the left or right SSD to act as a backup for RAID update logging;
and two file system data declustered arrays called DA1
and DA2
using the regular HDDs. (If the -s parameter was used with
mkrginput, there will be no DA2
data declustered array.)daName
in the stanza files:
# grep daName SV35229088L.stanza
%da: daName=SSD spares=0 replaceThreshold=1 auLogSize=120m
%da: daName=NVR spares=0 replaceThreshold=1 auLogSize=120m nspdEnable=yes
%da: daName=DA1 VCDSpares=31
%da: daName=DA2 VCDSpares=31
The
parameters after the declustered array names are required exactly as shown. (If the
-s parameter was used with mkrginput, there will be no
DA2
data declustered array, and the parameters for the DA1
data
declustered array will instead be spares=4 VCDSpares=60
.)da=NVR
in the stanza file:
# grep da=NVR SV35229088L.stanza
%pdisk: pdiskName=n1s01 device=//server1/dev/sda5 da=NVR rotationRate=NVRAM
%pdisk: pdiskName=n2s01 device=//server2/dev/sda5 da=NVR rotationRate=NVRAM
This
shows that a pdisk called n1s01
will be created using the
/dev/sda5
NVRAM partition from server1
, and a pdisk called
n2s01
will use the /dev/sda5
NVRAM partition on
server2
. The parameters again are required exactly as shown. The name
n1s01
means node 1, slot 1, referring to the server node and the NVRAM partition,
or "slot." Similarly, n2s01
means server node 2, NVRAM slot 1.The disks in the SSD
, DA1
, and DA2
declustered
arrays can be found using similar grep invocations on the stanza files.
If, as was recommended, the IBM Storage
Scale RAID component
database was used to provide meaningful names for the GL4 building block components, the names of
the recovery groups should be chosen to follow that convention. In this example, the building block
rack was named BB1
and the enclosures were named BB1ENC1
,
BB1ENC2
, BB1ENC3
, and BB1ENC4
. It would make
sense then to name the left and right recovery groups BB1RGL
and
BB1RGR
. Other conventions are possible as well.
# mmcrrecoverygroup BB1RGL -F SV35229088L.stanza --servers server1,server2
mmcrrecoverygroup: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
# mmcrrecoverygroup BB1RGR -F SV35229088R.stanza --servers server2,server1
mmcrrecoverygroup: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
The
left recovery group is created with server1
as primary and server2
as backup, and the right recovery group is created with server2
as primary and
server1
as backup.Verifying recovery group creation
# mmlsrecoverygroup BB1RGL -L
declustered
recovery group arrays vdisks pdisks format version
----------------- ----------- ------ ------ --------------
BB1RGL 4 0 119 4.1.0.1
declustered needs replace scrub background activity
array service vdisks pdisks spares threshold free space duration task progress priority
----------- ------- ------ ------ ------ --------- ---------- -------- -------------------------
SSD no 0 1 0,0 1 186 GiB 14 days repair-RGD/VCD 10% low
NVR no 0 2 0,0 1 3744 MiB 14 days repair-RGD/VCD 10% low
DA1 no 0 58 2,31 2 138 TiB 14 days repair-RGD/VCD 10% low
DA2 no 0 58 2,31 2 138 TiB 14 days repair-RGD/VCD 10% low
declustered checksum
vdisk RAID code array vdisk size block size granularity state remarks
------------------ ------------------ ----------- ---------- ---------- ----------- ----- -------
config data declustered array VCD spares actual rebuild spare space remarks
------------------ ------------------ ------------- --------------------------------- ----------------
rebuild space DA1 31 36 pdisk
rebuild space DA2 31 36 pdisk
config data max disk group fault tolerance actual disk group fault tolerance remarks
-------------- ------------------------------ --------------------------------- -----------
rg descriptor 1 enclosure + 1 drawer 1 enclosure + 1 drawer limiting fault tolerance
system index 2 enclosure 1 enclosure + 1 drawer limited by rg descriptor
active recovery group server servers
----------------------------------------------- -------
server1 server1,server2
Notice
that the vdisk information for the newly-created recovery group is indicated with 0s or is missing;
the next step is to create the vdisks.Defining and creating the vdisks
Once the recovery groups are created and being served by their respective servers, it is time to create the vdisks using the mmcrvdisk command.
The internal RAID transaction and update log vdisks must be created first.
- A log tip vdisk (type
vdiskLogTip
) in theNVR
declustered array - A log tip backup vdisk (type
vdiskLogTipBackup
) in theSSD
declustered array - A log home vdisk (type
vdiskLog
) in theDA1
declustered array - A log reserved vdisk (type
vdiskLogReserved
in theDA2
declustered array (and in theDA3
declustered array, in the case of GL6)
DA2
in the remainder of this example.)On Power Systems servers, the checksumGranularity=4k parameter is required for the various log vdisks in the log vdisk stanza file. This parameter should be omitted on non-Power® servers.
# cat mmcrvdisklog.BB1
%vdisk:
vdiskName=BB1RGLLOGTIP
rg=BB1RGL
daName=NVR
blocksize=2m
size=48m
raidCode=2WayReplication
checksumGranularity=4k # Power only
diskUsage=vdiskLogTip
%vdisk:
vdiskName=BB1RGLLOGTIPBACKUP
rg=BB1RGL
daName=SSD
blocksize=2m
size=48m
raidCode=Unreplicated
checksumGranularity=4k # Power only
diskUsage=vdiskLogTipBackup
%vdisk:
vdiskName=BB1RGLLOGHOME
rg=BB1RGL
daName=DA1
blocksize=2m
size=20g
raidCode=4WayReplication
checksumGranularity=4k # Power only
diskUsage=vdiskLog
longTermEventLogSize=4m
shortTermEventLogSize=4m
fastWriteLogPct=90
%vdisk:
vdiskName=BB1RGLDA2RESERVED
rg=BB1RGL
daName=DA2
blocksize=2m
size=20g
raidCode=4WayReplication
checksumGranularity=4k # Power only
diskUsage=vdiskLogReserved
%vdisk:
vdiskName=BB1RGRLOGTIP
rg=BB1RGR
daName=NVR
blocksize=2m
size=48m
raidCode=2WayReplication
checksumGranularity=4k # Power only
diskUsage=vdiskLogTip
%vdisk:
vdiskName=BB1RGRLOGTIPBACKUP
rg=BB1RGR
daName=SSD
blocksize=2m
size=48m
raidCode=3WayReplication
checksumGranularity=4k # Power only
diskUsage=vdiskLogTipBackup
%vdisk:
vdiskName=BB1RGRLOGHOME
rg=BB1RGR
daName=DA1
blocksize=2m
size=20g
raidCode=4WayReplication
checksumGranularity=4k # Power only
diskUsage=vdiskLog
longTermEventLogSize=4m
shortTermEventLogSize=4m
fastWriteLogPct=90
%vdisk:
vdiskName=BB1RGRDA2RESERVED
rg=BB1RGR
daName=DA2
blocksize=2m
size=20g
raidCode=4WayReplication
checksumGranularity=4k # Power only
diskUsage=vdiskLogReserved
The
parameters chosen for size, blocksize, raidCode, fastWriteLogPct, and the event log sizes are
standard and have been carefully calculated, and they should not be changed. The only difference in
the vdisk log stanza files between two building blocks will be in the recovery group and vdisk
names. (In the case of a GL6 building block with NVRAM partitions, there will be an additional
vdiskLogReserved for DA3
, with parameters otherwise identical to the
DA2
log reserved vdisk.)The checksumGranularity=4k parameter is required for Power Systems servers. It should be omitted on non-Power servers.
# mmcrvdisk -F mmcrvdisklog.BB1
mmcrvdisk: [I] Processing vdisk BB1RGLLOGTIP
mmcrvdisk: [I] Processing vdisk BB1RGLLOGTIPBACKUP
mmcrvdisk: [I] Processing vdisk BB1RGLLOGHOME
mmcrvdisk: [I] Processing vdisk BB1RGLDA2RESERVED
mmcrvdisk: [I] Processing vdisk BB1RGRLOGTIP
mmcrvdisk: [I] Processing vdisk BB1RGRLOGTIPBACKUP
mmcrvdisk: [I] Processing vdisk BB1RGRLOGHOME
mmcrvdisk: [I] Processing vdisk BB1RGRDA2RESERVED
mmcrvdisk: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
You
can use the mmlsvdisk command (or the mmlsrecoverygroup
command) to verify that the log vdisks have been created:
# mmlsvdisk
declustered block size
vdisk name RAID code recovery group array in KiB remarks
------------------ --------------- ------------------ ----------- ---------- -------
BB1RGLDA2RESERVED 4WayReplication BB1RGL DA2 2048 logRsvd
BB1RGLLOGHOME 4WayReplication BB1RGL DA1 2048 log
BB1RGLLOGTIP 2WayReplication BB1RGL NVR 2048 logTip
BB1RGLLOGTIPBACKUP Unreplicated BB1RGL SSD 2048 logTipBackup
BB1RGRDA2RESERVED 4WayReplication BB1RGR DA2 2048 logRsvd
BB1RGRLOGHOME 4WayReplication BB1RGR DA1 2048 log
BB1RGRLOGTIP 2WayReplication BB1RGR NVR 2048 logTip
BB1RGRLOGTIPBACKUP Unreplicated BB1RGR SSD 2048 logTipBackup
Now
the file system vdisks may be created. - one using 4-way replication and a 1 MiB block size and a total vdisk size of 2048 GiB suitable for file system metadata
- one using Reed-Solomon 8 + 3p encoding and an 16 MiB block size suitable for file system data
The possibilities for the vdisk creation stanza file are quite great, depending on the number and type of vdisk NSDs required for the number and type of file systems desired, so the vdisk stanza file will need to be created by hand, possibly following a template. The sample vdisk stanza file that is supplied in /usr/lpp/mmfs/samples/vdisk/vdisk.stanza can be used for this purpose and adapted to specific file system requirements.
# cat mmcrvdisknsd.BB1
%vdisk: vdiskName=BB1RGLMETA1
rg=BB1RGL
da=DA1
blocksize=1m
size=2048g
raidCode=4WayReplication
diskUsage=metadataOnly
failureGroup=1
pool=system
%vdisk: vdiskName=BB1RGLMETA2
rg=BB1RGL
da=DA2
blocksize=1m
size=2048g
raidCode=4WayReplication
diskUsage=metadataOnly
failureGroup=1
pool=system
%vdisk: vdiskName=BB1RGRMETA1
rg=BB1RGR
da=DA1
blocksize=1m
size=2048g
raidCode=4WayReplication
diskUsage=metadataOnly
failureGroup=1
pool=system
%vdisk: vdiskName=BB1RGRMETA2
rg=BB1RGR
da=DA2
blocksize=1m
size=2048g
raidCode=4WayReplication
diskUsage=metadataOnly
failureGroup=1
pool=system
%vdisk: vdiskName=BB1RGLDATA1
rg=BB1RGL
da=DA1
blocksize=16m
raidCode=8+3p
diskUsage=dataOnly
failureGroup=1
pool=data
%vdisk: vdiskName=BB1RGLDATA2
rg=BB1RGL
da=DA2
blocksize=16m
raidCode=8+3p
diskUsage=dataOnly
failureGroup=1
pool=data
%vdisk: vdiskName=BB1RGRDATA1
rg=BB1RGR
da=DA1
blocksize=16m
raidCode=8+3p
diskUsage=dataOnly
failureGroup=1
pool=data
%vdisk: vdiskName=BB1RGRDATA2
rg=BB1RGR
da=DA2
blocksize=16m
raidCode=8+3p
diskUsage=dataOnly
failureGroup=1
pool=data
Notice
how the file system metadata vdisks are flagged for eventual file system usage as metadataOnly and
for placement in the system storage pool, and the file system data vdisks are flagged for eventual
dataOnly usage in the data storage pool. (After the file system is created, a policy will be
required to allocate file system data to the correct storage pools; see Creating the GPFS file system.)Importantly, also notice that block sizes for the file system metadata and file system data vdisks must be specified at this time, may not later be changed, and must match the block sizes supplied to the eventual mmcrfs command.
Notice also that the eventual failureGroup=1 value for the NSDs on the file system vdisks is the same for vdisks in both the BB1RGL and BB1RGR recovery groups. This is because the recovery groups, although they have different servers, still share a common point of failure in the four GL4 disk enclosures, and IBM Storage Scale should be informed of this through a distinct failure group designation for each disk enclosure. It is up to the IBM Storage Scale system administrator to decide upon the failure group numbers for each IBM Storage Scale System 3200 building block in the GPFS cluster. In this example, the failure group number 1 has been chosen to match the example building block number.
mmcrvdisknsd.BB1
file, use
the following mmcrvdisk command:
# mmcrvdisk -F mmcrvdisknsd.BB1
mmcrvdisk: [I] Processing vdisk BB1RGLMETA1
mmcrvdisk: [I] Processing vdisk BB1RGLMETA2
mmcrvdisk: [I] Processing vdisk BB1RGRMETA1
mmcrvdisk: [I] Processing vdisk BB1RGRMETA2
mmcrvdisk: [I] Processing vdisk BB1RGLDATA1
mmcrvdisk: [I] Processing vdisk BB1RGLDATA2
mmcrvdisk: [I] Processing vdisk BB1RGRDATA1
mmcrvdisk: [I] Processing vdisk BB1RGRDATA2
mmcrvdisk: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
You
can use the mmlsvdisk command or the mmlsrecoverygroup command
to verify that the vdisks have been created.Creating NSDs from vdisks
# mmcrnsd -F mmcrvdisknsd.BB1
mmcrnsd: Processing disk BB1RGLMETA1
mmcrnsd: Processing disk BB1RGLMETA2
mmcrnsd: Processing disk BB1RGRMETA1
mmcrnsd: Processing disk BB1RGRMETA2
mmcrnsd: Processing disk BB1RGLDATA1
mmcrnsd: Processing disk BB1RGLDATA2
mmcrnsd: Processing disk BB1RGRDATA1
mmcrnsd: Processing disk BB1RGRDATA2
mmcrnsd: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
The mmcrnsd command then once again rewrites the stanza file in preparation for use as input to the mmcrfs command.
Creating the GPFS file system
# mmcrfs gpfsbb1 -F mmcrvdisknsd.BB1 -B 16m --metadata-block-size 1m -T /gpfsbb1 -n 256
The following disks of gpfsbb1 will be formatted on node server1:
BB1RGLMETA1: size 269213696 KB
BB1RGRMETA1: size 269213696 KB
BB1RGLDATA1: size 8593965056 KB
BB1RGRDATA1: size 8593965056 KB
BB1RGLMETA2: size 269213696 KB
BB1RGRMETA2: size 269213696 KB
BB1RGLDATA2: size 8593965056 KB
BB1RGRDATA2: size 8593965056 KB
Formatting file system ...
Disks up to size 3.3 TB can be added to storage pool system.
Disks up to size 82 TB can be added to storage pool data.
Creating Inode File
Creating Allocation Maps
Creating Log Files
Clearing Inode Allocation Map
Clearing Block Allocation Map
Formatting Allocation Map for storage pool system
98 % complete on Tue Nov 25 13:27:00 2014
100 % complete on Tue Nov 25 13:27:00 2014
Formatting Allocation Map for storage pool data
85 % complete on Tue Nov 25 13:27:06 2014
100 % complete on Tue Nov 25 13:27:06 2014
Completed creation of file system /dev/gpfsbb1.
mmcrfs: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
The
-n 256 parameter specifies that the allocation maps should account for 256
nodes mounting the file system. This is an example only and should be adjusted to actual cluster
expectations.
# cat policy
rule 'default' set pool 'data'
The policy must then be
installed in the file system using the mmchpolicy command:
# mmchpolicy gpfsbb1 policy -I yes
Validated policy 'policy': parsed 1 Placement Rules, 0 Restore Rules, 0 Migrate/Delete/Exclude Rules,
0 List Rules, 0 External Pool/List Rules
Policy 'policy'. installed and broadcast to all nodes.
If
a policy is not placed in a file system with multiple storage pools, attempts to place data into
files will return ENOSPC as if the file system were full.
# mmmount gpfsbb1 -a