Configure storage type data replication

To get the file system data replica values, run the mmlsfs <fsName> -r -R command to review the output values. The value of -r is the default number of data replicas and the value of -R is the maximum number of data replicas.

Important: The value of -R cannot be changed after the file system creation. Usually, the value 3 is the recommended values for -r and -R if you are using IBM Storage Scale FPO and the value 1 for -r and 2 for -R are recommended values for production when you are using Centralized Storage.

For different storage modes, refer to the following table for recommended combination for dfs.replication, gpfs.replica.enforced and file system data replica.

Table 1. Configurations for data replication
Storage mode dfs.replication gpfs.replica.enforced File system data replica Comments
#1 FPO

(gpfs.storage.type=local)

3 gpfs or dfs -r = 3 -R = 3 Other combinations are not recommended.
#2 IBM Storage Scale System

(gpfs.storage.type=shared)

1 dfs -r = 1 -R = 2

-r = 1 -R = 3

Follow the HDFS protocol. But the job will fail if one DN is down after getBlockLocation is returned.

Potential issue: Does not show the advantage that all DN can access the blocks.

If you are using this configuration you must use the mmlsattr command to check the file replication value. If the set file replication value is less than the dfs.replication value, the HDFS interface cannot be used to check the file replication value because the NameNode returns at least the dfs.replication value in the shared storage mode.

#3 IBM Storage Scale System

(gpfs.storage.type=shared)

2 or 3 gpfs -r = 1 -R = 2

-r = 1 -R = 3

Follow the HDFS protocol (returns 2 or 3 DNs) but does not match the real storage usage on GPFS level.

Job will not fail if one DN is down after getBlockLocation is returned.

Potential risk: Upper-layer applications calculate the disk space consumption as replication * file size, thinking a file takes more storage space than it actually does. HDFS Transparency will still use the actual disk space correctly.

#4 IBM Storage Scale System

(gpfs.storage.type=shared)

1 gpfs -r = 1 -R = 2

-r = 1 -R = 3

Do not use if the application wants to set the replication value from HDFS protocol.
#5 IBM Storage Scale System

(gpfs.storage.type=shared)

2 or 3 dfs -r = 1 -R = 2

-r = 1 -R = 3

All the data will be set as replica 2 or 3 which will not take advantage of using IBM Storage Scale System or SAN storage.

If you are using this configuration you must use the mmlsattr command to check the file replication value. If the set file replication value is less than the dfs.replication value, the HDFS interface cannot be used to check the file replication value because the NameNode returns at least the dfs.replication value in the shared storage mode.

Note:
  • The dfs.replication is defined in the hdfs-site.xml file. The gpfs.storage.type and gpfs.replica.enforced are defined in the gpfs-site.xml file.
  • Starting from HDFS Transparency version 3.1.1-1, the default value for dfs.replication is 3 in hdfs-site.xml and gpfs.replica.enforced is gpfs in gpfs-site.xml.
  • The dfs.replication value should be smaller or equal to the DataNode count.