IBM Storage Scale configuration and tuning
- Data replica and metadata replica:
While creating IBM Storage Scale file systems, ensure that the replication settings meet the data protection needs of the cluster.
For production cluster over internal disks, it is recommended to take replica 3 for both data and metadata. If you have local RAID5 or RAID6 adapters with battery protected, you can take replica 2 for the data.
When a file system is created, the default number of copies of data and metadata are respectively defined by the-r
(DefaultDataReplicas) and-m
(DefaultMetadataReplicas) options to the mmcrfs command. Also, the value of-R
(MaxDataReplicas) and-M
(MaxMetadataReplicas) cannot be changed after the file system is created. Therefore, it is recommended to take 3 for-R
/-M
for flexibility to change the replica in the future.Note: The first instance (copy) of the data is referred to as the first replica. For example, setting theDefaultDataReplicas=1
(by using-d
1 option to mmcrfs) results in only a single copy of each piece of data, which is typically not desirable for a shared-nothing environment.Query the number of replicas that are kept for any specific file system by running the command:/usr/lpp/mmfs/bin/mmlsfs <filesystem_name> | egrep " -r| -m"
Change the level of data and metadata replication for any file system by running mmchfs by using the same
-r
(DefaultDataReplicas) and-m
(DefaultMetadataReplicas) flags to change the default replication options and then mmrestripefs (with the-R
flag) to restripe the file system to match the new default replication options.For example,/usr/lpp/mmfs/bin/mmchfs <filesystem_name> -r <NewDefaultDataReplicas> -m <NewDefaultDataReplicas> /usr/lpp/mmfs/bin/mmrestripefs <filesystem_name> -R
- Additional considerations for the file system:When you create the file system, consider tuning the /usr/lpp/mmfs/bin/mmcrfsparameters file based on the characteristics of your applications:
- -L
- By default, the value is 4 MB for a file system log file. It is a good idea to create any file
system with at least a 16 MB log file (
-L 16M
) or, if your application is sensitive to meta-operations, with at least a 32 MB log file (-L 32M
). - -E
- By default, the value is yes, which provides exact mtime. If your applications do not require exact mtime, you can change this value to no for better performance.
- -S
- The default value depends on the minimum release level of the cluster when the file system is created. If the minimum release level is 5.0.0 or greater, the default value is relatime. Otherwise, the default value is no, which causes the atime to be updated each time that the file is read. If your application does not depend on exact atime, yes or relatime provides better performance.
- --inode-limit
- If you plan for the file system to contain many files, it is a good idea to set the value as
large as possible to avoid getting errors that say "no inode space". You can estimate the value of
this parameter with the following
formula:
--inode-limit = (<metadata_disk_size> * <metadata_disk_number>)/(<inode_size> * DefaultMetadataReplicas)
- Define the data and the metadata distribution across the NSD server nodes in the
cluster:
Ensure that clusters larger than four nodes are not defined with a single (dataAndMetadata) system storage pool.
For performance and RAS reasons, it is recommended that data and metadata are separated in some configurations (which means that not all the storage is defined to use a single dataAndMetadata system pool).
These guidelines focus on the RAS considerations that are related to the implications of losing metadata servers from the cluster. In IBM Storage Scale Shared Nothing configurations (which recommend setting theunmountOnDiskFail=meta
option), a given file system is unmounted when the number of nodes experiencing metadata disk failures is equal to or greater than the value of theDefaultMetadataReplicas
option defined for the file system (the-m
option to the mmcrfs command as per above). So, for a file system with the typically configured valueDefaultMetadataReplicas=3
, the file system unmounts when metadata disks in three separate locality group IDs fail (when a node fails, all the internal disks in that node is marked down).Note: All the disks in the same file system on a given node must have the same locality group ID.The Locality ID refers to all three elements of the extended failure group topology vector (For example, the vector 2,1,3 could represent rack 2, rack position 1, node 3 in this portion of the rack). To avoid file system unmounts associated with losing too many nodes serving metadata, it is recommended that the number of metadata servers be limited when possible. Also metadata servers must be distributed evenly across the cluster to avoid the case of a single hardware failure (such as the loss of a frame/rack or network switch) leading to multiple metadata node failures.Some suggestions for separation of data and metadata based on cluster size:Total NSD Nodes in Cluster Suggestions for Allocation of Data and Metadata Across NSDs 1-5 All nodes must have both data and metadata disks.
Depending on the available disks it is optional: both the data and metadata can be stored on each disk in this configuration (in which case, the NSDs are all defined as dataAndMetadata) or the disks can be specifically allocated for data or metadata.
If the disk number per node is equal to or less than 3, define all the disks as dataAndMetadata.
If the disk number per node is larger than 3, take the 1:3 ratio for metadataOnly disk and dataOnly disk if your applications are meta data IO sensitive. If your applications are not metadata IO sensitive, consider using 1 metadataOnly disk.
6-9 5 nodes must serve as metadata disks.
Assign one node per virtual rack where each node is one unique failure group. Among these nodes, select 5 nodes with metadata disks and other nodes with data-only disks.
For metadata disk number, if you are not considering IOPS for metadata disk, you can select one disk as metadata NSD from the above 5 nodes with metadata disks. Other disks from these 5 nodes are used for data disks. if considering IOPS for metadata disk, you could select 1:3 ratio for metadata:data.
For example, if you have 8 nodes with 10 disks per node, you have totally 81 disks. However, if you are considering 1:3 ratio, you could have 20 disks for metadata and select 5 disks per node from the above 5 nodes as metadata NSD disks. All other disks are configured as data NSD.
10 - 19 There are several different layouts, for example, 2 nodes per virtual rack for 10-node cluster; for 20-node cluster, you can take every 4 nodes per virtual rack or every 2 nodes per virtual rack; for 15-node cluster, you can take every 3 nodes per virtual rack.
You must keep at least 5 failure groups for meta data and data. This can ensure that you have enough failure groups for data restripe when you have failures from 2 failure groups.
To make it simple, it is suggested that every 2 nodes are defined as a virtual rack, with the first element of the extended failure group kept the same for nodes in the same virtual rack, and every virtual rack must have a node with metadata disks defined.
For example, for an 18-node cluster, node1~node18, node1 and node2 are considered as a virtual rack. You can select some disks from node1 as metadataOnly disks and other disks from node1 and all disks from node2 as dataOnly disks. Ensure that these nodes are in the same failure group (for example, all dataOnly disks from node1 are of failure group 1,0,1, all dataOnly disks from node2 are of failure group 1,0,2).
20 or more Usually, it is recommended that the virtual rack number be greater than 4 but less than 32 with each rack of the same node number. Each rack is defined as one unique failure group and you have more than five failure groups that can tolerate failures from 2 failure groups for data restripe. Select one node from each rack to serve as meta data node.
For example, for a 24-node cluster, you can split the clusters into 6 virtual racks with 4 nodes per rack. For 21-node cluster, it is recommended to take 7 virtual racks with 3 nodes per rack. For node number larger than 40, as a starting point, it's recommended that approximately every 10 nodes should be defined as a virtual rack, with the first element of the extended failure group kept the same for nodes in the same virtual rack. As for meta data, every virtual rack should have one node with metadataOnly disks defined. if you have more than 10+ racks, you could only select 5~10 virtual racks configured with metadata disks.
As for how many disks must be configured as metadataOnly disks on the node which is selected for metadataOnly disks, this depends on the exact disk configuration and workloads. For example, if you configure one SSD per virtual rack, defining the SSD from each virtual rack as metadataOnly disks work well for most workloads.
Note:- If you are not considering the IOPS requirement from the meta IO operations, usually 5% of the total disk size in the file system must be kept for meta data. If you can predict how many files your file system is filled and the average file size, then the requirement of the meta space size could be calculated roughly.
- In a Shared Nothing framework, it is recommended that all nodes have similar disks in disk number and disk capacity size. If not, it might lead to hot disks when some nodes with small disk number or small disk capacity size are running out of disk space.
- As for the number of nodes that are considered as one virtual rack, it is recommended to keep the node number even from each virtual rack.
- It is always recommended to configure SSD or other fast disks as metadataOnly disks. This speeds up some maintenance operations, such as mmrestripefs, mmdeldisk, and mmchdisk.
- If you are not sure about failure group definition, contact scale@us.ibm.com
- When running a sharing nothing cluster, choose a failure group mapping scheme suited to IBM
Storage Scale.
Defining more than 32 failure groups IDs for a specific file system slows down the execution of a lot of concurrent disk space allocation operations, such as restripe operations mmrestripefs -b.
On FPO-enabled clusters, defining more than 32 locality groups per failure group ID slows down the execution of restripe operations, such as mmrestripefs -r.
To define an IBM Storage Scale FPO-enabled cluster containing a storage pool, set the option
allowWriteAffinity
toyes
. This option can be checked by running the mmlspool <fs-name> all - L command. In FPO-enabled clusters, currently all disks on the same node must be assigned to the same locality group ID (three integer vector x,y,z), which also defines a failure group ID <x,y>. It is recommended that failure group IDs refer to sets of common resources, with nodes sharing a failure group ID having a common point of failure, such as a shared rack or a network switch. - Do not configure
allowWriteAffinity=yes
for a metadataOnly system pool.For a metadataOnly storage pool (not a dataAndMetadata pool), set allowWriteAffinity to no. Setting allowWriteAffinity to yes for metadataOnly storage pool slows down the inode allocation for the pool.
- Any FPO-enabled storage pool (any pool with allowWriteAffinity=yes defined) must define
blockGroupFactor to be larger than 1 (regardless of the value of writeAffinityDepth).
When allowWriteAffinity is enabled, more RPC (Remote Procedure Call) activity might occur compared to the case of setting allowWriteAffinity=no.
To reduce some of the RPC overhead associated with setting
allowWriteAffinity=1
, for pools with allowWriteAffinity enabled, it is recommended that the BlockGroupFactor be set to greater than 1. Starting point recommendations areblockGroupFactor=2
(for general workloads),blockGroupFactor=10
(for database workloads), andblockGroupFactor=128
(Hadoop workloads). - Tune the block size for storage pools defined to IBM
Storage Scale.
For storage pools containing both data and metadata (pools defined as dataAndMetadata), a block size of 1M is recommended.
For storage pools containing only data (pools defined as dataOnly), a block size of 2M is recommended.
For storage pools containing only metadata (pools defined as metadataOnly), a block size of 256K is recommended.
The following sample pool stanzas (used when creating NSDs via the mmcrnsd command) are based on the tuning suggestions from steps 4-7:#for a metadata only system pool: %pool: pool=system blockSize=256K layoutMap=cluster allowWriteAffinity=no #for a data and metadata system pool: %pool: pool=system blockSize=1M layoutMap=cluster allowWriteAffinity=yes writeAffinityDepth=1 blockGroupFactor=2 #for a data only pool: %pool: pool=datapool blockSize=2M layoutMap=cluster allowWriteAffinity=yes writeAffinityDepth=1 blockGroupFactor=10
- Tune the size of the IBM
Storage Scale
pagepool attribute by setting the pool size of each node to be between 10%
and 25% of the real memory installed.Note: The Linux® buffer pool cache is not used for IBM Storage Scale file systems. The recommended size of the pagepool attribute depends on the workload and the expectations for improvements due to caching. A good starting point recommendation is somewhere between 10% and 25% of real memory. If machines with different amounts of memory are installed, use the
-N
option to mmchconfig to set different values according to the memory installed on the machines in the cluster. Though these are good starting points for performance recommendations, some customers use relatively small page pools, such as between 2-3% of real memory installed, particularly for machines with more than 256GB installed.The following example shows how to set a page pool size equal to 10% of the memory (this assumes all the nodes have the same amount of memory installed):TOTAL_MEM=$(cat /proc/meminfo | grep MemTotal | tr -d \"[:alpha:]\" | tr -d \"[:punct:]\" | tr -d \"[:blank:]\") PERCENT_OF_MEM=10 PAGE_POOL=$((${TOTAL_MEM}*${PERCENT_OF_MEM}/(100*1024))) mmchconfig pagepool=${PAGE_POOL}M –i
- Change the following IBM
Storage Scale configuration
options and then restart IBM
Storage Scale.Note: For IBM Storage Scale 4.2.0.3 or 4.2.1 and later, the restart of IBM Storage Scale can be delayed until the next step, because tuning workerThreads requires a restart.Set each configuration option individually:
mmchconfig readReplicaPolicy=local mmchconfig unmountOnDiskFail=meta mmchconfig restripeOnDiskFailure=yes mmchconfig nsdThreadsPerQueue=10 mmchconfig nsdMinWorkerThreads=48 mmchconfig prefetchaggressivenesswrite=0 mmchconfig prefetchaggressivenessread=2
For versions of IBM Storage Scale earlier than 5.0.2, also set one of the following values:mmchconfig maxStatCache=512 mmchconfig maxStatCache=0
In versions of IBM Storage Scale earlier than 5.0.2, the stat cache is not effective on the Linux platform unless the Local Read-Only Cache (LROC) is configured. For more information, see the description of the maxStatCache parameter in the topic mmchconfig command.
Set all the configuration options at once by using the mmchconfig command:mmchconfig readReplicaPolicy=local,unmountOnDiskFail=meta, restripeOnDiskFailure=yes,nsdThreadsPerQueue=10,nsdMinWorkerThreads=48, prefetchaggressivenesswrite=0,prefetchaggressivenessread=2
For versions of IBM Storage Scale earlier than 5.0.2, also include one of the following expressions:
maxStatCache=512
ormaxStatCache=0
.The maxMBpS tuning option must be set as per the network bandwidth available to IBM Storage Scale. If you are using one 10 Gbps link for the IBM Storage Scale network traffic, the default value of 2048 is appropriate. Otherwise scale the value of maxMBpS to be about twice the value of the network bandwidth available on a per node basis.
For example, for two bonded 10 Gbps links an appropriate setting for maxMBpS is:mmchconfig maxMBpS=4000 # this example assumes a network bandwidth of about 2GB/s (or 2 bonded 10 Gbps links) available to Spectrum Scale
Note: In a special user scenario, such as active-to-active disaster recovery deployment,restripeOnDiskFailure
must be configured asno
for internal disk cluster.Some of these configuration options do not take effect until IBM Storage Scale is restarted.
- Depending on the level of code installed, follow the tuning recommendation for Case A or Case B:
- If running IBM
Storage Scale 4.2.0 PTF3, 4.2.1, or
any higher level, either set workerThreads to 512, or try setting workerThreads=8*cores per node
(both require a restart of IBM
Storage Scale to take
effect). For lower code levels, setting worker1Threads to 72 (with the
-i
, immediate, option to mmchconfig does not require restarting IBM Storage Scale.)mmchconfig workerThreads=512 # for Spectrum Scale 4.2.0 PTF3, 4.2.1, or any higher levels or mmchconfig workerThreads=8*CORES_PER_NODE # for Spectrum Scale 4.2.0 PTF3, 4.2.1, or any higher levels
Change workerThreads to 512 (the default is 128) to enable additional thread tuning. This change requires that IBM Storage Scale be restarted to take effect.
Note: For IBM Storage Scale 4.2.0.3 or 4.2.1 or later, it is recommended that the following configuration parameters not be changed (setting workerThreads to 512, or (8*cores per node), auto-tunes these values): parallelWorkerThreads, logWrapThreads, logBufferCount, maxBackgroundDeletionThreads, maxBufferCleaners, maxFileCleaners, syncBackgroundThreads, syncWorkerThreads, sync1WorkerThreads, sync2WorkerThreads, maxInodeDeallocPrefetch, flushedDataTarget, flushedInodeTarget, maxAllocRegionsPerNode, maxGeneralThreads, worker3Threads, and prefetchThreads.After you enable auto-tuning by tuning the value of workerThreads, if you previously changed any of these settings (parallelWorkerThreads, logWrapThreads, and so on) you must restore them back to their default values by running mmchconfig <tunable>=Default.
- For IBM
Storage Scale 4.1.0.x, 4.1.1.x, 4.2.0.0,
4.2.0.1, 4.2.0.2, the default values work for most scenarios. Generally only worker1Threads tuning
is required:
mmchconfig worker1Threads=72 -i # for Spectrum Scale 4.1.0.x, 4.1.1.x, 4.2.0.0, 4.2.0.1, 4.2.0.2
For IBM Storage Scale 4.1.0.x, 4.1.1.x, 4.2.0.0, 4.2.0.1, 4.2.0.2, worker1Threads=72 is a good starting point (the default is 48), though larger values have been used in database environments and other configurations that have many disks present.
- If running IBM
Storage Scale 4.2.0 PTF3, 4.2.1, or
any higher level, either set workerThreads to 512, or try setting workerThreads=8*cores per node
(both require a restart of IBM
Storage Scale to take
effect). For lower code levels, setting worker1Threads to 72 (with the
- Customers running IBM
Storage Scale 4.1.0, 4.1.1, and
4.2.0 must change the default configuration of trace to run in overwrite mode instead of blocking
mode.To avoid potential performance problems, customers running IBM Storage Scale 4.1.0, 4.1.1, and 4.2.0 must change the default IBM Storage Scale tracing mode from blocking mode to overwrite mode as follows:
/usr/lpp/mmfs/bin/mmtracectl --set --trace=def --tracedev-writemode= overwrite --tracedev-overwrite-buffer-size=500M # only for Spectrum Scale 4.1.0, 4.1.1, and 4.2.0
This assumes that 500MB can be made available on each node for IBM Storage Scale trace buffers. If 500MB are not available, then set a lower appropriately sized trace buffer.
- Consider whether pipeline writing must be enabled.
By default, data ingestion node writes 2 or 3 replicas of the data to the target nodes over the network in parallel when pipeline writing is disabled (
enableRepWriteStream=0
). This takes additional network bandwidth. If pipeline writing is enabled, the data ingestion node only writes one replica over the network and the target node writes the additional replica. Enabling pipeline writing (mmchconfig enableRepWriteStream=1 and restarting IBM Storage Scale daemon on all nodes) can increase IO write performance in the following two scenarios:- Data is ingested from the IBM Storage Scale client and the network bandwidth from the data-ingesting client is limited.
- Data is written through rack-to-rack switch with limited bandwidth. For example, 30 nodes per rack, the bandwidth of rack-to-rack switch is 40Gb. When all the nodes are writing data over the rack-to-rack switch, each node gets only 40Gb/30, which is approximately 1.33Gb average network bandwidth.
enableRepWriteStream
must be kept as0
.