Ingesting data into IBM Spectrum Scale clusters

MapReduce tasks perform best when input data is evenly distributed across cluster nodes. You can use the following approaches or a combination to ingest data for the first time and on an ongoing basis:

  • Import data through a diskless IBM Spectrum Scale node. This ensures that the data is distributed evenly across all failure groups and all nodes within a failure group.
  • If you have a large set of data to copy, it might help to use all cluster nodes to share ingest workload. Use a write-affinity depth of 0, along with as many cluster nodes with storage as possible to copy data in parallel.
  • A write-affinity depth of 0 ensures that each node distributes data across as many nodes as possible. IBM Spectrum Scale policies can be used to enforce write-affinity depth settings based on fileset name, filename, or other attributes.
  • Another mechanism to distribute data on ingest is to use write-affinity depth failure groups (WADFG) to control placement of the file replica. A WADFG setting of “*,*,*” ensures that all the file chunks are evenly distributed across all nodes. A placement policy can be used to selectively specify this attribute on the data set being ingested.

It is possible that even after employing the above techniques to ingest, the cluster might become unbalanced as nodes and disks are added or removed. You can check whether the data in the cluster is balanced by using the mmdf command. If data disks in different nodes are showing uneven disk usage, rebalance the cluster by running the mmrestripefs -b command. Keep in mind that the rebalancing command causes additional I/O activity in the cluster. Therefore, plan to run it at a time when workload is light.