Performance sizing

A lot of factors, such as logic processor number, memory size, network bandwidth, storage bandwidth and the IBM Storage® Scale deployment mode, can impact performance sizing. This section gives a brief throughput estimate sizing guide for teragen and terasort workloads as the query and transaction workload types (HBase, Hive) have too many factors to be able to give sizing rules.

Sizing the throughput from HDFS Transparency cluster could be done by two steps:
  1. Sizing the throughput from IBM Storage Scale POSIX interface.
  2. Calculating the throughput from HDFS Transparency.
To size the throughput of IBM Storage Scale POSIX interface, use the open source IOR benchmark to get the throughput of reads and writes from the POSIX interface. If you are not able to use IOR benchmark, estimate the throughput for IBM Storage Scale POSIX interface as follows:
  • For IBM Storage Scale System, get the throughput number from the IBM® product guide.
  • For IBM Storage Scale FPO:
    • If the network bandwidth is greater than (Disk-number-per-node * disk-bandwidth), calculate using:

      ((Disk-number * Disk-bandwidth / Replica-number) * 0.7)

    • If network bandwidth is smaller than (Disk-number-per-node * disk-bandwidth), calculate using:

      ((Network-bandwidth-per-node * node-number) * 0.7)

Usually, it is recommended to take SSD for metadata so that the metadata operations in IBM Storage Scale FPO do not become the bottleneck. Under this condition, HDFS Transparency interface will yield approximately 70% to 80% throughput based on the POSIX interface throughput value. The benchmark throughput is impacted by the number of Hadoop nodes and the Hadoop-level configuration settings.

As for how to size Hadoop node number,
  • For IBM Storage Scale System:

    Calculate Hadoop node number by using the IBM Storage Scale System official throughput value and the client network bandwidth:

    For example, if using IBM Storage Scale System GL4s and the throughput is 36GB from IBM product guide, and the client has 10Gb network bandwidth, then will need 36GB/((10Gb/8) * 0.8) clients to drive the throughput.

    OR

    Calculate based only on the network bandwidth from the IBM Storage Scale System configuration and the client network adapter throughput:

    For example, IBM Storage Scale System configuration network bandwidth (e.g. 100GB) and the client network adapter throughput (e.g. 10GB), then will need (100GB/10GB) = 10 clients.

  • For FPO:

    All Hadoop nodes should be IBM Storage Scale FPO nodes.