Operating system configuration and tuning

Perform the following steps to configure and tune a Linux system:

  1. deadline disk scheduler

    Change all the disks defined to IBM Spectrum Scale to use the 'deadline' queue scheduler (cfg is the default for some distros, such as RHEL 6).

    For each block device defined to IBM Spectrum Scale, run the following command to enable the deadline scheduler:
    echo "deadline" > /sys/block/<diskname>/queue/scheduler

    Changes made in this manner (echoing changes to sysfs) do not persist over reboots. To make these changes permanent, enable the changes in a script that runs on every boot or (generally preferred) create a udev rule.

    The following sample script sets deadline scheduler for all disks in the cluster that are defined to IBM Spectrum Scale (this example must be run on the node with passwordless access to all the other nodes):
    #!/bin/bash
    /usr/lpp/mmfs/bin/mmlsnsd -X | /bin/awk ' { print $3 " " $5 } ' | \
    /bin/grep dev |
    while read device node ; do
    device=$(echo $device | /bin/sed 's/\/dev\///' )
    /usr/lpp/mmfs/bin/mmdsh -N $node "echo deadline >
    /sys/block/$device/queue/scheduler"
    Done
    As previously stated, changes made by echoing to sysfs files (as per this example script) take effect immediately on running the script, but do not persist over reboots. One approach to making such changes permanent is to enable a udev rule, as per this example rule to force all block devices to use deadline scheduler after rebooting. To enable this rule, you can create the following file as /etc/udev/rules.d/99-hdd.rules):
    ACTION=="add|change", SUBSYSTEM=="block", ATTR{device/model}=="*",
    ATTR{queue/scheduler}="deadline"

    In the next step, give an example of how to create udev rules that apply only to the devices used by IBM Spectrum Scale.

  2. disk IO parameter change
    To further tune the block devices used by IBM Spectrum Scale, run the following commands from the console on each node:
    echo 16384 > /sys/block/<device>/queue/max_sectors_kb
    echo 256 > /sys/block/<device>/queue/nr_requests
    echo 32 > /sys/block/<device>/device/queue_depth
    These block device tuning settings must be large enough for SAS/SATA disks. For /sys/block/<device>/queue/max_sectors_kb, the tuning value chosen must be less than or equal to /sys/block/<device>/queue/max_hw_sectors_kb. Many SAS/SATA devices allow setting 16384 for max_hw_sectors_kb, but not all devices may accept these values. If your device does not allow for the block device tuning recommendations above, try setting smaller values and cutting the recommendations in half until the tuning is successful. For example, if setting max_sectors_kb to 16384 results in a write error:
    echo 16384 > /sys/block/sdd/queue/max_sectors_kb
    -bash: echo: write error: Invalid argument
    Try setting max_sectors_kb to 8192:
    echo 8192 > /sys/block/sdd/queue/max_sectors_kb

    If your disk is not SAS/SATA, check the disk specification from the disk vendor for tuning recommendations.

    Note: If the max_sectors_kb of your disks is small (e.g. 256 or 512) and you are not allowed to tune the above values (i.e., you get an “invalid argument” as per the example above), then your disk performance might be impacted because IBM Spectrum Scale IO requests might be split into several smaller requests according to the limits max_sectors_kb places at the block device level.

    As discussed in Step 1 tuning recommendations, any tuning done by echoing to sysfs files will be lost when a node reboots. To make such a tuning permanent, either create appropriate udev rules or place these commands in a boot file that is run on each reboot.

    As udev rules are the preferred way of accomplishing this kind of block device tuning, give an example of a generic udev rule that enables the block device tuning recommended in steps 1 and 2 for all block devices. This rule can be enabled by creating the following rule as a file /etc/udev/rules.d/100-hdd.rules):
    ACTION=="add|change", SUBSYSTEM=="block", ATTR{device/model}=="*",
    ATTR{queue/nr_requests}="256", ATTR{device/queue_depth}="32",
    ATTR{queue/max_sectors_kb}="16384"
    If it is not desirable to tune all block devices with the same settings, multiple rules can be created with specific tuning for the appropriate devices. To create such device specific rules, you can use the ‘KERNEL’ match key to limit which devices udev rules apply to (e.g., KERNEL==sdb). The following example script can be used to create udev rules that tune only the block devices used by IBM Spectrum Scale:
    #!/bin/bash
    #clean up any existing /etc/udev/rules.d/99-hdd.rules files
    /usr/lpp/mmfs/bin/mmdsh -N All "rm -f /etc/udev/rules.d/100-hdd.rules"
    #collect all disks in use by GPFS and create udev rules one disk at a time
    /usr/lpp/mmfs/bin/mmlsnsd -X | /bin/awk ' { print $3 " " $5 } ' | \
    /bin/grep dev |
    while read device node ; do
    device=$(echo $device | /bin/sed 's/\/dev\///' )
    echo $device $node
    echo "ACTION==\"add|change\", SUBSYSTEM==\"block\", \
    KERNEL==\"$device\", ATTR{device/model}==\"*\", \
    ATTR{queue/nr_requests}=\"256\", \
    ATTR{device/queue_depth}=\"32\", ATTR{queue/max_sectors_kb}=\"16384\" "> \
    /tmp/100-hdd.rules
    /usr/bin/scp /tmp/100-hdd.rules $node:/tmp/100-hdd.rules
    /usr/lpp/mmfs/bin/mmdsh -N $node "cat /tmp/100-hdd.rules >>\
    /etc/udev/rules.d/100-hdd.rules"
    Done
    Note: The previous example script must be run from a node that has ssh access to all nodes in the cluster. This previous example script will create udev rules that will set the recommended block device tuning on future reboots. To put the recommended tuning values from steps 1 and 2 in place immediately in effect, the following example script can be used:
    #!/bin/bash
    /usr/lpp/mmfs/bin/mmlsnsd -X | /bin/awk ' { print $3 " " $5 } ' | \
    /bin/grep dev |
    while read device node ; do
    device=$(echo $device | /bin/sed 's/\/dev\///' )
    /usr/lpp/mmfs/bin/mmdsh -N $node "echo deadline >\
    /sys/block/$device/queue/scheduler"
    /usr/lpp/mmfs/bin/mmdsh -N $node "echo 16384>\
    /sys/block/$device/queue/max_sectors_kb"
    /usr/lpp/mmfs/bin/mmdsh -N $node "echo 256 >\
    /sys/block/$device/queue/nr_requests"
    /usr/lpp/mmfs/bin/mmdsh -N $node "echo 32 >\
    /sys/block/$device/device/queue_depth"
    Done
  3. disk cache checking

    On clusters that do not run Hadoop/Spark workloads, disks used by IBM Spectrum Scale must have physical disk write caching disabled, regardless of whether RAID adapters are used for these disks.

    When running other (non-Hadoop/Spark) workloads, write caching on the RAID adapters can be enabled if the local RAID adapter cache is battery protected, but the write cache on the physical disks must not be enabled.

    Check the specification for your RAID adapter to figure out how to turn on/off the RAID adapter write cache, as well as the physical disk write cache.

    For common SAS/SATA disks without RAID adapter, run the following command to check whether the disk in question is enabled with physical disk write cache:
    sdparm --long /dev/<diskname> | grep WCE

    If WCE is 1, it means the disk write cache is on.

    The following commands can be used to turn on/off physical disk write caching:
    # turn on physical disk cache
    sdparm -S -s WCE=1 /dev/<diskname>
    # turn off physical disk cache
    sdparm -S -s WCE=0 /dev/<diskname>
    Note: The physical disk read cache must be enabled no matter what kind of disk is used. For SAS/SATA disks without RAID adapters, run the following command to check whether the disk read cache is enabled or not:
    sdparm --long /dev/<diskname> | grep RCD

    If the value of RCD (Read Cache Disable) is 0, the physical disk read cache is enabled. On Linux, usually the physical disk read cache is enabled by default.

  4. Tune vm.min_free_kbytes to avoid potential memory exhaustion problems.

    When vm.min_free_kbytes is set to its default value, in some configurations it is possible to encounter memory exhaustion symptoms when free memory must be available. It is recommended that vm.min_free_kbytes be set to between 5~6 percent of the total amount of physical memory, but no more than 2GB should be allocated for this reserve memory.

    To tune this value, add the following into /etc/sysctl.conf and then run 'sysctl -p' on Red Hat or SuSE:
    vm.min_free_kbytes = <your-min-free-KBmemory>
  5. OS network tuning
    If your network adapter is 10Gb Ethernet adapter, you can put the following into /etc/sysctl.conf and then run /sbin/sysctl -p /etc/sysctl.conf on each node:
    sunrpc.udp_slot_table_entries = 128
    sunrpc.tcp_slot_table_entries = 128
    net.core.rmem_max=4194304
    net.core.wmem_max=4194304
    net.core.rmem_default=4194304
    net.core.wmem_default=4194304
    net.core.netdev_max_backlog = 300000
    net.core.somaxconn = 10000
    net.ipv4.tcp_rmem = 4096 4224000 16777216
    net.ipv4.tcp_wmem = 4096 4224000 16777216
    net.core.optmem_max=4194304

    If your cluster is based on InfiniBand adapters, see the guide from your InfiniBand adapter vendor.

    If you bond two adapters and configure xmit_hash_policy=layer3+4 with bond mode 4 (802.3ad, the recommended bond mode), IBM Spectrum Scale of one node has only one TCP/IP connection with another node in the cluster for data transfer. This might make the network traffic only over one physical connection if the network traffic is not heavy.

    If your cluster size is not large (for example, only one physical switch is enough for your cluster nodes), you could try bonding mode 6 (balance-alb, no special support from switch). This might give better network bandwidth as compared with bonding mode 4 (802.3ad, require support from switch). See the Linux bonding: 802.3ad (LACP) vs. balance-alb mode link for advantages and disadvantages on Linux bonding 802.3ad versus balance-alb mode.