What to do if you see degraded performance over NSD protocol

This topic describes the issues relating to degraded performance over NSD protocol.

Compared degraded performance to what? Is there a repeatable test and a baseline to compare to? "It is slow" is not a valid measurable metric. You must have a baseline to compare to.

There are multiple tools to create that. The product includes nsdperf, but you can choose other available tools in the market such as ior, iozone, bonnie++.

Things to check:
  • First, check the network end to end.
  • Review any changes that are done to either the clients or servers (sysctl, software updates)
  • Check OS resources on the client system (CPU, memory, swap in and out)
  • Check OS resources on the server system.
  • Look for mmhealth events.
  • Look for SMART events (if applicable).
  • Restart the client.
If you still see degraded performance compared to your baseline with the repeatable test, it is time to gather some information and contact IBM®, as follows:
  • Generate an IBM Storage Scale snap on IBM Storage Scale Erasure Code Edition cluster.
  • Generate an IBM Storage Scale snap on the client cluster.

You can already contact IBM support with the above snaps. If you suspect any issues at the disks level, you must engage with the disk vendor tools. In addition, you might gather the following information and attach to the IBM case.

ICT (intercompletion time) data is a full I/O trace that gives size, seek distance, LBA, queue depth at time of completion, overall response time of the I/O and the completion time of this I/O relative to the previous or relative to the start of the I/O, whichever is later, for each pdisk I/O request. Things to look for would be the distribution of the ICT times, comparison of the response time to ICT time, and so on. And checking whether anomalies are specific to hardware domains or to particular ranges of time. This data can be useful to IBM support to help determine many different types of issues.

When you contact IBM support, compile the following data in addition to your baseline and the results that you obtain that differ from the baseline. Also, include an overview of the environment and the tools as versions used to create the baseline:
  • Gather ICT debug data:
    • Create a directory to host the debugs. You can use NFS or separate disk, as it can generate a fair amount of data. In the following example, /tmp/mmfs/ict is used:
      # mkdir /tmp/mmfs/ict
    • Enable the gather of ICT data on IBM Storage Scale Erasure Code Edition node:
      #mmchconfig nsdRAIDICTLogDir=/tmp/mmfs/ict,nsdRAIDDetailedICTLogging=all -N NODE i
      
    • Once you re-created the performance degradation against the baseline, set the login back to default and tar the information to be sent to IBM:
      #  mmchconfig nsdRAIDICTLogDir=default,nsdRAIDDetailedICTLogging=default -N NODE -i
      
      # tar -czf ict.tgz -C /tmp/mmfs ict
    • Attach the compressed file to the IBM case.
  • Unbalance of vdisk partition distribution:
    • Add the output of the following command from IBM Storage Scale Erasure Code Edition nodes to a text file and add it to the IBM case.
      # /usr/lpp/mmfs/bin/mmfsadm test vdisk vdDist 1