Information to be collected before contacting the IBM Support Center

For effective communication with the IBM® Support Center to help with problem diagnosis, you need to collect certain information.

Information to be collected for all problems related to GPFS

Regardless of the problem encountered with GPFS™, the following data should be available when you contact the IBM Support Center:
  1. A description of the problem.
  2. Output of the failing application, command, and so forth.
  3. A tar file generated by the gpfs.snap command that contains data from the nodes in the cluster. In large clusters, the gpfs.snap command can collect data from certain nodes (for example, the affected nodes, NSD servers, or manager nodes) using the -N option.

    For more information about gathering data with gpfs.snap, see gpfs.snap command.

    If the gpfs.snap command cannot be run, collect these items:
    1. Any error log entries relating to the event:
      • On an AIX node, issue this command:
        errpt -a
      • On a Linux node, create a tar file of all the entries in the /var/log/messages file from all nodes in the cluster or the nodes that experienced the failure. For example, issue the following command to create a tar file that includes all nodes in the cluster:
        mmdsh -v -N all "cat /var/log/messages" > all.messages 
      • On a Windows node, use the Export List... dialog in the Event Viewer to save the event log to a file.
    2. A master GPFS log file that is merged and chronologically sorted for the date of the failure (see Creating a master GPFS log file).
    3. If the cluster was configured to store dumps, collect any internal GPFS dumps written to that directory relating to the time of the failure. The default directory is /tmp/mmfs.
    4. On a failing Linux node, gather the installed software packages and the versions of each package by issuing this command:
      rpm -qa
    5. On a failing AIX node, gather the name, most recent level, state, and description of all installed software packages by issuing this command:
      lslpp -l
    6. File system attributes for all of the failing file systems, issue:
      mmlsfs Device
    7. The current configuration and state of the disks for all of the failing file systems, issue:
      mmlsdisk Device
    8. A copy of file /var/mmfs/gen/mmsdrfs from the primary cluster configuration server.
  4. For Linux on Z, collect the data of the operating system as described in the Linux on z Systems® Troubleshooting Guide .
  5. If you are experiencing one of the following problems, see the appropriate section before contacting the IBM Support Center:

Additional information to collect for delays and deadlocks

When a delay or deadlock situation is suspected, the IBM Support Center will need additional information to assist with problem diagnosis. If you have not done so already, ensure you have the following information available before contacting the IBM Support Center:
  1. Everything that is listed in Information to be collected for all problems related to GPFS.
  2. The deadlock debug data collected automatically.
  3. If the cluster size is relatively small and the maxFilesToCache setting is not high (less than 10,000), issue the following command:
    gpfs.snap --deadlock
    If the cluster size is large or the maxFilesToCache setting is high (greater than 1M), issue the following command:
    gpfs.snap --deadlock --quick

    For more information about the --deadlock and --quick options, see gpfs.snap command.

Additional information to collect for file system corruption or MMFS_FSSTRUCT errors

When file system corruption or MMFS_FSSTRUCT errors are encountered, the IBM Support Center will need additional information to assist with problem diagnosis. If you have not done so already, ensure you have the following information available before contacting the IBM Support Center:
  1. Everything that is listed in Information to be collected for all problems related to GPFS.
  2. Unmount the file system everywhere, then run mmfsck -n in offline mode and redirect it to an output file.

The IBM Support Center will determine when and if you should run the mmfsck -y command.

Additional information to collect for GPFS daemon crashes

When the GPFS daemon is repeatedly crashing, the IBM Support Center will need additional information to assist with problem diagnosis. If you have not done so already, ensure you have the following information available before contacting the IBM Support Center:
  1. Everything that is listed in Information to be collected for all problems related to GPFS.
  2. Ensure the /tmp/mmfs directory exists on all nodes. If this directory does not exist, the GPFS daemon will not generate internal dumps.
  3. Set the traces on this cluster and all clusters that mount any file system from this cluster:
    mmtracectl --set --trace=def --trace-recycle=global
  4. Start the trace facility by issuing:
    mmtracectl --start  
  5. Recreate the problem if possible or wait for the assert to be triggered again.
  6. Once the assert is encountered on the node, turn off the trace facility by issuing:
    mmtracectl --off

    If traces were started on multiple clusters, mmtracectl --off should be issued immediately on all clusters.

  7. Collect gpfs.snap output:
    gpfs.snap