Information to be collected before contacting the IBM Support Center
For effective communication with the IBM® Support Center to help with problem diagnosis, you need to collect certain information.
Information to be collected for all problems related to GPFS
Regardless
of the problem encountered with GPFS™,
the following data should be available when you contact the IBM Support Center:
- A description of the problem.
- Output of the failing application, command, and so forth.
- A tar file generated by the gpfs.snap command that
contains data from the nodes in the cluster. In large clusters, the
gpfs.snap command can collect data from certain nodes (for example, the
affected nodes, NSD servers, or manager nodes) using the -N option.
For more information about gathering data with gpfs.snap, see gpfs.snap command.
If the gpfs.snap command cannot be run, collect these items:- Any error log entries relating to
the event:
- On an AIX node, issue this command:
errpt -a
- On a Linux node, create a tar file of all the entries in
the /var/log/messages file from all nodes in the cluster or the nodes that
experienced the failure. For example, issue the following command to create a tar file that includes
all nodes in the
cluster:
mmdsh -v -N all "cat /var/log/messages" > all.messages
- On a Windows node, use the Export List... dialog in the Event Viewer to save the event log to a file.
- On an AIX node, issue this command:
- A master GPFS log file that is merged and chronologically sorted for the date of the failure (see Creating a master GPFS log file).
- If the cluster was configured to store dumps, collect any internal GPFS dumps written to that directory relating to the time of the failure. The default directory is /tmp/mmfs.
- On a
failing Linux node, gather the installed software packages
and the versions of each package by issuing this command:
rpm -qa
- On a failing AIX node, gather the name,
most recent level, state, and description of all installed software packages by issuing this
command:
lslpp -l
- File system attributes for all of the failing file systems, issue:
mmlsfs Device
- The current configuration and state of the disks for all of the failing file
systems, issue:
mmlsdisk Device
- A copy of file /var/mmfs/gen/mmsdrfs from the primary cluster configuration server.
- Any error log entries relating to
the event:
- For Linux on Z, collect the data of the operating system as described in the Linux on z Systems® Troubleshooting Guide (www.ibm.com/support/knowledgecenter/linuxonibm/liaaf/lnz_r_sv.html).
- If you are experiencing one of the following problems, see the
appropriate section before contacting the IBM Support Center:
- For delay and deadlock issues, see Additional information to collect for delays and deadlocks.
- For file system corruption or MMFS_FSSTRUCT errors, see Additional information to collect for file system corruption or MMFS_FSSTRUCT errors.
- For GPFS daemon crashes, see Additional information to collect for GPFS daemon crashes.
Additional information to collect for delays and deadlocks
When a delay or deadlock situation
is suspected, the IBM Support
Center will need additional information to assist with problem diagnosis.
If you have not done so already, ensure you have the following information
available before contacting the IBM Support
Center:
- Everything that is listed in Information to be collected for all problems related to GPFS.
- The deadlock debug data collected automatically.
- If the cluster size is relatively small and the maxFilesToCache
setting is not high (less than 10,000), issue the following
command:
gpfs.snap --deadlock
If the cluster size is large or the maxFilesToCache setting is high (greater than 1M), issue the following command:gpfs.snap --deadlock --quick
For more information about the --deadlock and --quick options, see gpfs.snap command.
Additional information to collect for file system corruption or MMFS_FSSTRUCT errors
When file
system corruption or MMFS_FSSTRUCT errors
are encountered, the IBM Support
Center will need additional information to assist with problem diagnosis.
If you have not done so already, ensure you have the following information
available before contacting the IBM Support
Center:
- Everything that is listed in Information to be collected for all problems related to GPFS.
- Unmount the file system everywhere, then run mmfsck -n in offline mode and redirect it to an output file.
The IBM Support Center will determine when and if you should run the mmfsck -y command.
Additional information to collect for GPFS daemon crashes
When
the GPFS daemon is repeatedly
crashing, the IBM Support Center
will need additional information to assist with problem diagnosis.
If you have not done so already, ensure you have the following information
available before contacting the IBM Support
Center:
- Everything that is listed in Information to be collected for all problems related to GPFS.
- Ensure the /tmp/mmfs directory exists on all nodes. If this directory does not exist, the GPFS daemon will not generate internal dumps.
- Set the traces on this cluster and all clusters that mount
any file system from this cluster:
mmtracectl --set --trace=def --trace-recycle=global
- Start the trace facility by issuing:
mmtracectl --start
- Recreate the problem if possible or wait for the assert to be triggered again.
- Once the assert is encountered on the node, turn off the trace
facility by issuing:
mmtracectl --off
If traces were started on multiple clusters, mmtracectl --off should be issued immediately on all clusters.
- Collect gpfs.snap output:
gpfs.snap