HSM common problems and solutions

space management client common problems are listed. Typical solutions are suggested.

The following table lists common problems and typical solutions.
Table 1. Common HSM problems and resolutions
Problem Problem source Solution

No HSM daemons are running.

The configuration in the dsm.opt file or the dsm.sys file is invalid. The error prevents all HSM daemons from starting.

Run any HSM command. The command output describes the failure. Correct the configuration in the dsm.opt file or the dsm.sys file.

The watch daemon (dsmwatchd) is the only active daemon.

Any of the following conditions can cause this problem:
  • HSM was stopped on the specified node.
  • Failover is disabled on the specified node.
  • The DMAPI service is not running.

Try the following solutions:

  • Start the HSM daemons by issuing the HSM command: dsmmigfs start. The daemons might take up to 30 seconds to start running.
  • Enable the failover on the node by issuing the HSM command: dsmmigfs enablefailover
  • Ensure that GPFS™ is in the active state on all nodes in the cluster. To verify this state, issue the GPFS command: mmgetstate –a
The mount of DMAPI-enabled file systems fails. The recall daemon does not run. Ensure that the recall daemon runs by issuing the command: dsmrecalld. The mount of a DMAPI-enabled file system requires at least one recall daemon in the cluster to be running.
The mount of DMAPI-enabled file systems hangs

There are two possible causes:

  1. On one node in the GPFS cluster, there is an orphaned DMAPI session from a recall daemon that failed.
  2. The GPFS configuration parameter, enableLowspaceEvents, is set to yes. To see the current value of this parameter, issue the command:

    mmlsconfig | grep enableLowspaceEvents

If there is an orphaned DMAPI session, restart the recall daemon:
  1. Stop the recall daemon on all the nodes in the cluster. Issue the command: dmkilld.
  2. Start the recall daemon by issuing the command: dsmrecalld. The orphaned DMAPI session is cleaned up while the recall daemon starts.
If enableLowspaceEvents=yes, change the value and restart the GPFS daemon on all nodes:
  1. /usr/lpp/mmfs/bin/mmchconfig enableLowspaceEvents=no
  2. /usr/lpp/mmfs/bin/mmshutdown -a
  3. /usr/lpp/mmfs/bin/mmstartup -a
Several Space Management commands end without processing. The space management client cannot access the node configuration in the /etc/adsm/SpaceMan directory. Typically, this condition is caused by an unmounted /etc file system. Mount the /etc file system.
A file migration operation fails and displays the messages "ANS1228E Sending of object .. failed." and "ANS9256E File .. is currently opened by another process." or a file recall operation hangs and is unable to provide user feedback. A previous file migration operation or file recall operation of the affected file terminated prematurely. This termination was caused by a failure or a GPFS shutdown on the node that processed the file migration operation or file recall operation. Later on this node either:
  1. The recall daemons were restarted BEFORE the affected file system was remounted, or
  2. The recall daemons were not restarted at all, or
  3. The recall daemons were restarted on this node even though this node was not the owner of the affected file system.
Restart the recall daemons on the affected node by issuing the command: dsmmigfs restart. If it is not clear which node caused this problem, perform the following procedure:
  1. Recursively list the content of the .SpaceMan/logdir/ subdirectory of the affected file system, for example, ls -lR /<affected_file_system>/.SpaceMan/logdir/.
  2. Look for entries in all translog subdirectories that contain one or more trailing digits in their name, for example, translog12/.
    • For problems with a file migration operation, look for entries with the suffix .mig, for example, 099B3477562F877D0000000000007D5B00 000000000000001772E70200000000.mig
    • For problems with a file recall operation, look for entries with suffix .rec, for example, 099B3477562F877D0000000000007D5B00 000000000000001772E70200000000.rec
  3. Take note of the trailing digits in the name of the matching translog subdirectory or subdirectories, for example, "12" for the subdirectory translog12/.
  4. Issue the command: mmlscluster on one of the cluster nodes. In the command's output, look for a number in the "Node" column that matches the trailing digits. For example, look for "12" in this column. In the same row, look for the node name that corresponds with the matching number, for example, "12" corresponds with node name "number_cruncher".
  5. Ensure that the matching node is the owner of the affected file system. Check for the owner by issuing the command: dsmmigfs q -d. If required, transfer the file system ownership to the node identified in step 4 by issuing the command: dsmmigfs takeover /<affected_file_system> on this node.
  6. Restart the recall daemons on the matching nodes, for example, on the node "number_cruncher".
    Important: When you restart the recall daemons, make sure not to disrupt unaffected recall processes.