Solving common LSF problems
Most problems are due to incorrect installation or configuration. Before you start to troubleshoot LSF problems, always check the error log files first. Log messages often point directly to the problem.
Finding LSF error logs
When something goes wrong, LSF server daemons log error messages in the LSF log directory (specified by the LSF_LOGDIR parameter in the lsf.conf file).
Procedure
- lim.log.host_name
- res.log.host_name
- pim.log.host_name
- mbatchd.log.management_host
- mbschd.log.management_host
- sbatchd.log.management_host
- vemkd.log.management_host
If these log files contain any error messages that you do not understand, contact IBM Support.
Diagnosing and fixing most LSF problems
General troubleshooting steps for most LSF problems.
Procedure
Cannot open the lsf.conf file
You might see this message when you run the lsid file. The message usually means that the LSF_CONFDIR/lsf.conf file is not accessible to LSF.
About this task
By default, LSF checks the directory that is defined by the LSF_ENVDIR parameter for the lsf.conf file. If the lsf.conf file is not in LSF_ENVDIR, LSF looks for it in the /etc directory.
For more information, see Setting up the LSF environment with cshrc.lsf and profile.lsf.
Procedure
- Make sure that a symbolic link exists from /etc/lsf.conf to lsf.conf
- Use the csrhc.lsf or profile.lsf script to set up your LSF environment.
- Ensure that the cshrc.lsf or profile.lsf script is available for users to set the LSF environment variables.
LIM
dies quietly
When the LSF
LIM
daemon exits unexpectedly, check for errors in the LIM
configuration files.
Procedure
This command displays most configuration errors. If the command does not report any errors, check
in the LIM
error log.
LIM
communication times out
Sometimes the LIM
is up, but running the lsload
command prints the Communication time out error message.
About this task
If the LIM
just started, LIM
needs time to get initialized by
reading configuration files and contacting other instances of LIM
. If the
LIM
does not become available within one or two minutes, check the
LIM
error log for the host you are working on.
To prevent communication timeouts when the local LIM
is starting or restarting,
define the parameter LSF_SERVER_HOSTS in the lsf.conf
file. The client contacts the LIM
on one of the
LSF_SERVER_HOSTS and runs the command. At least one of the hosts that are
defined in the list must have a LIM
that is up and running.
When the local LIM
is running but the cluster has no management host, LSF
applications display the Cannot locate master LIM now, try later. message.
Procedure
LIM
error logs on the first few hosts that are listed in the
Host
section of the
lsf.cluster.cluster_name file. If the
LSF_MASTER_LIST parameter is defined in the lsf.conf file,
check the LIM
error logs on the hosts that are listed in this parameter
instead.
Management host
LIM
is down
Sometimes the management host
LIM
is up, but running the lsload or lshosts
command displays the Master LIM is down; try later. message.
About this task
If the /etc/hosts file on the host where the management host
LIM
is running is configured with the host name that is assigned to the loopback IP
address (127.0.0.1), LSF client
LIM
cannot contact the management host
LIM
. When the management host
LIM
starts up, it sets its official host name and IP address to the loopback
address. Any client requests get the management host
LIM
address as 127.0.0.1, and try to connect to it, and in fact tries to access
itself.
Procedure
LIM
in /etc/hosts.
LIM
IP address to the loopback
address:127.0.0.1 localhost myhostname
127.0.0.1 localhost
192.168.123.123 myhostname
For a management
host LIM
running on a host that uses an IPv6 address, the loopback address is
::1
.
LIM
IP address by using an IPv6
address:::1 localhost ipv6-localhost ipv6-loopback
fe00::0 ipv6-localnet
ff00::0 ipv6-mcastprefix
ff02::1 ipv6-allnodes
ff02::2 ipv6-allrouters
ff02::3 ipv6-allhosts
User permission denied
If the remote host cannot securely determine the user ID of the user that is requesting remote execution, remote execution fails with an User permission denied error message.
Procedure
Remote execution fails because of non-uniform file name space
A non-uniform file name space might cause a command to fail with the chdir(...) failed: no such file or directory message.
About this task
You are trying to run a command remotely, but either your current working directory does not exist on the remote host, or your current working directory is mapped to a different name on the remote host.
If your current working directory does not exist on a remote host, do not run commands remotely on that host.
Procedure
Batch daemons die quietly
When the LSF batch daemons sbatchd and mbatchd exit unexpectedly, check for errors in the configuration files.
About this task
If the mbatchd daemon is running but the sbatchd daemon dies on some hosts, it might be because mbatchd is not configured to use those hosts.
Procedure
- Check the sbatchd and mbatchd daemon error logs.
- Run the badmin ckconfig command to check the configuration.
- Check for email in the LSF administrator mailbox.
sbatchd starts but mbatchd
does not
When the sbatchd daemon starts but the mbatchd
daemon
is not running, it is possible that mbatchd is temporarily unavailable because
the management host
LIM is temporarily unknown. The sbatchd: unknown service error
message displays.
Procedure
Avoiding orphaned job processes
LSF uses process groups to track all the processes of a job. However, if the application forks a child, the child becomes a new process group. The parent dies immediately, and the child process group is orphaned from the parent process, and cannot be tracked.
About this task
For more information about process tracking with Linux cgroups, see Memory and swap limit enforcement based on Linux cgroup memory subsystem.
Procedure
Host not used by LSF
The mbatchd
daemon allows the sbatchd daemon to run
only on the hosts that are listed in the Host
section of the
lsb.hosts file. If you configure an unknown host in the following
configurations, mbatchd logs an error message: HostGroup
or
HostPartition
sections of the lsb.hosts file, or as a
HOSTS
definition for a queue in the lsb.queues
file.
About this task
If you try to configure a host that is not listed in the Host
section of the
lsb.hosts file, the mbatchd
daemon logs the following
message.
mbatchd on host: LSB_CONFDIR/cluster1/configdir/file(line #): Host hostname is not used by lsbatch; ignored
If you start the mbatchd
daemon on a host that is not known by the
mbatchd
daemon, mbatchd
rejects the sbatchd.
The sbatchd daemon logs the This host is not used by
lsbatch system. message and exits.
Procedure
Unknown host type or model
A model or type UNKNOWN
indicates that the host is down or the
LIM
on the host is down. You need to take immediate action to restart
LIM
on the UNKNOWN
host.
Procedure
Default host type or model
If you see DEFAULT
in lim -t, it means that automatic
detection of host type or model failed, and the host type that is configured in the
lsf.shared file cannot be found. LSF works
on the host, but a DEFAULT
model might be inefficient because of incorrect CPU
factors. A DEFAULT
type might also cause binary incompatibility because a job from
a DEFAULT
host type can be migrated to another DEFAULT
host type.