Kerberos authentication with IBM® Spectrum LSF

The Kerberos integration for LSF allows you to use Kerberos authentication for LSF clusters and jobs.

Kerberos integration for LSF includes the following features:

  • The dedicated binary krbrenewd renews TGTs for pending jobs and running jobs. It is enhanced to handle several jobs without creating too much overhead for mbatchd and KDC.
  • Separate user TGT forwarding from daemon and user authentication with a parameter, LSB_KRB_TGT_FWD, to control TGT forwarding.
  • Kerberos solution package is preinstalled in the LSF installation directory, relieving users from compiling from source code. krb5 function calls are dynamically linked.
  • Preliminary TGT forwarding support for parallel jobs, including shared directory support for parallel jobs. If all hosts at a customer site have a shared directory, you can configure this directory in lsf.conf via parameter LSB_KRB_TGT_DIR, and the TGT for each individual job is stored here.
  • LSF Kerberos integration works in an NFSv4 environment.

Install LSF in a location that does not require a credential to access.

You must provide the following krb5 libraries since they do not ship with LSF:

  • libkrb5.so
  • libkrb5support.so
  • libk5crypto.so
  • libcom_err.so

Set LSB_KRB_LIB_PATH in lsf.conf to the path that contains these four libraries.

Note the following issues when using the Kerberos integration:

  • If you turn on the account mapping feature of LSF, you must ensure that the execution user has read/write permission for the directory that is defined by the LSB_KRB_TGT_DIR parameter, which holds the runtime TGT.
  • krb5 libraries are required for TGT manipulation.
  • Configure the TGT renew limit so it is long enough for jobs to finish running. Long jobs that last several hours or even several days need their TGTs renewed in time to keep the job running. Ensure that the job execution time does not exceed the TGT renew limit.
  • With the blaunch command, only one task res is invoked per host.
  • blaunch krb5 does not support auto-resizable jobs.
  • blaunch krb5 does not support remote execution servers that are running LSF, Versions 9.1.2, or older, and therefore the renew script does not work with these versions of RES. Similarly, blaunch krb5 does not support sbatchd daemons from LSF, Versions 9.1.2, or older. Therefore, child sbatchd daemons cannot be kerberized and the renew script does not work in root sbatchd daemons from LSF, Versions 9.1.2, or older.
  • The brequeue command does not transfer new TGTs to the mbatchd daemon. If a job is requeued by the brequeue command, the TGT job that is used is the one that is cached by the mbatchd daemon.
  • LSF does not check the contents or exit code of the erenew script. If erenew contains the wrong command, AFS tokens cannot be renewed and LSF does not report any errors in the log file. Therefore, users must ensure that the commands in the erenew script can renew AFS tokens successfully.
  • Some bsub options, such as bsub -Zs or bsub -is require the bsub command to do file manipulation. In this case, if the file involved resides in the AFS volume, users must ensure that they acquire a proper AFS token before they run the bsub command.

Kerberos Support for NFSv4 and AFS

When LSF is used on NFSv4 or Andrew File System (AFS), each process in a sequential job or a distributed parallel job needs to periodically renew its credentials. For this reauthentication to take place in a secure, user friendly environment, a TGT file is distributed to each execution host and the root sbatchd daemon in each execution host renews the TGT.

If you use the AFS feature, you must provide the libkopenafs.so or libkopenafs.so.1 libraries, which do not ship with LSF. You can use them from the openafs-authlibs-* package or build them directly from the AFS source.

To support AFS, LSF provides an external renew hook mechanism, which is called after TGT is renewed. Users can write their own renew logic through this renew hook. More specifically, users can use the demo script erenew.krb5 in the $LSF_SERVERDIR directory and rename it to erenew. Users can also create the erenew executable file in the $LSF_SERVERDIR directory. This erenew script is called immediately at job startup time to make sure the user’s job has a valid AFS token. LSF also automatically calls the binary file after TGT is renewed. For example, AFS users can use this hook to run the aklog command for renewing AFS tokens.