Script interface for data transfer jobs
LSF data manager administrators provide a customized data transfer command that is specified by the FILE_TRANSFER_CMD parameter in lsf.datamanager used by data transfer jobs.
Requirements of the data transfer command
- The command must take two arguments with the following form:
The first argument is an absolute path to the location of the source file and the second is absolute path to the destination of the transfer.[host_name:]abs_file_path - The command that you specify must block until the transfer is successfully completed or an error occurs.
- The command must return 0 if successful, and a nonzero value if an error occurs.
- The command must be executable by users from the cluster data transfer nodes.
- The command must be able to accept path descriptors with or without host names for each of its two arguments. For example, the default scp command satisfies both requirements. The cp command is not valid because it can't accept a host name.
If the command returns successfully, LSF data manager assumes that the transfer was completed without error
The default transfer command (/usr/bin/scp) meets these requirements under the following conditions:
- The data transfer nodes are standard Linux hosts with scp deployed
- Passwordless SSH is configured for users from the data node to the data source and destination hosts.
Important: If a transfer job that is submitted by LSF data
manager fails, it is not retried. Any jobs that request the data file being transferred by the
failed job are killed by LSF with
exit code 125.
Example data transfer wrapper script
Depending on the transfer tool you are using, you might want to implement a wrapper script to
help you debug data transfer problems, and to make data transfers more resilient to failure. The
following example wrapper script for the scp command does two things:
- Retry the transfer up to five times before it gives up, sleeping 10 seconds between retries
- Log any error messages output by the scp command to the file /tmp/<transfer_jobid>.<host_name>.err on the data transfer node (I/O node) that ran the transfer job
#!/bin/bash
#
# Save the source, destination, and execution host
#
src=$1
dst=$2
host='hostname'
#
# File names for the error log and a temp file
#
errlog=/tmp/$LSB_JOBID.$host.err
tmplog=/tmp/$LSB_JOBID.tmp
#
# Append the source and destination of the transfer to the error log
#
echo "SRC = $src" >> $errlog
echo "DST = $dst" >> $errlog
#
# Try to do the transfer up to 5 times, sleeping 10 seconds in between
#
ntries=$((0))
done=$((0))
while [[ $(($ntries < 5)) == 1 && $(($done == 0)) == 1 ]]; do
#
# Increment the retry count
#
ntries=$(($ntries + 1))
#
# Run the transfer command and store its output in a temp file
#
scp $src $dst &> $tmplog
#
# Save the error code returned by the command
#
code=$?
if [ $code == "0" ]; then
#
# Success! Delete the error log and break out of the loop
#
rm $errlog
done=$((1))
else
#
# Failure :( Append the command output to the error log
#
echo "==== OUTPUT: (Attempt $ntries) ===================" >> $errlog
cat $tmplog >> $errlog
fi
#
# Sleep for 10 seconds before trying again
#
sleep 10
done
#
# Always delete the temp file
#
rm $tmplog
#
# Exit with the last return code provided by the transfer command
#
exit $code