对数据传输作业故障进行故障诊断
如果数据传输作业执行失败,那么需要传输作业正在传输的文件的作业将被 LSF终止。
使用 bjobs -l 命令可查看有关作业的以下信息:
- 作业退出,代码为 125。
- TERM_DATA 作业出口原因。
- 发布到作业的外部消息。
例如,针对作业 126 的数据登台传输作业失败,该作业已提交数据需求:
bjobs -l 126
Job <126>, User <user1>, Project <default>, Status <EXIT>, Queue <normal>, Comm
and <my_data_job.sh>
Wed Sep 3 14:33:12: Submitted from host <hostA>, CWD </scratch/dev4/user1/wo
rkspace/datajobs>, Data Requirement Requested;
Wed Sep 3 14:33:26: Exited with exit code 125.
Wed Sep 3 14:33:26: Completed <exit>; TERM_DATA: job killed by LSF due to fail
ed data staging.
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 root Sep 3 14:33 Staging failed for <hostA:/home/u N
RESOURCE REQUIREMENT DETAILS:
Combined: select[type == local] order[r15s:pg]
Effective: -
由于 bjobs -l 命令仅显示外部消息的第一行,因此请使用 bread 命令来读取完整的外部消息:
bread 126
JOBID MSG_ID FROM POST_TIME DESCRIPTION
126 0 root Sep 3 14:33 Staging failed for <hostA:/home
/user1/data2> : transfer job <127
@dq9.1.2> through </home/user1/my
scp> exited; TERM_OWNER: job kill
ed by owner.