对数据传输作业故障进行故障诊断

如果数据传输作业执行失败,那么需要传输作业正在传输的文件的作业将被 LSF终止。

使用 bjobs -l 命令可查看有关作业的以下信息:
  • 作业退出,代码为 125。
  • TERM_DATA 作业出口原因。
  • 发布到作业的外部消息。
例如,针对作业 126 的数据登台传输作业失败,该作业已提交数据需求:
bjobs -l 126

Job <126>, User <user1>, Project <default>, Status <EXIT>, Queue <normal>, Comm
                     and <my_data_job.sh>
Wed Sep  3 14:33:12: Submitted from host <hostA>, CWD </scratch/dev4/user1/wo
                     rkspace/datajobs>, Data Requirement Requested;
Wed Sep  3 14:33:26: Exited with exit code 125.
Wed Sep  3 14:33:26: Completed <exit>; TERM_DATA: job killed by LSF due to fail
                     ed data staging.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

 EXTERNAL MESSAGES:
 MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT
 0      root       Sep  3 14:33   Staging failed for <hostA:/home/u   N

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg]
 Effective: -
由于 bjobs -l 命令仅显示外部消息的第一行,因此请使用 bread 命令来读取完整的外部消息:
bread 126
JOBID      MSG_ID FROM       POST_TIME      DESCRIPTION
126        0      root       Sep  3 14:33   Staging failed for <hostA:/home
                                            /user1/data2> : transfer job <127
                                            @dq9.1.2> through </home/user1/my
                                            scp> exited; TERM_OWNER: job kill
                                            ed by owner.