Topic
  • 1 reply
  • Latest Post - ‏2007-09-12T13:37:11Z by SystemAdmin
SystemAdmin
SystemAdmin
46 Posts

Pinned topic LoadLeveler cluster dynamic reconfiguration causes already running jobs to restart from scratch.

‏2007-09-06T09:35:42Z |
Hello,

I currently have a problem where, after issuing a 'llctl -g reconfig' after amending the LoadLeveler configuration files that already running jobs are restarted from scratch causing already generated outputs to be lost, and on some occasions previously running jobs queued after the reconfig command has completed.
Also, this does not impact every job, and can impact both serial and parallel jobs.

I have used the llctl -g reconfig command numerous times without this issue.

Has anyone else experienced this or a similar problem?

Platform / Cluster Details:

Platform - Power5+ (P575 cluster)
OS AIX 5.3 (5300-04-03)
LoadLeveler version - 3.3.2

Kind regards,

Andrew Austin.
Updated on 2007-09-12T13:37:11Z at 2007-09-12T13:37:11Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    46 Posts

    Re: LoadLeveler cluster dynamic reconfiguration causes already running jobs to restart from scratch.

    ‏2007-09-12T13:37:11Z  
    Hi Andrew,

    Based on your description of the problem, the running jobs on (some of ?) the nodes where reconfig was done got vacated. It could be due to config value change/reconfig processing, or some other reasons. Please open a PMR and send in the startd/starter logs where the running job(s) were vacated, and the log of the schedd which owned the job(s). We will take a look and help to resolve the problem.
    Regards,
    Waiman