I currently have a problem where, after issuing a 'llctl -g reconfig' after amending the LoadLeveler configuration files that already running jobs are restarted from scratch causing already generated outputs to be lost, and on some occasions previously running jobs queued after the reconfig command has completed.
Also, this does not impact every job, and can impact both serial and parallel jobs.
I have used the llctl -g reconfig command numerous times without this issue.
Has anyone else experienced this or a similar problem?
Platform / Cluster Details:
Platform - Power5+ (P575 cluster)
OS AIX 5.3 (5300-04-03)
LoadLeveler version - 3.3.2
Pinned topic LoadLeveler cluster dynamic reconfiguration causes already running jobs to restart from scratch.
Answered question This question has been answered.
Unanswered question This question has not been answered yet.
Updated on 2007-09-12T13:37:11Z at 2007-09-12T13:37:11Z by SystemAdmin
SystemAdmin 110000D4XK46 Posts
Re: LoadLeveler cluster dynamic reconfiguration causes already running jobs to restart from scratch.2007-09-12T13:37:11ZThis is the accepted answer. This is the accepted answer.Hi Andrew,
Based on your description of the problem, the running jobs on (some of ?) the nodes where reconfig was done got vacated. It could be due to config value change/reconfig processing, or some other reasons. Please open a PMR and send in the startd/starter logs where the running job(s) were vacated, and the log of the schedd which owned the job(s). We will take a look and help to resolve the problem.