When the llctl drain startd someclass command is run, we expect LL to stop scheduling jobs in class someclass and let jobs already running in someclass terminate normally.
If a job in someclass is preempted when the class receives the drain command, the job will not resume until the class is resumed. This behavior makes it difficult to take a part of a cluster down for maintenance without loosing jobs.
Have other sites encountered this problem? Is there a clean way to deal with this? Checkpointing is not an option: it's too unreliable and may even not work at all on a preempted job...
This topic has been locked.