Controlling and monitoring host power state management

The following commands allow for control and monitoring of host power state management.

badmin hpower

The option: hpower for badmin is used to switch the power state of idle host (hosts and host groups including compute unit and host partition hosts) to enter into power saving state or working state manually. For example:

badmin hpower suspend | resume [-C comments] host_name […]

Options:

suspend
Puts the host in energy saving state. badmin hpower suspend calls the script defined by POWER_SUSPEND_CMD in the PowerPolicy, and tags the host so that it cannot be resumed by the PowerPolicy.
resume
Puts the host in working state. The host can enter power save status when CYCLE_TIME is reached. If the host should not enter power save status, use the badmin hclose command to block the host from the power policy.
-C
Add to describe the specified power management action. Comments are displayed by badmin hist and badmin hhist.
host_name
Specify one or more host names, host groups, compute units, or host partitions. All specified hosts will be switched to energy saving state or working state. Error message will be shown if the host state is not ready for switching. (Each host is in one line with each message)

badmin hist and badmin hhist

Use badmin hist and badmin hhist to retrieve the historical information about the power state changes of hosts.

All power related events are logged for both badmin hpower and actions triggered by configured (automated) PowerPolicy.

Power State Action Performed by Success/Fail Logged Events
Suspend By badmin hpower On Success

Host <host_name> suspend request from administrator <cluster_admin_name>.

Host <host_name> suspend request done.

Host <host_name> suspend.

On Failure

Host <host_name> suspend request from administrator <cluster_admin_name>.

Host <host_name> suspend request failed.

Host <host_name> power unknown.

By PowerPolicy On Success

Host <host_name> suspend request from power policy <policy_name>.

Host <host_name> suspend request done.

Host <host_name> suspend.

On Failure

Host <host_name> suspend request from power policy <policy_name>.

Host <host_name> suspend request failed.

Host <host_name> power unknown.

Resume By badmin hpower On Success

Host <host_name> resume request from administrator <cluster_admin_name>.

Host <host_name> resume request done.

Host <host_name> on.

On Failure

Host <host_name> resume request from administrator <cluster_admin_name>.

Host <host_name> resume request exit.

Host <host_name> power unknown.

By PowerPolicy On Success

Host <host_name> resume request from power policy <policy_name>.

Host <host_name> resume request done.

Host <host_name> on.

On Failure

Host <host_name> resume request from power policy <policy_name>.

Host <host_name> resume request exit.

Host <host_name> power unknown.

bhosts

Use bhosts -l to display the power state for hosts. bhosts only shows the power state of the host when PowerPolicy (in lsb.resources) is enabled. If the host status becomes unknown (power operation due to failure), the power state is shown as a dash (“-”).

Final power states:

on
The host power state is “On” (Note: power state “on” does not mean the batch host state is “ok”, which depends on whether lim and sbatchd can be connected by the management host.)
suspend
The host is suspended by policy or manually with badmin hpower

Intermediate power states:

The following states are displayed when mbatchd has sent a request for power operations but the execution has not returned back. If the operation command returns, LSF assumes the operation is done. The intermediate status will be changed.

restarting
The host is resetting when resume operation failed.
resuming
The host is being resumed from standby state which is triggered by either policy or job, or cluster administrator
suspending
The host is being suspended which is triggered by either policy or cluster administrator

Final host state under administrator control:

closed_Power
The host it is put into power saving (suspend) state by the cluster administrator

Final host state under policy control:

ok_Power
A transitional state displayed while the host waits for sbatchd to resume. Lets mbatchd know that the host may be considered for scheduling, but it cannot immediately be used for jobs.
A host may enter this state in two ways:
  1. An LSF host which is manually resumed (using badmin hpower resume), after it was manually suspended (using badmin hpower suspend).
  2. When PowerPolicy is defined in lsb.resources, a member host that is suspended by the policy automatically has its power state suspended. The state of this host will be displayed as ok_Power (rather than closed_Power). This is different from suspending the host manually (by badmin hpower suspend) because this host may be woken by job scheduling even it was suspended by the policy.

Example bhosts:

HOST_NAME  STATUS       JL/U    MAX  NJOBS   RUN  SSUSP  USUSP    RSV
host1      closed        -      4     0       0      0       0     0
host2      ok_Power      -      4     0       0      0             0
host3      unavail       -      4     0       0      0       0     0 

Example bhosts -w:

HOST_NAME  STATUS       JL/U    MAX  NJOBS   RUN  SSUSP  USUSP    RSV
host1      closed_Power  -      4     0       0      0       0     0
host2      ok_Power      -      4     0       0      0             0
host3      unavail       -      4     0       0      0       0     0 

Example bhosts -l:

HOST  host1

STATUS       CPUF  JL/U  MAX  NJOBS  RUN  SSUSP  USUSP  RSV DISPATCH_WINDOW

closed_Power 1.00    -    4     4     4     0      0      -         -



CURRENT LOAD USED FOR SCHEDULING:
 
         r15s   r1m   r15m   ut   pg   io   ls   it   tmp   swp   mem  slots

Total    0.0    0.0   0.0    0%   0.0   0    0    0   31G   31G   12G    0

Reserved 0.0    0.0   0.0    0%   0.0   0    0    0    0M    0M  4096M   -



LOAD THRESHOLD USED FOR SCHEDULING:

          r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem

loadSched   -     -     -     -       -     -    -     -     -      -      -  

loadStop    -     -     -     -       -     -    -     -     -      -      -  



POWER STATUS:  ok
IDLE TIME: 2m 12s

CYCLE TIME REMAINING: 3m 1s

bjobs

When a host in energy saving state host is switched to working state by a job (that is, the job has been dispatched and waiting for the host to resume), its state is not shown as pending. Instead, it is displayed as provisioning (PROV). For example:

bjobs

JOBID   USER    STAT   QUEUE     FROM_HOST  EXEC_HOST  JOB_NAME    SUBMIT_TIME
204     root    PROV   normal    host2      host1      sleep 9999  Jun  5 15:24

The state PROV is displayed. This state shows that the job is dispatched to a suspended host, and this host is being resumed. The job remains in PROV state until LSF dispatches the job.

When a job is requires a host in energy saving state or the host is powered off, and LSF is switching the host to working state, the following event is appended by bjobs -l:

Mon Nov 5 16:40:47: Will start on 2 Hosts <host1> <host2>. Waiting for machine provisioning;

The message indicates which host is being provisioned and how many slots are requested.

bhist

When a job is dispatched to a standby host and provisioning the host to resume to working state is triggered, two events are saved into lsb.events and lsb.streams. For example:

Tue Nov 19 01:29:20: Host is being provisioned for job. Waiting for host <xxxx> to power on;

Tue Nov 19 01:30:06: Host provisioning is done;

bresources

Use bresources -p to show the configured energy aware scheduling policies. For example:

bresources -p

Begin PowerPolicy
  NAME = policy_night
  HOSTS = hostGroup1 host3
  TIME_WINDOW= 23:59-5:00
  MIN_IDLE_TIME= 1800
  CYCLE_TIME= 60
  APPLIED = Yes
End PowerPolicy



Begin PowerPolicy
  NAME = policy_other
  HOSTS = all
  TIME_WINDOW= all
  APPLIED = Yes
End PowerPolicy

In the above case, “policy_night” is defined only for hostGroup1 and host3 and applies during the hours of 23:59 and 5:00. In contrast, “policy_other” covers all other hosts not included in the “policy_night” power policy (with the exception of management and management candidate hosts) and is in effect at all hours.