resource

The egosh resource subcommand manages resources through EGO.

resource close [-reclaim] [-c comment] [-b] resource_name

Closes a resource, preventing further allocation. Closing a resource does not change its allocation status. If the resource is currently allocated to a consumer, the resource remains allocated until the consumer returns it voluntarily. If the resource is not currently allocated to a consumer, the resource remains in its unallocated state. Existing workload finishes running before closing.

This is an administrative subcommand. You must first log on as cluster administrator before you can issue this subcommand.

-reclaim
EGO reclaims the host before it closes; running workload terminates as per the configured grace period. The host is prevented from further allocation. If the resource is currently allocated to a consumer, it is reclaimed. Once reclaimed, it is not allocated to another consumer.

After issuing this command, the host status changes to CLOSED; the reported reason is that the cluster administrator closes and reclaims the host.

-c comment
Specifies a reason why this action was requested. The reason can comprise up to 1024 alphanumeric or special characters, except control characters (Ctrl + key) and multi-byte characters. The description must be enclosed in double quotes ( ) if it contains spaces.
-b
Specifies closing resources in batch mode. The egosh resource command automatically combines resources and submits the request as one for all the listed resources, eliminating the need to run this command multiple times. In this mode, the command does not return individual resource action status results (for example, separate messages for host1, host2, and host3); it returns a confirmation that the action for the batch was accepted (for example, a single message listing host1, host2, and host3). Use this option if performance is an issue.
resource_name …
Specifies the name of the resource or resources to close.

To close multiple resources, separate the resource names with a space.

resource group | rg [-MDS -P resource_plan [-slotmap consumer]] [-l] [group_name ... ]

Displays information about all of the resource groups in the cluster including the number of hosts in the group, the total number of slots or units, the number of free and allocated slots or units, and detailed usage information describing distribution among consumers.

Note: This subcommand limits the displayed length of the resource group name to 21 characters. If the full length of the resource group name is required, use egosh rg -l instead.
rg
Is an alias to the resource group subcommand. You can use this as a shortcut instead of typing the full subcommand name.
  • ALLOCATED: Indicates the total number of resources allocated to a consumer.
  • FREE: Indicates the total number of unused resources, including unused owned and unused shared (guaranteed), as per the resource plan
  • OWN: Indicates the configured ownership numbers, as per the resource plan.
  • SHARE: Indicates the configured share percentage among siblings, as per the resource plan.
-MDS
Displays multi-dimensional information.
-P resource_plan
Specifies the name of the multi-dimensional resource plan.
-slotmap consumer
Shows resource group from the view of the consumer's slot-mapping.
-l
Lists values for allocated and free slots within resource groups. Detailed usage information includes breakdown of owned, shared, and borrowed slots (both in-use and unused slots) in the cluster:
  • OWN_USE: Indicates number of owned resources assigned to consumer.
  • SHARE_USE: Indicates number of resources assigned to consumer from share pool.
  • BORROW_USE: Indicates number of resources borrowed from other consumers.
  • OWN_FREE: Indicates number of remaining (unused) owned resources as guaranteed from resource plan.
  • SHARE_FREE: Indicates number of remaining (unused) share pool resources as guaranteed from resource plan.
Note: Values for OWN_FREE and SHARE_FREE may not add up to the actual free or total number of resources for the resource group. Some resources reflected in the number may be reclaimed resources.
group_name
Specifies the names of one or more resource groups for which you want information displayed. For example, ManagementHosts.

resource list [-l ] [-ll] [-m | -s | -t | -a | -G | [-g] -o attribute,…] [-R res_req] [resource_name …]

Displays information about the resources in the cluster, listing each host and information about the resources on each host.

-l
Provides the same information with a longer name field, if some are truncated when -l is not specified.
-ll
Provides the same information as the -l option, and in comma-separated values (CSV) format.
-m
Displays the list of failover candidate hosts in the cluster and identifies which host is currently the primary host.
-s
Displays summaries of the hosts in the cluster, including information on host states and resource utilization.
-t
Displays a list of host types defined in the cluster.
-a
Displays all load indices for all resources.
-g
Shows total slots (all or free) for a host, including excluded resource groups. Use with the -o option, specifying values for the slot and freeslot attributes.
-o attribute,…
Specifies the attributes to include in the display. Use this option to customize the output, including only those attributes you are interested in. For example:

resource list -o status,type,ncpus

Specify one (or more) of the following:
  • status: Current state of the host
  • type: Type of host
  • ncpus: Number of CPUs as seen by EGO (value used to determine the number of slots; can be overridden by resource group configuration)
  • ngpus: Number of GPUs as ween by EGO
  • nprocs: Number of physical processors (if ncpus defined as procs, then ncpus = nprocs)
  • ncores: Number of cores per processor (if ncpus defined as cores, then ncpus = nprocs * ncores)
  • nthreads: Number of threads per core (if ncpus defined as threads, then ncpus = nprocs * ncores * nthreads)
  • ut: CPU utilization
  • mem: Available memory
  • swp: Available swap space
  • pg: Paging rate
  • io: Disk input and output rate
  • slot: Number of slots
  • freeslot: Number of free slots
  • r15s: 15-second load
  • r15m: 15-minute load
  • r1m: 1-minute load
  • model: The host model
  • cpuf: The CPU factor
  • maxmem: Maximum memory
  • maxswp: Maximum swap space
  • tmp: Available temp space
  • maxtmp: Maximum space in the /tmp directory
  • ndisks: Number of local disks
  • it: Idle time
  • ls: Logon users
  • resourceattr: Resource attributes assigned to this host
  • processpri: The OS process priority of cluster workloads (either normal or lowest)
  • resource_name: The name of a host resource
  • (7.3.2 Fix)podman_active: The active version of Podman on the host.
  • These attributes are supported if you have enabled GPUs for your environment:
    • ngpus: Number of GPU devices
    • gpushared: Number of GPU devices in shared mode
    • gpuexclusive_thread: Number of GPU devices in exclusive mode
    • gpuexclusive_process: Number of GPU devices in exclusive process compute mode
    • gpuprohibited: Number of GPU devices in prohibited mode
    • gpucap1_0: GPU capability 1.0 count per host
    • gpucap1_1: GPU capability 1.1 count per host
    • gpucap1_2: GPU capability 1.2 count per host
    • gpucap1_3: GPU capability 1.3 count per host
    • gpucap2_plus: GPU capability 2.0 count per host
    • gpudriverversion: GPU driver version
    • gpusdkversion: GPU SDK version
    • gpumaxfactor: Maximum factor for all GPU devices
    • gputopology: GPU topology
    • gpumatrix: GPU NVLink connection information
    • gpumodedevice_number: GPU compute mode for each device; specifically:
      • gpumode0: GPU compute mode for device0
      • gpumode1: GPU compute mode for device1
      • gpumode2: GPU compute mode for device2
      • gpumode3: GPU compute mode for device3
    • gpueccdevice_number: Number of unhandled ECC errors for each device; specifically:
      • gpuecc0: Number of unhandled ECC errors for each device0
      • gpuecc1: Number of unhandled ECC errors for each device1
      • gpuecc2: Number of unhandled ECC errors for each device2
      • gpuecc3: Number of unhandled ECC errors for each device3
    • gpuutdevice_number: GPU utilization percentage for each device (where the value is a float); specifically:
      • gpuut0: GPU Utilization percentage for device0
      • gpuut1: GPU Utilization percentage for device1
      • gpuut2: GPU Utilization percentage for device2
      • gpuut3: GPU Utilization percentage for device3
    • gputempdevice_number: Core temperature for each device; specifically:
      • gputemp0: Core temperature for device0
      • gputemp1: Core temperature for device1
      • gputemp2: Core temperature for device2
      • gputemp3: Core temperature for device3
    • gpumodeldevice_number: GPU model name for each device; specifically:
      • gpumodel0: GPU model name for device0
      • gpumodel1: GPU model name for device1
      • gpumodel2: GPU model name for device2
      • gpumodel3: GPU model name for device3
    • gpumaxmemdevice_number: Total memory for each device (in MB); specifically:
      • gpumaxmem0: Total memory for device0
      • gpumaxmem1: Total memory for device1
      • gpumaxmem2: Total memory for device2
      • gpumaxmem3: Total memory for device3
    • gpumutdevice_number: GPU memory utilization for each device; specifically:
      • gpumut0: GPU memory utilization for device0
      • gpumut1: GPU memory utilization for device1
      • gpumut2: GPU memory utilization for device2
      • gpumut3: GPU memory utilization for device3
    • gpucapverdevice_number: GPU capability version for each device; specifically:
      • gpucapver0: GPU capability version for device0
      • gpucapver1: GPU capability version for device1
      • gpucapver2: GPU capability version for device2
      • gpucapver3: GPU capability version for device3
    • gpumemdevice_number: Memory available for each device (in MB); specifically:
      • gpumem0: Memory available for device0
      • gpumem1: Memory available for device1
      • gpumem2: Memory available for device2
      • gpumem3: Memory available for device3
    • gpupstatedevice_number: GPU performance state for each device; specifically:
      • gpupstate0: GPU performance state for device0
      • gpupstate1: GPU performance state for device1
      • gpupstate2: GPU performance state for device2
      • gpupstate3: GPU performance state for device3
    • gpustatusdevice_number: GPU overall status for each device; specifically:
      • gpustatus0: GPU overall status for device0
      • gpustatus1: GPU overall status for device1
      • gpustatus2: GPU overall status for device2
      • gpustatus3: GPU overall status for device3
    • gpuerrordevice_number: GPU error for each device; specifically:
      • gpuerror0: GPU error for device0
      • gpuerror1: GPU error for device1
      • gpuerror2: GPU error for device2
      • gpuerror3: GPU error for device3
    • gpubusiddevice_number: GPU bus ID for each device; specifically:
      • gpubusid0: GPU bus ID for device0
      • gpubusid1: GPU bus ID for device1
      • gpubusid2: GPU bus ID for device2
      • gpubusid3: GPU bus ID for device3
Note: You cannot use this command option to view global ncpu settings. This information can only be viewed directly in the shared copy of ego.conf.
-G
Displays one or more resource groups to which the host belongs. To view resource groups for multiple hosts, separate the host names with a space.
-R res_req
Displays information about the resources that match the resource requirement string specified.

Specify name-value pairs for the resource requirement(s). Multiple resource requirements are separated with the characters &&.

When using res_req, use select(select_string) to specify the criteria for selecting the resources. The selection string is a logical expression to select one or more resources to match one or more criteria. Any resource that satisfies the criteria is selected.

select(expression)select(expression operator expression)select((expression operator expression) operator expression)

The entire resource requirement string cannot contain more than 512 characters and parentheses must be entered as shown:

resource_name operator value

resource_name
The following resources can be used as selection criteria.
Static resources
Static resources are built-in resources that represent host information that does not change over time, such as the maximum RAM available to user processes or the number of processors in a machine. Most static resources are determined at start-up time, or when hardware configuration changes are detected.

Static resources can be used to select appropriate hosts based on binary architecture, relative CPU speed, and system configuration.

Note: The resources ncpus, ncores, nprocs, nthreads, maxmem, maxswp, and maxtmp are not static on Linux® hosts that support dynamic hardware reconfiguration.

Table 1. Static resources

Index

Measures

Units

Determined by

type

host type

string

configuration

model

host model

string

configuration

hname

host name

string

configuration

cpuf

CPU factor

relative

configuration

server

host can run remote jobs

Boolean

configuration

rexpri

execution priority

nice(2) argument

configuration

ncpus

number of processors

processors

LIM

ndiks

number of local disks

disks

LIM

maxmem

maximum RAM

MB

LIM

maxswp

maximum swap space

MB

LIM

maxtmp

maximum space in /tmp

MB

LIM


CPU factor (cpuf)
The CPU factor is the speed of the host’s CPU relative to other hosts in the cluster. If one processor is twice the speed of another, its CPU factor should be twice as large. CPU factors are defined by the cluster administrator. For multiprocessor hosts, the CPU factor is the speed of a single processor.
Server
The server static resource is Boolean. It has the following values:
  • 1 if the host is configured to run jobs from other hosts.
  • 0 if the host is a client for submitting jobs to other hosts.
operator
The following operators can be used in selection strings. If you are using the selection string in an XML format, you must use the applicable escape characters in the XML Equivalence column. The operators are listed in order of decreasing precedence:
Table 2. Selection string operators
Operator XML equivalent Syntax Meaning
! Not applicable !a Logical NOT: 1 if a==0, 0 otherwise
* Not applicable a*b Multiply a and b
/ Not applicable a / b Divide a by b
+ Not applicable a+b Add a and b
- Not applicable a-b Subtract b from a
> > a > b 1 if a is greater than b, 0 otherwise
< < a < b 1 if a is less than b, 0 otherwise
>= &gt;= a >= b 1 if a is greater than or equal to b, 0 otherwise
<= <= a <= b 1 if a is less than or equal to b, 0 otherwise
== Not applicable a == b 1 if a is equal to b, 0 otherwise
!= Not applicable a != b 1 if a is not equal to b, 0 otherwise
&& &amp;&amp; a && b Logical AND: 1 if both a and b are non-zero, 0 otherwise
|| Not applicable a || b Logical OR: 1 if either a or b is non-zero, 0 otherwise

value
Specifies the value to be used as criteria for selecting a resource. Value can be numerical, such as when referring to available memory or swap space, or it can be textual, such as when referring to a specific type of host.
Important: If the command is issued in whole from the shell console or the requirement has white space, enclose the requirement in double quotation marks. For example:
>egosh resource list -R select(mem>100 && it>1)
If the command is issued from the egosh console, do not use quotation marks. For example:
>egosh
>resource list -R select(mem>100)
Tip: To displays all hosts currently enabled to log core hours, run the following command:
egosh resource list -l -R corehoursaudit
Sample output:
NAME                   status        mem    swp    tmp   ut    it    pg   r1m  r15s  r15m  ls
ib15b02                ok            45G     0M   889G   4%   258   0.0   0.9   1.2   0.9   2
To display all hosts currently not logging core hours, run the following command:
egosh resource list -l -R select(!corehoursaudit)
Use single quotation marks if you are running on Linux with bash and have an exclamation mark (!) in the string:
egosh resource list -l -R 'select (!corehoursaudit)'
Sample output:
NAME                   status        mem    swp    tmp   ut    it    pg   r1m  r15s  r15m  ls
Master1                ok            21G    45G   207G   3%     0  12.5   0.3   0.0   0.3   egosh

resource open resource_name

Opens the specified resource, allowing it to accept requests.

This is an administrative subcommand. You must first log on as cluster administrator before you can issue this subcommand.

resource_name …
Specifies the name of the resource or resources to open.

To open multiple resources, separate the resource names with a space.

resource remove [-b] resource_name

Removes the specified resource (host) from the cluster. EGO is also shut down if the host is closed. To remove a host, it must have joined the cluster dynamically and is now either unavailable or closed without running workload.

This is an administrative subcommand. You must first log on as cluster administrator before you can issue this subcommand.

-b
Specifies removing resources in batch mode. The egosh resource command automatically combines resources and submits the request as one for all the listed resources, eliminating the need to run this command multiple times. In this mode, the command does not return individual resource action status results (for example, separate messages for host1, host2, and host3); it returns a confirmation that the action for the batch was accepted (for example, a single message listing host1, host2, and host3). Use this option if performance is an issue.
resource_name …
Specifies the name of the host or hosts to remove.

To remove multiple hosts, separate the host names with a space.

resource view [resource_name …]

Displays all the information about all resources.

resource_name …
Specifies the name of the resource or resources you want to view.

Displays information about the specified resource or resources.

To view multiple resources, separate the resource names with a space.

resource updaterg

Checks all hosts and all resource groups for resource attribute changes to existing hosts (for example, adding or removing resource attributes from hosts). If there are changes, this subcommand then removes or adds the host to or from resource groups, and then updates the resource group membership status to reflect the current status, as follows:
  • If the changed resource does not meet all of the resource requirements (resreq), then the subcommand removes this host from the resource group and reclaims the workload from this resource group.
  • If the changed resource meets at least some of the resource requirements, then the subcommand adds the host to a new resource group.

Note that there are no options or arguments for this subcommand. Simply run egosh resource updaterg to update all hosts and all resource groups.

By default, automatically checking and updating the resource group membership status to keep it current is not enabled. To enable it, set EGO_ENABLE_RG_UPDATE_MEMBERSHIP=Y and EGO_RG_UPDATE_MEMBERSHIP_INTERVAL=time_in_seconds in the ego.conf file. For details, see ego.conf reference. If not enabled, run egosh resource updaterg to update manually.