resource
The egosh resource subcommand manages resources through EGO.
resource close [-reclaim] [-c comment] [-b] resource_name …
Closes a resource, preventing further allocation. Closing a resource does not change its allocation status. If the resource is currently allocated to a consumer, the resource remains allocated until the consumer returns it voluntarily. If the resource is not currently allocated to a consumer, the resource remains in its unallocated state. Existing workload finishes running before closing.
This is an administrative subcommand. You must first log on as cluster administrator before you can issue this subcommand.
- -reclaim
- EGO reclaims the host before it closes; running workload terminates as per the configured grace
period. The host is prevented from further allocation. If the resource is currently allocated to a
consumer, it is reclaimed. Once reclaimed, it is not allocated to another consumer.
After issuing this command, the host status changes to CLOSED; the reported reason is that the cluster administrator closes and reclaims the host.
- -c comment
- Specifies a reason why this action was requested. The reason can comprise up to 1024
alphanumeric or special characters, except control characters (Ctrl + key) and multi-byte
characters. The description must be enclosed in double quotes (
- -b
- Specifies closing resources in batch mode. The egosh resource command automatically combines resources and submits the request as one for all the listed resources, eliminating the need to run this command multiple times. In this mode, the command does not return individual resource action status results (for example, separate messages for host1, host2, and host3); it returns a confirmation that the action for the batch was accepted (for example, a single message listing host1, host2, and host3). Use this option if performance is an issue.
- resource_name …
- Specifies the name of the resource or resources to close.
To close multiple resources, separate the resource names with a space.
resource group | rg [-MDS -P resource_plan [-slotmap consumer]] [-l] [group_name ... ]
Displays information about all of the resource groups in the cluster including the number of hosts in the group, the total number of slots or units, the number of free and allocated slots or units, and detailed usage information describing distribution among consumers.
- rg
- Is an alias to the resource group subcommand. You can use this as a shortcut
instead of typing the full subcommand name.
- ALLOCATED: Indicates the total number of resources allocated to a consumer.
- FREE: Indicates the total number of unused resources, including unused owned and unused shared (guaranteed), as per the resource plan
- OWN: Indicates the configured ownership numbers, as per the resource plan.
- SHARE: Indicates the configured share percentage among siblings, as per the resource plan.
- -MDS
- Displays multi-dimensional information.
- -P resource_plan
- Specifies the name of the multi-dimensional resource plan.
- -slotmap consumer
- Shows resource group from the view of the consumer's slot-mapping.
- -l
- Lists values for allocated and free slots within resource groups. Detailed usage information
includes breakdown of owned, shared, and borrowed slots (both in-use and unused slots) in the cluster:
- OWN_USE: Indicates number of owned resources assigned to consumer.
- SHARE_USE: Indicates number of resources assigned to consumer from share pool.
- BORROW_USE: Indicates number of resources borrowed from other consumers.
- OWN_FREE: Indicates number of remaining (unused) owned resources as guaranteed from resource plan.
- SHARE_FREE: Indicates number of remaining (unused) share pool resources as guaranteed from resource plan.
Note: Values for OWN_FREE and SHARE_FREE may not add up to the actualfree
or total number of resources for the resource group. Some resources reflected in the number may be reclaimed resources. - group_name
- Specifies the names of one or more resource groups for which you want information displayed. For example, ManagementHosts.
resource list [-l ] [-ll] [-m | -s | -t | -a | -G | [-g] -o attribute,…] [-R res_req] [resource_name …]
Displays information about the resources in the cluster, listing each host and information about the resources on each host.
- -l
- Provides the same information with a longer name field, if some are truncated when -l is not specified.
- -ll
- Provides the same information as the -l option, and in comma-separated values (CSV) format.
- -m
- Displays the list of failover candidate hosts in the cluster and identifies which host is currently the primary host.
- -s
- Displays summaries of the hosts in the cluster, including information on host states and resource utilization.
- -t
- Displays a list of host types defined in the cluster.
- -a
- Displays all load indices for all resources.
- -g
- Shows total slots (all or free) for a host, including excluded resource groups. Use with the -o option, specifying values for the slot and freeslot attributes.
- -o attribute,…
- Specifies the attributes to include in the display. Use this option to customize the output,
including only those attributes you are interested in. For example:
resource list -o status,type,ncpus
Specify one (or more) of the following:- status: Current state of the host
- type: Type of host
- ncpus: Number of CPUs as seen by EGO (value used to determine the number of slots; can be overridden by resource group configuration)
- ngpus: Number of GPUs as ween by EGO
- nprocs: Number of physical processors (if ncpus defined as procs, then ncpus = nprocs)
- ncores: Number of cores per processor (if ncpus defined as cores, then ncpus = nprocs * ncores)
- nthreads: Number of threads per core (if ncpus defined as threads, then ncpus = nprocs * ncores * nthreads)
- ut: CPU utilization
- mem: Available memory
- swp: Available swap space
- pg: Paging rate
- io: Disk input and output rate
- slot: Number of slots
- freeslot: Number of free slots
- r15s: 15-second load
- r15m: 15-minute load
- r1m: 1-minute load
- model: The host model
- cpuf: The CPU factor
- maxmem: Maximum memory
- maxswp: Maximum swap space
- tmp: Available temp space
- maxtmp: Maximum space in the /tmp directory
- ndisks: Number of local disks
- it: Idle time
- ls: Logon users
- resourceattr: Resource attributes assigned to this host
- processpri: The OS process priority of cluster workloads (either normal or lowest)
- resource_name: The name of a host resource
- podman_active: The active version of Podman on the host.
- These attributes are supported if you have enabled GPUs for your
environment:
- ngpus: Number of GPU devices
- gpushared: Number of GPU devices in shared mode
- gpuexclusive_thread: Number of GPU devices in exclusive mode
- gpuexclusive_process: Number of GPU devices in exclusive process compute mode
- gpuprohibited: Number of GPU devices in prohibited mode
- gpucap1_0: GPU capability 1.0 count per host
- gpucap1_1: GPU capability 1.1 count per host
- gpucap1_2: GPU capability 1.2 count per host
- gpucap1_3: GPU capability 1.3 count per host
- gpucap2_plus: GPU capability 2.0 count per host
- gpudriverversion: GPU driver version
- gpusdkversion: GPU SDK version
- gpumaxfactor: Maximum factor for all GPU devices
- gputopology: GPU topology
- gpumatrix: GPU NVLink connection information
- gpumodedevice_number: GPU compute mode for each device; specifically:
- gpumode0: GPU compute mode for device0
- gpumode1: GPU compute mode for device1
- gpumode2: GPU compute mode for device2
- gpumode3: GPU compute mode for device3
- gpueccdevice_number: Number of unhandled ECC errors for each device;
specifically:
- gpuecc0: Number of unhandled ECC errors for each device0
- gpuecc1: Number of unhandled ECC errors for each device1
- gpuecc2: Number of unhandled ECC errors for each device2
- gpuecc3: Number of unhandled ECC errors for each device3
- gpuutdevice_number: GPU utilization percentage for each device (where the
value is a float); specifically:
- gpuut0: GPU Utilization percentage for device0
- gpuut1: GPU Utilization percentage for device1
- gpuut2: GPU Utilization percentage for device2
- gpuut3: GPU Utilization percentage for device3
- gputempdevice_number: Core temperature for each device; specifically:
- gputemp0: Core temperature for device0
- gputemp1: Core temperature for device1
- gputemp2: Core temperature for device2
- gputemp3: Core temperature for device3
- gpumodeldevice_number: GPU model name for each device; specifically:
- gpumodel0: GPU model name for device0
- gpumodel1: GPU model name for device1
- gpumodel2: GPU model name for device2
- gpumodel3: GPU model name for device3
- gpumaxmemdevice_number: Total memory for each device (in MB);
specifically:
- gpumaxmem0: Total memory for device0
- gpumaxmem1: Total memory for device1
- gpumaxmem2: Total memory for device2
- gpumaxmem3: Total memory for device3
- gpumutdevice_number: GPU memory utilization for each device; specifically:
- gpumut0: GPU memory utilization for device0
- gpumut1: GPU memory utilization for device1
- gpumut2: GPU memory utilization for device2
- gpumut3: GPU memory utilization for device3
- gpucapverdevice_number: GPU capability version for each device;
specifically:
- gpucapver0: GPU capability version for device0
- gpucapver1: GPU capability version for device1
- gpucapver2: GPU capability version for device2
- gpucapver3: GPU capability version for device3
- gpumemdevice_number: Memory available for each device (in MB);
specifically:
- gpumem0: Memory available for device0
- gpumem1: Memory available for device1
- gpumem2: Memory available for device2
- gpumem3: Memory available for device3
- gpupstatedevice_number: GPU performance state for each device;
specifically:
- gpupstate0: GPU performance state for device0
- gpupstate1: GPU performance state for device1
- gpupstate2: GPU performance state for device2
- gpupstate3: GPU performance state for device3
- gpustatusdevice_number: GPU overall status for each device; specifically:
- gpustatus0: GPU overall status for device0
- gpustatus1: GPU overall status for device1
- gpustatus2: GPU overall status for device2
- gpustatus3: GPU overall status for device3
- gpuerrordevice_number: GPU error for each device; specifically:
- gpuerror0: GPU error for device0
- gpuerror1: GPU error for device1
- gpuerror2: GPU error for device2
- gpuerror3: GPU error for device3
- gpubusiddevice_number: GPU bus ID for each device; specifically:
- gpubusid0: GPU bus ID for device0
- gpubusid1: GPU bus ID for device1
- gpubusid2: GPU bus ID for device2
- gpubusid3: GPU bus ID for device3
Note: You cannot use this command option to view global ncpu settings. This information can only be viewed directly in the shared copy of ego.conf. - -G
- Displays one or more resource groups to which the host belongs. To view resource groups for multiple hosts, separate the host names with a space.
- -R res_req
- Displays information about the resources that match the resource requirement string
specified.
Specify name-value pairs for the resource requirement(s). Multiple resource requirements are separated with the characters &&.
When using res_req, use select(select_string) to specify the criteria for selecting the resources. The selection string is a logical expression to select one or more resources to match one or more criteria. Any resource that satisfies the criteria is selected.
select(expression)select(expression operator expression)select((expression operator expression) operator expression)The entire resource requirement string cannot contain more than 512 characters and parentheses must be entered as shown:
resource_name operator value
- resource_name
- The following resources can be used as selection criteria.
- Static resources
- Static resources are built-in resources that represent host information that does not change
over time, such as the maximum RAM available to user processes or the number of processors in a
machine. Most static resources are determined at start-up time, or when hardware configuration
changes are detected.
Static resources can be used to select appropriate hosts based on binary architecture, relative CPU speed, and system configuration.
Note: The resources ncpus, ncores, nprocs, nthreads, maxmem, maxswp, and maxtmp are not static on Linux® hosts that support dynamic hardware reconfiguration.
Table 1. Static resources Index
Measures
Units
Determined by
type host type
string
configuration
model host model
string
configuration
hname host name
string
configuration
cpuf CPU factor
relative
configuration
server host can run remote jobs
Boolean
configuration
rexpri execution priority
nice(2) argument
configuration
ncpus number of processors
processors
LIM
ndiks number of local disks
disks
LIM
maxmem maximum RAM
MB
LIM
maxswp maximum swap space
MB
LIM
maxtmp maximum space in /tmp
MB
LIM
- CPU factor (cpuf)
- The CPU factor is the speed of the host’s CPU relative to other hosts in the cluster. If one processor is twice the speed of another, its CPU factor should be twice as large. CPU factors are defined by the cluster administrator. For multiprocessor hosts, the CPU factor is the speed of a single processor.
- Server
- The server static resource is Boolean. It has the following values:
- 1 if the host is configured to run jobs from other hosts.
- 0 if the host is a client for submitting jobs to other hosts.
- operator
- The following operators can be used in selection strings. If you are using the selection string
in an XML format, you must use the applicable escape characters in the XML Equivalence column. The
operators are listed in order of decreasing precedence:
Table 2. Selection string operators Operator XML equivalent Syntax Meaning ! Not applicable !a Logical NOT: 1 if a==0, 0 otherwise * Not applicable a*b Multiply a and b / Not applicable a / b Divide a by b + Not applicable a+b Add a and b - Not applicable a-b Subtract b from a > > a > b 1 if a is greater than b, 0 otherwise < < a < b 1 if a is less than b, 0 otherwise >= >= a >= b 1 if a is greater than or equal to b, 0 otherwise <= <= a <= b 1 if a is less than or equal to b, 0 otherwise == Not applicable a == b 1 if a is equal to b, 0 otherwise != Not applicable a != b 1 if a is not equal to b, 0 otherwise && && a && b Logical AND: 1 if both a and b are non-zero, 0 otherwise || Not applicable a || b Logical OR: 1 if either a or b is non-zero, 0 otherwise
- value
- Specifies the value to be used as criteria for selecting a resource. Value can be numerical, such as when referring to available memory or swap space, or it can be textual, such as when referring to a specific type of host.
Important: If the command is issued in whole from the shell console or the requirement has white space, enclose the requirement in double quotation marks. For example:
If the command is issued from the egosh console, do not use quotation marks. For example:>egosh resource list -R
select(mem>100 && it>1)
>egosh >resource list -R select(mem>100)
Tip: To displays all hosts currently enabled to log core hours, run the following command:egosh resource list -l -R corehoursaudit
Sample output:To display all hosts currently not logging core hours, run the following command:NAME status mem swp tmp ut it pg r1m r15s r15m ls ib15b02 ok 45G 0M 889G 4% 258 0.0 0.9 1.2 0.9 2
Use single quotation marks if you are running on Linux with bash and have an exclamation mark (!) in the string:egosh resource list -l -R select(!corehoursaudit)
egosh resource list -l -R 'select (!corehoursaudit)'
Sample output:NAME status mem swp tmp ut it pg r1m r15s r15m ls Master1 ok 21G 45G 207G 3% 0 12.5 0.3 0.0 0.3 egosh
resource open resource_name …
Opens the specified resource, allowing it to accept requests.
This is an administrative subcommand. You must first log on as cluster administrator before you can issue this subcommand.
- resource_name …
- Specifies the name of the resource or resources to open.
To open multiple resources, separate the resource names with a space.
resource remove [-b] resource_name …
Removes the specified resource (host) from the cluster. EGO is also shut down if the host is closed. To remove a host, it must have joined the cluster dynamically and is now either unavailable or closed without running workload.
This is an administrative subcommand. You must first log on as cluster administrator before you can issue this subcommand.
- -b
- Specifies removing resources in batch mode. The egosh resource command automatically combines resources and submits the request as one for all the listed resources, eliminating the need to run this command multiple times. In this mode, the command does not return individual resource action status results (for example, separate messages for host1, host2, and host3); it returns a confirmation that the action for the batch was accepted (for example, a single message listing host1, host2, and host3). Use this option if performance is an issue.
- resource_name …
- Specifies the name of the host or hosts to remove.
To remove multiple hosts, separate the host names with a space.
resource view [resource_name …]
Displays all the information about all resources.
- resource_name …
- Specifies the name of the resource or resources you want to view.
Displays information about the specified resource or resources.
To view multiple resources, separate the resource names with a space.
resource updaterg
- If the changed resource does not meet all of the resource requirements (resreq), then the subcommand removes this host from the resource group and reclaims the workload from this resource group.
- If the changed resource meets at least some of the resource requirements, then the subcommand adds the host to a new resource group.
Note that there are no options or arguments for this subcommand. Simply run egosh resource updaterg to update all hosts and all resource groups.
By default, automatically checking and updating the resource group membership status to keep it
current is not enabled. To enable it, set EGO_ENABLE_RG_UPDATE_MEMBERSHIP=Y
and
EGO_RG_UPDATE_MEMBERSHIP_INTERVAL=time_in_seconds
in the
ego.conf file. For details, see ego.conf reference. If not enabled, run egosh resource
updaterg to update manually.