Adding a GPU worker node to a Red Hat OpenShift cluster on AWS
Before you begin
Ensure that you have the following requirements:
- A GPU worker node is added to the Red Hat® OpenShift® cluster. . Current AWS Maximo® Application Suite BYOL offerings do not include nodes with GPU.
-
A control shell.
The control shell can be the boot node. Locate the control shell in the EC2 dashboard after the list of all instances. If the boot node is in a stopped state, restart the instance. Connect to this instance as the EC2-user.Tip: You can use Visual Studio Code to remotely connect to the boot node, but it is not necessary.ocjq -
The appropriate EC2 GPU instance is selected and has sufficient availability in the region where the Maximo Application Suite instance is installed.
Obtain this information from the instance type page that is located after the EC2 service in the AWS console.
For example, if you deployed the Maximo Application Suite instance in the
us-east-1region, go to the EC2 instance type page for that region by navigating to the AWS website. The instance type page details the compute, networking, storage, accelerators, and pricing information. The networking section details the availability zones.
About this task
For more information about the processes in this task, see:
- AWS Recommended GPU Instances
-
Note: AWS offers EC2 instances that come with GPUs. Use
p3.2xlargeas the EC2 instance type for MVI. - Install & use GPU on AWS
- Creating a machine set on AWS
Procedure
What to do next
To verify that the process is completed successfully, or in instances that errors occur, ensure
that you run the commands in the openshift-machine-api namespace.
Next, run the command oc create -f <machine set custom resource> (Step
12). The output always indicates
that the machine is created. However, if there is a failure in creating the machine, the machine set
is not ready and available. Running oc get machine can immediately indicate the
failure:
Run the oc get machine command. The output indicates that the machine
is provisioned:
NAME PHASE TYPE REGION ZONE AGE masocp-qxkeml-wh7px-master-0 Running m5.2xlarge us-east-1 us-east-1a 18h masocp-qxkeml-wh7px-master-1 Running m5.2xlarge us-east-1 us-east-1b 18h masocp-qxkeml-wh7px-master-2 Running m5.2xlarge us-east-1 us-east-1c 18h masocp-qxkeml-wh7px-worker-gpu-us-east-1a-5z7sd Failed 4s masocp-qxkeml-wh7px-worker-gpu-us-east-1a-nhldx Failed 20s masocp-qxkeml-wh7px-worker-us-east-1a-h2c8g Running m5.4xlarge us-east-1 us-east-1a 18h masocp-qxkeml-wh7px-worker-us-east-1a-p7mt9 Running m5.4xlarge us-east-1 us-east-1a 18h masocp-qxkeml-wh7px-worker-us-east-1b-4rlrq Running m5.4xlarge us-east-1 us-east-1b 18h masocp-qxkeml-wh7px-worker-us-east-1b-dhv6g Running m5.4xlarge us-east-1 us-east-1b 18h masocp-qxkeml-wh7px-worker-us-east-1c-ks85p Running m5.4xlarge us-east-1 us-east-1c 18h masocp-qxkeml-wh7px-workerocs-us-east-1a-9r6pj Running m5.4xlarge us-east-1 us-east-1a 17h masocp-qxkeml-wh7px-workerocs-us-east-1b-p9psl Running m5.4xlarge us-east-1 us-east-1b 17h masocp-qxkeml-wh7px-workerocs-us-east-1c-94d7q Running m5.4xlarge us-east-1 us-east-1c 17h |
oc describe machine <machine name>
or oc describe machineset <machineset name> and check the error message that
is listed after Status or Events:
Status:
Conditions:
Last Transition Time: 2022-05-26T15:20:25Z
Message: Instance has not been created
Reason: InstanceNotCreated
Severity: Warning
Status: False
Type: InstanceExists
Error Message: error launching instance: Your requested instance type (p3.2xlarge) is not supported in your requested Availability Zone (us-east-1a). Please retry your request by not specifying an Availability Zone or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.
Error Reason: InvalidConfiguration
Last Updated: 2022-05-26T15:20:26Z
Phase: Failed
Provider Status:
Conditions:
Last Probe Time: 2022-05-26T15:20:26Z
Last Transition Time: 2022-05-26T15:20:26Z
Message: error launching instance: Your requested instance type (p3.2xlarge) is not supported in your requested Availability Zone (us-east-1a). Please retry your request by not specifying an Availability Zone or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.
Reason: MachineCreationFailed
Status: False
Type: MachineCreation
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 52s (x2 over 53s) awscontroller masocp-qxkeml-wh7px-worker-gpu-us-east-1a-5z7sd: reconciler failed to Create machine: failed to launch instance: error launching instance: Your requested instance type (p3.2xlarge) is not supported in your requested Availability Zone (us-east-1a). Please retry your request by not specifying an Availability Zone or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.
|
oc delete machineset <machineset name>
Edit the availabilityZone value in the custom resource and rerun oc
create -f <customresource.json>. Monitor the creation of the machine set and machines by
using the commands that are listed in step 13.
For any other types of errors, delete the machine set, edit the custom resource, and re-create the machine set by using the edited custom resource
JSON file.