Maximo Application Suite
Customer-managed

Adding a GPU worker node to a Red Hat OpenShift cluster on AWS

Before you begin

Ensure that you have the following requirements:

A GPU worker node is added to the Red Hat® OpenShift® cluster. . Current AWS Maximo® Application Suite BYOL offerings do not include nodes with GPU.
A control shell.

The control shell can be the boot node. Locate the control shell in the EC2 dashboard after the list of all instances. If the boot node is in a stopped state, restart the instance. Connect to this instance as the EC2-user.
Tip: You can use Visual Studio Code to remotely connect to the boot node, but it is not necessary.

oc jq
The appropriate EC2 GPU instance is selected and has sufficient availability in the region where the Maximo Application Suite instance is installed.

Obtain this information from the instance type page that is located after the EC2 service in the AWS console.

For example, if you deployed the Maximo Application Suite instance in the us-east-1 region, go to the EC2 instance type page for that region by navigating to the AWS website. The instance type page details the compute, networking, storage, accelerators, and pricing information. The networking section details the availability zones.

About this task

For more information about the processes in this task, see:

AWS Recommended GPU Instances
Note: AWS offers EC2 instances that come with GPUs. Use p3.2xlarge as the EC2 instance type for MVI.
Install & use GPU on AWS
Creating a machine set on AWS

Procedure

In the control shell, log in as masocpuser (or kubeadmin).
Switch to the openshift-machine-api namespace.
```
oc project openshift-machine-api
```
Note:
If the namespace is not switched, use the -n flag and provide openshift-machine-api as an argument in the succeeding steps.
List the available machinists in the cluster.
```
oc get machineset -o name
```
Select an appropriate machine set as a template for the new GPU worker node's YAML custom resource. Pick a machine set that is located in the same availability zone as the GPU EC2 instance type to use to create the new node.
For example, if p3.2xlarge is available in us-east-1b, pick a machine set that has us-east-1b as part of its name.

Assign a variable for the template machine set name.

For example,


SOURCE_MACHINESET=machine set.machine.openshift.io/masocp-4kyowr-mm5b5-worker-us-east-1b

Copy the source machine set's custom resource to a new file.
```
oc get -o json  $SOURCE_MACHINESET  | jq -r > source-machineset.json
```
Note: The file source-machineset.json is created in the current folder.

Define variables to use for later.


OLD_MACHINESET_NAME=$(jq '.metadata.name' -r source-machineset.json)


NEW_MACHINESET_NAME=${OLD_MACHINESET_NAME/worker/worker-gpu}

Change the instanceType and if needed, change the number of replicas. Delete some metadata and copy the resulting code into a new file gpu-machineset.json. This file is used to create the new machine set with the GPU.


jq -r '.spec.template.spec.providerSpec.value.instanceType = "p3.2xlarge"
  | .spec.replicas = 1
  | del(.metadata.selfLink)
  | del(.metadata.uid)
  | del(.metadata.creationTimestamp)
  | del(.metadata.resourceVersion)
  ' source-machineset.json > gpu-machineset.json

Change the machine set name in gpu-machineset.json.


sed -i "s/$OLD_MACHINESET_NAME/$NEW_MACHINESET_NAME/g" gpu-machineset.json

Run the diff command to check changes.
```
diff -Nuar source-machineset.json gpu-machineset.json
```
For more information, see Install & use GPU on AWS.
Check the value for availabilityZone (found under spec.template.spec.providerSpec.value.placement). Ensure that the new instance type (p3.2xlarge) has the same availability zone, or you can omit the availability key-value pair from the JSON file. If not, an error is displayed after you create the machine set. For more information, see the troubleshooting section at the end of this task.

Create a machine set:


oc create -f gpu-machineset.json

Example output

machineset.machine.openshift.io/masocp-4kyowr-mm5b5-worker-gpu-us-east-1b created

Verify that the machine set is created.


oc get machineset

Example output

NAME                                        DESIRED   CURRENT   READY   AVAILABLE   AGE
masocp-4kyowr-mm5b5-worker-gpu-us-east-1b   1         1                             10s
masocp-4kyowr-mm5b5-worker-us-east-1a       3         3         3       3           7d8h
masocp-4kyowr-mm5b5-worker-us-east-1b       2         2         2       2           7d8h
masocp-4kyowr-mm5b5-worker-us-east-1c       2         2         2       2           7d8h
masocp-4kyowr-mm5b5-workerocs-us-east-1a    1         1         1       1           7d7h
masocp-4kyowr-mm5b5-workerocs-us-east-1b    1         1         1       1           7d7h
masocp-4kyowr-mm5b5-workerocs-us-east-1c    1         1         1       1           7d7h

Note: The output shows that the new GPU node was created but is not ready and available yet.

Get the list of machines to show the status:


oc get machine

Example output

NAME                                              PHASE          TYPE         REGION      ZONE         AGE                                             
masocp-4kyowr-mm5b5-master-0                      Running        m5.2xlarge   us-east-1   us-east-1a   7d8h
masocp-4kyowr-mm5b5-master-1                      Running        m5.2xlarge   us-east-1   us-east-1b   7d8h
masocp-4kyowr-mm5b5-master-2                      Running        m5.2xlarge   us-east-1   us-east-1c   7d8h
masocp-4kyowr-mm5b5-master-3                      Running        m5.2xlarge   us-east-1   us-east-1a   7d8h
masocp-4kyowr-mm5b5-master-4                      Running        m5.2xlarge   us-east-1   us-east-1b   7d8h
masocp-4kyowr-mm5b5-worker-gpu-us-east-1b-nrr4n   Provisioning   p3.2xlarge   us-east-1   us-east-1b   22s
masocp-4kyowr-mm5b5-worker-us-east-1a-kx449       Running        m5.4xlarge   us-east-1   us-east-1a   7d8h
masocp-4kyowr-mm5b5-worker-us-east-1a-nn72q       Running        m5.4xlarge   us-east-1   us-east-1a   7d8h
masocp-4kyowr-mm5b5-worker-us-east-1a-p5nqf       Running        m5.4xlarge   us-east-1   us-east-1a   7d8h
masocp-4kyowr-mm5b5-worker-us-east-1b-7r5wz       Running        m5.4xlarge   us-east-1   us-east-1b   7d8h
masocp-4kyowr-mm5b5-worker-us-east-1b-94khr       Running        m5.4xlarge   us-east-1   us-east-1b   7d8h
masocp-4kyowr-mm5b5-worker-us-east-1c-fvv52       Running        m5.4xlarge   us-east-1   us-east-1c   7d8h
masocp-4kyowr-mm5b5-worker-us-east-1c-rsnwf       Running        m5.4xlarge   us-east-1   us-east-1c   7d8h
masocp-4kyowr-mm5b5-workerocs-us-east-1a-hwb4m    Running        m5.4xlarge   us-east-1   us-east-1a   7d7h
masocp-4kyowr-mm5b5-workerocs-us-east-1b-979w8    Running        m5.4xlarge   us-east-1   us-east-1b   7d7h
masocp-4kyowr-mm5b5-workerocs-us-east-1c-85ktb    Running        m5.4xlarge   us-east-1   us-east-1c   7d7h

When the machine set is done provisioning, the output for oc get machineset is similar to the following example:

Example output

NAME                                        DESIRED   CURRENT   READY   AVAILABLE   AGE
masocp-4kyowr-mm5b5-worker-gpu-us-east-1b   1         1         1       1           3m38s
masocp-4kyowr-mm5b5-worker-us-east-1a       3         3         3       3           7d8h
masocp-4kyowr-mm5b5-worker-us-east-1b       2         2         2       2           7d8h
masocp-4kyowr-mm5b5-worker-us-east-1c       2         2         2       2           7d8h
masocp-4kyowr-mm5b5-workerocs-us-east-1a    1         1         1       1           7d7h
masocp-4kyowr-mm5b5-workerocs-us-east-1b    1         1         1       1           7d7h
masocp-4kyowr-mm5b5-workerocs-us-east-1c    1         1         1       1           7d7h

Run the oc get machine command. The output indicates that the machine is provisioned:

Example output

NAME                                              PHASE         TYPE         REGION      ZONE         AGE
...
masocp-4kyowr-mm5b5-master-3                      Running       m5.2xlarge   us-east-1   us-east-1a   7d8h
masocp-4kyowr-mm5b5-master-4                      Running       m5.2xlarge   us-east-1   us-east-1b   7d8h
masocp-4kyowr-mm5b5-worker-gpu-us-east-1b-nrr4n   Provisioned   p3.2xlarge   us-east-1   us-east-1b   107s
masocp-4kyowr-mm5b5-worker-us-east-1a-kx449       Running       m5.4xlarge   us-east-1   us-east-1a   7d8h
...

Note: You can also check the Red Hat OpenShift console, by clicking Compute > Nodes or click Compute > Machinesets.

What to do next

To verify that the process is completed successfully, or in instances that errors occur, ensure that you run the commands in the openshift-machine-api namespace.

Next, run the command oc create -f <machine set custom resource> (Step 12). The output always indicates that the machine is created. However, if there is a failure in creating the machine, the machine set is not ready and available. Running oc get machine can immediately indicate the failure:

Run the oc get machine command. The output indicates that the machine is provisioned:

Example output

NAME                                              PHASE     TYPE         REGION      ZONE         AGE
masocp-qxkeml-wh7px-master-0                      Running   m5.2xlarge   us-east-1   us-east-1a   18h
masocp-qxkeml-wh7px-master-1                      Running   m5.2xlarge   us-east-1   us-east-1b   18h
masocp-qxkeml-wh7px-master-2                      Running   m5.2xlarge   us-east-1   us-east-1c   18h
masocp-qxkeml-wh7px-worker-gpu-us-east-1a-5z7sd   Failed                                          4s
masocp-qxkeml-wh7px-worker-gpu-us-east-1a-nhldx   Failed                                          20s
masocp-qxkeml-wh7px-worker-us-east-1a-h2c8g       Running   m5.4xlarge   us-east-1   us-east-1a   18h
masocp-qxkeml-wh7px-worker-us-east-1a-p7mt9       Running   m5.4xlarge   us-east-1   us-east-1a   18h
masocp-qxkeml-wh7px-worker-us-east-1b-4rlrq       Running   m5.4xlarge   us-east-1   us-east-1b   18h
masocp-qxkeml-wh7px-worker-us-east-1b-dhv6g       Running   m5.4xlarge   us-east-1   us-east-1b   18h
masocp-qxkeml-wh7px-worker-us-east-1c-ks85p       Running   m5.4xlarge   us-east-1   us-east-1c   18h
masocp-qxkeml-wh7px-workerocs-us-east-1a-9r6pj    Running   m5.4xlarge   us-east-1   us-east-1a   17h
masocp-qxkeml-wh7px-workerocs-us-east-1b-p9psl    Running   m5.4xlarge   us-east-1   us-east-1b   17h
masocp-qxkeml-wh7px-workerocs-us-east-1c-94d7q    Running   m5.4xlarge   us-east-1   us-east-1c   17h

To see the reason for the failure, run oc describe machine <machine name> or oc describe machineset <machineset name> and check the error message that is listed after Status or Events:

Status:
  Conditions:
    Last Transition Time:  2022-05-26T15:20:25Z
    Message:               Instance has not been created
    Reason:                InstanceNotCreated
    Severity:              Warning
    Status:                False
    Type:                  InstanceExists
  Error Message:           error launching instance: Your requested instance type (p3.2xlarge) is not supported in your requested Availability Zone (us-east-1a). Please retry your request by not specifying an Availability Zone or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.
  Error Reason:            InvalidConfiguration
  Last Updated:            2022-05-26T15:20:26Z
  Phase:                   Failed
  Provider Status:
    Conditions:
      Last Probe Time:       2022-05-26T15:20:26Z
      Last Transition Time:  2022-05-26T15:20:26Z
      Message:               error launching instance: Your requested instance type (p3.2xlarge) is not supported in your requested Availability Zone (us-east-1a). Please retry your request by not specifying an Availability Zone or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.
      Reason:                MachineCreationFailed
      Status:                False
      Type:                  MachineCreation
Events:
  Type     Reason        Age                From           Message
  ----     ------        ----               ----           -------
  Warning  FailedCreate  52s (x2 over 53s)  awscontroller  masocp-qxkeml-wh7px-worker-gpu-us-east-1a-5z7sd: reconciler failed to Create machine: failed to launch instance: error launching instance: Your requested instance type (p3.2xlarge) is not supported in your requested Availability Zone (us-east-1a). Please retry your request by not specifying an Availability Zone or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.

In this case, you can delete the machine set:

 oc delete machineset <machineset name>

Edit the availabilityZone value in the custom resource and rerun oc create -f <customresource.json>. Monitor the creation of the machine set and machines by using the commands that are listed in step 13. For any other types of errors, delete the machine set, edit the custom resource, and re-create the machine set by using the edited custom resource JSON file.