Maximo Application Suite
Customer-managed

Adding a GPU worker node to a Red Hat OpenShift cluster on AWS

Before you begin

Ensure that you have the following requirements:

  • A GPU worker node is added to the Red Hat® OpenShift® cluster. . Current AWS Maximo® Application Suite BYOL offerings do not include nodes with GPU.
  • A control shell.

    The control shell can be the boot node. Locate the control shell in the EC2 dashboard after the list of all instances. If the boot node is in a stopped state, restart the instance. Connect to this instance as the EC2-user.
    Tip: You can use Visual Studio Code to remotely connect to the boot node, but it is not necessary.
    oc jq
  • The appropriate EC2 GPU instance is selected and has sufficient availability in the region where the Maximo Application Suite instance is installed.

    Obtain this information from the instance type page that is located after the EC2 service in the AWS console.

    For example, if you deployed the Maximo Application Suite instance in the us-east-1 region, go to the EC2 instance type page for that region by navigating to the AWS website. The instance type page details the compute, networking, storage, accelerators, and pricing information. The networking section details the availability zones.

About this task

For more information about the processes in this task, see:

Procedure

  1. In the control shell, log in as masocpuser (or kubeadmin).
  2. Switch to the openshift-machine-api namespace.
    
    oc project openshift-machine-api
    
    Note:

    If the namespace is not switched, use the -n flag and provide openshift-machine-api as an argument in the succeeding steps.

  3. List the available machinists in the cluster.
    
    oc get machineset -o name
    
  4. Select an appropriate machine set as a template for the new GPU worker node's YAML custom resource. Pick a machine set that is located in the same availability zone as the GPU EC2 instance type to use to create the new node.
    For example, if p3.2xlarge is available in us-east-1b, pick a machine set that has us-east-1b as part of its name.
  5. Assign a variable for the template machine set name.
    For example,
    
    SOURCE_MACHINESET=machine set.machine.openshift.io/masocp-4kyowr-mm5b5-worker-us-east-1b
    
  6. Copy the source machine set's custom resource to a new file.
    
    oc get -o json  $SOURCE_MACHINESET  | jq -r > source-machineset.json
    
    Note: The file source-machineset.json is created in the current folder.
  7. Define variables to use for later.
    
    OLD_MACHINESET_NAME=$(jq '.metadata.name' -r source-machineset.json)
    
    
    NEW_MACHINESET_NAME=${OLD_MACHINESET_NAME/worker/worker-gpu}
    
  8. Change the instanceType and if needed, change the number of replicas. Delete some metadata and copy the resulting code into a new file gpu-machineset.json. This file is used to create the new machine set with the GPU.
    
    jq -r '.spec.template.spec.providerSpec.value.instanceType = "p3.2xlarge"
      | .spec.replicas = 1
      | del(.metadata.selfLink)
      | del(.metadata.uid)
      | del(.metadata.creationTimestamp)
      | del(.metadata.resourceVersion)
      ' source-machineset.json > gpu-machineset.json
    
  9. Change the machine set name in gpu-machineset.json.
    
    sed -i "s/$OLD_MACHINESET_NAME/$NEW_MACHINESET_NAME/g" gpu-machineset.json
    
  10. Run the diff command to check changes.
    
    diff -Nuar source-machineset.json gpu-machineset.json
    

    For more information, see Install & use GPU on AWS.

  11. Check the value for availabilityZone (found under spec.template.spec.providerSpec.value.placement). Ensure that the new instance type (p3.2xlarge) has the same availability zone, or you can omit the availability key-value pair from the JSON file. If not, an error is displayed after you create the machine set. For more information, see the troubleshooting section at the end of this task.
  12. Create a machine set:
    
    oc create -f gpu-machineset.json
    
    Example output
  13. Verify that the machine set is created.
    
    oc get machineset
    
    Example output
    Note: The output shows that the new GPU node was created but is not ready and available yet.
    1. Get the list of machines to show the status:
      
      oc get machine
      
      Example output

      When the machine set is done provisioning, the output for oc get machineset is similar to the following example:

      Example output
    2. Run the oc get machine command. The output indicates that the machine is provisioned:
      Example output
      Note: You can also check the Red Hat OpenShift console, by clicking Compute > Nodes or click Compute > Machinesets.

What to do next

To verify that the process is completed successfully, or in instances that errors occur, ensure that you run the commands in the openshift-machine-api namespace.

Next, run the command oc create -f <machine set custom resource> (Step 12). The output always indicates that the machine is created. However, if there is a failure in creating the machine, the machine set is not ready and available. Running oc get machine can immediately indicate the failure:

Run the oc get machine command. The output indicates that the machine is provisioned:

Example output
To see the reason for the failure, run oc describe machine <machine name> or oc describe machineset <machineset name> and check the error message that is listed after Status or Events:
In this case, you can delete the machine set:
 oc delete machineset <machineset name> 

Edit the availabilityZone value in the custom resource and rerun oc create -f <customresource.json>. Monitor the creation of the machine set and machines by using the commands that are listed in step 13. For any other types of errors, delete the machine set, edit the custom resource, and re-create the machine set by using the edited custom resource JSON file.