Enabling GPUs

You can enable GPUs when you want to create instance groups to use GPU resources for applications and have GPU monitoring available.

About this task

To enable GPUs for instance groups and GPU monitoring charts and table columns in the cluster management console, you configure a parameter in the ego.conf file. You also need to configure GPU resource groups to run the GPU workload.

Procedure

  1. If you installed IBM® Spectrum Conductor 2.5.0 from a fresh installation, you must skip to the next step. Complete this step if you performed a rolling upgrade from a previous version of IBM Spectrum Conductor and have set EGO_GPU_ENABLED=Y using the gpuconfig.sh enable command:
    1. Upgrade to the latest version of IBM Spectrum Conductor. For more details, see Upgrade by using rolling upgrade topic.
    2. Disable the GPUs:
      • To run with user interaction: # $EGO_TOP/conductorspark/2.5.0/etc/gpuconfig.sh disable.
      • To run without user interaction: # $EGO_TOP/conductorspark/2.5.0/etc/gpuconfig.sh disable --quiet -u username -x password
  2. Set EGO_GPU_AUTOCONFIG=Y in the $EGO_CONFDIR/ego.conf file on all primary hosts.
  3. Restart EGO on all the hosts in the cluster and restart all the services:
    egosh ego restart all
    egosh service stop all
    egosh service start all
  4. Verify the GPU resource information from the host properties of the cluster management console or CLI by running the following command:
    egosh resource list -o ngpus
    The sample output:
    # egosh resource list -o ngpus
    NAME    ngpus
    hostA     2

Results

EGO is restarted on all of the hosts in the cluster with the change applied. If you chose to not restart the cluster, you need to manually restart EGO on all the hosts in the cluster and restart all the services for the change to take effect.

What to do next

  1. You must configure GPU resource groups to run the GPU workload; see Using resource groups with GPU hosts.
  2. To disable GPUs:
    1. Set EGO_GPU_AUTOCONFIG=N in the $EGO_CONFDIR/ego.conf file on all primary hosts..
    2. Restart EGO on all the hosts in the cluster and restart all the services:
      egosh ego restart all
      egosh service stop all
      egosh service start all