GPU application configuration

Understand the configuration required for GPU applications.

Error handling

Error handling in addition to other application-specific behavior can be configured in the application profile. See Service error handling control for more details. Note that onGpuInvoke() does not currently have its own error handler in the application profile. If an error occurs in onGpuInvoke(), it is handled according to onInvoke() error settings in the application profile.

GPU scheduling and application profile

IBM® Spectrum Symphony schedules one available GPU device per GPU service instance. In case of exclusive IBM Spectrum Symphony scheduling, the device is exclusively used by the service until the service is shut down. The choice of which GPU device to schedule can be controlled to a certain extent by defining specific rules and variables in the application profile of the GPU application.

Note: If you want to run the GPU application in thread exclusive mode, it is necessary to switch the card to this mode and configure the corresponding value of SI_GPU_COMPUTE_MODE in the application profile.

Note:

As of CUDA 8.0, compute mode support has changed: exclusive thread compute mode has been deprecated and has been replaced with exclusive process mode.
If you want to run the GPU application in exclusive thread mode (or exclusive process mode, if you use CUDA 8.0 or later), it is necessary to switch the card to this mode and configure the corresponding value of SI_GPU_COMPUTE_MODE in the application profile.

Set resource requirements

A basic knowledge of resource requirement strings (resReq) in IBM Spectrum Symphony is assumed.

Generally, there are two levels of scheduling in IBM Spectrum Symphony. At the global level, IBM Spectrum Symphony schedules the host where particular service instance of the application will run. At the local level, service instance initialization contains GPU device scheduling. Ideally, the service will not start on a host that does not have available GPU devices compatible with restrictions defined in the application profile. This is especially useful when not all hosts in the resource group assigned to your application have GPU devices.

You can control which hosts will be potential candidates for running application workload. By setting resReq correctly, you can restrict global scheduling (for example, to restrict to hosts with a valid number of GPU devices, set resReq to gpuexclusive_thread mode (or gpuexclusive_process mode, if you use CUDA 8.0 or later). To define gpuexclusive_thread mode (or gpuexclusive_process mode), follow these steps:

In the cluster management console, click Workload > Symphony > Application Profiles.
The Applications page displays.
Click your application.
The Application Profile page displays.
Select Advanced Configuration.
In the Resource Requirements field, replace the current contents with the following:
```
select(gpuexclusive_thread > 0 && gpuexclusive_thread < 9)
```
If you use CUDA 8.0 or later, specify this content:
```
select(gpuexclusive_process > 0 && gpuexclusive_process < 9)
```
Click Save.
The Confirmation window displays.
Review your selections, and then click Confirm.

Check GPU device information using CLI

Determine the number of GPU devices. For example, run this command:
```
egosh resource list -o ngpus,gpushared,gpuexclusive_thread,gpucap1_0,gpucap1_1,gpucap1_2,gpucap1_3,gpucap2_plus
```
If you use CUDA 8.0 or later, run this command:
```
egosh resource list -o ngpus,gpushared,gpuexclusive_process,gpucap1_0,gpucap1_1,gpucap1_2,gpucap1_3,gpucap2_plus
```
This command shows the following output:
```
NAME    ngpus gpushared gpuexclusive_thread gpucap1_0 gpucap1_1 gpucap1_2 gpucap1_3 gpucap2_plus
HostA   2     1.0       1.0                 0.0       0.0       1.0       0.0       1.0
```
For CUDA 8.0 or later, the command shows this output:
```
NAME    ngpus gpushared gpuexclusive_process gpucap1_0 gpucap1_1 gpucap1_2 gpucap1_3 gpucap2_plus
HostA   2     1.0       1.0                  0.0       0.0       1.0       0.0       1.0
```
This output shows that there are two GPU devices: one is in share mode, and one is in exclusive thread (or exclusive process) mode, one device has a compute capability level of 1.2, whereas the other has a capability level of 2.0 or later. Here is a breakdown to help understand this output:

ngpus

Shows the number of GPU devices.

gpushared

Shows the number of GPU devices in share mode.

gpuexclusive_thread (or gpuexclusive_process)

Shows the number of GPU devices in exclusive thread mode (or in exclusive process mode).

gpucap1_0

Shows the GPU compute capability level for the 1.0 devices.

gpucap1_1

Shows the GPU compute capability level for the 1.1 devices.

gpucap1_2

Shows the GPU compute capability level for the 1.2 devices.

gpucap1_3

Shows the GPU compute capability level for the 1.3 devices.

gpucap2_plus

Shows the GPU compute capability level for the 2.0 or later devices.
Determine a GPU device's compute mode value (SI_GPU_COMPUTE_MODE) and capability value (SI_GPU_CAPABILITY). For example, run this command to retrieve information for device 0:
```
egosh resource list -o gpumode0,gpucapver0
```
The following example output for this command shows that device 0's compute mode value is 0.0 (to indicate shared mode), and its capability value is 3.7:
```
NAME    gpumode0   gpucapver0 
HostA   0.0        3.7
```
Additionally, you can a run similar command to determine values for other devices. For example, to view the values for device 1, run:
```
egosh resource list -o gpumode1,gpucapver1
```
Here is example output for this command, which shows that device 1's compute mode value is 1.0 (to indicate exclusive thread mode), and its capability value is 2.0:
```
NAME    gpumode0   gpucapver0 
HostA   1.0        2.0
```
Here is example output for CUDA 8.0 or later, which shows that device 1's compute mode value is 3.0 (to indicate exclusive process mode), and its capability value is 2.0:
```
NAME    gpumode0   gpucapver0 
HostA   3.0        2.0
```
Here is example output for CUDA versions before CUDA 8.0, which shows that device 1's compute mode value is 1.0 (to indicate exclusive thread mode), and its capability value is 2.0:
```
NAME    gpumode0   gpucapver0 
HostA   1.0        2.0
```

Profile environment variables

The SI_GPU_COMPUTE_MODE environment variable sets the GPU device's compute mode value, and the SI_GPU_CAPABILITY sets the capability value. Set these in the Services > osTypes > osType within your application profile.

Here is information about each of these environment variables:

SI_GPU_COMPUTE_MODE

Controls the compute mode of the target GPU device. Only devices that match the configured mode will be scheduled for the application's services. Valid values are as follows:

0: Specify a value of 0 for the SI_GPU_COMPUTE_MODE environment variable to indicate shared mode.
Note that setting this variable to 0 will instruct IBM Spectrum Symphony to schedule only devices with shared mode set; whereas the IBM Spectrum Symphony scheduling itself will remain exclusive.
1: For CUDA versions before CUDA 8.0, specify a value of 1 for the SI_GPU_COMPUTE_MODE environment variable to indicate exclusive thread mode.
3: For CUDA 8.0 or later, specify a value of 3 for the SI_GPU_COMPUTE_MODE environment variable to indicate exclusive process mode.

Note: If this variable is not set at all or set to an empty value, IBM Spectrum Symphony will not check the mode of GPU device before scheduling it for service. Always set the SI_GPU_COMPUTE_MODE value to one of these values, if there are prohibited GPU devices in your system.

SI_GPU_CAPABILITY

Controls the compute capability level of the target GPU device. Only devices that match the configured GPU capability level will be scheduled for the application's services. Valid values are gpucap1_0, gpucap1_1, gpucap1_2, gpucap1_3, or gpucap2_plus. These values correspond to the supported GPU compute capabilities levels: 1.0, 1.1, 1.2, 1.3, 2.0, or later.

Tip: For details about the SI_GPU_CAPABILITY values, see the previous section about determining the number of GPU devices.

Use the SI_GPU_CAPABILITY environment variable to control the capability level of the GPU device. For example, if there is a host with two GPU devices, one device with a capability level of 1.1, and another device with a capability of level 2.0, to restrict your application to devices with capability of level 1.1, follow the process described about setting resource requirements, and update the resource requirement with a gpucap1_1 value, as illustrated in the following string:

select((gpuexclusive_thread > 0 && gpuexclusive_thread < 9) && (gpucap1_1==1))

Here are application profile examples for Windows. In this example, there are two GPU devices: one compute node device is in share mode, with a capability level of 1.2; the other is in exclusive thread mode (or in exclusive process mode) with a capability level of 3.7.

To run workload with a device in share mode, use one of the following two application profile configurations:

Specify a value of 0 for the SI_GPU_COMPUTE_MODE environment variable to indicate shared mode:

<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples"
        ... ...
        resReq="select((gpushared &gt; 0 &amp;&amp; gpushared &lt; 9)" />


<Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP">
<osTypes>
   <osType name="NTX64"
         startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP"
         workDir="${SOAM_HOME}/work">
         <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env>
         <env name="SI_GPU_COMPUTE_MODE">0</env>
   </osType>

Specify a value of 0 for the SI_GPU_COMPUTE_MODE environment variable to indicate shared mode, and specify 1.2 for the SI_GPU_CAPABILITY environment variable to indicate that the compute capability level of the target GPU device should use capability level 1.2:

<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples"
        ... ...
        resReq="select((gpushared &gt; 0 &amp;&amp; gpushared &lt; 9) &amp;&amp; (gpucap1_2 &gt; 0))"   />


<Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP">
<osTypes>
   <osType name="NTX64"
         startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP"
         workDir="${SOAM_HOME}/work">
         <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env>
         <env name="SI_GPU_COMPUTE_MODE">0</env>
         <env name="SI_GPU_CAPABILITY">1.2</env>
   </osType>

To run workload with a device in exclusive thread mode (or in exclusive process mode), use one of the following two application profile configurations:

Specify a value of 1 for the SI_GPU_COMPUTE_MODE environment variable to indicate exclusive thread mode:

<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples"
        ... ...
        resReq="select(gpuexclusive_thread &gt; 0 &amp;&amp; gpuexclusive_thread &lt; 9)" />


<Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP">
<osTypes>
   <osType name="NTX64"
         startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP"
         workDir="${SOAM_HOME}/work">
         <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env>
         <env name="SI_GPU_COMPUTE_MODE">1</env>
   </osType>

If you use CUDA 8.0 or later, configure using gpuexclusive_process, as follows:

<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples"
        ... ...
        resReq="select(gpuexclusive_process &gt; 0 &amp;&amp; gpuexclusive_process &lt; 9)" />


<Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP">
<osTypes>
   <osType name="NTX64"
         startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP"
         workDir="${SOAM_HOME}/work">
         <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env>
         <env name="SI_GPU_COMPUTE_MODE">3</env>
   </osType>

Specify a value of 1 for the SI_GPU_COMPUTE_MODE environment variable to indicate exclusive thread mode, and specify 3.7 for the SI_GPU_CAPABILITY environment variable to indicate that the compute capability level of the target GPU device should use capability level 3.7:

<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples"
        ... ...
        resReq="select((gpuexclusive_thread &gt; 0 &amp;&amp; gpuexclusive_thread &lt; 9) &amp;&amp; (gpucap2_plus &gt; 0))" />


<Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP">
<osTypes>
   <osType name="NTX64"
         startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP"
         workDir="${SOAM_HOME}/work">
         <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env>
         <env name="SI_GPU_COMPUTE_MODE">1</env>
         <env name="SI_GPU_CAPABILITY">3.7</env>
   </osType>

If you use CUDA 8.0 or later, configure using gpuexclusive_process, as follows:

<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples"
        ... ...
        resReq="select((gpuexclusive_process &gt; 0 &amp;&amp; gpuexclusive_process &lt; 9) &amp;&amp; (gpucap2_plus &gt; 0))" />


<Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP">
<osTypes>
   <osType name="NTX64"
         startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP"
         workDir="${SOAM_HOME}/work">
         <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env>
         <env name="SI_GPU_COMPUTE_MODE">3</env>
         <env name="SI_GPU_CAPABILITY">3.7</env>
   </osType>

Creating a GPU resource group (advanced configuration)

If not all hosts in the cluster have GPU devices, creating a resource group containing only GPU hosts should be considered. This method can be used instead of defining a resource requirement string if it is more convenient. Follow the guidelines in Understanding resource groups to create tag cuda and assign it to all GPU-enabled hosts in your cluster. You can then assign this resource group to your GPU application.