GPU application configuration
Understand the configuration required for GPU applications.
Error handling
Error handling in addition to other application-specific behavior can be configured in the application profile. See Service error handling control for more details. Note that onGpuInvoke() does not currently have its own error handler in the application profile. If an error occurs in onGpuInvoke(), it is handled according to onInvoke() error settings in the application profile.GPU scheduling and application profile
IBM® Spectrum Symphony schedules one available GPU device per GPU service instance. In case of exclusive IBM Spectrum Symphony scheduling, the device is exclusively used by the service until the service is shut down. The choice of which GPU device to schedule can be controlled to a certain extent by defining specific rules and variables in the application profile of the GPU application.- As of CUDA 8.0, compute mode support has changed: exclusive thread compute mode has been deprecated and has been replaced with exclusive process mode.
- If you want to run the GPU application in exclusive thread mode (or exclusive process mode, if you use CUDA 8.0 or later), it is necessary to switch the card to this mode and configure the corresponding value of SI_GPU_COMPUTE_MODE in the application profile.
Set resource requirements
A basic knowledge of resource requirement strings (resReq) in IBM Spectrum Symphony is assumed.Generally, there are two levels of scheduling in IBM Spectrum Symphony. At the global level, IBM Spectrum Symphony schedules the host where particular service instance of the application will run. At the local level, service instance initialization contains GPU device scheduling. Ideally, the service will not start on a host that does not have available GPU devices compatible with restrictions defined in the application profile. This is especially useful when not all hosts in the resource group assigned to your application have GPU devices.
- In the cluster management console, click .
The Applications page displays.
- Click your application.
The Application Profile page displays.
- Select Advanced Configuration.
- In the Resource Requirements field, replace the current contents with the
following:
select(gpuexclusive_thread > 0 && gpuexclusive_thread < 9)
If you use CUDA 8.0 or later, specify this content:select(gpuexclusive_process > 0 && gpuexclusive_process < 9)
- Click Save.
The Confirmation window displays.
- Review your selections, and then click Confirm.
Check GPU device information using CLI
- Determine the number of GPU devices. For example, run this
command:
egosh resource list -o ngpus,gpushared,gpuexclusive_thread,gpucap1_0,gpucap1_1,gpucap1_2,gpucap1_3,gpucap2_plus
If you use CUDA 8.0 or later, run this command:This command shows the following output:egosh resource list -o ngpus,gpushared,gpuexclusive_process,gpucap1_0,gpucap1_1,gpucap1_2,gpucap1_3,gpucap2_plus
NAME ngpus gpushared gpuexclusive_thread gpucap1_0 gpucap1_1 gpucap1_2 gpucap1_3 gpucap2_plus HostA 2 1.0 1.0 0.0 0.0 1.0 0.0 1.0
For CUDA 8.0 or later, the command shows this output:This output shows that there are two GPU devices: one is in share mode, and one is in exclusive thread (or exclusive process) mode, one device has a compute capability level of 1.2, whereas the other has a capability level of 2.0 or later. Here is a breakdown to help understand this output:NAME ngpus gpushared gpuexclusive_process gpucap1_0 gpucap1_1 gpucap1_2 gpucap1_3 gpucap2_plus HostA 2 1.0 1.0 0.0 0.0 1.0 0.0 1.0
ngpus
- Shows the number of GPU devices.
gpushared
- Shows the number of GPU devices in share mode.
gpuexclusive_thread
(orgpuexclusive_process
)- Shows the number of GPU devices in exclusive thread mode (or in exclusive process mode).
gpucap1_0
- Shows the GPU compute capability level for the 1.0 devices.
gpucap1_1
- Shows the GPU compute capability level for the 1.1 devices.
gpucap1_2
- Shows the GPU compute capability level for the 1.2 devices.
gpucap1_3
- Shows the GPU compute capability level for the 1.3 devices.
gpucap2_plus
- Shows the GPU compute capability level for the 2.0 or later devices.
- Determine a GPU device's compute mode value (SI_GPU_COMPUTE_MODE) and capability value
(SI_GPU_CAPABILITY). For example, run this command to retrieve information for device
0:
egosh resource list -o gpumode0,gpucapver0
The following example output for this command shows that device 0's compute mode value is 0.0 (to indicate shared mode), and its capability value is 3.7:NAME gpumode0 gpucapver0 HostA 0.0 3.7
Additionally, you can a run similar command to determine values for other devices. For example, to view the values for device 1, run:egosh resource list -o gpumode1,gpucapver1
Here is example output for this command, which shows that device 1's compute mode value is 1.0 (to indicate exclusive thread mode), and its capability value is 2.0:NAME gpumode0 gpucapver0 HostA 1.0 2.0
Here is example output for CUDA 8.0 or later, which shows that device 1's compute mode value is 3.0 (to indicate exclusive process mode), and its capability value is 2.0:
Here is example output for CUDA versions before CUDA 8.0, which shows that device 1's compute mode value is 1.0 (to indicate exclusive thread mode), and its capability value is 2.0:NAME gpumode0 gpucapver0 HostA 3.0 2.0
NAME gpumode0 gpucapver0 HostA 1.0 2.0
Profile environment variables
The SI_GPU_COMPUTE_MODE environment variable sets the GPU device's compute mode value, and the SI_GPU_CAPABILITY sets the capability value. Set these in the within your application profile.- SI_GPU_COMPUTE_MODE
- Controls the compute mode of the target GPU device. Only devices that match the configured mode
will be scheduled for the application's services. Valid values are as follows:
- 0
- Specify a value of 0 for the SI_GPU_COMPUTE_MODE environment variable to
indicate shared mode.
Note that setting this variable to 0 will instruct IBM Spectrum Symphony to schedule only devices with shared mode set; whereas the IBM Spectrum Symphony scheduling itself will remain exclusive.
- 1
For CUDA versions before CUDA 8.0, specify a value of 1 for the SI_GPU_COMPUTE_MODE environment variable to indicate exclusive thread mode.
- 3
- For CUDA 8.0 or later, specify a value of 3 for the SI_GPU_COMPUTE_MODE environment variable to indicate exclusive process mode.
Note: If this variable is not set at all or set to an empty value, IBM Spectrum Symphony will not check the mode of GPU device before scheduling it for service. Always set the SI_GPU_COMPUTE_MODE value to one of these values, if there are prohibited GPU devices in your system. - SI_GPU_CAPABILITY
- Controls the compute capability level of the target GPU device. Only devices that match the
configured GPU capability level will be scheduled for the application's services. Valid values are
gpucap1_0
,gpucap1_1
,gpucap1_2
,gpucap1_3
, orgpucap2_plus
. These values correspond to the supported GPU compute capabilities levels: 1.0, 1.1, 1.2, 1.3, 2.0, or later.Tip: For details about the SI_GPU_CAPABILITY values, see the previous section about determining the number of GPU devices.Use the SI_GPU_CAPABILITY environment variable to control the capability level of the GPU device. For example, if there is a host with two GPU devices, one device with a capability level of 1.1, and another device with a capability of level 2.0, to restrict your application to devices with capability of level 1.1, follow the process described about setting resource requirements, and update the resource requirement with agpucap1_1
value, as illustrated in the following string:select((gpuexclusive_thread > 0 && gpuexclusive_thread < 9) && (gpucap1_1==1))
- To run workload with a device in share mode, use one of the following two application profile configurations:
- Specify a value of 0 for the SI_GPU_COMPUTE_MODE environment variable to
indicate shared
mode:
Specify a value of 0 for the SI_GPU_COMPUTE_MODE environment variable to indicate shared mode, and specify 1.2 for the SI_GPU_CAPABILITY environment variable to indicate that the compute capability level of the target GPU device should use capability level 1.2:<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples" ... ... resReq="select((gpushared > 0 && gpushared < 9)" /> <Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP"> <osTypes> <osType name="NTX64" startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP" workDir="${SOAM_HOME}/work"> <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env> <env name="SI_GPU_COMPUTE_MODE">0</env> </osType>
<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples" ... ... resReq="select((gpushared > 0 && gpushared < 9) && (gpucap1_2 > 0))" /> <Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP"> <osTypes> <osType name="NTX64" startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP" workDir="${SOAM_HOME}/work"> <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env> <env name="SI_GPU_COMPUTE_MODE">0</env> <env name="SI_GPU_CAPABILITY">1.2</env> </osType>
- To run workload with a device in exclusive thread mode (or in exclusive process mode), use one of the following two application profile configurations:
- Specify a value of 1 for the SI_GPU_COMPUTE_MODE environment variable to
indicate exclusive thread
mode:
<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples" ... ... resReq="select(gpuexclusive_thread > 0 && gpuexclusive_thread < 9)" /> <Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP"> <osTypes> <osType name="NTX64" startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP" workDir="${SOAM_HOME}/work"> <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env> <env name="SI_GPU_COMPUTE_MODE">1</env> </osType>
If you use CUDA 8.0 or later, configure usingSpecify a value of 1 for the SI_GPU_COMPUTE_MODE environment variable to indicate exclusive thread mode, and specify 3.7 for the SI_GPU_CAPABILITY environment variable to indicate that the compute capability level of the target GPU device should use capability level 3.7:gpuexclusive_process
, as follows:<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples" ... ... resReq="select(gpuexclusive_process > 0 && gpuexclusive_process < 9)" /> <Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP"> <osTypes> <osType name="NTX64" startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP" workDir="${SOAM_HOME}/work"> <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env> <env name="SI_GPU_COMPUTE_MODE">3</env> </osType>
<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples" ... ... resReq="select((gpuexclusive_thread > 0 && gpuexclusive_thread < 9) && (gpucap2_plus > 0))" /> <Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP"> <osTypes> <osType name="NTX64" startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP" workDir="${SOAM_HOME}/work"> <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env> <env name="SI_GPU_COMPUTE_MODE">1</env> <env name="SI_GPU_CAPABILITY">3.7</env> </osType>
If you use CUDA 8.0 or later, configure usinggpuexclusive_process
, as follows:<Consumer applicationName="GpuSampleAppCPP" consumerId="/SampleApplications/SOASamples" ... ... resReq="select((gpuexclusive_process > 0 && gpuexclusive_process < 9) && (gpucap2_plus > 0))" /> <Service description="The GPU Sample Service" name="GpuSampleService" packageName="GpuSampleServiceCPP"> <osTypes> <osType name="NTX64" startCmd="${SOAM_DEPLOY_DIR}/GpuSampleServiceCPP" workDir="${SOAM_HOME}/work"> <env name="PATH">C:\Program Files\NVIDIA Corporation\NVSMI</env> <env name="SI_GPU_COMPUTE_MODE">3</env> <env name="SI_GPU_CAPABILITY">3.7</env> </osType>