GPU overview

You can improve the performance of your Spark applications by offloading certain processing functions from your processor (CPU) to a graphics processing unit (GPU); specific hardware and software requirements apply.

A GPU is designed to optimize parallel processing. An NVIDIA GPU can potentially have thousands of cores that can run the same instruction at the same time, such as operating on each pixel within an image. By comparison, a CPU has a few cores that are ideal for serial processing, carrying out different instructions sequentially.

GPU scheduling for instance groups

You can enable GPU scheduling so that applications can use graphics processing units (GPUs) in addition to CPUs in your cluster. You can only enable GPU scheduling with certain Spark versions. Spark versions not supported: 1.5.2 and 3.0.0.

GPUs are designed to optimize parallel processing. If your Spark application contains code that would benefit from parallel processing, you can run that part of the code on the GPU, and the rest of the code on the CPU.

IBM® Spectrum Conductor interfaces with the Spark scheduler to ensure that GPU resources are assigned to the applications that can use them. A stage marked with GPU is scheduled to the GPU resource group while others are scheduled to the CPU resource group.

You can create a instance group and enable GPU slots to be allocated to Spark executors.

GPU configuration for Spark applications

There are two ways to configure GPU for Spark applications:
  • Configure GPU Resilient Distributed Dataset (RDD) in your Spark application, which supports adaptive GPU scheduling.
  • Set the Spark parameter spark.ego.gpu.app=true, which does not support adaptive GPU scheduling, but does specify that the Spark application is a GPU application that can schedule jobs on GPU resources.
For both configuration methods, you can also specify the GPU mode to use (exclusive mode or default mode), so that the Spark executor can be started on the corresponding GPU that has the mode you request:
Exclusive GPU mode
IBM Spectrum Conductor obtains an exclusive lock on the GPU device, so that when you submit a process, it uses that GPU for itself and no other processes can use it.
Default GPU mode
IBM Spectrum Conductor shares the GPU, so that any process can use it.

Specify the GPU mode by configuring either the Spark spark.ego.gpu.mode parameter, or the SPARK_EGO_GPU_MODE environment variable. As a best practice, all the GPUs on a host should be configure to use the same GPU mode (either all exclusive or all default). Having one mode makes it easy to calculate the slots for the host (resource group). If you must have a host with both exclusive and shared GPU mode configured, then, as a best practice, calculate each GPU as a slot.

Tip: If you manually run the nvidia-smi command to change the GPU mode, you must restart EGO on the host afterwards so that EGO can detect the changed GPU mode.

GPU usage

You can monitor GPU usage for instance groups and Spark applications in the cluster management console. To enable GPU monitoring charts and table columns in the cluster management console, you run a script on the primary host. You also need to configure GPU resource groups to run the GPU workload.

With GPU monitoring enabled, for instance groups, you see the following information:
  • The number of CPU and GPU slots used.
  • The amount of memory and GPU utilization that is used and the total available for all instance groups.
  • CPU and GPU resource usage in chart form to quickly identify which instance groups are using the most CPU or GPU slots. You can also view the total usage in chart form.
For Spark applications, you see the following information:
  • The amount of memory and GPU utilization that is used and the total available for the applications.
  • Current values (total, average, maximum, and minimum) across all GPU devices that are used by the applications.
  • Type of slot that is being used by the executors.
  • CPU and GPU resource usage for the application in chart form to quickly identify GPU slots usage and the number of GPU running executors.