Creating an instance group to use GPUs

Create an instance group that allocates GPU slots to Spark executors in the instance group.

Before you begin

You require separate resource groups for CPU and GPU hosts. See Using resource groups with GPU hosts.
You must have enabled the GPU feature. See Enabling GPUs.

About this task

Follow these steps to create an instance group that uses GPU resources to run its applications. This task calls out the steps only for GPU allocation when you create an instance group. For more information on how to create an instance group, see Creating instance groups.

Procedure

In the Basic Settings tab, and click the Configuration link to customize the Spark version properties for the following GPU parameters. If you do not make changes, the default values are used.
SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK

Specifies a comma-separated list of SPARK_EGO_GPU_SLOTS_PER_TASK values. When specified, the Spark master service is scaled up to accommodate (at a minimum) one service instance for the total number of values specified. A maximum of five values are supported. With no list specified (default), the SPARK_EGO_GPU_SLOTS_PER_TASK value takes effect for all Spark master service instances.
To prevent a Spark master instance from becoming stuck while it waits for executors on another instance to finish, enable fair share scheduling for executors or select a multidimensional resource plan for executors for the instance group. For more information, see Setting consumers and resource groups for an instance group.

SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX

Specifies the maximum number of GPU tasks that can run concurrently in one GPU executor. Default is Integer.MAX_value.

SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE

Specifies the minimum number of GPU slots to reserve once the executor is started. The number of GPU executor slots is the minimum of the reserved number and the owned number. Default is 1. The timeout for the reserve GPU slots is set by SPARK_EGO_EXECUTOR_IDLE_TIMEOUT.

SPARK_EGO_GPU_MODE

Specifies whether the executor needs either exclusive or shared GPUs. Valid values are Exclusive or Shared. Default value is shared.

SPARK_EGO_GPU_SLOTS_MAX

Specifies the maximum number of slots that an application can get for GPU tasks in Spark master mode. Default is Integer.MAX_value.
SPARK_EGO_GPU_SLOTS_PER_TASK
Specifies the number of slots that are allocated to GPU Spark tasks:
- To enable one GPU Spark task to run with multiple EGO slots, specify a positive integer that is greater than or equal to one. For example, 1, 2, or 3 are valid values. For example, setting SPARK_EGO_GPU_SLOTS_PER_TASK=2 means that each task can run on a maximum of two EGO slots.
- To enable multiple GPU Spark tasks to run with a single EGO slot, specify a negative integer that is less than -1 (such as -2, -3, or -4). For example, setting SPARK_EGO_GPU_SLOTS_PER_TASK=-2 means that there are two tasks running on the single EGO slot.
The initial number of tasks of an executor times the SPARK_EGO_GPU_SLOTS_PER_TASK value must be equal or less than the number GPUs on the host where the executor is run. Valid value is a positive integer. Default is 1.

When this parameter takes effect for GPU scheduling, the number of slots in the egosh alloc command output equals the number of running GPU tasks times the value of SPARK_EGO_GPU_SLOTS_PER_TASK.
SPARK_EGO_SLOTS_REQUIRED_TIMEOUT

The time, in seconds, to wait for a Spark application to get the required number of slots, including CPU and GPU slots, before launching tasks. After this time, any slots that are held are released and the application fails. Default is Integer.MAX_value.
If your cluster is installed to a shared file system, decide whether you want to enable the shuffle service for the instance group.
- If you do not enable the shuffle service, set the spark.local.dir parameter to a shared directory on the file system.
- If you want to enable the shuffle service, enable and configure the shuffle service. When you enable the shuffle service, a new consumer is created by default exclusively for the shuffle service. If you want to change this default consumer, the shuffle service consumer must be associated with only two resource groups (or a resource plan): one for CPU scheduling and the other for GPU scheduling. For more information on enabling the shuffle service in a shared file system, see Enabling and configuring the Spark shuffle service.
Enable GPU slot allocation and specify the resource group (or multidimensional resource plan) from which resources are allocated to executors in the instance group.

Make sure that the CPU executors resource group contains all the CPU and GPU executor hosts. If you do not do this, GPU slots are used for the shuffle service.
1. In the Resource Groups and Plans section, select the CPU resource group for use by Spark executors (CPU slots); for example, the CPUrg resource group that you created.
2. Select the GPU resource group for use by Spark executors (GPU slots), for example, the GPUrg that you created. Ensure that you do not select the resource group to be used by Spark drivers.

Results

The instance group is set up for GPU allocation.

What to do next

Create and deploy the instance group. After you start the instance group, GPU slots (in addition to CPU slots) are allocated to Spark executors in the instance group. See Starting instance groups.
Tip: If you create and deploy the instance group with Spark and other components (such as notebooks with GPUs), the GPU memory usage on the Overview tab is reflective of services for Spark applications, not the other instance group components. For example, a message about GPU allocation not enabled on the Overview tab indicates no Spark applications are using GPUs; it does not mean that no GPUs are used.
Submit a Spark application that uses GPUs to the instance group. See either Submitting a Spark application with GPU RDD or Submitting a Spark application without GPU RDD.