Create an instance group
that allocates GPU slots to Spark executors in the instance group.
About this task
Follow these steps to create an instance group that uses GPU resources to run
its applications. This task calls out the steps only for GPU allocation when you create an instance group. For more information on how
to create an instance group, see Creating instance groups.
Procedure
-
In the Basic Settings tab, and click the
Configuration link to customize the Spark version properties for the
following GPU parameters. If you do not make changes, the default values are used.
- SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK
- Specifies a comma-separated list of SPARK_EGO_GPU_SLOTS_PER_TASK values.
When specified, the Spark master service
is scaled up to accommodate (at a minimum) one service instance for the total number of values
specified. A maximum of five values are supported. With no list specified (default), the
SPARK_EGO_GPU_SLOTS_PER_TASK value takes effect for all Spark master service instances.
To prevent a
Spark master instance from becoming stuck
while it waits for executors on another instance to finish, enable fair share scheduling for
executors or select a multidimensional resource plan for executors for the instance group. For more information, see Setting consumers and resource groups for an instance group.
- SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX
- Specifies the maximum number of GPU tasks that can run concurrently in one GPU executor. Default
is Integer.MAX_value.
- SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE
- Specifies the minimum number of GPU slots to reserve once the executor is started. The number of
GPU executor slots is the minimum of the reserved number and the owned number. Default is 1. The
timeout for the reserve GPU slots is set by SPARK_EGO_EXECUTOR_IDLE_TIMEOUT.
- SPARK_EGO_GPU_MODE
- Specifies whether the executor needs either exclusive or shared GPUs. Valid values are Exclusive
or Shared. Default value is shared.
- SPARK_EGO_GPU_SLOTS_MAX
- Specifies the maximum number of slots that an application can get for GPU tasks in Spark master mode. Default is
Integer.MAX_value.
- SPARK_EGO_GPU_SLOTS_PER_TASK
- Specifies
the number of slots that are allocated to GPU Spark tasks:
- To enable one GPU Spark task to run with multiple EGO slots, specify a positive integer that is
greater than or equal to one. For example, 1, 2, or 3 are valid values. For example, setting
SPARK_EGO_GPU_SLOTS_PER_TASK=2 means that each task can run on a maximum of
two EGO slots.
- To enable multiple GPU Spark tasks to run with a single EGO slot, specify a negative integer
that is less than -1 (such as -2, -3, or -4). For example, setting
SPARK_EGO_GPU_SLOTS_PER_TASK=-2 means that there are two tasks running on the
single EGO slot.
The initial number of tasks of an executor times the
SPARK_EGO_GPU_SLOTS_PER_TASK value must be equal or less than the number GPUs
on the host where the executor is run. Valid value is a positive integer. Default is 1.
When
this parameter takes effect for GPU scheduling, the number of slots in the egosh
alloc command output equals the number of running GPU tasks times the value of
SPARK_EGO_GPU_SLOTS_PER_TASK.
- SPARK_EGO_SLOTS_REQUIRED_TIMEOUT
- The time, in seconds, to wait for a Spark application to get the required number of slots,
including CPU and GPU slots, before launching tasks. After this time, any slots that are held are
released and the application fails. Default is Integer.MAX_value.
-
If your cluster is installed to a shared file system, decide whether you want to enable the
shuffle service for the instance group.
- If you do not enable the shuffle service, set the spark.local.dir
parameter to a shared directory on the file system.
- If you want to enable the shuffle service, enable and configure the shuffle service. When
you enable the shuffle service, a new consumer is created by default exclusively for the shuffle
service. If you want to change this default consumer, the shuffle service consumer must be
associated with only two resource groups (or a resource plan): one for CPU scheduling and the other
for GPU scheduling. For more information on enabling the shuffle service in a shared file system,
see Enabling and configuring the Spark shuffle service.
-
Enable GPU slot allocation and specify the resource group (or multidimensional resource plan)
from which resources are allocated to executors in the instance group.
Make sure that the CPU executors resource group contains all the CPU and GPU executor hosts. If
you do not do this, GPU slots are used for the shuffle service.
-
In the Resource Groups and Plans section, select the CPU resource group for use by Spark
executors (CPU slots); for example, the CPUrg resource group that you
created.
-
Select the GPU resource group for use by Spark executors (GPU slots), for example, the
GPUrg that you created. Ensure that you do not select the resource group to
be used by Spark drivers.
Results
The instance group is set up
for GPU allocation.
What to do next
- Create and deploy the instance group. After you start the instance group,
GPU slots (in addition to CPU slots) are allocated to Spark executors in the instance group. See Starting instance groups.
Tip: If
you create and deploy the instance group with Spark and other components (such as notebooks with GPUs), the GPU memory usage on the
Overview tab is reflective of services for Spark applications, not the other
instance group components. For example,
a message about GPU allocation not enabled on the Overview tab indicates no
Spark applications are using GPUs; it does not mean that no GPUs are used.
- Submit a Spark application that uses GPUs to the instance group. See either Submitting a Spark application with GPU RDD or Submitting a Spark application without GPU RDD.