Tuning task scheduling

Tune task scheduler settings to control various aspects for tasks in a Spark application.

Executor startup: Numbers and overhead

Once a slot is given to the task scheduler, it can start executors and run tasks. Unlike Apache Spark, executors can be started on demand in IBM® Spectrum Conductor. You can also dynamically change the number of executors and the slot number on each executor according to the total task number in the current stage.

Normally, an executor on a host can take as many slot as it can, and tasks can be executed on different threads within this executor. In this case, the memory of this executor must be large enough to avoid Java OutOfMemory exceptions. However, high JVM memory might reduce GC performance. A large executor is also not a good choice for fault tolerance.

To limit the maximum slots of an executor, use the SPARK_EGO_EXECUTOR_SLOTS_MAX parameter to set the maximum number of tasks that can run concurrently in one Spark executor. For example, if you have 16 slots and SPARK_EGO_EXECUTOR_SLOTS_MAX is set to 8, normally 2 executors are started, given that 8 slots are from the same host. If the 16 slots are distributed on 4 hosts, 4 executors are started.

Having too many executors might burden the resource orchestrator and exhaust system resources. For example, entropy runouts for the random number generator when initiating Jetty cause the executor to hang for 3 seconds. In this case, set -Djava.security.egd=file:/dev/./urandom to use a pseudo random number generator (set by default in IBM Spectrum Conductor).

Resource fluctuation between stages

Executor startup overhead and resource adjustment delays could potentially cause resource fluctuation issues for fine-grained scheduling. We can control this availability versus dynamics tradeoff at different levels:

Executor level: Normally, when an executor does not have any tasks, it is stopped and all its slots are returned to the session scheduler, so that other applications can reuse them. To extend the lifetime of this executor, configure the SPARK_EGO_EXECUTOR_SLOTS_RESERVE to reserve slots on an idle executor. You can also configure the duration to hold this reserved slot through the SPARK_EGO_EXECUTOR_IDLE_TIMEOUT parameter. Though occupying slots, this executor does not consume any CPU cycles. The process keeps on running until the executor idle timeout expires and then all slots are returned. All cached data on this executor is lost if the process is terminated. Thus, for workloads that could have straggling tasks and straggling executors, set the timeout to a higher value.
Executor idle timeout only prevents the executor process from being restarted. To prevent waiting resources from being assigned, use the SPARK_EGO_EXECUTOR_SLOTS_RESERVE parameter to configure a few reserved slots on an executor, so that new tasks can be immediately started as they come. The more reserved slots there are, the more responsive an executor is when new tasks arrive.
Slot level: The idle timeout can be applied at the slot level through the SPARK_EGO_FREE_SLOTS_IDLE_TIMEOUT. In this case, a slot is not returned until the timeout expires, helping to avoid resource fluctuation between short stages, especially in machine learning scenarios where the same iterations are repeatedly executed.

Use the following parameters to control resource fluctuation between stages:

Table 1. Parameters to control resource fluctuation
Parameter	Description	When to use
SPARK_EGO_EXECUTOR_IDLE_TIMEOUT	Specifies the wait duration before stopping an idle executor; requires reserved slots to be defined as well.	When executors are frequently lost and restarting executors takes up a good proportion of running time, set this parameter to a suitable value. Pro: Less executor restart overhead Con: Poor resource fairness
SPARK_EGO_EXECUTOR_SLOTS_RESERVE	Specifies the minimum number of slots to reserve for idle executors.	When the overhead of adjusting slots takes up a good proportion of running time, set this parameter to a suitable value. Pro: Less slot adjustment overhead Con: Poor resource fairness
SPARK_EGO_FREE_SLOTS_IDLE_TIMEOUT	Specifies the wait duration before returning an idle slot.	When the overhead of adjusting slots takes up a good proportion of running time, set this parameter for more fine-grained control on slots. Pro: Less slot adjustment overhead Con: Poor resource fairness

Adaptive scheduling for mixed cluster resources

CPU and GPU tasks in many applications, especially traditional machine-learning applications and deep-learning frameworks, are convertible. When your cluster is short on GPU resources, enable adaptive scheduling through the SPARK_EGO_GPU_ADAPTIVE_ENABLE parameter to schedule GPU tasks on CPU resources. For more information, see Enabling adaptive scheduling.

Straggler tasks

If you find straggler tasks in your applications, enable speculation through the spark.speculation parameter to rerun those tasks simultaneously on other executors. When tasks are rerun, they could benefit from data locality and run faster. For more information, see Spark scheduling parameters.