Tuning parameters for shared Spark batch applications

Tune shared Spark batch application settings to prevent loss of RDD block data and to set the size of the shared Spark context pool.

Before you begin

You can tune only shared Spark batch application settings with certain Spark versions. Spark versions not supported: 1.5.2 and 3.0.0.
You must be a cluster administrator, consumer administrator, or have the Spark Instance Groups Configure permission to modify the configuration of the shared Spark batch application's instance group.

About this task

Shared Spark batch applications require Spark executors to stay alive for a longer duration without running workload. Without this configuration, executors exit after the job is complete, resulting in the loss of RDD block data. To avoid losing RDD block data, you must set a longer timeout duration for executors with cached data blocks. The higher the timeout, the better the chances of preserving RDD block data.

You might also want to configure the size of the shared context pool, wherein a pool of prestarted contexts is maintained. Maintaining a pool of prestarted contexts gives you instant access to an existing context. When the number of shared contexts in the pool reaches a specified limit and new shared contexts are submitted, the shared context that is the least recently used is stopped.

To specify a longer timeout for executors or to set the shared context pool size, you must modify the configuration of the instance group to which the shared Spark batch application is submitted. Optionally, you can also change the defaults for the number of concurrent jobs that can be started in a shared context and the timeout duration for RDD creation.

The sharable RDD API provides a data caching layer, wherein the shared RDD data is computed once and cached for reuse. To avoid losing RDD cached block data, you can set a larger memory amount for Spark executors. For information on Spark memory tuning, click the Spark Documentation link when you are configuring the Spark version and then search for the Memory Management Overview in the Tuning Guide.

Procedure

From the cluster management console, click Instance Groups.
Select the instance group to modify and click Configure.
On the Basic Settings tab, select the Spark version and click the Configuration link.
Update the following parameter under Applications Properties:
- To set the Spark executors memory amount to use per executor process, set the spark.executor.memory parameter. Valid value is a string (for example, 512m, 2g).
Update the following parameters under Spark on EGO:
- To specify a longer timeout for executors, set the SPARK_EGO_CACHED_EXECUTOR_IDLE_TIMEOUT parameter to as high a value as possible (default is Integer.MAX_value).
- To set the shared context pool size, set the SPARK_EGO_SHARED_CONTEXT_POOL_SIZE parameter. Valid value is an integer that starts from 0 (default is 5).
- To set the number of concurrent jobs that can be started in a shared context, set the SPARK_EGO_SHARED_CONTEXT_MAX_JOBS parameter. Valid value is an integer greater than 0 (default is 100). New jobs that are submitted beyond this limit are rejected.
- To set the duration that others must wait for a shared RDD to be created before the operation times out, set the SPARK_EGO_SHARED_RDD_WAIT_TIMEOUT parameter. Valid value is an integer greater than 0 (default is 60000 milliseconds). During the RDD creation period, other access to this RDD is blocked.
Click Save.
Deploy and start the instance group.