Spark on EGO instance group parameters

Configure an instance group for resource management and session scheduling using the Spark on EGO plug-in.

For a Spark version used by an instance group, the Spark on EGO framework provides two configuration groups to configure resource management and session scheduling parameters. The parameters are available with all supported Spark versions, except where specifically mentioned. For information on editing the Spark version in an instance group, see Modifying instance groups.

Spark on EGO settings

When editing the configuration for an instance group, the following settings are available from the Spark configuration dialog, Spark on EGO drop-down list:
Note:
  • There is a lowercase counterpart of the uppercase properties. Use the lowercase version in the spark-submit command.
  • To configure host blocking, see the Host Blocking drop-down list. You can also check the settings for the BLOCKHOST parameters in the spark-env.sh configuration file.
SPARK_EGO_ACCESS_NAMENODES
In cluster mode, specifies the name nodes from which the application can get the delegation token.
SPARK_EGO_APP_MAX_IN_CACHE

Specifies the maximum number of running applications that can be retained in the Spark master memory cache. Valid value is a positive integer.

Default: 5000

SPARK_EGO_APPS_MAX_SEARCH_NUM

Specifies the maximum number of applications to be searched. When the Spark master retains some applications, only the top 100k applications (default) are searchable.

Default: 100000

SPARK_EGO_APPS_MAX_RESULTS_NUM

Specifies the maximum number of applications to be returned in one request. If you change this parameter to return more than 200 applications (default), ensure that you increase the Spark master's JVM memory.

Default: 200

SPARK_EGO_AUTH_MODE
Specifies the authentication and authorization mode that is used with EGO security plug-ins.
Note: In the cluster management console, this parameter is controlled by selecting the Enable authentication and authorization for the submission user check box in the Basic Settings tab of an instance group. If selected, the Spark master authenticates and authorizes the specified user. Unselected, the Spark master trusts all specified users.
Valid values for SPARK_EGO_AUTH_MODE are:
  • EGO_TRUST: The Spark master trusts all specified users. No password is required. When you run a Spark application from the spark-submit command, you can add the spark.ego.uname with the user name. For example:
    ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://MasterHost:7077 --conf spark.ego.uname=UserName 
    $SPARK_HOME/lib/spark-examples-1.6.1-hadoop2.6.0.jar 100
  • EGO_AUTH: The Spark master authenticates and authorizes the specified user. When you run a Spark application from the spark-submit command, you must add the parameters spark.ego.uname and spark.ego.passwd.

    For spark.ego.passwd, you can either specify the password in the spark-submit command, or you can leave the value empty (--conf spark.ego.passwd=). If the value is empty, the spark-submit command expects a password from either running the command or stdin (a specified file). For example:

    Password included in the spark-submit command:
    ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://MasterHost:7077 --conf spark.ego.uname=UserName 
    --conf spark.ego.passwd=Password $SPARK_HOME/lib/spark-examples-1.6.1-hadoop2.6.0.jar 100
    Password from the spark-submit command:
    ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://MasterHost:7077 --conf spark.ego.uname=UserName 
    --conf spark.ego.passwd= $SPARK_HOME/lib/spark-examples-1.6.1-hadoop2.6.0.jar 100
    User is prompted to enter in the password.
    Password from stdin (a specified file):
    cat password_file | bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://MasterHost:7077 --conf spark.ego.uname=UserName 
    --conf spark.ego.passwd= $SPARK_HOME/lib/spark-examples-1.6.1-hadoop2.6.0.jar 100

Default: EGO_TRUST

SPARK_EGO_CACHED_EXECUTOR_IDLE_TIMEOUT

For shared SparkContext applications, specifies how long (in seconds) an executor with cached data blocks stays alive without any workload running on it. When an executor has cached data, this timeout overrides the value of the SPARK_EGO_EXECUTOR_IDLE_TIMEOUT parameter. Valid value is an integer starting from 1.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: Integer.MAX_value

SPARK_EGO_CLIENT_TTL

When high availability is enabled for an instance group during creation, specifies how long resource allocations for the Spark master are retained after the Spark master goes down. If the Spark master restarts within the specified duration (defined in minutes), the Spark master retrieves those allocations for reuse.

Default: 30 (minutes)

SPARK_EGO_DISTRIBUTE_FILES

Specifies whether to distribute files through a shared file system. Valid values are true and false. If set to true, resource files (.jar files and scripts) are uploaded to the staging directory.

Default: false

SPARK_EGO_DOCKER_CONTAINER_START_TIMEOUT

Specifies the wait duration (in seconds) for a Docker container to start. If the container is not started within this duration (default 60 seconds), the Docker controller terminates the container process. Valid value is a positive integer.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 60 (seconds)

SPARK_EGO_DOCKER_CONTAINER_STOP_TIMEOUT

Specifies the wait duration (in seconds) for a Docker container to stop. If the container is not stopped within this duration (default 60 seconds), the Docker controller terminates the container SPARK_EGO_SHARED_CONTEXT_POOL_SIZE process. Valid value is a positive integer.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 60 (seconds)

SPARK_EGO_DRIVER_BLOCKHOST_CONF

Defines rules to block hosts when Spark drivers in an instance group fail a specified number of times within a specified duration of time on those hosts. When a host is blocked, the Spark master does not allocate resources from the host for driver startup.

Specify this parameter with one or more keywords that are enclosed within single quotation marks (') that can contain uppercase and lowercase letters, numbers, underscores (_), and dashes (-). Multiple keywords within each rule must be separated by a space ( ). Multiple rules must be separated by a semi-colon (;). Multiple phrases within each rule must be separated by a comma that represents an AND condition.

For example, when the configuration is defined as '[127];[100];rule1,rule2;rule3;rule4', a semi-colon (;) separates three rule sets. The numbers in the first bracket are built-in driver exit code rules. The numbers in the second bracket are user driver program exit codes. The other numbers are user-defined rules that are separated by a semicolon.

Assuming the following conditions, if the driver's exit reason on a host matches one or more of these rules, that host is blocked:
  • The driver does not start on a host because the startup command does not exist. The host is blocked and the resources are returned to the resource orchestrator. The resource orchestrator cannot allocate resources for the driver on this host until the user unblocks this host from the command line or from the cluster management console.
  • The driver starts on a host and then fails; the exit reason contains the phrase rule1 and rule2, which meets the host blocking criteria. The host is blocked and the resources are returned to the resource orchestrator. The resource orchestrator cannot allocate resources for the driver on this host until the user unblocks this host from the command line or from the cluster management console.
  • The driver starts on a host and then fails; the user program exit code is 100, which meets the host blocking criteria. The resource orchestrator cannot allocate resources for the driver on this host until the user unblocks this host from the command line or from the cluster management console.
The following example shows the SPARK_EGO_DRIVER_BLOCKHOST_CONF parameter, which is defined with additional rules:
SPARK_EGO_EXECUTOR_BLOCKHOST_CONF='[6,7,16,17,18,25,126,127];[d1,d2,d3];rule1,rule2;rule3;rule4'
In this example, the numbers in first bracket are built-in rules, while the numbers in second bracket are user-defined driver program exit codes. Other user-defined rules can be added after the second bracket, separated by a semicolon. Rules that are separated by a comma represent AND conditions, while rules that are separated by a semicolon represent OR conditions.
Note: The exit codes 0-40 are occupied by system. When you add program exit codes as host blocking rules, specify numbers outside of this scope.

Default: [6,7,16,17,18,25,126,127];[]

Spark versions not supported: 1.5.2 and 2.0.1.

For a full list of the default rules, conditions, and other host blocking information, see Host blocking.

SPARK_EGO_DRIVER_BLOCKHOST_DURATION

When host blocking is enabled, this parameter specifies the duration in minutes within which a Spark driver fails to start or a driver exits due to an error before the host is blocked for the instance group. Within the specified duration, the host is blocked if it fails due to an error hitting the same rule as many times as specified for the SPARK_EGO_DRIVER_BLOCKHOST_MAX_FAILURE_TIMES parameter.

Default: 10

SPARK_EGO_DRIVER_BLOCKHOST_MAX_FAILURE_TIMES

When host blocking is enabled, this parameter specifies the maximum number of times that a driver fails to start or a driver exits due to an error hitting the same rule before the host is blocked. The host is blocked if it fails as many times as specified for this parameter within the duration of time that is specified for the SPARK_EGO_DRIVER_BLOCKHOST_DURATION parameter. If the values for this parameter and the SPARK_EGO_EXECUTOR_BLOCKHOST_MAX_FAILURE_TIMES parameter are 0, host blocking is disabled.

Default: 3

SPARK_EGO_DRIVER_RESREQ

Resource requirement for executors when requesting resources from EGO.

Default: select(('X86_64' || 'LINUXPPC64' || 'LINUXPPC64LE') && ('ascd_pkg_deployed'==1))

SPARK_EGO_ENABLE_BLOCKHOST

Enable or disable host blocking. To enable configure host blocking, ensure that the SPARK_EGO_ENABLE_BLOCKHOST parameter is set to true. Valid values are true and false.

Default: true

SPARK_EGO_ENABLE_PREEMPTION

Enables preemption. This setting disables reclaim for performance tuning purposes. Valid values are true and false.

Default: true

SPARK_EGO_EXECUTOR_BLOCKHOST_CONF

Defines rules to block hosts when Spark executors in an instance group fail a specified number of times within a specified duration of time on those hosts. When a host is blocked, the Spark master does not allocate resources from the host for executor startup.

Specify this parameter with one or more keywords that are enclosed within single quotation marks (') that can contain uppercase and lowercase letters, numbers, underscores (_), and dashes (-). Multiple keywords within each rule must be separated by a space ( ). Multiple rules must be separated by a semi-colon (;). Multiple phrases within each rule must be separated by a comma that represents an AND condition.

For example, when the configuration is defined as '[127];[100];rule1,rule2;rule3;rule4', a semi-colon (;) separates three rule sets. The numbers in the first bracket are built-in executor exit code rules. The numbers in the second bracket are user executor program exit codes. The other numbers are user-defined rules that are separated by a semicolon.

Assuming the following conditions, if the executor's exit reason on a host matches one or more of these rules, that host is blocked:
  • The executor does not start on a host because the startup command does not exist. The host is blocked and the resources are returned to the resource orchestrator. The resource orchestrator cannot allocate resources for the executor on this host until the user unblocks this host from the command line or from the cluster management console.
  • The executor starts on a host and then fails; the exit reason contains the phrase rule1 and rule2, which meets the host blocking criteria. The host is blocked and the resources are returned to the resource orchestrator. The resource orchestrator cannot allocate resources for the executor on this host until the user unblocks this host from the command line or from the cluster management console.
  • The executor starts on a host and then fails; the user program exit code is 100, which meets the host blocking criteria. The resource orchestrator cannot allocate resources for the executor on this host until the user unblocks this host from the command line or from the cluster management console.
The following example shows the SPARK_EGO_EXECUTOR_BLOCKHOST_CONF parameter, which is defined with additional rules:
SPARK_EGO_EXECUTOR_BLOCKHOST_CONF='[6,7,16,17,18,25,126,127];[d1,d2,d3];rule1,rule2;rule3;rule4'
In this example, the numbers in first bracket are built-in rules, while the numbers in second bracket are user-defined executor program exit codes. Other user-defined rules can be added after the second bracket, separated by a semicolon. Rules that are separated by a comma represent AND conditions, while rules that are separated by a semicolon represent OR conditions.
Note: The exit codes 0-40 are occupied by system. When you add program exit codes as host blocking rules, specify numbers outside of this scope.

Default: [6,7,16,17,18,25,126,127];[]

Spark versions not supported: 1.5.2 and 2.0.1.

For a full list of the default rules, conditions, and other host blocking information, see Host blocking.

SPARK_EGO_EXECUTOR_BLOCKHOST_DURATION

When host blocking is enabled, this parameter specifies the duration in minutes within which a Spark executor fails to start or an executor exits due to an error before the host is blocked for the instance group. Within the specified duration, the host is blocked if it fails due to an error hitting the same rule as many times as specified for the SPARK_EGO_EXECUTOR_BLOCKHOST_MAX_FAILURE_TIMES parameter.

Default: 10

SPARK_EGO_EXECUTOR_BLOCKHOST_MAX_FAILURE_TIMES

When host blocking is enabled, this parameter specifies the maximum number of times that an executor fails to start or an executor exits due to an error hitting the same rule before the host is blocked. The host is blocked if it fails as many times as specified for this parameter within the duration of time that is specified for the SPARK_EGO_EXECUTOR_BLOCKHOST_DURATION parameter. If the values for this parameter and the SPARK_EGO_DRIVER_BLOCKHOST_MAX_FAILURE_TIMES parameter are 0, host blocking is disabled.

Default: 3

SPARK_EGO_EXECUTOR_RESREQ

Resource requirement for drivers when requesting resources from EGO.

Default: select(('X86_64' || 'LINUXPPC64' || 'LINUXPPC64LE') && ('ascd_pkg_deployed'==1))

SPARK_EGO_EXECUTORS_MAX_RESULTS_NUM

Specifies the maximum number of executors to be returned for each application. If you change this parameter to return more than 200 executors per application (default), ensure that you increase the Spark master's JVM memory.

Default: 200

SPARK_EGO_FAILOVER_INDEX_CLEANUP_TIMEOUT

Specifies the timeout (in hours) for indices/records in memory to be cleaned up after Spark master failover if the application/driver/executor status remains unknown. The status is assumed to be FINISHED for applications and drivers, and EXITED for executors. Indices are not cleaned up if the Spark master receives an update within the timeout.

Default: 1 (hour)

SPARK_EGO_FREE_SLOTS_IDLE_TIMEOUT

Specifies how long (in seconds) the Spark driver must retain free slots before releasing them back to the Spark master. If new tasks are generated within the specified duration, the slots are allocated for those tasks. Valid value is an integer, starting from 0 up to the value of the SPARK_EGO_EXECUTOR_IDLE_TIMEOUT parameter.

Use this parameter especially in the context of Spark Streaming applications. For Spark workload, the Spark driver aggressively releases free slots back to the Spark master. This resource scheduling pattern is not ideal for Spark Streaming workload, where batches of data must be processed as fast as they are being generated. To efficiently schedule resources for Spark Streaming applications, set this parameter so that the Spark driver holds free slots for a reasonable period. These free slots can then be used for new Spark Streaming tasks in the next interval.

Spark versions not supported: 1.5.2 and 1.6.1.

Default: 5

SPARK_EGO_HOST_CLOSE_GRACE_PERIOD

When a host is manually closed for maintenance, specifies the duration (in seconds) to wait for Spark drivers and executors that are running on the host to complete. After the period is passed, any running drivers and executors are terminated. Valid value is a positive integer, starting from 0.

When the egosh resource close command is run to close a host, this parameter takes effect as follows:
  • When Spark executors are running on the closed host:
    • If tasks running on these executors complete before the specified period, none of the tasks are terminated. After the tasks complete, the slots are returned to the resource orchestrator.
    • If tasks running on the executors do not complete before the specified period, the Spark driver terminates these tasks and reruns them on another host.
  • When Spark drivers are running on the closed host, the Spark master ensures that the grace period for Spark drivers is longer than that of executors. Spark executors exit earlier than Spark drivers.
    • If the Spark driver completes before the specified period, none of the applications are terminated. After the driver completes, the slots are returned to the resource orchestrator.
    • If the Spark driver does not complete before the specified period, the Spark master terminates the Spark driver; any executors that belong to the driver are also terminated. The Spark master returns the slots to the resource orchestrator and reruns the driver on another host.
  • When Spark masters are running on the closed host:
    • If the Spark master is running but Spark drivers and executors are not running, use the kill -9 Spark_pid command to manually terminate the Spark master. The new Spark master restarts on another host and all running drivers register to the new Spark master.
    • If the Spark master and some executors and drivers are running, the Spark driver terminates the executors. The Spark master then terminates the Spark driver (and any executors that belong to the driver) and reruns the Spark driver on another host. Use the kill -9 Spark_pid command to manually terminate the Spark master after the drivers and executors are terminated.
    When set to 0, Spark drivers and executors are immediately terminated on a closed host and the slots are returned to the resource orchestrator.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 0

SPARK_EGO_IMPERSONATION

Specifies the designated account to run Spark applications. Valid values are true and false. If set to true, Spark applications submitted from the command line run as the submission user. If set to false, Spark applications run as the consumer execution user for the driver and executor.

Note: Consider the following notes when you configure SPARK_EGO_IMPERSONATION:
  • The owner of the driver and executor log is the execution user of the driver and executor.
  • If you are submitting Spark applications on a host inside the cluster in client mode, you must ensure that the user who is logged in to the client host and either the submission user (impersonation is enabled) or the consumer execution user (impersonation is disabled) for the Spark executors are the same user or you receive a permission issue.
  • If SPARK_EGO_IMPERSONATION is set to true and SPARK_EGO_AUTH_MODE is set to EGO_AUTH, the user of spark.ego.uname or spark.ego.credential must be the LDAP or OS execution user, rather than an EGO user such as Admin.
  • In the cluster management console, this parameter is controlled by selecting the Enable impersonation to have Spark applications run as the submission user check box in the Basic Settings tab of a instance group. If selected, Spark applications run as the submission user. Unselected, Spark applications run as the consumer execution user for the driver and executor.

Default: false

SPARK_EGO_LAST_UPDATED_TIMESTAMP

Specifies the timestamp for the instance group deployment in the Spark deployment script (spark-env.sh).

Default: No default

SPARK_EGO_LOG_GROUP_READABLE

Enables group read permission to read Spark service logs.

This parameter does not display by default. To show it in the Spark on EGO drop-down list:
  1. Click Workload > Instance Groups > Spark.
  2. In the Spark tab, click Configuration.
  3. Scroll to the Additional Environment Variables section, click Add an Environment Variable, and add an environment variable with the name SPARK_EGO_LOG_GROUP_READABLE and the value true.
  4. Click Save.

Any user that should have read access to the Spark service logs should be either a member of the primary user group of the execution user of the instance group, or a member of the user group that is provided as input in the Administrator user group parameter of the instance group.

Default: false

SPARK_EGO_LOG_OTHER_READABLE_ENABLED

Enables others the ability to read Spark driver, executor, and application event logs.

Note that if you modify an existing instance group and set SPARK_EGO_LOG_OTHER_READABLE_ENABLED=true, the setting will only affect logs for new Spark applications.

This parameter does not display by default. To show it in the Spark on EGO drop-down list, configure SPARK_EGO_LOG_OTHER_READABLE_ENABLED=ON in the ascd.conf file.

To enable the read permission for others, set the value of this parameter to true.

Default: false

SPARK_EGO_LOGSERVICE_PORT

Specifies the UI port for the EGO log service.

Default: 28082

SPARK_EGO_MAX_SNAP_FILE_SIZE
Limits the maximum size of the snap file that is used to recover the application history. It is best practice to increase this value along with SPARK_DAEMON_MEMORY, which controls the total amount of memory that is allocated to the Spark master. Enter a positive integer, followed by any one of the following units:
  • b for bytes. For example, 100b specifies a limit of 100 bytes.
  • k for kibibytes. For example, 10k specifies 10 kibibytes.
  • m for mebibytes. For example, 1m specifies one mebibyte.
  • g for gibibytes. For example, 1g specifies one gibibyte.

Default: 256m

SPARK_EGO_SHARED_CONTEXT_MAX_JOBS

Specifies the number of concurrent jobs that can be started in a shared context. New job submissions that exceed this limit (default 100) are rejected. Valid value is a positive integer greater than 0.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 100

SPARK_EGO_SHARED_CONTEXT_POOL_SIZE

For shared SparkContext applications, specifies the size of the shared context pool, wherein a pool of pre-started shared contexts is maintained. When the context pool reaches this specified limit and new shared contexts are submitted, the least recently used context is stopped. Valid value is an integer that starts from 0.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 5

SPARK_EGO_SHARED_RDD_WAIT_TIMEOUT

When a shared RDD is being created, this parameter specifies the duration that others must wait for the RDD to be created before the operation times out. Valid value is an integer greater than 0. During this RDD creation period (default 60000 milliseconds), other access to this RDD is blocked.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 60000

SPARK_EGO_SHARED_PYTHON

For shared SparkContext applications, enables or disables support for Python applications to use the sharable RDD API. When enabled, shared RDDs can be reused by both Scala and Python applications. Valid values are true or false.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: false

SPARK_EGO_SLOTS_REQUIRED_TIMEOUT

The time, in seconds, to wait for a Spark application to get the required number of slots, including CPU and GPU slots, before launching tasks. After this time, any slots that are held are released and the application fails.

Default: Integer.MAX_value

SPARK_EGO_STAGING_DIR

Specifies the staging directory for submission in a distributed file system (DFS). The staging directory is responsible for transferring .jar and data files.

Default: No default

SPARK_EGO_SUBMIT_FILE_REPLICATION

Specifies the number of replicas that the uploaded .jar file uses. If not defined, the default setting of dfs.replication in the Hadoop core-site.xml file is used.

Default: No default

spark.ssl.ego.gui.enabled

Specifies that the configured instance group uses HTTPS for its web UIs.

Default: false

spark.ssl.ego.workload.enabled

Specifies that the configured instance group uses SSL communication between the following Spark processes: Spark master and Spark driver, Spark driver and executor, and executor to the shuffle service.

Default: false

Session scheduler settings

With session scheduling, an application does not connect to the resource orchestrator directly. Instead, it registers to the Spark master and acquires resources. The Spark master holds a connection to the resource orchestrator and consolidates resources from all applications. It then acquires resources from the resource orchestrator and offers these resources to different applications.

When you edit the configuration for an instance group, the following settings are available from the Session Scheduler drop-down list:

SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK

Specifies a comma-separated list of SPARK_EGO_GPU_SLOTS_PER_TASK values. When specified, the Spark master service is scaled up to accommodate (at a minimum) one service instance for the total number of values specified. This parameter does not affect the Spark notebook master instance. With no list specified (default), the SPARK_EGO_GPU_SLOTS_PER_TASK value takes effect for all Spark master service instances. A maximum of five values are supported.

Spark versions not supported: 1.5.2, 2.0.1, and 2.1.0.

Default: Empty, no list specified.

SPARK_EGO_APP_SCHEDULE_POLICY

Specifies the scheduling policy for the application. The following scheduling policies are supported:

  • First-in, first-out scheduling (fifo): Meets the requirements of applications submitted earlier first.
  • Fair-share scheduling (fairshare): Shares the entire resource pool among all submitted applications. Applications get their deserved number of resources according to their priority.
  • Priority (priority): Meets the requirements of applications that are submitted with higher priority.

Default: fairshare

SPARK_EGO_CONF_DIR_EXTRA

Specifies an optional directory containing an additional spark-env.sh configuration file to load on startup. Use this spark-env.sh file to set additional third-party environment variables that you require beyond available Spark environment variables.

Default: Not defined

SPARK_EGO_DRIVER_RELAUNCH_MAX_TIMES

Sets the maximum number of attempts to relaunch a Spark driver in the failed or error state. Valid value is 1- 10.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 3

SPARK_EGO_EXECUTOR_IDLE_TIMEOUT

Specifies the duration (in seconds) that an executor stays alive when there is no workload on it. This variable requires SPARK_EGO_EXECUTOR_SLOTS_RESERVE to be defined.

Default: 60 (seconds)

SPARK_EGO_EXECUTOR_SLOTS_MAX

Specifies the maximum number of tasks that can run concurrently in one Spark executor. To prevent the Spark executor process from running out of memory, define this variable only after you evaluate Spark executor memory and memory usage per task.

Default: Integer.MAX_value

SPARK_EGO_GPU_ADAPTIVE_ENABLE

Enables adaptive scheduling wherein tasks run first on a portion of GPU resources in the cluster. When GPU resources are no longer available, the remaining tasks run on CPU resources.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: false

SPARK_EGO_GPU_ADAPTIVE_EST_RATIO

When adaptive scheduling is enabled, specifies the estimated speedup ratio of CPU tasks to GPU tasks, based on the average duration of CPU tasks versus GPU tasks. Valid value is a positive integer, starting from 1.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 5

SPARK_EGO_GPU_ADAPTIVE_PERCENTAGE

When adaptive scheduling is enabled, specifies the maximum number of tasks in the GPU stage (as a percentage of total GPU tasks) that are transferred to CPU resources when GPU resources are no longer available. The number of transferred GPU tasks depend on a combination of factors and could be zero, less than, or equal to the percentage value of this parameter, but it does not exceed the specified value. Valid values are between 0 and 1 (including decimal digits).

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 0.1

SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX

For GPU scheduling, specifies the maximum number of GPU tasks that can run concurrently in one GPU executor.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: Integer.MAX_value

SPARK_EGO_EXECUTOR_SLOTS_RESERVE

Specifies the minimum number of slots to reserve once the executor is started. The number of executor slots is the minimum of the reserved number and the owned number. The timeout for the reserve slots is set by SPARK_EGO_EXECUTOR_IDLE_TIMEOUT. For Spark 1.5.2, look for this parameter under the Spark on EGO configuration group. 

Default: 1

SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE

For GPU scheduling, specifies the minimum number of GPU slots to reserve once the executor is started. The number of GPU executor slots is the minimum of the reserved number and the owned number. The timeout for the reserve GPU slots is set by SPARK_EGO_EXECUTOR_IDLE_TIMEOUT.  

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 1

SPARK_EGO_PRIORITY

Specifies the priority of Spark driver and executor scheduling for the instance group. The valid range for scheduling priority is 1 to 10000, with 10000 being the highest priority.

Default: 5000

SPARK_EGO_RECLAIM_GRACE_PERIOD

Specifies the duration (in seconds) that the Spark master waits to reclaim resources from applications. If set to 0, the Spark driver kills any running tasks and returns resources at once to the Spark master. If set to (for example) 10, the Spark driver waits 10 seconds for running tasks to complete. After 10 seconds, it kills the tasks and returns the resources.

Valid value is an integer in the range 0 - 8640000.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: 0

SPARK_EGO_SLOTS_MAX

Specifies the maximum number of slots that an application can get in Spark master mode.

Default: Integer.MAX_value

SPARK_EGO_GPU_SLOTS_MAX

For GPU scheduling, specifies the maximum number of slots that an application can get for GPU tasks in Spark master mode.

Spark versions not supported: 1.5.2 and 3.0.0.

Default: Integer.MAX_value

SPARK_EGO_SLOTS_PER_TASK
For CPU scheduling, specifies the number of slots that are allocated to Spark tasks:
  • To enable one CPU Spark task to run with multiple EGO slots, specify a positive integer that is greater than or equal to one (for instance, 1, 2, or 3 are valid values). As an example, setting SPARK_EGO_SLOTS_PER_TASK=2 means that each task can run on a maximum of two EGO slots.
  • To enable multiple CPU Spark tasks to run with a single EGO slot, specify a negative integer that is less than -1 (for instance, -2, -3, or -4 are valid values).As an example, setting SPARK_EGO_SLOTS_PER_TASK=-2 means that there are two tasks running on the single EGO slot.

When this parameter takes effect for CPU scheduling, the number of slots in the egosh alloc command output equals the number of running CPU tasks times the SPARK_EGO_SLOTS_PER_TASK value.

Spark versions not supported: 1.5.2 and 3.0.0.

For GPU scheduling, use SPARK_EGO_GPU_SLOTS_PER_TASK, instead of this parameter.

Default: 1

SPARK_EGO_GPU_SLOTS_PER_TASK
For GPU scheduling, specifies the number of slots that are allocated to Spark tasks:
  • To enable one GPU Spark task to run with multiple EGO slots, specify a positive integer that is greater than or equal to one (for instance, 1, 2, or 3 are valid values). As an example, setting SPARK_EGO_GPU_SLOTS_PER_TASK=2 means that each task can run on a maximum of two EGO slots.
  • To enable multiple GPU Spark tasks to run with a single EGO slot, specify a negative integer that is less than -1 (for instance, -2, -3, or -4 are valid values). As an example, setting SPARK_EGO_GPU_SLOTS_PER_TASK=-2 means that there are two tasks running on the single EGO slot.

The initial number of tasks of an executor times the SPARK_EGO_GPU_SLOTS_PER_TASK value should be equal or less than the number GPUs on the host where the executor is run. When this parameter takes effect for GPU scheduling, the number of slots in the egosh alloc command output equals the number of running GPU tasks times the SPARK_EGO_GPU_SLOTS_PER_TASK value.

Spark versions not supported: 1.5.2 and 3.0.0.

For CPU scheduling, use SPARK_EGO_SLOTS_PER_TASK, instead of this parameter.

Default: 1

Modified Spark properties

The following Spark property configuration settings are modified in IBM® Spectrum Conductor from their configuration settings within the Spark community.

spark.locality.wait

Specifies how long to wait to launch a data-local task before giving up and launching it on a less-local node. The same wait is used to step through multiple locality levels (process-local, node-local, rack-local and then any). For Spark on Mesos, if the locality does not match, the allocated resources can be returned. If possible, slots with better locality can be assigned on the next schedule. In IBM Spectrum Conductor, resources are not exchanged for better locality if the allocated slot number is greater than or equal to the demanded slot number.

spark.shuffle.service.enabled
Enables or disables the external shuffle service. This service preserves the shuffle files that are written by executors so the executors can be safely removed. This property has two default configurations depending on your cluster configuration:
  • When your cluster is installed to a local file system, spark.shuffle.service.enabled is set to true by default.
  • When you cluster is installed to a shared file system (such as IBM Spectrum Scale), spark.shuffle.service.enabled is set to false by default. You can optionally enable the shuffle service when you create the instance group. To enable this service through the RESTful API, set this value to true and ensure that spark.local.dir is set to a directory in the shared file system.