Memory and CPU configuration options

You can configure a variety of memory and CPU options within Apache Spark, IBM® Java™, and z/OS®.

Apache Spark configuration options

There are two major categories of Apache Spark configuration options: Spark properties and environment variables.

Spark properties control most application settings and can be configured separately for each application. You can set these properties in the following ways, in order from highest to lowest precedence:

Directly on a SparkConf passed to your SparkContext
At run time, as command line options passed to spark-shell and spark-submit
In the spark-defaults.conf properties file

In addition, you can configure certain Spark settings through environment variables, which are read from the $SPARK_CONF_DIR/spark-env.sh script.

Table 1 through Table 4 summarize some of the Spark properties and environment variables that control memory and CPU usage by Spark applications.¹^,²

Table 1. Environment variables that control memory settings
Environment variable	Default value	Description
SPARK_WORKER_MEMORY	Total memory on host system minus 1 GB	Total amount of memory that a worker can give to executors. Note: This value should equal the product of `spark.executor.memory` times the number of executor instances.
SPARK_DAEMON_MEMORY	1 GB	Amount of memory to allocate to the Spark master and worker daemons.

Table 2. Spark properties that control memory settings
Spark property	Default value	Description
`spark.driver.memory`	1 GB	Amount of memory to use for the driver process (that is, where SparkContext is initialized). Note: In client mode, this property must not be set through the SparkConf directly in your application. Instead, set this through the --driver-memory command line option or in your default properties file.
`spark.driver.maxResultSize`	1 GB	Limit of the total size of serialized results of all partitions for each Spark action (for instance, collect). Note: Jobs will fail if the size of the results is above this limit. Having a high limit may cause out-of-memory errors in the driver.
`spark.executor.memory`	1 GB	Amount of memory to use per executor process. Note: This sets the heap size of the executor JVM.
`spark.memory.fraction`	0.6	Fraction of (heap space - 300 MB) used for execution and storage. The lower this value is, the more frequently spills and cached data eviction occur. The purpose of this property is to set aside memory for internal metadata, user data structures, and imprecise size estimation in case of sparse, unusually large records.
`spark.memory.storageFraction`	0.5	Amount of storage memory that is immune to eviction, expressed as a fraction of the size of the region set aside by `spark.memory.fraction`. The higher this value is, the less working memory may be available to execution and tasks may spill to disk more often.
`spark.memory.offHeap.enabled`	false	If set to `true`, Spark attempts to use off-heap memory for certain operations. If off-heap memory use is enabled, `spark.memory.offHeap.size` must be positive.
`spark.memory.offHeap.size`	0	The absolute amount of memory, in bytes, that can be used for off-heap allocation. This setting has no impact on heap memory usage, so if your executors' total memory consumption must fit within some hard limit, be sure to shrink the JVM heap size accordingly. This must be set to a positive value when `spark.memory.offHeap.enabled` is set to `true`.

Table 3. Environment variables that control CPU settings
Environment variable	Default value	Description
SPARK_WORKER_CORES	All cores on host system	Number of cores to use for all executors. Note: This value should equal the product of `spark.executor.cores` times the number of executor instances.

Table 4. Spark properties that control CPU settings
Spark property	Default value	Description
`spark.deploy.defaultCores`	Infinite	Default number of cores to give to applications if they do not set a `spark.cores.max` value.
`spark.cores.max`	(Not set)	Maximum number of cores to give to an application across the entire cluster. If not set, the default is the `spark.deploy.defaultCores` value.
`spark.driver.cores`	1	Number of cores to use for the driver process, only in cluster mode.
`spark.executor.cores`	All cores on host system	Number of cores to use on each executor. Setting this parameter allows an application to run multiple executors, provided that there are enough cores on the worker (SPARK_WORKER_CORES); otherwise, only one executor per application runs on each worker.

Table 5. Spark properties that affect application and cluster parallelism
Spark property	Default value	Description
`spark.zos.master.app.alwaysScheduleApps`	false	Use this property to enable the Spark Master to start applications, even when the current applications appear to be using all the CPU and memory. This enables the system to start and then manage the resources across the applications, according to the WLM policy.
`spark.zos.maxApplications`	5	When `spark.zos.master.app.alwaysScheduleApps` is set to true, specifies the maximum number of applications that can be scheduled to run concurrently. This value should be set based on resources available according to the current WLM policy.
`spark.dynamicAllocation.enabled`	false	Specifies the Spark Master can request that the Spark Worker add or remove executors based on the workload and available resources.
`spark.shuffle.service.enabled`	false	Specifies that the worker should start the Spark Shuffle service, which allows executors to transfer (shuffle) data across executors, as needed. This is required when `spark.dynamicAllocation.enabled` is true.

For more information about the Apache Spark configuration options, see http://spark.apache.org/docs/2.4.8/configuration.html. For more information about tuning the Apache Spark cluster, see http://spark.apache.org/docs/2.4.8/tuning.html.

IBM Java configuration options

In addition to the Apache Spark settings, you can use IBM JVM runtime options to manage resources used by Apache Spark applications.

You can set the IBM JVM runtime options in the following places:

IBM_JAVA_OPTIONS: Use this environment variable to set generic JVM options.
SPARK_DAEMON_JAVA_OPTS: Use this environment variable to set additional JVM options for the Apache Spark master and worker daemons.
spark.driver.extraJavaOptions: Use this Apache Spark property to set additional JVM options for the Apache Spark driver process.
spark.executor.extraJavaOptions: Use this Apache Spark property to set additional JVM options for the Apache Spark executor process. You cannot use this option to set Spark properties or heap sizes.

Table 6 summarizes some of the IBM JVM runtime options that control resource usage.

Table 6. IBM JVM runtime options that control resource usage
Option	Default value	Description
-Xmx	Half of the available memory, with a minimum of 16 MB and a maximum of 512 MB	Maximum heap size. Note: Set Spark JVM heap sizes via Spark properties (such as `spark.executor.memory`), not via Java options.
-Xms	4 MB	Initial heap size.
-Xss	1024 KB	Maximum stack size.
-Xcompressdrefs	Enabled by default when the value of the -Xmx option is less than or equal to 57 GB	Enables the use of compressed references.
-Xlp	1M pageable pages, when available, are the default size for the object heap and the code cache. If -Xlp is specified without a size, 1M non-pageable is the default for the object heap.	Requests that the JVM allocate the Java object heap using large page sizes (1M or 2G). Instead of -Xlp, you can use -Xlp:codecache and -Xlp:objectcache to set the JIT code cache and object heap separately. Note: There is a limit to the number of large pages that are available on a z/OS system. Consult your system administrator before changing this setting.
-Xmn	(Not set)	Sets the initial and maximum size of the new area to the specified value when using the -Xgcpolicy:gencon option.

For more information about IBM Java on z/OS, see IBM SDK, Java Technology Edition z/OS User Guide.

z/OS configuration parameters

The z/OS operating system provides additional configuration options that allow the system administrator to ensure that no workloads use more resources than they are allowed. Be sure that any application-level configuration does not conflict with the z/OS system settings. For example, the executor JVM will not start if you set spark.executor.memory=4G but the MEMLIMIT parameter for the user ID that runs the executor is set to 2G.

Typically, you can modify these settings in the following ways (in order from highest to lowest precedence):

Use the ALTUSER RACF® command to modify the resource settings in the OMVS segment of the security profile for the user ID under which Apache Spark runs.
Use the IEFUSI exit, which receives control before each job step starts.
Set the system-wide defaults in the appropriate SYS1.PARMLIB member.

Table 7 summarizes some of the z/OS configuration options that might be relevant to your Open Data Analytics for z/OS environment.

Table 7. IBM z/OS configuration parameters
Parameter	Default value	Description	Where to set
MEMLIMIT	2 GB	Amount of virtual storage above the bar that an address space is allowed to use. Note: Set the MEMLIMIT for the Spark user ID to the largest JVM heap size (executor memory size) plus the amount of native memory that Spark processing might require (typically, at least 2 GB).	MEMLIMIT parameter in the SMFPRM`xx` parmlib member (sets the system-wide default) MEMLIMIT parameter on JCL JOB or EXEC statements IEFUSI exit MEMLIMIT variable in the OMVS security profile segment
MAXMMAPAREA	40960	Maximum amount of data space storage, in pages, that can be allocated for memory mapping of z/OS UNIX files.	MAXMMAPAREA parameter in the BPXPRM`xx` parmlib member (sets the system-wide default) MMAPAREAMAX variable in the OMVS security profile segment
MAXASSIZE	200 MB	Maximum address space size for a new process. Note: The JVM will fail to start if the MAXASSIZE for the Spark user ID is too small. The recommended minimum for Java is 2,147,483,647 bytes.	MAXASSIZE parameter in the BPXPRM`xx` parmlib member (sets the system-wide default) IEFUSI exit ASSIZEMAX variable in the OMVS security profile segment
MAXCPUTIME	1000	Maximum CPU time, in seconds, that a process can use.	MAXCPUTIME parameter in the BPXPRM`xx` parmlib member (sets the system-wide default) CPUTIMEMAX variable in the OMVS security profile segment
MAXPROCESSOR	25	Maximum number of concurrently active processes for a single z/OS UNIX user ID.	MAXPROCUSER parameter in the BPXPRM`xx` parmlib member (sets the system-wide default) PROCUSERMAX variable in the OMVS security profile segment

For more information about z/OS parmlib members, see z/OS MVS Initialization and Tuning Reference.

¹ http://spark.apache.org/docs/2.4.8/configuration.html

² http://spark.apache.org/docs/2.4.8/spark-standalone.html