Creating the Apache Spark working directories

Complete this task to create the Apache Spark working directories.

About this task

Table 1 lists some of the working directories that Apache Spark uses. The sizes of these directories might need to be large depending on the type of work that is running; this is true particularly for the SPARK_LOCAL_DIRS directory.

Table 1. Apache Spark working directories
Directory contents	Default location	Environment variable	Suggested new directory
Log files	$SPARK_HOME/logs	SPARK_LOG_DIR	Under /var, such as /var/spark/logs
Working data for the worker process	$SPARK_HOME/work	SPARK_WORKER_DIR	Under /var, such as /var/spark/work
Shuffle and RDD data	/tmp	SPARK_LOCAL_DIRS	Under /tmp, such as /tmp/spark/scratch
PID files	/tmp	SPARK_PID_DIR	Under /tmp, such as /tmp/spark/pid

Procedure

As you did in Creating the Apache Spark configuration directory, follow your file system conventions and create new working directories for Apache Spark.
Note: Consider mounting the $SPARK_WORKER_DIR and $SPARK_LOCAL_DIRS directories on separate zFS file systems to avoid uncontrolled growth on the primary zFS where Spark is located. The sizes of these zFS file systems depend on the activity level of your applications and whether auto-cleanup or rolling logs is enabled (see step 4). If you are unsure about the sizes, 500 MB is a good starting point. Then, monitor the growth of these file systems and adjust their sizes accordingly. Avoid using temporary file systems (TFS) for these directories if you expect significant growth, as TFS can use a large amount of real memory.
Give the following users read/write access to the newly created working directories:
- The user ID who runs z/OS Spark (SPARKID in these examples)
- The end user IDs who will be using z/OS Spark.
Assuming those users belong to the same UNIX user group, you may issue:
```
chmod ug+rwx /var/spark/logs
chmod ug+rwx /var/spark/work
chmod ug+rwx /tmp/spark/scratch
chmod ug+rwx /tmp/spark/pid
```
Update the $SPARK_CONF_DIR/spark-env.sh script with the new environment variables pointing to the newly created working directories.
For example:
```
export SPARK_WORKER_DIR=/var/spark/work
```
Configure these directories to be cleaned regularly.
1. Configure Spark to perform cleanup.
  By default, Spark does not regularly clean up worker directories, but you can configure it to do so. Change the following Spark properties in $SPARK_CONF_DIR/spark-defaults.conf to values that support your planned activity, and monitor these settings over time:
  
  spark.worker.cleanup.enabled
  
  Enables periodic cleanup of worker and application directories. This is disabled by default. Set to true to enable it.
  
  spark.worker.cleanup.interval
  
  The frequency, in seconds, that the worker cleans up old application work directories. The default is 30 minutes. Modify the value as you deem appropriate.
  
  spark.worker.cleanup.appDataTtl
  
  Controls how long, in seconds, to retain application work directories. The default is 7 days, which is generally inadequate if Spark jobs are run frequently. Modify the value as you deem appropriate.
  
  For more information about these properties, see http://spark.apache.org/docs/2.4.8/spark-standalone.html.
2. Configure Spark to enable rolling log files.
  Be default, Spark retains all of the executor log files. You can change the following Spark properties in $SPARK_CONF_DIR/spark-defaults.conf to enable rolling of executor logs:
  spark.executor.logs.rolling.maxRetainedFiles
  
  Sets the number of latest rolling log files that are going to be retained by the system. Older log files will be deleted. The default is to retain all log files.
  
  spark.executor.logs.rolling.strategy
  
  Sets the strategy for rolling of executor logs. By default, it is disabled. The valid values are:
  
  time
  
  Time-based rolling. Use spark.executor.logs.rolling.time.interval to set the rolling time interval.
  
  size
  
  Size-based rolling. Use spark.executor.logs.rolling.maxSize to set the maximum file size for rolling.
  
  spark.executor.logs.rolling.time.interval
  Sets the time interval by which the executor logs will be rolled over. Valid values are:
  
  daily
  
  hourly
  
  minutely
  
  Any number of seconds
  spark.executor.logs.rolling.maxSize
  
  Sets the maximum file size, in bytes, by which the executor logs will be rolled over.
  For more information about these properties, see http://spark.apache.org/docs/2.4.8/configuration.html.
3. Create jobs that clean up or archive the following directories listed in Table 1:
  - $SPARK_LOG_DIR
  - $SPARK_WORKER_DIR, if not configured to be cleaned by Spark properties
  - $SPARK_LOCAL_DIRS
  z/OS® UNIX ships a sample script, skulker, that you can use as written or modify to suit your needs. The -R option can be useful, as Spark files are often nested in subdirectories. You can schedule skulker to run regularly from cron or other in-house automation tooling. You can find a sample skulker script in the /samples directory. For more information about skulker, see "skulker - Remove old files from a directory" in z/OS UNIX System Services Command Reference.
Optional: Periodically check all file systems involved in Spark (such as $SPARK_HOME and any others mounted under it or elsewhere).
- You can specify the FSFULL parameter for a file system to that it generates operator messages as the file system reaches a user-specified threshold.
- Look for the number of extents, which can impact I/O performance for the disks involved. Perform these steps to reduce the number of extents:
  1. Create and mount a new zFS.
  2. Use copytree, tar, or similar utilities to copy the key directories from the old file system to the new one.
  3. Unmount the old file system and re-mount the new file system in its place.
For more information, see "Managing File System Size" in z/OS DFSMSdfp Advanced Services.

Note: Update the BPXPRMxx member of parmlib with the new file systems.

Results

You have completed the customization of your Apache Spark directory structure.

What to do next

Continue with Configuring networking for Apache Spark.