Complete this task to create the Apache Spark working directories.
About this task
Table 1 lists some
of the working directories that Apache Spark uses. The sizes of these directories
might need to be large depending on the type of work that is running; this is true particularly for
the SPARK_LOCAL_DIRS directory.
Table 1. Apache Spark working
directories
Directory contents |
Default location |
Environment variable |
Suggested new directory |
Log files |
$SPARK_HOME/logs |
SPARK_LOG_DIR |
Under /var, such as /var/spark/logs |
Working data for the worker process |
$SPARK_HOME/work |
SPARK_WORKER_DIR |
Under /var, such as /var/spark/work |
Shuffle and RDD data |
/tmp |
SPARK_LOCAL_DIRS |
Under /tmp, such as /tmp/spark/scratch |
PID files |
/tmp |
SPARK_PID_DIR |
Under /tmp, such as /tmp/spark/pid |
Procedure
-
As you did in Creating the Apache Spark configuration directory, follow your file system conventions
and create new working directories for Apache Spark.
Note: Consider mounting the
$SPARK_WORKER_DIR and
$SPARK_LOCAL_DIRS directories on separate zFS file systems to avoid
uncontrolled growth on the primary zFS where
Spark is located.
The sizes of
these zFS file systems depend on the activity level of your applications and whether auto-cleanup or
rolling logs is enabled (see step 4). If you are unsure about the sizes, 500 MB is a good starting point. Then, monitor the growth of
these file systems and adjust their sizes accordingly. Avoid using
temporary file systems (TFS) for these directories if you expect significant growth, as TFS can use
a large amount of real memory.
-
Give the following users read/write access to the newly created working directories:
- The user ID who runs z/OS Spark (SPARKID in these
examples)
- The end user IDs who will be using z/OS Spark.
Assuming those users belong to the same UNIX user group, you may issue:
chmod ug+rwx /var/spark/logs
chmod ug+rwx /var/spark/work
chmod ug+rwx /tmp/spark/scratch
chmod ug+rwx /tmp/spark/pid
-
Update the $SPARK_CONF_DIR/spark-env.sh script with the new
environment variables pointing to the newly created working directories.
For example:
export SPARK_WORKER_DIR=/var/spark/work
-
Configure these directories to be cleaned regularly.
-
Configure Spark to
perform cleanup.
By default,
Spark does
not regularly clean up worker directories, but you can configure it to do so. Change the following
Spark properties in
$SPARK_CONF_DIR/spark-defaults.conf to values that support your planned
activity, and monitor these settings over time:
- spark.worker.cleanup.enabled
- Enables periodic cleanup of worker and application directories. This is disabled by default. Set
to true to enable it.
- spark.worker.cleanup.interval
- The frequency, in seconds, that the worker cleans up old application work directories. The
default is 30 minutes. Modify the value as you deem appropriate.
- spark.worker.cleanup.appDataTtl
- Controls how long, in seconds, to retain application work directories. The default is 7 days,
which is generally inadequate if Spark jobs are run frequently.
Modify the value as you deem appropriate.
For more information about these properties, see http://spark.apache.org/docs/2.4.8/spark-standalone.html.
-
Configure Spark to
enable rolling log files.
Be default, Spark
retains all of the executor log files. You can change the following Spark properties in
$SPARK_CONF_DIR/spark-defaults.conf to enable rolling of executor logs:
- spark.executor.logs.rolling.maxRetainedFiles
- Sets the number of latest rolling log files that are going to be retained by the system. Older
log files will be deleted. The default is to retain all log files.
- spark.executor.logs.rolling.strategy
- Sets the strategy for rolling of executor logs. By default, it is disabled. The valid values
are:
- time
- Time-based rolling. Use
spark.executor.logs.rolling.time.interval
to set the
rolling time interval.
- size
- Size-based rolling. Use
spark.executor.logs.rolling.maxSize
to set the maximum
file size for rolling.
- spark.executor.logs.rolling.time.interval
- Sets the time interval by which the executor logs will be rolled over. Valid values are:
daily
hourly
minutely
- Any number of seconds
- spark.executor.logs.rolling.maxSize
- Sets the maximum file size, in bytes, by which the executor logs will be rolled over.
For more information about these properties, see http://spark.apache.org/docs/2.4.8/configuration.html.
-
Create jobs that clean up or archive the following directories listed in Table 1:
- $SPARK_LOG_DIR
- $SPARK_WORKER_DIR, if not configured to be cleaned by Spark properties
- $SPARK_LOCAL_DIRS
z/OS® UNIX ships
a sample script, skulker, that you can use as written or modify to suit your
needs. The -R option can be useful, as Spark files are often nested in
subdirectories. You can schedule skulker to run regularly from
cron or other in-house automation tooling. You can find a
sample skulker script in the /samples directory. For more
information about skulker, see "skulker - Remove old files from a directory" in
z/OS UNIX System Services Command Reference.
- Optional:
Periodically check all file systems involved in Spark (such as
$SPARK_HOME and any others mounted under it or elsewhere).
- You can specify the FSFULL parameter for a file system to that it generates operator messages as
the file system reaches a user-specified threshold.
- Look for the number of extents, which can impact I/O performance for the disks involved. Perform
these steps to reduce the number of extents:
- Create and mount a new zFS.
- Use copytree, tar, or similar utilities to copy the key
directories from the old file system to the new one.
- Unmount the old file system and re-mount the new file system in its place.
For more information, see "Managing File System Size" in z/OS DFSMSdfp Advanced Services.
Note: Update the
BPXPRMxx member of parmlib with the new file systems.
Results
You have completed the customization of your Apache Spark directory structure.