Troubleshooting issues with Apache Spark

Use the following information to troubleshoot issues you might encounter with Apache Spark.

You can also use Apache Spark log files to help identify issues with your Spark processes. See Spark log files for more information about where to find these log files.

Multiple Spark applications cannot run simultaneously with the "alwaysScheduleApps" setting enabled

Symptom: Multiple Spark applications cannot run simultaneously with the "alwaysScheduleApps" setting enabled, even when there is sufficient memory and CPU (zIIP) resources:
21/02/09 15:34:10 INFO Master: Attempted to re-register application at same address: dipn.ipc.us.aexp.com:4056
The following are examples of repeating Spark messages in this instance:
21/02/09 15:34:19 WARN Master: Unknown application app-20210209153410-0352 requested 1 total executors.
21/02/09 15:34:19 WARN Master: Unknown application app-20210209153409-0351 requested 1 total executors.
21/02/09 15:34:20 WARN Master: Unknown application app-20210209153407-0350 requested 1 total executors.
21/02/09 15:34:20 WARN Master: Unknown application app-20210209153410-0352 requested 1 total executors.
Spark issues the following messages when the second and subsequent applications try to use the same port as the first application:
21/01/15 10:01:23 DEBUG TransportServer: Shuffle server started on port: 4056
21/01/15 10:01:23 INFO Utils: Successfully started service 'sparkDriver' on port 4056.
Cause: Spark expects to find ports in use if they are already being used by another Spark application, and has its own error handling that moves to the next port in the Spark port range. Using SHAREPORT defeats that logic and causes applications to interfere with each other.
Response: Do not use SHAREPORT when assigning TCPIP PORT definitions to Spark.

Spark commands fail with an EDC5111I message, and ICH408I message appears on the z/OS console

Symptom: Spark worker daemon fails to create executors with the following error:
18/01/22 13:11:14 ERROR ExecutorRunner: Error running executor
java.io.IOException: Cannot run program "/usr/lpp/java/java800/J8.0_64/bin/java" 
(in directory"/u/usr1/work/app-20180122131112-0000/0"): EDC5111I Permission 
         denied.
         at java.lang.ProcessBuilder.start(ProcessBuilder.java:1059)
         at
org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker
$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167)
         at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run
            (ExecutorRunner.scala:73)
Caused by: java.io.IOException: EDC5111I Permission denied.
         at java.lang.UNIXProcess.<init>(UNIXProcess.java:189)
         at java.lang.ProcessImpl.start(ProcessImpl.java:167)
         at java.lang.ProcessBuilder.start(ProcessBuilder.java:1040)
         ... 2 more
The following message appears on the z/OS console:
SY1  ICH408I USER(USR1 ) GROUP(SYS1  ) NAME(####################)
 /u/usr1/work/app-20180122131112-0000/0
 CL(DIRSRCH ) FID(E2D7D2F0F0F10002000000014BA847EA)
 INSUFFICIENT AUTHORITY TO CHDIR
 ACCESS INTENT(--X) ACCESS ALLOWED(GROUP  ---)
 EFFECTIVE UID(0000000012) EFFECTIVE GID(0000000500)
Cause: When z/OS Spark client authentication is enabled, the Spark executor processes run under the user ID of the driver. However, the z/OS system that hosts the Spark cluster is not configured to accept ACL's set by Spark, which is needed for the executors to access Spark directories.
Response: Configure the z/OS system that hosts the Spark cluster to accept ACL's. For example, issue the following RACF command:
SETROPTS CLASSACT(FSSEC)

For more information, see Configuring z/OS Spark client authentication.

Spark scripts fail with FSUM6196 and EDC5129I messages

Symptom: Spark scripts fail with the following error message:
env: FSUM6196 bash: not executable: EDC5129I No such file or directory
Cause: Apache Spark expects to find the bash shell in the user's PATH environment variable, but bash cannot be found when the spark-script attempts to invoke bash.
Response: Ensure that the PATH environment variable includes the directory where the bash executable is installed. For example, if bash is located at /usr/bin/bash-4.2/bin/bash, ensure that PATH includes /usr/bin/bash-4.2/bin.

Spark scripts fail with FSUM7332 message

Symptom: Spark scripts fail with the following error message:
failed to launch org.apache.spark.deploy.master.Master: 
/usr/lpp/IBM/izoda/spark/spark23x/bin/spark-class 76: 
FSUM7321 Unknown option "posix"
Cause: Apache Spark expects to find the env command in /usr/bin, but it cannot be found. Either the /usr/bin/env symbolic link is missing or it is not pointing to /bin/env. It is possible that creation of this symbolic link was missed during Spark setup or that the symbolic link was lost after a system IPL.
Response: Ensure that /usr/bin/env exists and is a symbolic link to /bin/env, and that the symbolic link persists across system IPLs. For more information, see Verifying the env command path.

An error occurs when starting the master, but the master starts correctly

Symptom: When starting the master, the following error occurs:
bash-4.2$ $SPARK_HOME/sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, 
logging to /u/user/Billy/logs/spark--org.apache.spark.deploy.master.Master-1-ALPS.out
failed to launch org.apache.spark.deploy.master.Master:
full log in /u/user/Billy/logs/spark--org.apache.spark.deploy.master.Master-1-ALPS.out
Cause: Apache Spark polls for a number of seconds to repeatedly check to see if the master started successfully. If your system is under heavy load, this message might appear, but it generally means that the check finished polling for success before the master startup completed.
Response: Check the master log, or issue the ps command and look for the master process to definitively see whether or not the master started successfully.

Only one Spark executor is started

Symptom: You specified --num-executors 2 on a spark-submit, but only one executor was started.
Cause: The --num-executors parameter is only valid in YARN mode, which is not supported on z/OS®. Instead, the number of executors is determined by your resource settings.
Response: For more information about resource settings, see Configuring memory and CPU options.

Shell script displays unreadable characters

Symptom: When running a shell script, it displays unreadable characters on the screen, such as:
./start-master.sh: line 1: syntax error near 
unexpected token `$'\101\123\124\105\122^''
Cause: Incorrect file encoding or downlevel bash shell.
Response: Ensure that the file encoding is in EBCDIC, not ASCII, and is not tagged as ASCII. You can check the tagging of a file by issuing the ls -T shell command. Also, ensure that your bash shell level is 4.2.53 or 4.3.48. You can check the bash level by issuing the bash -version command.

Spark-shell fails with java.lang.ExceptionInInitializerError error message

Symptom: Spark-shell fails with the following error message:
java.lang.ExceptionInInitializerError …. Scala signature 
package has wrong version  expected: 5.0  found: 45.0 in scala.package
Cause: Your JVM is likely running with the wrong default encoding.
Response: Ensure that you have the following environment variable set:
IBM_JAVA_OPTIONS=-Dfile.encoding=UTF8
For more information about setting environment variables, see Setting up a user ID for use with z/OS Spark.

The Spark master fails with JVMJ9VM015W error

Symptom: The Spark master fails to start and gives the following error:
JVMJ9VM015W Initialization error for library j9gc28(2): 
Failed to instantiate compressed references metadata; 200M requested
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Cause: The master JVM could not obtain enough memory to start. Memory is most likely constrained by your ASSIZEMAX setting.
Response: For more information about setting the ASSIZEMAX parameter, see Configuring memory and CPU options.

A Spark application is not progressing and shows JVMJ9VM015W error in the log

Symptom: The Spark master and worker started successfully; however, the Spark application is not making any progress, and the following error appears in the executor log:
JVMJ9VM015W Initialization error for library j9gc28(2): 
Failed to instantiate heap; 20G requested
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Cause: The executor JVM could not obtain enough memory to start. Memory is most likely constrained by your MEMLIMIT setting or IEFUSI exit.
Response: For more information about setting the MEMLIMIT parameter or using the IEFUSI exit, see Configuring memory and CPU options. Also see "Displaying process limits" in z/OS UNIX System Services Planning.

Spark commands fail with an EDC5157 message, or BPXP015I and BPXP014I messages appear on the z/OS console

Symptom: Spark commands fail with the following error message:
17/08/08 13:51:40 INFO StandaloneAppClient$ClientEndpoint: Executor updated: 
app-20170808135140-0000/0 is now FAILED (java.io.IOException: Cannot run program 
"/usr/lpp/java/java800/J8.0_64/bin/java" (in directory "/u/user1/work/app-20170808135140-0000/0"):
EDC5157I An internal error has occurred.)
The following messages appear on the z/OS console:
BPXP015I HFS PROGRAM /bin/setfacl IS NOT MARKED PROGRAM CONTROLLED.
BPXP014I ENVIRONMENT MUST BE CONTROLLED FOR SURROGATE (BPX.SRV.uuuuuuuu)
      PROCESSING. 
Cause: Apache Spark 2.1.1 and later requires that the _BPX_SHAREAS environment variable be set to NO when starting the cluster, but it is currently set to YES.
Response: Under your Spark user ID (for instance, SPARKID), ensure that the $SPARK_CONF_DIR/spark-env.sh file contains _BPX_SHAREAS=NO, and that the master and worker processes were started using that spark-env.sh file. Verify that the SPARK_CONF_DIR and _BPX_SHAREAS environment variable are set properly for any BPXBATCH jobs that run start-master.sh and start-slave.sh. Consider restarting the master and worker processes to ensure the proper setting is used.

Spark worker or driver cannot connect to the master and the log files show "java.io.IOException: Failed to connect" error messages

Symptom: The Spark master starts successfully, but the worker or the driver is unable to connect to it. The following error message is repeated several times in the worker or application log:
org.apache.spark.SparkException: Exception thrown in awaitResult
...Caused by: java.io.IOException: Failed to connect to ip_address:port
Cause: The worker or driver is unable to connect to the master due to network errors.
Response: Check your network connectivity. If you have Spark client authentication enabled, verify that your AT-TLS policy is set up correctly and that the worker or driver has a valid digital certificate. For more information about client authentication, see Configuring z/OS Spark client authentication.

Spark worker fails with ICH408I message with NEWJOBNAME insert

Symptom: Spark worker fails with the following error message:
ICH408I USER(SPARKID) 
GROUP(SPARKGRP) NAME(####################) 
CL(PROCESS ) 
INSUFFICIENT AUTHORITY TO NEWJOBNAME 
Cause: The IzODA Apache Spark worker spawns drivers and executors using the jobname prefixes or templates specified in spark-defaults.conf. The SPARKID userid will not be authorized to create jobs with specified jobnames unless authorized.
Response: Permit the SPARKID userid to the BPX.JOBNAME profile with READ access..