Troubleshooting issues with Apache Spark
Use the following information to troubleshoot issues you might encounter with Apache Spark.
You can also use Apache Spark log files to help identify issues with your Spark processes. See Spark log files for more information about where to find these log files.
Multiple Spark applications cannot run simultaneously with the "alwaysScheduleApps" setting enabled
Symptom: Multiple Spark applications cannot run
simultaneously with the "alwaysScheduleApps" setting enabled, even when there is sufficient memory
and CPU (zIIP) resources:
21/02/09 15:34:10 INFO Master: Attempted to re-register application at same address: dipn.ipc.us.aexp.com:4056
The following are examples of repeating Spark messages in this
instance:
21/02/09 15:34:19 WARN Master: Unknown application app-20210209153410-0352 requested 1 total executors.
21/02/09 15:34:19 WARN Master: Unknown application app-20210209153409-0351 requested 1 total executors.
21/02/09 15:34:20 WARN Master: Unknown application app-20210209153407-0350 requested 1 total executors.
21/02/09 15:34:20 WARN Master: Unknown application app-20210209153410-0352 requested 1 total executors.
Spark issues the
following messages when the second and subsequent applications try to use the same port as the first
application:
21/01/15 10:01:23 DEBUG TransportServer: Shuffle server started on port: 4056
21/01/15 10:01:23 INFO Utils: Successfully started service 'sparkDriver' on port 4056.
Cause: Spark expects to find ports in use
if they are already being used by another Spark application, and has its own
error handling that moves to the next port in the Spark port range. Using SHAREPORT
defeats that logic and causes applications to interfere with each other.
Response: Do not use SHAREPORT when assigning TCPIP PORT definitions
to Spark.
Spark commands fail with an EDC5111I message, and ICH408I message appears on the z/OS console
Symptom: Spark worker daemon fails to create
executors with the following error:
18/01/22 13:11:14 ERROR ExecutorRunner: Error running executor
java.io.IOException: Cannot run program "/usr/lpp/java/java800/J8.0_64/bin/java"
(in directory"/u/usr1/work/app-20180122131112-0000/0"): EDC5111I Permission
denied.
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1059)
at
org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker
$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167)
at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run
(ExecutorRunner.scala:73)
Caused by: java.io.IOException: EDC5111I Permission denied.
at java.lang.UNIXProcess.<init>(UNIXProcess.java:189)
at java.lang.ProcessImpl.start(ProcessImpl.java:167)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1040)
... 2 more
The
following message appears on the z/OS
console:
SY1 ICH408I USER(USR1 ) GROUP(SYS1 ) NAME(####################)
/u/usr1/work/app-20180122131112-0000/0
CL(DIRSRCH ) FID(E2D7D2F0F0F10002000000014BA847EA)
INSUFFICIENT AUTHORITY TO CHDIR
ACCESS INTENT(--X) ACCESS ALLOWED(GROUP ---)
EFFECTIVE UID(0000000012) EFFECTIVE GID(0000000500)
Cause: When z/OS Spark client authentication is
enabled, the Spark executor
processes run under the user ID of the driver. However, the z/OS system that hosts the Spark cluster is not configured to
accept ACL's set by Spark,
which is needed for the executors to access Spark directories.
Response: Configure the z/OS system that hosts the Spark cluster to accept ACL's. For
example, issue the following RACF command:
SETROPTS CLASSACT(FSSEC)
For more information, see Configuring z/OS Spark client authentication.
Spark scripts fail with FSUM6196 and EDC5129I messages
Symptom: Spark scripts fail with the
following error message:
env: FSUM6196 bash: not executable: EDC5129I No such file or directory
Cause: Apache Spark expects to find the bash shell in
the user's PATH environment variable, but bash cannot be found when the
spark-script attempts to invoke bash.
Response: Ensure that the
PATH environment variable includes the directory where the bash executable is
installed. For example, if bash is located at /usr/bin/bash-4.2/bin/bash,
ensure that PATH includes
/usr/bin/bash-4.2/bin.
Spark scripts fail with FSUM7332 message
Symptom: Spark scripts fail with the
following error message:
failed to launch org.apache.spark.deploy.master.Master:
/usr/lpp/IBM/izoda/spark/spark23x/bin/spark-class 76:
FSUM7321 Unknown option "posix"
Cause: Apache Spark expects to find the
env command in /usr/bin, but it cannot be found. Either the
/usr/bin/env symbolic link is missing or it is not pointing to
/bin/env. It is possible that creation of this symbolic link was missed during
Spark setup or that the
symbolic link was lost after a system IPL.
Response: Ensure that
/usr/bin/env exists and is a symbolic link to /bin/env,
and that the symbolic link persists across system IPLs. For more information, see
Verifying the env command path.
An error occurs when starting the master, but the master starts correctly
Symptom: When starting the master, the following error occurs:
bash-4.2$ $SPARK_HOME/sbin/start-master.sh
starting org.apache.spark.deploy.master.Master,
logging to /u/user/Billy/logs/spark--org.apache.spark.deploy.master.Master-1-ALPS.out
failed to launch org.apache.spark.deploy.master.Master:
full log in /u/user/Billy/logs/spark--org.apache.spark.deploy.master.Master-1-ALPS.out
Cause: Apache Spark polls for a number of seconds to
repeatedly check to see if the master started successfully. If your system is under heavy load, this message might appear, but it
generally means that the check finished polling for success before the master startup completed.
Response: Check the master log, or issue the
ps command and look for the master process to definitively see whether or
not the master started
successfully.
Only one Spark executor is started
Symptom: You specified --num-executors 2 on a
spark-submit, but only one executor was started.
Cause: The --num-executors parameter is only
valid in YARN mode, which is not supported on z/OS®. Instead,
the number of executors is
determined by your resource settings.
Response: For more information about resource settings, see Configuring memory and CPU options.
Shell script displays unreadable characters
Symptom: When running a shell script, it displays unreadable
characters on the screen, such as:
./start-master.sh: line 1: syntax error near
unexpected token `$'\101\123\124\105\122^''
Cause: Incorrect file encoding or downlevel bash shell.
Response: Ensure that the file encoding is in EBCDIC, not ASCII, and
is not tagged as ASCII. You can check the tagging of a file by issuing the ls -T
shell command. Also, ensure that your bash shell level is 4.2.53
or 4.3.48. You can check the bash level by issuing the bash -version
command.
Spark-shell fails with java.lang.ExceptionInInitializerError error message
Symptom: Spark-shell fails with the following error
message:
java.lang.ExceptionInInitializerError …. Scala signature
package has wrong version expected: 5.0 found: 45.0 in scala.package
Cause: Your JVM is likely running with the wrong default
encoding.
Response: Ensure that you have the following environment variable set:
IBM_JAVA_OPTIONS=-Dfile.encoding=UTF8
For more information
about setting environment variables, see Setting up a user ID for use with z/OS Spark.The Spark master fails with JVMJ9VM015W error
Symptom: The Spark
master fails to start and gives the
following error:
JVMJ9VM015W Initialization error for library j9gc28(2):
Failed to instantiate compressed references metadata; 200M requested
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Cause: The master JVM could not obtain enough memory to start. Memory is
most likely constrained by your ASSIZEMAX setting.
Response: For
more information about setting the ASSIZEMAX parameter, see Configuring memory and CPU options.
A Spark application is not progressing and shows JVMJ9VM015W error in the log
Symptom: The Spark
master and worker started successfully; however, the
Spark application is not
making any progress, and the following error appears in the executor log:
JVMJ9VM015W Initialization error for library j9gc28(2):
Failed to instantiate heap; 20G requested
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Cause: The executor JVM could not obtain enough memory to start. Memory is
most likely constrained by your MEMLIMIT setting or IEFUSI exit.
Response: For more information about setting the MEMLIMIT parameter or
using the IEFUSI exit, see Configuring memory and CPU options. Also see
"Displaying process limits" in z/OS UNIX System Services Planning.
Spark commands fail with an EDC5157 message, or BPXP015I and BPXP014I messages appear on the z/OS console
Symptom: Spark commands fail with the
following error message:
17/08/08 13:51:40 INFO StandaloneAppClient$ClientEndpoint: Executor updated:
app-20170808135140-0000/0 is now FAILED (java.io.IOException: Cannot run program
"/usr/lpp/java/java800/J8.0_64/bin/java" (in directory "/u/user1/work/app-20170808135140-0000/0"):
EDC5157I An internal error has occurred.)
The following messages appear on the z/OS console: BPXP015I HFS PROGRAM /bin/setfacl IS NOT MARKED PROGRAM CONTROLLED.
BPXP014I ENVIRONMENT MUST BE CONTROLLED FOR SURROGATE (BPX.SRV.uuuuuuuu)
PROCESSING.
Cause: Apache Spark 2.1.1 and later requires that the
_BPX_SHAREAS environment variable be set to
NO
when starting
the cluster, but it is currently set to YES
.Response: Under your Spark user ID (for instance,
SPARKID), ensure that the $SPARK_CONF_DIR/spark-env.sh file contains
_BPX_SHAREAS=NO
, and that the master and worker processes were started using that
spark-env.sh file. Verify that the SPARK_CONF_DIR and
_BPX_SHAREAS environment variable are set properly for any BPXBATCH jobs that
run start-master.sh and start-slave.sh. Consider
restarting the master and worker processes to ensure the proper setting is used.Spark worker or driver cannot connect to the master and the log files show "java.io.IOException: Failed to connect" error messages
Symptom: The Spark
master starts successfully, but the
worker or the driver is unable to connect to it. The
following error message is repeated several times in the worker or application
log:
org.apache.spark.SparkException: Exception thrown in awaitResult
...Caused by: java.io.IOException: Failed to connect to ip_address:port
Cause: The worker or driver is unable to connect to the master due to network errors.
Response: Check your network connectivity. If you have Spark client authentication enabled,
verify that your AT-TLS policy is set up correctly and that the worker or driver has a valid digital certificate. For
more information about client authentication, see Configuring z/OS Spark client authentication.
Spark worker fails with ICH408I message with NEWJOBNAME insert
Symptom: Spark worker fails with the
following error message:
ICH408I USER(SPARKID)
GROUP(SPARKGRP) NAME(####################)
CL(PROCESS )
INSUFFICIENT AUTHORITY TO NEWJOBNAME
Cause: The IzODA Apache Spark worker spawns drivers and executors
using the jobname prefixes or templates specified in spark-defaults.conf. The SPARKID userid will
not be authorized to create jobs with specified jobnames unless authorized.
Response: Permit the SPARKID userid to the BPX.JOBNAME profile with
READ access..