IBM Support

FileNotFoundException for shuffle service

Troubleshooting


Problem

The conductor reports the FileNotFoundException issue. This error only exists for one SIG (jh-spark211) and one shuffle directory (/data10/spark/jh-spark211/local_dir/).
2019-05-28?17:32:58.724?com.spark.rules.DefaultRuleRunner.runRules(DefaultRuleRunner.java:34)??
TaskFailException----------------------------------------
com.spark.rules.common.TaskFailException:?org.apache.spark.SparkException:?Job?aborted.
at?com.spark.rules.executor.AbstractExecutor.process(AbstractExecutor.java:60)
at?com.spark.rules.executor.ExecutorGroup.processRule(ExecutorGroup.java:87)
at?com.spark.rules.executor.ExecutorGroup.processRules(ExecutorGroup.java:74)
.................
at?org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:520)
at?org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at?org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at?org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:494)
at?com.spark.rules.executor.output.Spark2Hdfs.doProcess(Spark2Hdfs.java:35)
at?com.spark.rules.executor.AbstractExecutor.process(AbstractExecutor.java:50)
...?15?more
Caused?by:?org.apache.spark.SparkException:?Job?aborted?due?to?stage?failure:?Task?10?in?stage?11.0?failed?4?times,?most?recent?failure:?Lost?task?10.3?in?stage?11.0?(TID?1320,?dsszbyz-etl-node57,?executor?74-08cf9e79-87e4-48f7-bb33-808b66b881a8):?org.apache.spark.SparkException:?Task?failed?while?writing?rows
at?org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
at?org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at?org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at?org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at?org.apache.spark.scheduler.Task.run(Task.scala:99)
at?org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:396)
at?java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
at?java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at?java.lang.Thread.run(Thread.java:795)
Caused?by:?org.apache.spark.shuffle.FetchFailedException:?java.lang.RuntimeException:?Failed?to?open?file:?/data10/spark/jh-spark211/local_dir/d2f5f7e6-adcb-4c7d-ae56-f0bb20a957db_hive_hive/blockmgr-6976f3e6-7539-4b60-8ebc-bf3efe46b8f7/25/shuffle_13_168_0.index
at?org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:262)
at?org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:174)
at?org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:99)
at?org.apache.spark.deploy.ego.EGOExternalShuffleBlockHandler.handleMessage(EGOShuffleService.scala:127)
..............
at?io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at?io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at?java.lang.Thread.run(Thread.java:795)
Caused?by:?java.util.concurrent.ExecutionException:?java.io.FileNotFoundException:?/data10/spark/jh-spark211/local_dir/d2f5f7e6-adcb-4c7d-ae56-f0bb20a957db_hive_hive/blockmgr-6976f3e6-7539-4b60-8ebc-bf3efe46b8f7/25/shuffle_13_168_0.index?(Permission?denied)
..............................
Caused?by:?java.io.FileNotFoundException:?/data10/spark/jh-spark211/local_dir/d2f5f7e6-adcb-4c7d-ae56-f0bb20a957db_hive_hive/blockmgr-6976f3e6-7539-4b60-8ebc-bf3efe46b8f7/25/shuffle_13_168_0.index?(Permission?denied)
..............

Document Location

Worldwide

[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSZU2E","label":"IBM Spectrum Conductor with Spark"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB77","label":"Automation Platform"}}]

Log InLog in to view more of this document

This document has the abstract of a technical article that is available to authorized users once you have logged on. Please use Log in button above to access the full document. After log in, if you do not have the right authorization for this document, there will be instructions on what to do next.

Document Information

Modified date:
06 September 2019

UID

ibm10964712