Spark tuning (HortonWorks or IBM Spectrum Conductor)

If you run Spark standalone distro (for example, community Spark, or IBM Spectrum Conductor® with Spark), it is recommended to take IBM Storage Scale POSIX interface.

If you run Spark standalone distro on IBM Storage Scale FPO, refer the Tuning for IBM Storage Scale FPO section.

If you run Spark standalone distro on IBM Storage Scale System, no need to tune further.

If you take Hadoop distro (for example, HortonWorks HDP), it is recommended to take IBM Storage Scale HDFS Transparency. Refer the System tuning and HDFS Transparency Tuning sections.

At Spark level, the following two should be tuned to make Spark work well on IBM Storage Scale:

Configuration Default value Recommended value
spark.shuffle.file.buffer

($SPARK_HOME/conf/spark-defaults.conf)

32K IBM Storage Scale data blocksize

spark_shuffle_file_buffer=$(/usr/lpp/mmfs/bin/mmlsfs <filesystem_name> -B | tail -1 | awk ' { print $2} ')

If the blocksize of file system is larger than 2MB, configure 2MB for spark.shuffle.file.buffer.

spark.local.dir

Note: This configuration will be overridden by either of the following environment variables set by the cluster manager:
  • SPARK_LOCAL_DIRS (Standalone)
  • MESOS_SANDBOX (Mesos)
  • LOCAL_DIRS (YARN)
/tmp Configure the local directory for this (not configure this with IBM Storage Scale directory).
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 1 Changing this into 2 can make Spark job commit fast.

As for other tuning for Spark, refer Spark configuration and Tuning Spark for Spark level tuning.