Spark tuning (HortonWorks or IBM Spectrum Conductor)
If you run Spark standalone distro (for example, community Spark, or IBM Spectrum Conductor® with Spark), it is recommended to take IBM Storage Scale POSIX interface.
If you run Spark standalone distro on IBM Storage Scale FPO, refer the Tuning for IBM Storage Scale FPO section.
If you run Spark standalone distro on IBM Storage Scale System, no need to tune further.
If you take Hadoop distro (for example, HortonWorks HDP), it is recommended to take IBM Storage Scale HDFS Transparency. Refer the System tuning and HDFS Transparency Tuning sections.
At Spark level, the following two should be tuned to make Spark work well on IBM Storage Scale:
Configuration | Default value | Recommended value |
---|---|---|
spark.shuffle.file.buffer ($SPARK_HOME/conf/spark-defaults.conf) |
32K | IBM Storage Scale data
blocksize spark_shuffle_file_buffer=$(/usr/lpp/mmfs/bin/mmlsfs <filesystem_name> -B | tail -1 | awk ' { print $2} ') If the blocksize of file system is larger than 2MB, configure 2MB for spark.shuffle.file.buffer. |
spark.local.dir
Note: This configuration will be overridden by either
of the following environment variables set by the cluster manager:
|
/tmp | Configure the local directory for this (not configure this with IBM Storage Scale directory). |
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version | 1 | Changing this into 2 can make Spark job commit fast. |
As for other tuning for Spark, refer Spark configuration and Tuning Spark for Spark level tuning.