Configuring and tuning SparkWorkloads

  1. Configure spark.shuffle.file.buffer.

    By default, this must be configured on $SPARK_HOME/conf/spark-defaults.conf.

    To optimize Spark workloads on an IBM Storage Scale file system, the key tuning value to set is the spark.shuffle.file.buffer configuration option used by Spark (defined in a spark config file) which must be set to match the block size of the IBM Storage Scale file system being used.

    The user can query the block size for an IBM Storage Scale file system by running: mmlsfs <filesystem_name> -B

    The following is an example of tuning the spark_shuffle_buffer_size for a given file system:

    spark_shuffle_file_buffer=$(/usr/lpp/mmfs/bin/mmlsfs
    <filesystem_name> -B | tail -1 | awk ' { print $2} ')

    Need to set the Spark configuration option spark.shuffle.file.buffer to the value assigned to $spark_shuffle_file_buffer.

    Defining a large block size for IBM Storage Scale file systems used for spark shuffle operations can improve system performance. However, using a block size larger than 2M can offer useful improvements on typical hardware used in FPO configurations is not proven.

  2. Configure spark.local.dir with local path.

    Do not put the Spark’s shuffle data into the IBM Storage Scale file system because this slows down the shuffle process.