Submitting a sample Spark job

Submit a sample Spark job to the Spark on EGO framework to test your cluster.

Before you begin

  • Ensure that the Spark on EGO framework is installed.
  • (Optional) If you plan to run workload in the cluster deployment mode, ensure that you set up a distributed file system with the Apache Hadoop API (for example, HDFS or IBM Spectrum Scale).

About this task

This section demonstrates the SparkPi sample application, which is packaged with Spark and computes the value of Pi.

Procedure

  1. Set your environment by sourcing the following environment scripts.
    • If you are using BASH, run:

      source $EGO_TOP/profile.platform

    • If you are using CSH, run:

      source $EGO_TOP/cshrc.platform

  2. Configure the deployment mode, either in the configuration files or from the command line when submitting a job.
    • If you want to set the deployment mode using configuration files, see Configuring the deployment mode.
    • If you want to set the deployment mode from the command line, continue with the rest of this task.
  3. (Optional) If you are using IBM Spectrum Scale, set up the following configurations to add any extra connector jars or your own dependency jars to the classpath.
    1. In $SPARK_HOME/conf/spark-env.sh, define the following variables to add an additional Java classpath and library path for the client:
      • SPARK_SUBMIT_CLASSPATH=path_to_spectrum_scale_connector_jar
      • SPARK_SUBMIT_LIBRARY_PATH=path_to_spectrum_scale_connector_library_folder

      For example:

      • SPARK_SUBMIT_CLASSPATH=/usr/lpp/mmfs/hadoop/hadoop-gpfs-2.7.0.jar
      • SPARK_SUBMIT_LIBRARY_PATH=/usr/lpp/mmfs/hadoop
    2. In $SPARK_HOME/conf/spark-defaults.conf, define the following properties to add an additional Java classpath and library path for the Spark driver in cluster mode and the executors:
      • spark.driver.extraClassPath path_to_spectrum_scale_connector_jar. For example:

        spark.driver.extraClassPath /usr/lpp/mmfs/hadoop/hadoop-gpfs-2.7.0.jar

      • spark.executor.extraClassPath path_to_spectrum_scale_connector_jar. For example:

        spark.executor.extraClassPath /usr/lpp/mmfs/hadoop/hadoop-gpfs-2.7.0.jar

      • spark.driver.extraLibraryPath path_to_spectrum_scale_connector_library_folder. For example:

        spark.driver.extraLibraryPath /usr/lpp/mmfs/hadoop

      • spark.executor.extraLibraryPath path_to_spectrum_scale_connector_library_folder. For example:

        spark.executor.extraLibraryPath /usr/lpp/mmfs/hadoop

      For Spark 1.4.1, you use the following configuration:
      • spark.driver.extraClassPath: extra classpath intended for driver.
      • spark.executor.extraClassPath: extra classpath intended for executor.
      • SPARK_CLASSPATH in the CLI or script: extra classpath intended for submission client or other daemons.
        Note: SPARK_CLASSPATH is deprecated since Spark 1.0. With the Spark on EGO framework, configuring SPARK_CLASSPATH in spark-env.sh may lead to driver and executor configuration check failure. In this case, it is suggested to remove SPARK_CLASSPATH from spark-env.sh and define it when needed (for example) when submitting jobs:
        SPARK_CLASSPATH="XXX" $SPARK_HOME/bin/spark-submit --master ego-cluster --class  org.apache.spark.examples.SparkPi
        $SPARK_HOME/lib/spark-examples-1.4.1-hadoop2.6.0.jar
  4. Submit a Spark job using the SparkPi sample in much the same way as you would in open-source Spark.

    Note that --master ego-client submits the job in the client deployment mode, where the SparkContext and Driver program run external to the cluster. Use --master ego-cluster to submit the job in the cluster deployment mode, where the Spark Driver runs inside the cluster.

    • $SPARK_HOME/bin/spark-submit --master ego-client --class org.apache.spark.examples.SparkPi $SPARK_HOME/lib/spark-examples-1.4.1-hadoop2.6.0.jar
    • $SPARK_HOME/bin/run-example SparkPi
  5. If you see errors, troubleshoot your setup.
    • If you are using the client deployment mode, use the logs that are printed to the command line by default.
    • If you are using the cluster deployment mode, use the stdout and stderr logs that are under $SPARK_LOCAL_DIRS/logs on the host.