Submitting Spark batch applications

Submit Spark workload by submitting Spark batch applications by using the cluster management console, RESTful APIs, or the CLI.

A Spark batch application is launched by only the spark-submit command from the following ways:
  • cluster management console (immediately or by scheduling the submission).
  • ascd Spark application RESTful APIs.
  • CLI (by using the spark-submit command in the Spark deployment directory) either inside or outside the cluster.
    • From inside the cluster: DEPLOY_HOME/spark-###-hadoop-###/bin
    • From an external client: CLIENT_HOME/spark-###-hadoop-###/bin
    Cluster and client mode:
    You can submit a Spark batch application by using cluster mode (default) or client mode either inside the cluster or from an external client:
    • Cluster mode (default): Submitting Spark batch application and having the driver run on a host in your driver resource group. The spark-submit syntax is --deploy-mode cluster.
      When you submit outside the cluster from an external client in cluster mode, you must specify a .jar file that all hosts in the instance group cluster have access to. The .jar file might be in a shared location or you can specify the file path that all hosts have access to. For example:
      CLIENT_HOME/spark-###-hadoop-###/bin/spark-submit --master spark://<hostname>:<port> 
      --deploy-mode cluster --class org.apache.spark.examples.SparkPi 
      file:///var/conductor/deploydir/examples/jars/spark-examples_2.11-2.1.1.jar 100
      
    • Client mode: Submitting Spark batch application and having the driver run on the machine you are submitting from. The spark-submit syntax is --deploy-mode client.
      When you submit outside the cluster from an external client in client mode, you must use a local .jar file on the client machine. This can be a relative path, or a fully qualified path. For example:
      CLIENT_HOME/spark-###-hadoop-###/bin/spark-submit --master spark://<hostname>:<port> 
      --deploy-mode client --class org.apache.spark.examples.SparkPi examples\jars\spark-examples_2.11-2.1.1.jar 100
      
      If you are submitting Spark applications from the spark-submit command line in client mode, ensure that you source the profile before submitting the application; otherwise, you cannot access the Spark driver UI.
      • If you are using BASH, run the following command:
        . $EGO_TOP/profile.platform
      • If you are using CSH, run the following command:
        source $EGO_TOP/cshrc.platform

    If you submit a Spark batch application from an external client by using client mode and you have enabled the spark.eventLog parameter, ensure that the spark.eventLog.dir file path is accessible to the driver on the external client. If you are using cluster mode, the spark.eventLog.dir must be accessible to the driver within the cluster.

    Specifying an instance group by its name:
    To specify the destination instance group by it's name, you can enter --master ego://<signame>:<x>; where signame is the instance group that you want to submit the application to and x is the order number of the primary instance. Before you submit a Spark application inside the cluster from the CLI by using either the spark-submit command or other client tools such as pyspark, saprkR, and spark-shell with --master ego://<signame>:<x>, you must source the cluster profile under $EGO_TOP.

    Along with --master ego://<signame>:<x>, you can specify either spark.ego.uname/spark.ego.passwd or spark.ego.credential in the spark-submit command, or you can run egosh user logon in the command line before you submit the Spark application.

    If submitting from inside the cluster, Kerberos authentication works with --master ego://<signame>:<x>. From an external client, Kerberos authentication is not supported.

    Note: If you change the cluster EGO security plug-in EGO_SEC_PLUGIN in ego.conf, you must redeploy the instance group to ensure the parameter spark.ego.vemkd.principal is updated to new one in the Spark configuration file. For an external client, you must either change the Spark parameter spark.ego.vemkd.principalmanually in the spark-defaults.conf file or re-create and download the newest external Spark configuration package again. If you add a management host, you must either update spark.ego.ascd.rest.urls to include the new management host ascd RESTful API URL or re-create and download the newest external Spark configuration package again. For more information, see Setting up an external client.