Submitting Spark batch applications to IBM Platform Conductor for Spark

Technical Blog Post

Abstract

Body

IBM Platform Conductor for Spark (Platform Conductor) addresses many of the challenges that organizations face in deploying and working with Apache Spark. To learn more about building a multi-tenant Spark environment on a shared platform, see our blog post on Spark Multitenancy with IBM Platform Conductor. This blog will discuss the different methods for submitting a Spark batch application.

Compatible with standalone cluster, better than standalone cluster

The Spark community provides good material on using the spark-submit script under the Spark bin directory to launch an application to a different cluster manager (see Submitting Applications in the Spark community). For compatibility reasons, Platform Conductor follows the behavior of the Spark Standalone Cluster Manager, using spark://HOST:PORT to specify the master service. The following examples can run either inside or outside of the Platform Conductor cluster as long as the submission host is connected to the cluster:

Submitting a simple batch application (SparkPi) from the command line

Client Mode:

./bin/spark-submit --deploy-mode client --master spark://masterhost:7077 --class org.apache.spark.examples.SparkPi ./lib/spark-examples-1.4.1-hadoop2.6.0.jar 100

Cluster Mode:

./bin/spark-submit --deploy-mode cluster --master spark://masterhost:6066 --class org.apache.spark.examples.SparkPi ./lib/spark-examples-1.4.1-hadoop2.6.0.jar 100

In client mode, the Spark master port (7077 by default) is used for submission; in cluster mode, the REST port (6066 by default) is used for submission. The main application JAR is automatically transferred to the cluster.

Master and REST URLs:

Follow these steps to find the master and REST URLs in your Spark instance group.

Navigate to the Spark Instance Groups page, and ensure that the target Spark instance group is in the “Started” state.

Select the target Spark instance group and click the batch master description link.
The Masters pop-up window displays all of your master and REST URLs. Click Close to close the pop-up window.

Deploying additional packages

Additional JAR or other files may be required. Platform Conductor supports several methods to specify the location of these files in spark-submit CLI, including the absolute path, hdfs:, http:, https:, ftp:, and local:. Learn more about these methods in the Advanced Dependency Management section of the Spark community documentation.

The following is a Kafka Python example:

Client Mode:

./bin/spark-submit --deploy-mode client --master spark://masterhost:7077 --jars ./external/spark-streaming-kafka-assembly_2.10.jar ./examples/src/main/python/streaming/kafka_wordcount.py zookeeperhost:port kafka_topic

Cluster Mode:

./bin/spark-submit --deploy-mode cluster --master spark://masterhost:6066 --jars /location/spark-streaming-kafka-assembly_2.10.jar ./examples/src/main/python/streaming/kafka_wordcount.py zookeeperhost:port kafka_topic

This example is the same as the SparkPi example, in that the master port (7077 by default) is used for submission in the client mode; the REST port (6066 by default) is used for submission in the cluster mode.

In client mode, the JAR file specified by –jars will automatically be transferred to the cluster. This is not the case in cluster mode, however, as the JAR file is transferred by the Spark driver, not the spark-submit script. In client mode, the Spark driver runs in the same environment as the spark-submit script, and it can access all files accessible to spark-submit. In cluster mode, the Spark driver is scheduled to one of the hosts inside the cluster. If it is not the submission host, there is a chance that the driver cannot access the JAR package location specified in the spark-submit command.

Deploying files in cluster mode

In the cluster mode, the additional JAR file must be predeployed to all eligible hosts, or to a shared location accessible to all eligible hosts in the cluster. To enable this, Platform Conductor provides a command called “soamdeploy” that predeploys packages to all hosts in the target resource group. For more information on this command, see soamdeploy in the IBM Knowledge Center.

Platform Conductor also supports IBM Spectrum Scale to build a distributed file system in the target cluster in order to share additional JAR files among hosts. For more information on IBM Spectrum Scale support in Platform Conductor, see IBM Spectrum Scale overview in the IBM Knowledge Center.

Submitting a batch application from the PMC

In addition to CLI submissions, you can also use the Platform Management Console (PMC) to submit batch applications, particularly in cluster mode. Follow these steps to submit a Spark batch application using the PMC:

Navigate to the Spark Instance Groups page, and ensure that the target Spark instance group is in the “Started” state.

Navigate to the Applications & Notebooks page.

Click Run a batch application.

Select the target Spark instance group. The master URL will be provided, and deploy-mode will be added to the command line automatically.
In the main dialog, enter other required information and click Submit.

In the future

Going forward, we plan to include many advanced capabilities for batch application submission. Some of these capabilities include:

Enhanced Security: Authentication can be turned on for spark-submit if fine-grained isolation is required within a Spark instance group. With certain configurations, Spark batch applications can be authenticated by the EGO user, PAM user, or Kerberos user.
Scheduled Applications: Instead of running a batch application immediately after submission, you can schedule a batch application to run at a specified time.
Streaming Applications: Unlike regular Spark batch applications, streaming batch applications are usually long running and depend on input data sources. Platform Conductor includes special enhancements to better serve these types of applications.

Now that you understand how to submit your Spark batch applications to Platform Conductor, try it out! Download an evaluation version of Platform Conductor from our Service Management Connect page. If you have any questions, post them in our forum!

For more information on Platform Conductor, visit the IBM Platform Conductor for Spark Knowledge Center.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS4H63","label":"IBM Spectrum Conductor"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16163815

Tips

Submitting Spark batch applications to IBM Platform Conductor for Spark

Technical Blog Post

Abstract

Body

Compatible with standalone cluster, better than standalone cluster

Submitting a simple batch application (SparkPi) from the command line

Deploying additional packages

Submitting a batch application from the PMC

In the future

UID

Share your feedback

Need support?